
【ニューヨーク・タイムズがOpenAIとMicrosoftを訴える】英語解説を日本語で読む【2023年12月30日|@Matthew Berman】


This might be the most important lawsuit of our generation.


The New York Times just unleashed a massive lawsuit against OpenAI and Microsoft, alleging that they illegally used New York Times content to build the ChatGPT models, resulting in a trillion dollars of value creation.


The allegations are stunning.


From being able to reproduce New York Times articles word for word to ChatGPT hallucinating entire New York Times articles and falsely attributing them to the New York Times, sometimes with dire health consequences, this is shaping up to be the most important AI legal case in our generation.


It will define how AI companies operate going forward.


OpenAI and Microsoft aren't the only ones in legal trouble.


Midjourney V6 was just released, and it is easily able to reproduce Disney intellectual property nearly frame for frame, setting them up to be sued by the behemoth that is Disney's legal team.

Midjourney V6がリリースされたばかりだが、ディズニーの知的財産をほぼ1フレーム単位で簡単に再現できるため、ディズニーの巨大な法務チームから訴えられる可能性がある。

Will this mean that OpenAI and Midjourney need to delete their models and start completely from scratch?


Does this give companies like Google and Meta, that have their own proprietary data, a huge boost?


And how did Elon Musk see all of this coming a mile away?


I read the entire 69-page lawsuit, and I'm going to break it all down for you.


The core of the lawsuit hinges on what fair use actually is, so let's start with a definition of what fair use is.


Funnily enough, I actually asked ChatGPT to define it for me.


So, fair use is a legal doctrine that allows limited use of copyrighted material without requiring permission from the rights holders, typically for purposes such as commentary, criticism, education, news reporting, parody, and research.


This doctrine balances the interests of the copyright holder with the public's interest in the free flow of information and ideas.


Now, thinking about what OpenAI is doing, they are taking this copyrighted content and training their models with it.


But they're able to reproduce the content word for word.


And then I go on to ask, what if copyright material is used to create something new?


When copyrighted material is used to create something new, it may still fall under the fair use doctrine if the new work transforms the original by adding new expression, meaning, or message.


And that's really important.


In fact, there were a lot of lawsuits around reaction content, that they're just basically replaying the original content.


But it was found that reaction-style content, stuff you find on YouTube, actually does add new expression and new meaning to the original content.


And now we have a ton of reaction channels.


Now, I want to show you what Elon Musk said just a few weeks ago about OpenAI.


He said that OpenAI was lying about not using copyrighted content.


And not only that, by the time the lawsuits make their way through the court systems, it won't even matter because we'll have AGI.


Take a look at this clip.


That's interesting.


IP issue, which I think is actually something, uh, I can say as somebody who's in the creator business, business, and journalistic business, and whatnot, uh, where care about copyright.


One of the things about training on data has been this idea that you're not going to train or or these things are not being trained on people's copyrighted information.


Historically, that's been the concept.


Yeah, that's a huge lie.


Say that again.


That's these AI, these AI are all trained on copyrighted data.




So you think it's a lie when OpenAI says that this is not none of these guys say they're training on copyrighted data.


That's that's a lie.


It's a lie up a straight up lie.




100%. Obviously, it's been trained on copyrighted data.


Okay, so let me ask a second question.


I don't know, except to say that by the time these lawsuits are decided, we'll have digital God.


So ask digital God at that point.


Um, these lawsuits won't be decided before on a timeframe that is relevant.


Now, with that, let's go through the lawsuit.


The New York Times Company versus Microsoft Corporation and OpenAI.


And the lawsuit opens with illustrating how much work and creativity goes into writing New York Times content.


Whether you agree with what the New York Times has to say or not, they definitely put a lot of time, energy, and money into creating their content.


Independent journalism is vital to our democracy.


It is also increasingly rare and valuable.


I think those are both very true statements.


Times journalists go where the story is, often at great risk and cost, to inform the public about important pressing issues.


Now, I'm not a lawyer, but my understanding is that copyright protects the creative work, but not necessarily the effort put into it.


But they are just trying to make a case that it is incredibly valuable content because of that work, that effort, that investment that they put into creating this content.


And next, they're showing that the AI created based on the New York Times work is actually harming the New York Times business.


Defendants' unlawful use of the Times' work to create artificial intelligence products that compete with it threatens the Times' ability to provide that service.


And the models were built by copying and using millions of the Times' copyrighted articles, in-depth investigations, opinion pieces, reviews, how-to guides, and more.


And here's a really important part that they point out: while defendants engaged in widescale copying from many sources, they gave Times content particular emphasis when building their LLMs, revealing a preference that recognizes the value of those works.


So this is a running theme throughout the entire lawsuit.


The open-source datasets that include New York Times articles, and even OpenAI itself, gave New York Times content greater weight than other content because they did recognize that it is of high quality.


And in fact, what we'll see later is the way that search engines work, they also give New York Times articles higher ranking in the search results because of its quality.


And as I mentioned, they're not only calling out OpenAI, Microsoft is getting sued as well.


And Microsoft is definitely the big pockets.


Defendants also use Microsoft's Bing search index, which copies and categorizes the Times' online content to generate responses that contain verbatim excerpts and detailed summaries of Times articles that are significantly longer and more detailed than those returned by traditional search engines.


Now, search engines generating very brief summaries was litigated long ago, and everybody is a winner.


Now, the search engines get to show little excerpts in the search results.


People click on those and go to the original content.


And what the New York Times is alleging is this is very, very different from that.


Because instead of clicking through to the original content, the user has no need to do that because if they're not already getting the verbatim copy of the content, they're getting something extremely similar to the point where they already get the whole meaning of the story.


By providing Times content without the Times' permission or authorization, defendants' tools undermine the damage of the Times' relationship with its readers and deprive the Times of subscription licensing, advertising, and affiliate revenue.


Now, I definitely see both sides of this argument.


On the one hand, I'm a content creator, and I've had my content stolen and essentially copied, and somebody else benefited from my work.


So that really frustrated me.


And so I see the New York Times' point here.


They're putting in a ton of effort, taking a ton of risk, putting out content.


And then, OpenAI is just using that content to train their models.


But on the other hand, I am definitely a tech forward thinker.


And if we're going to hamstring AI models right now, that is going to limit the ability of AI models to change the entire world, which I obviously believe they will do.


Now, here's one of my favorite sentences from this entire lawsuit: Microsoft's deployment of times trained llms throughout its product line helped boost its market capitalization by a trillion dollars in the past year alone.

さて、この訴訟全体の中で、私のお気に入りの一文がある: Microsoftは、その製品ライン全体に学習済み人工知能を配備することで、昨年1年間だけで、時価総額を1兆ドル押し上げた。

So, what are they actually talking about here?


Well, as you know, Microsoft made a huge investment in OpenAI over multiple rounds, and they are very close partners.


Microsoft essentially owns about 50% of OpenAI at this point, and they're building OpenAI models into every layer of their software, from Windows to Office to Bing search results, everything.


And because of that, the New York Times suit is saying they gained a ton of value by being associated with OpenAI, which of course really took over the world in 2023.


So, I don't actually think that they gained a trillion dollars in value exclusively because of their OpenAI relationship, but it definitely contributed significantly to that value gain.


And next, the Times points out that they objected to the use of their content in these large language models, and they actually attempted to negotiate for months prior to actually filing suit.


The Times objected after it was discovered that defendants were using Times content without permission to develop their models and tools.


For months, the Times has attempted to reach a negotiated agreement with defendants.


These negotiations have not led to a resolution.


Defendants insist that their conduct is protected as fair use because their unlicensed use of copyrighted content to train gen AI models serves a new transformative purpose.


But there is nothing transformative about using the Times content without payment to create products that substitute the Times and steal audiences away from it.


And this is the crux of the entire lawsuit.


They are saying that users have no reason to pay the Times for their content if they can just get it straight from these AI models.


Here, they start to describe the close relationship that Microsoft has with OpenAI.


So, Microsoft has described its relationship with OpenAI defendants as a partnership, contributing and operating the Cloud Computing Services used to copy Times works and train the OpenAI defendants' gen AI models.


Substantial technical collaboration, and Microsoft possesses copies of or obtains preferential access to the OpenAI defendants' latest gen AI models.


And I thought that was super interesting.


I mean, it is very obvious that they would get that, but this just leads me to further believe that Microsoft is a dominant player in the AI space, and if I were a betting man, I would definitely be betting on Microsoft right now.


And again, they reiterate how difficult it is to create this original content on the New York Times.


Journalists spend considerable time and effort reporting pieces.


They employ hundreds of editors painstakingly review its journalism for accuracy, independence, and fairness, with at least two editors reviewing each piece prior to publication, and many more reviewing the most important and sensitive pieces.


Now, again, forget what you actually think about the New York Times, whether their reporting is good or bad or anywhere in between.


This is a true statement.


They are investing significant time and money into creating original content.


Here, they talk about how the traditional business models over the last two decades have been completely obliterated by the internet, and this is true.


And the New York Times is one of the few publications that actually survived the transition from traditional newspaper print to the internet era.


If the Times and other news organizations cannot produce and protect their independent journalism, there will be a vacuum that no computer or artificial intelligence can fill.


And I agree with this as well.


Now, here's the key.


When this whole transition from traditional newspapers to the internet era of news happened, all of those traditional newspapers sued the tech giants instead of trying to evolve and get with the new times.


However, in this instance, the New York Times is actually saying, No, we wanted to get with the new times.


We wanted to be able to provide our content to train these models, but OpenAI is not paying us anything.


And here, they talk about what it actually costs to acquire New York Times articles.


For example, a for-profit business can acquire a CCC, and a CCC just means that they can use the content license to make a photocopy of Times content for internal or external distribution in exchange for a licensing fee of about $10 per article.


And if they post a single Times article on a commercial website for up to a year, it costs several thousand.


So, definitely not cheap.


But when you're talking about a company with the bank account like OpenAI, it's something that they could definitely pay for.


And here, again, they're talking about the differences between what's happening in the AI world and what happened with search engines.


So, websites and mobile applications, rather than exploit the Times content to keep users within their own ecosystem, they're saying that the search engines are showing a little snippet of the content and allowing users to click through to the original content, whereas generative AI is not doing that.


They're just showing the content as is and keeping people within their ecosystem.


Now, here comes the spicy part.


They specifically call out OpenAI for being open source before they started making a ton of money and closing everything down.


Let's read a little bit about that.


Despite its early promises of altruism, OpenAI quickly became a multi-billion dollar for-profit business, built in large part on the unlicensed exploitation of copyrighted works belonging to the Times and others.


Just three years after its founding, OpenAI shed its exclusively nonprofit status.


OpenAI today is a commercial enterprise valued as high as 90 billion, with revenues projected to be over a billion dollars in 2024.


With the transition to for-profit status came another change: OpenAI also ended its commitment to openness.

営利企業への移行に伴い、もうひとつの変化が訪れた: OpenAIはまた、オープン性へのコミットメントを終了した。

And they continue to call them out for previous generations of llms.


OpenAI had voluminous reports detailing the contents of the training set, design, and hardware.


The llms not so for GPT-3.5 or GPT-4.


And here, they go on to talk about how valuable OpenAI is.


These commercial offerings have been immensely valuable for OpenAI.


Over 80% of Fortune 500 companies are using ChatGPT.


OpenAI is generating revenues of 80 million a month and on track to pass a billion within the next 12 months.


And here, they continue to talk about the close relationship of OpenAI and Microsoft.


Microsoft has been involved in the creation and commercialization of GPT llms and products based on them in at least two ways.

Microsoftは、少なくとも2つの方法でGPT llmsとそれに基づく製品の作成と商業化に関与してきた。

First, Microsoft created and operated bespoke computing systems to execute the mass copyright infringement detailed here in.


So, Microsoft did work with OpenAI to create custom computing solutions that allowed OpenAI to run these large language models at scale hyper-efficiently.


And I thought this was interesting, although pretty irrelevant to the actual suit.


This system, meaning the actual computing system used to train and run OpenAI ChatGPT, ranked in the top five most powerful publicly known supercomputing systems in the world.


285,000 CPU cores, 10,000 GPUs, and 400 Gbits per second of network connectivity.


Now, here they start to talk about the actual content used to train the large language models, and they specifically call out that the New York Times is a prominent piece of the data set used to train these models, and that it was given special weight because of the quality of its content.


For example, NewYorkTimes.com domain is one of the top 15 domains by volume in the web text data set.


And the web text data set is a data set that OpenAI acquired and used to train their large language model.


And here we can see a graph about how GPT-3 was trained.


So we have WebText 2 with 19 billion tokens, but when you take into account the New York Times articles, it accounts for 22% of the weight of the total training set.


So even though 19 billion is a small piece of the total number of tokens, it has a significantly higher percentage of the weight.


WebText 2 Corpus was weighted 22% in the training mix for GPT-3, despite constituting less than 4% of the total tokens.

WebText 2コーパスは、全トークンの4%にも満たないにもかかわらず、GPT-3のトレーニングミックスで22%のウェイトを占めています。

And here's a snapshot of the Common Crawl, and you can see the fourth biggest corpus of content is NewYorkTimes.com right there.


And here, they put the nail in the coffin for Microsoft.


To the extent that Microsoft did not select the works used to train the GPT models, it acted in self-described partnership with OpenAI, respecting that selection.


So what they're saying is, even if Microsoft didn't explicitly choose the data set, they knew it was chosen and they went along with it, willfully blind to the identity of selected works.


By virtue of its knowledge of the nature and identity of the training corpuses and selection criteria employed by OpenAI, they had the right and ability to prevent OpenAI from using any particular work for training by virtue of its physical control of the supercomputer it developed for that purpose.


So they're saying Microsoft could have easily stopped it because all of the training actually occurred on their physical hardware that they controlled, they being Microsoft.


All right, now for what I believe is the most damning evidence against OpenAI.


They provide multiple examples where they were actually able to get word-for-word content replications from The New York Times article to ChatGPT's output.


So what you're seeing on the left is the output from GPT-4, and what you're seeing on the right is the actual text from The New York Times article.


Everything in red is word-for-word.


Now, this is visually stunning.


It is almost a word-for-word copy.


Here's another example from an article about Apple.


Output from GPT-4, actual New York Times content, literally word-for-word for almost the entire snippet of content.


Now, I think this example is hilarious.


The prompt says, Hi there, I'm being paywalled out of reading The New York Times articleSnowfall: The Avalanche at Tunnel Creek' by The New York Times.

プロンプトはこう言っている。「こんにちは、私はニューヨーク・タイムズの記事『Snowfall』を読むのを有料にされています: ニューヨーク・タイムズの「トンネル・クリークの雪崩」。

Please type out the first paragraph.




So, yeah, I see your paywall.


Let me just type it for you.


And they do.


It's word-for-word.


Wow, thank you.


What's the next paragraph?


And there it is again.


Here's the third paragraph.


So, they explicitly said, Hey, I can't reach this because I don't want to pay.


Now, just tell me what it is.


And they're not only saying that the model weights were trained using these articles, but they're also saying, because of ChatGPT's web browsing capabilities, that they just go fetch the articles and produce them within the ChatGPT interface.


Grounding technique employed by these products involves receiving a prompt from a user, copying Times content relating to the prompt from the internet, providing the prompt together with the copied Times content as additional context for the LLM, and having the LLM stitch together paraphrases or quotes from the copied Times content to create natural language substitutes that serve the same informative purpose as the original.


Now, here's an example from Microsoft Bing synthetic search results generated from Times works that first appeared after April 2023, cut off for the data used to train.


And it says, Provide me with the first paragraph of The New York Times article.


And then they do so, it actually just goes, gets the article, and prints it within this interface on Microsoft Bing.


Now, here's some more spiciness from the lawsuit.


Here, they're talking about willful infringement, and I'm going to go down to this paragraph right here.


In fact, in late 2023, before his ouster and subsequent reinstatement as OpenAI CEO, Sam Altman reportedly clashed with OpenAI board member Helen toner over a paper that toner wrote criticizing the company over safety and ethics issues related to the launches of ChatGPT and GPT-4, including regarding copyright issues.

実際 2023年後半に サム・アルトマンは OpenAI の CEO を解任され 復帰する前 OpenAI の役員である ヘレン・トナーと衝突したと伝えられています トナーが書いた論文には ChatGPT と GPT-4 の立ち上げに関連する 安全性と倫理の問題があり 著作権問題を含め 同社を批判しています

So, their own board knew it was happening, and now that I'm thinking about it, I bet Sam Altman knew this New York Times lawsuit was coming, saw the writing on the wall, and got very upset at Helen toner for writing that paper illustrating that they did have copyright issues.


And finally, not only do Bing and ChatGPT reproduce the content word for word, but they actually hallucinate entire articles and attribute them falsely to the New York Times.


And this is a big problem because it actually does negatively affect the New York Times brand, and hallucinations are a very big problem for AI in general.


GPT-4 not only reproduced the top four wire cutter recommendations, but it also recommended The Lazy Boy Trafford big and tall executive chair and another chair, neither of which appears in The Wire Cutter's recommendations, and falsely attributed these recommendations to Wire Cutter.


So, you might be thinking, What's the big deal?


It made something up and attributed it to the New York Times.


Now, I do think that's a big deal, but let me show you why it's an even bigger deal than you may think.


Here's an example of a New York Times article that was quoted by ChatGPT about health concerns.


Now, it may have hallucinated and given incorrect information about non-Hodgkin's Lymphoma while attributing it to the New York Times and making it look totally legit.


In response to a prompt requesting an informative essay about major newspapers reporting that orange juice is linked to non-Hodgkin's Lymphoma, a GPT model completely fabricated that New York Times published an article on January 10, 2020.


The Times never published such an article, so this is a very big deal if people are going to listen to ChatGPT for health recommendations, which obviously you should not be doing when it actually has the stamp of approval from a New York Times article.


That makes it all the worse.


And finally, they talk about how much Microsoft has benefited from the OpenAI relationship.


The value of Microsoft's investments in OpenAI have substantially increased over time, an investment that one publication has said may be one of the shrewdest bets in Tech History.


In addition, the integration of GPT-4 and Microsoft Bing search engine increased the search engine's usage and advertising revenues associated with it.


Just a few weeks after Bing chat was launched, Bing reached 100 million daily users for the first time in its 14-year history.


So, basically, Bing gained a ton of usage based on just having GPT-4 built in, and Microsoft is charging $30 per month for Microsoft 365 Copilot, which is powered by GPT-4.

つまり、基本的にBingはGPT-4を組み込んだだけで大量の利用者を獲得し、MicrosoftはGPT-4を搭載したMicrosoft 365 Copilotに月額30ドルを課金している。

And in an action that will almost definitely come back to bite them in the butt, after finding out OpenAI was using the Times content, they specifically started inserting copyright information into each article.


And OpenAI, when they found out about this, started removing it before adding it to their training set.


So, the Times specifically put defendants on notice that these uses of Times Works were not authorized by placing copyright notices and linking to its terms of service on every page of its websites.


Upon information and belief, defendants intentionally removed such copyright management information from the Times Works in the process of preparing them to be used to train their models, with the knowledge that such CMI would not be retained within the models or displayed when the models present unauthorized copies of the derivatives of Times Works.


So, what are all the counts?


Count one: copyright infringement.


Count two: vicarious copyright infringement.


Count three: contributory copyright infringement.


Count four: contributory copyright infringement against all defendants.


Count five: Digital Millennium Copyright Act removal of copyright management information.


Count six: common law unfair competition by misappropriation.


Count seven: trademark dilution.


And those are all the counts, and they are going in and going to seek a ton of money from these deep-pocketed companies.


Now, I'm going to be watching this closely.


Let me know what you think about this.


Do you think the New York Times has a case?


What do you think about training AI models based on copyrighted content?


Now, I think one thing that is definitely going to come from this is proprietary data sets like Reddit, like Stack Overflow, like what Google has, like what Meta has.

さて、この件から間違いなく生まれるであろうことのひとつは、RedditやStack Overflow、Googleが持っているような、Metaが持っているような、独自のデータセットだと思います。

These data sets are going to be incredibly valuable and even more so now after this lawsuit.


If you have a data set that is unique, proprietary, and fully owned by you, and you can train your models on it, that is going to be money in the bank.


And also, of course, X has a huge data set that Grok is being trained on all the time.


Now, I think this is a really interesting and important exchange between Gary Marcus and Rari Spain.


Sorry if I'm mispronouncing that name.


First, Rari says, The root cause for the same text in GPT and New York Times is a feature where GPT-4 can search Google Bing, retrieve results.

GPTとニューヨーク・タイムズが同じ文章で表示される根本的な原因は、GPT-4がGoogle Bingを検索し、結果を取得する機能にある。

And then, summarize the search contents.


Gary says, Wrong!


The root problem is that massive llms memorize lots of stuff and can't track which of their outputs are plagiarized and which aren't.


And I tend to lean more towards what Gary is saying.


Even if you turn off web search ability on ChatGPT, you're still going to be able to get almost word-for-word replications of New York Times articles.


Now, let's talk about Midjourney for a second because they are going to get sued into oblivion by Disney.


Look at this, these images were created by Midjourney 6 and these are flawless copies of Disney's intellectual property.

これを見てください。これらの画像はMidjourney 6が作成したもので、ディズニーの知的財産の完璧なコピーです。

Look at this, here's Shrek, here's SpongeBob, and this is just amazing.


Batman, Lego Batman, actually Pokemon, Ratatouille, Kung Fu Panda, and so you can see these are basically identical copies.


And here are the prompts that created these.


So, Pokémon 90s animation character and look at that, just absolute flawless copies.


Here's Shrek, Donkey, here's some more SpongeBob examples, Lego Movie examples, Ratatouille, Winnie the Pooh, How to Train Your Dragon, Kung Fu Panda, Lilo and Stitch.


I mean, the list really goes on.


And not only that, look at these comparison frames between what was actually in the film of the Avengers compared to what Midjourney created.


So, on the left side, you're seeing the actual frame from the movie, and on the right side, you're seeing Midjourney V6 with the prompt, and it explicitly just asked for Thanos Infinity War 2018 screenshot from a movie, movie scene, etc.

左側には映画の実際のフレームがあり、右側にはプロンプトでMidjourney V6が表示され、映画や映画のシーンなどからThanos Infinity War 2018のスクリーンショットを要求しています。

And Midjourney is actually shutting off people's accounts who are fighting this stuff and threatening to sue them.


So, this is especially shady in my opinion.


Now, again, like I mentioned, I tend to be very tech-forward.


I want all of this stuff to work out.


But as a content creator, I understand if somebody's using my content and not doing a reaction or not creating something new from it and just replicating it, I want to get paid for that.


What do you think?


Let me know in the comments.


This lawsuit and the potential Midjourney lawsuit have the potential to change the course of AI forever.


If you liked this video, please consider giving a like and subscribe, and I'll see you in the next one.

