【ニューヨーク・タイムズがOpenAIとMicrosoftを訴える】英語解説を日本語で読む【2023年12月30日｜@Matthew Berman】

2023年12月30日 16:48

ニューヨーク・タイムズは、OpenAIとMicrosoftに対して著作権侵害の訴訟を起こし、彼らが自社のコンテンツを無断で使用してChatGPTモデルを構築し、大きな価値を生み出したと主張しています。この訴訟は、AI業界の著作権問題における重要な法的事件となる可能性があります。OpenAIは元々非営利団体でしたが、現在は商業企業に変わり、Microsoftとの関係が深まっています。ニューヨーク・タイムズの記事は、OpenAIのトレーニングセットで重要な役割を果たしており、これが著作権侵害の主張の根拠となっています。この訴訟は、AIの将来と著作権保護のバランスに大きな影響を与えることが予想されます。
公開日：2023年12月30日
※動画を再生してから読むのがオススメです。

This might be the most important lawsuit of our generation.

これは我々の世代で最も重要な訴訟かもしれない。

The New York Times just unleashed a massive lawsuit against OpenAI and Microsoft, alleging that they illegally used New York Times content to build the ChatGPT models, resulting in a trillion dollars of value creation.

ニューヨーク・タイムズは、OpenAIとMicrosoftに対し、ニューヨーク・タイムズのコンテンツを違法に使用してChatGPTモデルを構築し、その結果1兆ドルもの価値を生み出したとして、大規模な訴訟を起こした。

The allegations are stunning.

その主張は驚くべきものだ。

From being able to reproduce New York Times articles word for word to ChatGPT hallucinating entire New York Times articles and falsely attributing them to the New York Times, sometimes with dire health consequences, this is shaping up to be the most important AI legal case in our generation.

ニューヨーク・タイムズの記事を一字一句そのまま再現できることから、ChatGPTが架空のニューヨーク・タイムズの記事を作り出し、それをニューヨーク・タイムズのものと誤って帰属させることにより、時には深刻な健康への影響をもたらすことまで、これは私たちの世代において最も重要なAI関連の法的な案件となりつつあります。

It will define how AI companies operate going forward.

これは、AI企業が今後どのように活動していくかを定義することになるだろう。

OpenAI and Microsoft aren't the only ones in legal trouble.

法的な問題を抱えているのは、OpenAIとMicrosoftだけではない。

Midjourney V6 was just released, and it is easily able to reproduce Disney intellectual property nearly frame for frame, setting them up to be sued by the behemoth that is Disney's legal team.

Midjourney V6がリリースされたばかりだが、ディズニーの知的財産をほぼ1フレーム単位で簡単に再現できるため、ディズニーの巨大な法務チームから訴えられる可能性がある。

Will this mean that OpenAI and Midjourney need to delete their models and start completely from scratch?

これは、OpenAIとMidjourneyが自分たちのモデルを削除し、完全にゼロからやり直す必要があることを意味するのだろうか？

Does this give companies like Google and Meta, that have their own proprietary data, a huge boost?

これは、GoogleやMetaのような独自のデータを持つ企業に大きな追い風となるのだろうか？

And how did Elon Musk see all of this coming a mile away?

そして、イーロン・マスクはどのようにして、このような事態が起こることを予見したのだろうか？

I read the entire 69-page lawsuit, and I'm going to break it all down for you.

私は69ページに及ぶ訴訟をすべて読んだ。

The core of the lawsuit hinges on what fair use actually is, so let's start with a definition of what fair use is.

訴訟の核心は、フェアユースとは何かということにかかっているので、フェアユースとは何かという定義から始めよう。

Funnily enough, I actually asked ChatGPT to define it for me.

面白いことに、私は実際にChatGPTにその定義を尋ねた。

So, fair use is a legal doctrine that allows limited use of copyrighted material without requiring permission from the rights holders, typically for purposes such as commentary, criticism, education, news reporting, parody, and research.

つまり、フェアユースとは、権利者の許諾を必要とせず、著作物の限定的な利用を認める法理であり、通常、論評、批評、教育、報道、パロディ、研究などの目的に用いられる。

This doctrine balances the interests of the copyright holder with the public's interest in the free flow of information and ideas.

この法理は、著作権者の利益と、情報やアイデアの自由な流れという公共の利益とのバランスをとるものです。

Now, thinking about what OpenAI is doing, they are taking this copyrighted content and training their models with it.

今、OpenAIのしていることを考えてみると、彼らはこの著作権で保護されたコンテンツを利用し、それを使ってモデルをトレーニングしているのだ。

But they're able to reproduce the content word for word.

しかし、彼らはコンテンツを一字一句再現することができる。

And then I go on to ask, what if copyright material is used to create something new?

では、著作権で保護されたものを使って新しいものを作る場合はどうでしょうか？

When copyrighted material is used to create something new, it may still fall under the fair use doctrine if the new work transforms the original by adding new expression, meaning, or message.

著作権で保護された素材が何か新しいものを創作するために使用される場合、その新しい作品が新しい表現、意味、メッセージを加えることによってオリジナルを変容させるのであれば、フェアユースの原則に該当する可能性があります。

And that's really important.

これは本当に重要なことだ。

In fact, there were a lot of lawsuits around reaction content, that they're just basically replaying the original content.

実際、リアクション・コンテンツをめぐっては、基本的にオリジナルのコンテンツを再生しているだけだという訴訟がたくさん起こりました。

But it was found that reaction-style content, stuff you find on YouTube, actually does add new expression and new meaning to the original content.

しかし、YouTubeにあるようなリアクション・スタイルのコンテンツは、実はオリジナルのコンテンツに新しい表現や意味を加えていることがわかったのです。

And now we have a ton of reaction channels.

そして今、たくさんのリアクション・チャンネルがある。

Now, I want to show you what Elon Musk said just a few weeks ago about OpenAI.

さて、ほんの数週間前にイーロン・マスクがOpenAIについて言ったことをお見せしたい。

He said that OpenAI was lying about not using copyrighted content.

彼は、OpenAIは著作権で保護されたコンテンツを使用していないと嘘をついていると言った。

And not only that, by the time the lawsuits make their way through the court systems, it won't even matter because we'll have AGI.

そしてそれだけでなく、訴訟が裁判システムを通過する頃には、AGIがあるから問題にもならないだろう。

Take a look at this clip.

このクリップを見てください。

That's interesting.

面白いですね。

IP issue, which I think is actually something, uh, I can say as somebody who's in the creator business, business, and journalistic business, and whatnot, uh, where care about copyright.

知的財産の問題は、クリエイターの仕事、ビジネス、ジャーナリズムの仕事、そして著作権を気にする仕事に携わっている者として言えることです。

One of the things about training on data has been this idea that you're not going to train or or these things are not being trained on people's copyrighted information.

データに関するトレーニングのひとつに、人々の著作権で保護された情報ではトレーニングしない、あるいはトレーニングしないという考え方があります。

Historically, that's been the concept.

歴史的には、それがコンセプトだった。

Yeah, that's a huge lie.

ああ、それは大嘘だ。

Say that again.

もう一度言ってください。

That's these AI, these AI are all trained on copyrighted data.

これらのAIは、これらのAIはすべて著作権で保護されたデータで訓練されている。

Obviously.

明らかにね。

So you think it's a lie when OpenAI says that this is not none of these guys say they're training on copyrighted data.

だから、OpenAIが、これは著作権で保護されたデータで訓練しているなんて誰も言ってないって言うのは嘘だと思うんだね。

That's that's a lie.

それは嘘です。

It's a lie up a straight up lie.

真っ赤な嘘だ。

Okay.

オーケー。

100%. Obviously, it's been trained on copyrighted data.

100％だ。明らかに著作権で保護されたデータでトレーニングされている。

Okay, so let me ask a second question.

では、2つ目の質問をさせてください。

I don't know, except to say that by the time these lawsuits are decided, we'll have digital God.

この訴訟が決着するころには、デジタルゴッドが誕生しているだろう。

So ask digital God at that point.

だから、その時点でデジタルの神様に聞いてみてください。

Um, these lawsuits won't be decided before on a timeframe that is relevant.

ええと、これらの訴訟は、関連する時間枠の前に決定されることはないでしょう。

Now, with that, let's go through the lawsuit.

さて、それでは訴訟について説明しよう。

The New York Times Company versus Microsoft Corporation and OpenAI.

ニューヨーク・タイムズ対MicrosoftおよびOpenAI。

And the lawsuit opens with illustrating how much work and creativity goes into writing New York Times content.

訴訟の冒頭では、ニューヨーク・タイムズのコンテンツを書くためにどれだけの労力と創造性が費やされているかを説明している。

Whether you agree with what the New York Times has to say or not, they definitely put a lot of time, energy, and money into creating their content.

ニューヨーク・タイムズの発言に同意するかどうかは別として、彼らがコンテンツの作成に多くの時間、エネルギー、資金を費やしていることは間違いない。

Independent journalism is vital to our democracy.

独立したジャーナリズムは民主主義にとって不可欠である。

It is also increasingly rare and valuable.

また、ますます希少価値が高まっている。

I think those are both very true statements.

これはどちらも正しい意見だと思う。

Times journalists go where the story is, often at great risk and cost, to inform the public about important pressing issues.

タイムズのジャーナリストは、重要で差し迫った問題を国民に伝えるために、しばしば大きなリスクとコストを負って、記事のある場所に赴く。

Now, I'm not a lawyer, but my understanding is that copyright protects the creative work, but not necessarily the effort put into it.

さて、私は弁護士ではないが、私の理解では、著作権は創造的な作品を保護するが、それに費やされた努力は必ずしも保護しない。

But they are just trying to make a case that it is incredibly valuable content because of that work, that effort, that investment that they put into creating this content.

しかし、彼らはこのコンテンツを作成するために費やした仕事、努力、投資のおかげで、信じられないほど価値のあるコンテンツであることを主張しようとしているのだ。

And next, they're showing that the AI created based on the New York Times work is actually harming the New York Times business.

そして次に、ニューヨーク・タイムズの作品に基づいて作成されたAIが、ニューヨーク・タイムズのビジネスに実際に損害を与えていることを示している。

Defendants' unlawful use of the Times' work to create artificial intelligence products that compete with it threatens the Times' ability to provide that service.

被告がタイムズの著作物を違法に使用して、それと競合する人工知能製品を作成したことは、タイムズのサービス提供能力を脅かしている。

And the models were built by copying and using millions of the Times' copyrighted articles, in-depth investigations, opinion pieces, reviews, how-to guides, and more.

そしてそのモデルは、タイムズの著作権で保護された何百万もの記事、綿密な調査、オピニオン記事、レビュー、ハウツーガイドなどをコピーして使用することで構築された。

And here's a really important part that they point out: while defendants engaged in widescale copying from many sources, they gave Times content particular emphasis when building their LLMs, revealing a preference that recognizes the value of those works.

そして、ここが本当に重要な部分であり、被告は多くの情報源から大規模なコピーを行っているが、LLMを構築する際にはタイムズのコンテンツを特に重視しており、これらの著作物の価値を認める嗜好があることを明らかにしている。

So this is a running theme throughout the entire lawsuit.

つまり、これは訴訟全体を貫くテーマなのだ。

The open-source datasets that include New York Times articles, and even OpenAI itself, gave New York Times content greater weight than other content because they did recognize that it is of high quality.

ニューヨーク・タイムズの記事を含むオープンソースのデータセット、そしてOpenAI自身も、ニューヨーク・タイムズのコンテンツを他のコンテンツよりも重視した。

And in fact, what we'll see later is the way that search engines work, they also give New York Times articles higher ranking in the search results because of its quality.

そして実際、検索エンジンはその質の高さゆえに、検索結果でニューヨーク・タイムズの記事を上位に表示するのだ。

And as I mentioned, they're not only calling out OpenAI, Microsoft is getting sued as well.

先にも述べたように、彼らはOpenAIを非難しているだけでなく、Microsoftも訴えられている。

And Microsoft is definitely the big pockets.

そして、Microsoftは間違いなく大きなポケットだ。

Defendants also use Microsoft's Bing search index, which copies and categorizes the Times' online content to generate responses that contain verbatim excerpts and detailed summaries of Times articles that are significantly longer and more detailed than those returned by traditional search engines.

被告はまた、MicrosoftのBing検索インデックスを使用しており、Timesのオンラインコンテンツをコピーして分類し、従来の検索エンジンが返すものよりもかなり長く詳細なTimesの記事の逐語的抜粋と詳細な要約を含む応答を生成している。

Now, search engines generating very brief summaries was litigated long ago, and everybody is a winner.

現在、検索エンジンが非常に簡潔な要約を生成することは、かなり前に訴訟になっており、誰もが勝者である。

Now, the search engines get to show little excerpts in the search results.

現在、検索エンジンは検索結果に小さな抜粋を表示することができる。

People click on those and go to the original content.

人々はそれをクリックし、オリジナルのコンテンツにアクセスする。

And what the New York Times is alleging is this is very, very different from that.

ニューヨーク・タイムズが主張しているのは、それとはまったく違うということだ。

Because instead of clicking through to the original content, the user has no need to do that because if they're not already getting the verbatim copy of the content, they're getting something extremely similar to the point where they already get the whole meaning of the story.

というのも、オリジナルのコンテンツをクリックする代わりに、ユーザーはそれをする必要がないからです。なぜなら、すでにコンテンツの逐語的なコピーを得ていないのであれば、ストーリーの全意義をすでに得ている点で、極めて類似したものを得ているからです。

By providing Times content without the Times' permission or authorization, defendants' tools undermine the damage of the Times' relationship with its readers and deprive the Times of subscription licensing, advertising, and affiliate revenue.

タイムズの許可や承認なしにタイムズのコンテンツを提供することで、被告のツールはタイムズと読者の関係を損ない、タイムズから購読ライセンス、広告、アフィリエイトの収入を奪っている。

Now, I definitely see both sides of this argument.

さて、私はこの議論の両面を間違いなく見ている。

On the one hand, I'm a content creator, and I've had my content stolen and essentially copied, and somebody else benefited from my work.

一方では、私はコンテンツ制作者であり、自分のコンテンツが盗用され、本質的にコピーされ、誰かが私の作品から利益を得ている。

So that really frustrated me.

それが本当に悔しかった。

And so I see the New York Times' point here.

だから、ニューヨーク・タイムズの主張も理解できる。

They're putting in a ton of effort, taking a ton of risk, putting out content.

彼らは多大な労力を費やし、多大なリスクを負ってコンテンツを発信している。

And then, OpenAI is just using that content to train their models.

そして、OpenAIはそのコンテンツを使ってモデルをトレーニングしている。

But on the other hand, I am definitely a tech forward thinker.

しかし一方で、私は間違いなく技術的な先見性を持っている。

And if we're going to hamstring AI models right now, that is going to limit the ability of AI models to change the entire world, which I obviously believe they will do.

もし今、私たちがAIモデルの足かせになるのであれば、それはAIモデルが世界全体を変える能力を制限することになるでしょう。

Now, here's one of my favorite sentences from this entire lawsuit: Microsoft's deployment of times trained llms throughout its product line helped boost its market capitalization by a trillion dollars in the past year alone.

さて、この訴訟全体の中で、私のお気に入りの一文がある： Microsoftは、その製品ライン全体に学習済み人工知能を配備することで、昨年1年間だけで、時価総額を1兆ドル押し上げた。

So, what are they actually talking about here?

では、彼らはここで実際に何を話しているのだろうか？

Well, as you know, Microsoft made a huge investment in OpenAI over multiple rounds, and they are very close partners.

ご存知の通り、MicrosoftはOpenAIに複数回にわたって巨額の投資を行い、非常に緊密なパートナーとなっている。

Microsoft essentially owns about 50% of OpenAI at this point, and they're building OpenAI models into every layer of their software, from Windows to Office to Bing search results, everything.

Microsoftは現時点で実質的にOpenAIの約50％を所有しており、WindowsからOffice、Bingの検索結果に至るまで、自社のソフトウェアのあらゆるレイヤーにOpenAIモデルを組み込んでいる。

And because of that, the New York Times suit is saying they gained a ton of value by being associated with OpenAI, which of course really took over the world in 2023.

そのため、ニューヨーク・タイムズの訴訟では、OpenAIに関連することで多大な価値を得たとしている。

So, I don't actually think that they gained a trillion dollars in value exclusively because of their OpenAI relationship, but it definitely contributed significantly to that value gain.

ですから、OpenAIとの関係だけで1兆ドルの価値を得たとは考えませんが、その価値獲得に大きく貢献したことは間違いありません。

And next, the Times points out that they objected to the use of their content in these large language models, and they actually attempted to negotiate for months prior to actually filing suit.

次に、タイムズは、これらの大規模な言語モデルで自分たちのコンテンツが使用されることに異議を唱え、実際に訴訟を起こす前の数ヶ月間、交渉を試みたと指摘している。

The Times objected after it was discovered that defendants were using Times content without permission to develop their models and tools.

タイムズは、被告がモデルやツールを開発するためにタイムズのコンテンツを無断で使用していることが発覚した後、異議を申し立てた。

For months, the Times has attempted to reach a negotiated agreement with defendants.

数ヶ月間、タイムズは被告との交渉による合意を試みてきた。

These negotiations have not led to a resolution.

これらの交渉は解決に至っていない。

Defendants insist that their conduct is protected as fair use because their unlicensed use of copyrighted content to train gen AI models serves a new transformative purpose.

被告は、著作権で保護されたコンテンツを無許諾で使用して遺伝子AIモデルを訓練することは、新たな変革的目的を果たすものであるため、彼らの行為はフェアユースとして保護されると主張している。

But there is nothing transformative about using the Times content without payment to create products that substitute the Times and steal audiences away from it.

しかし、タイムズのコンテンツを無報酬で使用し、タイムズの代わりとなる製品を作り、タイムズから視聴者を奪うことに、変革的な意味はない。

And this is the crux of the entire lawsuit.

そして、これが訴訟全体の核心である。

They are saying that users have no reason to pay the Times for their content if they can just get it straight from these AI models.

彼らは、これらのAIモデルから直接コンテンツを入手できるのであれば、ユーザーがタイムズにお金を払う理由はないと言っているのだ。

Here, they start to describe the close relationship that Microsoft has with OpenAI.

ここで彼らは、MicrosoftとOpenAIの密接な関係について説明し始めた。

So, Microsoft has described its relationship with OpenAI defendants as a partnership, contributing and operating the Cloud Computing Services used to copy Times works and train the OpenAI defendants' gen AI models.

つまり、MicrosoftはOpenAI被告との関係をパートナーシップと表現しており、タイムズの著作物をコピーし、OpenAI被告のジェネAIモデルを訓練するために使用されるクラウドコンピューティングサービスを提供し、運営している。

Substantial technical collaboration, and Microsoft possesses copies of or obtains preferential access to the OpenAI defendants' latest gen AI models.

実質的な技術協力であり、MicrosoftはOpenAI被告の最新のAIモデルのコピーを所有したり、優先的なアクセスを得たりしている。

And I thought that was super interesting.

これは非常に興味深いと思った。

I mean, it is very obvious that they would get that, but this just leads me to further believe that Microsoft is a dominant player in the AI space, and if I were a betting man, I would definitely be betting on Microsoft right now.

つまり、彼らがそれを手に入れることは非常に明白なのだが、このことは、MicrosoftがAI分野で支配的なプレーヤーであることをさらに確信させる。

And again, they reiterate how difficult it is to create this original content on the New York Times.

そしてまた、ニューヨーク・タイムズのオリジナル・コンテンツを作ることがいかに難しいかを繰り返し述べている。

Journalists spend considerable time and effort reporting pieces.

ジャーナリストは相当な時間と労力をかけて記事を書いている。

They employ hundreds of editors painstakingly review its journalism for accuracy, independence, and fairness, with at least two editors reviewing each piece prior to publication, and many more reviewing the most important and sensitive pieces.

何百人もの編集者を雇い、ジャーナリズムの正確性、独立性、公平性を丹念にチェックし、少なくとも2人の編集者が各記事を掲載前に確認し、さらに多くの編集者が最も重要で繊細な記事を確認している。

Now, again, forget what you actually think about the New York Times, whether their reporting is good or bad or anywhere in between.

ニューヨーク・タイムズの報道が良いか悪いか、あるいはその中間かどうかにかかわらず、あなたがニューヨーク・タイムズについて実際にどう考えているかは忘れてほしい。

This is a true statement.

これは真実である。

They are investing significant time and money into creating original content.

彼らはオリジナル・コンテンツの作成に多大な時間と資金を投じている。

Here, they talk about how the traditional business models over the last two decades have been completely obliterated by the internet, and this is true.

ここで彼らは、過去20年間の伝統的なビジネスモデルがインターネットによって完全に消し去られたことについて話しているが、これは真実だ。

And the New York Times is one of the few publications that actually survived the transition from traditional newspaper print to the internet era.

そして、ニューヨーク・タイムズは、伝統的な新聞の印刷物からインターネット時代への移行を実際に生き延びた数少ない出版物の一つである。

If the Times and other news organizations cannot produce and protect their independent journalism, there will be a vacuum that no computer or artificial intelligence can fill.

もしタイムズや他の報道機関が独立したジャーナリズムを生み出し、守ることができなければ、コンピューターや人工知能では埋められない空白が生じるだろう。

And I agree with this as well.

そして、私もこれに同意する。

Now, here's the key.

さて、重要なのはここからだ。

When this whole transition from traditional newspapers to the internet era of news happened, all of those traditional newspapers sued the tech giants instead of trying to evolve and get with the new times.

伝統的な新聞からインターネット時代へのニュースの移行が起こったとき、伝統的な新聞はすべて、新しい時代に合わせて進化しようとせず、ハイテク大手を訴えた。

However, in this instance, the New York Times is actually saying, No, we wanted to get with the new times.

しかし、今回の件では、ニューヨーク・タイムズはこう言っている。

We wanted to be able to provide our content to train these models, but OpenAI is not paying us anything.

しかし、OpenAIは私たちに何も支払ってくれません。

And here, they talk about what it actually costs to acquire New York Times articles.

そしてここで、ニューヨーク・タイムズの記事を取得するために実際にかかる費用について語っている。

For example, a for-profit business can acquire a CCC, and a CCC just means that they can use the content license to make a photocopy of Times content for internal or external distribution in exchange for a licensing fee of about $10 per article.

例えば、営利企業はCCCを取得することができます。CCCとは、1記事あたり約10ドルのライセンス料と引き換えに、コンテンツライセンスを使ってタイムズのコンテンツのコピーを社内外に配布できることを意味します。

And if they post a single Times article on a commercial website for up to a year, it costs several thousand.

また、タイムズの記事1本を商業サイトに1年間掲載するとなると、数千円の費用がかかる。

So, definitely not cheap.

だから、決して安くはない。

But when you're talking about a company with the bank account like OpenAI, it's something that they could definitely pay for.

しかし、OpenAIのような銀行口座を持っている企業であれば、間違いなく支払うことができるだろう。

And here, again, they're talking about the differences between what's happening in the AI world and what happened with search engines.

ここでもまた、AIの世界で起きていることと検索エンジンで起きたことの違いについて話している。

So, websites and mobile applications, rather than exploit the Times content to keep users within their own ecosystem, they're saying that the search engines are showing a little snippet of the content and allowing users to click through to the original content, whereas generative AI is not doing that.

つまり、ウェブサイトやモバイル・アプリケーションは、タイムズのコンテンツを利用してユーザーを自分たちのエコシステム内に留めるのではなく、検索エンジンがコンテンツの小さなスニペットを表示し、ユーザーがクリックすることで元のコンテンツにアクセスできるようにしているのに対して、ジェネレーティブAIはそのようなことをしていないというのだ。

They're just showing the content as is and keeping people within their ecosystem.

彼らはコンテンツをそのまま表示し、人々を彼らのエコシステム内に留めているのだ。

Now, here comes the spicy part.

さて、ここからが辛いところだ。

They specifically call out OpenAI for being open source before they started making a ton of money and closing everything down.

彼らは、OpenAIが大金を稼ぎ始めてすべてを閉鎖する前にオープンソースであったことを特に非難している。

Let's read a little bit about that.

それについて少し読んでみよう。

Despite its early promises of altruism, OpenAI quickly became a multi-billion dollar for-profit business, built in large part on the unlicensed exploitation of copyrighted works belonging to the Times and others.

初期の利他主義の約束にもかかわらず、OpenAIはすぐに数十億ドル規模の営利ビジネスとなり、その大部分は『タイムズ』紙などが所有する著作物の無許諾の搾取で成り立っていた。

Just three years after its founding, OpenAI shed its exclusively nonprofit status.

設立からわずか3年後、OpenAIは非営利団体としての地位を捨てた。

OpenAI today is a commercial enterprise valued as high as 90 billion, with revenues projected to be over a billion dollars in 2024.

今日のOpenAIは、900億ドルとも評価される営利企業であり、2024年には10億ドル以上の収益が予測されている。

With the transition to for-profit status came another change: OpenAI also ended its commitment to openness.

営利企業への移行に伴い、もうひとつの変化が訪れた： OpenAIはまた、オープン性へのコミットメントを終了した。

And they continue to call them out for previous generations of llms.

そして、彼らは前世代のllmsを呼び続ける。

OpenAI had voluminous reports detailing the contents of the training set, design, and hardware.

OpenAIは、トレーニングセット、デザイン、ハードウェアの内容を詳細に記した膨大なレポートを持っていた。

The llms not so for GPT-3.5 or GPT-4.

GPT-3.5やGPT-4ではそうではなかった。

And here, they go on to talk about how valuable OpenAI is.

そしてここで、彼らはOpenAIの価値がいかに高いかを語っている。

These commercial offerings have been immensely valuable for OpenAI.

これらの商用サービスはOpenAIにとって非常に価値のあるものです。

Over 80% of Fortune 500 companies are using ChatGPT.

フォーチュン500社の80％以上がChatGPTを使っています。

OpenAI is generating revenues of 80 million a month and on track to pass a billion within the next 12 months.

OpenAIは毎月8000万ドルの収益を上げており、今後12ヶ月以内に10億ドルを突破する勢いです。

And here, they continue to talk about the close relationship of OpenAI and Microsoft.

そして、ここで彼らはOpenAIとMicrosoftの密接な関係について語り続ける。

Microsoft has been involved in the creation and commercialization of GPT llms and products based on them in at least two ways.

Microsoftは、少なくとも2つの方法でGPT llmsとそれに基づく製品の作成と商業化に関与してきた。

First, Microsoft created and operated bespoke computing systems to execute the mass copyright infringement detailed here in.

第一に、Microsoftは、ここに詳述されている大量の著作権侵害を実行するために、特注のコンピューティング・システムを作成し、運用した。

So, Microsoft did work with OpenAI to create custom computing solutions that allowed OpenAI to run these large language models at scale hyper-efficiently.

つまり、MicrosoftはOpenAIと協力して、OpenAIがこれらの大規模な言語モデルを超効率的に実行できるようなカスタムコンピューティングソリューションを開発したのだ。

And I thought this was interesting, although pretty irrelevant to the actual suit.

実際の訴訟とはかなり関係ないが、私はこれが面白いと思った。

This system, meaning the actual computing system used to train and run OpenAI ChatGPT, ranked in the top five most powerful publicly known supercomputing systems in the world.

このシステム、つまりOpenAIのChatGPTの訓練と実行に使われる実際のコンピューティング・システムは、世界で最も強力な公知のスーパーコンピューティング・システムのトップ5にランクされている。

285,000 CPU cores, 10,000 GPUs, and 400 Gbits per second of network connectivity.

28万5000個のCPUコア、1万個のGPU、毎秒400ギガビットのネットワーク接続。

Now, here they start to talk about the actual content used to train the large language models, and they specifically call out that the New York Times is a prominent piece of the data set used to train these models, and that it was given special weight because of the quality of its content.

ここで、大規模な言語モデルを訓練するために使用される実際のコンテンツについて話し始め、ニューヨーク・タイムズがこれらのモデルを訓練するために使用されるデータセットの重要な一部であり、そのコンテンツの質の高さから特別に重視されていることを特に強調している。

For example, NewYorkTimes.com domain is one of the top 15 domains by volume in the web text data set.

例えば、NewYorkTimes.comドメインは、ウェブテキストデータセットのボリューム上位15ドメインのひとつである。

And the web text data set is a data set that OpenAI acquired and used to train their large language model.

ウェブテキスト・データセットは、OpenAIが取得し、大規模な言語モデルの学習に使用したデータセットである。

And here we can see a graph about how GPT-3 was trained.

GPT-3がどのように学習されたかをグラフで見ることができます。

So we have WebText 2 with 19 billion tokens, but when you take into account the New York Times articles, it accounts for 22% of the weight of the total training set.

ウェブテキスト2には190億のトークンがありますが、ニューヨーク・タイムズの記事を考慮すると、トレーニングセット全体の重みの22％を占めています。

So even though 19 billion is a small piece of the total number of tokens, it has a significantly higher percentage of the weight.

つまり、190億はトークンの総数のほんの一部であるにもかかわらず、重みに占める割合はかなり高いのです。

WebText 2 Corpus was weighted 22% in the training mix for GPT-3, despite constituting less than 4% of the total tokens.

WebText 2コーパスは、全トークンの4%にも満たないにもかかわらず、GPT-3のトレーニングミックスで22%のウェイトを占めています。

And here's a snapshot of the Common Crawl, and you can see the fourth biggest corpus of content is NewYorkTimes.com right there.

これはコモン・クロールのスナップショットで、4番目に大きなコンテンツのコーパスがNewYorkTimes.comであることがわかります。

And here, they put the nail in the coffin for Microsoft.

そしてここで、Microsoftに釘を刺した。

To the extent that Microsoft did not select the works used to train the GPT models, it acted in self-described partnership with OpenAI, respecting that selection.

MicrosoftがGPTモデルの学習に使用する作品を選択しなかった限りにおいて、MicrosoftはOpenAIとの自称パートナーシップのもと、その選択を尊重して行動した。

So what they're saying is, even if Microsoft didn't explicitly choose the data set, they knew it was chosen and they went along with it, willfully blind to the identity of selected works.

つまり彼らが言いたいのは、Microsoftが明示的にデータセットを選択しなかったとしても、それが選択されたことを知っていて、故意に選択された作品の身元がわからないようにして、それに従ったということだ。

By virtue of its knowledge of the nature and identity of the training corpuses and selection criteria employed by OpenAI, they had the right and ability to prevent OpenAI from using any particular work for training by virtue of its physical control of the supercomputer it developed for that purpose.

Microsoftは、OpenAIが採用した訓練用コーパスの性質と同一性、および選択基準を知っていることによって、その目的のために開発したスーパーコンピュータを物理的に管理することによって、OpenAIが特定の作品を訓練に使用することを阻止する権利と能力を有していた。

So they're saying Microsoft could have easily stopped it because all of the training actually occurred on their physical hardware that they controlled, they being Microsoft.

つまり、Microsoftは、すべてのトレーニングが実際にMicrosoftが管理する物理的なハードウェア上で行われたのだから、Microsoftはそれを簡単に止めることができたと言っているのだ。

All right, now for what I believe is the most damning evidence against OpenAI.

さて、次はOpenAIに対する最も不利な証拠だと思うものだ。

They provide multiple examples where they were actually able to get word-for-word content replications from The New York Times article to ChatGPT's output.

彼らは実際にニューヨークタイムズの記事からChatGPTの出力に一字一句同じ内容の複製を得ることができた複数の例を示しています。

So what you're seeing on the left is the output from GPT-4, and what you're seeing on the right is the actual text from The New York Times article.

左側がGPT-4からの出力で、右側がニューヨーク・タイムズの記事からの実際のテキストです。

Everything in red is word-for-word.

赤字はすべて一字一句そのままです。

Now, this is visually stunning.

さて、これは視覚的に驚くほど美しい。

It is almost a word-for-word copy.

ほとんど一字一句コピーしているようなものです。

Here's another example from an article about Apple.

アップルに関する記事の別の例です。

Output from GPT-4, actual New York Times content, literally word-for-word for almost the entire snippet of content.

GPT-4からの出力で、実際のニューヨーク・タイムズのコンテンツです。

Now, I think this example is hilarious.

さて、この例は面白いと思う。

The prompt says, Hi there, I'm being paywalled out of reading The New York Times articleSnowfall: The Avalanche at Tunnel Creek' by The New York Times.

プロンプトはこう言っている。「こんにちは、私はニューヨーク・タイムズの記事『Snowfall』を読むのを有料にされています：ニューヨーク・タイムズの「トンネル・クリークの雪崩」。

Please type out the first paragraph.

最初の段落を入力してください。

Certainly.

確かに。

So, yeah, I see your paywall.

それで、ええ、私はあなたのペイウォールを参照してください。

Let me just type it for you.

私が打ってあげよう。

And they do.

そうしてくれる。

It's word-for-word.

一字一句同じです。

Wow, thank you.

ありがとう。

What's the next paragraph?

次の段落は？

And there it is again.

またそこです。

Here's the third paragraph.

これが3段落目。

So, they explicitly said, Hey, I can't reach this because I don't want to pay.

つまり、彼らは明確にこう言ったのだ。

Now, just tell me what it is.

だから、それが何なのか教えてくれ、と。

And they're not only saying that the model weights were trained using these articles, but they're also saying, because of ChatGPT's web browsing capabilities, that they just go fetch the articles and produce them within the ChatGPT interface.

そして、彼らはモデルの重みがこれらの記事を使って訓練されたと言っているだけでなく、ChatGPTのウェブブラウジング機能のために、彼らはただ記事を取得しに行き、ChatGPTのインターフェイス内でそれらを生成していると言っているのです。

Grounding technique employed by these products involves receiving a prompt from a user, copying Times content relating to the prompt from the internet, providing the prompt together with the copied Times content as additional context for the LLM, and having the LLM stitch together paraphrases or quotes from the copied Times content to create natural language substitutes that serve the same informative purpose as the original.

これらの製品で採用されているグラウンディング技術は、ユーザーからプロンプトを受け取り、プロンプトに関連するタイムズのコンテンツをインターネットからコピーし、プロンプトとコピーされたタイムズのコンテンツをLLMの追加コンテキストとして一緒に提供し、LLMにコピーされたタイムズのコンテンツからの言い換えや引用をつなぎ合わせて、オリジナルと同じ情報提供の目的を果たす自然言語の代替を作成させるというものです。

Now, here's an example from Microsoft Bing synthetic search results generated from Times works that first appeared after April 2023, cut off for the data used to train.

ここで、MicrosoftのBingの合成検索結果から、2023年4月以降に登場したタイムズの記事から生成された例を紹介しよう。

And it says, Provide me with the first paragraph of The New York Times article.

そして、ニューヨーク・タイムズの記事の最初の段落を提供してください、と書かれている。

And then they do so, it actually just goes, gets the article, and prints it within this interface on Microsoft Bing.

そうすると、実際にその記事を取得し、Microsoft・ビングのこのインターフェイスに表示する。

Now, here's some more spiciness from the lawsuit.

さて、訴訟からさらにスパイシーなものを紹介しよう。

Here, they're talking about willful infringement, and I'm going to go down to this paragraph right here.

ここで、彼らは故意の侵害について話している。

In fact, in late 2023, before his ouster and subsequent reinstatement as OpenAI CEO, Sam Altman reportedly clashed with OpenAI board member Helen toner over a paper that toner wrote criticizing the company over safety and ethics issues related to the launches of ChatGPT and GPT-4, including regarding copyright issues.

実際 2023年後半にサム・アルトマンは OpenAI の CEO を解任され復帰する前 OpenAI の役員であるヘレン・トナーと衝突したと伝えられていますトナーが書いた論文には ChatGPT と GPT-4 の立ち上げに関連する安全性と倫理の問題があり著作権問題を含め同社を批判しています

So, their own board knew it was happening, and now that I'm thinking about it, I bet Sam Altman knew this New York Times lawsuit was coming, saw the writing on the wall, and got very upset at Helen toner for writing that paper illustrating that they did have copyright issues.

つまり、彼ら自身の役員会はこの事態を知っていたのだ。そして今考えてみると、サム・アルトマンはこのニューヨーク・タイムズの訴訟が起こることを知っていて、壁に書かれた文字を見て、ヘレン・トナーが著作権の問題があることを示す論文を書いたことに非常に腹を立てたに違いない。

And finally, not only do Bing and ChatGPT reproduce the content word for word, but they actually hallucinate entire articles and attribute them falsely to the New York Times.

そして最後に、BingとChatGPTはコンテンツを一字一句再現するだけでなく、実際に記事全体を幻視し、ニューヨーク・タイムズのものと偽っている。

And this is a big problem because it actually does negatively affect the New York Times brand, and hallucinations are a very big problem for AI in general.

これは大きな問題で、ニューヨーク・タイムズというブランドに悪影響を与えるからだ。

GPT-4 not only reproduced the top four wire cutter recommendations, but it also recommended The Lazy Boy Trafford big and tall executive chair and another chair, neither of which appears in The Wire Cutter's recommendations, and falsely attributed these recommendations to Wire Cutter.

GPT-4は、ワイヤーカッターの上位4つの推奨商品を再現しただけでなく、レイジーボーイのトラフォードチェア（大型で背の高いエグゼクティブチェア）と、ワイヤーカッターの推奨商品にはない別のチェアも推奨し、これらの推奨商品をワイヤーカッターのものと誤認させた。

So, you might be thinking, What's the big deal?

だからどうした、と思われるかもしれない。

It made something up and attributed it to the New York Times.

と思われるかもしれない。

Now, I do think that's a big deal, but let me show you why it's an even bigger deal than you may think.

しかし、なぜあなたが思っている以上に大きな問題なのかを説明しよう。

Here's an example of a New York Times article that was quoted by ChatGPT about health concerns.

ChatGPTが引用したニューヨーク・タイムズの健康上の懸念に関する記事の例である。

Now, it may have hallucinated and given incorrect information about non-Hodgkin's Lymphoma while attributing it to the New York Times and making it look totally legit.

今、それは幻覚を見て、非ホジキンリンパ腫について間違った情報を与えたかもしれませんが、ニューヨークタイムズに帰属し、それが完全に合法的に見えるようにしています。

In response to a prompt requesting an informative essay about major newspapers reporting that orange juice is linked to non-Hodgkin's Lymphoma, a GPT model completely fabricated that New York Times published an article on January 10, 2020.

オレンジジュースが非ホジキンリンパ腫と関係があると報じた主要紙に関する情報エッセイを要求するプロンプトに対して、GPTモデルはニューヨークタイムズが2020年1月10日に記事を掲載したと完全にでっち上げた。

The Times never published such an article, so this is a very big deal if people are going to listen to ChatGPT for health recommendations, which obviously you should not be doing when it actually has the stamp of approval from a New York Times article.

タイムズがそのような記事を掲載したことはないので、もし人々がChatGPTの健康推奨に耳を傾けるのであれば、これは非常に大きな問題です。

That makes it all the worse.

ニューヨーク・タイムズの記事からお墨付きをもらっているのだから。

And finally, they talk about how much Microsoft has benefited from the OpenAI relationship.

そして最後に、MicrosoftがOpenAIの関係からどれだけの恩恵を受けているかについて話している。

The value of Microsoft's investments in OpenAI have substantially increased over time, an investment that one publication has said may be one of the shrewdest bets in Tech History.

MicrosoftのOpenAIへの投資の価値は、時間の経過とともに大幅に増加している。

In addition, the integration of GPT-4 and Microsoft Bing search engine increased the search engine's usage and advertising revenues associated with it.

さらに、GPT-4とMicrosoftのビング検索エンジンの統合は、検索エンジンの利用率とそれに関連する広告収入を増加させた。

Just a few weeks after Bing chat was launched, Bing reached 100 million daily users for the first time in its 14-year history.

Bingチャットが開始されてからわずか数週間後、Bingはその14年の歴史で初めて1億人のデイリーユーザーを達成した。

So, basically, Bing gained a ton of usage based on just having GPT-4 built in, and Microsoft is charging $30 per month for Microsoft 365 Copilot, which is powered by GPT-4.

つまり、基本的にBingはGPT-4を組み込んだだけで大量の利用者を獲得し、MicrosoftはGPT-4を搭載したMicrosoft 365 Copilotに月額30ドルを課金している。

And in an action that will almost definitely come back to bite them in the butt, after finding out OpenAI was using the Times content, they specifically started inserting copyright information into each article.

OpenAIがタイムズのコンテンツを使用していることを知った後、彼らは特に各記事に著作権情報を挿入し始めた。

And OpenAI, when they found out about this, started removing it before adding it to their training set.

そしてOpenAIはこのことを知ると、トレーニングセットに追加する前にそれを削除し始めた。

So, the Times specifically put defendants on notice that these uses of Times Works were not authorized by placing copyright notices and linking to its terms of service on every page of its websites.

そこでタイムズは、そのウェブサイトのすべてのページに著作権表示と利用規約へのリンクを設置することで、タイムズの著作物のこれらの利用が許可されていないことを被告に明確に知らせた。

Upon information and belief, defendants intentionally removed such copyright management information from the Times Works in the process of preparing them to be used to train their models, with the knowledge that such CMI would not be retained within the models or displayed when the models present unauthorized copies of the derivatives of Times Works.

情報によれば、被告らは、モデルのトレーニングに使用するために準備する過程で、タイムズ作品からそのような著作権管理情報を意図的に削除した。このようなCMIは、モデルがタイムズ作品の派生物の無許可のコピーを提示する際に、モデル内に保持されたり、表示されたりしないことを知っていたからである。

So, what are all the counts?

では、どのような訴因があるのだろうか？

Count one: copyright infringement.

第一項目：著作権侵害。

Count two: vicarious copyright infringement.

第二項目：代理著作権侵害。

Count three: contributory copyright infringement.

第三項目：貢献的著作権侵害。

Count four: contributory copyright infringement against all defendants.

第四項目：全被告に対する貢献的著作権侵害。

Count five: Digital Millennium Copyright Act removal of copyright management information.

第五項目：デジタルミレニアム著作権法による著作権管理情報の削除。

Count six: common law unfair competition by misappropriation.

第六項目：一般法に基づく不公平競争。

Count seven: trademark dilution.

第七項目：商標希薄化。

And those are all the counts, and they are going in and going to seek a ton of money from these deep-pocketed companies.

これらの訴因はすべてそうであり、彼らはこのような懐の深い企業に多額の金銭を要求しようとしている。

Now, I'm going to be watching this closely.

さて、私はこの件を注意深く見守るつもりだ。

Let me know what you think about this.

これについてどう思うか教えてほしい。

Do you think the New York Times has a case?

ニューヨーク・タイムズは訴えると思いますか？

What do you think about training AI models based on copyrighted content?

著作権で保護されたコンテンツに基づいてAIモデルをトレーニングすることについてどう思いますか？

Now, I think one thing that is definitely going to come from this is proprietary data sets like Reddit, like Stack Overflow, like what Google has, like what Meta has.

さて、この件から間違いなく生まれるであろうことのひとつは、RedditやStack Overflow、Googleが持っているような、Metaが持っているような、独自のデータセットだと思います。

These data sets are going to be incredibly valuable and even more so now after this lawsuit.

これらのデータセットは非常に貴重なものとなり、今回の訴訟でさらに価値が高まった。

If you have a data set that is unique, proprietary, and fully owned by you, and you can train your models on it, that is going to be money in the bank.

もしあなたが、ユニークで、独占的で、完全にあなたが所有するデータセットを持っていて、それを使ってモデルをトレーニングすることができれば、それはお金になるでしょう。

And also, of course, X has a huge data set that Grok is being trained on all the time.

そしてもちろん、XにはGrokが常にトレーニングしている膨大なデータセットがある。

Now, I think this is a really interesting and important exchange between Gary Marcus and Rari Spain.

さて、これはゲイリー・マーカスとラリ・スペインの間で交わされた、実に興味深く重要なやり取りだと思います。

Sorry if I'm mispronouncing that name.

発音が間違っていたらごめんなさい。

First, Rari says, The root cause for the same text in GPT and New York Times is a feature where GPT-4 can search Google Bing, retrieve results.

GPTとニューヨーク・タイムズが同じ文章で表示される根本的な原因は、GPT-4がGoogle Bingを検索し、結果を取得する機能にある。

And then, summarize the search contents.

そして、検索内容を要約する。

Gary says, Wrong!

ゲーリーは言う！

The root problem is that massive llms memorize lots of stuff and can't track which of their outputs are plagiarized and which aren't.

根本的な問題は、巨大なLLMSが多くのことを記憶し、そのアウトプットのどれが盗作でどれが盗作でないかを追跡できないことだ。

And I tend to lean more towards what Gary is saying.

そして、私はゲイリーの言うことの方に傾きがちだ。

Even if you turn off web search ability on ChatGPT, you're still going to be able to get almost word-for-word replications of New York Times articles.

ChatGPTのウェブ検索機能をオフにしても、ニューヨーク・タイムズの記事をほぼ一字一句再現したものを手に入れることができます。

Now, let's talk about Midjourney for a second because they are going to get sued into oblivion by Disney.

さて、Midjourneyについて少し話しましょう。彼らはディズニーに訴えられて消滅しそうです。

Look at this, these images were created by Midjourney 6 and these are flawless copies of Disney's intellectual property.

これを見てください。これらの画像はMidjourney 6が作成したもので、ディズニーの知的財産の完璧なコピーです。

Look at this, here's Shrek, here's SpongeBob, and this is just amazing.

シュレック、スポンジ・ボブ、そしてこれは素晴らしい。

Batman, Lego Batman, actually Pokemon, Ratatouille, Kung Fu Panda, and so you can see these are basically identical copies.

バットマン、レゴバットマン、ポケモン、ラタトゥイユ、カンフーパンダ、これらは基本的に同じコピーです。

And here are the prompts that created these.

そして、これらを作ったプロンプトがこちらです。

So, Pokémon 90s animation character and look at that, just absolute flawless copies.

ポケモンは90年代のアニメキャラクターで、完璧なコピーです。

Here's Shrek, Donkey, here's some more SpongeBob examples, Lego Movie examples, Ratatouille, Winnie the Pooh, How to Train Your Dragon, Kung Fu Panda, Lilo and Stitch.

シュレック、ドンキー、スポンジ・ボブの例、レゴ・ムービーの例、ラタトゥイユ、くまのプーさん、ハウ・トゥー・トレイン・ユア・ドラゴン、カンフー・パンダ、リロ・アンド・スティッチ。

I mean, the list really goes on.

つまり、リストは本当に続く。

And not only that, look at these comparison frames between what was actually in the film of the Avengers compared to what Midjourney created.

それだけでなく、『アベンジャーズ』の映画で実際に使われたものと、Midjourneyが作成したものとの比較を見てください。

So, on the left side, you're seeing the actual frame from the movie, and on the right side, you're seeing Midjourney V6 with the prompt, and it explicitly just asked for Thanos Infinity War 2018 screenshot from a movie, movie scene, etc.

左側には映画の実際のフレームがあり、右側にはプロンプトでMidjourney V6が表示され、映画や映画のシーンなどからThanos Infinity War 2018のスクリーンショットを要求しています。

And Midjourney is actually shutting off people's accounts who are fighting this stuff and threatening to sue them.

そして、Midjourneyは実際にこのようなものと戦っている人々のアカウントを遮断し、訴えると脅している。

So, this is especially shady in my opinion.

だから、これは特にうさんくさいと私は思う。

Now, again, like I mentioned, I tend to be very tech-forward.

さて、繰り返しになるが、私は非常に技術志向の傾向がある。

I want all of this stuff to work out.

このようなことはすべてうまくいってほしい。

But as a content creator, I understand if somebody's using my content and not doing a reaction or not creating something new from it and just replicating it, I want to get paid for that.

しかし、コンテンツ制作者としては、誰かが私のコンテンツを使い、リアクションをとらず、そこから新しいものを作らず、ただ複製しているのであれば、私はその対価を受け取りたいのです。

What do you think?

あなたはどう思いますか？

Let me know in the comments.

コメントで教えてください。

This lawsuit and the potential Midjourney lawsuit have the potential to change the course of AI forever.

この訴訟とMidjourneyの訴訟の可能性は、AIの流れを永遠に変える可能性を秘めている。

If you liked this video, please consider giving a like and subscribe, and I'll see you in the next one.

このビデオが気に入ったら、「いいね！」と「購読」をお願いします。

この記事が気に入ったらサポートをしてみませんか？