【実証結果公開】Gemini 事例①｜Google I/O 2024の動画をGemini 1.5で全文文字起こし→ブログ記事→日本語化ステップ

2024年5月25日 10:04

こちらの記事は以下ニュースレターの補足内容です。

https://tetumemo.m-newsletter.com/posts/73c30cceee2866c7

🎥べース動画

https://www.youtube.com/watch?v=uFroTufv6es&ab_channel=Google

📑Gemini 1.5 Flashで全文を文字起こし

Google's ambitions and artificial intelligence Gemini AI is rolling out to work, and it's completely changing the way we work. You know, a lot has happened in a year. There have been new beginnings. We found new ways to find new ideas. And new solutions to age-old problems. Sorry about your shirt. We dreamt of things, never too old for a treehouse. We trained for things. Alright, let's go, go, go. And learned about this thing. We found new paths. Took the next step. And made the big leap. Can I move? We filled days like they were weeks, and more happened in months than has happened in years. We feel days like they were weeks, and more happened in months than has happened in years. Whoops! 3 eggs! Gemini models Fueling Scientific Research Alpha code 2 Cracking Competitive Coding Gemini 1.5 pro Smashing Context Window Limits Things got bigger. Like way bigger. And it wasn't all just for him. Or for her. It was for everyone. Making AI helpful for everyone And you know what? We're just getting started. Google. Mountain View. Can I shoot a video of my possessions, and turn that into a searchable database? So I ran to my bookshelf, and I shot a video just panning my camera along the bookshelf, and I fed the video into the model. It gave me the titles and authors of the books. Even though the authors weren't visible on those book spines and on the bookshelf there was a squirrel nutcracker sat in front of the book truncating the title. You could just see the word "sight." See. And it still guess the correct book. The range of things you can do with that is almost unlimited. So at that point for me, it was just like a click. Like this is it. I thought like I had like a super power in my hands. It was poetry. It was beautiful. I was so happy. It just this This is going to be amazing. This is This is going to help people. This This kind of where the future of language models are going. Personalized to you, not because you trained it to be personal to you, but personal to you because you can give it such a vast understanding of who you are. And So, at that point for me, it was just like a click. Like this is it. I thought like I had like a super power in my hands. It was poetry. It was beautiful. I was so happy. It just this this is going to be amazing. This is this is going to help people. This this kind of where the future of language models are going. Personalized to you, not because you trained it to be personal to you, but personal to you because you can give it such a vast understanding of who you are. We've been investing in AI for more than a decade and innovating at every layer of the stack. Research, product, infrastructure, we're going to talk about it all today. Still, we are in the very early days of the AI platform shift. We see so much opportunity ahead for creators, for developers, for startups, for everyone. Helping to drive those opportunities is what our Gemini era is all about. So let's get started. A year ago on this stage, we first shared our plans for Gemini AI, a frontier model built to be natively multi-model from the very beginning. That could reason across text, images, video, code, and more. It's a big step in turning any input into any output. And I/O for a new generation. Since then, we introduced the first Gemini AI models, are most capable AI. They demonstrated state-of-the-art performance on every multi-modal benchmark. And that was just the beginning. 2 months later, we introduced Gemini 1.5 Pro delivering a big breakthrough in long context. It can run 1 million tokens in production consistently, more than any other large-scale foundation model yet. We want everyone to benefit from what Gemini can do. So, we've worked quickly to share these advances with all of you. Today more than 1.5 million developers use Gemini models across our tools. You're using it to debug code. Get new insights. And to build build the next generation of AI applications. We've also been bringing Gemini's breakthrough capabilities across our products in powerful ways. We'll show examples today across search, Photos, workspace, Android, and more. Today, all of our 2 billion user products use Gemini. And we've introduced new experiences, too. Including on mobile, where people can interact with Gemini directly through the app. Now available on Android and iOS. And through Gemini advanced, which provides access to our most capable models. Over 1 million people have signed up to try it, in just 3 months. And it continues to show strong momentum. One of the most exciting transformations with Gemini has been in Google Search. In the past year, we've answered billions of queries as part of our Search Generative Experience. People are using it to search in entirely new ways. And asking new types of questions. Longer, and more complex queries, even searching with photos, and getting back the best the web has to offer. We've been testing this experience outside of labs. And we are encouraged to see, not only an increase in search usage, but also an increase in user satisfaction. I'm excited to announce that we will begin we will begin launching this fully revamped experience, AI Overviews to everyone in the US this week, and we'll bring it to more countries soon. There's so much innovation happening in search, thanks to Gemini. We can create much more powerful search experiences, including within our products. Let me show you an example in Google Photos. We launched Google Photos almost 9 years ago. Since then, people have used it to organize their most important memories. Today, that amounts to more than 6 billion photos, and videos uploaded every single day. And people love using Photos to search across their life. With Gemini, you're making that a whole lot easier. Say you're at a parking station ready to pay, but you can't recall your license plate number. Before, you could search photos for keywords. And then scroll through years worth of photos, looking for the right one. Now, you can simply ask Photos. It knows the cars that appear often. It triangulates which one is yours, and just tells you the license plate number. And ask Photos can also help you search your memories in a deeper way. For example, you might be reminiscing about your daughter, Lucia's early milestones. You can ask Photos when did Lucia learn to swim. You can even follow up with something more complex. Show me how Lucia's swimming has progressed. Here Gemini goes beyond a simple search, recognizing different contexts. From doing laps in the pool to sparkling in the ocean. To the Texan dates on her swimming certificates. And for those packages, it ups all together in a summary. You can really take it all in, and relive amazing memories all over again. We are rolling out ask photos this this summer, with more capabilities to come. Unlocking knowledge across formats is why we built Gemini to be multi-modal from the ground up. It's one model, with all the modalities built in. So, not only does it understand each type of input, it finds connections between them. Multi modality radically expands the questions we can ask, and the answers we will get back. Long context takes this a step further. Enabling us to bring in even more information. Hundreds of pages of text, hours of audio, a full hour of video, or an entire code repo. Or, if you want roughly 96 cheesecake factory menus. For that many menus, you need a 1 million token context window. Now possible with Gemini 1.5 pro. Developers have been using it in super interesting ways. Let's take a look. I remember the announcement, the 1 million token context window and d- my first reaction was, there's no way. They were able to achieve this. I wanted to to test its technical skills. Um so I uploaded a line chart uh it was temperatures uh between like Tokyo, and Berlin, and how they vary across the 12 months of the year. So I I got in there, and I threw in the Python library that I was really struggling with. And I just asked it a simple question. And it nailed it. It could find specific references to comments in the code, and specific requests that people had made. And other issues that people had had, but then suggests a fix for it that related to what I was working on. I immediately tried to kind of crash it, so I I took, you know, four or five research papers I had on my desktop. And it's a mind-blowing experience when you add so much text, and then you see the kind of amount of tokens you add is not even at half the capacity. It felt a little bit like Christmas because you saw things kind of peppered up to the top of your feed, about like, "Oh wow, I built this thing," or, "Oh, it's doing this." And I would have never expected. Can I shoot a video of my possessions, and turn that into a searchable database? So I ran to my bookshelf, and I shot a video just panning my camera along the bookshelf, and I fed the video into the model. It gave me the titles and authors of the books. Even though the authors weren't visible on those book spines, and on the bookshelf there was a squirrel nutcracker sat in front of the book truncating the title. You could just see the word "sight." See? And it still guessed the correct book. The range of things you can do with that is almost unlimited. So at that point, for me it was just like a click, like this is it. I thought like I had like a super power in my hands. It was poetry. It was beautiful. I was so happy. It just this this is going to be amazing. This is This is going to help people. This this kind of where the future of language models are going. Personalized to you, not because you trained it to be personal to you, but personal to you because you can give it such a vast understanding of who you are. We've been investing in AI for more than a decade, and innovating at every layer of the stack. Research, product, infrastructure, we're going to talk about it all today. Still, we are in the very early days, and you'll see glimpses of our approach throughout the day. But, let me show you the kinds of use cases we are working hard to solve. Let's start with shopping. It's pretty fun to shop for shoes. And a lot less fun to return them when they don't fit. Imagine if Gemini could do all the steps for you. Searching your inbox for the receipt. Locating the order number from your email. Filling out a return form. And even scheduling a pick up. That's much easier, right? Imagine if Gemini could do all the steps for you. Searching your inbox for the receipt. Locating the order number from your email. Filling out a return form. And even scheduling a pick up. That's much easier, right? We've been rolling out Gemini 1.5 Pro with long context in preview, over the last few months. We've made a series of quality improvements across translation, coding, and reasoning. You'll see these updates reflected in the models starting today. I'm excited to announce that we are bringing this improved version of Gemini 1.5 Pro to all developers globally. In addition today, Gemini 1.5 pro with 1 million contexts is now directly available for consumers in Gemini advanced, and can be used across 35 languages. 1 million tokens is opening up entirely new possibilities. It's exciting. But, I think we can push ourselves even further. So today, we are expanding the context window to 2 million tokens. We see this is how we will make the most progress against our mission. Organizing the world's information across every input, making it accessible via any output, and combining the world's information with the information in your world, in a way that's truly useful for you. To fully realize the benefits of AI, we will continue to break new ground. Google DeepMind is hard at work. To share more, please welcome, for the first time on the I/O stage, Sir Demis.

📒ブログ記事化指示後のアウトプット（英語）

Google's Gemini: Unleashing the Power of AI, One Click at a Time

The world is changing, and it's changing fast. New discoveries, innovative solutions, and groundbreaking technologies are emerging at an unprecedented pace, transforming the way we live, work, and interact with the world around us. At the heart of this transformation lies Artificial Intelligence, and Google is leading the charge with its groundbreaking new AI model: Gemini.

This isn't just another AI; this is a paradigm shift. Gemini is a multi-modal marvel, capable of seamlessly navigating the intricate world of text, images, video, and code, weaving them together to create an experience that transcends the limitations of traditional AI. Imagine searching the internet with a photo, analyzing hours of video with a single prompt, or debugging complex code with unparalleled efficiency. This is the power of Gemini, and it's already changing the way we interact with technology.

From the hushed halls of research labs to the vibrant tapestry of everyday life, Gemini is making its presence felt. Developers are leveraging its unprecedented capabilities to build the next generation of AI applications. Google products, from the ubiquitous Search to the deeply personal Photos, are being supercharged with Gemini's intelligence, delivering a user experience that is both intuitive and transformative.

But this is just the beginning. Google's vision for Gemini extends far beyond incremental improvements; it's about fundamentally changing the way we interact with information. Imagine a future where AI can anticipate your needs, automate complex tasks, and provide personalized assistance tailored to your unique circumstances. This is the future Google is building, and Gemini is the key.

Intrigued? Let's dive deeper into the world of Gemini, exploring its capabilities, its impact, and its potential to reshape our world.

Gemini: The Multi-Modal Maestro

Traditionally, AI models have been confined to specific domains, excelling in text processing or image recognition, but struggling to bridge the gap between different modalities. Gemini shatters these limitations, ushering in an era of true multi-modal understanding.

Imagine a world where you can search the internet using a photo, asking your AI assistant to "find me shoes like these, but in blue." Or picture yourself analyzing hours of video footage with a simple command: "show me all the instances where the main character smiles." Gemini makes these scenarios possible, seamlessly integrating different modalities to deliver a rich and comprehensive understanding of the information it processes.

This multi-modal prowess is not just a technological feat; it's a paradigm shift in the way we interact with information. We no longer need to confine ourselves to the limitations of text-based search or single-modal analysis. Gemini empowers us to explore the world of information in a way that mirrors our own multi-faceted understanding of the world around us.

Breaking the Context Barrier: Gemini's Million-Token Leap

One of Gemini's most remarkable achievements is its ability to process vast amounts of information, thanks to its groundbreaking long context window. Imagine being able to feed an AI model an entire research paper, a full hour of video footage, or even hundreds of pages of text, and have it instantly comprehend the intricate nuances and complex relationships within that data. This is the power of Gemini's long context window, and it's a game changer for AI applications.

This leap in context handling opens up a world of possibilities. Developers are leveraging Gemini's 1 million token context window to tackle complex coding challenges, analyze lengthy documents, and even create personalized AI assistants that can retain vast amounts of information about their users.

But Google isn't stopping there. In a bold move that pushes the boundaries of AI even further, Google is expanding Gemini's context window to a staggering 2 million tokens. This unprecedented capacity will enable Gemini to process even more information, delivering deeper insights, more nuanced responses, and a level of understanding that was previously unimaginable.

Gemini in Action: Transforming Everyday Experiences

Gemini's impact extends far beyond the realm of research labs and developer tools. It's already making its presence felt in everyday Google products, enhancing the user experience and delivering new levels of functionality.

Google Search, the cornerstone of the internet, is undergoing a radical transformation with the integration of Gemini. Now, users can engage in a more natural and conversational search experience, asking complex questions and receiving comprehensive answers that go beyond simple keyword matching. Search results are no longer limited to text; Gemini can integrate images, videos, and other rich media, providing a more immersive and informative search experience.

Google Photos, the repository of our most cherished memories, is also being revolutionized by Gemini. Imagine being able to search your photo library with nuanced queries like "show me photos of my daughter's first swimming lesson" or "find all the pictures where my family is laughing." Gemini makes these queries possible, understanding the context of your photos and delivering precisely the results you're looking for.

These are just a few examples of how Gemini is transforming everyday experiences. As Gemini continues to evolve, we can expect to see its impact ripple across a wider range of Google products, blurring the lines between human and machine interaction and delivering an experience that is both intuitive and empowering.

The Future of AI: Personalization, Automation, and Empowerment

Google's vision for Gemini extends far beyond incremental improvements; it's about fundamentally changing the way we interact with information. Imagine a future where AI can anticipate your needs, automate complex tasks, and provide personalized assistance tailored to your unique circumstances. This is the future Google is building, and Gemini is the key.

Imagine an AI assistant that can help you navigate the complexities of daily life, from booking appointments to managing finances to planning trips. This assistant would not be a generic chatbot; it would be deeply personalized, understanding your preferences, learning your habits, and anticipating your needs. It would be a trusted companion, empowering you to manage your life with greater ease and efficiency.

This vision of a personalized and empowering AI is not a distant dream; it's a future that Google is actively working towards. With Gemini at its core, this future is closer than ever before.

The Power of One Click: Unleashing the Potential of Gemini

The stories emerging from developers and early adopters of Gemini are filled with a sense of awe and wonder. They describe a feeling of having a "superpower" at their fingertips, a tool that unlocks new levels of creativity, efficiency, and understanding.

This sense of empowerment is at the heart of Google's vision for Gemini. It's about making AI accessible to everyone, regardless of technical expertise. It's about democratizing the power of information and putting it in the hands of those who can use it to make a difference.

This is the power of one click: the ability to access the world's information, in all its forms, with a simple gesture. It's the power to analyze, create, and understand like never before. It's the power to transform the world, one interaction at a time.

Google's Gemini is not just an AI model; it's a catalyst for change, a spark that ignites a new era of innovation and progress. As we embark on this journey, we can be certain of one thing: the world will never be the same.

📙日本語訳化指示後のアウトプット

Google の Gemini: ワンクリックで AI の力を解き放つ

**世界は変化しており、しかも急速に変化しています。新しい発見、革新的なソリューション、そして画期的なテクノロジーがかつてないほどのペースで登場し、私たちが生き、働き、周りの世界と関わる方法を変えています。**この変革の中心にあるのは人工知能であり、Google は画期的な新しい AI モデル、Gemini でその先頭に立っています。

これは単なる AI ではありません。これはパラダイムシフトです。 Gemini は、テキスト、画像、動画、コードという複雑な世界をシームレスにナビゲートできるマルチモーダルな驚異であり、それらを組み合わせて従来の AI の限界を超えた体験を生み出します。写真を使ってインターネットを検索したり、1 つのプロンプトで何時間もかかるビデオ映像を分析したり、比類のない効率で複雑なコードをデバッグしたりすることを想像してみてください。これが Gemini の力であり、すでにテクノロジーとの関わり方を変えつつあります。

**研究室の静かな空間から、活気に満ちた日常生活のタペストリーまで、Gemini はその存在感を示しています。**開発者は、次世代の AI アプリケーションを構築するために、その前例のない機能を活用しています。ユビキタスな検索から深くパーソナルな写真まで、Google の製品は Gemini の知性によって強化されており、直感的で革新的なユーザー体験を提供しています。

しかし、これはほんの始まりにすぎません。Google の Gemini に対するビジョンは、段階的な改善をはるかに超えたものです。それは、情報との関わり方を根本的に変えることです。 AI があなたのニーズを予測し、複雑なタスクを自動化し、あなたの状況に合わせて調整されたパーソナライズされた支援を提供する未来を想像してみてください。これは Google が構築している未来であり、Gemini がその鍵となります。

興味がありますか？では、Gemini の世界を深く掘り下げ、その能力、影響、そして私たちの世界を再構築する可能性を探ってみましょう。

Gemini: マルチモーダルなマエストロ

従来の AI モデルは、テキスト処理や画像認識などの特定の分野に限定されており、異なるモダリティ間のギャップを埋めるのに苦労していました。 Gemini はこれらの制限を打ち破り、真のマルチモーダル理解の時代を到来させます。

写真を用いてインターネットを検索し、「これと同じような靴を、ただし青色のものを探してください」と AI アシスタントに依頼する世界を想像してみてください。あるいは、「主人公が笑っているすべてのインスタンスを表示する」という簡単なコマンドで何時間もかかるビデオ映像を分析することを想像してみてください。 Gemini は、異なるモダリティをシームレスに統合することで、これらのシナリオを実現し、処理する情報に対する豊かで包括的な理解を提供します。

このマルチモーダルな能力は、単なる技術的な偉業ではありません。それは、情報との関わり方のパラダイムシフトです。もはや、テキストベースの検索や単一モーダル分析の限界に縛られる必要はありません。 Gemini は、私たち自身の多面的な世界理解を反映した方法で、情報の海を探検することを可能にします。

コンテキストの壁を打ち破る: Gemini の100万トークンの飛躍

Gemini の最も注目すべき成果の 1 つは、画期的なロングコンテキストウィンドウにより、膨大な量の情報を処理できることです。 AI モデルに研究論文全体、1 時間分のビデオ映像、さらには数百ページのテキストを入力し、そのデータ内の複雑なニュアンスと関係を即座に理解できると想像してみてください。これが Gemini のロングコンテキストウィンドウの力であり、AI アプリケーションのゲームチェンジャーとなります。

このコンテキスト処理の飛躍は、可能性の世界を開きます。開発者は、複雑なコーディングの課題に取り組んだり、長いドキュメントを分析したり、ユーザーに関する膨大な量の情報を保持できるパーソナライズされた AI アシスタントを作成したりするために、Gemini の 100 万トークンコンテキストウィンドウを活用しています。

**しかし、Google はそこで止まりません。AI の境界線をさらに押し広げる大胆な動きとして、Google は Gemini のコンテキストウィンドウを驚異的な 200 万トークンに拡張しています。**この前例のない容量により、Gemini はさらに多くの情報を処理できるようになり、より深い洞察、より微妙な応答、そしてこれまで想像もできなかったレベルの理解を提供します。

Gemini の実用例: 日常体験の変革

Gemini の影響は、研究室や開発ツールを超えて、広範囲に及んでいます。それはすでに日常的に使用される Google 製品に影響を与え、ユーザーエクスペリエンスを向上させ、新しいレベルの機能を提供しています。

**インターネットの礎石である Google 検索は、Gemini の統合により、根本的な変革を遂げています。**ユーザーは、より自然で会話的な検索エクスペリエンスに参加できるようになり、複雑な質問をして、単純なキーワードマッチングを超えた包括的な回答を受け取ることができます。検索結果はもはやテキストに限定されません。Gemini は、画像、ビデオ、その他の豊富なメディアを統合し、より没入型で有益な検索エクスペリエンスを提供します。

私たちの大切な思い出の宝庫である Google フォトも、Gemini によって革命を起こされています。「娘の初めてのスイミングレッスンの写真を見せて」や「家族が笑っている写真をすべて見つけて」といった微妙なクエリでフォトライブラリを検索できると想像してみてください。Gemini はこれらのクエリを実現し、写真のコンテキストを理解して、あなたが探している結果を正確に提供します。

これらは、Gemini が日常体験をどのように変えているかを示すほんの一例です。 Gemini が進化し続けるにつれて、その影響はより広範囲の Google 製品に波及し、人と機械の相互作用の境界線を曖昧にし、直感的で力を与える体験を提供することが期待されます。

AI の未来: パーソナライゼーション、自動化、エンパワーメント

Google の Gemini に対するビジョンは、段階的な改善をはるかに超えたものです。それは、情報との関わり方を根本的に変えることです。 AI があなたのニーズを予測し、複雑なタスクを自動化し、あなたの状況に合わせて調整されたパーソナライズされた支援を提供する未来を想像してみてください。これは Google が構築している未来であり、Gemini がその鍵となります。

予約から財務管理、旅行計画まで、日常生活の複雑さを乗り切るのに役立つ AI アシスタントを想像してみてください。 このアシスタントは、一般的なチャットボットではありません。それは深くパーソナライズされ、あなたの好みを理解し、あなたの習慣を学び、あなたのニーズを予測します。それは信頼できる仲間であり、あなたがより簡単に効率的にあなたの人生を管理することを可能にします。

パーソナライズされ、力を与える AI のこのビジョンは、遠い夢ではありません。それは Google が積極的に取り組んでいる未来です。 Gemini をその核とすることで、この未来はこれまで以上に近づいています。

ワンクリックの力: Gemini の可能性を解き放つ

Gemini の開発者やアーリーアダプターから寄せられる話は、畏敬の念と驚きに満ちています。彼らは、指先に「超能力」を持っているような感覚、つまり新しいレベルの創造性、効率性、理解を解き放つツールを持っているような感覚を説明しています。

このエンパワーメントの感覚は、Google の Gemini に対するビジョンの核心にあります。それは、技術的な専門知識に関係なく、すべての人が AI にアクセスできるようにすることです。それは、情報の力を民主化し、それを変化をもたらすために使用できる人々の手に委ねることです。

これがワンクリックの力です。シンプルなジェスチャーで、あらゆる形態の世界中の情報にアクセスできることです。それは、これまで以上に分析し、創造し、理解する力です。それは、一度に 1 つのインタラクションで世界を変える力です。

Google の Gemini は、単なる AI モデルではありません。それは変化の触媒であり、イノベーションと進歩の新しい時代に火をつける火花です。 この旅に乗り出すとき、私たちは 1 つのことを確信することができます。世界は二度と同じになることはありません。

――ここまで――

📝 ニュースレター発行中

ニュースレターでは、毎週月曜日の夕方に、あなたが私の体験を疑似体験し、未来への旅（行動）をするきっかけを提供します。

主にAIを中心に図解・検証し、その体験をニュースレターとして発行します。

購読お待ちしてます　👇️

📝 発行者

テツメモ｜tetumemo（Xアカウント）

テツメモ｜tetumemo（リンク集）

この記事が気に入ったらサポートをしてみませんか？