【Distil-Whisper：高速化された音声認識モデル】英語解説を日本語で読む【2023年11月3日｜@1littlecoder】

2023年11月4日 10:40

新たな音声からテキストへの高速変換モデル「Distil-Whisper」がHugging Faceチームによって開発された。このモデルは、元のWhisperモデルのサイズを縮小しつつ、1%の単語エラーレートを保ちつつ英語に特化している。Distil-Whisperはモデルハブで利用でき、Google CollabノートブックとTransformersライブラリで動作する。音声トランスクリプションは高速で、5分のクリップを8秒、9分29秒のクリップを28秒で処理する。現在は英語のみに対応しているが、将来的には他言語の対応も期待される。
公開日：2023年11月3日
※動画を再生してから読むのがオススメです。

We have a new, very fast speech-to-text using Distil-Whisper.

Distil-Whisperを使った新しい、非常に高速な音声テキスト変換がある。

Distilation is a process in which you reduce the size of a deep learning model.

Distilationとは、ディープラーニングモデルのサイズを縮小するプロセスです。

Hugging Face team has reduced the size of Whisper.

Hugging FaceチームはWhisperのサイズを縮小しました。

That means Whisper is now going to be faster, smaller, and it can create much more text.

つまり、Whisperはより速く、より小さくなり、より多くのテキストを作成できるようになった。

It can transcribe much more text using a lesser amount of time, lesser amount of compute.

より少ない時間、より少ない計算量で、より多くのテキストを書き写すことができる。

That's exactly what we're going to see in this video on this Google Collab notebook.

これこそが、このGoogle Collabノートブックで私たちがこのビデオで見ようとしていることなのです。

What is this project about?

このプロジェクトは何についてですか？

This project is Distil-Whisper, so it's Distiled the version of Whisper for only for English.

このプロジェクトはDistil-Whisperで、Whisperの英語版だけをDistilしたものです。

That's something that you need to keep in mind.

これは覚えておいてほしいことです。

This is six times faster, 49% smaller, and it performs within 1% of word error rate (WER) on out-of-distribution evaluation set.

Distil-Whisperは6倍速く、49％小さくなり、分布外の評価セットで単語誤り率（WER）が1％以内に収まります。

So, which means it doesn't do bad when you compare it with the original Whisper that OpenAI actually launched.

つまり、OpenAIが実際に発売したオリジナルのWhisperと比較しても、悪い結果ではないということだ。

The Whisper large V2 is the base one and they've got two versions: Distil large V2 and Distil medium English.

Whisper large V2が基本で、2つのバージョンがある： Distil large V2とDistil medium Englishだ。

So, for my experiments, I use medium because I've been using the medium model for a lot of different tasks.

私の実験では、多くの異なるタスクに中モデルを使用しているため、中モデルを使用しています。

I kind of found that the medium model has the sweet spot.

ミディアムモデルにはスイートスポットがあるんだ。

How do you use it?

どのように使うのですか？

There are three things that you can do with this one.

これによって、3つのことができます。

You can do short-form transcription, couple of seconds.

数秒間の短い形式の転写ができます。

You can do long-form transcription.

長い形式の転写ができます。

And third, you can use this model as an assistant for the original Whisper for speculative decoding, which increases the speed in which the original Whisper would transcribe.

そして、このモデルを元のWhisperのアシスタントとして使用することができます。これにより、元のWhisperが転写する速度が向上します。

So, we're going to see the long-form transcription in this particular video and I'm going to show you how that is happening.

では、このビデオでは長い形式の転写を見ていきます。それがどのように行われているかをお見せします。

The model is already available on Hugging Face's model Hub if you want to download.

このモデルはすでにHugging Faceのモデルハブでダウンロードできます。

So, it's available here.

ですから、ここで入手可能です。

You can go to the files and versions and you can see the model available as a safe inser file and also the PyTorch file.

ファイルとバージョンのところに行くと、セーフインサーファイルとPyTorchファイルがあります。

You can use whatever you like.

お好きなものをお使いください。

Now, let me take you to the Google Collab notebook.

では、Google Collabノートブックをご覧ください。

If you see, I just took one audio clip of mine.

見てください、私のオーディオ・クリップを1つ取りました。

So, this is the audio clip and this audio clip, as you can see, it's a 5-minute audio clip.

これがオーディオクリップで、このオーディオクリップは見ての通り5分のオーディオクリップです。

Goes to 5 minutes and for that 5-minute audio clip, it took about 8 seconds and it did the transcription.

5分になり、その5分のオーディオクリップには約8秒かかり、転写が行われました。

So, it goes on to say that, Okay, we have a bunch of news to cover starting from M AI, which is something that everybody covered.

それでは、M AIから始まるいくつかのニュースをカバーすると書かれています。

So, I'm not going to spend a lot of time on the obvious news that you probably would know.

というわけで、おそらく皆さんもご存知のような明らかなニュースにはあまり時間を割きません。

Let's understand the code in detail.

コードの詳細を理解しよう。

The first thing is this Distil-Whisper is integrated with Transformers, so you can use Hugging Face Transformers to do the transcription.

まず、このDistil-WhisperはTransformersと統合されているので、Hugging Face Transformersを使って文字起こしができる。

So, we're going to install Transformers, accelerate, and if you were to use datasets from Hugging Face, then you can install datasets as well.

つまり、Transformersをインストールし、加速し、Hugging Faceのデータセットを使うのであれば、データセットもインストールする。

If you were to use Flash Attention 2, which will speed up this process, or if your GPU supports it and you have got the right PyTorch version, then you can use it.

もしFlash Attention 2を使うなら、このプロセスをスピードアップできますし、GPUが対応していてPyTorchのバージョンが合っていれば、それを使うこともできます。

In our particular case, this is not helpful if you do not use Flash Attention, but still, if you want to speed up the process, then you can use Optimum.

私たちの特別なケースでは、Flash Attentionを使用しない場合は役に立ちませんが、それでもプロセスをスピードアップしたいのであれば、Optimumを使用することができます。

And then use better Transformers from Optimum, which we'll see in the rest of the code.

そして、Optimumのより良いTransformersを使う。

So, install Optimum once you have installed all these libraries.

これらのライブラリをすべてインストールしたら、Optimumをインストールしよう。

This is just a utility for me to text wrap in Google Collab, so never mind this.

これはGoogle Collabでテキストラップするためのユーティリティなので、気にしないでほしい。

The next thing is to import Torch from Transformers, import AutoModel for speech-to-speech tech, AutoProcessor, and pipeline.

次に、TransformersからTorchをインポートし、音声合成技術用のAutoModel、AutoProcessor、pipelineをインポートする。

We're going to use primarily pipeline to create the automatic speech recognition ASR pipeline.

主にパイプラインを使って、自動音声認識ASRパイプラインを作成する。

Then, from Optimum.betaTransformer, import betaTransformer.

そして、Optimum.betaTransformerから、betaTransformerをインポートします。

This is just for your reference.

これはあくまで参考のためです。

We're not going to use this particular function here.

ここでは、この特定の関数を使うつもりはありません。

So, we're going to directly move the model to BetaTransformer model.

というわけで、モデルを直接BetaTransformerモデルに移します。

This is the step where you specify whether you have got a GPU.

これはGPUを持っているかどうかを指定するステップです。

If it is a GPU, it is going to accept it.

GPUであれば、それを受け入れます。

So, I'm running this on a T4 GPU on Google Collab.

というわけで、Google CollabのT4 GPUでこれを実行しています。

If you use this Google Collab notebook that I linked in the YouTube description, you can directly use it.

YouTubeの説明でリンクしたGoogle Collabノートブックを使えば、直接使うことができます。

Once, if it is not there, then it will take CPU, where it will take a lot of time to do because this is not optimized for CPU.

一旦、それがない場合はCPUを使うことになるが、これはCPUに最適化されていないため、やるのにかなりの時間がかかる。

Then, you specify the TorD type based on the kind of machine that you have got.

そして、自分の持っているマシンの種類に応じて、TorDの種類を指定します。

Then, you use the model, whether you want the large model, whether you want the medium model, you specify the model here, and you start downloading the model.

次に、モデルを使用します。大きなモデルを使用するか、中モデルを使用するかを指定し、モデルをダウンロードし始めます。

So, if you have got flash attention support, then you specify here use flash attention to is equal to true.

フラッシュ・アテンションをサポートしている場合は、ここでuse flash attention toをtrueに指定します。

But if you do not have flash attention support, then you don't have to use it.

しかし、フラッシュ・アテンションをサポートしていない場合は、使用する必要はありません。

Low CPU M usage true.

CPU M使用率を低くする。

Accelerate will help you with this.

Accelerateがこれを助けてくれる。

Use safe tensors so that it takes the safe tensors file from this instead of the PyTorch file.

Use safe tensorsで、PyTorchファイルの代わりにここから安全なテンソルファイルを取り込む。

Now, after you specify all these things, then you move the model to GPU first, and then you move the model into a better Transformer model.

さて、これらをすべて指定したら、まずモデルをGPUに移動させ、そのモデルをよりよいTransformerモデルに移動させます。

So, here we are moving the model to GPU, then we are converting the model into an Optimum better Transformer model.

つまり、ここではモデルをGPUに移動し、次にモデルをOptimum better Transformerモデルに変換しています。

After you have done all these things, now you have to specify the processor, AutoProcessor, sir, do from pre-train, give the model ID.

これらの作業をすべて終えたら、今度はプロセッサー、AutoProcessor、sir、pre-train、モデルIDを指定します。

Now is the time when you're going to create the audio automatic speech recognition pipeline.

これから、オーディオの自動音声認識パイプラインを作成します。

So, create a pipeline automatic speech recognition, the model, the tokenizer from the processor, and the chunk length is what is very important for you to do long-form transcription.

ですので、パイプラインの自動音声認識、モデル、トークナイザー、プロセッサからチャンクの長さを作成し、長い形式の転写を行うために非常に重要です。

So, chunk length specifies how does it want to chunk.

チャンクの長さは、どのようにチャンクするかを指定します。

And then do the transcription, so that even if you have like a 1-hour audio clip, it can chunk it and then it can do the transcription for you.

そして、転写を行います。つまり、1時間のオーディオクリップでも、それを分割して転写することができます。

That is what you're specifying here.

これが、ここで指定することです。

Once the pipeline is ready, then all you have to do is go file upload.

パイプラインの準備ができたら、あとはファイルをアップロードするだけです。

You can use this to upload the file.

ファイルをアップロードするには、これを使います。

You can download it from the internet, whatever you want to do.

インターネットからダウンロードすることもできます。

And the next step is, whatever the name of the file that you uploaded, you specify the name of the file here, and then it will start transcribing it.

そして次のステップは、アップロードしたファイルの名前が何であれ、ここでファイル名を指定すると、テープ起こしが始まります。

Like I said, I had already done the transcription previously.

さっきも言ったように、私はすでにテープ起こしを終えていました。

I showed it to you.

それをお見せしました。

I just uploaded a file.

ファイルをアップロードしただけです。

Now, this file is from this YouTube video of Ali abdal.

このファイルはアリ・アブダルのYouTubeビデオからのものです。

So this file is 9 minutes 29 seconds.

このファイルは9分29秒です。

So we're going to see in real time how much time it takes for you to understand before you want to know whether you want to use distal Whisper or not.

ですから、ディスタル・ウィスパーを使うかどうかを決める前に、理解するのにどれくらいの時間がかかるか、リアルタイムで見てみましょう。

So I've already downloaded the MP3 of it, and I've uploaded it here.

すでにMP3をダウンロードして、ここにアップロードしました。

So I'm going to just go copy it, copy the path, come back here, and then I'm going to paste it.

だから、それをコピーして、パスをコピーして、ここに戻ってきて、貼り付けます。

And, uh, I can just give the file name because it's in the root.

そして、ルート内にあるので、ファイル名を指定します。

And then I'm going to not print the text.

それから、テキストを印刷しないようにします。

I don't need to.

その必要はない。

Just do the result.

結果だけでいいんです。

When I do this thing, you can see that it is going to start.

この作業を行うと、開始されることがわかります。

Like I said, for a 5-minute clip, it took me 8 seconds.

さっき言ったように、5分のクリップで8秒かかった。

So technically, this should take about like 16 seconds.

だから、技術的には16秒くらいかかるはずなんだ。

But, uh, let's see what is going to happen.

でも、何が起こるか見てみよう。

Once it is done, you can just print the final result.

出来上がったら、最終結果をプリントしてください。

And then this will give you, like I said, this currently works only with English language.

そうすると、さっきも言ったように、これは今のところ英語版でしか使えません。

So if you were using Whisper for a different language, maybe this is not a good solution for this.

だから、もし違う言語でWhisperを使っていたら、これはあまりいい解決策ではないかもしれない。

But I guess they would probably expand this to other languages as well in the future because this is a tried and tested solution.

しかし、これは試行錯誤を重ねたソリューションなので、将来的には他の言語にも拡大されるでしょう。

So it took 28 seconds for a 9.29 or 9 minutes 29 seconds clip.

9.29、つまり9分29秒のクリップに28秒かかったわけだ。

Let's go and print this thing and then let's see how it goes.

さあ、これを印刷して、どうなるか見てみよう。

Okay, it says, Hey friends, welcome back to the channel.

よし、それによると、みなさん、チャンネルへようこそと書いてあります。

Let's talk about focusing.

集中について話しましょう。

So you can see the entire text that has been transcribed.

ということで、書き起こされたテキスト全体が見えます。

And, um, this is all in 28 seconds for a 10-minute clip.

そして、これは10分のクリップで28秒です。

It's quite amazing.

すごいですね。

And, um, you can do it on the free Google Collab version.

これは無料のGoogle Collabバージョンでできます。

If you are doing it on Collab, every time you have to download the model.

Collabでやる場合は、毎回モデルをダウンロードする必要があります。

If you have got your own GPU, you don't have to download the model, do the setup every now.

自分のGPUを持っていれば、モデルをダウンロードする必要はなく、毎回セットアップを行います。

And then, and, um, wait for the CPU optimized distal Whisper that is going to come soon, according to the team.

そして、CPUに最適化されたディスタルウィスパーがもうすぐ登場するとのことなので、それを待つことにしよう。

And you can go check out the other details here.

その他の詳細は、こちらでチェックできる。

Like, if you want to do short form, you can do it without um chunking.

例えば、ショートフォームを使いたいなら、チャンキングなしでできる。

If you want to do speculative coding, they've got the same thing.

投機的なコーディングをしたいなら、同じものがある。

You can use our Google Collab notebook to do the same thing as well.

Google Collabノートブックを使っても同じことができる。

Just modify the code.

コードを修正するだけです。

So, in this video, we learned how to do speech to text.

このビデオでは、音声合成の方法を学びました。

In fact, like one of the fastest in the world at this point.

実際、現時点で世界最速のもののひとつです。

Um, except the quanti models or except Whisper CPP.

クォンティモデルやウィスパーCPPを除けばね。

So, this is a version of Whisper that does speech to text using a Distiled version of Whisper model on GPU.

これは、GPU上のWhisperモデルのDistiledバージョンを使って音声合成を行うWhisperのバージョンです。

We have seen approximately 10 minute clip takes 28 seconds.

約10分のクリップには28秒かかりました。

5 minute clip took 8 seconds.

5分のクリップは8秒かかった。

So, let me know in the comment section what you feel about it.

それでは、コメント欄で感想をお聞かせください。

The Google Collab notebook will be in the YouTube description for you to directly check it out.

Google Collabノートブックは、YouTubeの説明文に記載されているので、直接確認してほしい。

See you in another video.

また別のビデオでお会いしましょう。

Happy prompting.

それではまた。

この記事が気に入ったらサポートをしてみませんか？