Automated Meeting Minutes Generation using Faster-Whisper, Pyannote, and ChatGPT
Introduction
Creating meeting minutes is a common requirement following most conferences and meetings. This task is often perceived as time-consuming and tedious, especially considering that the resulting documents are rarely read thoroughly. To address this issue, I have developed a Python script that generates meeting minutes from audio or video files. This article introduces this innovative solution.
The script leverages three powerful AI models:
Whisper (for speech recognition)
Pyannote (for speaker identification)
ChatGPT (for text generation)
I previously wrote an article about using Whisper, which demonstrated considerable processing time. In response to this limitation, this article introduces a script utilizing Faster-Whisper, a more efficient alternative.
The speed improvement is notable and highly beneficial for our purposes.
Script Overview
The script operates through the following steps:
Loading and preprocessing the audio/video file
Transcribing the audio using Faster-Whisper
Identifying speakers with Pyannote
Integrating transcription and speaker information
Generating meeting minutes using ChatGPT
Let's examine each component in detail, but first, we'll cover the environment setup. I run this script on Ubuntu 24.04 on WSL, using an NVIDIA RTX 3070Ti GPU.
Environment Setup
Obtaining an OpenAI API Key
Create an account on the OpenAI website and acquire an API key. For detailed instructions, please refer to this article.
Acquiring a Hugging Face Access Token
For guidance on obtaining a Hugging Face access token, please consult this informative article.
https://zenn.dev/protoout/articles/73-hugging-face-setup
Docker Engine Installation
https://docs.docker.com/engine/install/ubuntu/
Remove potentially old Docker installations:
for pkg in docker.io docker-doc docker-compose docker-compose-v2 podman-docker containerd runc; do sudo apt remove $pkg; done
Install Docker Engine:
sudo apt update
sudo apt install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
$(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt update
sudo apt install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
sudo docker run hello-world
docker -v
wsl -—shutdown
Configure Docker to run without sudo:
sudo groupadd docker
sudo usermod -aG docker rui
wsl --shutdown
Set Docker to start automatically:
To avoid manually starting Docker after each reboot, add the following to your sudoers file:
sudo visudo
Add this line at the end:
docker ALL=(ALL) NOPASSWD: /usr/sbin/service docker start
sudo nano ~/.bashrc
Add this line at the end:
if [[ $(service docker status | awk '{print $4}') = "not" ]]; then
sudo service docker start > /dev/null
fi
source ~/.bashrc
NVIDIA Docker Installation
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update
sudo apt install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Creating a Docker Container
In your Ubuntu on WSL home folder:
CUDA | NVIDIA NGC
docker pull nvcr.io/nvidia/cuda:12.2.2-cudnn8-devel-ubuntu22.04
docker run -it --gpus all nvcr.io/nvidia/cuda:12.2.2-cudnn8-devel-ubuntu22.04
apt update && apt full-upgrade -y
apt install git wget nano ffmpeg -y
Miniconda Installation
cd ~
mkdir tmp
cd tmp
https://docs.anaconda.com/miniconda/#miniconda-latest-installer-links
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
# yes, enter, yes
# remove a tmp folder
cd ..
rm -rf tmp
exit
docker container ls -a
docker start <container id>
docker exec -it <container id> /bin/bash
Conda Environment Setup
mkdir faster_whisper
cd faster_whisper
nano environment_faster_whisper.yml
name: faster-whisper
channels:
- conda-forge
- pytorch
- nvidia
- defaults
dependencies:
- python=3.11
- pip
- nvidia/label/cuda-12.2.2::cuda-toolkit
- tiktoken
- pip:
- faster-whisper
- pyannote.audio
- pydub==0.25.1
- torch
- torchvision
- torchaudio
- moviepy
- openai
- python-dotenv
conda env create -f environment_faster_whisper.yml
conda activate faster-whisper
Script
nano faster_whisper_script.py
import os
import subprocess
from faster_whisper import WhisperModel
from pyannote.audio import Pipeline
from pydub import AudioSegment
from moviepy.editor import VideoFileClip
import torch
from openai import OpenAI
from dotenv import load_dotenv
import argparse
import tiktoken
def get_env_variable(var_name):
value = os.getenv(var_name)
if value is None:
raise ValueError(f"Environment variable {var_name} is not set")
return value
# Load the .env file
load_dotenv()
# Retrieve OpenAI API key and HuggingFace token from environment variables
OPENAI_API_KEY = get_env_variable("OPENAI_API_KEY")
HF_TOKEN = get_env_variable("HF_TOKEN")
def get_device():
return 'cuda' if torch.cuda.is_available() else 'cpu'
def convert_to_wav(input_file):
file_ext = os.path.splitext(input_file)[1].lower()
if file_ext in [".mp3", ".mp4", ".wma"]:
print(f"Converting {input_file} to WAV")
wav_file_path = input_file.replace(file_ext, ".wav")
if file_ext == ".wma":
try:
subprocess.run(['ffmpeg', '-i', input_file, wav_file_path], check=True)
print(f"Converted file saved as {wav_file_path}")
except subprocess.CalledProcessError:
raise ValueError(f"Error converting WMA file: {input_file}")
else:
audio = AudioSegment.from_file(input_file, format=file_ext[1:])
audio.export(wav_file_path, format="wav")
print(f"Converted file saved as {wav_file_path}")
return wav_file_path
elif file_ext == ".wav":
return input_file
else:
raise ValueError(f"Unsupported file format: {file_ext}")
def extract_audio_from_video(video_file):
audio_file = video_file.replace(".mp4", ".wav")
print(f"Extracting audio from {video_file}")
try:
video = VideoFileClip(video_file)
audio = video.audio
audio.write_audiofile(audio_file)
video.close()
audio.close()
print(f"Audio extracted and saved as {audio_file}")
except Exception as e:
print(f"An error occurred while extracting audio: {str(e)}")
raise
return audio_file
def transcribe_audio(file_path, model_name, beam_size=5, compute_type="float16"):
print(f"Transcribing audio from {file_path} using model {model_name}")
print(f"Beam size: {beam_size}, Compute type: {compute_type}")
device = get_device()
model = WhisperModel(model_name, device=device, compute_type=compute_type)
try:
# First, attempt multi-segment language detection
language_info = model.detect_language_multi_segment(file_path)
detected_language = language_info[0]
except AttributeError:
# If multi-segment method is not available, use standard language detection
print("Multi-segment language detection not available. Using standard detection.")
segments, info = model.transcribe(file_path, beam_size=beam_size)
detected_language = info.language
print(f"Detected language: {detected_language}")
segments, info = model.transcribe(file_path, beam_size=beam_size, language=detected_language)
transcription = ""
formatted_segments = []
for segment in segments:
transcription += segment.text + " "
formatted_segments.append({
"start": segment.start,
"end": segment.end,
"text": segment.text
})
print(f"Transcription completed")
return transcription.strip(), formatted_segments
def diarize_audio(file_path, num_speakers=None, min_speakers=None, max_speakers=None):
print(f"Performing speaker diarization on {file_path}")
device = get_device()
print(f"Using device: {device}")
try:
pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1", use_auth_token=HF_TOKEN)
pipeline.to(device)
except Exception as e:
print(f"Failed to load the pipeline: {e}")
return None
if num_speakers is not None:
diarization = pipeline(file_path, num_speakers=num_speakers)
elif min_speakers is not None and max_speakers is not None:
diarization = pipeline(file_path, min_speakers=min_speakers, max_speakers=max_speakers)
else:
diarization = pipeline(file_path)
print(f"Speaker diarization completed: {diarization}")
return diarization
def assign_speakers_to_segments(segments, diarization):
if diarization is None:
print("Diarization is None. Skipping speaker assignment.")
return []
print("Assigning speakers to segments")
speaker_segments = []
current_speaker = None
current_segment = None
unmatched_segments = []
for segment in segments:
start_time = segment['start']
end_time = segment['end']
text = segment['text']
matched = False
for turn, _, speaker in diarization.itertracks(yield_label=True):
overlap = min(end_time, turn.end) - max(start_time, turn.start)
if overlap > 0: # Consider it a match if there's an overlap
if current_speaker == speaker:
# Append to the current segment
current_segment['end'] = end_time
current_segment['text'] += " " + text
else:
# Save the current segment and start a new one
if current_segment:
speaker_segments.append(current_segment)
current_speaker = speaker
current_segment = {
"start": start_time,
"end": end_time,
"speaker": speaker,
"text": text
}
matched = True
break
if not matched:
print(f"No matching speaker found for segment: {text} [{start_time} - {end_time}]")
unmatched_segments.append({
"start": start_time,
"end": end_time,
"speaker": "UNKNOWN",
"text": text
})
if current_segment:
speaker_segments.append(current_segment)
# Merge unmatched segments into speaker_segments
all_segments = speaker_segments + unmatched_segments
all_segments.sort(key=lambda x: x['start']) # Sort by start time
print("Speakers assigned to segments")
return all_segments
def count_tokens(text):
encoding = tiktoken.encoding_for_model("gpt-4-turbo")
return len(encoding.encode(text))
def split_transcript(transcript, max_tokens_per_chunk):
words = transcript.split()
chunks = []
current_chunk = []
current_chunk_tokens = 0
for word in words:
word_tokens = count_tokens(word)
if current_chunk_tokens + word_tokens > max_tokens_per_chunk:
chunks.append(" ".join(current_chunk))
current_chunk = [word]
current_chunk_tokens = word_tokens
else:
current_chunk.append(word)
current_chunk_tokens += word_tokens
if current_chunk:
chunks.append(" ".join(current_chunk))
return chunks
def generate_minutes(speaker_segments):
print("Generating meeting minutes using OpenAI API")
client = OpenAI(api_key=OPENAI_API_KEY)
transcript = "\n".join([f"Speaker {seg['speaker']} [{seg['start']} - {seg['end']}]: {seg['text']}" for seg in speaker_segments])
max_tokens_per_chunk = 4096 # Adjust this based on your token limits and model used
chunks = split_transcript(transcript, max_tokens_per_chunk)
all_minutes = []
for chunk in chunks:
messages = [
{"role": "system", "content": "Please create meeting minutes for the following conversation."},
{"role": "user", "content": chunk}
]
try:
response = client.chat.completions.create(
messages=messages,
model="gpt-4",
max_tokens=4096, # Adjust the max tokens as necessary
n=1,
stop=None,
temperature=0.5,
)
if response.choices and len(response.choices) > 0:
minutes = response.choices[0].message.content.strip()
all_minutes.append(minutes)
else:
print("No choices in the response")
return None
except Exception as e:
print(f"An error occurred: {e}")
return None
return "\n".join(all_minutes)
def parse_arguments():
parser = argparse.ArgumentParser(description="Speech to Text with Speaker Diarization")
parser.add_argument("audio_file", help="Path to the audio or video file")
parser.add_argument("--model", choices=["tiny", "base", "small", "medium", "large", "large-v2", "large-v3"],
default="base", help="Faster-whisper model to use")
parser.add_argument("--num_speakers", type=int, help="Number of speakers (if known)")
parser.add_argument("--min_speakers", type=int, help="Minimum number of speakers")
parser.add_argument("--max_speakers", type=int, help="Maximum number of speakers")
parser.add_argument("--beam_size", type=int, default=5, help="Beam size for transcription")
parser.add_argument("--compute_type", choices=["float16", "float32", "int8"],
default="float16", help="Compute type for the model")
return parser.parse_args()
def main(args):
print(f"Processing file: {args.audio_file}")
file_ext = os.path.splitext(args.audio_file)[1].lower()
if file_ext == ".mp4":
audio_file = extract_audio_from_video(args.audio_file)
else:
audio_file = args.audio_file
if file_ext in [".mp3", ".mp4", ".wav"]:
audio_file = convert_to_wav(audio_file)
transcription, segments = transcribe_audio(audio_file, args.model, args.beam_size, args.compute_type)
print(f"Transcription: {transcription}")
diarization = diarize_audio(audio_file, num_speakers=args.num_speakers,
min_speakers=args.min_speakers, max_speakers=args.max_speakers)
speaker_segments = assign_speakers_to_segments(segments, diarization)
output_file = os.path.splitext(audio_file)[0] + "_output.txt"
with open(output_file, 'w', encoding='utf-8') as f:
f.write("Speakers assigned to segments\n")
for segment in speaker_segments:
segment_text = f"Speaker {segment['speaker']} [{segment['start']} - {segment['end']}]: {segment['text']}"
print(segment_text)
f.write(segment_text + "\n")
print(f"Results written to {output_file}")
minutes = generate_minutes(speaker_segments)
if minutes is not None:
minutes_file = os.path.splitext(audio_file)[0] + "_minutes.txt"
with open(minutes_file, 'w', encoding='utf-8') as f:
f.write(minutes)
print(f"Meeting minutes written to {minutes_file}")
else:
print("Failed to generate meeting minutes.")
if __name__ == "__main__":
args = parse_arguments()
main(args)
Create a .env file with your API keys:
nano .env
OPENAI_API_KEY=sk-XXXXXXXXXXXXXXXXXXXXXX
HF_TOKEN=hf_ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ
Usage Instructions
Help message
python faster_whisper_script.py --help
usage: faster_whisper_script.py [-h] [--model {tiny,base,small,medium,large,large-v2,large-v3}] [--num_speakers NUM_SPEAKERS] [--min_speakers MIN_SPEAKERS] [--max_speakers MAX_SPEAKERS]
[--beam_size BEAM_SIZE] [--compute_type {float16,float32,int8}]
audio_file
Speech to Text with Speaker Diarization
positional arguments:
audio_file Path to the audio or video file
options:
-h, --help show this help message and exit
--model {tiny,base,small,medium,large,large-v2,large-v3}
Faster-whisper model to use
--num_speakers NUM_SPEAKERS
Number of speakers (if known)
--min_speakers MIN_SPEAKERS
Minimum number of speakers
--max_speakers MAX_SPEAKERS
Maximum number of speakers
--beam_size BEAM_SIZE
Beam size for transcription
--compute_type {float16,float32,int8}
Compute type for the model
Usage Instructions
# on ubuntu command line
# docker container ls -a
docker cp "/mnt/c/Windows_path/VVVV.wav" docker_container_name:root/speech_to_text
Example usage
python faster_whisper_script.py XXXX.wav --model large-v3 --min_speakers 1 --max_speakers 5
To copy files from the Docker container to Windows:
docker cp docker_container_name:root/speech_to_text/xxx.txt "/mnt/c/Windows path/"
Conclusion
This Python script combines cutting-edge AI technologies to automatically generate high-quality meeting minutes from audio or video inputs. It significantly streamlines the process of recording meetings and transcribing lectures.
The implementation of Faster-Whisper has notably reduced processing time, enhancing the efficiency of the entire workflow.
I hope this tool allows you to allocate more time to your educational and research activities, leading to increased productivity and satisfaction in your academic endeavors.
The rapid advancement of AI technology is truly remarkable. I encourage you to explore and utilize such AI technologies to create more efficient work environments tailored to your needs.
Your feedback, questions, and suggestions for improvement are highly valued. Please feel free to leave a comment below.