MERTをgoogle colabで試してみた。

MERTとは

MERTは、Acoustic Music Understanding Modelという音楽理解のためのモデルになっています。
今回は楽曲分類とかdownstream taskでできないのかやってみたいと思います。


リンク

Colab
github

準備

Google Colabを開き、メニューから「ランタイム→ランタイムのタイプを変更」でランタイムを「GPU」に変更します。

環境構築

インストール手順です。

!pip install transformers accelerate datasets
!pip install nnAudio

推論

(1)モデルのロード

from transformers import Wav2Vec2FeatureExtractor
from transformers import AutoModel
import torch
from torch import nn
import torchaudio.transforms as T
from datasets import load_dataset


device = "cuda" if torch.cuda.is_available() else "cpu"

# loading our model weights
model = AutoModel.from_pretrained("m-a-p/MERT-v1-95M", trust_remote_code=True).to(device)
# loading the corresponding preprocessor config
processor = Wav2Vec2FeatureExtractor.from_pretrained("m-a-p/MERT-v1-95M",trust_remote_code=True)

(2)デモデータの準備と実行

dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
dataset = dataset.sort("id")
sampling_rate = dataset.features["audio"].sampling_rate

resample_rate = processor.sampling_rate
# make sure the sample_rate aligned
if resample_rate != sampling_rate:
    print(f'setting rate from {sampling_rate} to {resample_rate}')
    resampler = T.Resample(sampling_rate, resample_rate, dtype=torch.float64)
else:
    resampler = None

# audio file is decoded on the fly
if resampler is None:
    input_audio = dataset[0]["audio"]["array"]
else:
  input_audio = resampler(torch.from_numpy(dataset[0]["audio"]["array"]))
  
inputs = processor(input_audio, sampling_rate=resample_rate, return_tensors="pt").to(device)
with torch.no_grad():
    outputs = model(**inputs, output_hidden_states=True)

埋め込み表現形式です。

# take a look at the output shape, there are 13 layers of representation
# each layer performs differently in different downstream tasks, you should choose empirically
all_layer_hidden_states = torch.stack(outputs.hidden_states).squeeze()
print(all_layer_hidden_states.shape) # [13 layer, Time steps, 768 feature_dim]

# for utterance level classification tasks, you can simply reduce the representation in time
time_reduced_hidden_states = all_layer_hidden_states.mean(-2)
print(time_reduced_hidden_states.shape) # [13, 768]

# you can even use a learnable weighted average representation
aggregator = nn.Conv1d(in_channels=13, out_channels=1, kernel_size=1)
weighted_avg_hidden_states = aggregator(time_reduced_hidden_states.unsqueeze(0).cpu()).squeeze()
print(weighted_avg_hidden_states.shape) # [768]

Advanced Application

楽曲分類というか何が似ているのかやってみます。
対象はVaundy, OneokRock, Official髭団Dism, Twiceです。
(0)yt-dlpのインストール

!python3 -m pip install -U yt-dlp

(1)データの準備

!yt-dlp -x --audio-format wav "https://youtu.be/V-gxqhWEbxI" -o "%(title)s.%(ext)s"  # vaundy
!yt-dlp -x --audio-format wav "https://youtu.be/6YZlFdTIdzM" -o "%(title)s.%(ext)s"  # one ok rock
!yt-dlp --audio-format wav "https://youtu.be/hN5MBlGv2Ac" -o "%(title)s.%(ext)s" # official
!yt-dlp --audio-format wav "https://youtu.be/oLrp9uTa9gw" -o "%(title)s.%(ext)s"  # official 2
!yt-dlp --audio-format wav "https://youtu.be/lD-GY7WiTd4" -o "%(title)s.%(ext)s"  # twice

(2)前処理
30-60秒の辺りをそれぞれ取り出してみます。

start = 30
end = 60
!ffmpeg -i "/content/そんなbitterな話 ⧸ Vaundy:MUSIC VIDEO.wav" -ss $start -t $end /content/out_vaundy.wav
!ffmpeg -i "/content/ONE OK ROCK - Clock Strikes [Official Music Video].wav" -ss $start -t $end /content/out_oneok.wav
!ffmpeg -i "/content/Official髭男dism - Subtitle [Official Video].webm" -ss $start -t $end /content/out_official.wav
!ffmpeg -i "/content/Official髭男dism - TATTOO [Official Video].webm" -ss $start -t $end /content/out_official2.wav
!ffmpeg -i "/content/Bouquet.webm" -ss $start -t $end /content/out_twice.wav

(3)推論実行
推論のための関数を定義します。

import librosa
def calc_embedding_for_music(input_audio_path):
  y, sr = librosa.load(input_audio_path)

  resample_rate = processor.sampling_rate
  # make sure the sample_rate aligned
  if resample_rate != sampling_rate:
      print(f'setting rate from {sr} to {resample_rate}')
      resampler = T.Resample(sr, resample_rate, dtype=torch.float64)
  else:
      resampler = None

  # audio file is decoded on the fly
  if resampler is None:
      input_audio = y
  else:
    input_audio = resampler(torch.Tensor(y).to(torch.float64))
  
  inputs = processor(input_audio, sampling_rate=resample_rate, return_tensors="pt").to(device)
  with torch.no_grad():
      outputs = model(**inputs, output_hidden_states=True)

  # need to finetuning
  # aggregator = nn.Conv1d(in_channels=13, out_channels=1, kernel_size=1)
  # weighted_avg_hidden_states = aggregator(time_reduced_hidden_states.unsqueeze(0).cpu()).squeeze()
  # print(weighted_avg_hidden_states.shape) # [768]
  # return weighted_avg_hidden_states

  all_layer_hidden_states = torch.stack(outputs.hidden_states).squeeze()
  # print(all_layer_hidden_states.shape) # [13 layer, Time steps, 768 feature_dim]
  time_reduced_hidden_states = all_layer_hidden_states.mean(-2)
  # print(time_reduced_hidden_states.shape) # [13, 768]
  return time_reduced_hidden_states.view(-1).cpu()
  
from numpy import dot 
from numpy.linalg import norm 
def calc_cos_sim(a, b):
  cos_sim = dot(a, b) / (norm(a) * norm(b)) 
  return cos_sim

埋め込み表現の計算です。

embed_vaundy = calc_embedding_for_music("/content/out_vaundy.wav").detach().numpy()
embed_oneok = calc_embedding_for_music("/content/out_oneok.wav").detach().numpy()
embed_official = calc_embedding_for_music("/content/out_official.wav").detach().numpy()
embed_official2 = calc_embedding_for_music("/content/out_official2.wav").detach().numpy()
embed_twice = calc_embedding_for_music("/content/out_twice.wav").detach().numpy()

類似度の計算です。

print("official vs vaundy: ", calc_cos_sim(embed_official, embed_vaundy))
print("official vs oneok: ", calc_cos_sim(embed_official, embed_oneok))
print("official vs official2: ", calc_cos_sim(embed_official, embed_official2))
print("official vs twice: ", calc_cos_sim(embed_official, embed_twice))

output

official vs vaundy:  0.9231535
official vs oneok:  0.9128771
official vs official2:  0.9318339
official vs twice:  0.907945

一応できてそうですが、そんなに大きく埋め込み空間が離れなかった、、やっぱりそういう楽曲のデータセットを食わせないとダメか?

最後に

今回はMusic UnderstandingのLarge ModelであるMERTを試して、下流タスクとして楽曲分析を行ってみました。結構前からこのプロジェクト自体はあったみたいですが、ViTやLLMと同じように大規模モデルで解決していこうという流れは楽曲の方にも来ているんですね。今後もここら辺は色々みてみたい。

今後ともLLM, Diffusion model, Image Analysis, 3Dに関連する試した記事を投稿していく予定なのでよろしくお願いします

この記事が気に入ったらサポートをしてみませんか?