Pythonで音付き動画を作る方法

2021年7月11日 21:49

しっぽにちは、Tan2です。今日はPythonで音付きの動画を作成する方法です。なかなか調べても出てこなかったので、ここでまとめておきます。

候補1: opencv

pythonで動画や画像を扱うとき最も一般に使われるものは、おそらくopencv(cv2)だと思います。しかし、どうやらopencv単体では音声付の動画を作成することはできなさそうです。

候補2: pyav

音付きの動画から、以下のコードで音声つき動画を作成することが出来ました。
以下は公式サンプルを少し改変して音声付き動画を作るように変更したものです。

import av
import av.datasets
import numpy as np


input_ = av.open("./videos/01_smile.mp4")
output = av.open('./videos/sample02.mp4', 'w')

# Make an output stream using the input as a template. This copies the stream
# setup from one to the other.
in_stream = input_.streams.get(audio=0, video=0) 
out_stream_video = output.add_stream("h264", rate=30, width=1920, height=1080)
out_stream_audio = output.add_stream("aac", rate=44100, layout="stereo")
audio_first_flag = True
for i, packet in enumerate(input_.demux(in_stream)):

   # We need to skip the "flushing" packets that `demux` generates.
   if packet.dts is None:
       continue

   # We need to assign the packet to the new stream.
   if packet.stream.type == 'video': 
       for frame in packet.decode():
           # get PIL image
           image = frame.to_image()
           # to numpy
           image = np.array(image)
           # porocessing_img
           image[:, :, 1]  = 255
           # to frame
           new_frame = av.VideoFrame.from_ndarray(image, format='rgb24')
           # encode frame to packet
           for new_packet in out_stream_video.encode(new_frame):
               # mux packet
               output.mux(new_packet)

   elif packet.stream.type == 'audio':
       for audio_frame in packet.decode():
           # to numpy
           arr = audio_frame.to_ndarray()
           # decrease volume
           arr = arr * 0.1
           # to audio frame
           new_audio_frame = av.AudioFrame.from_ndarray(arr, format="fltp")
           new_audio_frame.rate = 44100
           # encode frame to packet
           for new_packet in out_stream_audio.encode(new_audio_frame):
               output.mux(new_packet)


input_.close()
output.close()

簡単に流れを説明します。上記のサンプルは、入力の動画を加工し別の動画ファイルを出力するというものです。

どのような加工をしたかというと以下の２つです。
1. 元の動画のRGBのうちGの値を全ての時間で最大値(255)にする
2. 音量を1/10にする（波形の振幅を1/10にする）

コードに出てくる概念イメージを説明するとつぎのようなイメージです。
- input_やoutputはデータを詰めておく倉庫
- streamは倉庫の荷物を運ぶ流通経路みたいなもの
- packetが荷物
- frameやaudio_frameは荷物の箱の中に入っている中身
- 倉庫に荷物が箱で詰まってるとき動画として再生できる

これを踏まえた上でプログラムの流れを簡単に説明すると
1. 元の動画から、videoとaudioのstreamを一つのstreamとして取り出す
2. 出力する動画を指定して、videoのstreamとaudioのstreamを作成する
3. 入力のstreamからpakcetという動画の中身の一部をfor文で順番に取り出す4. packetには種類がありvideo時とaudio時とがあるり、処理を分ける。
5. videoだった場合は、packetからframeという画像のデータを取り出す。
6. 画像データをnumpyに変換する。
7. RGBのGの値を255にする。
8. numpyからframeを作成する。
9. outputのvideoのstreamを使いoutput用のpacketを作成する。
10. outputに作成したpacketを詰め込む。
11. 入力から取得したpacketが音だった場合、同様に音の処理をする。
12. 入力動画と出力動画のファイル操作を終了する
というような感じになります。

https://github.com/PyAV-Org/PyAV/issues/302
https://github.com/PyAV-Org/PyAV/issues/469

他の候補

moviepyというPythonで動画編集をしちゃおうというものがあるそうです。こちらについては今後どこかで触ってみたいと思います。

この記事が気に入ったらサポートをしてみませんか？