Swiftで弾き語り動画に好きなタイミングで歌詞を入れた伝説回

2021年9月16日 19:38

畑田です。

AVFoundationを使って撮影した動画に背景画像やタップしている時間だけ歌詞を入れる機能を実装しました。(すごい！)
AVFoundationにおける動画編集は正確でわかりやすい情報が少ないので、登場するクラスの解説を交えて実装を記録しておきます。

AVComposition

AVCompositionは複数のソースからなるデータを重ね合わせるためのクラスであり、AVAssetのサブクラスです。

AVCompositionは音声や映像といったそれぞれのメディアを意味するトラック(AVCompositionTrackというAVAssetTrackを継承したクラスで表される)という単位の集合です。やっぱりAVAssetの友達。

またそれぞれのトラックはsegment(AVCompositionTrackSegment)によって構成されており、これはURL、track identifier、time mappingなどでコンテナに保存されている各メディアデータを表しています。

AVCompositionはAVAssetのサブクラスであることからなんとなくわかるかもしれませんが、AVPlayerなどに渡して再生することができます。

また、ファイルで表せる全てのAVデータはコンテナ形式によらず、合成することができます。

time mappingはメディアの長さを管理しており、元データと保存先(合成先)データのタイムレンジが等しければそのまま保存し、異なれば等倍で伸ばして保存するなどするということです。

こういったtrackやsegmentには簡単にアクセスできて書き出すこともできるし、AVMutableCompositionTrackを用いればAVMutableCompositionを新しくインスタンス化することで、compositionを再構築、新規作成することもできます。

AVMutableCompositionTrackやAVMutableCompositionを用いれば、メディアの挿入や削除、スケーリングなどの操作を高レイヤーにおいて行うことができます。

特に、動画編集、compositionの新規作成については以下の図表を参照してください。

少しソースコード書いてみます。

// create an empty composition
let mutableComposition = AVMutableComposition()
// add an empty video track to the composition
guard let compositionVideoTrack = mutableComposition.addMutableTrack(withMediaType: .video, preferredTrackID: kCMPersistentTrackID_Invalid) else {
   print("failed to add video track")
   return
}
// add an empty audio track to the composition
guard let compositionAudioTrack = mutableComposition.addMutableTrack(withMediaType: .audio, preferredTrackID: kCMPersistentTrackID_Invalid) else {
   print("failed to add audio track")
   return
}
// add a video track of asset to the composition video track
let videoAsset = AVURLAsset(url: someURL)
guard let _ = videoAsset.tracks(withMediaType: .video).first else { return }
let sourceVideoTrack = videoAsset.tracks(withMediaType: .video)[0]
do {
   try compositionVideoTrack.insertTimeRange(sourceVideoTrack.timeRange, of: sourceVideoTrack, at: .zero)
} catch {
   print(error)
}
// add a audio track of asset to the composition audio track
let audioAsset = AVURLAsset(url: someURL)
guard let _ = audioAsset.tracks(withMediaType: .audio).first else { return }
let sourceAudioTrack = audioAsset.tracks(withMediaType: .audio)[0]
do {
   try compositionAudioTrack.insertTimeRange(sourceAudioTrack.timeRange, of: sourceAudioTrack, at: .zero)
} catch {
   print(error)
}
// create export session
guard let session = AVAssetExportSession(asset: mutableComposition, presetName: AVAssetExportPresetPassthrough) else {
   print("failed to prepare session")
   return
}
// set up output file
session.outputURL = urlToWriteTo
session.outputFileType = .mp4
// export the composition
session.exportAsynchronously {
   switch session.status {
   case .completed:
       print("completed")
   case .failed:
       print("export error: \(session.error!.localizedDescription)")
   default:
       break
   }
}
composition trackにassetのtrackを載せるコードのdo構文の中を以下のようにしてあげると、時間の設定を変えられます。
let firstFiveSeconds = CMTimeRange(start: .zero, end: CMTime(seconds: 5.0, preferredTimescale: sourceVideoTrack.naturalTimeScale))
try compositionVideoTrack.insertTimeRange(firstFiveSeconds, of: sourceVideoTrack, at: .zero)
AVMutableVideoCompositionInstruction, AVMutableVideoCompositionLayerInstruction

実際にトラックの操作(透過、移動、クロップ)を設定するにはこちらのクラスを利用します。

それらの操作の開始時間、継続時間も同時に設定します。

動画を撮影したときの向きを取得して、それに合わせてビデオトラックを回転させることもできます。(下のソースコードのlayerInstruction.setTransform(sourceVideoTrack.preferredTransform, at: .zero)の部分を参照してください。)

上の図では1つのAVMutableCompositionTrackに対して1つのAVMutableVideoCompositionLayerInstructionを使用しているイメージとなります。

しかし、1つのAVMutableCompositionTrackに対して複数のAVMutableVideoCompositionLayerInstructionを構築することも可能ですので、より複雑な編集も効率よく行えます。

AVMutableVideoComposition

AVMutableCompositionと名前は似ていますが、こちらのクラスではAVMutableCompositionに対する付加情報を設定していきます。

具体的には、フレームの長さやレンダリングサイズ、そして先ほどでてきたトラックの操作(AVMutableVideoCompositionInstruction)です。

このAVMutableVideoCompositionを、AVMutableCompositionとともにAVPlayerやAVAssetExportSessionのクラスに渡す事でビデオに付加情報がセットされます。

このように説明だけしても、使い方が不明です、、、となりそうなのでまたソースコードを書いてみます。

先ほどのソースコードで生成したexport sessionであるsessionで出力を実行する前に、videoCompositionインスタンスを渡しています。

// create mutable video composition which mainly decides frame duration, render size and instruction
let videoComposition: AVMutableVideoComposition = AVMutableVideoComposition()
// set up frame size and render size
videoComposition.frameDuration = CMTimeMake(value: 1, timescale: 30)
videoComposition.renderSize = sourceVideoTrack.naturalSize
// create mutable video composition instruction
let instruction: AVMutableVideoCompositionInstruction = AVMutableVideoCompositionInstruction()
// set time range in which this instruction is active
instruction.timeRange = CMTimeRangeMake(start: CMTime.zero, duration: mutableComposition.duration)
// create and set up layer instruction
let layerInstruction: AVMutableVideoCompositionLayerInstruction = AVMutableVideoCompositionLayerInstruction(assetTrack: compositionVideoTrack)
layerInstruction.setTransform(sourceVideoTrack.preferredTransform, at: .zero) // rotate video here
instruction.layerInstructions = [layerInstruction]
videoComposition.instructions = [instruction]
// set mutable video composition in export session
session.videoComposition = videoComposition
// export below...

背景を挿入してみる

AVMutableVideoCompositionでは動画の上にアニメーションや字幕、動画と画像、動画と動画の重ね合わせを設定することもできます。

CALayerを用いることで映像に階層構造を作っています。

以下では映像の上に背景画像(静止画)を載せています。

これを上のソースコードの、sessionにvideo compositionを渡す前に書いてあげれば良いという感じです。

// get video size
let videoSize: CGSize = videoTrack.naturalSize
// create parent layer
let parentLayer: CALayer = CALayer()
parentLayer.frame = CGRect(x: 0, y: 0, width: videoSize.width, height: videoSize.height)
// create videp layer which will be associated with composition video track
let videoLayer: CALayer = CALayer()
videoLayer.frame = CGRect(x: 0, y: 0, width: videoSize.width, height: videoSize.height)
// add video layer to parent layer and attach video contents to video layer
parentLayer.addSublayer(videoLayer)
videoComposition.animationTool = AVVideoCompositionCoreAnimationTool(postProcessingAsVideoLayer: videoLayer, in: parentLayer)
// if original video selected, `backgroundImage` is nil and background layer shouldn't be created or added to parent layer
if let _ = backgroundImage {
   // create background layer
   let backgroundLayer: CALayer = CALayer()
   backgroundLayer.frame = CGRect(x: 0, y: 0, width: videoSize.width, height: videoSize.height)
   backgroundLayer.opacity = 1.0
   backgroundLayer.masksToBounds = true
   backgroundLayer.backgroundColor = UIColor.clear.cgColor
   backgroundLayer.contentsGravity = CALayerContentsGravity.resizeAspectFill
   // add contents to background layer
   backgroundLayer.contents = backgroundImage!.cgImage
   // add background layer over video layer
   parentLayer.addSublayer(backgroundLayer)
   // make video layer clear to lighten video
   videoLayer.opacity = 0
}

自作の字幕を挿入してみる

開発中の製品の売りは歌詞入れ機能！ということで歌詞の入れ方も記録しておきます。

アプリ内では歌詞を設定して音楽を聴きながら、歌詞に触れていた時間だけその歌詞を挿入するという機能を実装しています。
実際のプロダクトのコードです。

mergeMovie()メソッドを呼ぶと歌詞のついた動画がアプリのtmpディレクトリとフォトライブラリに保存されるようになっています。

ここでlyricDataプロパティは前の画面から引き継いでいるもので、型はDictionary<String, Any>ですが、実際には["id": Int, "lyric": String, "start_at": CMTime, "end_at": CMTime]というような形式です。

getFromTmp(file:)メソッドは独自に定義したものなので完全コピペでは走りません。
getFromTmp(file:)メソッドのソースコード

import Foundation
extension FileManager {
   class func getFromTmp(file name: String) -> URL {
       let tmpDirURL = self.default.temporaryDirectory
       let url = tmpDirURL.appendingPathComponent(name)
       if self.default.fileExists(atPath: url.path) {
           try! self.default.removeItem(atPath: url.path)
           print("this file \(name) already exists and then deleted")
       }
       return url
   }
}

private func mergeMovie() {
   // confirm source video asset is not nil
   let asset = AVURLAsset(url: movieURL)
   print(movieURL.path)
   // extract video track from asset
   guard let _ = asset.tracks(withMediaType: .video).first else { return print("video track not found") }
   let videoTrack = asset.tracks(withMediaType: AVMediaType.video)[0]
   // extract audio track from asset
   guard let _ = asset.tracks(withMediaType: .audio).first else { return print("audio track not found") }
   let audioTrack = asset.tracks(withMediaType: AVMediaType.audio)[0]
   // create empty base composition
   let mutableComposition: AVMutableComposition = AVMutableComposition()
   // create empty composition video and audio tracks
   let compositionVideoTrack: AVMutableCompositionTrack! = mutableComposition.addMutableTrack(withMediaType: .video, preferredTrackID: kCMPersistentTrackID_Invalid)
   let compositionAudioTrack: AVMutableCompositionTrack! = mutableComposition.addMutableTrack(withMediaType: .audio, preferredTrackID: kCMPersistentTrackID_Invalid)
   // insert source video track to the composition video track
   do {
       try compositionVideoTrack.insertTimeRange(CMTimeRangeMake(start: CMTime.zero, duration: asset.duration), of: videoTrack, at: CMTime.zero)
   } catch {
       print("insert video track error:", error)
   }
   // insert source audio track to the composition audio track
   do {
       try compositionAudioTrack.insertTimeRange(CMTimeRangeMake(start: CMTime.zero, duration: asset.duration), of: audioTrack, at: CMTime.zero)
   } catch {
       print("insert audio track error:", error)
   }
   // 回転方向の設定
   let preferredTransform = videoTrack.preferredTransform
   //        compositionVideoTrack.preferredTransform = preferredTransform // not effective
   // create eport session with base composition
   _assetExportSession = AVAssetExportSession(asset: mutableComposition, presetName: AVAssetExportPresetMediumQuality)
   // create mutable video composition which mainly decides duration, render size and instruction
   let videoComposition: AVMutableVideoComposition = AVMutableVideoComposition()
   videoComposition.renderSize = videoTrack.naturalSize
   videoComposition.frameDuration = CMTimeMake(value: 1, timescale: 30)
   let instruction: AVMutableVideoCompositionInstruction = AVMutableVideoCompositionInstruction()
   instruction.timeRange = CMTimeRangeMake(start: CMTime.zero, duration: mutableComposition.duration)
   let layerInstruction: AVMutableVideoCompositionLayerInstruction = AVMutableVideoCompositionLayerInstruction(assetTrack: compositionVideoTrack)
   layerInstruction.setTransform(preferredTransform, at: .zero) // required!
   instruction.layerInstructions = [layerInstruction]
   videoComposition.instructions = [instruction]
   let videoSize: CGSize = mutableComposition.naturalSize
   // create lyric layer
   let lyricLayer = self.makeLyricLayer(for: mutableComposition)
   // create parent layer
   let parentLayer: CALayer = CALayer()
   parentLayer.frame = CGRect(x: 0, y: 0, width: videoSize.width, height: videoSize.height)
   let videoLayer: CALayer = CALayer()
   videoLayer.frame = CGRect(x: 0, y: 0, width: videoSize.width, height: videoSize.height)
   parentLayer.addSublayer(videoLayer)
   parentLayer.addSublayer(lyricLayer)
   videoComposition.animationTool = AVVideoCompositionCoreAnimationTool(postProcessingAsVideoLayer: videoLayer, in: parentLayer)
   // set mutable video composition in export session
   _assetExportSession?.videoComposition = videoComposition
   // set up export session
   exportURL = FileManager.getFromTmp(file: "completion_movie.mov")
   _assetExportSession?.outputFileType = AVFileType.mov
   _assetExportSession?.outputURL = exportURL
   _assetExportSession?.shouldOptimizeForNetworkUse = true
   // export
   _assetExportSession?.exportAsynchronously(completionHandler: {() -> Void in
       if self._assetExportSession?.status == AVAssetExportSession.Status.failed {
           print("failed:", self._assetExportSession?.error ?? "error")
       }
       if self._assetExportSession?.status == AVAssetExportSession.Status.completed {
           // save to photo library
           PHPhotoLibrary.shared().performChanges({
               PHAssetChangeRequest.creationRequestForAssetFromVideo(atFileURL: self.exportURL)
           })
           print("saved to \(self._assetExportSession!.outputURL!.path)")
       }
   })
}
private func makeLyricLayer(for mutableComposition: AVMutableComposition) -> CALayer {
   let videoSize = mutableComposition.naturalSize
   // create parent layer
   let lyricLayer: CALayer = CALayer()
   lyricLayer.frame = CGRect(x: 0, y: 0, width: videoSize.width, height: videoSize.height)
   lyricLayer.opacity = 1.0
   lyricLayer.backgroundColor = UIColor.clear.cgColor
   lyricLayer.masksToBounds = true
   // prepare animation
   let frameAnimation: CAKeyframeAnimation = CAKeyframeAnimation(keyPath: "contents")
   frameAnimation.beginTime = AVCoreAnimationBeginTimeAtZero // attention! apple recommends
   frameAnimation.duration = CMTimeGetSeconds(mutableComposition.duration)
   frameAnimation.repeatCount = 1
   frameAnimation.autoreverses = false
   frameAnimation.isRemovedOnCompletion = false // apple recommends
   frameAnimation.fillMode = CAMediaTimingFillMode.forwards
   frameAnimation.calculationMode = CAAnimationCalculationMode.discrete
   // set up key times
   var imageKeyTimes: Array<NSNumber> = []
   let frameCount = Int(frameAnimation.duration * 30) // duration [s] * frame rate [/s] = total frame count
   for i in 0 ... frameCount {
       imageKeyTimes.append((Double(i)/Double(frameCount)) as NSNumber)
   }
   frameAnimation.keyTimes = imageKeyTimes
   // set up values
   var imageValues: Array<CGImage> = []
   for currentFrame in 0 ... frameCount {
       // begin rendering setting
       UIGraphicsBeginImageContext(videoSize)
       // get position of crrent frame in total video length
       let ratio = Double(currentFrame) / Double(frameCount)
       for (index, lyricDatum) in lyricData.enumerated() {
           let startTime = lyricDatum["start_at"] as! CMTime // time at which the lyric appears
           let endTime = lyricDatum["end_at"] as! CMTime // time at which lyric disappears
           // if current frame is between start time and end time, render lyric label
           if CMTimeGetSeconds(startTime) / CMTimeGetSeconds(mutableComposition.duration) < ratio && ratio < CMTimeGetSeconds(endTime) / CMTimeGetSeconds(mutableComposition.duration) {
               let lyric = lyricDatum["lyric"] as! String
               let attributedString = NSMutableAttributedString(string: lyric, attributes: [NSAttributedString.Key.font: UIFont.systemFont(ofSize: 36, weight: UIFont.Weight(rawValue: 1)), NSAttributedString.Key.foregroundColor: UIColor.white])
               let label = UILabel()
               label.frame.size = videoSize
               label.textAlignment = .center
               label.numberOfLines = 2
               label.clipsToBounds = true
               label.allowsDefaultTighteningForTruncation = true
               label.attributedText = attributedString
               label.drawText(in: CGRect(x: lyricLayer.bounds.origin.x, y: lyricLayer.bounds.maxY * 2 / 3, width: videoSize.width, height: videoSize.height / 3))
           }
           // else render nothing
       }
       let lyricImage = UIGraphicsGetImageFromCurrentImageContext()
       UIGraphicsEndImageContext()
       guard let _ = lyricImage?.cgImage else { continue }
       imageValues.append(lyricImage!.cgImage!)
   }
   frameAnimation.values = imageValues
   // add animation to lyric layer
   lyricLayer.add(frameAnimation, forKey: nil)
   return lyricLayer
}

このコードだと重すぎたので、滑らかに動くアニメーションを挿入するのであれば別ですが、字幕を入れるだけであれば、下のコードの方が良いです。


private func makeLyricLayer(for mutableComposition: AVMutableComposition) -> CALayer {
   let videoSize = mutableComposition.naturalSize// create parent layer
   let lyricLayer: CALayer = CALayer()
   lyricLayer.frame = CGRect(x: 0, y: 0, width: videoSize.width, height: videoSize.height)
   lyricLayer.opacity = 1.0
   lyricLayer.backgroundColor = UIColor.clear.cgColor
   lyricLayer.masksToBounds = true
   // prepare animation
   let frameAnimation: CAKeyframeAnimation = CAKeyframeAnimation(keyPath: "contents")
   frameAnimation.beginTime = AVCoreAnimationBeginTimeAtZero // attention! apple recommends
   frameAnimation.duration = CMTimeGetSeconds(mutableComposition.duration)
   frameAnimation.repeatCount = 1
   frameAnimation.autoreverses = false
   frameAnimation.isRemovedOnCompletion = false // apple recommends
   frameAnimation.fillMode = CAMediaTimingFillMode.forwards
   frameAnimation.calculationMode = CAAnimationCalculationMode.discrete
   // set up key times
   var imageKeyTimes: Array<NSNumber> = []
   // set up values
   var imageValues: Array<CGImage> = []
   // create transparent image
   UIGraphicsBeginImageContext(videoSize)
   guard let cgEmptyImage = UIGraphicsGetImageFromCurrentImageContext()?.cgImage else { return lyricLayer }
   UIGraphicsEndImageContext()
   lyricData.sort() { d0, d1 in
       let startTime0 = d0["start_at"] as! CMTime
       let startTime1 = d1["start_at"] as! CMTime
       return startTime0 < startTime1
   }
   for (index, lyricDatum) in lyricData.enumerated() {
       let startTime = lyricDatum["start_at"] as! CMTime // time at which the lyric appears
       let endTime = lyricDatum["end_at"] as! CMTime // time at which lyric disappears
       if startTime == endTime { continue }
       if index == 0, startTime != .zero {
           imageKeyTimes.append(0)
           imageValues.append(cgEmptyImage)
       }
       // if current frame is between start time and end time, render lyric label
       imageKeyTimes.append(NSNumber(value: CMTimeGetSeconds(startTime) / CMTimeGetSeconds(mutableComposition.duration)))
       imageKeyTimes.append(NSNumber(value: CMTimeGetSeconds(endTime) / CMTimeGetSeconds(mutableComposition.duration)))
       // begin rendering
       UIGraphicsBeginImageContext(videoSize)
       let lyric = lyricDatum["lyric"] as! String
       let attributedString = NSMutableAttributedString(string: lyric, attributes: [NSAttributedString.Key.font: UIFont.systemFont(ofSize: 24), NSAttributedString.Key.foregroundColor: UIColor.white])
       let label = UILabel()
       label.frame.size = videoSize
       label.textAlignment = .center
       label.numberOfLines = 2
       label.clipsToBounds = true
       label.allowsDefaultTighteningForTruncation = true
       label.attributedText = attributedString
       label.drawText(in: CGRect(x: lyricLayer.bounds.origin.x, y: lyricLayer.bounds.maxY * 2 / 3, width: videoSize.width, height: videoSize.height / 3))
       guard let cgLyricImage = UIGraphicsGetImageFromCurrentImageContext()?.cgImage else {
           UIGraphicsEndImageContext()
           imageValues.append(cgEmptyImage)
           imageValues.append(cgEmptyImage)
           continue
       }
       // end rendering
       UIGraphicsEndImageContext()
       imageValues.append(cgLyricImage)
       imageValues.append(cgEmptyImage)
   }
   // set the last key time and the last value
   imageKeyTimes.append(1)
   imageValues.append(cgEmptyImage)
   // set key times
   frameAnimation.keyTimes = imageKeyTimes
   print(imageKeyTimes)
   // set values
   frameAnimation.values = imageValues
   print(imageValues)
   // add animation to lyric layer
   lyricLayer.add(frameAnimation, forKey: nil)
   return lyricLayer
}