MacOS Montereyに音声認識エンジンJuliusをHomebrew経由でインストールする。

こんにちは。
この記事では、音声認識エンジンJuliusをHomebrew経由でインストールする手法をまとめます。個人的にHomebrewを利用して他のパッケージを管理していることが多かったので、やってみることにしました。

他の手法としては、GitHub Releasesからソースコードをダウンロードしてインストールする手法やリポジトリからクローンしてインストールする手法があります。

環境 (2022/04/10時点)

  • macOS Monterey 12.2 / Intel Core i7

  • Homebrew 3.4.5

  • Julius v4.6

  • Julius Dictation Kit

OSによってインストールできるかどうかは、Homebrew Formulaeから確認できます。また、Juliusだけでは、音響モデル・言語モデルが含まれていないので、Julius Dictation Kitをインストールする必要があります。

Install julius, julius-dictation-kit

julius-dictation-kitはHomebrewに含まれておらず、自然言語処理に特化したリポジトリHomebrew-nlpにからインストールします。
まず、Homebrew-nlpを追加します。

% brew tap uetchy/nlp

次に、julius, julius-dictation-kitをHomebrew経由でインストールします。

% brew install julius, julius-dictation-kit

これでインストールすることができました。

リアルタイム音声認識をやってみる。

PCのマイクを入力として音声認識を動かしてみます。
インストールしたjulius-dictation-kitからモデルを指定して実行します。

Homebrewでインストールしたパッケージのパスは、"brew --prefix julius-dictation-kit"で取得できます。

% brew --prefix julius-dictation-kit
/usr/local/opt/julius-dictation-kit

今回は、GMMベースのモデルを使用しました。
実行するときに気をつけたいのは、-nostripオプションを付けることです。このオプションがないと音声入力が取得できず実行できませんでした。

% julius \
  -C `brew --prefix julius-dictation-kit`/share/main.jconf \
  -C `brew --prefix julius-dictation-kit`/share/am-gmm.jconf

// 略 //

STAT: ###### initialize input device
Stat: adin_darwin: sample rate = 16000
Error: adin_darwin: cannot set InputUnit's EnableIO(Input)
ERROR: m_adin: failed to ready input device

% julius -nostrip \
  -C `brew --prefix julius-dictation-kit`/share/main.jconf \
  -C `brew --prefix julius-dictation-kit`/share/am-gmm.jconf

// 略 //
STAT: All init successfully done

STAT: ###### initialize input device
----------------------- System Information begin ---------------------
JuliusLib rev.4.6 (fast)

Engine specification:
 -  Base setup   : fast
 -  Supported LM : DFA, N-gram, Word
 -  Extension    : LibSndFile
 -  Compiled by  : gcc -g -O2 -fPIC
Library configuration: version 4.6
 - Audio input
    primary A/D-in driver   : coreaudio (MacOSX CoreAudio)
    available drivers       :
    wavefile formats        : various formats by libsndfile ver.1
    max. length of an input : 320000 samples, 150 words
 - Language Model
    class N-gram support    : yes
    MBR weight support      : yes
    word id unit            : short (2 bytes)
 - Acoustic Model
    multi-path treatment    : autodetect
 - External library
    file decompression by   : zlib library
 - Process hangling
    fork on adinnet input   : no
 - built-in SIMD instruction set for DNN
    SSE AVX FMA
    FMA is available maximum on this cpu, use it
 - built-in CUDA support: no


------------------------------------------------------------
Configuration of Modules

 Number of defined modules: AM=1, LM=1, SR=1

 Acoustic Model (with input parameter spec.):
 - AM00 "_default"
	hmmfilename=/usr/local/opt/julius-dictation-kit/share/model/phone_m/jnas-tri-3k16-gid.binhmm
	hmmmapfilename=/usr/local/opt/julius-dictation-kit/share/model/phone_m/logicalTri

 Language Model:
 - LM00 "_default"
	vocabulary filename=/usr/local/opt/julius-dictation-kit/share/model/lang_m/bccwj.60k.htkdic
	n-gram  filename=/usr/local/opt/julius-dictation-kit/share/model/lang_m/bccwj.60k.bingram (binary format)

 Recognizer:
 - SR00 "_default" (AM00, LM00)

------------------------------------------------------------
Speech Analysis Module(s)

[MFCC01]  for [AM00 _default]

 Acoustic analysis condition:
	       parameter = MFCC_E_D_N_Z (25 dim. from 12 cepstrum + energy, abs energy supressed with CMN)
	sample frequency = 16000 Hz
	   sample period =  625  (1 = 100ns)
	     window size =  400 samples (25.0 ms)
	     frame shift =  160 samples (10.0 ms)
	    pre-emphasis = 0.97
	    # filterbank = 24
	   cepst. lifter = 22
	      raw energy = False
	energy normalize = False
	    delta window = 2 frames (20.0 ms) around
	     hi freq cut = OFF
	     lo freq cut = OFF
	 zero mean frame = ON
	       use power = OFF
	             CVN = OFF
	            VTLN = OFF

    spectral subtraction = off

 cep. mean normalization = yes, real-time MAP-CMN, updating initial mean with last 500 input frames
  initial mean from file = N/A
   beginning data weight = 100.00
 cep. var. normalization = no

	 base setup from = Julius defaults

------------------------------------------------------------
Acoustic Model(s)

[AM00 "_default"]

 HMM Info:
    8443 models, 3090 states, 3090 mpdfs, 49440 Gaussians are defined
	      model type = context dependency handling ON
      training parameter = MFCC_E_N_D_Z
	   vector length = 25
	number of stream = 1
	     stream info = [0-24]
	cov. matrix type = DIAGC
	   duration type = NULLD
	max mixture size = 16 Gaussians
     max length of model = 5 states
     logical base phones = 43
       model skip trans. = not exist, no multi-path handling

 AM Parameters:
        Gaussian pruning = none (full computation)  (-gprune)
    short pause HMM name = "sp" specified, "sp" applied (physical)  (-sp)
  cross-word CD on pass1 = handle by approx. (use 3-best of same LC)

------------------------------------------------------------
Language Model(s)

[LM00 "_default"] type=n-gram

 N-gram info:
	            spec = 3-gram, backward (right-to-left)
	        OOV word = <unk>(id=2)
	    wordset size = 59084
	  1-gram entries =      59084  (  0.5 MB)
	  2-gram entries =    2476660  ( 27.7 MB) (64% are valid contexts)
	  3-gram entries =    7894442  ( 52.8 MB)
	LR 2-gram entries=    2476660  (  9.7 MB)
	           pass1 = given additional forward 2-gram

 Vocabulary Info:
        vocabulary size  = 64274 words, 366102 models
        average word len = 5.7 models, 17.1 states
       maximum state num = 54 nodes per word
       transparent words = not exist
       words under class = 9444 words

 Parameters:
	(-silhead)head sil word = 0: "<s> @0.000000 [] silB(silB)"
	(-siltail)tail sil word = 1: "</s> @0.000000 [。] silE(silE)"

------------------------------------------------------------
Recognizer(s)

[SR00 "_default"]  AM00 "_default"  +  LM00 "_default"

 Lexicon tree:
	 total node num = 415714
	  root node num =    632
	(148 hi-freq. words are separated from tree lexicon)
	  leaf node num =  64274
	 fact. node num =  64274

 Inter-word N-gram cache: 
	root node to be cached = 195 / 631 (isolated only)
	word ends to be cached = 59084 (all)
	  max. allocation size = 46MB
	(-lmp)  pass1 LM weight = 8.0  ins. penalty = -2.0
	(-lmp2) pass2 LM weight = 8.0  ins. penalty = -2.0
	(-transp)trans. penalty = +0.0 per word
	(-cmalpha)CM alpha coef = 0.050000

 Search parameters: 
	    multi-path handling = no
	(-b) trellis beam width = 1500
	(-bs)score pruning thres= disabled
	(-n)search candidate num= 30
	(-s)  search stack size = 500
	(-m)    search overflow = after 10000 hypothesis poped
	        2nd pass method = searching sentence, generating N-best
	(-b2)  pass2 beam width = 100
	(-lookuprange)lookup range= 5  (tm-5 <= t <tm+5)
	(-sb)2nd scan beamthres = 80.0 (in logscore)
	(-n)        search till = 30 candidates found
	(-output)    and output = 1 candidates out of above
	 IWCD handling:
	   1st pass: approximation (use 3-best of same LC)
	   2nd pass: loose (apply when hypo. is popped and scanned)
	 factoring score: 1-gram prob. (statically assigned beforehand)
	short pause segmentation = off
	fall back on search fail = off, returns search failure

------------------------------------------------------------
Decoding algorithm:

	1st pass input processing = real time, on-the-fly
	1st pass method = 1-best approx. generating indexed trellis
	output word confidence measure based on search-time scores

------------------------------------------------------------
FrontEnd:

 Input stream:
	             input type = waveform
	           input source = microphone
	    device API          = default
	          sampling freq. = 16000 Hz
	         threaded A/D-in = supported, on
	   zero frames stripping = off
	         silence cutting = on
	             level thres = 2000 / 32767
	         zerocross thres = 60 / sec.
	             head margin = 300 msec.
	             tail margin = 400 msec.
	              chunk size = 1000 samples
	       FVAD switch value = -1 (disabled)
	    long-term DC removal = off
	    level scaling factor = 1.00 (disabled)
	      reject short input = < 800 msec
	      reject  long input = off

----------------------- System Information end -----------------------

Notice for feature extraction (01),
	*************************************************************
	* Cepstral mean normalization for real-time decoding:       *
	* NOTICE: The first input may not be recognized, since      *
	*         no initial mean is available on startup.          *
	*************************************************************

------
### read waveform input
Stat: adin_portaudio: audio cycle buffer length = 256000 bytes
Stat: adin_portaudio: sound capture devices:
Stat: adin_portaudio: use default device
Stat: adin_portaudio: [Core Audio: MacBook Proのマイク]
Stat: adin_portaudio: (you can specify device by "PORTAUDIO_DEV_NUM=number"
Stat: adin_portaudio: try to set default low latency from portaudio: 0 msec
Stat: adin_portaudio: latency was set to 228.500000 msec
STAT: AD-in thread created
<<< please speak >>>

"please speak"と表示され、音声認識できる状態になりました。
試しに「こんにちは」を認識させてみます。

pass1_best:  こんにちは 。
pass1_best_wordseq: <s> こんにちは+感動詞 </s>
pass1_best_phonemeseq: silB | k o N n i ch i w a | silE
pass1_best_score: -2789.926758
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 11897 generated, 1183 pushed, 197 nodes popped in 121
sentence1:  こんにちは 。
wseq1: <s> こんにちは+感動詞 </s>
phseq1: silB | k o N n i ch i w a | silE
cmscore1: 0.709 0.651 1.000
score1: -2804.239258

最後に

Homebrew経由でJuliusをインストールする方法をまとめました。
今後は、オフラインで任意の音源でできるようにしたいなと思っています。
以上。

参考


この記事が気に入ったらサポートをしてみませんか?