NLP | GINZA v5で固有表現抽出のルール追加を試してみた

2021年10月20日 00:29

「BERT/GPT-3/DALL-E 自然言語処理・画像処理・音声処理人口知能プログラミング実践入門」を読んで、リクルートのAI研究機関「Megagon Labs」提供の「GINZA」という日本語の自然言語処理ライブラリがあることを知りました。
※書籍へのリンクも記載していますが、このnoteは書籍の内容に従わずにあくまでも勝手に最新バージョンで試したことに対する内容です

興味を惹かれ

BERTくらいしか自然言語処理ライブラリの名前を知らなかったため興味を惹かれたのですが、書籍内のGINZAのバージョンは4.0.5であり少し古いバージョンでした。2021/08/26にv5がリリースされているようで、2021/10/01時点では最新は5.0.2 (2021/09/06)となっていました。

試そうとするも

せっかく試すならば最新で試したいと思ったところ、v4からv5になった際にbraking changeも何点かあるようでした。

Google Colaboratory上で試すもエラーが発生していたため、分かっていないなりにも調べて試すことが出来た記録を残しておきたいと思います。

解析モデルパッケージの読み込み

まずは解析モデルパッケージである ja-ginza の読み込みの際のエラーです。

!pip install ginza # 書籍ではバージョン指定
import spacy
nlp = spacy.load('ja_ginza')

上記コードでロードされるのですが、見つかりませんというエラーになりました（ランタイムを再起動しない際のエラーと同じ）

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-2-935bffa734f0> in <module>()
     1 import spacy
----> 2 nlp = spacy.load('ja_ginza')

1 frames
/usr/local/lib/python3.7/dist-packages/spacy/util.py in load_model(name, vocab, disable, exclude, config)
   352     if name in OLD_MODEL_SHORTCUTS:
   353         raise IOError(Errors.E941.format(name=name, full=OLD_MODEL_SHORTCUTS[name]))
--> 354     raise IOError(Errors.E050.format(name=name))
   355 
   356 

OSError: [E050] Can't find model 'ja_ginza'. It doesn't seem to be a Python package or a valid path to a data directory.

そこで公式サイトをみてみると pip install が必要な記述があり、さらに ja-ginza-electra という精度を向上した解析モデルパッケージもリリースされたとのことで、せっかくなのでこちらを試してみることにしました。

ちなみに公式サイトでは ja-ginza-electra を利用する場合はメモリ容量16GB以上を推奨とのことでしたがGoogle Colaboratory無償版でも動作しました（無償版はメモリ12GBらしいです）

!pip install -U ginza ja-ginza-electra
!pip freeze

せっかくなので、インストールされたパッケージのバージョンも確認しておきます。

ginza==5.0.2
ja-ginza-electra==5.0.0
spacy==3.1.3

ランタイムの再起動後に以下を実行するとロードに成功しました。

import spacy
nlp = spacy.load('ja_ginza_electra')

固有表現抽出のルールの追加

EntiryRulerを利用してルールを追加出来るようなのですが、最新バージョンでは使い方が異なっているためエラーになりました。

# 古いバージョンの場合の記述（動作は試してません）
from spacy.pipeline import EntityRuler
ruler = EntityRuler(nlp)
ruler.add_patterns([
                   {'label': 'Person', 'pattern': 'サツキ'},
                   {'label': 'Person', 'pattern': 'メイ'}
])
nlp.add_pipe(ruler)

# 以下、実行エラー
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-2-c2915b8de193> in <module>()
     5                     {'label': 'Person', 'pattern': 'メイ'}
     6 ])
----> 7 nlp.add_pipe(ruler)

/usr/local/lib/python3.7/dist-packages/spacy/language.py in add_pipe(self, factory_name, name, before, after, first, last, source, config, raw_config, validate)
   755             bad_val = repr(factory_name)
   756             err = Errors.E966.format(component=bad_val, name=name)
--> 757             raise ValueError(err)
   758         name = name if name is not None else factory_name
   759         if name in self.component_names:

ValueError: [E966] `nlp.add_pipe` now takes the string name of the registered component factory, not a callable component. Expected string, but got <spacy.pipeline.entityruler.EntityRuler object at 0x7f7c00bd5190> (name: 'None').

- If you created your component with `nlp.create_pipe('name')`: remove nlp.create_pipe and call `nlp.add_pipe('name')` instead.

- If you passed in a component like `TextCategorizer()`: call `nlp.add_pipe` with the string name instead, e.g. `nlp.add_pipe('textcat')`.

- If you're using a custom component: Add the decorator `@Language.component` (for function components) or `@Language.factory` (for class components / factories) to your custom component and assign it a name, e.g. `@Language.component('your_name')`. You can then run `nlp.add_pipe('your_name')` to add it to the pipeline.

spaCy の add_pipe の引数がオブジェクトから文字列 (Instance name of the current pipeline component) に代わっているようでした。またEntityRulerクラスを利用する場合とLanguageクラスのadd_pipeメソッドを利用する方法があるようでした。

※Languageクラス＝nlpオブジェクト（インスタンス）の理解です (Language class is created when you call spacy.load)

もう一点、上記を解決しても実際にはルールが反映されなかったためさらに調べました。

spaCyのルールベースの解析の説明に記述がありました。

The entity ruler is designed to integrate with spaCy’s existing pipeline components and enhance the named entity recognizer. If it’s added before the "ner" component, the entity recognizer will respect the existing entity spans and adjust its predictions around it. This can significantly improve accuracy in some cases. If it’s added after the "ner" component, the entity ruler will only add spans to the doc.ents if they don’t overlap with existing entities predicted by the model. To overwrite overlapping entities, you can set overwrite_ents=True on initialization.

the "ner" component が何を指すのかよく分からなかったので (ner=named entity recognizer)、最後の overwrite_ents=True を試した所うまくルールが反映されるようになりました。

config = {
   'overwrite_ents': True
}
ruler = nlp.add_pipe('entity_ruler', config=config)
patterns = [{'label': 'Company', 'pattern': 'Baseconnect株式会社'}]
ruler.add_patterns(patterns)

ルール有無の固有表現抽出の比較

せっかくなのでルール有無でどのように変わったか載せておきます。

from spacy import displacy
doc = nlp('<tbody><tr><th>社名</th><td>Baseconnect株式会社/BaseconnectInc.</td></tr><tr><th>設立日</th><td>2017年1月17日</td></tr><tr><th>役員</th><td><p>代表取締役　國重侑輝</p><p>取締役　　　野中崇史</p><p>社外取締役　山川隆義</p><p>常勤監査役　尾堂隆久</p></td></tr><tr><th>従業員数</th><td>社員40名・アルバイト142名</td></tr><tr><th>資本金</th><td>13億1912万円（累計増資額）</td></tr><tr><th>拠点</th><td><dl><dt>京都本社</dt><dd>〒604-8861<brclass="u-spItem">京都府京都市中京区壬生神明町1−5</dd></dl></td></tr></tbody>')
displacy.render(doc, style='ent', jupyter=True)

上記コードで実行した結果は以下になります。

ルールを適用する前は以下のようになっており、ルールを追加することで期待された結果を得れることが確認出来ました。