pythonで英語のクイズを作る - その2

前回、英語のクイズを作った時に
- 三単現のSがあると変換できない。
- 動詞の原形ではない表現だと変換できない。
- etc...
という問題があったのでいろいろと調べてみたら 
NLTK(Natural Language Toolkit)を使えばうまく行きそうだということで試してみた。

import re
import nltk
from nltk import stem
 #以下を実行するとカレントディレクトリに変換に必要な辞書がダウンロードされる 。1回だけ行えば良い。 #nltk .download("wordnet") #nltk .download("punkt")

data="""
Hi Shota, you look specially sharp today. Is that a new suit? That ties goes well with it.
Thank you Nancy, Yes, it is a new suit.
Any special occasion ?
I’m attending a meeting of the New York chapter of my alumni association this evening. I didn’t wanna look too casual.
You are dressed to kill. You know Shota you are one of the very few people at Alex and Alex who still bothers to dress well. Apparel has lost its appeal. Who needs fashion these days when you express yourself to social media.
Yes. Sartorial standards are certainly slipping. Neckties are disappearing even among bankers. People with sneakers everywhere even to wedding and church services. I read some whether half of American say they can wear jeans to work.
It’s not a trend that I particularly like.
Call me old fashion but I think certain standard should be maintain when it comes to what to wear and when.
I’m on the same opinion. I think it’s un-professional to present a slovenly appearance.
But As I’m sure you are wear more companies are allowing employees to dress casually any day of the week.

[words and phrases]
look sharp : 洒落た身なりである
alumni : 同窓会
be dressed to kill : めかしこんでいる
bother to : わざわざ〜する
apparel: アパレル, clothing, attire
lose ones appeal: 魅力を失う
express oneself through: 〜を通して自己表現する
sartorial standard: 服装の基準
slip: 低下する、衰える、 (名)小さな誤り
church service: 教会での礼拝
wear jeans to work: ジーンズを履いて仕事に行く
call me old fashion but : 時代遅れだと言われるだろうが
when it comes to: に関して言えば
be with the same opinion: 同じ意見である
un-professional: プロらしくない
slovenly appearance: だらしない身なり
allow someone to dress casual: 人がカジュアルな服装をするのを認める
any day of the week: 何曜日でも、いつでも
Who needs: Who needs? 誰が何を必要とするのでしょう?(誰も必要としていない)
"""

# 辞書をeng_dictに生成
eng_dict = {}
for line in data.splitlines():
   if ":" in line:
       tmp = line.split(':')
       eng_dict[tmp[0].strip()] = tmp[1].strip()  # stripで前後の空白を削除

# スクリプトをtextに抽出
text = ""
for line in data.splitlines(keepends=True):
   if "[" in line:  # [words and phrases] のところまでをスクリプトとする。
       break
   else:
       text += line
       
stemmer = stem.PorterStemmer()

match_phrases = []

def get_replace_script(script):
   global match_phrases
   tokenize_script = nltk.word_tokenize(script)
   replaced_tokenize_script = nltk.word_tokenize(script)
   replaced_tokenize_script = [i.lower() for i in replaced_tokenize_script]
   morph_script = [stemmer.stem(i) for i in replaced_tokenize_script]

   for phrase in eng_dict:
       tokenize_phrase = nltk.word_tokenize(phrase)
       tokenize_phrase = [i.lower() for i in tokenize_phrase]
       morph_phrase = [stemmer.stem(i) for i in tokenize_phrase]

       for i in range(len(morph_script)-len(morph_phrase)+1):
           if morph_script[i:i+len(morph_phrase)] == morph_phrase:
               match_phrases.append(phrase)
               for j in range(i, i+len(morph_phrase)):
                   tokenize_script[j] = tokenize_script[j][0:1] + '_'*len(tokenize_script[j][1:])  # ex. April -> A____ にする


   output = " ".join(tokenize_script)
   output = re.sub(r' ([,\.\?])', r'\1', output)  # ,と.の前の空白を削除する。
   output = re.sub(r' ([\’|\']) ', r'\1', output)  # ’の前後の空白を削除する。('とは違う!!)
   return output

def main():

   for line in text.splitlines():
       print(get_replace_script(line))

   print("--- matched phrase(s) ---")
   for phrase in eng_dict:
       if phrase in match_phrases:
           print(phrase)

   print("--- unmatched phrase(s) ---")
   for phrase in eng_dict:
       if phrase not in match_phrases:
           print(phrase)

main()

結果

Hi Shota, you look specially sharp today. Is that a new suit?
That ties goes well with it.
Thank you Nancy, Yes, it is a new suit.
Any special occasion?
I’m attending a meeting of the New York chapter of my a_____ association this evening.
I didn’t wan na look too casual.
You are dressed to kill. You know Shota you are one of the very few people at Alex and Alex who still b______ t_ dress well. A______ has lost its appeal. W__ n____ fashion these days when you express yourself to social media.
Yes. S________ s________ are certainly s_______. Neckties are disappearing even among bankers. People with sneakers everywhere even to wedding and c_____ s_______. I read some whether half of American say they can w___ j____ t_ w___.
It’s not a trend that I particularly like.
C___ m_ o__ f______ b__ I think certain standard should be maintain w___ i_ c____ t_ what to wear and when.
I’m on the same opinion. I think it’s u______________ to present a s_______ a_________.
But As I’m sure you are wear more companies are allowing employees to dress casually a__ d__ o_ t__ w___.
--- matched phrase(s) ---
alumni
bother to
apparel
sartorial standard
slip
church service
wear jeans to work
call me old fashion but
when it comes to
un-professional
slovenly appearance
any day of the week
Who needs
--- unmatched phrase(s) ---
look sharp
be dressed to kill
lose ones appeal
express oneself through
be with the same opinion
allow someone to dress casual

少しうまく変換出来るようになった。(10/19個→14/19個)
上記の6つのフレーズがうまく変換できなかったのは以下の理由。
- look very sharp とveryが入っていた。
- I'm が I am とは変換されず I m だったために be に一致しなかった。
- 代名詞が変換されなかった。(ex. yourself -> oneself)

今回のエッセンス

- NLTKを使って英文を分かち書きする。
- NLTKのstemmerで動詞の原形に変換したりする。
- 分かち書きされたlist中からマッチするphraseを探し出すためにスライスを使った。

参考にしたWebサイト

NLTKの使い方をいろいろ調べてみた

この記事が気に入ったらサポートをしてみませんか?