言語処理１００本ノックから自然言語処理を始める　第１章

2021年4月3日 04:24

はじめに

最近、様々な事が重なりモチベーションを失いかけていたので新しい事を始めようと思い前から興味のあった自然言語処理を初めて見ることにした。日記みたいな感じで学んだことや工夫すべき事などをメモって行こうと思う。
サイトはこちら：https://nlp100.github.io/ja/

問題の取り組み方としては、最初は何も見ずに一旦やってみて最後に模範と比べる方法でやろうと思う

環境
・OS:　Ubuntu20.04LTS
・Env:　Jupyter Lab (Python3.8)

00. 文字列の逆順

文字列”stressed”の文字を逆に（末尾から先頭に向かって）並べた文字列を得よ．

x = 'stressed' #substitute "stressed" for x 
r_x = '' #make empty r_x 
for i in range(0, len(x)):
   r_x = r_x + x[len(x)-i-1]
print(r_x)

output: desserts

久しぶりにコードを書いたからめちゃくちゃ忘れていた…いろいろ思い出していい復習になった。ただ、時間かかった...

01. 「パタトクカシーー」

「パタトクカシーー」という文字列の1,3,5,7文字目を取り出して連結した文字列を得よ．

x = "パタトクカシーー"
new_x = ""
for i in range (0, len(x)): 
   if (i%2==0): #if i is even which means 1, 3, 5, 7th characters 
       new_x += x[i]
print(new_x)

output: パトカー

00が解けてfor文が理解できていれば簡単な問題だった。

02. 「パトカー」＋「タクシー」＝「パタトクカシーー」

「パトカー」＋「タクシー」の文字を先頭から交互に連結して文字列「パタトクカシーー」を得よ．

x = "パトカー"
y = "タクシー" 
new_xy = ""
for i in range (0, len(x)):
   new_xy += x[i]+y[i]
print(new_xy)

output: パタトクカシーー

この方法が最適かはわからないが思いついた方法で書いた。

03. 円周率

“Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics.”という文を単語に分解し，各単語の（アルファベットの）文字数を先頭から出現順に並べたリストを作成せよ．

x = "Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics."
#replace ',' and '.' 
x = x.replace(',','')
x = x.replace('.','')

word_list = x.split(' ') #use split for getting words list
char_size = [] #make empty list
for i in word_list: 
   char_size.append(len(i)) #use append for add int to the list 
print(char_size)

Output: [3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5, 8, 9, 7, 9]

split関数使うのはすぐにわかる。カンマとピリオドが入ってしまうので、replace関数で削除。append忘れててTypeError: 'int' object is not iterableのエラーになった。

04. 元素記号

“Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can.”という文を単語に分解し，1, 5, 6, 7, 8, 9, 15, 16, 19番目の単語は先頭の1文字，それ以外の単語は先頭の2文字を取り出し，取り出した文字列から単語の位置（先頭から何番目の単語か）への連想配列（辞書型もしくはマップ型）を作成せよ．

x = 'Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can.'
#remove , and .
x = x.replace('.','')
x = x.replace(',','')
#make word list
word_list = x.split(' ')
word_dict = {}
num = 0
for i in word_list:
   num += 1
   if num in [1, 5, 6, 7, 8, 9, 15, 16, 19]:
       word_dict[num]=i[0]
   else:
       word_dict[num]=i[0:2]
print(word_dict)

output: {1: 'H', 2: 'He', 3: 'Li', 4: 'Be', 5: 'B', 6: 'C', 7: 'N', 8: 'O', 9: 'F', 10: 'Ne', 11: 'Na', 12: 'Mi', 13: 'Al', 14: 'Si', 15: 'P', 16: 'S', 17: 'Cl', 18: 'Ar', 19: 'K', 20: 'Ca'}

もっと綺麗な書き方ありそう…とりあえずはこれで良しとした。
dict型も思い出していい感じ。ちゃんと元素記号も表示されたし。中学の頃覚えたなぁ、懐かしい。

05. n-gram

与えられたシーケンス（文字列やリストなど）からn-gramを作る関数を作成せよ．この関数を用い，”I am an NLPer”という文から単語bi-gram，文字bi-gramを得よ．

def n_gram(x, n): #define n_gram function
   #word n-gram 
   x_word = x.split(' ') #get words list 
   word_gram = [] 
   for i in range(0, len(x_word)):
       if(i<=len(x_word)-n):
           word_gram.append(x_word[i:i+n])
           continue
       else:
           break
   print(str(n)+"-gram(word): ")
   print(word_gram)
   
   #character n-gram
   x_char = x.replace(' ','')
   char_gram = []
   for i in range(0, len(x_char)):
       if(i<=len(x_char)-n): 
           char_gram.append(x_char[i:i+n])    
           continue
       else: 
           break
   print(str(n)+"-gram(char): ")
   print(char_gram)

x = "I am an NLPer"
n = 2
n_gram(x, n)

Output:  2-gram(word): 
         [['I', 'am'], ['am', 'an'], ['an', 'NLPer']]
         2-gram(char): 
         ['Ia', 'am', 'ma', 'an', 'nN', 'NL', 'LP', 'Pe', 'er']

問題の意味がわからんのやけど…n-gramってなんや。
調べるとn-gramとはn個の連続する単位のことでbi-gramは2-gramのことを示すらしい。ちなみに1-gramはuni-gramで3-gramをtri-gramと言うらしい。単語は単語ごとのn-gramで文字は文字ごとのn-gram。特に文字n-gramは日本語や中国語などの単語の区切りが曖昧な自然言語に対して使われる。
やや時間がかかった、模範回答と早く比べたい。

06. 集合

“paraparaparadise”と”paragraph”に含まれる文字bi-gramの集合を，それぞれ, XとYとして求め，XとYの和集合，積集合，差集合を求めよ．さらに，’se’というbi-gramがXおよびYに含まれるかどうかを調べよ．

def char_n_gram(x,n):
   #character n-gram
   x_char = x.replace(' ','')
   char_gram = []
   for i in range(0, len(x_char)):
       if(i<=len(x_char)-n): 
           char_gram.append(x_char[i:i+n])    
           continue
       else: 
           break
   return char_gram 
X = set(char_n_gram("paraparaparadise",2)) #sub to X with set method 
Y = set(char_n_gram("paragraph",2)) #sub to Y with set method 

#Union Set(和集合)
print(X.union(Y))
#Intersection Set(積集合)
print(X.intersection(Y))
#Symmetric Difference Set(差集合)
print(X.symmetric_difference(Y))

Output: {'pa', 'ra', 'is', 'ar', 'ph', 'ad', 'ap', 'ag', 'se', 'gr', 'di'}
        {'ap', 'pa', 'ra', 'ar'}
        {'di', 'is', 'ph', 'ad', 'ag', 'gr', 'se'}

必要なのは、文字bi-gramなので、5で作ったn_gram関数をちょっと改良。具体的には、 #character n-gramのところを持ってきてprintをreturnに変えただけ。あとは、setに変えてそのまま和・積・差集合を当てはめた。

07. テンプレートによる文生成

引数x, y, zを受け取り「x時のyはz」という文字列を返す関数を実装せよ．さらに，x=12, y=”気温”, z=22.4として，実行結果を確認せよ．

def sentence(x, y, z):
   print(str(x)+"時の"+str(y)+"は"+str(z)) 
   
x=12
y="気温"
z=22.4
sentence(x,y,z)

Output: 12時の22.4は気温

ほんとにこれでいいの？（笑）
まあいいのか。

08. 暗号文

与えられた文字列の各文字を，以下の仕様で変換する関数cipherを実装せよ．英小文字ならば(219 - 文字コード)の文字に置換その他の文字はそのまま出力この関数を用い，英語のメッセージを暗号化・復号化せよ．

def cipher(S):
   new = []
   for s in S:
       if 97 <= ord(s) <= 122:
           s = chr(219 - ord(s))
       new.append(s)
   return ''.join(new)
 
s = 'I am an NLPer'
new = cipher(s)
print (new)
print (cipher(new))

英小文字かどうかの判定はislower関数を使い判別できる。英大文字の場合はisupperで判断できる。暗号化は簡単にできるが、復号化をどうすれば良いのか…

09. Typoglycemia

スペースで区切られた単語列に対して，各単語の先頭と末尾の文字は残し，それ以外の文字の順序をランダムに並び替えるプログラムを作成せよ．ただし，長さが４以下の単語は並び替えないこととする．適当な英語の文（例えば”I couldn’t believe that I could actually understand what I was reading : the phenomenal power of the human mind .”）を与え，その実行結果を確認せよ．

def random_gen_sent(x):
   import random 
   
   def rand_ints_nodup(a, b, k): #for just genelate random list
       ns = []
       while len(ns) < k:
           n = random.randint(a, b)
           if not n in ns:
               ns.append(n)
       return ns

   #replace , . : to ''
   x = x.replace(':','')
   x = x.replace('.','')
   x = x.replace(',','')
   #get the words list 
   x_word = x.split(' ')
   gene_sent = "" #make empty
   new_word = "" #make empty
   
   for i in x_word:
       if(len(i)>4):
           new_word = ""
           new_word += i[0] 
           #random with rand_ints_nodup function 
           randlist = rand_ints_nodup(1,len(i)-2,len(i)-2) #genelate random list between x[1] and x[len(x)-1]
           for rand in randlist:
               new_word +=i[rand]
           new_word += i[len(i)-1]
           gene_sent+=new_word #add the word to gene_sent
       else: 
           gene_sent+=i #add the word to gene_sent
       
       gene_sent+=" " #add space

   print(gene_sent)
x = "I couldn’t believe that I could actually understand what I was reading : the phenomenal power of the human mind ."
random_gen_sent(x)

Output: I cou’lndt bveilee that I culod aclulaty unndtrsaed what I was raiedng  the pnhenmeaol pewor of the huamn mind

いや〜もっといい書き方ある気がするなぁ。書き方としては、まず重複しないランダム関数を作り、文字が4より大きければ作った関数を最初と最後以外の文字に適用し、新しいwordを生成する。

反省点・改善点

久しぶりにプログラムを書いたことと長い間Python自体に触れてなかったことで思った以上に時間がかかってしまった…
まあ、いい復習になった。一通り自分で書いたのでネットに転がってる模範コードとかと照らし合わせてみよう。楽しみ。

この記事が気に入ったらサポートをしてみませんか？