Pythonで正規表現

かわだ

2023年10月25日 05:56

正規表現を制するものはPythonを制する、とまではいかないですが、奥の深い正規表現を扱ってみます。

一覧は、こちらのQiitaの中にある表がとても分かりやすかったです。

X(twitter)データの加工

まずは、この文字列をテキストマイニングできるように加工していきましょう。

text = "RT @pythonleaner Python is easy! 😀 https://abcd.com #Python #education"

処理の概要は、こんな感じです。

・余分な文字列を取り除く
・ユーザIDはUsernameに置き換える
・😀(感情表現)を文字化する

まず"RT "を除きましょう。

import re
text = "RT @pythonleaner Python is easy! 😀 https://abcd.com #Python #education"

text2 = re.sub('RT\s+', "", text)
text2

"RT"とその後ろのスペース(\s)の1回以上の繰り返し(+)を検出し、""に置き換えます。

'@pythonleaner Python is easy! 😀 https://abcd.com #Python #education'

次に、"@pythonleaner"を"Username"に置き換えます。

import re
text = "RT @pythonleaner Python is easy! 😀 https://abcd.com #Python #education"

text2 = re.sub('RT\s+', "", text)
text3 = re.sub('@\w+', "Username", text2)
text3

”@”とその後ろの任意の英数字(\w)の繰り返し(+)を"Username"に置き換えています。

'Username Python is easy! 😀 https://abcd.com #Python #education'

次に"https://abcd.com"を取り除きます。

import re
text = "RT @pythonleaner Python is easy! 😀 https://abcd.com #Python #education"

text2 = re.sub('RT\s+', "", text)
text3 = re.sub('@\w+', "Username", text2)
text4=re.sub('(http|https):\/\/\S+', "", text3)
text4

"http"または"https"のグループは(http|https)のように表します。
"://"と任意の空白文字以外の繰り返し”\S+”も取り除きます。

'Username Python is easy! 😀 #Python #education'

次にハッシュタグの"#"を除きます。

import re
text = "RT @pythonleaner Python is easy! 😀 https://abcd.com #Python #education"

text2 = re.sub('RT\s+', "", text)
text3 = re.sub('@\w+', "Username", text2)
text4=re.sub('(http|https):\/\/\S+', "", text3)
text5=re.sub('#', "", text4)
text5

'Username Python is easy! 😀 Python education'

最後に絵文字を文字化します。その名も、"emoji"というライブラリがあります。このライブラリを使うことで絵文字を言語化することができます。

import re
import emoji

text = "RT @pythonleaner Python is easy! 😀 https://abcd.com #Python #education"

text2 = re.sub('RT\s+', "", text)
text3 = re.sub('@\w+', "Username", text2)
text4=re.sub('(http|https):\/\/\S+', "", text3)
text5=re.sub('#', "", text4)
text6 = emoji.demojize(text5)
text6

😀がgrinning_faceになりました。

'Username Python is easy! :grinning_face: Python education'

正しい英語に変える

次のお題はこちらです。感情を乗せたこのような表現はよく見られます。しかしこれはテキストマイニングにとっては問題です。

text="Keeeeep going!!!!! We'd like to do so....."

繰り返しを除く

keeeeepをkeepに変えます。

import re

text="Keeeeep going!!!!! We'd like to do so....."
text = text.lower()
text = re.sub(r'(.)\1+', r'\1\1', text)
text

同じ文字が1つ以上連続する(\1+)任意の文字列グループ(.)を、2回繰り返す同じ文字(\1\1)に置き換えます。結果は以下のとおり。!!!!!や…..もついでに短くなりましたね。

"keep going!! we'd like to do so.."

?=肯定先読み

次に、まだ残存している!!や..もきれいにしましょう。
以下のコードを追加します。

text = re.sub(r'[?.!]+(?=[?.!])', "", text)

[]は集合を表します。[?.!]+は、"?", ".", "!"のいずれかが1つ以上繰り返していることを指します。()はグループなので、間違えて(?.!)+としてしまうと、"?.!"の塊が連続することになってしまいます。
次の"?="は”肯定先読み”と言います。例えば、r'super(?=man)'だと、superの後にmanが来た時にマッチします。なので、superwomanはマッチしません。
この場合だと、?.!のいずれかの1つ以上の後にまた?.!がある状態を検出し、それを""で置き換えています。

import re

text="Keeeeep going!!!!! We'd like to do so....."
text = text.lower()
text = re.sub(r'(.)\1+', r'\1\1', text)
text = re.sub(r'[\?\.\!]+(?=[\?\.\!])', "", text)
text

これを実行すると

"keep going! we'd like to do so."

この様に大分きれいになりました。

省略形を直す(contractions)

英語にはI’ｍやThat’sのように省略形があります。テキストマイニングする上ではこれらも一つずつの単語にしていきます。
それに便利な、contractionsという便利なライブラリがあります。これは「"I'm": 'I am'」のような辞書です。

import contractions
print(contractions.contractions_dict)

とすると以下のように一覧が見られます。

これを使って、省略形を置き換えていきます。

import re
import contractions

text="Keeeeep going!!!!! We'd like to do so....."
text = text.lower()
text = re.sub(r'(.)\1+', r'\1\1', text)
text = re.sub(r'[\?\.\!]+(?=[\?\.\!])', "", text)

for k, v in contractions.contractions_dict.items():
    text = text.replace(k, v)
text

結果はこのようになります。

'keep going! we would like to do so.'

最初の"Keeeeep going!!!!! We'd like to do so....."に比べてかなりきれいになりましたね。