見出し画像

英語 Wikipedia と Simple English Wikipedia の対応関係を観る(2)

前回は、Simple Einglish Wikipedia (以下、SimpleWiki) と English Wikipedia (以下、EnWiki) を比較して同一記事タイトルが約20万件あるとわかりました。
ここからはその記事の本文テキストを見ていく準備をします。

前回の最後に同一の記事タイトル数を数えた時に 2つの dump データを 1つの json ファイルにまとめました。
共通記事タイトルをキーとして、共通記事タイトル(title)、EnWikiでの記事ID(id)、EnWikiの本文テキスト(text)、SimpleWikiでの記事ID(s_id)、SimpleWikiの本文テキスト(s_text)が引ける形になっています。

(略)
length which is typically at the decimeter scale) using the following empirical relationships:\nformula_18.\nSee also.\nReferences.",
        "s_id": "76367",
        "s_text": "The albedo of an object is the extent to which it reflects light, more specifically light which comes from the Sun. It is defined as the ratio of reflected to incident electromagnetic radiation. It is a unitless measure indicative of a surface's or body's diffuse reflectivity. The word is derived from \"albus\", a Latin word for \"white\".\nOther websites."
    },
    "A": {
        "title": "A",
        "id": "290",
        "text": "First letter of the Latin alphabet\nA, or a, is the first letter and the first vowel of the Latin alphabet, 

(略)

Code points.\nThese are the code points for the forms of the letter in various systems\n1 Also for encodings based on ASCII, including the DOS, Windows, ISO-8859 and Macintosh families of encodings.\nUse as a number.\nIn the hexadecimal (base 16) numbering system, A is a number that corresponds to the number 10 in decimal (base 10) counting.\nNotes.\nFootnotes.\nReferences.",
        "s_id": "8",
        "s_text": "A or a is the first letter of the English alphabet. The small letter, a or α, is used as a lower case vowel.\nWhen it is spoken, ā is said as a long a, a Diphthong of ĕ and y. A is similar to Alphabet of the Greek alphabet. That is not surprising, because it means the same sound.\n\"Alpha and Omega\" (the last letter of the Greek alphabet) means from beginning to the end. In musical notation, the letter A is the symbol of a note in the scale, below B and above G.\nA is the letter that was used to represent a team in an old TV show, The A-Team. A capital a is written \"A\". Use a capital a at the start of a sentence if writing.\nA is also a musical note, sometimes referred to as \"La\".\nWhere it came from.\nThe letter 'A' was in the Phoenician alphabet's aleph. This symbol came from a simple picture of an ox head.\nThis Phoenician letter helped make the basic blocks of later types of the letter. The Greeks later modified this letter and used it as their letter alpha. The Greek alphabet was used by the Etruscans in northern Italy, and the Romans later modified the Etruscan alphabet for their own language.\nUsing the letter.\nThe letter A has six different sounds. It can sound like æ, in the International Phonetic Alphabet, such as the word \"pad\". Other sounds of this letter are in the words \"father\", which developed into another sound, such as in the word \"ace\".\nUse in mathematics.\nIn algebra, the letter \"A\" along with other letters at the beginning of the alphabet is used to represent known quantities.\nIn geometry, capital A, B, C etc. are used to label line segments, lines, etc. Also, A is typically used as one of the letters to label an angle in a triangle.\nIts letter shape is referred to abstractly in Sir William Vallance Douglas Hodge's 5th postulate, the basis for, as one of the Millennium Prize Problems, the Hodge Conjecture.\nReferences."
    },
    "Alabama": {
        "title": "Alabama",
        "id": "303",
        "text": "U.S. state\nAlabama () is a state in the Southeastern region of the United States, bordered by Tennessee to the north; Georgia to the east; Florida and the Gulf of Mexico to the south; and Mississippi to the west. Alabama is the 30th largest by area and the 24th-most populous of the U.S. states.\nAlabama is nicknamed the \"Yellowhammer State\", after the state bird.
(略)

毎回この json を処理してもいいのですが、この状態だと text, s_text の可読性が低く、処理の途中で何か不都合が起きたとしても、目視で原因箇所を探るのが大変です。
なので、EnWiki、SimpleWiki の各記事を 1 テキストファイルに展開します。この際、せっかくとった記事間の対応関係が消えないように、対応する記事ファイル名を同一にし、接頭辞として共通の ID を新たに設けます。

f'{_num}_en_{_id}_simple_{_s_id}.txt'

ファイル名の構造としてはこんな感じで、{_num} が新しく設ける共通の接頭辞です。このファイル名を使い、対応する記事間で同一のファイル名を設定し、Simple と En の違いはファイルを格納するディレクトリで行います。


こうしておけば、参照中のファイルのディレクトリ名から、それが EnWiki の記事か、SimpleWiki の記事か判別できますし、対応するもう片方の記事はもう片方のディレクトリの同一ファイル名で見つけられます。また双方の記事 id をファイル名に残しているので、dump ファイルにさかのぼって確認することも可能です。

整形のためのスクリプトはこちらです。

EnWiki, SimpleWiki の各記事が、同一タイトルの関係を保ったままテキストファイルになりました。
このファイルのひとつを開いてみます。

$ less text/en/1000_en_4548_simple_91088.txt

Type of snowstorm
A blizzard is a severe snowstorm characterized by strong sustained winds and low visibility, lasting for a prolonged period of time—typically at least three or four hours. A ground blizzard is a weather condition where snow is not falling but loose snow on the ground is lifted and blown by strong winds. Blizzards can have an immense size and usually stretch to hundreds or thousands of kilometres.
Definition and etymology.
In the United States, the National Weather Service defines a blizzard as a severe snow storm characterized by strong winds causing blowing snow that results in low visibilities. The difference between a blizzard and a snowstorm is the strength of the wind, not the amount of snow. To be a blizzard, a snow storm must have sustained winds or frequent gusts that are greater than or equal to with blowing or drifting snow which reduces visibility to or less and must last for a prolonged period of time—typically three hours or more.
Environment Canada defines a blizzard as a storm with wind speeds exceeding accompanied by visibility of or less, resulting from snowfall, blowing snow, or a combination of the two. These conditions must persist for a period of at least four hours for the storm to be classified as a blizzard, except north of the arctic tree line, where that threshold is raised to six hours.
The Australia Bureau of Meteorology describes a blizzard as, "Violent and very cold wind which is laden with snow, some part, at least, of which has been raised from snow covered ground."
While severe cold and large amounts of drifting snow may accompany blizzards, they are not required. Blizzards can bring whiteout conditions, and can paralyze regions for days at a time, particularly where snowfall is unusual or rare.
A severe blizzard has winds over , near zero visibility, and temperatures of or lower. In Antarctica, blizzards are associated with winds spilling over the edge of the ice plateau at an average velocity of .
Ground blizzard refers to a weather condition where loose snow or ice on the ground is lifted and blown by strong winds. The primary difference between a ground blizzard as opposed to a regular blizzard is that in a ground blizzard no precipitation is produced at the time, but rather all the precipitation is already present in the form of snow or ice at the surface.
(略)

御覧の通り、1 行が段落もしくはセクションタイトルでできています。

1 文ではない。

なので、次に必要な前処理は、文単位に切ることです。

英語の文章を文単位に分割する実装はいくつかあります。例えば、nltk の Punkt 、moses の tokenizer、、、
色々試しましたが、セミコロンやコロン、引用などがたびたび出てくる Wikipedia 記事の文単位分割には spacy の en_core_web_sm モデルがよかったのでこれにしました。
(検証としてこの記事を様々なツールで文単位に分割して見比べました)

spacy は文章(文字列型)を与えてやれば、文単位に分割するだけでなく、品詞タグ付けや、パージング、NERなどなど大体の基盤処理を
doc = nlp(text)
だけで完了してくれて便利なので、いっきに情報付けるとこまでしてしまいましょう。
1ファイルが SimpleWiki もしくは EnWiki の 1 記事に対応しつつ、
各ファイルの1 行が 1 文になっているファイル(.txt)と、
各ファイルの1 行が 1 トークン(もしくはEOS)になっているファイル(.mecab)を作成します。
トークン化の際に付ける情報は、品詞と lemma と 表層形出現形のlowercase、あとはストップワードか否かあたりをタブ区切りで並べておきます。
この形に整形した後のファイル群を、先ほどと同じ en と simple のディレクトリ構造と共通ファイル名を保ったまま、別の場所に格納します。

スクリプトはこちら

$ less sentences/en/1000_en_4548_simple_91088.txt

Type of snowstorm
A blizzard is a severe snowstorm characterized by strong sustained winds and low visibility, lasting for a prolonged period of time—typically at least three or four hours.
A ground blizzard is a weather condition where snow is not falling but loose snow on the ground is lifted and blown by strong winds.
Blizzards can have an immense size and usually stretch to hundreds or thousands of kilometres.
Definition and etymology.
In the United States, the National Weather Service defines a blizzard as a severe snow storm characterized by strong winds causing blowing snow that results in low visibilities.
The difference between a blizzard and a snowstorm is the strength of the wind, not the amount of snow.
To be a blizzard, a snow storm must have sustained winds or frequent gusts that are greater than or equal to with blowing or drifting snow which reduces visibility to or less and must last for a prolonged period of time—typically three hours or more.
Environment Canada defines a blizzard as a storm with wind speeds exceeding accompanied by visibility of or less, resulting from snowfall, blowing snow, or a combination of the two.
These conditions must persist for a period of at least four hours for the storm to be classified as a blizzard, except north of the arctic tree line, where that threshold is raised to six hours.
The Australia Bureau of Meteorology describes a blizzard as, "Violent and very cold wind which is laden with snow, some part, at least, of which has been raised from snow covered ground.
(略)
$ less sentences/en/1000_en_4548_simple_91088.txt.mecab

Type    NOUN    NN      type    type    False
of      ADP     IN      of      of      True
snowstorm       NOUN    NN      snowstorm       snowstorm       False
A       DET     DT      a       a       True
blizzard        NOUN    NN      blizzard        blizzard        False
is      AUX     VBZ     is      be      True
a       DET     DT      a       a       True
severe  ADJ     JJ      severe  severe  False
snowstorm       NOUN    NN      snowstorm       snowstorm       False
characterized   VERB    VBN     characterized   characterize    False
by      ADP     IN      by      by      True
strong  ADJ     JJ      strong  strong  False
sustained       ADJ     JJ      sustained       sustained       False
winds   NOUN    NNS     winds   wind    False
and     CCONJ   CC      and     and     True
low     ADJ     JJ      low     low     False
visibility      NOUN    NN      visibility      visibility      False
,       PUNCT   ,       ,       ,       False
lasting VERB    VBG     lasting last    False
for     ADP     IN      for     for     True
a       DET     DT      a       a       True
prolonged       ADJ     JJ      prolonged       prolonged       False
period  NOUN    NN      period  period  False
of      ADP     IN      of      of      True
time    NOUN    NN      time    time    False
—       PUNCT   :       —       —       False
typically       ADV     RB      typically       typically       False
at      ADV     RB      at      at      True
least   ADV     RBS     least   least   True
three   NUM     CD      three   three   True
or      CCONJ   CC      or      or      True
four    NUM     CD      four    four    True
hours   NOUN    NNS     hours   hour    False
.       PUNCT   .       .       .       False
EOS

.mecab という形式は 形態素解析器『MeCab』のデフォルト出力が、トークン単位に切った後、各行にトークン情報(今回のタブ区切りと若干形式は異なりますが)を記述し、文末に「EOS」だけの行を置く設定なので、その流用です。
文末を表す専用の行表記があると、繰り返し処理を書くときにフラグを書かなくていいのと、この行を数えればファイル内の文数が間違いなくわかる利点があります。

残る前処理は、対応する記事間で、対応する文を獲得することですが、それは次回に続きます。

この記事が気に入ったらサポートをしてみませんか?