見出し画像

Hugging FaceのWikipedia英語データセットの比較

Hugging Faceにある次の2つのWikipediaデータセットで英語のWikipediaデータを読み込み比較します。
結論としては、graelo/wikipediaのほうがデータが新しくてデータ数が多いので良さそうです。

データセットを読み込むコード

コア部分は次の2行

# データセットの保存先を外付けハードディスクにしているのでchache_dirを指定
# wikipediaデータセットよりenダウンロード。25分くらいかかった
wiki_en = load_dataset("wikipedia", "20220301.en", cache_dir="/Volumes/DataSets/DataSets")

# graelo/wikipediaデータセットよりenダウンロード
graelo_wiki = load_dataset("graelo/wikipedia", "20230901.en", cache_dir="/Volumes/DataSets/DataSets")

Pythonインタプリタでデータの確認

% python
Python 3.11.6 (main, Oct  2 2023, 13:45:54) [Clang 15.0.0 (clang-1500.0.40.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from datasets import load_dataset
>>> wiki_en = load_dataset("wikipedia", "20220301.en", cache_dir="/Volumes/DataSets/DataSets", trust_remote_code=True)
>>> graelo_wiki = load_dataset("graelo/wikipedia", "20230901.en", cache_dir="/Volumes/DataSets/DataSets", trust_remote_code=True)

Loading dataset shards: 100%|█████████████████████████████████████████████████████████████████| 43/43 [00:29<00:00,  1.48it/s]
>>> 
>>> wiki_en
DatasetDict({
    train: Dataset({
        features: ['id', 'url', 'title', 'text'],
        num_rows: 6458670
    })
})
>>> graelo_wiki
DatasetDict({
    train: Dataset({
        features: ['id', 'url', 'title', 'text'],
        num_rows: 6705754
    })
})

wikipediaの20220301英語データセットには 6458670個。
graelo/wikipediaの20230901英語データセットには6705754個データがあることがわかります。

>>> wiki_en['train'][9]
{'id': '316', 'url': 'https://en.wikipedia.org/wiki/Academy%20Award%20for%20Best%20Production%20Design', 'title': 'Academy Award for Best Production Design', 'text': "The Academy Award for Best Production Design recognizes achievement for art direction in film. The category's original name was Best Art Direction, but was changed to its current name in 2012 for the 85th Academy Awards. This change resulted from the Art Director's branch of the Academy of Motion Picture Arts and Sciences (AMPAS) being renamed the Designer's branch. Since 1947, the award is shared with the set decorator(s). It is awarded to the best interior design in a film.\n\nThe films below are listed with their production year (for example, the 2000 Academy Award for Best Art Direction is given to a film from 1999). In the lists below, the winner of the award for each year is shown first, followed by the other nominees in alphabetical order.\n\nSuperlatives\n\nWinners and nominees\n\n1920s\n\n1930s\n\n1940s\n\n1950s\n\n1960s\n\n1970s\n\n1980s\n\n1990s\n\n2000s\n\n2010s\n\n2020s\n\nSee also\n BAFTA Award for Best Production Design\n Critics' Choice Movie Award for Best Production Design\n\nNotes\n\nReferences\n\nBest Production Design\n\nAwards for best art direction"}
>>> graelo_wiki['train'][1]
{'id': '710', 'url': 'https://en.wikipedia.org/wiki/Foreign%20relations%20of%20Angola', 'title': 'Foreign relations of Angola', 'text': "The foreign relations of Angola are based on Angola's strong support of U.S. foreign policy as the Angolan economy is dependent on U.S. foreign aid.\nFrom 1975 to 1989, Angola was aligned with the Eastern bloc, in particular the Soviet Union, Libya, and Cuba. Since then, it has focused on improving relationships with Western countries, cultivating links with other Portuguese-speaking countries, and asserting its own national interests in Central Africa through military and diplomatic intervention. In 1993, it established formal diplomatic relations with the United States. It has entered the Southern African Development Community as a vehicle for improving ties with its largely Anglophone neighbors to the south. Zimbabwe and Namibia joined Angola in its military intervention in the Democratic Republic of the Congo, where Angolan troops remain in support of the Joseph Kabila government. It also has intervened in the Republic of the Congo (Brazzaville) in support of Denis Sassou-Nguesso in the civil war.\n\nSince 1998, Angola has successfully worked with the United Nations Security Council to impose and carry out sanctions on UNITA. More recently, it has extended those efforts to controls on conflict diamonds, the primary source of revenue for UNITA during the Civil War that ended in 2002. At the same time, Angola has promoted the revival of the Community of Portuguese-Speaking Countries (CPLP) as a forum for cultural exchange and expanding ties with Portugal (its former ruler) and Brazil (which shares many cultural affinities with Angola) in particular. Angola is a member of the Port Management Association of Eastern and Southern Africa (PMAESA).\n\nDiplomatic relations \nList of countries with which Angola maintains diplomatic relations with:\n\nBilateral relations\n\nAfrica\n\nAmericas\n\nAsia\n\nEurope\n\nSee also \n List of diplomatic missions in Angola\n List of diplomatic missions of Angola\n Visa requirements for Angolan citizens\n\nReferences\n\nExternal links"}

それぞれデータの内容確認。

この記事が気に入ったらサポートをしてみませんか?