見出し画像

GLUE - 英語圏における自然言語処理の標準ベンチマーク

1. GLUE

GLUE」(General Language Understanding Evaluation)は、英語圏における自然言語処理の標準ベンチマークです。「同義言い換え」「質疑応答」といった、言語に関するテストデータが含まれており、このテストデータを使って総合的な言語能力のスコアを算出します。

英語圏の自然言語処理におけるデファクトスタンダードとなっており、新しい言語AIに関する論文を発表する際には、「GLUEスコア」を掲載することが慣わしとなっています。

また、2020年2月より高難易度の「SuperGLUE」も提供されています。

2. リーダーズボード

「GLUEスコア」をランキングした「リーダーズボード」を、以下で参照できます。

GLUE Leaderboar

3. タスクとデータセット

「GLUEタスク」の簡単な説明は、次のとおりです。

・CoLA : 文が英語文法として正しいかどうかを判定。
・SST-2 : 映画レビューの感情解析(ポジティブ、ネガティブ)を判定。
・MRPC : 2つの文が同じ意味かどうかを判定。
・STS-B : ニュースの見出し文の類似度を5段階で評価。
・QQP : 2つの質問が同じ意味かどうかを判定。
・MNLI-m / MNLI-mm : 2つの文の含意関係(含意、矛盾、中立)を判定。
・SQuAD : コンテキストから質問の回答を抽出。
・QNLI : 質問と文は、正しい回答を含んでいるかどうかを判定。
・RTE : 2つの文の含意関係(含意、含意でない)をを判定。
・WNLI : 代名詞が置換された文が元の文に含まれているかどうかを判定。

「GLUEタスク」の「データセット」は以下で入手できます。

GLUE Tasks

3-1. CoLA (The Corpus of Linguistic Acceptability)

「CoLA」は、文が英語文法として正しいかどうかを判定するタスクです。データセットは、23の書籍や雑誌記事を元にしています。

◎ データ形式
各行は、タブ区切りの4列で構成されています。

・1列目 : 文のソース。
・2列目 : ラベル(0 =英語文法として正しい、1 =正しくない)。
・3列目 : 著者が最初に記したラベル。
・4列目 : 文。

◎ サンプル
サンプルは、次のとおりです。

clc95	0	*	In which way is Sandy very anxious to see if the students will be able to solve the homework problem?
c-05	1		The book was written by John.
c-05	0	*	Books were sent to each other by the students.
swb04	1		She voted for herself.
swb04	1		I saw that gas can explode.

3-2. SST-2 (The Stanford Sentiment Treebank)

「SST-2」は、映画レビューの感情解析(ポジティブ、ネガティブ)を判断するタスクです。

◎ データ形式
各行は、タブ区切りの2列で構成されています。

・1列目 : 文のソース。
・2列目 : ラベル(0 =ネガティブ、1 =ポジティブ)。

◎ サンプル
サンプルは、次のとおりです。

hide new secretions from the parental units 	0
contains no wit , only labored gags 	0
that loves its characters and communicates something rather beautiful about human nature 	1
remains utterly satisfied to remain the same throughout 	0
on the worst revenge-of-the-nerds clichés the filmmakers could dredge up 	0
that 's far too tragic to merit such superficial treatment 	0

3-3. MRPC (Microsoft Research Paraphrase Corpus)

「MRPC」は、2つの文が同じ意味かどうかを判定するタスクです。データセットは、オンラインのニュースを元にしています。

◎ データ形式
各行は、タブ区切りの5列で構成されています。

・1列目 : 2つの文が同じ意味かどうか(0 =同じ意味でない、1 =同じ意味)。
・2列目 : 文1のID。
・3列目 : 文2のID。
・4列目 : 文1。
・5列目 : 文2。

◎ サンプル
サンプルは、次のとおりです。

1	702876	702977	Amrozi accused his brother, whom he called "the witness", of deliberately distorting his evidence.	Referring to him as only "the witness", Amrozi accused his brother of deliberately distorting his evidence.
0	2108705	2108831	Yucaipa owned Dominick's before selling the chain to Safeway in 1998 for $2.5 billion.	Yucaipa bought Dominick's in 1995 for $693 million and sold it to Safeway for $1.8 billion in 1998.
1	1330381	1330521	They had published an advertisement on the Internet on June 10, offering the cargo for sale, he added.	On June 10, the ship's owners had published an advertisement on the Internet, offering the explosives for sale.
0	3344667	3344648	Around 0335 GMT, Tab shares were up 19 cents, or 4.4%, at A$4.56, having earlier set a record high of A$4.57.	Tab shares jumped 20 cents, or 4.6%, to set a record closing high at A$4.57.
1	1236820	1236712	The stock rose $2.11, or about 11 percent, to close Friday at $21.51 on the New York Stock Exchange.	PG&E Corp. shares jumped $1.63 or 8 percent to $21.03 on the New York Stock Exchange on Friday.
1	738533	737951	Revenue in the first quarter of the year dropped 15 percent from the same period a year earlier.	With the scandal hanging over Stewart's company, revenue the first quarter of the year dropped 15 percent from the same period a year earlier.

3-4. STS-B (Semantic Textual Similarity Benchmark)

「STS-B」は、ニュースの見出し文の類似度を5段階で評価するタスクです。

◎ データ形式
各行は、タブ区切りの9列で構成されています。

・1列目 : インデックス(0〜)。
・2列目 : ジャンル。
・3列目 : ファイル名。
・4列目 : 年。
・5列目 : 文1のソース。
・6列目 : 文2のソース。
・7列目 : 文1。
・8列目 : 文2。
・9列目 : スコア(0.000〜5.000)。

◎ サンプル
サンプルは、次のとおりです。

0	main-captions	MSRvid	2012test	0001	none	none	A plane is taking off.	An air plane is taking off.	5.000
1	main-captions	MSRvid	2012test	0004	none	none	A man is playing a large flute.	A man is playing a flute.	3.800
2	main-captions	MSRvid	2012test	0005	none	none	A man is spreading shreded cheese on a pizza.	A man is spreading shredded cheese on an uncooked pizza.	3.800
3	main-captions	MSRvid	2012test	0006	none	none	Three men are playing chess.	Two men are playing chess.	2.600
4	main-captions	MSRvid	2012test	0009	none	none	A man is playing the cello.	A man seated is playing the cello.	4.250
5	main-captions	MSRvid	2012test	0011	none	none	Some men are fighting.	Two men are fighting.	4.250

3-5. QQP (Quora Question Pairs)

「QQP」は、2つの質問が同じ意味かどうかを判定するタスクです。データセットは、コミュニティの質問回答サイト(Quora)を元にしています。

◎ データ形式
各行は、タブ区切りの6列で構成されています。

・1列目 : ID。
・2列目 : 質問1のID。
・3列目 : 質問2のID。
・4列目 : 質問1。
・5列目 : 質問2。
・6列目 : 2つの質問が同じ意味かどうか(0 =同じ意味でない、1 =同じ意味)。

◎ サンプル
サンプルは、次のとおりです。

133273	213221	213222	How is the life of a math student? Could you describe your own experiences?	Which level of prepration is enough for the exam jlpt5?	0
402555	536040	536041	How do I control my horny emotions?	How do you control your horniness?	1
360472	364011	490273	What causes stool color to change to yellow?	What can cause stool to come out as little balls?	0
150662	155721	7256	What can one do after MBBS?	What do i do after my MBBS ?	1
183004	279958	279959	Where can I find a power outlet for my laptop at Melbourne Airport?	Would a second airport in Sydney, Australia be needed if a high-speed rail link was created between Melbourne and Sydney?	0
119056	193387	193388	How not to feel guilty since I am Muslim and I'm conscious we won't have sex together?	I don't beleive I am bulimic, but I force throw up atleast once a day after I eat something and feel guilty. Should I tell somebody, and if so who?	0

3-6. MNLI-m (MultiNLI Matched) / MNLI-mm (MultiNLI Mismatched)

「MNLI-m」「MNLI-mm」は、2つの文の含意関係(含意、矛盾、中立)を判定するタスクです。

◎ データ形式
各行は、タブ区切りの11列で構成されています。

・1列目 : インデックス(0〜)。
・2列目 : プロンプトID。
・3列目 : ペアID。
・4列目 : ジャンル。
・5列目 : 文1のバイナリパース。
・6列目 : 文2のバイナリパース。
・7列目 : 文1のパース。
・8列目 : 文2のパース。
・9列目 : 文1。
・10列目 : 文2。
・11列目 : ラベル(entailment=含意, contradiction=矛盾, neutral=中立)。
・12列目 : ゴールドラベル(entailment=含意, contradiction=矛盾, neutral=中立)。

◎ サンプル
サンプルは、次のとおりです。

0	31193	31193n	government	( ( Conceptually ( cream skimming ) ) ( ( has ( ( ( two ( basic dimensions ) ) - ) ( ( product and ) geography ) ) ) . ) )	( ( ( Product and ) geography ) ( ( are ( what ( make ( cream ( skimming work ) ) ) ) ) . ) )	(ROOT (S (NP (JJ Conceptually) (NN cream) (NN skimming)) (VP (VBZ has) (NP (NP (CD two) (JJ basic) (NNS dimensions)) (: -) (NP (NN product) (CC and) (NN geography)))) (. .)))	(ROOT (S (NP (NN Product) (CC and) (NN geography)) (VP (VBP are) (SBAR (WHNP (WP what)) (S (VP (VBP make) (NP (NP (NN cream)) (VP (VBG skimming) (NP (NN work)))))))) (. .)))	Conceptually cream skimming has two basic dimensions - product and geography.	Product and geography are what make cream skimming work. 	neutral	neutral
1	101457	101457e	telephone	( you ( ( know ( during ( ( ( the season ) and ) ( i guess ) ) ) ) ( at ( at ( ( your level ) ( uh ( you ( ( ( lose them ) ( to ( the ( next level ) ) ) ) ( if ( ( if ( they ( decide ( to ( recall ( the ( the ( parent team ) ) ) ) ) ) ) ) ( ( the Braves ) ( decide ( to ( call ( to ( ( recall ( a guy ) ) ( from ( ( triple A ) ( ( ( then ( ( a ( double ( A guy ) ) ) ( ( goes up ) ( to ( replace him ) ) ) ) ) and ) ( ( a ( single ( A guy ) ) ) ( ( goes up ) ( to ( replace him ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) )	( You ( ( ( ( lose ( the things ) ) ( to ( the ( following level ) ) ) ) ( if ( ( the people ) recall ) ) ) . ) )	(ROOT (S (NP (PRP you)) (VP (VBP know) (PP (IN during) (NP (NP (DT the) (NN season)) (CC and) (NP (FW i) (FW guess)))) (PP (IN at) (IN at) (NP (NP (PRP$ your) (NN level)) (SBAR (S (INTJ (UH uh)) (NP (PRP you)) (VP (VBP lose) (NP (PRP them)) (PP (TO to) (NP (DT the) (JJ next) (NN level))) (SBAR (IN if) (S (SBAR (IN if) (S (NP (PRP they)) (VP (VBP decide) (S (VP (TO to) (VP (VB recall) (NP (DT the) (DT the) (NN parent) (NN team)))))))) (NP (DT the) (NNPS Braves)) (VP (VBP decide) (S (VP (TO to) (VP (VB call) (S (VP (TO to) (VP (VB recall) (NP (DT a) (NN guy)) (PP (IN from) (NP (NP (RB triple) (DT A)) (SBAR (S (S (ADVP (RB then)) (NP (DT a) (JJ double) (NNP A) (NN guy)) (VP (VBZ goes) (PRT (RP up)) (S (VP (TO to) (VP (VB replace) (NP (PRP him))))))) (CC and) (S (NP (DT a) (JJ single) (NNP A) (NN guy)) (VP (VBZ goes) (PRT (RP up)) (S (VP (TO to) (VP (VB replace) (NP (PRP him))))))))))))))))))))))))))))	(ROOT (S (NP (PRP You)) (VP (VBP lose) (NP (DT the) (NNS things)) (PP (TO to) (NP (DT the) (JJ following) (NN level))) (SBAR (IN if) (S (NP (DT the) (NNS people)) (VP (VBP recall))))) (. .)))	you know during the season and i guess at at your level uh you lose them to the next level if if they decide to recall the the parent team the Braves decide to call to recall a guy from triple A then a double A guy goes up to replace him and a single A guy goes up to replace him	You lose the things to the following level if the people recall.	entailment	entailment
2	134793	134793e	fiction	( ( One ( of ( our number ) ) ) ( ( will ( ( ( carry out ) ( your instructions ) ) minutely ) ) . ) )	( ( ( A member ) ( of ( my team ) ) ) ( ( will ( ( execute ( your orders ) ) ( with ( immense precision ) ) ) ) . ) )	(ROOT (S (NP (NP (CD One)) (PP (IN of) (NP (PRP$ our) (NN number)))) (VP (MD will) (VP (VB carry) (PRT (RP out)) (NP (PRP$ your) (NNS instructions)) (ADVP (RB minutely)))) (. .)))	(ROOT (S (NP (NP (DT A) (NN member)) (PP (IN of) (NP (PRP$ my) (NN team)))) (VP (MD will) (VP (VB execute) (NP (PRP$ your) (NNS orders)) (PP (IN with) (NP (JJ immense) (NN precision))))) (. .)))	One of our number will carry out your instructions minutely.	A member of my team will execute your orders with immense precision.	entailment	entailment
3	37397	37397e	fiction	( ( How ( ( ( do you ) know ) ? ) ) ( ( All this ) ( ( ( is ( their information ) ) again ) . ) ) )	( ( This information ) ( ( belongs ( to them ) ) . ) )	(ROOT (S (SBARQ (WHADVP (WRB How)) (SQ (VBP do) (NP (PRP you)) (VP (VB know))) (. ?)) (NP (PDT All) (DT this)) (VP (VBZ is) (NP (PRP$ their) (NN information)) (ADVP (RB again))) (. .)))	(ROOT (S (NP (DT This) (NN information)) (VP (VBZ belongs) (PP (TO to) (NP (PRP them)))) (. .)))	How do you know? All this is their information again.	This information belongs to them.	entailment	entailment
4	50563	50563n	telephone	( yeah ( i ( ( tell you ) ( what ( ( though ( if ( you ( go ( price ( some ( of ( those ( tennis shoes ) ) ) ) ) ) ) ) ) ( i ( can ( see ( why ( now ( you ( know ( they ( 're ( ( getting up ) ( in ( the ( hundred ( dollar range ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) )	( ( The ( tennis shoes ) ) ( ( have ( ( a range ) ( of prices ) ) ) . ) )	(ROOT (S (VP (VB yeah) (S (NP (FW i)) (VP (VB tell) (NP (PRP you)) (SBAR (WHNP (WP what)) (S (SBAR (RB though) (IN if) (S (NP (PRP you)) (VP (VBP go) (VP (VB price) (NP (NP (DT some)) (PP (IN of) (NP (DT those) (NN tennis) (NNS shoes)))))))) (NP (FW i)) (VP (MD can) (VP (VB see) (SBAR (WHADVP (WRB why)) (S (ADVP (RB now)) (NP (PRP you)) (VP (VBP know) (SBAR (S (NP (PRP they)) (VP (VBP 're) (VP (VBG getting) (PRT (RP up)) (PP (IN in) (NP (DT the) (CD hundred) (NN dollar) (NN range)))))))))))))))))))	(ROOT (S (NP (DT The) (NN tennis) (NNS shoes)) (VP (VBP have) (NP (NP (DT a) (NN range)) (PP (IN of) (NP (NNS prices))))) (. .)))	yeah i tell you what though if you go price some of those tennis shoes i can see why now you know they're getting up in the hundred dollar range	The tennis shoes have a range of prices.	neutral	neutral
5	110116	110116e	telephone	( ( my walkman ) ( broke ( so ( i ( 'm ( upset ( now ( i ( just ( have ( to ( ( turn ( the stereo ) ) ( up ( real loud ) ) ) ) ) ) ) ) ) ) ) ) ) )	( ( ( ( I ( 'm ( upset ( that ( ( my walkman ) broke ) ) ) ) ) and ) ( now ( I ( have ( to ( ( turn ( the stereo ) ) ( up ( really loud ) ) ) ) ) ) ) ) . )	(ROOT (S (NP (PRP$ my) (NN walkman)) (VP (VBD broke) (SBAR (IN so) (S (NP (FW i)) (VP (VBP 'm) (ADJP (VBN upset) (SBAR (RB now) (S (NP (FW i)) (ADVP (RB just)) (VP (VBP have) (S (VP (TO to) (VP (VB turn) (NP (DT the) (NN stereo)) (ADVP (RB up) (RB real) (JJ loud)))))))))))))))	(ROOT (S (S (NP (PRP I)) (VP (VBP 'm) (ADJP (VBN upset) (SBAR (IN that) (S (NP (PRP$ my) (NN walkman)) (VP (VBD broke))))))) (CC and) (S (ADVP (RB now)) (NP (PRP I)) (VP (VBP have) (S (VP (TO to) (VP (VB turn) (NP (DT the) (NN stereo)) (ADVP (RB up) (RB really) (JJ loud))))))) (. .)))	my walkman broke so i'm upset now i just have to turn the stereo up real loud	I'm upset that my walkman broke and now I have to turn the stereo up really loud.	entailment	entailment

3-7. QNLI (Question NLI)

「QNLI」は、質問と文は、正しい回答を含んでいるかどうかを判定するタスクです。

◎ データ形式
各行は、タブ区切りの4列で構成されています。

・1列目 : インデックス(0〜)。
・2列目 : 質問。
・3列目 : 文。
・4列目 : ラベル(entailment=含意, not_entailment=含意ではない)。

◎ サンプル
サンプルは、次のとおりです。

0	When did the third Digimon series begin?	Unlike the two seasons before it and most of the seasons that followed, Digimon Tamers takes a darker and more realistic approach to its story featuring Digimon who do not reincarnate after their deaths and more complex character development in the original Japanese.	not_entailment
1	Which missile batteries often have individual launchers several kilometres from one another?	When MANPADS is operated by specialists, batteries may have several dozen teams deploying separately in small sections; self-propelled air defence guns may deploy in pairs.	not_entailment
2	What two things does Popper argue Tarski's theory involves in an evaluation of truth?	He bases this interpretation on the fact that examples such as the one described above refer to two things: assertions and the facts to which they refer.	entailment
3	What is the name of the village 9 miles north of Calafat where the Ottoman forces attacked the Russians?	On 31 December 1853, the Ottoman forces at Calafat moved against the Russian force at Chetatea or Cetate, a small village nine miles north of Calafat, and engaged them on 6 January 1854.	entailment
4	What famous palace is located in London?	London contains four World Heritage Sites: the Tower of London; Kew Gardens; the site comprising the Palace of Westminster, Westminster Abbey, and St Margaret's Church; and the historic settlement of Greenwich (in which the Royal Observatory, Greenwich marks the Prime Meridian, 0° longitude, and GMT).	not_entailment
5	When is the term 'German dialects' used in regard to the German language?	When talking about the German language, the term German dialects is only used for the traditional regional varieties.	entailment

3-8. RTE (Recognizing Textual Entailment)

「RTE」は、2つの文の含意関係(含意、含意でない)をを判定するタスクです。

◎ データ形式
各行は、タブ区切りの4列で構成されています。

・1列目 : インデックス(0〜)。
・2列目 : 文1。
・3列目 : 文2。
・4列目 : ラベル(entailment=含意, not_entailment=含意ではない)。

◎ サンプル
サンプルは、次のとおりです。

0	No Weapons of Mass Destruction Found in Iraq Yet.	Weapons of Mass Destruction Found in Iraq.	not_entailment
1	A place of sorrow, after Pope John Paul II died, became a place of celebration, as Roman Catholic faithful gathered in downtown Chicago to mark the installation of new Pope Benedict XVI.	Pope Benedict XVI is the new leader of the Roman Catholic Church.	entailment
2	Herceptin was already approved to treat the sickest breast cancer patients, and the company said, Monday, it will discuss with federal regulators the possibility of prescribing the drug for more breast cancer patients.	Herceptin can be used to treat breast cancer.	entailment
3	Judie Vivian, chief executive at ProMedica, a medical service company that helps sustain the 2-year-old Vietnam Heart Institute in Ho Chi Minh City (formerly Saigon), said that so far about 1,500 children have received treatment.	The previous name of Ho Chi Minh City was Saigon.	entailment
4	A man is due in court later charged with the murder 26 years ago of a teenager whose case was the first to be featured on BBC One's Crimewatch. Colette Aram, 16, was walking to her boyfriend's house in Keyworth, Nottinghamshire, on 30 October 1983 when she disappeared. Her body was later found in a field close to her home. Paul Stewart Hutchinson, 50, has been charged with murder and is due before Nottingham magistrates later.	Paul Stewart Hutchinson is accused of having stabbed a girl.	not_entailment
5	Britain said, Friday, that it has barred cleric, Omar Bakri, from returning to the country from Lebanon, where he was released by police after being detained for 24 hours.	Bakri was briefly detained, but was released.	entailment

3-9. WNLI (Winograd NLI)

「WNLI」は、代名詞が置換された文が元の文に含まれているかどうかを判定するタスクです。

◎ データ形式
各行は、タブ区切りの4列で構成されています。

・1列目 : インデックス(0〜)。
・2列目 : 文1。
・3列目 : 文2。
・4列目 : ラベル(0=含まれていない, 1=含まれている)。

◎ サンプル
サンプルは、次のとおりです。

0	I stuck a pin through a carrot. When I pulled the pin out, it had a hole.	The carrot had a hole.	1
1	John couldn't see the stage with Billy in front of him because he is so short.	John is so short.	1
2	The police arrested all of the gang members. They were trying to stop the drug trade in the neighborhood.	The police were trying to stop the drug trade in the neighborhood.	1
3	Steve follows Fred's example in everything. He influences him hugely.	Steve influences him hugely.	0
4	When Tatyana reached the cabin, her mother was sleeping. She was careful not to disturb her, undressing and climbing back into her berth.	mother was careful not to disturb her, undressing and climbing back into her berth.	0
5	George got free tickets to the play, but he gave them to Eric, because he was particularly eager to see it.	George was particularly eager to see it.	0

【おまけ】 SQuAD

◎ SQuAD(Stanford Question Answering Dataset)
質問回答のデータセットです。

{
  "version": "v2.0",
  "data": [
    {
      "title": "サンプル", # タイトル
      "paragraphs": [
        {
          'context': "日曜日に友達と秋葉原に遊びに行きました。", # コンテキスト
          'qas': [
            {
              'id': "00001", # 質問ID
              'question': "どこに遊びに行った?",
              'is_impossible': False, # 質問に正しく回答できるかどうか,
              'answers': [
                {
                  'text': "秋葉原", # 回答
                  'answer_start': 7 # コンテキスト内の開始インデックス
                },
                    :
              ]
            } 
          ]  
        },
            :
      ]
    },
        :
  ]
}



この記事が気に入ったらサポートをしてみませんか?