Cheat! Realistic evaluation of LLMs without cheating

2024年1月15日 15:46

The world of local large language models (LLMs) has grown massively in the past year, with many amazing models such as Phi and Mistral showing real promise in being useful AI that can be run on more devices than ever before.

However, it remains hard to quantify how much smarter these models have become in the past year.

While benchmarks such as:

Are useful quantifications of the “intelligence” of a given AI model, they can also present a moral hazard.

The Open LLM Leaderboard

Due to the fact that the training data of many models has not been made public, it can be hard to determine whether a model has been previously trained on the data that it is being evaluated on (and is thus cheating). Therefore, it becomes easy for an unscrupulous AI developer to secretly train their model on popular benchmarks and release their flawed model to fanfare and adulation due to the perceived high performance on these benchmarks.

While techniques exist that can help detect if a given piece of text has been included in a model’s training data, these techniques could still potentially be evaded in the future by engineers looking to circumvent these measures.

A different approach to evaluating LLMs is by actually asking the LLMs a few questions one-on-one. This concept has been scaled into the Chatbot Arena Leaderboard, which allows users to compare the difference in output of two LLMs blind and choose which output was best.

All of these matchups are then agglomerated into one large leaderboard, where the scores of each LLM are based off a chess-style ELO ranking.

The current Chatbot Arena Leaderboard rankings

From this, we can find the generally most useful LLM for an average user with a much lower chance that the model is cheating in some way.

While this approach is undoubtedly helpful for the average user, it still requires thousands of users to perform many matchups to fully evaluate an LLM. This makes this evaluation technique unrealistic for all except the most popular LLMs.

So, how can we evaluate an LLM in a small-scale way without having the possibility of the model cheating in its answers? We propose the CNN Weekly Quiz as a possible solution.

Since April 2022, CNN has released a “5 Things CNN Weekly Quiz” which consists of 10 questions about the previous week’s news.

A typical question from the CNN News Quiz (2024 Jan 4th edition)

These questions have multiple choice answers and upon answering the answer is revealed along with a link to the news article that answers the question.

We propose that this data could be used as a weekly updated reading comprehension quiz.

A CNN article is linked to almost every answer, giving a grounding reference for each question-answer pair.

The timeliness of these quizzes means that these questions cannot have been seen before the model is trained (if choosing the correct cut-off date for which weekly news quizzes to use) and so guards against cheating. As long as the quizzes used for evaluation were written after the most recent publication date of an LLM, then it should be impossible for the LLM to have cheated the answers ahead of time.

Moreover, the multiple-choice nature of the dataset means that it can be used to evaluate LLMs automatically, by simply assigning the output with the highest probability of being generated as the LLMs answer (as has been done in other LLM evaluation packages).

The data structure of the generated CNN weekly quiz dataset

We provide the code for downloading and generating this quiz dataset here.

The copyright of the text from the CNN Weekly Quiz and all associated articles all belong to CNN.

In summary, we invite all LLM developers, researchers and practitioners to evaluate their models using this dataset and hope that this leads to a new paradigm of up-to-date evaluation of text-based AIs.

この記事が気に入ったらサポートをしてみませんか？