How to train a Karasu

2024年3月22日 11:19

Our team at Lightblue has developed a new state-of-the-art public large language model (LLM) for Japanese called ao-Karasu. In this article, I would like to share how we made this model, what it can do, what is cannot do, and what we are doing to further improve LLMs in Japanese.

Making the model

In order to make the best possible LLM in Japanese, we decided to fine-tune a pre-trained public model. This was done as the cost of pre-training a model from scratch is currently estimated in the billions of dollars (Llama 3 is being trained on 600,000 H100 equivalents s, with a price-tag of >$10 billion).

Therefore, we needed to choose a model which already has good base performance before we further improve its Japanese abilities.

Base model choice

Based on the MT-Bench benchmarks available to us just now, we found that the Qwen 1.5 72B Chat model was most performant on such general purpose benchmarks such as the LMSYS Chatbot Arena Leaderboard:

This is a global benchmark evaluated by users from all over the world, meaning that relatively little of this evaluation is directly evaluating Japanese ability. However, from this we can see which models are generally useful to users worldwide. We hypothesize that it is easier to teach an LLM Japanese than teach facts and logic, so we choose a model based on this multilingual benchmark.
From this benchmark, we find that Qwen is the standout, being the only model in the top 10 of this leaderboard that is not proprietary (i.e. the weights are available to run offline). We salute the Qwen team for their excellent model and their commitment to open source!

Dataset choice

We decided to train our model on a large dataset of more than a million entries of different datasets.
Much of our dataset was made up of data that we used previously to train our other models, such as Qarasu.

Our main additions this time were:

Publicly available technical blogs written in Japanese
Publicly available news stories
Publicly available QA site answers

For the QA site data, we filtered based on answer popularity and then took the top answer for each question as the AI response. This has obvious drawbacks in that the responses are not necessarily written from an AI's perspective (e.g. "My and my husband…" would be strange to hear from an AI), so we are aiming to do further filtering in the future.

For the news and technical articles, we create a chatbot style supervised learning task by either having the LLM generate an article from a title or generate a title from an article with a 50% of either.

This dataset amounted to more than 1.1 billion characters, the largest of which were:

~450 million characters from Wikipedia-based QA (same as Qarasu)
~200 million characters from technical blogs (new)
~200 million characters from Japanese QA site answers (new)
~100 million characters from LLM generated prompts and responses (same as Qarasu)
~70 million characters from news articles (new)

Training

In order to train Qwen 72B at full-precision, we would require more than 600GB of VRAM and train for many weeks, meaning the total cost of training such a model would be tens of thousands of dollars. We hypothesized that we can train using a less precise method (i.e. LoRA) and still get good results with training Qwen to understand Japanese. This would reduce the necessary VRAM to ~80GB and reduce the training cost to a few thousand dollars. After training, we simply merged our trained adapter into the full precision 72B model and then quantized that model using AutoAWQ.

We planned to train our model for roughly 40 days to train on all 1.1 billion characters, but we found that after roughly 400 steps (1 day) the training loss reached a plateau.

After three days, we had three checkpoints saved - at 388 steps, 776 steps, and 1164 steps. Upon evaluation of the checkpoints, we found that the first checkpoint had a marked boost in Japanese MT-Bench scores compared to the base Qwen 1.5 72B Chat model, while the subsequent two checkpoints had worse scores than the base model. Thus we abandoned training after three days and a cost of ~$400.
We hypothesize that training on our Japanese data a little (i.e. for 300 steps, amounting to ~20M characters) improved the Japanese comprehension and creation abilities of our model. However, our training data may have been too noisy so after initially learning some Japanese, our model may have overfit to the distribution of our noisy training data rather than being an ever more competent general AI assistant.
From our experience training this model, we have found that we can make notable improvements in a model using relatively few resources, (only ~20M characters and QLoRA). We have also found that training data needs to be very clean and close to the target data distribution (in our case, general AI assistants) to be useful. Thus, in the future we aim to make a much smaller, filtered, high quality dataset in Japanese, similar to LIMA in English.

Evaluation

First, we evaluated our model on the MT-Bench benchmark and found that it the scores exceeded GPT 3.5!

However, as good as the benchmark is, we also have to soberly analyse how the model performs in real life. We ran an internal demo where everyone at Lightblue, from Engineers to the Sales team to HR, tested the model and gave their feedback. With this internal evaluation, we were able to get a better idea of exactly how well our model performs in real world situations.

What it can do

We found that the model was good at logical reasoning and writing. For example, the model was able to come up with some interesting gift ideas for a dad who likes fishing with the condition that the model does not suggest fishing rods etc.

What it cannot do

We found that the model was predisposed to quite agressive hallucinations when asked a fact outright. For example, it said the President of Brazil is "Joan de Aliju Pendragon", which would be an interesting name for a novel, but is unfortunately not the correct answer.

So we found that our model is seemingly quite "intelligent", in that it can perform many logical and creative tasks quite well, but is somewhat lacking in "knowledge".

This also calls into question the efficacy of MT-Bench as a benchmark for model quality. We know that an LLM such as GPT 3.5 would usually not give such a fictitious answer for the president of Brazil, somewhat diluting the importance of AoKarasu gaining higher MT-Bench scores. Therefore, we require more sophisticated and varied evaluation criteria for measuring model quality in the future.

However, we are encouraged by the overall progress of our model compared to our previous Qarasu model, so we will endeavour to fix the inadequacies of AoKarasu to make it an even better contribution to the Japanese AI community.

Improvements

As we have seen, AoKarasu performs well in many situations, but is still deficient in some key areas. We propose three ways to further improve this model:

Quality over quantity with training data
We have seen that our model reached a peak in performance relatively early in training on our somewhat noisy large finetuning dataset. Therefore, we propose training the model on a much smaller, more focused, higher quality dataset.
Adding grounding
In order to fix hallucinations, we propose training on RAG and function calling style datasets to ground all answers in factual information. This will allow the LLM to reference correct information at inference time before replying when appropriate.
Better evaluations
We propose evaluating models on a suite of conversation benchmarks to get a more holistic appraisal of the performance of a model. We envisage a scenario where MT-Bench scores are used alongside ELYZA-100 and LB-Bench scores. More to come on LB-Bench in the coming weeks… 👀

Conclusion

We invite the community to try our latest LLM and follow us on Huggingface. We want to make truly spectacular Japanese LLMs that are as useful as possible to as many people as possible, we we hope you will join us on this mission.

Huggingface - https://huggingface.co/lightblue

Model page - https://huggingface.co/lightblue/aokarasu-72B

Model page (4bit) - https://huggingface.co/lightblue/aokarasu-72B-AWQ-4bit

この記事が気に入ったらサポートをしてみませんか？