Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
Human evaluation has been the gold standard for assessing the quality and accuracy of large language models (LLMs), especially for open-ended tasks such as creative writing and coding. However, human evaluation is slow, expensive, and often requires specialized expertise.
Researchers at Meta FAIR have introduced a novel approach called the Self-Taught Evaluator, which leverages synthetic data to train LLM evaluators without the need for human annotations. The method comes with a few caveats, but it could significantly improve the efficiency and scalability of LLM evaluation for enterprises that want to build custom models.
The challenges of LLM evaluation
LLMs are often used as evaluators themselves, playing a crucial role in aligning other models with human preferences or improving their own performance during training. This is especially important for tasks where multiple valid answers are possible, as is often the case with creative or complex instructions.
However, training accurate LLM evaluators typically relies on extensive human-annotated data, which is costly and time-consuming to acquire. This bottleneck becomes self-defeating, hindering the rapid development and deployment of new LLM-based applications.
The Self-Taught Evaluator addresses this challenge by using a training approach that eliminates the need for human-labeled data. It is built on top of the LLM-as-a-Judge concept, where the model is provided with an input, two possible answers, and an evaluation prompt. The LLM-as-a-Judge model aims to determine which response is better by generating a reasoning chain that reaches the correct result.
Self-Taught Evaluator starts with a seed LLM and a large collection of unlabeled human-written instructions, such as those commonly found in production systems.
First, the model selects a set of instructions from the uncurated pool. For each instruction, the Self-Taught Evaluator generates a pair of model responses: one designated as “chosen” and the other as “rejected.” The chosen response is designed to be of higher quality than the rejected response.
The model is then trained iteratively. In each iteration, it samples multiple LLM-as-a-Judge reasoning traces and judgments for each example. If the model produces a correct reasoning chain, the example is added to the training set. The final dataset is composed of a series of examples comprising the input instruction, a pair of true and false answers, and a judgment chain. The model is then fine-tuned on this new training set, resulting in an updated model for the next iteration.
Putting the Self-Taught Evaluator to the test
The researchers initialized their Self-Taught Evaluator with the Llama 3-70B-Instruct model. They used the WildChat dataset, which contains a large pool of human-written instructions, and selected more than 20,000 examples in the reasoning category. They also tested other datasets and tasks including coding and word math problems. They let the self-teaching pipeline generate the entire answers and training set without any human interference.
Their experiments showed that the Self-Taught Evaluator significantly improved the accuracy of the base model on the popular RewardBench benchmark, increasing it from 75.4% to 88.7% after five iterations without any human annotation. This performance comes close to, and in some cases surpasses, models trained on human-labeled data, even surpassing some private frontier models.
They observed similar improvements on the MT-Bench benchmark as well, which evaluates the performance of LLMs on multi-turn conversations.
Implications for enterprises
This research contributes to a growing trend of techniques that use LLMs in automated loops for self-improvement. These techniques can significantly reduce the manual effort required to create high-performing LLMs, paving the way for more efficient and scalable development and deployment of AI-powered applications.
The Self-Taught Evaluator can benefit enterprises that possess large amounts of unlabeled corporate data and want to fine-tune models on their own data without the need for extensive manual annotation and evaluation. It can also provide hints at how Meta will use its rich dataset of unlabeled user-generated data to train and improve its current and future models.
While promising, the Self-Taught Evaluator does have limitations. It relies on an initial seed model that is instruction-tuned and aligned with human preferences. In their experiments, the researchers used the Mixtral 8x22B mixture-of-experts model as the seed for creating their initial training dataset.
Enterprises will need to carefully consider the seed and base models that are relevant to their specific data and tasks. It is also important to note that standardized benchmarks often don’t represent the full capabilities and limitations of LLMs. At the same time, fully automated loops that rely solely on LLMs to self-evaluate their own outputs can fall on meaningless shortcuts that optimize the model for a benchmark but fail on real-world tasks. Enterprises will have to do their own manual tests at different stages of the training and evaluation process to make sure that the model is in fact getting closer to the kind of performance they have in mind.
Source link