Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
Advances in large language models (LLMs) have lowered the barriers to creating machine learning applications. With simple instructions and prompt engineering techniques, you can get an LLM to perform tasks that would have otherwise required training custom machine learning models. This is especially useful for companies that don’t have in-house machine learning talent and infrastructure, or product managers and software engineers who want to create their own AI-powered products.
However, the benefits of easy-to-use models are not without tradeoffs. Without a systematic approach to keeping track of the performance of LLMs in their applications, enterprises can end up getting mixed and unstable results.
Public benchmarks vs custom evals
The current popular way to evaluate LLMs is to measure their performance on general benchmarks such as MMLU, MATH and GPQA. AI labs often market their models’ performance on these benchmarks, and online leaderboards rank models based on their evaluation scores. But while these evals measure the general capabilities of models on tasks such as question-answering and reasoning, most enterprise applications want to measure performance on very specific tasks.
“Public evals are primarily a method for foundation model creators to market the relative merits of their models,” Ankur Goyal, co-founder and CEO of Braintrust, told VentureBeat. “But when an enterprise is building software with AI, the only thing they care about is does this AI system actually work or not. And there’s basically nothing you can transfer from a public benchmark to that.”
Instead of relying on public benchmarks, enterprises need to create custom evals based on their own use cases. Evals typically involve presenting the model with a set of carefully crafted inputs or tasks, then measuring its outputs against predefined criteria or human-generated references. These assessments can cover various aspects such as task-specific performance.
The most common way to create an eval is to capture real user data and format it into tests. Organizations can then use these evals to backtest their application and the changes that they make to it.
“With custom evals, you’re not testing the model itself. You’re testing your own code that maybe takes the output of a model and processes it further,” Goyal said. “You’re testing their prompts, which is probably the most common thing that people are tweaking and trying to refine and improve. And you’re testing the settings and the way you use the models together.”
How to create custom evals
To make a good eval, every organization must invest in three key components. First is the data used to create the examples to test the application. The data can be handwritten examples created by the company’s staff, synthetic data created with the help of models or automation tools, or data collected from end users such as chat logs and tickets.
“Handwritten examples and data from end users are dramatically better than synthetic data,” Goyal said. “But if you can figure out tricks to generate synthetic data, it can be effective.”
The second component is the task itself. Unlike the generic tasks that public benchmarks represent, the custom evals of enterprise applications are part of a broader ecosystem of software components. A task might be composed of several steps, each of which has its own prompt engineering and model selection techniques. There might also be other non-LLM components involved. For example, you might first classify an incoming request into one of several categories, then generate a response based on the category and content of the request, and finally make an API call to an external service to complete the request. It is important that the eval comprises the entire framework.
“The important thing is to structure your code so that you can call or invoke your task in your evals the same way it runs in production,” Goyal said.
The final component is the scoring function you use to grade the results of your framework. There are two main types of scoring functions. Heuristics are rule-based functions that can check well-defined criteria, such as testing a numerical result against the ground truth. For more complex tasks such as text generation and summarization, you can use LLM-as-a-judge methods, which prompt a strong language model to evaluate the result. LLM-as-a-judge requires advanced prompt engineering.
“LLM-as-a-judge is hard to get right and there’s a lot of misconception around it,” Goyal said. “But the key insight is that just like it is with math problems, it’s easier to validate whether the solution is correct than it is to actually solve the problem yourself.”
The same rule applies to LLMs. It’s much easier for an LLM to evaluate a produced result than it is to do the original task. It just requires the right prompt.
“Usually the engineering challenge is iterating on the wording or the prompting itself to make it work well,” Goyal said.
Innovating with strong evals
The LLM landscape is evolving quickly and providers are constantly releasing new models. Enterprises will want to upgrade or change their models as old ones are deprecated and new ones are made available. One of the key challenges is making sure that your application will remain consistent when the underlying model changes.
With good evals in place, changing the underlying model becomes as straightforward as running the new models through your tests.
“If you have good evals, then switching models feels so easy that it’s actually fun. And if you don’t have evals, then it is awful. The only solution is to have evals,” Goyal said.
Another issue is the changing data that the model faces in the real world. As customer behavior changes, companies will need to update their evals. Goyal recommends implementing a system of “online scoring” that continuously runs evals on real customer data. This approach allows companies to automatically evaluate their model’s performance on the most current data and incorporate new, relevant examples into their evaluation sets, ensuring the continued relevance and effectiveness of their LLM applications.
As language models continue to reshape the landscape of software development, adopting new habits and methodologies becomes crucial. Implementing custom evals represents more than just a technical practice; it’s a shift in mindset towards rigorous, data-driven development in the age of AI. The ability to systematically evaluate and refine AI-powered solutions will be a key differentiator for successful enterprises.
Source link