DeepMind and UC Berkeley shows how to make the most of LLM inference-time compute


Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More


Given the high costs and slow speed of training large language models (LLMs), there is an ongoing discussion about whether spending more compute cycles on inference can help improve the performance of LLMs without the need for retraining them.

In a new study, researchers at DeepMind and the University of California, Berkeley explore ways to improve the performance of LLMs by strategically allocating compute resources during inference. Their findings, detailed in a new research paper, suggest that by optimizing the use of inference-time compute, LLMs can achieve substantial performance gains without the need for larger models or extensive pre-training.

The tradeoff between inference-time and pre-training compute

The dominant approach to improving LLM performance has been to scale up model size and pre-training compute. However, this approach has limitations. Larger models are expensive to train and require more resources to run, which can make them impractical for deployment in different settings, including resource-constrained devices.

The alternative is to use more compute during inference to improve the accuracy of LLM responses on challenging prompts. This approach can enable the deployment of smaller LLMs while still achieving comparable performance to larger, more computationally expensive models. 

The question is, if an LLM is allowed to use a fixed amount of inference-time compute, how can you get the best performance through different inference methods and how well will it perform compared to a larger pre-trained model? 

The most popular approach for scaling test-time computation is best-of-N sampling, where the model generates N outputs in parallel and the most accurate response is selected as the final answer. However, there are other ways to use inference-time compute to improve LLMs. For example, instead of generating multiple responses in parallel, you can have the model revise and correct its response in multiple sequential steps. Another method is to change the verification mechanism that chooses the best-produced response. You can also combine parallel and sequential sampling along with multiple verification strategies and search algorithms to get an even richer landscape of inference-time optimization strategies.

Parallel vs sequential revision
Parallel vs sequential revision (source: arXiv)

To determine the optimal inference-time strategy, the researchers define “test-time compute-optimal scaling strategy” as the “strategy that chooses hyperparameters corresponding to a given test-time strategy for maximal performance benefits on a given prompt at test time.”

“Ideally, test-time compute should modify the distribution so as to generate better outputs than naïvely sampling from the LLM itself would,” the researchers write.

Different ways to use inference-time compute

The researchers explored two main strategies for using inference-time compute to improve LLM performance. The first strategy focuses on modifying the proposal distribution, which is the process by which the LLM generates responses. This can be achieved by fine-tuning the LLM to iteratively revise its answers in complex reasoning-based settings.

The second strategy involves optimizing the verifier, which is the mechanism used to select the best answer from the generated responses. This can be done by training a process-based reward model that evaluates the correctness of individual steps in an answer.

To evaluate their approach, the researchers conducted experiments with both methods on the challenging MATH benchmark using PaLM-2 models. 

“With both approaches, we find that the efficacy of a particular test-time compute strategy depends critically on both the nature of the specific problem at hand and the base LLM used,” the researchers write.

For easier problems, where the base LLM can already produce reasonable responses, allowing the model to iteratively refine its initial answer proved to be more effective than generating multiple samples in parallel. For more difficult problems that require exploring different solution strategies, they found that resampling multiple responses in parallel or deploying tree-search against a process-based reward model was more effective.

Different answer verification strategies
Different answer verification strategies (source: arxiv)

“This finding illustrates the need to deploy an adaptive ‘compute-optimal’ strategy for scaling test-time compute, wherein the specific approach for utilizing test-time compute is selected depending on the prompt, so as to make the best use of additional computation,” the researchers write.

By appropriately allocating test-time compute, the researchers were able to significantly improve performance, surpassing the best-of-N baseline while using only about 25% of the computation.

Balancing test-time compute with pre-training compute

The researchers also investigated the extent to which test-time computation can substitute for additional pre-training. They compared the performance of a smaller model with additional test-time compute to a 14x larger model with more pre-training.

For easier and medium-difficulty questions, the smaller model with additional test-time compute performed comparably to the larger pre-trained model. 

“This finding suggests that rather than focusing purely on scaling pretraining, in some settings it is more effective to pretrain smaller models with less compute, and then apply test-time compute to improve model outputs,” the researchers write.

However, for the most challenging questions, additional pre-training compute proved to be more effective. This indicates that current approaches to scaling test-time compute may not be a perfect substitute for scaling pre-training in all scenarios.

The researchers suggest several future directions for research, including exploring more complex strategies that combine different revision and search techniques and developing more efficient methods for estimating question difficulty.

“Overall, [our study] suggests that even with a fairly naïve methodology, scaling up test-time computation can already serve to be more preferable to scaling up pretraining, with only more improvements to be attained as test-time strategies mature,” the researchers write. “Longer term, this hints at a future where fewer FLOPs are spent during pretraining and more FLOPs are spent at inference.”



Source link

About The Author

Scroll to Top