Get your MMLU score 20X cheaper and 1000x faster Copy

Get your MMLU score 20X cheaper and 1000x faster Copy

Get your MMLU score 20X cheaper and 1000x faster Copy

News & Announcements

vijil

Apr 27, 2024

Meta Llama 3 language models have generated a lot of excitement. By leaderboard rankings (LMSys, OpenLLM), Meta Llama 3 70B is one of the best-performing open LLMs of the hour. Along with model weights, Meta released details of responsible AI methods that went into the development of Llama models. Where model developers leave off when they release an open LLM, Vijil picks up by providing a toolkit for enterprise AI engineers to harden, defend, and evaluate AI agents in domain-specific and task-specific business-critical use cases.  

This week at Vijil, we used our tools to evaluate the propensity of Meta Llama 3 to reflect social biases in response to “jailbreaking” prompts (which instruct the model to ignore the guardrails and filters set by system operators). When we compare Llama 3 to Llama 2, the results are worth sharing.

The cost of eval is the cost of model inference, priced by $ per 1M tokens. The time to complete is extrapolated from 10 iterations of scripts at https://github.com/TIGER-AI-Lab/MMLU-Pro.

How can we make LLM benchmarking faster and cheaper? One angle of approach is to improve the execution speed of the LLM evaluation engine. Another is to compress the benchmark to the essential prompts. At Vijil, we set out to do both.  

Over spring and summer this year, we built Vijil Evaluate, a high-performance evaluation engine that executes tests at massively parallel scale. In parallel, we constructed a “Lite” version of every benchmark of interest using tinyBenchmarks, a principled method for approximating a benchmark. The first of these benchmarks that we want to share today is MMLU-Pro.

In this post, we compare our Lite version to the full version of MMLU-Pro using test results obtained by Vijil Evaluate. The Lite version of MMLU-Pro reproduces the score of the full version with 95% accuracy at 95% cost savings, running 1000x faster on Vijil Evaluate than the full MMLU-Pro on the default evaluation harness. 

This means that we can test GPT-4o with nearly the same accuracy (72.3, 72.4) at 5% of the cost ($2.3, $53) at 1293X the speed (87 seconds, 31.25 hours). The benefit would be similar for any model that you want to test with the same benchmark. 

In the charts below, note that the costs and time are in the logarithmic scale.

Use Vijil Evaluate to run MMLU-Pro quickly and cheaply

Vijil Evaluate makes it easy to run MMLU-Pro — either the full or the Lite version — on any LLM to which you have access. Sign up at https://vijil.ai to use the minimalist UI and dashboard. Or just modify this Colab notebook with your provider and model. 

Vijil_Evaluate_MMLU.ipynb

You will need a Vijil API key to get started. To get one, send an email to contact@vijil.ai (tell them Aditya sent you). 

© 2025 Vijil. All rights reserved.

© 2025 Vijil. All rights reserved.

© 2025 Vijil. All rights reserved.