Opinion

vijil
Apr 27, 2024
When developers want to understand how well different Large Language Models (LLM) perform across a common set of tasks, they turn to standard benchmarks such as Massive Multitask Language Understanding (MMLU) and Grade School Math 8K (GSM8K). These benchmarks can also help them track changes in performance as a result of changes in the training data or in the model parameters. Running benchmarks for every change and every model, however, is time consuming and expensive. For example, the time and cost of running the MMLU-Pro, a more robust and challenging version of the MMLU, over three popular LLMs is as follows.
Use Vijil Evaluate to run MMLU-Pro quickly and cheaply
Vijil Evaluate makes it easy to run MMLU-Pro — either the full or the Lite version — on any LLM to which you have access. Sign up at https://vijil.ai to use the minimalist UI and dashboard. Or just modify this Colab notebook with your provider and model.
You will need a Vijil API key to get started. To get one, send an email to contact@vijil.ai (tell them Aditya sent you).