Model Validation with Benchmarks
In this tutorial, we’ll show you how to compare model performance on common benchmark tasks using the Benchmark service with the Remyx CLI.
Follow the instructions on how to install and authenticate using the Remyx CLI before you begin.
Overview
Off-the-shelf foundation models can perform a diverse range of tasks and can also be specialized to cater to the specific needs of your application. Techniques like fine-tuning enable these models to be tailored for specific tasks or skills. However, it’s crucial to ensure that your fine-tuned model continues to perform well on relevant tasks initially present in the original model, even if those tasks were not the primary focus during training.
While a fine-tuned model may excel in a specialized domain, it might unintentionally lose proficiency in critical areas such as mathematical reasoning or logical inference, which you may find are important for the model’s effectiveness in real-world scenarios. For instance, a healthcare chatbot using BioMistral/BioMistral-7B
may need to understand how to reason through dosage adjustments based on a patient’s weight and age - a task where proficiency in mathematical reasoning, as demonstrated by the base model mistralai/Mistral-7B-Instruct-v0.1
, is relevant to the chatbot application.
As you adapt a model to better suit your application, it’s important to periodically evaluate how these changes impact the relevant skills that were present in the base model. This ensures your specialized model retains the necessary capabilities to perform reliably in its intended use.
In this tutorial, we’ll explore how to compare a base model and a fine-tuned model using the Benchmark evaluation task in Remyx. This will allow you to systematically assess the impact of fine-tuning on performance across key benchmarks, ensuring your model strikes the right balance between newly introduced and previously present skills.
First, let’s start by importing the necessary components:
Comparing Base Models and Fine-tuned Models
In this example, we’ll compare the base model mistralai/Mistral-7B-Instruct-v0.1
to a further trained model, BioMistral/BioMistral-7B
, trained on a large medical corpus of data.
By instantiating a MyxBoard, you can organize the results of grouped evaluations for your model comparison.
Running Benchmarks
Powered by lighteval, the Benchmark
evaluation task let’s us run predefined, open source benchmarks across a variety of tasks/skills.
For this example, we’ll try to evaluate both models on the "leaderboard|gsm8k|0|0"
, "lighteval|asdiv|0|0"
, and "bigbench|logical_deduction|0|0"
covering common math and reasoning benchmarks.
Now we’re ready to launch evaluation jobs with the Remyx API. In this example, the prompt is tested against the candidate models and the results are asynchronously logged to the MyxBoard.
When the Benchmark evaluation job finishes, it will automatically print a message saying the job is complete.
To see the evaluation results, you can use:
A JSON object with the results from each benchmark task requested.
Conclusion
Comparing base and specialized models on key benchmarks is a great way to help ensure your specialized models retain relevant skills and capabilities while excelling in newly introduced tasks for your application.