BioMistral/BioMistral-7B
may need to understand how to reason through dosage adjustments based on a patient’s weight and age - a task where proficiency in mathematical reasoning, as demonstrated by the base model mistralai/Mistral-7B-Instruct-v0.1
, is relevant to the chatbot application.
As you adapt a model to better suit your application, it’s important to periodically evaluate how these changes impact the relevant skills that were present in the base model. This ensures your specialized model retains the necessary capabilities to perform reliably in its intended use.
In this tutorial, we’ll explore how to compare a base model and a fine-tuned model using the Benchmark evaluation task in Remyx. This will allow you to systematically assess the impact of fine-tuning on performance across key benchmarks, ensuring your model strikes the right balance between newly introduced and previously present skills.
First, let’s start by importing the necessary components:
mistralai/Mistral-7B-Instruct-v0.1
to a further trained model, BioMistral/BioMistral-7B
, trained on a large medical corpus of data.
Benchmark
evaluation task let’s us run predefined, open source benchmarks across a variety of tasks/skills.
For this example, we’ll try to evaluate both models on the "leaderboard|gsm8k|0|0"
, "lighteval|asdiv|0|0"
, and "bigbench|logical_deduction|0|0"
covering common math and reasoning benchmarks.