Follow the instructions on how to install and authenticate using the Remyx CLI before you begin.
Overview
Off-the-shelf foundation models can perform a diverse range of tasks and can also be specialized to cater to the specific needs of your application. Techniques like fine-tuning enable these models to be tailored for specific tasks or skills. However, it’s crucial to ensure that your fine-tuned model continues to perform well on relevant tasks initially present in the original model, even if those tasks were not the primary focus during training. While a fine-tuned model may excel in a specialized domain, it might unintentionally lose proficiency in critical areas such as mathematical reasoning or logical inference, which you may find are important for the model’s effectiveness in real-world scenarios. For instance, a healthcare chatbot usingBioMistral/BioMistral-7B
may need to understand how to reason through dosage adjustments based on a patient’s weight and age - a task where proficiency in mathematical reasoning, as demonstrated by the base model mistralai/Mistral-7B-Instruct-v0.1
, is relevant to the chatbot application.
As you adapt a model to better suit your application, it’s important to periodically evaluate how these changes impact the relevant skills that were present in the base model. This ensures your specialized model retains the necessary capabilities to perform reliably in its intended use.
In this tutorial, we’ll explore how to compare a base model and a fine-tuned model using the Benchmark evaluation task in Remyx. This will allow you to systematically assess the impact of fine-tuning on performance across key benchmarks, ensuring your model strikes the right balance between newly introduced and previously present skills.
First, let’s start by importing the necessary components:
Comparing Base Models and Fine-tuned Models
In this example, we’ll compare the base modelmistralai/Mistral-7B-Instruct-v0.1
to a further trained model, BioMistral/BioMistral-7B
, trained on a large medical corpus of data.
Running Benchmarks
Powered by lighteval, theBenchmark
evaluation task let’s us run predefined, open source benchmarks across a variety of tasks/skills.
For this example, we’ll try to evaluate both models on the "leaderboard|gsm8k|0|0"
, "lighteval|asdiv|0|0"
, and "bigbench|logical_deduction|0|0"
covering common math and reasoning benchmarks.