Model Validation with Benchmarks

In this tutorial, we’ll show you how to compare model performance on common benchmark tasks using the Benchmark service with the Remyx CLI.

Follow the instructions on how to install and authenticate using the Remyx CLI before you begin.

Overview

Off-the-shelf foundation models can perform a diverse range of tasks and can also be specialized to cater to the specific needs of your application. Techniques like fine-tuning enable these models to be tailored for specific tasks or skills. However, it’s crucial to ensure that your fine-tuned model continues to perform well on relevant tasks initially present in the original model, even if those tasks were not the primary focus during training. While a fine-tuned model may excel in a specialized domain, it might unintentionally lose proficiency in critical areas such as mathematical reasoning or logical inference, which you may find are important for the model’s effectiveness in real-world scenarios. For instance, a healthcare chatbot using BioMistral/BioMistral-7B may need to understand how to reason through dosage adjustments based on a patient’s weight and age - a task where proficiency in mathematical reasoning, as demonstrated by the base model mistralai/Mistral-7B-Instruct-v0.1, is relevant to the chatbot application. As you adapt a model to better suit your application, it’s important to periodically evaluate how these changes impact the relevant skills that were present in the base model. This ensures your specialized model retains the necessary capabilities to perform reliably in its intended use. In this tutorial, we’ll explore how to compare a base model and a fine-tuned model using the Benchmark evaluation task in Remyx. This will allow you to systematically assess the impact of fine-tuning on performance across key benchmarks, ensuring your model strikes the right balance between newly introduced and previously present skills. First, let’s start by importing the necessary components:

import logging
import traceback
from remyxai.client.myxboard import MyxBoard
from remyxai.client.remyx_client import RemyxAPI
from remyxai.api.evaluations import EvaluationTask, BenchmarkTask

Comparing Base Models and Fine-tuned Models

In this example, we’ll compare the base model mistralai/Mistral-7B-Instruct-v0.1 to a further trained model, BioMistral/BioMistral-7B, trained on a large medical corpus of data.

model_ids = [
    'mistralai/Mistral-7B-Instruct-v0.1',
    'BioMistral/BioMistral-7B',
]

By instantiating a MyxBoard, you can organize the results of grouped evaluations for your model comparison.

myx_board_name = "biomistral_benchmark_myxboard"
myx_board = MyxBoard(model_repo_ids=model_ids, name=myx_board_name)

Running Benchmarks

tasks = [EvaluationTask.BENCHMARK]

# View available benchmarks with:
print(BenchmarkTask.list_tasks())

bench_tasks = ["leaderboard|gsm8k|0|0", "bigbench|logical_deduction|0|0"]

Now we’re ready to launch evaluation jobs with the Remyx API. In this example, the prompt is tested against the candidate models and the results are asynchronously logged to the MyxBoard.

remyx_api = RemyxAPI()
remyx_api.evaluate(myx_board, tasks, benchmark_tasks=bench_tasks)

When the Benchmark evaluation job finishes, it will automatically print a message saying the job is complete. To see the evaluation results, you can use:

results = myx_board.get_results()
print(results)

A JSON object with the results from each benchmark task requested.

Conclusion

Comparing base and specialized models on key benchmarks is a great way to help ensure your specialized models retain relevant skills and capabilities while excelling in newly introduced tasks for your application.

Get Started

Resources

Concepts

Workflows

Case Studies

Tutorials

Model Validation with Benchmarks

Overview

Comparing Base Models and Fine-tuned Models

Running Benchmarks

Conclusion

Get Started

Resources

Concepts

Workflows

Case Studies

Tutorials

​Overview

​Comparing Base Models and Fine-tuned Models

​Running Benchmarks

​Conclusion

Overview

Comparing Base Models and Fine-tuned Models

Running Benchmarks

Conclusion