This tutorial will show you how to create your own custom evaluation of models using the MyxBoard within the Remyx CLI.

You can also follow along in this colab notebook.

Overview

Assessing LLMs using generic benchmark tasks can help measure the general capabilities of a foundation model. However, in practice, it is most useful to use customized evaluations to determine an LLM’s fitness for their application.

The ideal LLM base model for your application may depend on many factors like its accuracy and speed. With the Remyx Studio, you can use the LLM evaluation API to focus the scope of your experiments down to the relevant base models and evaluation criteria. In this tutorial, we’ll guide you through the process of:

  • Creating a MyxBoard using the Remyx CLI from a list of models or collection.
  • Ranking models based on their performance for a custom task using MyxMatch.
  • View and share your results

Follow the instructions on how to install and authenticate using the Remyx CLI before you begin.

Creating Your MyxBoard

The first step in comparing LLMs is creating a MyxBoard. A MyxBoard allows you to manage a group of models and run evaluations on them. You can either create a MyxBoard using a list of model identifiers or directly from a Hugging Face collection.

In this example, we create a MyxBoard, using our selected name or borrowing it from the collection identifier. This lets us compare the models listed or within that collection.

Ranking Your MyxBoard Models with MyxMatch

Once you’ve created a MyxBoard, you can compare the models to find the best base model for your application. The MyxMatch evaluation task helps you assess the performance of each model, ranking them based on how well they align with your specific use case, using a sample prompt.

Make sure to choose a prompt that closely matches the data your model will encounter during the use of your application.

MyxMatch creates a custom benchmark using your sample prompt to evaluate models with LLM-as-a-judge. It ranks the models based on how well they fit your application, helping you quickly find the ones that best meet your needs.

In this example, we use the MyxMatch task to evaluate the models’ responses to the prompt: “You are a media analyst. Objective: Analyze the media coverage. Phase 1: Begin analysis.” After the evaluation is finished, we retrieve and display the results. The CLI will notify you once the evaluation is complete.

Viewing and Sharing Your MyxBoard

The MyxBoard will be stored, and all updates will be handled automatically by the Remyx CLI, ensuring your MyxBoards are always up-to-date and easily retrievable. If you created a MyxBoard from a Hugging Face collection, you can also store your results in that collection as a dataset with the push_to_hf() method.

You can also view your results in the Remyx Studio app. Find it in the Myxboard tab under the name you created.

Conclusion

With the MyxBoard, you can customize and streamline the evaluation of LLMs for your specific needs. Whether you’re ranking models based on performance in a custom task or sharing your results with the broader community, the MyxBoard makes it easy to tailor and track LLM evaluations while adding more context to your ML artifacts, enhancing your experiment workflow.

Stay tuned for thousands more fine-grained evaluation tasks coming soon to the Remyx CLI!