Overview
Assessing LLMs using generic benchmark tasks can help measure the general capabilities of a foundation model. However, in practice, it is most useful to use customized evaluations to determine an LLM’s fitness for their application. The ideal LLM base model for your application may depend on many factors like its accuracy and speed. With the Remyx Studio, you can use the LLM evaluation API to focus the scope of your experiments down to the relevant base models and evaluation criteria. In this tutorial, we’ll guide you through the process of:- Creating a MyxBoard using the Remyx CLI from a list of models or collection.
- Ranking models based on their performance for a custom task using MyxMatch.
- View and share your results
Follow the instructions on how to install and authenticate using the Remyx CLI before you begin.
Creating Your MyxBoard
The first step in comparing LLMs is creating a MyxBoard. A MyxBoard allows you to manage a group of models and run evaluations on them. You can either create a MyxBoard using a list of model identifiers or directly from a Hugging Face collection.Ranking Your MyxBoard Models with MyxMatch
Once you’ve created a MyxBoard, you can compare the models to find the best base model for your application. The MyxMatch evaluation task helps you assess the performance of each model, ranking them based on how well they align with your specific use case, using a sample prompt.Make sure to choose a prompt that closely matches the data your model will encounter during the use of your application.
Viewing and Sharing Your MyxBoard
The MyxBoard will be stored, and all updates will be handled automatically by the Remyx CLI, ensuring your MyxBoards are always up-to-date and easily retrievable. If you created a MyxBoard from a Hugging Face collection, you can also store your results in that collection as a dataset with thepush_to_hf()
method.