Evaluate
Find the Best Model
When building applications with foundation models, one of the biggest challenges is choosing the right model from many options. It’s about more than finding a model that’s accurate or fast. Often, the best model is the one that has been trained on data representative of your task. However, it’s typically not known what data was used to train most base models, and empirically finding the best-fit model can take a lot of time and resources.
With the Remyx evaluation APIs, you can narrow your experiments to the models that matter most for your goals. Using tools like MyxMatch, you can rank models based on how well they align with your data, helping you quickly find the best fit.
Evaluate in the Studio
You can find the MyxMatch under the “Explore” section of the home view. Once you’ve clicked into the tool, you can start filling out the name for your matching job. Include a representative sample or prompt in the context box to help source the best model. Optionally, you can click the “Model Selection” dropdown button to select which models you want to compare. By default, all of the available models are chosen.
After you’ve clicked “Rank,” you’ll be redirected to the Myxboards view, where you can monitor the progress of your matching job. Once it’s finished, you’ll see a table ranking all selected models, with the top-ranked ones at the top. If you’re ready to move on to the next step and train a model, look for the “Train” button in the last column and click it.
Evaluate using the CLI
You can also follow along with this Colab notebook
Creating Your MyxBoard
The first step in comparing LLMs is creating a MyxBoard. A MyxBoard allows you to manage a group of models and run evaluations on them. You can either create a MyxBoard using a list of model identifiers or directly from a Hugging Face collection.
In this example, we create a MyxBoard, using our selected name or borrowing it from the collection identifier. This lets us compare the models listed or within that collection.
Ranking Models with MyxMatch
Once you’ve created a MyxBoard, you can compare the models to find the best base model for your application. The MyxMatch evaluation task helps you assess the performance of each model, ranking them based on how well they align with your specific use case, using a sample prompt.
How it Works
MyxMatch calculates two fitness scores based on responses given the prompt you provided. It creates a synthetic dataset by expanding on the input prompt before applying LLM-as-a-judge evaluations of each candidate model.
The first score captures how well a response fits the prompt - a baseline score for each base model. The second score is calculated after each base model assumes expert and novice personas on the topic of your prompt. We then measure how well each base model adheres to the persona and provide a score on each model’s “trainability” on the topic or task of your prompt.
These scores can uncover the models with the best priors for your application without requiring costly training of each candidate.
Run evaluation
Make sure to choose a prompt that closely matches the data your model will encounter during the use of your application.
In this example, we use the MyxMatch task to evaluate the models’ responses to the prompt: “You are a media analyst. Objective: Analyze the media coverage. Phase 1: Begin analysis.” After the evaluation is finished, we retrieve and display the results. The CLI will notify you once the evaluation is complete.
Viewing and Sharing Your MyxBoard
The MyxBoard will be stored, and all updates will be handled automatically by the Remyx CLI, ensuring your MyxBoards are always up-to-date and easily retrievable. If you created a MyxBoard from a Hugging Face collection, you can also store your results in that collection as a dataset with the push_to_hf()
method.
You can also view your results in the Remyx Studio app. Find it in the Myxboard tab under the name you created.
With the MyxBoard, you can customize and streamline the evaluation of models for your specific needs. Whether you’re ranking models based on performance in a custom task or sharing your results with the broader community, the MyxBoard makes it easy to tailor and track evaluations while adding more context to your ML artifacts, enhancing your experiment workflow.
Stay tuned for thousands more fine-grained evaluation tasks coming soon to the Remyx CLI!
What’s next?
You can explore how to train and deploy a model: