Comparing LLMs with MyxBoard

This tutorial will show you how to create your own custom evaluation of models using the MyxBoard within the Remyx CLI.

You can also follow along in this colab notebook.

Overview

Assessing LLMs using generic benchmark tasks can help measure the general capabilities of a foundation model. However, in practice, it is most useful to use customized evaluations to determine an LLM's fitness for their application.

The ideal LLM base model for your application may depend on many factors like its accuracy and speed. With the Remyx Studio, you can use the LLM evaluation API to focus the scope of your experiments down to the relevant base models and evaluation criteria. In this tutorial, we’ll guide you through the process of:

  • Creating a MyxBoard using the Remyx CLI from a list of models or collection.
  • Ranking models based on their performance for a custom task using MyxMatch.
  • View and share your results

Creating Your MyxBoard

The first step in comparing LLMs is creating a MyxBoard. A MyxBoard allows you to manage a group of models and run evaluations on them. You can either create a MyxBoard using a list of model identifiers or directly from a Hugging Face collection.

from remyxai.client.myxboard import MyxBoard

# View supported models
print(MyxBoard.MYXMATCH_SUPPORTED_MODELS)
model_ids = ["Phi-3-mini-4k-instruct", "Qwen2-1.5B"]
myx_board_name = "my_myxboard"
myx_board = MyxBoard(model_repo_ids=model_ids, name=myx_board_name)

# Or instantiate MyxBoard from a collection
# collection_name = “remyxai/llm-foundation-models-670422ac74c4fb4c24fa0831”
# myx_board = MyxBoard(hf_collection_name=collection_name)

In this example, we create a MyxBoard, using our selected name or borrowing it from the collection identifier. This lets us compare the models listed or within that collection.

Ranking Your MyxBoard Models with MyxMatch

Once you've created a MyxBoard, you can compare the models to find the best base model for your application. The MyxMatch evaluation task helps you assess the performance of each model, ranking them based on how well they align with your specific use case, using a sample prompt.

MyxMatch creates a custom benchmark using your sample prompt to evaluate models with LLM-as-a-judge. It ranks the models based on how well they fit your application, helping you quickly find the ones that best meet your needs.

from remyxai.client.remyx_client import RemyxAPI
from remyxai.api.evaluations import EvaluationTask

# Initialize RemyxAPI client
remyx_api = RemyxAPI()

# Define evaluation task and prompt
tasks = [EvaluationTask.MYXMATCH]
prompt = "You are a media analyst. Objective: Analyze the media coverage. Phase 1: Begin analysis."

# Run the evaluation
remyx_api.evaluate(myx_board, tasks, prompt=prompt)

# Once the evaluation is complete, fetch result
results = myx_board.get_results()

In this example, we use the MyxMatch task to evaluate the models' responses to the prompt: "You are a media analyst. Objective: Analyze the media coverage. Phase 1: Begin analysis." After the evaluation is finished, we retrieve and display the results. The CLI will notify you once the evaluation is complete.

Viewing and Sharing Your MyxBoard

The MyxBoard will be stored, and all updates will be handled automatically by the Remyx CLI, ensuring your MyxBoards are always up-to-date and easily retrievable. If you created a MyxBoard from a Hugging Face collection, you can also store your results in that collection as a dataset with the push_to_hf() method.

# View all your results or fetch them by evaluation task
myx_board.get_results() # all results
myx_board.get_results([EvaluationTask.MYXMATCH]) # by task

# Your results will be JSON formatted
> {'myxmatch': [{'model': 'Phi-3-mini-4k-instruct',
   'rank': 1,
   'prompt': 'What are 2 characteristics of a good employee?'},
  {'model': 'Qwen2-1.5B-Instruct',
   'rank': 2,
   'prompt': 'You are a media analyst. Objective: Analyze the media coverage. Phase 1: Begin analysis.'}]} 

# Optionally, push your results to your collection
# myx_board.push_to_hf()

You can also view your results in the Remyx Studio app. Find it in the Myxboard tab under the name you created.

myxboard example

Conclusion

With the MyxBoard, you can customize and streamline the evaluation of LLMs for your specific needs. Whether you're ranking models based on performance in a custom task or sharing your results with the broader community, the MyxBoard makes it easy to tailor and track LLM evaluations while adding more context to your ML artifacts, enhancing your experiment workflow.

Stay tuned for thousands more fine-grained evaluation tasks coming soon to the Remyx CLI!

Was this page helpful?