Find the Best Model

When building applications with foundation models, one of the biggest challenges is choosing the right model from many options. It’s about more than finding a model that’s accurate or fast. Often, the best model is the one that has been trained on data representative of your task. However, it’s typically not known what data was used to train most base models, and empirically finding the best-fit model can take a lot of time and resources.

With the Remyx evaluation APIs, you can narrow your experiments to the models that matter most for your goals. Using tools like MyxMatch, you can rank models based on how well they align with your data, helping you quickly find the best fit.

Evaluate in the Studio

You can find the MyxMatch under the “Explore” section of the home view. Once you’ve clicked into the tool, you can start filling out the name for your matching job. Include a representative sample or prompt in the context box to help source the best model. Optionally, you can click the “Model Selection” dropdown button to select which models you want to compare. By default, all of the available models are chosen.

After you’ve clicked “Rank,” you’ll be redirected to the Myxboards view, where you can monitor the progress of your matching job. Once it’s finished, you’ll see a table ranking all selected models, with the top-ranked ones at the top. If you’re ready to move on to the next step and train a model, look for the “Train” button in the last column and click it.

Evaluate using the CLI

You can also follow along with this Colab notebook

Creating Your MyxBoard

The first step in comparing LLMs is creating a MyxBoard. A MyxBoard allows you to manage a group of models and run evaluations on them. You can either create a MyxBoard using a list of model identifiers or directly from a Hugging Face collection.

Create a MyxBoard
from remyxai.client.myxboard import MyxBoard

# View supported models
print(MyxBoard.MYXMATCH_SUPPORTED_MODELS)
model_ids = ["Phi-3-mini-4k-instruct", "Qwen2-1.5B"]
myx_board_name = "my_myxboard"
myx_board = MyxBoard(model_repo_ids=model_ids, name=myx_board_name)

# Or instantiate MyxBoard from a collection
# collection_name = “remyxai/llm-foundation-models-670422ac74c4fb4c24fa0831”
# myx_board = MyxBoard(hf_collection_name=collection_name)

In this example, we create a MyxBoard, using our selected name or borrowing it from the collection identifier. This lets us compare the models listed or within that collection.

Ranking Models with MyxMatch

Once you’ve created a MyxBoard, you can compare the models to find the best base model for your application. The MyxMatch evaluation task helps you assess the performance of each model, ranking them based on how well they align with your specific use case, using a sample prompt.

How it Works

MyxMatch calculates two fitness scores based on responses given the prompt you provided. It creates a synthetic dataset by expanding on the input prompt before applying LLM-as-a-judge evaluations of each candidate model.

The first score captures how well a response fits the prompt - a baseline score for each base model. The second score is calculated after each base model assumes expert and novice personas on the topic of your prompt. We then measure how well each base model adheres to the persona and provide a score on each model’s “trainability” on the topic or task of your prompt.

These scores can uncover the models with the best priors for your application without requiring costly training of each candidate.

Run evaluation

Make sure to choose a prompt that closely matches the data your model will encounter during the use of your application.

Evaluate with MyxMatch
from remyxai.client.remyx_client import RemyxAPI
from remyxai.api.evaluations import EvaluationTask

# Initialize RemyxAPI client
remyx_api = RemyxAPI()

# Define evaluation task and prompt
tasks = [EvaluationTask.MYXMATCH]
prompt = "You are a media analyst. Objective: Analyze the media coverage. Phase 1: Begin analysis."

# Run the evaluation
remyx_api.evaluate(myx_board, tasks, prompt=prompt)

# Once the evaluation is complete, fetch result
results = myx_board.get_results()

In this example, we use the MyxMatch task to evaluate the models’ responses to the prompt: “You are a media analyst. Objective: Analyze the media coverage. Phase 1: Begin analysis.” After the evaluation is finished, we retrieve and display the results. The CLI will notify you once the evaluation is complete.

Viewing and Sharing Your MyxBoard

The MyxBoard will be stored, and all updates will be handled automatically by the Remyx CLI, ensuring your MyxBoards are always up-to-date and easily retrievable. If you created a MyxBoard from a Hugging Face collection, you can also store your results in that collection as a dataset with the push_to_hf() method.

View Results
# View all your results or fetch them by evaluation task
myx_board.get_results() # all results
myx_board.get_results([EvaluationTask.MYXMATCH]) # by task
# Your results will be JSON formatted
{
  'myxmatch': [
    {
      'model': 'Phi-3-mini-4k-instruct',
      'rank': 1,
      'prompt': 'What are 2 characteristics of a good employee?'
    },
    {
      'model': 'Qwen2-1.5B-Instruct',
      'rank': 2,
      'prompt': 'You are a media analyst. Objective: Analyze the media coverage. Phase 1: Begin analysis.'
    }
  ]
}

# Optionally, push your results to your collection
# myx_board.push_to_hf()

You can also view your results in the Remyx Studio app. Find it in the Myxboard tab under the name you created.

With the MyxBoard, you can customize and streamline the evaluation of models for your specific needs. Whether you’re ranking models based on performance in a custom task or sharing your results with the broader community, the MyxBoard makes it easy to tailor and track evaluations while adding more context to your ML artifacts, enhancing your experiment workflow.

Stay tuned for thousands more fine-grained evaluation tasks coming soon to the Remyx CLI!

What’s next?

You can explore how to train and deploy a model: