Find Your Best Model

Model evaluation is more than a single score—it’s about comparing models based on your application’s requirements and constraints.

Remyx simplifies this process by ranking and evaluating models in context. The MyxBoard acts as a customizable leaderboard, organizing evaluations by your chosen metrics, tasks, and datasets. This helps you explore model performance, trade-offs, and strengths across scenarios.

MyxBoard: Model Evaluation Tracker

The MyxBoard organizes model evaluations into groups, making it easy to compare models across datasets and benchmarks. It keeps evaluations structured, reproducible, and shareable, so you can quickly find the best model for your needs.

Available Evaluations

MyxMatch:

  • Custom evaluations tailored to your application’s unique needs.
  • Synthesize custom benchmark from sample prompts and/or context about your use-case for LLM-as-a-Judge to assess each model.

MyxMatch uses multiple LLM-as-a-Judge techniques to assign scores based on responses given the prompt you provided. These scores are combined to rank models, identifying the best fit without expensive training. Models ranking as a 5 demonstrate the strongest alignment with your application.

Benchmark:

  • Powered by lighteval, evaluate models using predefined benchmark tasks like TruthfulQA or other industry-standard datasets.
  • Compare performance given benchmark specific metrics.

Quickstart

You can quickly create a MyxBoard to track and view multiple evaluation results. It can be created from a list of models or from a Huggingface collection. Here’s an example that launches MyxMatch and Benchmark evaluation jobs for a MyxBoard using the Remyx Python API.

Running MyxMatch and Benchmark evaluations
from remyxai.client.myxboard import MyxBoard
from remyxai.api.evaluations import (
    AvailableArchitectures,
    EvaluationTask,
    BenchmarkTask,
)
from remyxai.client.remyx_client import RemyxAPI

# View available model architectures
available_archs = AvailableArchitectures().list_architectures()
print(available_archs)

# (Optional) Check if a model is supported using the validator helper function
#from remyxai.utils.validators import validate_model
#is_valid, reason = validate_model("microsoft/Phi-3-mini-4k-instruct", available_archs)

# View available benchmark evals
print(BenchmarkTask.list_tasks())

# Create a MyxBoard with selected models
model_ids = ["microsoft/Phi-3-mini-4k-instruct", "Qwen/Qwen2-1.5B"]
myx_board_name = "small-llm-comparison-media"
myx_board = MyxBoard(model_repo_ids=model_ids, name=myx_board_name)

# Define tasks: MyxMatch and Benchmark
tasks = [EvaluationTask.MYXMATCH, EvaluationTask.BENCHMARK]
bench_tasks = ["leaderboard|truthfulqa:mc|0|0"]
prompt = "You are a media analyst. Objective: Analyze the media coverage. Phase 1: Begin analysis."

# Run evaluations
remyx_api = RemyxAPI()
remyx_api.evaluate(myx_board, tasks, prompt=prompt, benchmark_tasks=bench_tasks)

# Retrieve results
results = myx_board.get_results()

# Example results
print(results)

The evaluate method will launch the specified evaluation jobs, polling periodically for their completion. Once their status is "COMPLETE", you can fetch the results using myx_board.get_results(). You can expect to see a json including all job results:

Evaluation Results
{
  "myxmatch": [
    {
      "model": "Phi-3-mini-4k-instruct",
      "rank": 1,
      "prompt": "You are a media analyst. Objective: Analyze the media coverage. Phase 1: Begin analysis."
    },
    {
      "model": "Qwen2-1.5B",
      "rank": 2,
      "prompt": "You are a media analyst. Objective: Analyze the media coverage. Phase 1: Begin analysis."
    }
  ],
  "benchmark": [
    {
      "model": "microsoft/Phi-3-mini-4k-instruct",
      "metrics": {
        "leaderboard|truthfulqa:mc|0|0": {
          "truthfulqa_mc1": 0.361,
          "truthfulqa_mc1_stderr": 0.0168,
          "truthfulqa_mc2": 0.545,
          "truthfulqa_mc2_stderr": 0.0151
        }
      },
      "eval_tasks": "leaderboard|truthfulqa:mc|0|0",
      "execution_time": "09:38 November 15, 2024",
      "run_id": "2ad3d47d37724301bf28ef8979553b82"
    },
    {
      "model": "Qwen/Qwen2-1.5B",
      "metrics": {
        "leaderboard|truthfulqa:mc|0|0": {
          "truthfulqa_mc1": 1.0,
          "truthfulqa_mc1_stderr": 0.0,
          "truthfulqa_mc2": 0.0,
          "truthfulqa_mc2_stderr": 0.0
        }
      },
      "eval_tasks": "leaderboard|truthfulqa:mc|0|0",
      "execution_time": "09:38 November 15, 2024",
      "run_id": "2ad3d47d37724301bf28ef8979553b82"
    }
  ]
}

Viewing and Sharing Your MyxBoard

The MyxBoard will be stored, and all updates will be handled automatically by the Remyx CLI, ensuring your MyxBoards are always up-to-date and easily retrievable. If you created a MyxBoard from a Hugging Face collection, you can also store your results in that collection as a dataset with the push_to_hf() method.

View Results
# Fetch results
myx_board.get_results()

# Optionally, push your results to your collection
myx_board.push_to_hf()

You can also view your results in the Remyx Studio app. Find it in the Myxboard tab under the name you created.

Evaluate in the Studio

You can find the MyxMatch under the “Explore” section of the home view. Once you’ve clicked into the tool, you can start filling out the name for your matching job. Include a representative sample or prompt in the context box to help source the best model. Optionally, you can click the “Model Selection” dropdown button to select which models you want to compare. By default, all of the available models are chosen.

After you’ve clicked “Rank,” you’ll be redirected to the Myxboards view, where you can monitor the progress of your matching job. Once it’s finished, you’ll see a table ranking all selected models, with the top-ranked ones at the top. If you’re ready to move on to the next step and train a model, look for the “Train” button in the last column and click it.

Future Developments

With the MyxBoard, you can customize and streamline the evaluation of models for your specific needs. Whether you’re ranking models based on their alignment to your use case or sharing your results with the broader community, the MyxBoard makes it easy to tailor and track evaluations while adding more context to your ML artifacts, enhancing your experiment workflow.

Stay tuned for the next updates to the evaluation features on Remyx, including:

  • Support for more fine-grained evaluation tasks and model comparison options.
  • Expanded evaluation modes, including multimodal and agent-based tests.