In this tutorial, we’ll show you how to select a model based on context about your application using MyxMatch with the Remyx CLI.

Follow the instructions on how to install and authenticate using the Remyx CLI before you begin.

Overview

Customer service chat applications are on the rise, but with new LLMs constantly released, which makes the best base model for your application? In this tutorial, we’ll show how easy it is to evaluate candidate models for your use-case, based on some relevant context.

import logging
import traceback
from remyxai.client.myxboard import MyxBoard
from remyxai.client.remyx_client import RemyxAPI
from remyxai.api.evaluations import EvaluationTask

Comparing Candidate Models

Each LLM’s baseline capabilties are influenced by the training methods and datasets. In this example, we want the model with the best priors to handle customer queries.

model_ids = [
    'microsoft/Phi-3-mini-4k-instruct', 
    'BioMistral/BioMistral-7B', 
    'codellama/CodeLlama-7b-Instruct-hf', 
    'gorilla-llm/gorilla-openfunctions-v2', 
    'meta-llama/Llama-2-7b-hf', 
    'mistralai/Mistral-7B-Instruct-v0.3', 
    'meta-llama/Meta-Llama-3-8B', 
    'meta-llama/Meta-Llama-3-8B-Instruct', 
    'Qwen/Qwen2-1.5B', 
    'Qwen/Qwen2-1.5B-Instruct'
]

By instantiating a MyxBoard, you can organize the results of grouped evaluations for your model comparison.

myx_board_name = "customer_support_myxboard"
myx_board = MyxBoard(model_repo_ids=model_ids, name=myx_board_name)

Making Your MyxMatch

Myxmatch is a service to simplify custom model evaluation using LLM-as-a-Judge with synthetic data. All you need is a bit of context about your use-case or representative data samples.

tasks = [EvaluationTask.MYXMATCH]

# Representative data sample
prompt = "Your product is defective. It doesn't work as advertised."

Now we’re ready to launch evaluation jobs with the Remyx API. In this example, the prompt is tested against the candidate models and the results are asynchronously logged to the MyxBoard.

remyx_api = RemyxAPI()
remyx_api.evaluate(myx_board, tasks, prompt=prompt)

When the MyxMatch evaluation job finishes, it will automatically print a message saying the job is complete.

To see the evaluation results, you can use:

results = myx_board.get_results()
print(results)

The results will be stored in a JSON object like the following:

[
  {
    "models": [
      { "model": "Qwen2-1.5B", "rank": 1 },
      { "model": "Phi-3-mini-4k-instruct", "rank": 2 },
      { "model": "Meta-Llama-3-8B-Instruct", "rank": 3 },
      { "model": "Meta-Llama-3-8B", "rank": 4 },
      { "model": "CodeLlama-7b-Instruct-hf", "rank": 5 },
      { "model": "Qwen2-1.5B-Instruct", "rank": 6 },
      { "model": "Mistral-7B-Instruct-v0.3", "rank": 7 },
      { "model": "BioMistral-7B", "rank": 8 },
      { "model": "gorilla-openfunctions-v2", "rank": 9 },
      { "model": "Llama-2-7b-hf", "rank": 10 }
    ],
    "prompt": "Your product is defective. It doesn't work as advertised."
  }
]

Conclusion

After scoring and ranking each model’s response, Qwen2-1.5B model stands out from the remaining candidates with strong baseline capabilities for customer service use-cases.