Customer service chat applications are on the rise, but with new LLMs constantly released, which makes the best base model for your application? In this tutorial, we’ll show how easy it is to evaluate candidate models for your use-case, based on some relevant context.
Each LLM’s baseline capabilties are influenced by the training methods and datasets. In this example, we want the model with the best priors to handle customer queries.
Myxmatch is a service to simplify custom model evaluation using LLM-as-a-Judge with synthetic data. All you need is a bit of context about your use-case or representative data samples.
Copy
Ask AI
tasks = [EvaluationTask.MYXMATCH]# Representative data sampleprompt = "Your product is defective. It doesn't work as advertised."
Now we’re ready to launch evaluation jobs with the Remyx API. In this example, the prompt is tested against the candidate models and the results are asynchronously logged to the MyxBoard.
When the MyxMatch evaluation job finishes, it will automatically print a message saying the job is complete.To see the evaluation results, you can use:
Copy
Ask AI
results = myx_board.get_results()print(results)
The results will be stored in a JSON object like the following:
After scoring and ranking each model’s response, Qwen2-1.5B model stands out from the remaining candidates with strong baseline capabilities for customer service use-cases.