Customer service chat applications are on the rise, but with new LLMs constantly released, which makes the best base model for your application? In this tutorial, we’ll show how easy it is to evaluate candidate models for your use-case, based on some relevant context.
Each LLM’s baseline capabilties are influenced by the training methods and datasets. In this example, we want the model with the best priors to handle customer queries.
Myxmatch is a service to simplify custom model evaluation using LLM-as-a-Judge with synthetic data. All you need is a bit of context about your use-case or representative data samples.
Copy
Ask AI
tasks = [EvaluationTask.MYXMATCH]# Representative data sampleprompt = "Your product is defective. It doesn't work as advertised."
Now we’re ready to launch evaluation jobs with the Remyx API. In this example, the prompt is tested against the candidate models and the results are asynchronously logged to the MyxBoard.
After scoring and ranking each model’s response, Qwen2-1.5B model stands out from the remaining candidates with strong baseline capabilities for customer service use-cases.
Assistant
Responses are generated using AI and may contain mistakes.