LLM Evaluation - Comet's Opik #81

manisnesan · 2024-12-22T14:58:22Z

Why

What

How

Human review: review a dozen examples. Catch obvious mistakes and identify what to test for in evaluation pipeline
unit test: right granular test for each part of the pipeline such as retrieval, generation and post processing
AB testing: conduct controlled experiments to measure the impact of LLM updates. Combine human and automated feedback to drive better decision
Fine tuning & Debugging: leverage annotated production data to fine tune models. Production grade data improves models automatically.

Provide feedback