Releases: confident-ai/deepeval
ALL RAG Metrics now offers score reasoning, and a lot more.
In this release:
- Faithfulness, Answer Relevancy, Contextual Relevancy, Contextual Precision, and Contextual Recall, all offer a reasoning for its given score.
- Azure OpenAI now supported via a single command in the CLI: https://docs.confident-ai.com/docs/metrics-introduction#using-azure-openai
- New Summarization Metric that uses the QAG framework for its implementation: https://docs.confident-ai.com/docs/metrics-summarization
- Pulling datasets from Confident AI now offers an intermediate step for additional data processing before evaluation: https://docs.confident-ai.com/docs/confident-ai-evaluate-datasets#pull-your-dataset-from-confident-ai
- Decoupled imports from
transformers
,sentence_transformers
, andpandas
to reduce package size
Lots of new features
Lots of new features this release:
JudgementalGPT
now allows for different languages - useful for our APAC and European friendsRAGAS
metrics now supports all OpenAI models - useful for those running into context length issuesLLMEvalMetric
now returns a reasoning for its scoredeepeval test run
now has hooks that call on test run completionevaluate
now displaysretrieval_context
for RAG evaluationRAGAS
metric now displays metric breakdown for all its distinct metrics
Continuous Evaluation
Automatically integrated with Confident AI for continous evaluation throughout the lifetime of your LLM (app):
-log evaluation results and analyze metrics pass / fails
-compare and pick the optimal hyperparameters (eg. prompt templates, chunk size, models used, etc.) based on evaluation results
-debug evaluation results via LLM traces
-manage evaluation test cases / datasets in one place
-track events to identify live LLM responses in production
-add production events to existing evaluation datasets to strength evals over time
Continuous Evaluation
Automatically integrated with Confident AI for continous evaluation throughout the lifetime of your LLM (app):
-log evaluation results and analyze metrics pass / fails
-compare and pick the optimal hyperparameters (eg. prompt templates, chunk size, models used, etc.) based on evaluation results
-debug evaluation results via LLM traces
-manage evaluation test cases / datasets in one place
-track events to identify live LLM responses in production
-add production events to existing evaluation datasets to strength evals over time
Evaluate entire datasets
Mid-week bug fixes release with an extra feature:
- run_test now works
- new function
evaluate
, evaluates a list of test cases (dataset) on metrics you define, all without having to go through the CLI. More info here: https://docs.confident-ai.com/docs/evaluation-datasets#evaluate-your-dataset-without-pytest
Judgemental GPT
In this release, deepeval has added support for:
- JudgementalGPT, a dedicated LLM app developed by Confident AI to perform evaluations more robustly and accurately. JudgementalGPT provides a score and a reason for the score.
- Parallel testing: execute test cases in parallel and speed up evaluation up to 100x.
v0.20.17
new release
v0.20.16
new release
v0.20.15
new release
v0.20.14
prepare for release