Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Renamed criterias in LLM-as-a-Judge metrics to criteria. #1545

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

tejaswini
Copy link
Member

No description provided.

@martinscooper
Copy link
Collaborator

I would use italic for CrossProviderInferenceEngine

@@ -46,11 +46,11 @@ An LLM as a Judge metric consists of several essential components:
1. The judge model, such as *Llama-3-8B-Instruct* or *gpt-3.5-turbo*, which evaluates the performance of other models.
2. The platform responsible for executing the judge model, such as Huggingface, OpenAI API and IBM's deployment platforms such as WatsonX and RITS.
A lot of these model and catalog combinations are already predefined in our catalog. The models are prefixed by metrics.llm_as_judge.direct followed by the platform and the model name.
For instance, metrics.llm_as_judge.direct.rits.llama3_1_70b refers to llama3 70B model that uses RITS deployment service.
For instance, *metrics.llm_as_judge.direct.rits.llama3_1_70b* refers to llama3 70B model that uses RITS deployment service.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would change llama3 70B -> LLama3 70B

@@ -8,43 +8,16 @@
from .error_utils import UnitxtError
from .inference import (
InferenceEngine,
OptionSelectingByLogProbsInferenceEngine,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although this imports are unused, they are here in order to let users import anything from unitxt.llmasjudge. Lets re-add them.

@yoavkatz
Copy link
Member

yoavkatz commented Jan 24, 2025

Since we already announced it - we need to merge it as soon as possible. Any blockers - from doing it to day? Smaller fixes can be done later. We also need to make a new release and move FM-eval to it, so FM-eval users will not start using an old name.

tejaswini-nexplore and others added 16 commits January 24, 2025 09:22
* add text2sql templates

Signed-off-by: Yotam-Perlitz <[email protected]>

* add data managment utility for text2sql

Signed-off-by: Yotam-Perlitz <[email protected]>

* add basic template

Signed-off-by: Yotam-Perlitz <[email protected]>

* add sql execution accuracy metric

Signed-off-by: Yotam-Perlitz <[email protected]>

* add text2sql execution accuracy metric

Signed-off-by: Yotam-Perlitz <[email protected]>

* add text2sql task

Signed-off-by: Yotam-Perlitz <[email protected]>

* condition download in presence of a cache dir

Signed-off-by: Yotam-Perlitz <[email protected]>

* add init fille

Signed-off-by: Yotam-Perlitz <[email protected]>

* add processors

Signed-off-by: Yotam-Perlitz <[email protected]>

* add processors

Signed-off-by: Yotam-Perlitz <[email protected]>

* add basic template

Signed-off-by: Yotam-Perlitz <[email protected]>

* change id to int

Signed-off-by: Yotam-Perlitz <[email protected]>

* change notations in templates

Signed-off-by: Yotam-Perlitz <[email protected]>

* push to catalog

Signed-off-by: Yotam-Perlitz <[email protected]>

* add evidence, remove SL

Signed-off-by: Yotam-Perlitz <[email protected]>

* remove unued function, fix

Signed-off-by: Yotam-Perlitz <[email protected]>

* fix imports from unitxt.text2sql

Signed-off-by: Yotam-Perlitz <[email protected]>

* push to catalog

Signed-off-by: Yotam-Perlitz <[email protected]>

* fix cache location

Signed-off-by: Yotam-Perlitz <[email protected]>

* add example

Signed-off-by: Yotam-Perlitz <[email protected]>

* fix imports

Signed-off-by: Yotam-Perlitz <[email protected]>

* add func_timeout to test reqs

Signed-off-by: Yotam-Perlitz <[email protected]>

* fix typing

Signed-off-by: Yotam-Perlitz <[email protected]>

* change template name

Signed-off-by: Yotam-Perlitz <[email protected]>

* push to catalog

Signed-off-by: Yotam-Perlitz <[email protected]>

* add req

Signed-off-by: Yotam-Perlitz <[email protected]>

* add local model option

Signed-off-by: Yotam-Perlitz <[email protected]>

* fix databases download

Signed-off-by: Yotam-Perlitz <[email protected]>

* fix databases download

Signed-off-by: Yotam-Perlitz <[email protected]>

* add loader limit ot make example faster

Signed-off-by: Yotam-Perlitz <[email protected]>

* fix cache paths, avoid re-download

Signed-off-by: Yotam-Perlitz <[email protected]>

* add type schema

Signed-off-by: Yotam-Perlitz <[email protected]>

* remove inports from inits

Signed-off-by: Yotam-Perlitz <[email protected]>

* add text2sql to inits

Signed-off-by: Yotam-Perlitz <[email protected]>

* update card to use serializers

Signed-off-by: Yotam-Perlitz <[email protected]>

* add schema serializer

Signed-off-by: Yotam-Perlitz <[email protected]>

* add text2sql serializer to default template

Signed-off-by: Yotam-Perlitz <[email protected]>

* add schema to task

Signed-off-by: Yotam-Perlitz <[email protected]>

* adjust templates to using serializer

Signed-off-by: Yotam-Perlitz <[email protected]>

* adjust templates to using serializer

Signed-off-by: Yotam-Perlitz <[email protected]>

* fix processor

Signed-off-by: Yotam-Perlitz <[email protected]>

* remove target prefix from template

Signed-off-by: Yotam-Perlitz <[email protected]>

* add shuffle to bird

Signed-off-by: Yotam-Perlitz <[email protected]>

* add shuffle to bird

Signed-off-by: Yotam-Perlitz <[email protected]>

* edit template

Signed-off-by: Yotam-Perlitz <[email protected]>

* remove comment from init

Signed-off-by: Yotam-Perlitz <[email protected]>

* clear processors code

Signed-off-by: Yotam-Perlitz <[email protected]>

* add option with ticks

Signed-off-by: Yotam-Perlitz <[email protected]>

* add anls metric

Signed-off-by: Yotam-Perlitz <[email protected]>

* add template

Signed-off-by: Yotam-Perlitz <[email protected]>

* drop comment

Signed-off-by: Yotam-Perlitz <[email protected]>

* remove recursion limit

Signed-off-by: Yotam-Perlitz <[email protected]>

* add loader_limit to example

Signed-off-by: Yotam-Perlitz <[email protected]>

* fix recursion error

Signed-off-by: Yotam-Perlitz <[email protected]>

* move import to withing metric

Signed-off-by: Yotam-Perlitz <[email protected]>

* remove catalog files wo prepare

Signed-off-by: Yotam-Perlitz <[email protected]>

* fix typing

Signed-off-by: Yotam-Perlitz <[email protected]>

* change template im example

Signed-off-by: Yotam-Perlitz <[email protected]>

* moving text2sql implementaion to the main src dir

Signed-off-by: Yotam-Perlitz <[email protected]>

* fix imports

Signed-off-by: Yotam-Perlitz <[email protected]>

* fix imports

Signed-off-by: Yotam-Perlitz <[email protected]>

* fix imports

Signed-off-by: Yotam-Perlitz <[email protected]>

* fix imports

Signed-off-by: Yotam-Perlitz <[email protected]>

* import data_utils

Signed-off-by: Yotam-Perlitz <[email protected]>

* fix formatting

Signed-off-by: Yotam-Perlitz <[email protected]>

* refactor names

Signed-off-by: Yotam-Perlitz <[email protected]>

* add processors tests

Signed-off-by: Yotam-Perlitz <[email protected]>

* add more tests

Signed-off-by: Yotam-Perlitz <[email protected]>

* add tests

Signed-off-by: Yotam-Perlitz <[email protected]>

* refactor: allow more data sources

Signed-off-by: Yotam-Perlitz <[email protected]>

* allow db source input

Signed-off-by: Yotam-Perlitz <[email protected]>

* organize imports

Signed-off-by: Yotam-Perlitz <[email protected]>

* update example

Signed-off-by: Yotam-Perlitz <[email protected]>

* add db_type to task

Signed-off-by: Yotam-Perlitz <[email protected]>

* format

Signed-off-by: Yotam-Perlitz <[email protected]>

* add db_type to task

Signed-off-by: Yotam-Perlitz <[email protected]>

* add local db definition ability

Signed-off-by: Yotam-Perlitz <[email protected]>

* add EE tests

Signed-off-by: Yotam-Perlitz <[email protected]>

* add tests

Signed-off-by: Yotam-Perlitz <[email protected]>

* rename file

Signed-off-by: Yotam-Perlitz <[email protected]>

* rename file

Signed-off-by: Yotam-Perlitz <[email protected]>

* update sql metric

Signed-off-by: Yotam-Perlitz <[email protected]>

* rename file

Signed-off-by: Yotam-Perlitz <[email protected]>

* refactor types, serializers and metric

Signed-off-by: Yotam-Perlitz <[email protected]>

---------

Signed-off-by: Yotam-Perlitz <[email protected]>
* Add deduplicate operator

Signed-off-by: elronbandel <[email protected]>

* Deduplicate MMLU

Signed-off-by: elronbandel <[email protected]>

* Update Deduplicate example in documentation for clarity

Signed-off-by: elronbandel <[email protected]>

* Deduplicate social iqa

Signed-off-by: elronbandel <[email protected]>

---------

Signed-off-by: elronbandel <[email protected]>
* Add mtrag benchmark

Signed-off-by: elronbandel <[email protected]>

* Add multi_type_serializer for references and prediction fields in various JSON metrics

Signed-off-by: elronbandel <[email protected]>

* Remove unused TempOperator class and delete obsolete multi_turn.json task file

Signed-off-by: elronbandel <[email protected]>

---------

Signed-off-by: elronbandel <[email protected]>
Signed-off-by: Martín Santillán Cooper <[email protected]>
Signed-off-by: Martín Santillán Cooper <[email protected]>
@martinscooper martinscooper force-pushed the fixing_criterias_in_catalog branch from 9a968fd to 8efc10b Compare January 24, 2025 19:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants