Skip to content

Releases: IBM/unitxt

1.11.1

08 Jul 05:52
b23fb42
Compare
Choose a tag to compare

Non backward compatible changes

  • The class InputOutputTemplate has the field input_format. This field becomes a required field. It means that templates should explicitly set their value to None if not using it. by @elronbandel in #982
  • fix MRR RAG metric - fix MRR wiring, allow the context_ids to be a list of strings, instead of a list[list[str]]. This allows directly passing the list of predicted context ids, as was done in unitxt version 1.7. added corresponding tests. This change may change the scores of MRR metric. by @matanor in

New Features

  • Add the option to specify the number of processes to use for parallel dataset loading by @csrajmohan in #974
  • Add option for lazy load hf inference engine by @elronbandel in #980
  • Added a format based on Huggingface format by @yoavkatz in #988

New Assets

  • Add code mixing metric, add language identification task, add format for Starling model by @arielge in #956

Bug Fixes

Documentation

  • Add an example that shows how to use LLM as a judge that takes the references into account… by @eladven in #981
  • Improve the examples table documentation by @eladven in #976

Refactoring

Testing and CI/CD

New Contributors

Full Changelog: 1.10.1...1.10.2

1.11.0 (#996)

07 Jul 11:32
306fc50
Compare
Choose a tag to compare

Non backward compatible changes

  • The class InputOutputTemplate has the field input_format. This field becomes a required field. It means that templates should explicitly set their value to None if not using it. by @elronbandel in #982
  • fix MRR RAG metric - fix MRR wiring, allow the context_ids to be a list of strings, instead of a list[list[str]]. This allows directly passing the list of predicted context ids, as was done in unitxt version 1.7. added corresponding tests. This change may change the scores of MRR metric. by @matanor in

New Features

  • Add the option to specify the number of processes to use for parallel dataset loading by @csrajmohan in #974
  • Add option for lazy load hf inference engine by @elronbandel in #980
  • Added a format based on Huggingface format by @yoavkatz in #988

New Assets

  • Add code mixing metric, add language identification task, add format for Starling model by @arielge in #956

Bug Fixes

Documentation

  • Add an example that shows how to use LLM as a judge that takes the references into account… by @eladven in #981
  • Improve the examples table documentation by @eladven in #976

Refactoring

Testing and CI/CD

New Contributors

Full Changelog: 1.10.1...1.10.2

1.10.3

04 Jul 08:23
Compare
Choose a tag to compare

Non backward compatible changes

  • The class InputOutputTemplate has the field input_format. This field becomes a required field. It means that templates should explicitly set their value to None if not using it. by @elronbandel in #982
  • fix MRR RAG metric - fix MRR wiring, allow the context_ids to be a list of strings, instead of a list[list[str]]. This allows directly passing the list of predicted context ids, as was done in unitxt version 1.7. added corresponding tests. This change may change the scores of MRR metric. by @matanor in

New Features

  • Add the option to specify the number of processes to use for parallel dataset loading by @csrajmohan in #974
  • Add option for lazy load hf inference engine by @elronbandel in #980
  • Added a format based on Huggingface format by @yoavkatz in #988

New Assets

  • Add code mixing metric, add language identification task, add format for Starling model by @arielge in #956

Bug Fixes

Documentation

  • Add an example that shows how to use LLM as a judge that takes the references into account… by @eladven in #981
  • Improve the examples table documentation by @eladven in #976

Refactoring

Testing and CI/CD

New Contributors

Full Changelog: 1.10.1...1.10.2

1.10.2

04 Jul 06:17
97243ad
Compare
Choose a tag to compare

Non backward compatible changes

  • None - this release if fully compatible with the previous release.

New Features

  • added num_proc parameter - Optional integer to specify the number of processes to use for parallel dataset loading by @csrajmohan in #974
  • Add option to lazy load hf inference engine and fix requirements mechanism by @elronbandel in #980
  • Add code mixing metric, add language identification task, add format for Starling model by @arielge in #956
  • Add metrics: domesticated safety and regard by @dafnapension in #983
  • Make input_format required field in InputOutputTemplate by @elronbandel in #982
  • Added a format based on Huggingface format by @yoavkatz in #988

Bug Fixes

  • Fix the error at the examples table by @eladven in #976
  • fix MRR RAG metric - fix MRR wiring, allow the context_ids to be a list of strings, instead of a list[list[str]]. This allows directly passing the list of predicted context ids, as was done in unitxt version 1.7. added corresponding tests. by @matanor in #969
  • Fix llama_3_ibm_genai_generic_template by @lga-zurich in #978

Documentation

  • Add an example that shows how to use LLM as a judge that takes the references into account… by @eladven in #981

Refactoring

Testing and CI/CD

New Contributors

Full Changelog: 1.10.1...1.10.2

1.10.1

01 Jul 08:04
59b0a62
Compare
Choose a tag to compare

Main Changes

  • Continued with major improvements to the documentation including a new code examples section with standalone python code that shows how to perform evaluation, add new datasets, compare formats, use LLM as judges , and more. Cards for datasets from huggingface have detailed descriptions. New documentation of RAG tasks and metrics.
  • load_dataset can now load cards defined in a python file (and not only in the catalog). See example.
  • The evaluation results returned from evaluate now include two fields predictions and processed_predictions. See example.
  • The fields can have defaults, so if they are not specified in the card, they get a default value. For example, multi-class classification has text as the default text_type. See example.

Non backward compatible changes

You need to recreate the any cards/metrics you added by running prepare//.py file. You can create all cards simply by running python utils/prepare_all_artifacts.py . This will avoid the type error.

The AddFields operator was renamed Set and CopyFields operator was renamed Copy. Note previous code should continue to work, but we renamed all existing code in the unitxt and fm-eval repos.

New Features

Bug Fixes

Documentation

New Assets

Testing and CI/CD

New Contributors

Full Changelog: 1.10.0...1.10.1

Unitxt 1.10.0

03 Jun 18:22
10ee34c
Compare
Choose a tag to compare

Main changes

  • Added support for handling sensitive data . When data is loaded from a data source using a Loader the user can specify the classification of the data (e.g. "public" or "proprietary"). Then Unitxt components such as metrics and inference engines checks if they are allowed to process the data based on their configuration. For example, an LLM as judge that sends data to remote services can be configured to only send "public" data to the remote services. This replaced the UNITXT_ALLOW_PASSING_DATA_TO_REMOTE_API option, which was a general flag that was not data dependent and hence error prone.
    See more details in https://unitxt.readthedocs.io/en/latest/docs/data_classification_policy.html
  • Added support for adding metric prefix. Each metric has a new optional string attribute "score_prefix", that is appended to all scores it generates. This allows the same metric to be used on different fields of the tasks, and distinguish the output score.
  • New Operators tutorial and Loaders documentation

Backward

  • StreamInstanceOperator was renamed to InstanceOperator

New Features

  • Support for handling sensitive data sent to remote services by @pawelknes in #806 , @yoavkatz in #868
  • Added new NER metric using fuzzywuzzy logic by @sarathsgvr in #808
  • Added loader from HF spaces by @pawelknes in #860
  • Add metric prefix in main by @yoavkatz in #878
  • add MinimumOneExamplePerLabelRefiner to allow ensuring at least one example of each labels appears in the training data. by @alonh in #867

Bug Fix

New Assets

Documentation

New Contributors

Full Changelog: 1.9.0...1.10.0

Unitxt 1.9.0

20 May 12:20
Compare
Choose a tag to compare

What's Changed

The most important things are:

  • Addition of LLM as a Judge Metrics and Tasks for both evaluating LLMs as judge and using them for evaluation of other tasks. Read more in the LLM as a Judge Tutorial
  • Addition of RAG response generation tasks and datasets as part of an effort to add comprhensive RAG evaluation to unitxt.
  • Renaming FormTask to Task for simplicity
  • Major improvments to documentation and tutorials

Breaking Changes 🚨

  • Ensure consistent evaluation of CI across implementations [Might change previous results] by @dafnapension in #844
  • Fix default format so it will be the same as formats.empty in catalog. Impacts runs that did not specify a format by @yoavkatz in #848
  • LoadJson operator moved from unit.processors to unitxt.struct_data_operators
  • Fixed YesNoTemplate and Diverse LabelSampler, to support binary task typing. YesNoTemplate now expect class field to contain a string and not a list of of strings with one elements by @yoavkatz in #836

Bug Fixes

New Features

New Assets

Documentation

New Contributors

Full Changelog: 1.8.1...1.9.0

1.8.1

06 May 07:52
d931ce2
Compare
Choose a tag to compare

What's Changed

  • Fix missing experiment_id for multiprocessing evaluation by @alonh in #798
  • Add cache to metric prediction_type to speedup by @yoavkatz in #801

Full Changelog: 1.8.0...1.8.1

Unitxt 1.8.0

05 May 14:50
ece6ff9
Compare
Choose a tag to compare

What's Changed

In this release, the main improvement focuses on introducing type checking within Unitxt tasks. Tasks are fundamental to the Unitxt protocol, acting as standardized blueprints for those integrating new datasets into Unitxt. They facilitate the use of task-specific templates and metrics. To guarantee precise dataset processing in line with the task schema, we've introduced explicit types to the task fields.

For example, consider the NER task in Unitxt, previously defined as follows:

add_to_catalog(
     FormTask(
         inputs=["text", "entity_types"],
         outputs=["spans_starts", "spans_ends", "text", "labels"],
         metrics=["metrics.ner"],
     ),
     "tasks.ner",
)

Now, the NER task definition includes explicit types:

add_to_catalog(
     FormTask(
         inputs={"text": "str", "entity_types": "List[str]"},
         outputs={
             "spans_starts": "List[int]",
             "spans_ends": "List[int]",
             "text": "List[str]",
             "labels": "List[str]",
         },
         prediction_type="List[Tuple[str,str]]",
         metrics=["metrics.ner"],
     ),
     "tasks.ner",
)

This enhancement aligns with Unitxt's goal that definitions should be easily understandable and capable of facilitating validation processes with appropriate error messages to guide developers in identifying and solving issues.

Right now , using the original definition format without typing , will continue to work but generate a warning message. You should begin to adapt your tasks definition by adding types.

'inputs' field of Task should be a dictionary of field names and their types. For example, {'text': 'str', 'classes': 'List[str]'}. Instead only '['question', 'question_id', 'topic']' was passed. All types will be assumed to be 'Any'. In future version of unitxt this will raise an exception.
'outputs' field of Task should be a dictionary of field names and their types. For example, {'text': 'str', 'classes': 'List[str]'}. Instead only '['reference_answers', 'reference_contexts', 'reference_context_ids', 'is_answerable_label']' was passed. All types will be assumed to be 'Any'. In future version of unitxt this will raise an exception.

Special thanks to @pawelknes who implemented this important feature. It truly demonstrates the collective power of the Unitxt community and the invaluable contributions made by Unitxt users beyond the core development team. Such contributions are highly appreciated and encouraged.

  • For more detailed information, please refer to #710

Breaking Changes

"metrics.spearman", "metrics.kendalltau_b", "metrics.roc_auc": prediction type is float.
"metrics.f1_binary","metrics.accuracy_binary", "metrics.precision_binary", "metrics.recall_binary", "metrics.max_f1_binary", "metrics.max_accuracy_binary": prediction type is Union[float, int], references must be equal to 0 or 1

Bug Fixes

New Assets

New Features

  • Type checking for task definition by @pawelknes in #710
  • Add open and ibm_genai to llm as judge inference engine by @OfirArviv in #782
  • Add negative class score for binary precision, recall, f1 and max f1 by @lilacheden in #788
    1. Add negative class score for binary precision, recall, f1 and max f1, e.g. f1_binary now returns also "f1_binary_neg".
    2. Support Unions in metric prediction_type
    3. Add processor cast_to_float_return_nan_if_failed
    4. Breaking change: Make prediction_type of metrics numeric:
      A. "metrics.kendalltau_b", "metrics.roc_auc": prediction type is float.
      B. "metrics.f1_binary","metrics.accuracy_binary", "metrics.precision_binary", "metrics.recall_binary", "metrics.max_f1_binary", "metrics.max_accuracy_binary": prediction type is Union[float, int], references must be equal to 0 or 1
  • Group shuffle by @sam-data-guy-iam in #639

Documentation

Full Changelog: 1.7.7...1.8.0

Full Changelog: 1.8.1...1.8.0

Unitxt 1.7.9

05 May 12:43
ef01b8d
Compare
Choose a tag to compare

What's Changed

Full Changelog: 1.7.7...1.7.9