Release Unitxt 1.18.0 - Faster Loading · IBM/unitxt

The main improvements in this version focus on caching strategies, dataset loading, and speed optimizations.

Hugging Face Datasets Caching Policy

We have completely revised our caching policy and how we handle Hugging Face datasets in order to improve performance.

Hugging Face datasets are now cached by default.

This means that LoadHF loader will cache the downloaded datasets in the HF cache directory (typically ~/.cache/huggingface/datasets).

To disable this caching mechanism, use:

unitxt.settings.disable_hf_datasets_cache = True

All Hugging Face datasets are first downloaded and then processed.
- This means the entire dataset is downloaded, which is faster for most datasets. However, if you want to process a huge dataset, and the HF dataset supports streaming, you can load it in streaming mode
```
LoadHF(name="my-dataset", streaming=True)
```
To enable streaming mode by default for all Hugging Face datasets, use:
```
unitxt.settings.stream_hf_datasets_by_default = True
```

While the new defaults (full download & caching) may make the initial dataset load slower, subsequent loads will be significantly faster.

Unitxt Datasets Caching Policy

By default, when loading datasets with unitxt.load_dataset, the dataset is prepared from scratch each time you call the function.
This ensures that any changes made to the card definition are reflected in the output.

This process may take a few seconds, and for large datasets, repeated loading can accumulate overhead.
If you are using fixed datasets from the catalog, you can enable caching for Unitxt datasets and thus cache the unitxt datasets.
The datasets are cached in the huggingface cache (typically ~/.cache/huggingface/datasets).
```
from unitxt import load_dataset

ds = load_dataset(card="my_card", use_cache=True)
```

Faster Unitxt Dataset Preparation

To improve dataset loading speed, we have optimized how Unitxt datasets are prepared.

Background:

Unitxt datasets are converted to Hugging Face datasets because they store data on disk while keeping only the necessary parts in memory (via PyArrow). This enables efficient handling of large datasets without excessive memory usage.

Previously, unitxt.load_dataset used built-in Hugging Face methods for dataset preparation, which included unnecessary type handling and verification, slowing down the process.

Key improvements:

We now create the Hugging Face dataset directly, reducing preparation time by almost 50%.
With this optimization, Unitxt datasets are now faster than ever!

What's Changed

End of year summary blog post by @elronbandel in #1530
Updated documentation and examples of LLM-as-Judge by @tejaswini in #1532
Eval assist documentation by @tejaswini in #1537
Update notification banner styles and add 2024 summary blog link by @elronbandel in #1538
Add more granite llm as judge artifacts by @martinscooper in #1516
Fix Australian legal qa dataset by @elronbandel in #1542
Set use 1 shot for wikitq in tables_benchmark by @yifanmai in #1541
Bugfix: indexed row major serialization fails with None cell values by @yifanmai in #1540
Solve issue of expired token in Unitxt Assistant by @eladven in #1543
Add Replicate inference support by @elronbandel in #1544
add a filter to wikitq by @ShirApp in #1547
Add text2sql tasks by @perlitz in #1414
Add deduplicate operator by @elronbandel in #1549
Fix the authentication problem by @eladven in #1550
Attach assitant answers to their origins with url link by @elronbandel in #1528
Add mtrag benchmark by @elronbandel in #1548
Update end of year summary blog by @elronbandel in #1552
Add data classification policy to CrossProviderInferenceEngine initialization based on selected model by @elronbandel in #1539
Fix recently broken rag metrics by @elronbandel in #1554
Renamed criterias in LLM-as-a-Judge metrics to criteria - Breaking change by @tejaswini in #1545
Finqa hash to top by @elronbandel in #1555
Refactor safety metric to be faster and updated by @elronbandel in #1484
Improve assistant by @elronbandel in #1556
Feature/add global mmlu cards by @eliyahabba in #1561
Add quality dataset by @eliyahabba in #1563
Add CollateInstanceByField operator to group data by specific field by @sarathsgvr in #1546
Fix prompts table benchmark by @ShirApp in #1565
Create new IntersectCorrespondingFields operator by @pklpriv in #1531
Add granite documents format by @elronbandel in #1566
Revisit huggingface cache policy - BREAKING CHANGE by @elronbandel in #1564
Add global mmlu lite sensitivity cards by @eliyahabba in #1568
Add schema-linking by @KyleErwin in #1533
fix the printout of empty strings in the yaml cards of the catalog by @dafnapension in #1567
Use repr instead of to_json for unitxt dataset caching by @elronbandel in #1570
Added key value extraction evaluation and example with images by @yoavkatz in #1529

New Contributors

@tejaswini made their first contribution in #1532
@KyleErwin made their first contribution in #1533

Full Changelog: 1.17.0...1.18.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unitxt 1.18.0 - Faster Loading