Allow pushing config version to hub #7378

momeara · 2025-01-21T22:35:07Z

Feature request

Currently, when datasets are created, they can be versioned by passing the version argument to load_dataset(...). For example creating outcomes.csv on the command line

echo "id,value\n1,0\n2,0\n3,1\n4,1\n" > outcomes.csv

and creating it

import datasets
dataset = datasets.load_dataset(
    "csv",
    data_files ="outcomes.csv",
    keep_in_memory = True,
    version = '1.0.0')

The version info is stored in the info and can be accessed e.g. by next(iter(dataset.values())).info.version

This dataset can be uploaded to the hub with dataset.push_to_hub(repo_id = "maomlab/example_dataset"). This will create a dataset on the hub with the following in the README.md, but it doesn't upload the version information:

---
dataset_info:
  features:
  - name: id
    dtype: int64
  - name: value
    dtype: int64
  splits:
  - name: train
    num_bytes: 64
    num_examples: 4
  download_size: 1332
  dataset_size: 64
configs:
- config_name: default
  data_files:
  - split: train
    path: data/train-*
---

However, when I download from the hub, the version information is missing:

dataset_from_hub_no_version = datasets.load_dataset("maomlab/example_dataset")
next(iter(dataset.values())).info.version

I can add the version information manually to the hub, by appending it to the end of config section:

...
configs:
- config_name: default
  data_files:
  - split: train
    path: data/train-*
  version: 1.0.0
---

And then when I download it, the version information is correct.

Motivation

Why adding version information for each config makes sense

The version information is already recorded in the dataset config info data structure and is able to parse it correctly, so it makes sense to sync it with push_to_hub.
Keeping the version info in at the config level is different from version info at the branch level. As the former relates to the version of the specific dataset the config refers to rather than the version of the dataset curation itself.

A explanation for the current behavior:

In datasets/src/datasets/info.py:159, the _INCLUDED_INFO_IN_YAML variable doesn't include "version".

If my reading of the code is right, adding "version" to _INCLUDED_INFO_IN_YAML, would allow the version information to be uploaded to the hub.

Your contribution

Request: add "version" to _INCLUDE_INFO_IN_YAML in datasets/src/datasets/info.py:159

The text was updated successfully, but these errors were encountered:

momeara added the enhancement New feature or request label Jan 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow pushing config version to hub #7378

Allow pushing config version to hub #7378

momeara commented Jan 21, 2025

Allow pushing config version to hub #7378

Allow pushing config version to hub #7378

Comments

momeara commented Jan 21, 2025

Feature request

Motivation

Why adding version information for each config makes sense

A explanation for the current behavior:

Your contribution