Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow pushing config version to hub #7378

Open
momeara opened this issue Jan 21, 2025 · 0 comments
Open

Allow pushing config version to hub #7378

momeara opened this issue Jan 21, 2025 · 0 comments
Labels
enhancement New feature or request

Comments

@momeara
Copy link

momeara commented Jan 21, 2025

Feature request

Currently, when datasets are created, they can be versioned by passing the version argument to load_dataset(...). For example creating outcomes.csv on the command line

echo "id,value\n1,0\n2,0\n3,1\n4,1\n" > outcomes.csv

and creating it

import datasets
dataset = datasets.load_dataset(
    "csv",
    data_files ="outcomes.csv",
    keep_in_memory = True,
    version = '1.0.0')

The version info is stored in the info and can be accessed e.g. by next(iter(dataset.values())).info.version

This dataset can be uploaded to the hub with dataset.push_to_hub(repo_id = "maomlab/example_dataset"). This will create a dataset on the hub with the following in the README.md, but it doesn't upload the version information:

---
dataset_info:
  features:
  - name: id
    dtype: int64
  - name: value
    dtype: int64
  splits:
  - name: train
    num_bytes: 64
    num_examples: 4
  download_size: 1332
  dataset_size: 64
configs:
- config_name: default
  data_files:
  - split: train
    path: data/train-*
---

However, when I download from the hub, the version information is missing:

dataset_from_hub_no_version = datasets.load_dataset("maomlab/example_dataset")
next(iter(dataset.values())).info.version

I can add the version information manually to the hub, by appending it to the end of config section:

...
configs:
- config_name: default
  data_files:
  - split: train
    path: data/train-*
  version: 1.0.0
---

And then when I download it, the version information is correct.

Motivation

Why adding version information for each config makes sense

  1. The version information is already recorded in the dataset config info data structure and is able to parse it correctly, so it makes sense to sync it with push_to_hub.
  2. Keeping the version info in at the config level is different from version info at the branch level. As the former relates to the version of the specific dataset the config refers to rather than the version of the dataset curation itself.

A explanation for the current behavior:

In datasets/src/datasets/info.py:159, the _INCLUDED_INFO_IN_YAML variable doesn't include "version".

If my reading of the code is right, adding "version" to _INCLUDED_INFO_IN_YAML, would allow the version information to be uploaded to the hub.

Your contribution

Request: add "version" to _INCLUDE_INFO_IN_YAML in datasets/src/datasets/info.py:159

@momeara momeara added the enhancement New feature or request label Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant