Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PathIterable dataset with a single file and no validation will fail[BUG] #160

Open
jdeschamps opened this issue Jun 21, 2024 · 2 comments
Open
Labels
bug Something isn't working wontfix This will not be worked on

Comments

@jdeschamps
Copy link
Member

Describe the bug
If no validation is given, the PathIterableDataset will try to split files (rather than patches) between train and validation. However, with a single file this will always throw an error.

import numpy as np
from tifffile import imwrite

from careamics.config import DataConfig
from careamics.config.support import SupportedData
from careamics.lightning import CAREamicsTrainData

rng  = np.random.default_rng(42)
data = rng.integers(0, 255, (32, 32))
data_path = Path(".") / "data.tif"
imwrite(data_path, data)

data_config = DataConfig(
    data_type=SupportedData.TIFF.value,
    patch_size=(16, 16),
    axes="YX",
    batch_size=1,
)
data_module = CAREamicsTrainData(
    data_config=data_config, 
    train_data=str(data_path),
    use_in_memory=False
)
data_module.prepare_data()
data_module.setup()

Error:

ValueError: Not enough files to split a minimum of 5 files, got 1 files.

This is only applicable for when the data does not fit in memory (according to CAREamics definition), which is an impossible case: if the data does not fit in memory, then we cannot train from this single file.

This issue cannot really be fixed, unless we complexify even further the dataset (e.g. keep the validation set in memory and extract it randomly in a first pass).

I am leaning towards waiting for the Zarr dataset, and then just retire the PathIterableDataset. We could then provide a convenience function to convert train/validation/test files into a single Zarr archive and use it for training/prediction.

@jdeschamps jdeschamps added the bug Something isn't working label Jun 21, 2024
@melisande-c
Copy link
Member

This is only applicable for when the data does not fit in memory (according to CAREamics definition), which is an impossible case: if the data does not fit in memory, then we cannot train from this single file.

Probably the error that the file is too big should be raised before we get to this point!

@jdeschamps jdeschamps added the wontfix This will not be worked on label Dec 17, 2024
@jdeschamps
Copy link
Member Author

Probably superseded by #292

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

2 participants