Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BigEarthNetv2 dataset #2371

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Add BigEarthNetv2 dataset #2371

wants to merge 2 commits into from

Conversation

ando-shah
Copy link

Basic BENv2 dataset added. _verify(..) and checksums missing

@github-actions github-actions bot added the datasets Geospatial or benchmark datasets label Oct 28, 2024
from torchgeo.datasets.geo import NonGeoDataset
from torchgeo.datasets.utils import download_url, extract_archive, sort_sentinel2_bands

class BigEarthNetv2(NonGeoDataset):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think between the existing BigEarthNet version and version 2, there is a lot of shared code. The __init__ signature is the same as well, so I would think it's beneficial to inherit from BigEarthNet instead of NonGeoDataset so that a lot of the code does not have to be written twice, and instead only overwrite the relevant parts.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can someone point me to a summary of the changes between v1 and v2? Basically I'm curious if this is just a newer version of the same dataset (slightly more images or better labels) or a completely new iteration. We could either do:

BigEarthNet(version=1)
BigEarthNet(version=2)

in the case of the former or:

BigEarthNet()
BigEarthNetv2()

in the case of the latter.

Copy link
Author

@ando-shah ando-shah Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you're right, I've basically used v1 as the template to write v2.
v2 is the same dataset with the following changes:

  • Better labels
  • Better spatial splits
  • Fewer images - problematic images removed
  • pixel-levels labels additionally available ('maps') along with multi-class image-level labels

The reBEN paper details these differences

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. In that case I'm kind of leaning towards adding a version parameter to the existing dataset. We could default to 2. This would require the fewest code modifications.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a similar discussion on naming in #2394

@adamjstewart adamjstewart added this to the 0.7.0 milestone Oct 28, 2024
@ando-shah
Copy link
Author

ando-shah commented Oct 28, 2024 via email

dir_s2 = self.metadata_locs["s2"]["directory"]
dir_maps = self.metadata_locs["maps"]["directory"]

self.metadata = pd.read_parquet(os.path.join(self.root, filename))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noticed, that this is missing a filter based on the split for the appropriate set.


Args:
root: root directory where dataset can be found
split: train/val/test split to load
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And in v2, the val split is called "validation"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasets Geospatial or benchmark datasets
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants