Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when loading MINT-1T-PDF-2023-06 #13

Open
hancheolcho opened this issue Oct 17, 2024 · 4 comments
Open

Error when loading MINT-1T-PDF-2023-06 #13

hancheolcho opened this issue Oct 17, 2024 · 4 comments

Comments

@hancheolcho
Copy link

hancheolcho commented Oct 17, 2024

Hi,
First of all, thank you for releasing MINT-1T dataset :-)

I loaded one of MINT-1T datasets (MINT-1T-PDF-2023-06) but encountered the following error.

In [1]: ds = load_dataset(
    "mlfoundations/MINT-1T-PDF-CC-2023-06",
    download_config=DownloadConfig(resume_download=True))


...
Downloading data: 100%|████████████████████| 65007/65007 [9:17:51<00:00,  1.94files/s]
Computing checksums: 100%|████████████████████| 65007/65007 [00:59<00:00, 1087.82it/s]
Traceback (most recent call last):
  File "/home/hancheol/projects/mllm-data/mllm_data/a_preprocess/mint_1t_pdf/00_download.py", line 24, in <module>
    main(args)
  File "/home/hancheol/projects/mllm-data/mllm_data/a_preprocess/mint_1t_pdf/00_download.py", line 7, in main
    ds = load_dataset(
         ^^^^^^^^^^^^^
  File "/home/hancheol/anaconda3/lib/python3.11/site-packages/datasets/load.py", line 2609, in load_dataset
    builder_instance.download_and_prepare(
  File "/home/hancheol/anaconda3/lib/python3.11/site-packages/datasets/builder.py", line 1027, in download_and_prepare
    self._download_and_prepare(
  File "/home/hancheol/anaconda3/lib/python3.11/site-packages/datasets/builder.py", line 1789, in _download_and_prepare
    super()._download_and_prepare(
  File "/home/hancheol/anaconda3/lib/python3.11/site-packages/datasets/builder.py", line 1100, in _download_and_prepare
    split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hancheol/anaconda3/lib/python3.11/site-packages/datasets/packaged_modules/webdataset/webdataset.py", line 82, in _split_generators
    inferred_arrow_schema = pa.concat_tables(pa_tables, promote_options="default").schema
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/table.pxi", line 5245, in pyarrow.lib.concat_tables
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: struct fields don't match or are in the wrong order: Input fields: struct<bff_contained_ngram_count_before_dedupe: int64, image_metadata: list<item: struct<height: int64, page: int64, sha256: string, width: int64, xref: int64>>, images: list<item: string>, pdf_name: string, texts: list<item: string>, url: string> output fields: struct<bff_contained_ngram_count_before_dedupe: int64, image_metadata: list<item: struct<height: int64, page: int64, sha256: string, width: int64, xref: int64>>, images: list<item: string>, language_id_whole_page_fasttext: struct<en: double>, pdf_name: string, previous_word_count: int64, texts: list<item: string>, url: string>

Error message shows that the output data has two additional fields: language_id_whole_page_fasttext and previous_word_count.

Input fields:
    struct<
        bff_contained_ngram_count_before_dedupe: int64, 
        image_metadata: list<item: struct<
            height: int64, 
            page: int64, 
            sha256: string, 
            width: int64, 
            xref: int64
        >>, 
        images: list<item: string>, 
        pdf_name: string, 
        texts: list<item: string>, 
        url: string
    >

Output fields:
    struct<
        bff_contained_ngram_count_before_dedupe: int64, 
        image_metadata: list<item: struct<
            height: int64, 
            page: int64, 
            sha256: string, 
            width: int64, 
            xref: int64
        >>, 
        images: list<item: string>, 
        language_id_whole_page_fasttext: struct<en: double>,    # <=New one 
        pdf_name: string, 
        previous_word_count: int64,    # <=New one
        texts: list<item: string>, 
        url: string
    >

Do you have any idea how to fix it?

Best regards,
Han-Cheol

@anas-awadalla
Copy link
Collaborator

anas-awadalla commented Oct 17, 2024

Hi Han-Cheol! Thanks for your interest. I am not too sure why the huggingface default way of loading the dataset is returning this error. It seems that some samples might be missing some metadata. I have managed to load the data using the following code, hope this is helpful! Feel free to reopen if the issue persists.

from datasets import load_dataset
from tqdm import tqdm
from PIL import Image
import io

def read_mint_dataset(num_samples=None):
    # Load the dataset
    dataset = load_dataset("mlfoundations/MINT-1T-PDF-CC-2024-18", split="train", streaming=True)
    
    # Cast columns to the correct type
    dataset = dataset.cast_column('json', 'string')

    # Create a progress bar
    pbar = tqdm(desc="Reading samples", unit="sample")

    # Iterate through the dataset
    for i, sample in enumerate(dataset):
        # Update the progress bar
        pbar.update(1)

        # Read the TIFF image as a list of PIL Images
        tiff_image = sample['tiff']
        if isinstance(tiff_image, Image.Image):
            tiff = tiff_image
        else:
            tiff = Image.open(io.BytesIO(tiff_image))

        images = []
        for page in range(getattr(tiff, 'n_frames', 1)):
            if hasattr(tiff, 'seek'):
                tiff.seek(page)
            image = tiff.copy()
            if image.mode != 'RGB':
                image = image.convert('RGB')
            images.append(image)

        # Print some information about the sample
        print(f"Sample {i}:")
        print(f"  Text data: {sample['json']['texts']}")
        print(f"  Number of images: {len(images)}")
        print("---")

        if num_samples is not None and i + 1 >= num_samples:
            break

        if not isinstance(tiff_image, Image.Image):
            tiff.close()

    pbar.close()

if __name__ == "__main__":
    # Read 10 samples from the dataset
    read_mint_dataset(num_samples=100)

@hancheolcho
Copy link
Author

@anas-awadalla Thank you for fast reply :-)

It is quite strange.
When I ran the load_dataset command as follows, I still see the error as follows...
(I use datases 3.0.1)

In [17]: dataset = load_dataset("mlfoundations/MINT-1T-PDF-CC-2024-18", split="train", streaming=True)
---------------------------------------------------------------------------
ArrowTypeError                            Traceback (most recent call last)
Cell In[17], line 1
----> 1 ds = load_dataset("mlfoundations/MINT-1T-PDF-CC-2024-18", split="train", streaming=True)

File ~/anaconda3/lib/python3.11/site-packages/datasets/load.py:2093, in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, keep_in_memory, save_infos, revision, token, streaming, num_proc, storage_options, trust_remote_code, **config_kwargs)
   2091 # Return iterable dataset in case of streaming
   2092 if streaming:
-> 2093     return builder_instance.as_streaming_dataset(split=split)
   2095 # Download and prepare data
   2096 builder_instance.download_and_prepare(
   2097     download_config=download_config,
   2098     download_mode=download_mode,
   (...)
   2101     storage_options=storage_options,
   2102 )

File ~/anaconda3/lib/python3.11/site-packages/datasets/builder.py:1263, in DatasetBuilder.as_streaming_dataset(self, split, base_path)
   1256 dl_manager = StreamingDownloadManager(
   1257     base_path=base_path or self.base_path,
   1258     download_config=DownloadConfig(token=self.token, storage_options=self.storage_options),
   1259     dataset_name=self.dataset_name,
   1260     data_dir=self.config.data_dir,
   1261 )
   1262 self._check_manual_download(dl_manager)
-> 1263 splits_generators = {sg.name: sg for sg in self._split_generators(dl_manager)}
   1264 # By default, return all splits
   1265 if split is None:

File ~/anaconda3/lib/python3.11/site-packages/datasets/packaged_modules/webdataset/webdataset.py:87, in WebDataset._split_generators(self, dl_manager)
     79     raise ValueError(
     80         "The TAR archives of the dataset should be in WebDataset format, "
     81         "but the files in the archive don't share the same prefix or the same types."
     82     )
     83 pa_tables = [
     84     pa.Table.from_pylist(cast_to_python_objects([example], only_1d_for_numpy=True))
     85     for example in first_examples
     86 ]
---> 87 inferred_arrow_schema = pa.concat_tables(pa_tables, promote_options="default").schema
     88 features = datasets.Features.from_arrow_schema(inferred_arrow_schema)
     90 # Set Image types

File ~/anaconda3/lib/python3.11/site-packages/pyarrow/table.pxi:6106, in pyarrow.lib.concat_tables()

File ~/anaconda3/lib/python3.11/site-packages/pyarrow/error.pxi:155, in pyarrow.lib.pyarrow_internal_check_status()

File ~/anaconda3/lib/python3.11/site-packages/pyarrow/error.pxi:92, in pyarrow.lib.check_status()

ArrowTypeError: struct fields don't match or are in the wrong order: Input fields: struct<bff_contained_ngram_count_before_dedupe: int64, image_metadata: list<item: struct<height: int64, page: int64, sha256: string, width: int64, xref: int64>>, images: list<item: string>, pdf_name: string, texts: list<item: string>, url: string> output fields: struct<bff_contained_ngram_count_before_dedupe: int64, image_metadata: list<item: struct<height: int64, page: int64, sha256: string, width: int64, xref: int64>>, images: list<item: string>, language_id_whole_page_fasttext: struct<en: double>, pdf_name: string, previous_word_count: int64, texts: list<item: string>, url: string>

@anas-awadalla
Copy link
Collaborator

Ah ok I was using 2.16.1 but I can confirm that 3.0.1 has that issue. Here is how I fixed it.

from datasets import load_dataset
from tqdm import tqdm
from PIL import Image as PILImage
import io
from datasets import Features, Value, Image

def read_mint_dataset(num_samples=None):
    # Load the dataset
    dataset = load_dataset("mlfoundations/MINT-1T-PDF-CC-2024-18", split="train", streaming=True, features=Features(
        tiff=Image(decode=True),
        json=Value('string')
    ))
    
    # Cast columns to the correct type
    dataset = dataset.cast_column('json', 'string')

    # Create a progress bar
    pbar = tqdm(desc="Reading samples", unit="sample")

    # Iterate through the dataset
    for i, sample in enumerate(dataset):
        # Update the progress bar
        pbar.update(1)

        # Read the TIFF image as a list of PIL Images
        tiff_image = sample['tiff']
        if isinstance(tiff_image, PILImage.Image):
            tiff = tiff_image
        else:
            tiff = Image.open(io.BytesIO(tiff_image))

        images = []
        for page in range(getattr(tiff, 'n_frames', 1)):
            if hasattr(tiff, 'seek'):
                tiff.seek(page)
            image = tiff.copy()
            if image.mode != 'RGB':
                image = image.convert('RGB')
            images.append(image)

        # Print some information about the sample
        print(f"Sample {i}:")
        print(f"  Text data: {sample['json']['texts']}")
        print(f"  Number of images: {len(images)}")
        print("---")

        if num_samples is not None and i + 1 >= num_samples:
            break

        if not isinstance(tiff_image, PILImage.Image):
            tiff.close()

    pbar.close()

if __name__ == "__main__":
    # Read 10 samples from the dataset
    read_mint_dataset(num_samples=100)

@anas-awadalla anas-awadalla reopened this Oct 17, 2024
@hancheolcho
Copy link
Author

hancheolcho commented Oct 21, 2024

@anas-awadalla Hi,

Finally, I can load the data following your code.
Thank you so much :-)

p.s.
A little bit of difference in my code is that, 1) MINT-1T-PDF is downloaded using huggingface-cli download command and 2) I loaded each tar file as a HF Dataset for parallel processing as follows.

    ds = load_dataset(
        "webdataset", data_files=filepath, split="train", streaming=True,
        features=Features(tiff=Image(decode=True), json=Value("string"))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants