Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: 🎨 Parquet data type fixed #1456

Closed

Conversation

jeremyarancio
Copy link
Contributor

What

  • Timestamps in the JSONL/Parquet were in the wrong format: Unix timestamp
  • Using DuckDb, this problem is now fixed in the Parquet file => HF

Screenshot

image

Part of

@jeremyarancio
Copy link
Contributor Author

This PR should also contain the fix for images but it is not done for 2 reasons:

  • The field json structure is not consistent across rows, leading to the issue during the conversion to Parquet. Not sure if the problem can be solved...
  • The data contained in the field images is actually not that relevant for users.

@raphael0202
Copy link
Collaborator

The field json structure is not consistent across rows, leading to the issue during the conversion to Parquet. Not sure if the problem can be solved...

We should then reformat the images field so that it is Parquet compatible. One possibility would be to make images a list of dicts such as:

{
    "sizes": {
        "100": {
            "h": 100,
            "w": 56
        },
        "400": {
            "h": 400,
            "w": 225
        },
        "full": {
            "h": 3555,
            "w": 2000
        }
    },
    "uploaded_t": "1490702616",
    "uploader": "twan51"
    "imgid": "5",
    "key": "front_fr"
}

The difference here is that the dictionary key would be part of the item directly.

The data contained in the field images is actually not that relevant for users.

It is important, it's one of the field I (and other reusers) use the most often to fetch the images ;)

@@ -18,7 +18,7 @@ COPY (
completeness,
correctors_tags,
countries_tags,
created_t,
to_timestamp(created_t)::datetime AS created_t, -- Convert from unixtime to datetime
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum I'm not sure to understand why you do this?
created_t is an int, and it looks it's correctly detected by Parquet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, but it is a Unix timestamp, which is not particularly obvious to someone new to the dataset

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, but it is a Unix timestamp, which is not particularly obvious to someone new to the dataset

We should keep it as it, so that we keep as much as possible the compatibility with the other database dump (JSONL, MongoDB, API, etc).
We can add a created_datetime version, just like the CSV does, with the converted datetime though!

@jeremyarancio
Copy link
Contributor Author

The field json structure is not consistent across rows, leading to the issue during the conversion to Parquet. Not sure if the problem can be solved...

We should then reformat the images field so that it is Parquet compatible. One possibility would be to make images a list of dicts such as:

{
    "sizes": {
        "100": {
            "h": 100,
            "w": 56
        },
        "400": {
            "h": 400,
            "w": 225
        },
        "full": {
            "h": 3555,
            "w": 2000
        }
    },
    "uploaded_t": "1490702616",
    "uploader": "twan51"
    "imgid": "5",
    "key": "front_fr"
}

The difference here is that the dictionary key would be part of the item directly.

The data contained in the field images is actually not that relevant for users.

It is important, it's one of the field I (and other reusers) use the most often to fetch the images ;)

Ok, if it is actually useful

@jeremyarancio
Copy link
Contributor Author

The field json structure is not consistent across rows, leading to the issue during the conversion to Parquet. Not sure if the problem can be solved...

We should then reformat the images field so that it is Parquet compatible. One possibility would be to make images a list of dicts such as:

{
    "sizes": {
        "100": {
            "h": 100,
            "w": 56
        },
        "400": {
            "h": 400,
            "w": 225
        },
        "full": {
            "h": 3555,
            "w": 2000
        }
    },
    "uploaded_t": "1490702616",
    "uploader": "twan51"
    "imgid": "5",
    "key": "front_fr"
}

The difference here is that the dictionary key would be part of the item directly.

The data contained in the field images is actually not that relevant for users.

It is important, it's one of the field I (and other reusers) use the most often to fetch the images ;)

Ok, if it is actually useful

I tried different methods, also what you proposed, but I cannot manage to get the field schema recognized by Parquet...
Here's the "list of dict" solution in DuckDB:

CREATE OR REPLACE TABLE processed_images AS (
        WITH processed_images AS (
            SELECT code,
                json_merge_patch(
                    to_json({'key': unnest(json_keys(images)) }),
                    unnest(images->'$.*')
                ) AS merged
            FROM jsonl_sample
        )
        SELECT 
            code,
            array_agg(merged) AS images_array
        FROM processed_images
        GROUP BY code
    )
;
SELECT 
        jsonl.code,
        <...>, 
        processed_images.images_array
FROM jsonl_sample AS jsonl
LEFT JOIN processed_images
ON jsonl.code = processed_images.code;

@raphael0202
Copy link
Collaborator

A possibility would be to iterate over the JSONL from Python code, post-process the data and create the parquet file directly by submitting the data and the schema, no?

@jeremyarancio
Copy link
Contributor Author

The type of problem I'm dealing with, in addition to how large the dataset is and other bugs...
-> Data conversion is not done right because the data is not consistent across the JSONL

@raphael0202
Copy link
Collaborator

Data conversion is not done right because the data is not consistent across the JSONL

What's not consistent for example?

@jeremyarancio
Copy link
Contributor Author

Data conversion is not done right because the data is not consistent across the JSONL

What's not consistent for example?

Check above, the data is either a string or an int for the same item

@raphael0202
Copy link
Collaborator

raphael0202 commented Nov 12, 2024

Ok, if think you should use a data validation library such as pydantic (https://pydantic.dev/) to validate the input data and force the correct type!
Pydantic should be fast enough for this purpose.

@jeremyarancio
Copy link
Contributor Author

jeremyarancio commented Nov 12, 2024

Pydantic could be useful later on to impose a data contract to the JSONL, to prevent any modification in the data schema before the conversion process.
Who knows how the JSONL will change in months
But it doesn't "force" the correct type. The correct type needs to be manually converted, like this example below.

I'll handle future incorrect JSONL field type using try & except for each column, except leading the column not being converted with a Warning message.

[
                    {
                        "key": str(key),
                        "imgid": str(value.get("imgid", "unknown")),
                        "sizes": {
                            "100": {
                                "h": int(value.get("sizes", {}).get("100", {}).get("h", 0) or 0), # (or 0) because "h" or "w" can be none, leading to an error with int
                                "w": int(value.get("sizes", {}).get("100", {}).get("w", 0) or 0),
                            },
                            "200": {
                                "h": int(value.get("sizes", {}).get("200", {}).get("h", 0) or 0),
                                "w": int(value.get("sizes", {}).get("200", {}).get("w", 0) or 0),
                            },
                            "400": {
                                "h": int(value.get("sizes", {}).get("400", {}).get("h", 0) or 0),
                                "w": int(value.get("sizes", {}).get("400", {}).get("w", 0) or 0),
                            },
                            "full": {
                                "h": int(value.get("sizes", {}).get("full", {}).get("h", 0) or 0),
                                "w": int(value.get("sizes", {}).get("full", {}).get("w", 0) or 0),
                            },
                        },
                        "uploaded_t": str(value.get("uploaded_t", "unknown")),
                        "uploader": str(value.get("uploader", "unknown")),
                    }
                    for key, value in image.items()
                ]

@raphael0202
Copy link
Collaborator

raphael0202 commented Nov 12, 2024

But it doesn't "force" the correct type.

It can definitely do that (note the inconsistent types for h and w in "sizes"):

from pydantic import BaseModel
from typing import Literal

class ImageSize(BaseModel):
    h: int
    w: int


class Image(BaseModel):
    key: str
    sizes: dict[Literal["100", "200", "400", "full"], ImageSize]
    uploaded_t: int
    imgid: int | None = None
    uploader: str | None = None


raw_data = {
    "key": "nutrition_fr",
    "imgid": "3",
    "sizes": {
        "100": {"h": 100, "w": 100},
        "200": {"h": "200", "w": 200},
        "full": {"h": 600, "w": "600"}
    },
    "uploaded_t": 1620000000,
}

image = Image(**raw_data)

print(image.model_dump())

# Output:
# {'key': 'nutrition_fr', 'sizes': {'100': {'h': 100, 'w': 100}, '200': {'h': 200, 'w': 200}, 'full': {'h': 600, 'w': 600}}, 'uploaded_t': 1620000000, 'imgid': 3, 'uploader': None}

Pydantic is a validation library, so it can converts automatically from a string to an int, as long as the conversion is possible and makes sense.

@jeremyarancio
Copy link
Contributor Author

That's actually cleaner indeed!

@jeremyarancio
Copy link
Contributor Author

openfoodfacts/openfoodfacts-exports#6
Full postprocessing works 🔥
Still would like to add unittests and exception handling (draft)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

2 participants