fix: 🎨 Parquet data type fixed #1456

jeremyarancio · 2024-10-30T14:36:50Z

What

Timestamps in the JSONL/Parquet were in the wrong format: Unix timestamp
Using DuckDb, this problem is now fixed in the Parquet file => HF

Screenshot

Part of

Investigate database export as parquet file openfoodfacts-server#7660

jeremyarancio · 2024-10-30T14:42:08Z

This PR should also contain the fix for images but it is not done for 2 reasons:

The field json structure is not consistent across rows, leading to the issue during the conversion to Parquet. Not sure if the problem can be solved...
The data contained in the field images is actually not that relevant for users.

raphael0202 · 2024-10-30T14:56:13Z

The field json structure is not consistent across rows, leading to the issue during the conversion to Parquet. Not sure if the problem can be solved...

We should then reformat the images field so that it is Parquet compatible. One possibility would be to make images a list of dicts such as:

{
    "sizes": {
        "100": {
            "h": 100,
            "w": 56
        },
        "400": {
            "h": 400,
            "w": 225
        },
        "full": {
            "h": 3555,
            "w": 2000
        }
    },
    "uploaded_t": "1490702616",
    "uploader": "twan51"
    "imgid": "5",
    "key": "front_fr"
}

The difference here is that the dictionary key would be part of the item directly.

The data contained in the field images is actually not that relevant for users.

It is important, it's one of the field I (and other reusers) use the most often to fetch the images ;)

raphael0202 · 2024-10-30T14:57:38Z

robotoff/utils/sql/jsonl_to_parquet.sql

@@ -18,7 +18,7 @@ COPY (
        completeness,
        correctors_tags,
        countries_tags,
-        created_t,
+        to_timestamp(created_t)::datetime AS created_t, -- Convert from unixtime to datetime


Hum I'm not sure to understand why you do this?
created_t is an int, and it looks it's correctly detected by Parquet.

Indeed, but it is a Unix timestamp, which is not particularly obvious to someone new to the dataset

Indeed, but it is a Unix timestamp, which is not particularly obvious to someone new to the dataset

We should keep it as it, so that we keep as much as possible the compatibility with the other database dump (JSONL, MongoDB, API, etc).
We can add a created_datetime version, just like the CSV does, with the converted datetime though!

jeremyarancio · 2024-10-30T15:22:08Z

The field json structure is not consistent across rows, leading to the issue during the conversion to Parquet. Not sure if the problem can be solved...

We should then reformat the images field so that it is Parquet compatible. One possibility would be to make images a list of dicts such as:
{
    "sizes": {
        "100": {
            "h": 100,
            "w": 56
        },
        "400": {
            "h": 400,
            "w": 225
        },
        "full": {
            "h": 3555,
            "w": 2000
        }
    },
    "uploaded_t": "1490702616",
    "uploader": "twan51"
    "imgid": "5",
    "key": "front_fr"
}
The difference here is that the dictionary key would be part of the item directly.

The data contained in the field images is actually not that relevant for users.

It is important, it's one of the field I (and other reusers) use the most often to fetch the images ;)

Ok, if it is actually useful

jeremyarancio · 2024-10-31T16:25:51Z

The field json structure is not consistent across rows, leading to the issue during the conversion to Parquet. Not sure if the problem can be solved...

We should then reformat the images field so that it is Parquet compatible. One possibility would be to make images a list of dicts such as:
{
    "sizes": {
        "100": {
            "h": 100,
            "w": 56
        },
        "400": {
            "h": 400,
            "w": 225
        },
        "full": {
            "h": 3555,
            "w": 2000
        }
    },
    "uploaded_t": "1490702616",
    "uploader": "twan51"
    "imgid": "5",
    "key": "front_fr"
}
The difference here is that the dictionary key would be part of the item directly.

The data contained in the field images is actually not that relevant for users.

It is important, it's one of the field I (and other reusers) use the most often to fetch the images ;)
Ok, if it is actually useful

I tried different methods, also what you proposed, but I cannot manage to get the field schema recognized by Parquet...
Here's the "list of dict" solution in DuckDB:

CREATE OR REPLACE TABLE processed_images AS (
        WITH processed_images AS (
            SELECT code,
                json_merge_patch(
                    to_json({'key': unnest(json_keys(images)) }),
                    unnest(images->'$.*')
                ) AS merged
            FROM jsonl_sample
        )
        SELECT 
            code,
            array_agg(merged) AS images_array
        FROM processed_images
        GROUP BY code
    )
;
SELECT 
        jsonl.code,
        <...>, 
        processed_images.images_array
FROM jsonl_sample AS jsonl
LEFT JOIN processed_images
ON jsonl.code = processed_images.code;

raphael0202 · 2024-11-03T15:53:09Z

A possibility would be to iterate over the JSONL from Python code, post-process the data and create the parquet file directly by submitting the data and the schema, no?

jeremyarancio · 2024-11-12T11:29:49Z

The type of problem I'm dealing with, in addition to how large the dataset is and other bugs...
-> Data conversion is not done right because the data is not consistent across the JSONL

raphael0202 · 2024-11-12T12:08:07Z

Data conversion is not done right because the data is not consistent across the JSONL

What's not consistent for example?

jeremyarancio · 2024-11-12T12:39:21Z

Data conversion is not done right because the data is not consistent across the JSONL

What's not consistent for example?

Check above, the data is either a string or an int for the same item

raphael0202 · 2024-11-12T13:08:16Z

Ok, if think you should use a data validation library such as pydantic (https://pydantic.dev/) to validate the input data and force the correct type!
Pydantic should be fast enough for this purpose.

jeremyarancio · 2024-11-12T13:17:18Z

Pydantic could be useful later on to impose a data contract to the JSONL, to prevent any modification in the data schema before the conversion process.
Who knows how the JSONL will change in months
But it doesn't "force" the correct type. The correct type needs to be manually converted, like this example below.

I'll handle future incorrect JSONL field type using try & except for each column, except leading the column not being converted with a Warning message.

[
                    {
                        "key": str(key),
                        "imgid": str(value.get("imgid", "unknown")),
                        "sizes": {
                            "100": {
                                "h": int(value.get("sizes", {}).get("100", {}).get("h", 0) or 0), # (or 0) because "h" or "w" can be none, leading to an error with int
                                "w": int(value.get("sizes", {}).get("100", {}).get("w", 0) or 0),
                            },
                            "200": {
                                "h": int(value.get("sizes", {}).get("200", {}).get("h", 0) or 0),
                                "w": int(value.get("sizes", {}).get("200", {}).get("w", 0) or 0),
                            },
                            "400": {
                                "h": int(value.get("sizes", {}).get("400", {}).get("h", 0) or 0),
                                "w": int(value.get("sizes", {}).get("400", {}).get("w", 0) or 0),
                            },
                            "full": {
                                "h": int(value.get("sizes", {}).get("full", {}).get("h", 0) or 0),
                                "w": int(value.get("sizes", {}).get("full", {}).get("w", 0) or 0),
                            },
                        },
                        "uploaded_t": str(value.get("uploaded_t", "unknown")),
                        "uploader": str(value.get("uploader", "unknown")),
                    }
                    for key, value in image.items()
                ]

raphael0202 · 2024-11-12T14:28:36Z

But it doesn't "force" the correct type.

It can definitely do that (note the inconsistent types for h and w in "sizes"):

from pydantic import BaseModel
from typing import Literal

class ImageSize(BaseModel):
    h: int
    w: int


class Image(BaseModel):
    key: str
    sizes: dict[Literal["100", "200", "400", "full"], ImageSize]
    uploaded_t: int
    imgid: int | None = None
    uploader: str | None = None


raw_data = {
    "key": "nutrition_fr",
    "imgid": "3",
    "sizes": {
        "100": {"h": 100, "w": 100},
        "200": {"h": "200", "w": 200},
        "full": {"h": 600, "w": "600"}
    },
    "uploaded_t": 1620000000,
}

image = Image(**raw_data)

print(image.model_dump())

# Output:
# {'key': 'nutrition_fr', 'sizes': {'100': {'h': 100, 'w': 100}, '200': {'h': 200, 'w': 200}, 'full': {'h': 600, 'w': 600}}, 'uploaded_t': 1620000000, 'imgid': 3, 'uploader': None}

Pydantic is a validation library, so it can converts automatically from a string to an int, as long as the conversion is possible and makes sense.

jeremyarancio · 2024-11-12T15:04:00Z

That's actually cleaner indeed!

jeremyarancio · 2024-11-12T18:25:01Z

openfoodfacts/openfoodfacts-exports#6
Full postprocessing works 🔥
Still would like to add unittests and exception handling (draft)

fix: 🎨 Correct timestamp data type

d2259a8

jeremyarancio requested a review from a team as a code owner October 30, 2024 14:36

github-actions bot assigned jeremyarancio Oct 30, 2024

github-actions bot added the utils label Oct 30, 2024

raphael0202 reviewed Oct 30, 2024

View reviewed changes

jeremyarancio added 3 commits November 8, 2024 19:08

fix: 🎨 WIP

dbd45d7

fix: 🚧 WIP

b64a308

fix: 🔥 Works!

6a523ea

fix:

7009f0d

jeremyarancio mentioned this pull request Nov 12, 2024

feat: 🔥 Postprocess parquet: images field openfoodfacts/openfoodfacts-exports#6

Merged

jeremyarancio closed this Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: 🎨 Parquet data type fixed #1456

fix: 🎨 Parquet data type fixed #1456

jeremyarancio commented Oct 30, 2024

jeremyarancio commented Oct 30, 2024

raphael0202 commented Oct 30, 2024

raphael0202 Oct 30, 2024

jeremyarancio Oct 30, 2024

raphael0202 Oct 30, 2024

jeremyarancio commented Oct 30, 2024

jeremyarancio commented Oct 31, 2024

raphael0202 commented Nov 3, 2024

jeremyarancio commented Nov 12, 2024

raphael0202 commented Nov 12, 2024

jeremyarancio commented Nov 12, 2024

raphael0202 commented Nov 12, 2024 •

edited

Loading

jeremyarancio commented Nov 12, 2024 •

edited

Loading

raphael0202 commented Nov 12, 2024 •

edited

Loading

jeremyarancio commented Nov 12, 2024

jeremyarancio commented Nov 12, 2024

fix: 🎨 Parquet data type fixed #1456

fix: 🎨 Parquet data type fixed #1456

Conversation

jeremyarancio commented Oct 30, 2024

What

Screenshot

Part of

jeremyarancio commented Oct 30, 2024

raphael0202 commented Oct 30, 2024

raphael0202 Oct 30, 2024

Choose a reason for hiding this comment

jeremyarancio Oct 30, 2024

Choose a reason for hiding this comment

raphael0202 Oct 30, 2024

Choose a reason for hiding this comment

jeremyarancio commented Oct 30, 2024

jeremyarancio commented Oct 31, 2024

raphael0202 commented Nov 3, 2024

jeremyarancio commented Nov 12, 2024

raphael0202 commented Nov 12, 2024

jeremyarancio commented Nov 12, 2024

raphael0202 commented Nov 12, 2024 • edited Loading

jeremyarancio commented Nov 12, 2024 • edited Loading

raphael0202 commented Nov 12, 2024 • edited Loading

jeremyarancio commented Nov 12, 2024

jeremyarancio commented Nov 12, 2024

raphael0202 commented Nov 12, 2024 •

edited

Loading

jeremyarancio commented Nov 12, 2024 •

edited

Loading

raphael0202 commented Nov 12, 2024 •

edited

Loading