-
-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: 🎨 Parquet data type fixed #1456
Conversation
This PR should also contain the fix for
|
We should then reformat the {
"sizes": {
"100": {
"h": 100,
"w": 56
},
"400": {
"h": 400,
"w": 225
},
"full": {
"h": 3555,
"w": 2000
}
},
"uploaded_t": "1490702616",
"uploader": "twan51"
"imgid": "5",
"key": "front_fr"
} The difference here is that the dictionary key would be part of the item directly.
It is important, it's one of the field I (and other reusers) use the most often to fetch the images ;) |
@@ -18,7 +18,7 @@ COPY ( | |||
completeness, | |||
correctors_tags, | |||
countries_tags, | |||
created_t, | |||
to_timestamp(created_t)::datetime AS created_t, -- Convert from unixtime to datetime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hum I'm not sure to understand why you do this?
created_t
is an int, and it looks it's correctly detected by Parquet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, but it is a Unix timestamp, which is not particularly obvious to someone new to the dataset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, but it is a Unix timestamp, which is not particularly obvious to someone new to the dataset
We should keep it as it, so that we keep as much as possible the compatibility with the other database dump (JSONL, MongoDB, API, etc).
We can add a created_datetime
version, just like the CSV does, with the converted datetime though!
Ok, if it is actually useful |
I tried different methods, also what you proposed, but I cannot manage to get the field schema recognized by Parquet... CREATE OR REPLACE TABLE processed_images AS (
WITH processed_images AS (
SELECT code,
json_merge_patch(
to_json({'key': unnest(json_keys(images)) }),
unnest(images->'$.*')
) AS merged
FROM jsonl_sample
)
SELECT
code,
array_agg(merged) AS images_array
FROM processed_images
GROUP BY code
)
;
SELECT
jsonl.code,
<...>,
processed_images.images_array
FROM jsonl_sample AS jsonl
LEFT JOIN processed_images
ON jsonl.code = processed_images.code; |
A possibility would be to iterate over the JSONL from Python code, post-process the data and create the parquet file directly by submitting the data and the schema, no? |
What's not consistent for example? |
Check above, the data is either a string or an int for the same item |
Ok, if think you should use a data validation library such as pydantic (https://pydantic.dev/) to validate the input data and force the correct type! |
Pydantic could be useful later on to impose a data contract to the JSONL, to prevent any modification in the data schema before the conversion process. I'll handle future incorrect JSONL field type using try & except for each column, [
{
"key": str(key),
"imgid": str(value.get("imgid", "unknown")),
"sizes": {
"100": {
"h": int(value.get("sizes", {}).get("100", {}).get("h", 0) or 0), # (or 0) because "h" or "w" can be none, leading to an error with int
"w": int(value.get("sizes", {}).get("100", {}).get("w", 0) or 0),
},
"200": {
"h": int(value.get("sizes", {}).get("200", {}).get("h", 0) or 0),
"w": int(value.get("sizes", {}).get("200", {}).get("w", 0) or 0),
},
"400": {
"h": int(value.get("sizes", {}).get("400", {}).get("h", 0) or 0),
"w": int(value.get("sizes", {}).get("400", {}).get("w", 0) or 0),
},
"full": {
"h": int(value.get("sizes", {}).get("full", {}).get("h", 0) or 0),
"w": int(value.get("sizes", {}).get("full", {}).get("w", 0) or 0),
},
},
"uploaded_t": str(value.get("uploaded_t", "unknown")),
"uploader": str(value.get("uploader", "unknown")),
}
for key, value in image.items()
] |
It can definitely do that (note the inconsistent types for from pydantic import BaseModel
from typing import Literal
class ImageSize(BaseModel):
h: int
w: int
class Image(BaseModel):
key: str
sizes: dict[Literal["100", "200", "400", "full"], ImageSize]
uploaded_t: int
imgid: int | None = None
uploader: str | None = None
raw_data = {
"key": "nutrition_fr",
"imgid": "3",
"sizes": {
"100": {"h": 100, "w": 100},
"200": {"h": "200", "w": 200},
"full": {"h": 600, "w": "600"}
},
"uploaded_t": 1620000000,
}
image = Image(**raw_data)
print(image.model_dump())
# Output:
# {'key': 'nutrition_fr', 'sizes': {'100': {'h': 100, 'w': 100}, '200': {'h': 200, 'w': 200}, 'full': {'h': 600, 'w': 600}}, 'uploaded_t': 1620000000, 'imgid': 3, 'uploader': None} Pydantic is a validation library, so it can converts automatically from a string to an int, as long as the conversion is possible and makes sense. |
That's actually cleaner indeed! |
openfoodfacts/openfoodfacts-exports#6 |
What
Screenshot
Part of