Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OGRLayer::GetArrowStream(): add a DATETIME_AS_STRING=YES/NO option #11213

Merged
merged 4 commits into from
Nov 25, 2024

Conversation

rouault
Copy link
Member

@rouault rouault commented Nov 5, 2024

  • GetArrowStream(): add DATETIME_AS_STRING=YES/NO. Defaults to NO. Added in GDAL 3.11. Whether DateTime fields should be returned as a (normally ISO-8601 formatted) string by drivers. The aim is to be able to handle mixed timezones (or timezone naive values) in the same column. All drivers must honour that option, and potentially fallback to the OGRLayer generic implementation if they cannot (which is the case for the Arrow, Parquet and ADBC drivers).
    When DATETIME_AS_STRING=YES, the TIMEZONE option is ignored.

Fixes geopandas/pyogrio#487

  • OGRLayer::GetArrowStream(): when DATETIME_AS_STRING=YES, expose "GDAL:OGR:type":"DateTime" metadata in the ArrowSchema of DateTime fields

  • CreateFieldFromArrowSchema(): take into account GDAL:OGR:Type=DataTime when ArrowSchema.format='u' (string)

  • ogr2ogr: GPKG/FlatGeoBuf -> other format: in Arrow code path, use DATETIME_AS_STRING to preserve origin timezone

Fixes #11212

@rouault rouault added the funded through GSP Work funded through the GDAL Sponsorship Program label Nov 5, 2024
@rouault rouault added this to the 3.11.0 milestone Nov 5, 2024
@rouault rouault force-pushed the GetArrowStream_DATETIME_AS_STRING branch 3 times, most recently from 55352c2 to ae77c2f Compare November 6, 2024 00:40
@coveralls
Copy link
Collaborator

coveralls commented Nov 6, 2024

Coverage Status

coverage: 73.708% (-0.002%) from 73.71%
when pulling 324d024 on rouault:GetArrowStream_DATETIME_AS_STRING
into 17f6835 on OSGeo:master.

@rouault rouault force-pushed the GetArrowStream_DATETIME_AS_STRING branch from ae77c2f to ad41bb8 Compare November 6, 2024 01:26
@rouault
Copy link
Member Author

rouault commented Nov 6, 2024

@theroggy @jorisvandenbossche I'm thinking that in this DATETIME_AS_STRING=YES mode, in the ArrowSchema of datetime fields exposed as string (format='u'), we should probably also set the metadata field with a hint for the DateTime semantics. Any suggestion of an appropriate value for it?

@jorisvandenbossche
Copy link
Contributor

Thanks a lot for looking into this!

we should probably also set the metadata field with a hint for the DateTime semantics. Any suggestion of an appropriate value for it?

Would you just want to indicate that the original GDAL/OGR type was a DateTime? Or is there more information about the column that GDAL can know at that point?
For the type, maybe something like "gdal:type": "DateTime" ? (there is not yet any precedence where you store some information like this is any file format?)

@rouault
Copy link
Member Author

rouault commented Nov 6, 2024

Would you just want to indicate that the original GDAL/OGR type was a DateTime?

actually, I'm just remembering that we have already something. https://gdal.org/en/latest/doxygen/classOGRLayer.html#a3ffa8511632cbb7cff06a908e6668f55 mentions:

Starting with GDAL 3.8, the ArrowSchema::metadata field filled by the get_schema() callback may be set with the potential following items:
    "GDAL:OGR:alternative_name": value of OGRFieldDefn::GetAlternativeNameRef()
    "GDAL:OGR:comment": value of OGRFieldDefn::GetComment()
    "GDAL:OGR:default": value of OGRFieldDefn::GetDefault()
    "GDAL:OGR:subtype": value of OGRFieldDefn::GetSubType()
    "GDAL:OGR:width": value of OGRFieldDefn::GetWidth() (serialized as a string)
    "GDAL:OGR:unique": value of OGRFieldDefn::IsUnique() (serialized as "true" or "false")
    "GDAL:OGR:domain_name": value of OGRFieldDefn::GetDomainName()

Those are only filled when they cannot be expressed with an Arrow concept.
So logically that should be extended with "GDAL:OGR:type": "DateTime" in that situation

DATETIME_AS_STRING=YES/NO. Defaults to NO. Added in GDAL 3.11.
Whether DateTime fields should be returned as a (normally ISO-8601
formatted) string by drivers. The aim is to be able to handle mixed
timezones (or timezone naive values) in the same column.
All drivers must honour that option, and potentially fallback to the
OGRLayer generic implementation if they cannot (which is the case for the
Arrow, Parquet and ADBC drivers).
When DATETIME_AS_STRING=YES, the TIMEZONE option is ignored.

Fixes geopandas/pyogrio#487
@rouault rouault force-pushed the GetArrowStream_DATETIME_AS_STRING branch from 5468ef2 to f75539c Compare November 22, 2024 17:59
…:OGR:type":"DateTime" metadata in the ArrowSchema of DateTime fields
@rouault rouault force-pushed the GetArrowStream_DATETIME_AS_STRING branch from f75539c to 324d024 Compare November 23, 2024 15:15
@rouault rouault merged commit c7606da into OSGeo:master Nov 25, 2024
32 of 36 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
funded through GSP Work funded through the GDAL Sponsorship Program
Projects
None yet
3 participants