Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extensions should coerce values to correct type #1044

Closed
jsignell opened this issue Mar 15, 2023 · 10 comments
Closed

Extensions should coerce values to correct type #1044

jsignell opened this issue Mar 15, 2023 · 10 comments
Labels
bug Things which are broken extension Implement a STAC extension in PySTAC
Milestone

Comments

@jsignell
Copy link
Member

jsignell commented Mar 15, 2023

Specifically non-null values in "proj:epsg" and "proj:shape" should be coerced to int. Ideally this should happen on the properties field itself rather than only when directly getting the prop from the extension class

Context

I was debugging an issue on VEDA where epsg was being read as a float and causing some issues.

import pystac
from pystac.extensions.projection import ProjectionExtension

item = pystac.Item.from_file("https://staging-stac.delta-backend.com/collections/nceo_africa_2017/items/AGB_map_2017v0m_COG")
print(ProjectionExtension.has_extension(item))  # False
print(item.properties["proj:epsg"])  # 4326.0

I decided that maybe there was a mismatch between the STAC version and the extension version, so I pulled the file locally (using wget) and changed proj ext schema version from v1.0.0 to v1.1.0

import pystac
from pystac.extensions.projection import ProjectionExtension

item = pystac.Item.from_file("AGB_map_2017v0m_COG")
print(ProjectionExtension.has_extension(item))  # True
print(item.properties["proj:epsg"])  # 4326.0
print(ProjectionExtension.ext(item).epsg)  # 4326.0

This time the proj ext is understood , but in the process of pulling the file all my int fields were converted to floats. I think this is because json doesn't really understand the difference between ints and floats. but since my STAC entry contains floats rather than ints, the value is always a float. Even when getting it directly from the extension class.

@gadomski gadomski added bug Things which are broken extension Implement a STAC extension in PySTAC labels Mar 15, 2023
@gadomski gadomski added this to the 1.8 milestone Mar 15, 2023
@jsignell
Copy link
Member Author

I might be off in my assessment. I was looking at the https://staging-stac.delta-backend.com/collections/nceo_africa_2017/items/AGB_map_2017v0m_COG using a JSON prettifying extension and it looked like proj:epsg was in the item as an int, but when I look at the raw version it's a float.

@jsignell
Copy link
Member Author

jsignell commented Mar 15, 2023

So the part about wgetting the file converting from int to float is probably incorrect. I edited the original description to fix that misunderstanding.

@vincentsarago
Copy link
Member

using python https

import httpx
r = httpx.get("https://staging-stac.delta-backend.com/collections/nceo_africa_2017/items/AGB_map_2017v0m_COG")
r.json()["properties"]["proj:epsg"]
>>>  4326.0

using curl

curl https://staging-stac.delta-backend.com/collections/nceo_africa_2017/items/AGB_map_2017v0m_COG   
{...,"proj:epsg":4326.0,...}%  

When I use jq it does coerce the float to in

curl https://staging-stac.delta-backend.com/collections/nceo_africa_2017/items/AGB_map_2017v0m_COG | jq -c '.properties."proj:epsg"'
4326

@jsignell
Copy link
Member Author

Thanks that is helpful context. From googling around a bit it seems like json doesn't really have a concept of int vs float. So it is probably unreliable to depend on how the data is stored.

@jsignell
Copy link
Member Author

json validation does not catch this either because in json schema numbers with a trailing .0 are considered valid integers: https://json-schema.org/understanding-json-schema/reference/numeric.html

@jsignell
Copy link
Member Author

I was toying around with the idea of doing int coersion in the json_loads method itself. This is what it would look like in the native json library:

import math
import json

def parse_maybe_int(obj):
    result = float(obj)
    decimal, whole = math.modf(result)
    if decimal == 0:
        result = int(whole)
    return result

json.loads('{"foo": 1.0}', parse_float=parse_maybe_int)  # {'foo': 1}
json.loads('{"foo": 1.1}', parse_float=parse_maybe_int)  # {'foo': 1.1}
json.loads('{"foo": 1}', parse_float=parse_maybe_int)  # {'foo': 1}

But then I realized that there is an alternate library that is used: orjson and I don't see a similar mechanism for hooking into the float parsing. It looks like it is explicitly not supported and they recommend handling coersion after the load.

Before I go down the path of implementing that I'd like to get some feedback on whether this approach seems too aggressive for pystac objects.

@jsignell
Copy link
Member Author

Just to make this explicit. I think there are two very different approaches available:

  1. In the top level json_loads coerce all floats that end with .0 to ints. This is at the highest level and does not involve any changes to how extensions work or any per-extension changes.
  2. Make type coercion the responsibility of the extensions. If we choose this approach we will need to do special deserialization per-extension on each pystac object to ensure that the correct coercion is applied.

@gadomski
Copy link
Member

Make type coercion the responsibility of the extensions. If we choose this approach we will need to do special deserialization per-extension on each pystac object to ensure that the correct coercion is applied.

My instinct is option 2, option 1 feels a little to magical to me.

Typing and (de)serialization is a recurring problem in PySTAC (e.g. #1047), and I think forcing extensions to define the behavior that they want is the best call -- e.g. for projection, you may want to error if epsg has a fractional part, but another extension might be ok with a round/truncate to get from a float to an int.

@jsignell
Copy link
Member Author

I agree that using a dedicated library seems like a better fit. Do you think we should close this since the preferred approach is captured in #1092? FWIW, VEDA is correcting the values within the pgstac database, and there are conversations going on about changing the type of proj:epsg to be string rather than int, which would also fix the original issue that I was running into.

@gadomski
Copy link
Member

Do you think we should close this since the preferred approach is captured in #1092?

Sure. I think an issue specifically focused on a single extension (e.g. "Proj extension should fail if EPSG is not an integer") would be more actionable.

@gadomski gadomski closed this as not planned Won't fix, can't repro, duplicate, stale Apr 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Things which are broken extension Implement a STAC extension in PySTAC
Projects
None yet
Development

No branches or pull requests

3 participants