Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a serialization library #1092

Open
gadomski opened this issue Apr 11, 2023 · 9 comments
Open

Use a serialization library #1092

gadomski opened this issue Apr 11, 2023 · 9 comments
Milestone

Comments

@gadomski
Copy link
Member

In PySTAC, we do a lot of work converting STAC objects to and from JSON (to_dict/from_dict) and that process can be error prone and tricky to maintain. As discussed in #1047, the use of a (de)serialization could:

  • get us some conversion methods for free,
  • possibly improve performance, and
  • possibly open up serialization to non-JSON formats

We should explore what it would take to add a serialization library as a dependency and convert our data structures to use it if it makes sense.

Two options (there may be more):

We probably don't want to use pydantic unless it's gotten a lot faster since last checked.

Downsides

By adding a dependency on a serialization library, we move away from our "light, few/no dependencies" model that we've been operating in for v1. Hopefully we can reduce our own code complexity by offloading some work to the serialization library, but we should ensure that the juice is worth the squeeze.

@TomAugspurger
Copy link
Collaborator

FYI, I started on https://github.com/TomAugspurger/msgspec-stac/blob/main/msgspec_stac.py a few weeks back but haven't done anything with it since (using msgspec). It seemed pretty straightforward, but I had a few questions:

  1. I don't think we can reasonably do full STAC validation using a library like this. Something like validating datetime, start_datetime, and end_datetime are together valid (i.e. start_datetime and end_datetime are provided if datetime is None), sounds hard and slow. We should I think view validation as a nice benefit, rather than the motivation for using one of these libraries.
  2. It wasn't clear to me what the relationship would be between pystac objects and (say) msgspec's objects. Would a pystac.Item be a msgspec.Struct subclass? Or would it internally use a struct (which would I think remove much of the performance benefit)?

We probably don't want to use pydantic unless it's gotten a lot faster since last checked.

In theory pydanic v2 is much faster (though still slower than msgspec last I saw).

@gadomski
Copy link
Member Author

gadomski commented May 9, 2023

I did a dead-simple msgspec implementation myself just now using class Item(Struct), and ran into issues around flattening dictionaries: jcrist/msgspec#315. It's not uncommon to have extra fields at the top level of STAC objects, and those need to be captured by a deserialization library. Maybe there's a good way to do it, but it wasn't immediately obvious to me.

For more context, this is how it's done in Rust: https://github.com/gadomski/stac-rs/blob/dfaaabc00f581af3d6b948ee3de24f4b68e5acdd/stac/src/item.rs#L66-L68

@huard
Copy link

huard commented Aug 29, 2023

There's a pydantic-stac implementation here https://github.com/stac-utils/stac-pydantic which doesn't seem really active, but I still made a PR to migrate it to pydantic v2 in the hope it'd be useful at some point.

@gadomski
Copy link
Member Author

@thomas-maschler did some ad-hoc benchmarking in radiantearth/stac-spec#1252 (comment) and found pydantic to be slower than than pystac in the deserialization case.

@eseglem
Copy link

eseglem commented Nov 14, 2023

I would be very curious about where the time difference are coming from. All things being the same, I would definitely expect Pydantic should be faster since most of the work is happening in Rust and not Python. If I had to take a guess, it may be related to pystac validation being optional vs being built in with pydantic.

There may be some ways around that as well as ways to improve performance vs those benchmarks. It could be very beneficial to go that route, if the performance is acceptable, as it would supersede stac-pydantic and help consolidate the ecosystem. I guess it really depends on how important that performance is vs maintenance effort and such.

@rbavery
Copy link

rbavery commented Mar 15, 2024

bumping this! the code complexity for pystac is pretty high and when implementing the MLModel extension, pydantic v2 has felt more approachable. is pydantic 2 or another serialization library an option for pystac v2?

@gadomski
Copy link
Member Author

is pydantic 2 or another serialization library an option for pystac v2?

Could be, however no one (that I know of) is currently working on a pystac v2 at the moment.

@KeynesYouDigIt
Copy link
Contributor

so https://github.com/stac-utils/stac-pydantic already exists - but I am new and I know there are some functionality gaps.

Would it be a high value / good first project for me to look at adding this? we could eventually support multiple serialization formats, defaulting to stac-pydantic - that would be powerful.

@gadomski
Copy link
Member Author

gadomski commented Nov 3, 2024

Would it be a high value / good first project for me to look at adding this? we could eventually support multiple serialization formats, defaulting to stac-pydantic - that would be powerful.

You're of course welcome to poke at it, but I'd say this is a pretty complex first ticket -- you'll end up touching most of the codebase. I'll note too that some of the performance considerations discussed in this issue might be resolved by #1434.

Zooming out, I think that we could pivot towards simply interoperability between pystac and stac-pydantic so folks can use both in the same ecosystem more seamlessly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants