Improve the use of JSON Schema in tooling beyond validation #18

jonaslagoni · 2021-06-03T11:29:30Z

jonaslagoni
Jun 3, 2021

Just to align the thoughts, JSON Schema is for specifying validation rules that data should comply with and it should continue with that. This is perfect when you want to validate data, but what if you want to know the structure of the data? Cause as soon as you have valid data, you have structure and this underlying structure is what I refer to as the data model for the JSON Schema definition. One could see it as a side-effect to the specification.

The JSON Schema keywords, while simple on the surface, are complex beneath, and to truly understand how each keyword affects the underlying data model requires knowledge of the specification and how data is validated. However, even with that knowledge, there are still gaping holes that are up for interpretation.

Each organization (AsyncAPI, IBM, other tooling providers for OpenAPI, etc.) currently works in parallel solving the very same problems when it comes to using JSON Schema beyond validation in tooling. - and I think we should change that.

I think @jdesrosiers puts it perfectly in terms of what we have to achieve:

To me, an ideal code gen vocabulary is one that allows me to use the full power of JSON Schema for validation while also allowing me to use the same schemas for code gen.

The specific suggestion

As I see it, it comes down to two core issues that need to be solved:

Specify how you take any valid (no restriction on the use of keywords) JSON Schema definition (any specification version) and interpret it to a data model definition.
- (long term) Potentially it could also define mappings from the data model to specific languages, as this process can be done in parallel.
Propose an "official" code generation vocabulary that compliments the defined interpretation rules and how they work hand in hand.

(long term) provide a test suite for libraries to test compliance.

To me defining a code generation vocabulary cannot be done without solving the issue of how to interpret the validation rules.

What solving these issues should result in:

Anyone wanting to implement a model generation library has a clear document that explains the procedure of how to take any input JSON Schema and output a data model.
Instead of only using JSON Schema in tooling we now have a more concrete definition of the data model that we can use in our tooling instead of using JSON Schema alone.
The users have a consistent experience across any specification in terms of code generation from JSON Schema.
The users have a clear code generation vocabulary that they can use to consistently get the "correct" data model.

So, how do we solve it?

I suggest that it happens under a separate repository in the JSON Schema organization called (just gonna name something) json-schema-data-model alongside a GitHub team (datamodeling?) with members that actively maintain and contribute to that project. (when do you "actively" maintain something? 🤔 Is that defined under the organization?)

To get started the first issue in the repository should define the scope of the work in more detail so everyone agrees on what should be solved.

I suggest that the first team of maintainers are those who dedicate min x (10?) hours a week to kickstart the project. They should structure the work into GitHub milestones, splitting up the work into smaller chunks, what needs to be solved first, what can be done in parallel, etc.

This should be done somewhat separately from the core team working on the actual JSON Schema specification so it can continue to evolve. That being said, the involvement of the JSON Schema team is expected and should be the mediator and guide on this process.

Anyone should be able to jump in and contribute, but to slowly improve something we must start with something and the datamodeling team should move to provide as soon as possible.

Scope of this discussion

Since each organization has its own ecosystem, no specific library should be provided, let each organization do that (we could but think it is out of scope, at least until JSON Schema organization starts to provide official tooling 😄). So I suggest we keep it purely theoretical.
Do not go into details about how to solve the said issues since that is out of scope for this discussion.

Questions to spark the discussion

Should this happen under the JSON Schema organization?
Does the proposed issues make sense? Anything that should be removed/added?
Does the proposed process make sense?

jdesrosiers · 2021-06-03T17:32:57Z

jdesrosiers
Jun 3, 2021
Maintainer

I think this is all great and having the JSON Schema organization host this project is a good idea. However, it will have to be BYOM (bring your own maintainers). We already had to put JSON Hyper-Schema on the back burner because we don't have the resources to maintain it right now. Adding another spec would just put us back in the same position. It would be your project that you lead and publish. I'm happy to follow along and give suggestions, but if I have 10 hours a week to spare, it will go into rebooting hyper-schema.

1 reply

jonaslagoni Jun 4, 2021
Author

Understandable, also why I added ... separately from the core team working on the actual JSON Schema specification so it can continue to evolve. ....

As we already spend a great time working on solving this on our own sides, it makes sense to me, to move the work to an official channel that we all can benefit from.

handrews · 2021-06-03T17:43:07Z

handrews
Jun 3, 2021

Related: OAI/OpenAPI-Specification#2542

Also related: I am working on clarifying what a core-only JSON Schema implementation looks like, which will help disentangle what parts of JSON Schema really must be present, what parts can be mixed and matched, and what possibilities that might open up. For example, I just delivered a (sadly private so I can't explain further) generative-only vocabulary for a client, that does not use any standard vocabulary other than core.

Obviously, many if not most generative cases will want to share schemas with validation in a sort of bidirectional workflow, but there are a lot of possibilities that probably aren't obvious yet.

0 replies

gregsdennis · 2021-06-03T20:21:19Z

gregsdennis
Jun 3, 2021
Maintainer

Also related: I've had users request the ability to generated c# data models from schemas, and I've found it's considerably more difficult to figure out than the other way around.

It would be great to have a standard definition on how a schema should be interpreted to generate a model.

One thing to be aware of: we need to consider both strongly- and loosely-typed languages as they have different capabilities.

1 reply

gregsdennis Jul 3, 2021
Maintainer

Leaving this here for reference should someone decide to pick this up.

https://github.com/quicktype/quicktype

Relequestual · 2021-06-04T15:06:19Z

Relequestual
Jun 4, 2021
Maintainer

I'm for this. As @jdesrosiers pointed out, this needs people to make it work, but not just any people, and I think you're (@jonaslagoni) aware of that.

I think you're biggest challenge is going to be getting ANY level of time commitment from others. This has been my previous experience. I would love to be prooved wrong.

I would suggest a slight alteration of your proposal, but it's mostly semantics.

Establish what's already out there, who it's made by, and note how recent and active it is
Find out if they would be interested in collaborating in an "official" harmonized version. This may be a hard sell, it may not
Get commitment to ANY amount of time from companies
Get commitment (on the intent) to use any resulting published specification
Outline your ideal timeline and likely working patterns and structures, and see if that aligns with their desires

My feeling is you will need at least 2 larger companies, and maybe 2-4 medium or smaller companies to at least agree in principal, and hopefully then commit to contribute or AT LEAST review and feedback at various points in your development.

I'm happy for JSON Schema as an org to house, manage teams, guide and review work, and "stamp" a publication (whatever we decide that looks like).

A good first step to this might be to jump on a call with @mkistler as he has been the most interested in a new vocabulary like this, and the most active in trying to make it happen. Jump on a call, see if you can agree on a possible process as above. Report back, and let's go from there.

Sound reasonable?

1 reply

jonaslagoni Jun 7, 2021
Author

I think you're biggest challenge is going to be getting ANY level of time commitment from others. This has been my previous experience. I would love to be prooved wrong.

Me too 😄

I would suggest a slight alteration of your proposal, but it's mostly semantics. ...

Yea this all makes sense to me 👍

A good first step to this might be to jump on a call with @mkistler as he has been the most interested in a new vocabulary like this, and the most active in trying to make it happen. Jump on a call, see if you can agree on a possible process as above. Report back, and let's go from there.

Yep sounds good to me, gonna see if he replies to this thread, otherwise, I gotta find some other way of making contact 🙂

If we end up setting up a call it will be public for everyone to join. So I will provide link, time, and date here in this discussion unless a repo is created in the meantime.

handrews · 2021-06-04T16:45:21Z

handrews
Jun 4, 2021

@jonaslagoni @Relequestual I would suggest including @jdesrosiers along with @mkistler on any call. Mike is coming from a tooling background, while Jason understands the intent of what we've through through within the JSON Schema project already (e.g. the approach of disambiguating validation constructs, rather than creating new data modeling constructs or just forbidding a bunch of things). That will give you a balance of perspectives (as can be seen in the OAS issue that I linked above, where both comment).

As also noted in the OAS issue, I would strongly suggest avoiding any thought of a "one true solution" vocabulary for code generation. You will need different features for different sorts of languages (strong or loosely typed, OO or functional). There will be things you could easily generate in Python or JavaScript that will be difficult or impossible in C++ or Java.

JSON Schema draft 2020-12's modularity, where the unevaluated* keywords are in their own vocabulary because of the additional challenges inherent in implementing them (particularly in a memory-constrained streaming environment), is a good model here. You are likely to have a core of common data modeling features, and then several additional keyword vocabularies that support various less universal features.

2 replies

jdesrosiers Jun 4, 2021
Maintainer

I'd be happy to join a call if that happens.

jonaslagoni Jun 7, 2021
Author

Sounds good, I will add a comment to the discussion of time and place so anyone can join if they want to 👍

Relequestual · 2021-06-18T07:39:56Z

Relequestual
Jun 18, 2021
Maintainer

On the OAI call (2021-06-17) I heard there was a meeting of a special interest group. @jonaslagoni do you see that special interest group in OAI as separate from this effort, or are you OK for that to be the focus of activity? Either answer is fine.

12 replies

Relequestual Jun 23, 2021
Maintainer

I honestly don't mind where that discussion happens.

What I see happening in the future is something like recognising levels of voacabulary maturity.
How we evaluate that and the implications are to be decided.
I think unless there are strong cases for JSON Schema as an org supporting a specific new vocabulary, the involved parties should be free to do as they wish.

That being said, we should be able to provide a safe place for them to do so. A repo outside of this org for example might not have the same Code of Conduct, if any.

MikeRalphson Jun 23, 2021

The AWG repos use the CC-CofC: https://github.com/APIs-Working-Group/.github/blob/main/CODE_OF_CONDUCT.md#contributor-covenant-code-of-conduct

jdesrosiers Jun 23, 2021
Maintainer

I would love to see the JSON Schema organization host this work. It's not that we don't want to take on things other than validation, its just that we don't have the capacity for anything else. Remember, this is a small group of volunteers. As long as you can bring in the people to do the work (making proposals, writing the spec, writing proof of concepts, etc), we'd be thrilled to support that work in any way we can.

Relequestual Jun 24, 2021
Maintainer

On reflection, I think it makes sense to do it within the JSON Schema org. Visibility is important here.

@MikeRalphson can we add to the next OAI agenda please "Propose JSON Schema org hosts codegen vocab proposal work"?

MikeRalphson Jun 24, 2021

Added to this week's OAI/TDC agenda

Relequestual · 2021-06-24T07:45:00Z

Relequestual
Jun 24, 2021
Maintainer

@jonaslagoni Assuming the proposal to host the vocab proposal work in the JSON Schema org, do you want to propose a repo name?
I would guess it might need to leave room for other groups to propose something similar.

I wonder if the next step here is to try and define in and out of scope, kind of like a mini simplified charter. If you can agree on that, you'll have a basis to start this work collaborativly.

13 replies

Relequestual Aug 3, 2021
Maintainer

No, this specification will not define any sort of target artifact. It will specify common generic data modeling constructs and semantics. That should be easily mapable to most data modeling features in most programming languages, but no specific outcome can be specified. Every language is going to be a little different and we can't (nor do we want to) specify outcomes for all of them.

Part of the charter says...

A test suite is expected to be produced along side the work, to better discuss and formulate specific cases.

How does a test suite work if you don't intened to define specific outcomes?
I agree it would be onerous to specify specific outcome for even the most popular languages.
How would you approach a test suite if such a suite doesn't cover sepcific outcomes? (I'm not suggesting it can't be done that way, but I'm wondering if an approach has been considered and devised here.)

jonaslagoni Aug 3, 2021
Author

... , but no specific outcome can be specified. Every language is going to be a little different and we can't (nor do we want to) specify outcomes for all of them.

@jdesrosiers are you suggesting that the SIG should not handle this, or that the repository should never accept something like this? I think, that If the community want the specifics of how we port the common generic data modeling constructs to a specific programming language, they should be able to contribute this, maybe not through the SIG work, but to me, SIG is not anything other than a group pushing the work forward. What the SIG includes and excludes should not affect what the community wants and should be able to contribute to the repo.

I will most likely contribute this part (some of it) of how it can be ported to languages, as it is in our use-case in AsyncAPI 🙂 But for the SIG itself it will not be a focus point, as that would require a good chunk of work.

How does a test suite work if you don't intened to define specific outcomes?
I agree it would be onerous to specify specific outcome for even the most popular languages.
How would you approach a test suite if such a suite doesn't cover sepcific outcomes? (I'm not suggesting it can't be done that way, but I'm wondering if an approach has been considered and devised here.)

@Relequestual one of the first tasks I am gonna propose for the SIG is gonna be to decide how we want to define the description of said data models.

I.e. instead of directly trying to solve JSON Schema -> Java class, we need to try and port it to a common definition JSON Schema -> (Some common definition format) -> Java class, as @jdesrosiers pointed out.

It could be that we adopt one of the existing definition languages such as JTD as the common format so the process we will solve becomes JSON Schema -> JTD -> Java class, but it could be something completely different (don't start the discussion here, just wanted to point it out) 😄

In terms of the test suite, I suspect that it will be this outcome JSON Schema -> (Some common definition format) that will initially be included. If the community later want it to include the specific language outputs, so be it, time will tell 🙂

During this week, I will start to create issues and discussions, as well as very specific outcomes that the SIG should push forward. As I can see there are some different assumptions 😄

DavidBiesack Aug 3, 2021

@jdesrosiers Thanks for the explanation, Jason, and for being patient with me. I grasp your perspective and agree what you have works (with that understanding). My comments were made based on

"Specify how JSON Schema (a specification for validating JSON content) can be mapped to data modeling and data definition languages." Is that a correct interpretation? (Ben replied yes)

that is, I saw JSON Schema with a data definition vocabulary -> mapping process -> data definition, whereas I think I now understand your position that "JSON Schema with a data definition vocabulary' is a data definition language.

(I come to this from the world of OpenAPI where we use things like openapi-generator to generate type declarations and data structures in programming languages/DDL, and so that tainted my expectations; i.e. we will still use vocal-idl in this way, although the output will be more accurate, instead being based on more somewhat arbitrary interpretations of the present data model "intention" expressed in current JSON Schema.)

jdesrosiers Aug 4, 2021
Maintainer

@jonaslagoni

instead of directly trying to solve JSON Schema -> Java class, we need to try and port it to a common definition JSON Schema -> (Some common definition format) -> Java class

I very much want to avoid some intermediate format. I think this specification should be describing how to interpret a schema as a DDL, not how to transform a schema into a DDL. It shouldn't need to have an intermediate representation.

@Relequestual

How does a test suite work if you don't intend to define specific outcomes?

I think the test suite can be target DDL agnostic. We can write tests that say, "this schema describes a type Foo with property bar of type int". A specific output format isn't necessary to describe that. Let's say I have a code gen tool that generates Java classes. This is the processes I would use,

Run the test schema through the tool and get Java code as text back
Eval the Java code
Use reflection to assert that the eval'd code results in a class with name "Foo" that has a property "bar" of type "int"

This process allows implementers to use a common test suite no matter their target language and only requires that we define assertions, not some intermediate DDL.

karenetheridge Aug 5, 2021
Maintainer

exactly! the schema is the DDL.

If one is unsure where to begin, may I suggest just start writing code. Take a schema that you are familiar with, and start sketching out what the equivalent code should look like that should be generated from that. Then consider some changes to the schema, and how that might change the code. That should give some starting points, if only a library of examples to start. After that, the specification itself should start to write itself.

(at least, that's what I would do.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JSON Schema

Improve the use of JSON Schema in tooling beyond validation #18

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 7 comments 30 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

JSON Schema

Improve the use of JSON Schema in tooling beyond validation #18

jonaslagoni Jun 3, 2021

The specific suggestion

So, how do we solve it?

Scope of this discussion

Questions to spark the discussion

Replies: 7 comments · 30 replies

jdesrosiers Jun 3, 2021 Maintainer

jonaslagoni Jun 4, 2021 Author

handrews Jun 3, 2021

gregsdennis Jun 3, 2021 Maintainer

gregsdennis Jul 3, 2021 Maintainer

Relequestual Jun 4, 2021 Maintainer

jonaslagoni Jun 7, 2021 Author

handrews Jun 4, 2021

jdesrosiers Jun 4, 2021 Maintainer

jonaslagoni Jun 7, 2021 Author

Relequestual Jun 18, 2021 Maintainer

Relequestual Jun 23, 2021 Maintainer

MikeRalphson Jun 23, 2021

jdesrosiers Jun 23, 2021 Maintainer

Relequestual Jun 24, 2021 Maintainer

MikeRalphson Jun 24, 2021

Relequestual Jun 24, 2021 Maintainer

Relequestual Aug 3, 2021 Maintainer

jonaslagoni Aug 3, 2021 Author

DavidBiesack Aug 3, 2021

jdesrosiers Aug 4, 2021 Maintainer

karenetheridge Aug 5, 2021 Maintainer

jonaslagoni
Jun 3, 2021

Replies: 7 comments 30 replies

jdesrosiers
Jun 3, 2021
Maintainer

jonaslagoni Jun 4, 2021
Author

handrews
Jun 3, 2021

gregsdennis
Jun 3, 2021
Maintainer

gregsdennis Jul 3, 2021
Maintainer

Relequestual
Jun 4, 2021
Maintainer

jonaslagoni Jun 7, 2021
Author

handrews
Jun 4, 2021

jdesrosiers Jun 4, 2021
Maintainer

jonaslagoni Jun 7, 2021
Author

Relequestual
Jun 18, 2021
Maintainer

Relequestual Jun 23, 2021
Maintainer

jdesrosiers Jun 23, 2021
Maintainer

Relequestual Jun 24, 2021
Maintainer

Relequestual
Jun 24, 2021
Maintainer

Relequestual Aug 3, 2021
Maintainer

jonaslagoni Aug 3, 2021
Author

jdesrosiers Aug 4, 2021
Maintainer

karenetheridge Aug 5, 2021
Maintainer