Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support user-defined format in string schemas #29

Closed
Zac-HD opened this issue Dec 19, 2019 · 12 comments
Closed

Support user-defined format in string schemas #29

Zac-HD opened this issue Dec 19, 2019 · 12 comments

Comments

@Zac-HD
Copy link
Member

Zac-HD commented Dec 19, 2019

Thanks to @Stranger6667 in schemathesis/schemathesis#337 (comment):

[kiwicom has] some custom formats, e.g. for payment card numbers - and we declare it in the schema + use validation for it via jsonschema + connexion:

 @draft4_format_checker.checks("card_number")
 def is_card_number(value):
     # Validation based on Luhn algorithm

This is definitely in-scope for hypothesis-jsonschema, because user-defined formats are explicitly allowed by the spec! The options for this are to either add an argument, so that the API looks like

def from_type(schema: Schema, *, formats: Dict[str, SearchStrategy[str]] = None) -> ...:

or to maintain a global registry of known formats and strategies. In either case I'd want to raise an error if we ever try to generate a format value for which jsonschema does not have a known validator.

At the moment I'm leaning towards global state, with validation and the caveat that users can't override formats defined in the spec, but feedback would be most welcome.

@Stranger6667
Copy link
Contributor

Hi and sorry for a late reply :)

With a global registry, it might be easier to use by 3rd party tools like Schemathesis. E.g. users will be able to use hypothesis-jsonschema API directly instead of the wrapper we have now in Schemathesis (which is not necessarily a better idea). With an explicit argument, it feels cleaner to me (since there will be no mutation of internals) and Schemathesis may keep its interface for custom formats and call from_schema without mutating that format dictionary. In this case, there will be no backward-incompatible changes in Schemathesis + there will be no need to access the internals of hypothesis-jsonschema

In either case I'd want to raise an error if we ever try to generate a format value for which jsonschema does not have a known validator.

Hm, for Schemathesis it means, that if there is a custom format in the schema, then it will fail on any affected endpoint. Which, actually may help to find spots where we can generate data better suited than random strings. But I see a possible downside - there might be fewer errors discovered. We actually found a few flaws in our input validators on data that didn't fit the specified format. With the validation we'll have to write a strategy for data generation and the previously discovered cases will not be generated.

Also, it might slow down error discovery - e.g. if there are two fields A (without format) and B (with format) and there is an error that depends on A's value, then we won't discover this error until a proper format strategy will not be specified. It might be hard for users that are not experienced with Hypothesis and use Schemathesis as a zero-config tool.

But I do feel that it makes internal logic more explicit in both packages and the cases I mentioned above could be also handled explicitly on the Schemathesis side, e.g. by passing some "default" format strategy for unknown formats (this behavior may be configurable as well).

Also, since the scope of hypothesis-jsonschema is generating data that fit it also makes sense to have validation. Btw, I was experimenting with a package that generates data that doesn't fit - to widen Schemathesis's scope, but it is still WIP on the Schemathesis side.

So, in short I think passing format strategies explicitly + validation might be the most explicit and clean approach :)

@Zac-HD
Copy link
Member Author

Zac-HD commented Apr 12, 2020

Hmm, some examples might help:

    # Imagine that we've registered a validator for the format "hello",
    # which the only valid value is "world", and no other formats.
>>> FORMATS: Dict[str, SearchStrategy[str]] = {"hello": st.just("world")}

    # If we pass a strategy for a custom format, it gets used
>>> from_schema({"type": "string", "format": "hello"}, custom_formats=FORMATS)
st.just("world")

    # If the schema contains an unknown format, ignore the format keyword (status quo)
>>> from_schema({"type": "string", "format": "something else"})
st.text()

    # It's an error to pass a strategy for a standard format
>>> from_schema(..., custom_formats={"date": ...})
Traceback: ...
InvalidArgument("Cannot pass a custom format for `date`, "
                "because it is part of the jsonschema standard.")

    # It's an error to pass a strategy for a custom format unless a
    # validator for that format is registered on the draft(4,6,7) validator
    # See https://python-jsonschema.readthedocs.io/en/stable/validate/#validating-formats
>>> from_schema(..., custom_formats={"foo": ...})
Traceback: ...
InvalidArgument("A strategy was passed for the custom format `foo`, "
                "but no checker is registered for this format.")

    # Error if generated value for custom format doesn't validate
>>> from_schema(
        {"type": "string", "format": "hello"},
        custom_formats={"hello": st.just("an invalid string")}
    ).example()
Traceback: ...
InvalidArgument("Got 'an invalid string' from custom_formats['hello']=..., "
                "but this value does not conform to the format checker.")

And this shouldn't actually be too hard to implement, so long as we're careful to thread the custom_formats dict through all our recursive calls to from_schema and the variants for arrays, objects, and strings.

@Stranger6667
Copy link
Contributor

Aha, so you propose to connect format strategies with jsonschema's registered checks, right? Sounds interesting to me, I didn't think about it initially :) I like that it will add some extra validation on user-defined strategies.

Thank you, Zac, your examples are very helpful and from my pov reflect a very clean and reliable approach :)

@Zac-HD
Copy link
Member Author

Zac-HD commented Apr 12, 2020

Aha, so you propose to connect format strategies with jsonschema's registered checks, right?

Yep, that's right!

@Zac-HD
Copy link
Member Author

Zac-HD commented Apr 14, 2020

Untested prototype: master...custom-formats - @Stranger6667 I'd love to know how this would work for you!

@Stranger6667
Copy link
Contributor

@Zac-HD Seems excellent to me :)

@Zac-HD
Copy link
Member Author

Zac-HD commented Jul 15, 2020

@Stranger6667 - I've rebased, finished, and tested this 🎉

Can you pip install -U git+https://github.com/Zac-HD/hypothesis-jsonschema@custom-formats, try it out, and let me know how it goes? I don't really have a use-case, but if it works for you I'll ship it 😄

@Stranger6667
Copy link
Contributor

Hi @Zac-HD !

Thank you for working on this! I am thinking about a couple of use cases where additional jsonschema check might not be handy

  1. Non-Python app testing. Usually, the user spins up the app and tests it via e.g. schemathesis.from_uri("http://app/schema.yaml") or CLI.

In this case, there is no dependency on jsonschema on the app side and to avoid InvalidArgument error it will require writing some extra check and wrapping it with e.g. draft4_format_checker.checks decorator. Also, depending on the original check code (which might be in e.g. Java) it might require a re-implementation in Python (as I see the _get_format_filter implementation) to get the validated output. And since such checks might be quite complex I am concerned that it might actually make usage of hypothesis-jsonschema / schemathesis harder, than before. I also assume that if the app under test is written in some other language, then Python might be not the language of primary expertise which also implies some additional complexity for the users.

  1. Apps that don't use jsonschema.

Even though jsonschema is the most popular package for JSON Schema validation there are some alternatives. For example fastjsonschema (supports custom formats) and jsonschema-rs (doesn't support custom formats yet). My original case was tied to connexion which uses jsonschema, but for other frameworks, it is not the case (e.g. rororo plans to use jsonschema-rs). Bringing jsonschema checks will add a necessity for these users to implement additional code to run tests.

In general, I like the idea of extra validation, but it seems to me that it might negatively affect the experience of the users outside of connexion-like use case :( Maybe it could be optional somehow?

@Zac-HD
Copy link
Member Author

Zac-HD commented Jul 15, 2020

That makes sense. Let's relax the requirement a little: still use the registered checker if there is one, but allow custom formats without a registered checker too.

@Stranger6667
Copy link
Contributor

Sounds awesome! :)

@Zac-HD
Copy link
Member Author

Zac-HD commented Jul 16, 2020

Alright! I've added the feature in 61e0cdf, and shipped it as version 0.17.0 🎉

@Zac-HD Zac-HD closed this as completed Jul 16, 2020
@Stranger6667
Copy link
Contributor

Great, thanks! I will take a look at integrating the new version into Schemathesis :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants