-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validation in linked multi-graph #152
Comments
Hi @svenschneider You are correct, the reason you are seeing your error is due to the line you linked, that is, PySHACL iterates across each named Graph in the Dataset and validates each separately.
It is essentially for historical and backward-compatibility reasons. Early versions of PySHACL did not support Datasets that contained named graphs. It could operate on flat Graphs only, (ie, only N3, NT, Turtle, RDFXML files). The easiest and lest-disruptive way to enable this functionality in PySHACL at the time was to simply iterate over all of the named Graphs in the Dataset and validate them individually. This behavior suited the example datasets we were using because they did not have links between the graph as your example does. One hurdle to a better implementation is due to how RDFLib handles named Graphs in Datasets. When performing a lookup on a Dataset, you usually need to specify the identifier of the graph you're querying. So in your example, when validating node There is a feature in RDFLib called "default_union" that can be enabled on Dataset objects that does allow the application to execute queries across all named graphs in a dataset at once. If PySHACL used that feature, then your example would work as you expect. This is not used by PySHACL because it is a nonstandard operating mode for RDFLib. It is not enabled by default in RDFLib, and if PySHACL enabled that feature, it could cause unexpected behaviour for users who expect RDFLib dataset queries to operate in the "normal" manner. Additionally, I believe last time I experimented with enabling that feature, it caused some W3C SHACL Test suite tests to no longer pass, but I don't recall the specifics.
I have been thinking of implementing an optional operating mode for PySHACL, something like "union-graph mode", that would force "default_union" enabled on the target Dataset, and would run the validator once on the whole Dataset rather than running the validator individually over each named graph. If users find this mode to be useful and convenient, then it may become the default operating mode for an eventual "backward-incompatible" v1.0 release. |
Hi @ashleysommer , thanks for that elaborate and quick reply! I can now see why it's implemented as is. Additionally, I have been able to reproduce the problem you mention with respect to the queries in an RDFLib Dataset. At the moment, as a workaround I flatten the whole data graph before passing it to PySHACL. For now that works for my setup, but I don't know in how far that approach generalizes or which problems that could introduce. As for RDFLib's "default_union" feature, instead of "forcing" that on the input Dataset object you could perhaps check if it is enabled and only then execute the queries on the whole Dataset? The consequences could be that (i) this results in unexpected SHACL behaviour for users of the library; and (ii) there is (yet) another special case to be handled in the code. Thus, presumably your suggestion with the explicit processing mode seems like a nicer solution. One more observation: upon re-reading the SHACL standard it seems that validating an RDF dataset is out of scope for SHACL and will hence remain implementation-specific. In particular, here it says
At the very least it remains vague on what should happen when you provide an RDF dataset (instead of an RDF graph) to a SHACL processor. |
Yep, those are two paragraphs I did think about including in my previous response. You're right, I have thought about checking if "default_union" is already enabled on the input Dataset at runtime, and if it is, then use the alternate validation behaviour. But as you said, that is introducing yet another alternate operating mode for PySHACL that users may accidentally trigger and introduce unexpected behaviour. Also, that particular feature would not be able to be utilised by the PySHACL CLI tool, because that constructs the graphs from parsing files, so it cannot be determined whether "default_union" should be on or off. And you brought up another point that I should have mentioned in my previous comment, regarding the wording the SHACL Spec. It does remain vague on how to approach this problem, and that is the reason that early versions of PySHACL intentionally only operated on single graphs. |
Hi everyone,
I am trying to validate a data graph that is composed of multiple named graphs with links between the named graphs. However, pySHACL does not seem to be able to follow the links. The following two files exemplify this problem.
First, the Turtle shape graph (
shape.ttl
). It defines aex:Foo
node shape which expects aex:ref
path of classex:Bar
.Next, the TriG data graph (
data.trig
). It defines two graphs (ex:g1
andex:g2
) where the second graph contains anex:Foo
with a link to anex:Bar
(defined in the first graph).When I now run pyshacl on those files (
pyshacl -s shape.ttl data.trig
) I get the following output:I would expect this graph to validate without violations. This is for example the case on the SHACL playground.
I think, that the reason for the validation error is because of the following line
pySHACL/pyshacl/validate.py
Line 259 in a9f5192
Hence, I wonder (i) if it was possible to iterate over the whole target graph; and (ii) what the rationale behind iterating over the named graphs individually is?
Thanks for developing the library!
Best regards
Sven
The text was updated successfully, but these errors were encountered: