Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Which parties carry what costs of text/turtle changes, and do those outweigh which benefits for whom? #141

Open
RubenVerborgh opened this issue Jan 14, 2025 · 48 comments

Comments

@RubenVerborgh
Copy link
Member

RubenVerborgh commented Jan 14, 2025

Summary

In rdfjs/N3.js#484, I learned that the specifications intend to redefine the set of valid documents under the text/turtle media type (and presumably others).

Such a change might not be possible/desired, or should at least be acknowledged as a breaking change, with a resulting cost/benefit analysis.

Definitions

  • text/turtle as the media type defined by https://www.w3.org/TR/turtle/
  • valid-turtle as the (infinite) set of valid Turtle 1.1 documents
  • invalid-turtle as the (infinite) set of documents that are not in valid-turtle
  • spec-compliant Turtle parser as a piece of software that:
    • for each document in valid-turtle, produces the corresponding set of triples
    • for each document in invalid-turtle, rejects it (possibly with details on the syntax error)

Note here that the above definition includes rejection; the 1.1 specification text does not, its test cases do.

Potential problems

  1. Retroactively changing the definition of text/turtle breaks existing spec-compliant Turtle parsers, as they will incorrectly label valid text/turtle documents as invalid.
  2. There is no way to distinguish Turtle 1.1 from Turtle 1.2.
  • While 1 could be argued away as "1.1 parsers only break on 1.2 Turtle", it's a problem that the parser will not be able to tell you why it breaks. Does it break because it's invalid Turtle 1.1? Does it break because it's valid Turtle 1.2? Does it break because it's invalid Turtle 1.2, despite this document intending to be within the 1.1 subset? i.e., should or shouldn't it have worked with this particular text/turtle document and no other context?
  1. Building on 2, neither new nor old parsers will be able to fully automatically validate Turtle documents, since they need to be told out of band whether to validate for 1.1 or 1.2.
  2. Because of the closed-set nature of text/turtle in the Turtle 1.1 spec, any changes to that set (whether deletions or additions) would contradict the Turtle 1.1 spec itself / make it invalid.
  3. The problem will happen again in RDF 1.3.
  4. As a more specific instance of 3, there is no standards-based way for clients or servers to indicate they only support Turtle 1.1, nor to discover whether recipients support Turtle 1.1 or 1.2 (or 1.3), as Accept: text/turtle does not tell them. Nor does Content-Type: text/turtle tell them whether their parser can handle the contents, and we could be 20 gigabytes in until we notice it doesn't.

Analysis

Unlike formats like HTML, Turtle 1.1 does not contain provisions for upgrading. The specification assumes a closed set of valid documents. We find further evidence in a number of bad test cases (https://www.w3.org/2013/TurtleTests/), which explicitly consider more permissive parsers to be non-compliant.

There is a note in the spec (but only a note, and thus explicitly non-normative):

This specification does not define how Turtle parsers handle non-conforming input documents.

but this non-normative statement is contradicted by the bad test cases, which parsers need to reject in order to produce a compliant report.

Although the considered changes for 1.2 are presumably not in contradiction with those bad cases, the test suite was not designed to be exhaustive. Rather, the 1.1 specification considers text/turtle to be a closed set, and the test cases consider a handful of examples to verify the set is indeed closed.

In particular, no extension points where left open on purpose.
Therefore, the 1.1 spec is not only defining “Turtle 1.1”, but also strictly finalizing text/turtle.

(The IANA submission's reservation that "The W3C reserves change control over this specifications [sic]." does not change the above arguments.)

Potential solutions

A set of non-mutually exclusive solutions, which each cover part or all of the problem space:

  1. Factual disagreements with the above.

  2. The introduction of a new media type.

  3. The introduction of a new profile on top of the existing text/turtle media type.

  4. A change to the Turtle 1.1 spec that adds extension points or otherwise opens the set of text/turtle.

  5. Syntactical support in Turtle 1.2 for extension and/or versioning.

@afs
Copy link
Contributor

afs commented Jan 14, 2025

Thank you for the analysis.

We do have https://www.w3.org/TR/rdf12-turtle/#changes-12 and the matter will be in "RDF 1.2 New".

The WG is discussing levels of conformance.

@afs
Copy link
Contributor

afs commented Jan 14, 2025

Related:
There are links to specific versions of documents for both RDF and SPARQL:

https://www.w3.org/TR/rdf12-turtle/ -- currently, the 1.2 working draft. This will be the REC when published. Title "RDF 1.2 Turtle".
https://www.w3.org/TR/rdf11-turtle/ -- The RDF 1.1 published standard. Title "RDF 1.1 Turtle".
https://www.w3.org/TR/rdf-turtle/ -- Tracks the latest publication. Currently, 1.1.
https://www.w3.org/TR/turtle/ -- old name, tracks "rdf-turtle".

@RubenVerborgh
Copy link
Member Author

The WG is discussing levels of conformance.

Interesting; may I suggest a standards-based mechanism for agents to indicate this level? (A media type or profile comes to mind.)

Or would “classic conformance” de facto be the same as effectively only parsing the RDF 1.1 subset (in which case it would be equivalent to one of the points above)? However, this does not seem to be the case, with for instance base directions being added to literals (in which case “classic” might be a confusing/misleading term).

@afs
Copy link
Contributor

afs commented Jan 15, 2025

[This is not a WG response]

Any approach for versioning can have costs on both the reader-side and the writer-side.

For example, anything in the HTTP header to make the data consumer's task easier puts a requirement on the data producer. In the same way that RDF 1.2 syntax can be a long way in the delivered stream for a reader, having an HTTP header with the information makes the writers life harder because it may need to see all the data first - no stream writing without recording the details of which version is in the data, which would also be a producer-side burden.

One way to publish data is using a web server support for mapping file extensions to Content-Type headers -- .htaccess (httpd) or types{} (nginx) etc. This also appears with data dumps in archives such as zip.

Today, a tool kit may need to "know" to look in the URL to get the content type if no realistic content-type is available.

Given the file extension situation, I think any solution will not help RDF that much. Software will want to handle the static/non-profile/file-extension/... cases anyway. Only a domain specific (i.e. consumer and producer) deployment can be sure the global rules are in-play.

There is a trade-off of whether the long term continued provision of a migration solution is a greater burden than the evolution itself. Such migration should never be withdrawn -- "The web is not versioned".

@RubenVerborgh
Copy link
Member Author

Thanks, @afs. I want to leave space for others so will be brief, but quickly:

  • Not explicitly indicating feature/version/… support also incurs costs.
  • Your answer covers the case where such indications happen in the message; different arguments and trade-offs apply when they happen in the body. As a quick example, a first-line @version 1.2 or @features literal-direction would cause a desired fail-fast on 1.1 parsers, and assist 1.2 and future parsers.

@niklasl
Copy link

niklasl commented Jan 16, 2025

Maybe an optional version or feature declaration, to support fail-fast detection? With the implicit being "latest REC". It should perhaps be clearly stated that implementations are required to follow the evolution of the format; with the reciprocal requirement of evolving the format responsibly, aspiring to standardize once "sufficient" implementation coverage has been established. AFAIK, there is a requirement of multiple independent implementations; perhaps that number should be a function of the "cardinality of known deployments" and "how viable it is to upgrade them"? (I know it is a practical impossibility to quantify that on the web scale, but it goes to show awareness of the complexity underlying these judgement calls. And that we (W3C members) have a responsibility to care and cater for cooperative evolution to ensure web interop.)

I think this follows the conventions @afs referenced, which is a trade-off I'm cautiously in agreement with. Defining a new format (mime-type + suffix) is the only other viable option AFAICS; and while that caters for more overlap in deployments, it also induces a certain inertia and growing technical debt. (When is the previous format "sunset"? How is the data quality impacted during the overlap period? How do applications take the difference in expressivity into account?)

I see no practical way around some form of social contracts, as even content negotiation is not merely technical (q=0.9 ...). The most important contract is for publishers to avoid utilizing new features until their consumers have been notified and been able to upgrade; balanced with the need for precision in the domain of discourse among those who already have (we form a web after all).

@RubenVerborgh
Copy link
Member Author

With the implicit being "latest REC"

There is a trade-off of whether the long term continued provision of a migration solution is a greater burden than the evolution itself.

"The web is not versioned".

The key difference being that—for example—HTTP, HTML, and CSS have explicit behaviors on how to deal with unsupported constructs. HTTP proxies have rules on how to deal with unknown headers, HTTP has version negotiation, HTML has rules for unknown tags and attributes, CSS has rules for unsupported properties and even syntax.

So the Web's ability to be non-versioned is baked into the design of those technologies. Conversely, RDF adopting the non-versioned philosophy does not equate doing nothing on the feature support/versioning front, but rather being very explicit about how non-versioning is to be made possible.

In summary, not doing anything put us on neither a versioned nor a non-versioned trajectory. They are not binary opposites, with the third option “incompatible with versioning and non-versioning” being the unfortunate default choice.

@lisp
Copy link

lisp commented Jan 16, 2025

to take a concrete example as precedent, i do not recall that in the transition from sparql 1.0 to sparql 1.1 was the continued use of the same media type designators was problematic.

in what sense other than the concern about "late failure" for large documents should that matter for document media types?
the notion, that 1.2 documents would be marked may seem attractive, but to fail early would still require a change to import control flow.
that, where the inability to modified deployed 1.0 version resources is central to the problem.

@RubenVerborgh
Copy link
Member Author

to take a concrete example as precedent, i do not recall that in the transition from sparql 1.0 to sparql 1.1 was the continued use of the same media type designators was problematic.

Apples and oranges.
SPARQL is not a data language, nor is it problem-free.
The context of a query language is very different, including:

  • limited average and typical document length
  • different consequence of failure, with immediate and specific feedback
    • failure is in fact is sometimes triggered deliberately for endpoint feature discovery
  • absence of streaming parsing
  • different reuse context: individual queries tend to be sent to specific endpoints

So the upgrade path of SPARQL is much more similar to that of SQL, with similar challenges and non-issues.
Not comparable to that of HTTP, HTML, CSS, RDF.

And quite a pain in practice: one typically needs to know out-of-band what precise SPARQL endpoint software an interface is running, which determines how well certain SPARQL 1.0 or 1.1 features are supported.

In contrast, at least today, text/turtle has been 100% unambiguous since the introduction of the spec.
If anything, let's not go the SPARQL route.

in what sense […] should that matter for document media types?

RDF is about enabling interoperability. Yes, on the semantic level, but not having interoperability on the syntactical level precludes that.

In the pre-1.1 days, “Turtle” had been around as a format for over a decade, and parsers were incompatible with each other. It was quite the nightmare, trying to exchange data or write parsers. There was no established (let alone standard) way of knowing what subset was supported by everyone. The Turtle standard solved this by bringing certainty about what is and isn't text/turtle.

The proposed re-definition of text/turtle without any explicit indication, sends us back on a path where parsers may or may not be compatible with a certain Turtle version, and they can't even tell us. We cannot ask servers or clients. We have to know what software they are running. Not exactly the automated interoperability goal.

other than the concern about "late failure" for large documents

One might not even know. One could've parsed a 1.2 document wrongly without ever knowing. One could've rejected or accepted a document based on the wrong assumption (because assumptions are all you have, in band). One doesn't know if downstream systems are compatible with 1.1 or 1.2, because they can't tell.

It's an absolute interoperability nightmare that systems don't even have the words to express what they do and do not support. In a context where we're advocating for semantic interoperability, failing at syntactic interoperability is a serious flaw from a technical and strategic perspective. It adds a serious degree of brittleness, the details of which only a small group of people understand, which carries a major risk of reflecting badly on RDF as a whole for not being a sustainable—let alone interoperable—technology. People will say that RDF doesn't work reliably across systems, and they will be right.

@lisp
Copy link

lisp commented Jan 16, 2025

SPARQL is not a data language, ...

that may be, unless one is concerned with sparql processors.

@lisp
Copy link

lisp commented Jan 16, 2025

RDF is about enabling interoperability. Yes, on the semantic level, but not having interoperability on the syntactical level precludes that.

we agree - vehemently.
as much as ambiguous recommendations is not the answer, neither is error signalling and handling.
would graph store protocol endpoint service descriptions provide sufficient information to the architectures which you envision, in order for them to more effectively control requests?

@Ostrzyciel
Copy link

2 cents from someone who did implement a non-standard RDF format that has an analogue of Ruben's proposed @version 1.2 or @features literal-direction – it sounds like a nice idea, but implementing a serializer that would reliably set such flags is a pain. You essentially need to predict the future: "will this document need 1.2 features or not?" This may seem like a trivial question if we are dealing with a small piece of metadata on the Web, but is completely impossible if we have something like a database dump or any other long stream of data.

I ended up making the serializer always claim that all features are used by default. Then, it's up to the user to tell the serializer that "this and that" feature won't be needed. This creates an obvious compatibility problem, because parsers will simply refuse to read these files, even though in practice the feature may not be used. I have not found a better solution to this problem. I think this is a sensible compromise for my ugly format, but I would be against this in W3C formats. More details here.

Overall, I think a sensible solution would be to embrace the mess and just live with the fact that RDF formats can evolve. I would also like to ask the WG to kindly consider producing some "best practices" for how to mark that an RDF file is 1.2, in a use case-specific manner. I like the suggestion from @lisp for adding some info in graph store protocol descriptions. I'm also curious if something like a non-mandatory HTTP header would be an option. Or maybe a comment at the start of the file (like a shebang in .sh files) – of course, entirely optional. (disclaimer: I did not think these ideas through, they may be VERY bad)

@HolgerKnublauch
Copy link

Intuitively to me it sounds like TTL documents that use any of the new features need a new media type and file ending.

@lisp
Copy link

lisp commented Jan 16, 2025

I'm also curious if something like a non-mandatory HTTP header would be an option.

legacy software will not see them.
placing them such that the control flow of those components will have to be aware of them is not effective.
a service description, based on which a higher-level process can orchestrate operations would be much more effective.

@namedgraph
Copy link

namedgraph commented Jan 16, 2025

Isn't the situation with Turtle 1.1 and Turtle 1.2 a bit like Turtle and TriG? In both cases the former syntax is a subset of the latter.
With Turtle and TriG we got distinct media types (text/turtle and application/trig to be exact). Why shouldn't the same apply to Turtle 1.2?

@dr0i
Copy link

dr0i commented Jan 16, 2025

Consuming data which is suddenly turtle 1.2 (coming with unchanged MIME text/turtle) and this now breaks my formerly working turtle parser (say a widely used library ) is like an API break resulting in a non-working program.
So this is bad.
To avoid this developers providing code provide different versions of libraries, and provide these over time, marking those with API breaks resp. those who should be compatible, using semantic versioning.
It's unlikely that data deliverers would provide different turtle versions even if there would be a HTTP header (or other things) for that.
I ACK the problem, but tend to see it like @niklasl ("I see no practical way around some form of social contracts").
(BTW, even if we only change our data schema, not the RDF version, we also call this out as a possible API break to our customers, as even this can break consumers programs).

@coolharsh55
Copy link

Hi. My thoughts on this from a practicality perspective: I echo Ruben's argument that we should be aiming to support interoperability and backwards compatibility - especially when we know exactly how and why an existing system will break due to new changes. For Turtle, the mime type can be versioned - there is precedent for this if we look at existing mime types.

If we don't version the mime type, existing systems will break. They will need to be updated to support turtle 1.2. There is no way to distinguish between turtle 1.1 and turtle 1.2, so there is no way for them to silently fail or ignore turtle 1.2. There isn't also a way to fail with context i.e. failed as it doesn't handle turtle 1.2 - it will fail equally for valid turtle 1.2 and invalid turtle 1.1. So this is not a trivially fixable change. Not desirable IMO.

If we do version the mime type, existing systems will not break. If they have to support turtle 1.2, then they MUST change or be updated anyway since turtle 1.2 requires updates anyway, and hence there is an opportunity for these systems to add the mime type handling change alongside the turtle 1.2 handling changes. It might result in some extra work, potentially some complex cases as there is mime type handling. However, we know for sure that existing systems won't break (assuming the mime type is used as intended here), and if they do get an incorrectly assigned mime type then the fix is to use the correct mime type. So this should be the desirable state.

This also brings up the question of what should happen when Turtle 1.3 eventually is required. Again versioning the mime type is an option, but pragmatically, having the version in the document itself is the best forwards-compatible solution and a known best practice. It would be ideal to have it here.

@rubensworks
Copy link
Member

rubensworks commented Jan 17, 2025

Another important consideration to take into account here is the length of Accept headers when doing requests within a browser.

Long accept headers in browsers are problematic

The Fetch spec (CORS section) specifies that each header (including the Accept header) is limited to 128 characters.
But even this limit is already causing issues in practise when just taking into account today's RDF media types for content negotiation.

As an example, the Comunica query engine uses the following Accept header by default, which contains 324 characters:

Accept: application/n-quads,application/trig;q=0.95,application/ld+json;q=0.9,application/n-triples;q=0.8,text/turtle;q=0.6,application/rdf+xml;q=0.5,text/n3;q=0.35,application/xml;q=0.3,image/svg+xml;q=0.3,text/xml;q=0.3,text/html;q=0.2,application/xhtml+xml;q=0.18,application/json;q=0.135,text/shaclc;q=0.1,text/shaclc-ext;q=0.05

Hence, when we do these requests in a browser, we must splice this Accept header to 128 characters, which causes some (valid) RDF media types to not even get requested to the server.

New media types exacerbate this problem

As such, I believe introducing new media types for each RDF serialization in 1.2 is not the right way forward.
Because this would essentially halve the number of media types that can be requested from a server within a browser environment.

For example, the following (which contains some arbitrary new media types for 1.2) already reaches the limit according to CORS:

Accept: application/n-quads,application/n-quads-12,application/trig;q=0.95,application/trig-12;q=0.95,application/ld+json;q=0.9

And this problem would only get worse for every new RDF version:

Accept: application/n-quads,application/n-quads-12,application/n-quads-13,application/n-quads-13,application/n-quads-14

Towards a solution

My initial thought when reading this issue was that profile-based negotiation could be a good solution,
but this is not very compatible with CORS either (longer Accept header or new headers that are now allowed by default under CORS).

From this perspective, my feeling is that new media types or profile-based negotiation are not the way to go, and that in-band solutions such as @version might be better (there is precedent for this in JSON-LD's @version).


Not only does this problem apply to RDF serialization, it also applies to SPARQL result serializations: SPARQL/JSON, SPARQL/XML, SPARQL/CSV, SPARQL/TSV.

@namedgraph
Copy link

profile-based negotiation could be a good solution

Except that again, no established frameworks (e.g. JAX-RS implementations) support it.

@lisp
Copy link

lisp commented Jan 17, 2025

Except that again, no established frameworks (e.g. JAX-RS implementations) support it.

which is why it is better to implement the logic which verifies availability of the required media type on a higher level.
as long as there are legacy applications, a client application framework will have to validate the service endpoint before the request is made.

@kasei
Copy link
Contributor

kasei commented Jan 18, 2025

In contrast, at least today, text/turtle has been 100% unambiguous since the introduction of the spec.

While that's true for the spec version, I don't think the same can be said for the widespread use of the Team Submission that predates the spec. The same media type was in use for years before Turtle 1.1 was introduced and brought with it changes to the syntax. I'm not sure that's reason to do the same thing again, but this isn't the first time we've been faced with this issue.

@hvdsomp
Copy link

hvdsomp commented Jan 19, 2025

Towards a solution

My initial thought when reading this issue was that profile-based negotiation could be a good solution, but this is not very compatible with CORS either (longer Accept header or new headers that are now allowed by default under CORS).

Just felt like pointing out that there’s also an IETF Internet Draft on profile based negotiation, of which @RubenVerborgh is co-author. It’s been in the works for quite a long time. There’s been renewed interest from the cultural heritage community and even from the W3C where some consider this a topic that falls in the IETF realm. See https://datatracker.ietf.org/doc/draft-svensson-profiled-representations/.

@RubenVerborgh
Copy link
Member Author

widespread use of the Team Submission that predates the spec. […] I'm not sure that's reason to do the same thing again, but this isn't the first time we've been faced with this issue.

Technologically? Been there, done that.

Reputationally? Not so much.
Fifteen years ago, at least we could say: “All of this is mess happens because Turtle isn't yet a standard.”
Without a solution, we'll have to say henceforth: “All of this mess happens because Turtle is a standard. Twice—so far.”

@TallTed
Copy link
Member

TallTed commented Jan 23, 2025

@RubenVerborgh — You pointed to "Extended discussion at https://ruben.verborgh.org/articles/fine-grained-content-negotiation/"

— which included —

Particularly exciting is that multiple profiles can be combined in a single response, in contrast to the single-dimensional nature of MIME types.

First thing, your writing betrays a limited understanding of your topic, as you refer consistently to "MIME types", which are actually "media types", though they are used in a universe of MIME.

Next, I bear relatively recent scars of a years-long effort to convince IETF to follow their own documentation and work with a number of folks (including me) who wanted to extend media types by defining how to interpret multiple + therein. Part of the scarring came from IETF rejecting their own pre-existing profile extension, especially when the value(s) of profile are URIs, because there's a relatively SMALL character count beyond which those profile values are now to be considered malware(!).

In other words -- your "extended discussion" (which is really an extended monologue) has been overtaken by events, and is no longer (if it ever was) applicable.

@mielvds
Copy link

mielvds commented Jan 29, 2025

@TallTed questioning Ruben V's credibility with silly 'tomayto, tomahto' comments is not relevant, nor helpful. That said, valid point about profile.

I haven't seen any syntax change on the Web without a change in the media type. While I find @rubensworks' concerns valid, I don't think there's a way around it. Clients will have to be more picky about what content-type they request and eventually, either text/turtle will gradually dissapear or parsers will have caught up and it won't matter anymore.

@Tpt
Copy link

Tpt commented Jan 29, 2025

I haven't seen any syntax change on the Web without a change in the media type

I am not sure this is true. For example XML 1.0 and 1.1 share the same media type. Similarly HTML, CSS and JavaScript had significant syntax changes (especially the later 2) without media type. On the RDF-related elements, JSON-LD 1.1 changed the default processing mode without media type change, SPARQL got the very large 1.1 update...

@TallTed
Copy link
Member

TallTed commented Jan 29, 2025

@mielvds

questioning Ruben V's credibility with silly 'tomayto, tomahto' comments is not relevant, nor helpful

I don't think I made any 'tomayto, tomahto' comments, silly or otherwise.

@RubenVerborgh pointed to an 8 year old article he had written, as if it had some greater authority behind it than himself, and called it a "discussion", which would usually mean that it involved multiple participants. Indeed, the article page says "This Linked Research article has been peer-reviewed and accepted for the Workshop on Smart Descriptions & Smarter Vocabularies (SDSVoc) following the Call for Papers."

If you follow the link to that Call for Papers, you can see that (emphasis mine) "Short position papers are required in order to participate in this workshop. These are not academic papers but descriptions of the problem you’d like the workshop to discuss and the presentation you would like to offer. ‘Papers’ can be as simple as a short description of a tool or service to be demonstrated and the technologies used. Each organization or individual wishing to participate must submit a position paper explaining their interest in the workshop by the deadline. The intention is to make sure that participants have an active interest in the area, and that the workshop will benefit from their presence."

So, far from being "peer-reviewed", the only authority behind that article is @RubenVerborgh himself, and the "discussion" consists of only 2 comments (made 7 and 6 years ago) from other people (respectively, @VladimirAlexiev, who wanted to clarify the meaning of 1 sentence, and @nicholascar, who pointed to a document then-in-progress which has since moved to Content Negotiation by Profile; W3C Working Draft, 02 October 2023), of which only @VladimirAlexiev's is really about the content of the article, and that only about one phrase in one sentence, to which @RubenVerborgh made a one sentence reply, which did not result in any clarifying change within the article.

I stand by my assessment of the article, and of profile-based content negotiation.

@RubenVerborgh
Copy link
Member Author

For example XML 1.0 and 1.1 share the same media type.

See above; HTML, XML etc. have explicitly created in-document mechanisms for versioning, which is why they can do this. Turtle does not, which is why we have the problem. (N3, in contrast, was explicitly equipped for this reason with an @keyword mechanism.)

your writing betrays a limited understanding

@TallTed Let's keep this thread about tech to save people's inboxes; feel free to send other comments to mine.

@afs
Copy link
Contributor

afs commented Jan 29, 2025

RDF 1.2 does not change the vast majority of RDF data. Triple terms and base direction will be uncommon.

My concern is that we end up "splitting the world" into "RDF 1.1" and "RDF 1.2" yet all RDF 1.1 is valid RDF 1.2 and very often data from RDF 1.2 publisher is valid RDF 1.1.

It is a cost-benefit decision.

JSON-LD didn't change the media type - it was 1.0 compatible (nearly) and did introduce optional @version to control the processing.

Profiles do have a role in asking a RDF 1.2 data publisher that does use triple terms to some extent to return RDF 1.1 compatible unstar data.

@Tpt
Copy link

Tpt commented Jan 29, 2025

+1 to @afs

See above; HTML, XML etc. have explicitly created in-document mechanisms for versioning, which is why they can do this.

Indeed XML have a versioning mechanism. However some examples I took like JavaScript , SPARQL and I believe CSS do not have versioning and syntax introduced in recent versions are strong syntax errors in the previous ones. There are definitely syntax changes on the web without media type changes.

@RubenVerborgh
Copy link
Member Author

There are definitely syntax changes on the web without media type changes

…and never without breakage, which was the initial point: I expect format-based breakage to cause fatal technical and reputational damage to an ecosystem supposedly defined by interoperability.

@RubenVerborgh RubenVerborgh changed the title Does re-use of the same MIME types constitute a breaking change? Which parties carry what costs of text/turtle changes, and do those outweigh which benefits for whom? Jan 29, 2025
@gkellogg
Copy link
Member

The JSON-LD WG discussed this in the 2024-12-11 meeting and recorded as discussion on w3c/json-ld-syntax#436. The feeling is that there are many instances where the content served by different media-types may be changed, and changing it to something else is highly disruptive, to the point of constraining any real adoption of the new versions.

  • text/html has going through many changes.
  • application/jp2 (for JPEC2000) is often ignored and application/jpeg is used instead.
  • Microsoft change .docx from .doc and many other related formats, but these were gross changes from a proprietary media type to something more open. (application/msword => application/vnd.openxmlformats-officedocument.wordprocessingml.document

JSON-LD does have a version announcement feature (@version: 1.1), which is optional but may be mandatory for JSON-LD 1.2.

w3c/rdf-xml#49 (comment) suggests an rdf:version="1.2" attribute which could enable dirLangString and triple terms and could also change the behavior of rdf:ID for reification. But, it's not as easy to accomplish for other formats (Turtle/TriG could potentially introduce @version, but this can't reasonably be done for N-Triples or N-Quads.

@RubenVerborgh
Copy link
Member Author

RubenVerborgh commented Jan 29, 2025

I don't mean to hog the thread, but we're all increasingly talking past each other here.

Lots of true statements above:

  • Breaking changes have been made with and without changes in media type.
  • Non-breaking changes have been made with and without changes in media type.
  • Profiles have and haven't been used for things.

None of that was ever under question originally, nor does it answer the matter at hand.


So let me make my own question that started this thread more explicit and specific:

  • Overall: is it desirable / indeed the best cost–benefit trade-off to make changes to text/turtle?
    • Precedents in either direction exist, and cherry-picking does not help explain our specific trade-off.
  • In more detail:
    • What exactly are the costs / breakages of a proposed text/turtle change?
    • What exactly are the benefits of changing text/turtle compared to any alternatives (regardless of what those are)?
    • Who carries the costs? Who receives the benefits? Are they the same parties, and if not, why is that fair?
    • What percentage of systems will be affected? What percentage of systems are expected to upgrade? What percentage of documents are expected to contain non-backwards compatible statements? (And if that percentage is really low… why is it even worth breaking things?)
    • Are things going to break again with a future update?
  • Finally: Are any of those costs prohibitively expensive technologically, operationally, or reputationally?

@mielvds
Copy link

mielvds commented Jan 30, 2025

I haven't seen any syntax change on the Web without a change in the media type

I am not sure this is true. For example XML 1.0 and 1.1 share the same media type. Similarly HTML, CSS and JavaScript had significant syntax changes (especially the later 2) without media type. On the RDF-related elements, JSON-LD 1.1 changed the default processing mode without media type change, SPARQL got the very large 1.1 update...

Sorry, should have been more nuanced (I meant: changes that cause old parsers to break and not ignore) and you are right. It's what Ruben says: both stategies have been used and probably, the character limit on HTTP headers was a major driver for those not changing. BTW, now that you mention the JSON-LD approach: in that line of reasoning, turtle/trig, of ntriples/nquads might as well have been one syntax with a versioning construct.

After processing what has been said in this thread, my two cents:

  • What exactly are the costs / breakages of a proposed text/turtle change?

I think the benefits would be limited. If we don't change, some (of most) parsers will break...temporarily. If they are still being used and maintained, someone will provide a fix. Else, it's maybe not that big of a problem.

If we do consider this as a huge problem (I don't think it is), you can only do this with a media-type change (this was my previous point).

  • What exactly are the benefits of changing text/turtle compared to any alternatives (regardless of what those are)?

Parsers will able to anticipate or mitigate turtle 1.2 documents without additional changes in protocol or practice and will be able to explain to the user why it's not working, but only for online documents. I wonder whether cannot parse 1.2 provides enough benefit compared to invalid syntax error.

  • Who carries the costs? Who receives the benefits? Are they the same parties, and if not, why is that fair?

The developers of the parsers, consumers of RDF and publishers of RDF both receive costs and benefits. And mostly those operating in a Web context. There's a lot of offline processing too, and that's where a @version would be very welcome.

  • What percentage of systems will be affected? What percentage of systems are expected to upgrade? What percentage of documents are expected to contain non-backwards compatible statements? (And if that percentage is really low… why is it even worth breaking things?)

I don't know, but in my practice, I almost never rely on the RDF mediatypes because I don't process data published on the Web that I don't somehow control.

  • Are things going to break again with a future update?

Probably, but it will be less of a problem because of the 1.2 versioning system that results from this discussion :)

  • Finally: Are any of those costs prohibitively expensive technologically, operationally, or reputationally?

No. I don't think this will cause a reputation problem; it didn't for Javascript, CSS or HTML and much can be prevented by thouroughly communicating the spec beforehand. Web standards think (long), specify, formalise, document, and communicate, and therefore allow for much more migration time than any other format.

@pchampin
Copy link
Contributor

@coolharsh55

If we do version the mime type, existing systems will not break.

They might, though. If DBPedia migrates to an RDF 1.2 system, and there is a new media type for Turtle 1.2, then it is likely that they will now serve their content wit this new media type (see @Ostrzyciel's comment above). Then old client will suddenly become unable to consume DBPedia data ­– even if the data has not changed a bit, and is still effectively Turtle 1.1.

If we don't version the mime type, existing systems will break.

Not necessarily. Following my DBPedia example above: since most of the data will still remain RDF 1.1 compatible, then old client won't even notice the migration.

Of course, I'm not claiming that keeping the same media-type is free of problems. Actually I'm not sure which option I prefer...

@coolharsh55
Copy link

@pchampin

They might, though. If DBPedia migrates to an RDF 1.2 system, and there is a new media type for Turtle 1.2, then it is likely that they will now serve their content wit this new media type (see @Ostrzyciel's #141 (comment) above). Then old client will suddenly become unable to consume DBPedia data ­– even if the data has not changed a bit, and is still effectively Turtle 1.1.

Not necessarily. Following my DBPedia example above: since most of the data will still remain RDF 1.1 compatible, then old client won't even notice the migration.

Assumption that most data will remain 1.1 compatible may not be valid if we do heavy use of 1.2 in the future. My argument for mime types is better for existing 'old clients' that won't be updated any time soon because they won't be served 1.2 content that they cannot handle. If DBPedia or other systems migrate to 1.2 and stop serving 1.1 content, then hopefully it will be an informed decision. This shouldn't be the argument to not provide a way to distinguish between clients. Also the counter-argument should be stated here - if there is no distinction between mime-types and DBPedia starts serving 1.2 content - the old clients are going to run into errors anyway. At least with the difference in mime-type we have 'control' over the failing conditions, and a way to continue to run old clients with the old mime-type.

@TallTed
Copy link
Member

TallTed commented Jan 30, 2025

They might, though. If DBPedia migrates to an RDF 1.2 system, and there is a new media type for Turtle 1.2, then it is likely that they will now serve their content wit this new media type (see @Ostrzyciel's #141 (comment) above). Then old client will suddenly become unable to consume DBPedia data ­– even if the data has not changed a bit, and is still effectively Turtle 1.1.

Well, Turtle 1.1 consumers that rely on some default server behavior might choke on such new Turtle 1.2 serializations that were previously delivered as Turtle 1.1, if this new Turtle 1.2 is made the default and/or the media type remains text/turtle.

But I would expect that Turtle 1.1 consumers that correctly use ConNeg and the Accept: header to request text/turtle should get Turtle 1.1, and those that request the new text/turtle12 (or text/turtle-01-02 or whatever new version-identifying media type is chosen) should get Turtle 1.2.

(I don't think the WG can retroactively shoehorn version info into Turtle documents, nor some other serializations of RDF 1.2, because those serializations were specified as if there would be no changes against which to future proof. I consider this a major error, but it is where we are.)

I'm confident that this will be how things work, because DBpedia is hosted by Virtuoso, and Virtuoso supports nearly if not all RDF serializations. Turtle 1.2 is not yet available, because the WG hasn't finished specifying it, but I expect Virtuoso to natively handle RDF 1.2 and SPARQL 1.2 (possibly with a new /sparql12 endpoint alongside the existing /sparql endpoint, though this might be an admin-configurable option) once these things are full specified.

I don't think the WG has yet considered the full matrix of permutations, of what will happen when mixing tools conforming to each version of RDF and of SPARQL, with data conforming to each version of RDF and queries conforming to each version of SPARQL.

I don't think the WG can conclusively state now whether our new specifications will include some breaking or only non-breaking changes, and thus be 2.0 or 1.2. I think the WG can conclusively state the WG is trying to produce non-breaking changes to 1.2 (as the WG is Chartered to do).

@rat10
Copy link
Contributor

rat10 commented Feb 2, 2025

(I don't think the WG can retroactively shoehorn version info into Turtle documents, nor some other serializations of RDF 1.2, because those serializations were specified as if there would be no changes against which to future proof. I consider this a major error, but it is where we are.)

Naive question: can't we define something - and even if it's only a standardized comment like # turtle 1.2 in the first line - to help in disambiguation? That sure is not perfect, but probably still better than nothing.

@coolharsh55
Copy link

Naive question: can't we define something - and even if it's only a standardized comment like # turtle 1.2 in the first line - to help in disambiguation? That sure is not perfect, but probably still better than nothing.

@rat10 This is from my note on what I see it as: If we do add a magic-first-line but don't change the mime-type then 1.2 parsers have a way to identify whether they are looking at 1.2 or 1.x document, but previous parsers won't understand this and as a result will mark the document as invalid Turtle. Since there is no mechanism in turtle to indicate version or dictate how parsers should check for version compatibility, any solution which relies purely on addition of fields/info will break existing parsers. So AFAIK the options are:

  1. define a magic-string like @version 1.2 as first line to avoid changing mime-type - which will certainly break existing parsers as this is not defined in 1.1, where we get to keep the same mime-type; OR
  2. define a new mime-type for 1.2 so that it has to be specifically requested/handled differently - which will not break existing parsers but will also lock them out of future graphs as they cannot be sure whether a 1.x document is also a valid 1.1 document

Whether 1 or 2 is preferred depends on how one does the cost-value analysis. Both will require changing parsers anyway to handle 1.2. Approach 1 is better considering we don't need to make changes to the stack all the way from content-negotiation / mime-handling bits upwards. Approach 2 is better as it doesn't have a 'penalty' for not changing existing codebases as they won't be given 1.2 with the same mime-type (and they will reject the new mime-type).

My preference is for Approach 2, because even if we take Approach 1 right now, 5-10 years down the line we're going to have this same problem again when there will be a version 1.3 which will again break the then current parsers etc. - so I would prefer not breaking any stuff and updating things now rather than break them now and again at every version change. To do this, we would need a new mime-type e.g. turtlex (for expanded), and add a @version X.y to the turtle spec so that future versions can be declared purely in the document.

@rat10
Copy link
Contributor

rat10 commented Feb 2, 2025

Naive question: can't we define something - and even if it's only a standardized comment like # turtle 1.2 in the first line - to help in disambiguation? That sure is not perfect, but probably still better than nothing.

@rat10 This is from my note on what I see it as: If we do add a magic-first-line but don't change the mime-type then 1.2 parsers have a way to identify whether they are looking at 1.2 or 1.x document, but previous parsers won't understand this and as a result will mark the document as invalid Turtle. Since there is no mechanism in turtle to indicate version or dictate how parsers should check for version compatibility, any solution which relies purely on addition of fields/info will break existing parsers.

I'm not claiming any authority w.r.t. this issue but let me just add that

  • it would probably be much easier to update a parser to check for that comment as opposed to updating it to handle triple terms
  • it would be relatively easy to check if a failed parse began with such a comment, providing potentially helpful debug hints

I.E. such a comment would at least make failures less opaque. And failures seem to be unavoidable anyway. OTOH the comment line by itself should not break anything.

@domel
Copy link

domel commented Feb 2, 2025

Naive question: can't we define something - and even if it's only a standardized comment like # turtle 1.2 in the first line - to help in disambiguation? That sure is not perfect, but probably still better than nothing.

In libraries like ANTLR, comments are usually ignored in the abstract syntax tree (AST) by defining them as HIDDEN tokens. This is because comments do not affect the language's syntax, including them would unnecessarily increase AST complexity, and removing them simplifies further code processing.

In ANTLR, comments can be defined as hidden tokens:

COMMENT: '#' ~[\r\n]* -> skip;

@VladimirAlexiev
Copy link
Contributor

We can add @version "1.2", and require such line if the document includes 1.2 features.
This will cause 1.1 processors to fail early.
(It should also be allowed to appear in the middle, to support cases of concatenating turtles.)

This is the route of JSON-LD https://w3c.github.io/json-ld-syntax/#h-note-4 :

The use of 1.1 for the value of @version is intended to cause a JSON-LD 1.0 processor to stop processing.

(I think JSONLD 1.0 had keyword @version but it was string-valued, and was purposely changed to the number 1.1 in JSONLD 1.1)

And we need a profile parameter in the MIME type. Something like text/turtle; version=1.2.


I tend to agree with @rubensworks sentiment against a proliferation of content types.
But it won't be overly bad if we add turtle-star, trig-star, ntriples-star, nquads-star, etc.

@TallTed
Copy link
Member

TallTed commented Feb 4, 2025

But it won't be overly bad if we add turtle-star, trig-star, ntriples-star, nquads-star, etc.

s/-star/-1.2/g (or similar).

-star should disappear from informed use, especially by us, as it has already been leading to confusion when people hear about this work and discover the original whitepaper and/or the CG report before the 1.2 documents, especially but not only because the earlier works are "finished" while the 1.2 specs are still "in progress", so they think that implementing conformance to the earlier works is worthwhile... which it isn't.

@gkellogg
Copy link
Member

gkellogg commented Feb 5, 2025

We can add @version "1.2", and require such line if the document includes 1.2 features. This will cause 1.1 processors to fail early. (It should also be allowed to appear in the middle, to support cases of concatenating turtles.)

Starting in Turtle 1.1, we've been moving away from @ keywords and prefer PREFIX to @prefix; we would want to take this into consideration.

This is the route of JSON-LD https://w3c.github.io/json-ld-syntax/#h-note-4 :

The use of 1.1 for the value of @version is intended to cause a JSON-LD 1.0 processor to stop processing.

(I think JSONLD 1.0 had keyword @version but it was string-valued, and was purposely changed to the number 1.1 in JSONLD 1.1)

JSON-LD 1.0 did not constrain the set of potential keywords; they might look like normal terms to a JSON-LD processor; but, no terms allow the use of a numeric value, so "@version": 1.1 would fail on a JSON-LD 1.0 processor. JSON-LD 1.1 corrected the mistake by white-listing specific keywords, so that we wouldn't have that particular issue again.

And we need a profile parameter in the MIME type. Something like text/turtle; version=1.2.

I tend to agree with @rubensworks sentiment against a proliferation of content types. But it won't be overly bad if we add turtle-star, trig-star, ntriples-star, nquads-star, etc.

I'm not a fan of changing the media type, but I could live with it.

@domel
Copy link

domel commented Feb 5, 2025

I agree with @VladimirAlexiev. This suggestion aligns well with the approach outlined in RFC 6381, which introduces a codecs in audio and video parameter to refine format variations without altering the primary MIME type. In a similar manner, we could adopt a version parameter to differentiate between various iterations of the format while preserving compatibility.

For instance, using text/turtle; version=1.2 would explicitly indicate adherence to the RDF 1.2 specification. This method would also allow for distinctions such as version=1.2-classic to denote a variant that retains compatibility with 1.1 but includes dirLangString, etc.

By implementing this approach, we ensure clear version identification while maintaining seamless integration with existing MIME type handling mechanisms.

@pchampin
Copy link
Contributor

FTR, the discussions on that topic that we had in the community group, back in the days: w3c/rdf-star#43

@w3cbot
Copy link

w3cbot commented Feb 13, 2025

This was discussed during the #rdf-star meeting on 13 February 2025.

View the transcript

Which parties carry what costs of text/turtle changes, and do those outweigh which benefits for whom? 1

ora: Something from Ruben Verborgh. Any thoughts?

pfps: This is not limited to Turtle, RDF/XML, NTriples... all of our surface syntax are exactly in the same situation

rubensworks: also the SPARQL results syntaxes

ora: We need to things, eventually a solution and before that a response to Ruben

rubensworks: Based on some scrawling, several solutions have been proposed

rubensworks: the simplest one, do nothing and carry on with the changes

rubensworks: Second solution, introduce new media type but it comes with some cost, adoption, web browser restrict accept header length

rubensworks: An other is to make use of profiles

rubensworks: An other and it's the one I am personnaly in favor of is to keep the media type but introduce some in syntax version mechanism just like JSON-LD and HTML

rubensworks: the motivation is that currently text/turtle means Turtle 1.1 and if we change that, the contract break

gtw: All line based formats will have a hard time for in format versionning but it sounds a good approach for eg Turtle

gtw: if not, we open ourselves to get new media type each time a WG update the spec

AndyS: My personal concern is that to fix a problem we pay occasionally, we split the world in two

AndyS: If there is an inline version thing, then there is a choice, is it mandatory or optional to use RDF 1.2 features?

pchampin: I like the idea of optional inline version with the statement "you should add a version if you use RDF 1.2 features such that clients can break gracefully"

pchampin: Line formats makes versionning harder but error recovery much easier, you can just restart on the next line

james: With content negotiation, I have expected that the server can answer with whatever it wants

AndyS: A syntax error is not a new class of error a web system would have to deal with anyway

ora: I am not sure a HTML-like recovery system would be a good thing for RDF

ora: If there is something the client does not understand, the client should know it does not understand

<pchampin> well, RDF is monotonic, so a subset of the triples would at least not contradict the whole graph

ora: we need a decision on the principles, even if we don't deal with the details of the decision yet

<fsasaki> just to add me comment to the minutes: having the HTML / CSS / JavaScript way of "doing nothing" may have some benefits, going the path of loose coupling a bit more than done so far for RDF

ora: How this inline mechanism work in Turtle?

<niklasl> (My notes that I pseudo-read-from): I'm +1 to keep media type. Excellent points by all. I think focus is best spent on considering viability/workability of a version parameter; and also pro/con of in-band (version keyword in "prelude") (break early; mandatory or not)? With considerations on what is deployed out there, and when clients/servers are

<niklasl> likely to upgrade to 1.2 (as in make use of the new features; cf. unstarring). Error recovery is also very interesting (I think CSS is particularly interesting in that respect).

james: My point is not only that the server can return the content-type it wants but it can also provide the adequate signal for the client to understand why it might fail

pchampin: The inline mechanism should be a SHOULD because there are situation where it is hard for publisher to know if they are going to make sure of specific features like triple terms

pchampin: and so to mak sure that in the happy case (no triple term) the old implementation still work

<Zakim> AndyS, you wanted to ask about general toolkit (e.g Spring) support for "version="

<tl> +1 to an optional (SHOULD) @version parameter, since it allows to break early and with explanation

AndyS: I don't know the state of the world is out there, if you send a `version=` parameter on a media type, would framework like Spring fail or handle it properly?

pchampin: the version parameter of requests should be optional

pchampin: I will update my FOAF profile to serve the response with `version=1` and send a message to the semweb mailing list to see if implementations choke on it

ora: Would it make sense to enumerate what the difference approaches mean for the different syntaxes?

<niklasl> Yes; one or both of 1) version keyword and 2) version parameter on existing media types?

rubensworks: I am not sure if it make sense to enumerate all approaches even if it seems that the inline version in addition to the version= media type parameter seems to get traction

ora: I am wondering should we somehow clearly spell what the failure modes are and possible courses of actions


@pchampin
Copy link
Contributor

As discussed yesterday, I started a survey to assess the impact, on legacy implementations, of adding a version parameter on the media-types.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests