Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split mediaType into two properties; one for Link, one for Object? (contentType?) #638

Open
trwnh opened this issue Feb 9, 2025 · 4 comments

Comments

@trwnh
Copy link

trwnh commented Feb 9, 2025

https://www.w3.org/TR/activitystreams-vocabulary/#dfn-mediatype

When used on a Link, identifies the MIME media type of the referenced resource.

When used on an Object, identifies the MIME media type of the value of the content property. If not specified, the content property is assumed to contain text/html content.

This kind of "multiple applicability" is generally bad semantic design, since you are using the same term/symbol for different concepts. When used differently, there should be different terms.

I would say that in a next version, we consider doing something like:

mediaType
: Domain: Link
: Range: MIME media type
: Functional: True
: Comment: Identifies the MIME media type of the referenced resource (href of a Link).

contentType
: Domain: Object
: Range: MIME media type
: Functional: True
: Comment: Identifies the MIME media type of the value of the content property. If not specified, the content property is assumed to contain text/html content.

Motivationally, mediaType has incredibly limited applicability in general when applied to Object.content, with binary representations not really making sense for values of content. But it makes sense to use all kinds of different values when applied to Link.href, as the referenced resource can be basically anything (text or binary).


Tangentially, one other thing that might work and might align better with JSON-LD / RDF is to look into "typed values", so basically something like this in expanded JSON-LD form:

{
  "https://www.w3.org/ns/activitystreams#content": {
    "@value": "<p lang='en'>hello world</p>",
    "@type": "http://www.w3.org/1999/02/22-rdf-syntax-ns#HTML"
  }
}

Which is equivalent to the following Turtle/N-Triples:

_:b0 <https://www.w3.org/ns/activitystreams#content> "<p lang='en'>hello world</p>"^^<http://www.w3.org/1999/02/22-rdf-syntax-ns#HTML>

Not sure how much sense this makes, though... What we currently have is that we use the language tag @language which cannot be used at the same type as @type. An RDF Literal is either language/direction-tagged, or it is coerced to a different type, but not both. So it might make sense to leave it as a sort of contentType property while the value of content remains a language-tagged literal string.

@nightpool
Copy link
Collaborator

nightpool commented Feb 10, 2025 via email

@trwnh
Copy link
Author

trwnh commented Feb 10, 2025

I think we can't fully avoid the "different new properties", and to an extent this mirrors how things like HTTP have a Content-Type header which functions identically to how as:mediaType applies to Object.content. The difference is that an AS2 document can have more than just content. But still, we're probably dealing with at most 2 or 3 properties here -- contentType and maybe summaryType if something like #620 gets considered, plus something for links (mediaType could be reused, but we could also define hrefType if we really wanted to...). We've already declared name MUST be plain-text, and that requirement makes sense. Are there other properties that need a variable media type for their literal values?


Is this limitation tracked anywhere?

RDF Literals https://w3c.github.io/rdf-concepts/spec/#section-Graph-Literal are composed of the following:

  • Lexical form. This is a simple string representation of the value.
  • Datatype IRI. This is what the lexical form gets "coerced to".

In concrete syntaxes like JSON-LD or Turtle, there is often syntactic sugar for "simple literals", which don't have an explicitly stated datatype, but instead the datatype is inferred based on the syntax. For example, "foo" in Turtle is equivalent to "foo"^^<http://www.w3.org/2001/XMLSchema#string> by default. 1 in Turtle is equivalent to "1"^^<http://www.w3.org/2001/XMLSchema#integer>. For JSON-LD, similar syntactic sugar is used to convert a JSON string value into an xsd:string Literal, JSON numbers into either xsd:integer or xsd:double Literal, and JSON boolean into xsd:boolean Literal.

  • IFF the datatype IRI is http://www.w3.org/1999/02/22-rdf-syntax-ns#langString, then a third component is the BCP47 language (JSON-LD @language, Turtle "hello"@en).
  • IFF the datatype IRI is http://www.w3.org/1999/02/22-rdf-syntax-ns#dirLangString, then there is a third component for the language as above, as well as a fourth component for the direction (ltr or rtl, expressed with JSON-LD @direction or with Turtle "hello"@en--ltr)

If you try to use both features on the JSON-LD playground this becomes a bit more apparent:

{"https://www.w3.org/ns/activitystreams#content": {
  "@value": "<p>Hello world</p>",
  "@language": "en",
  "@type": "http://www.w3.org/1999/02/22-rdf-syntax-ns#HTML"
}}

jsonld.SyntaxError: Invalid JSON-LD syntax; an element containing "@value" may not contain both "@type" and either "@language" or "@direction".

RDF 1.2 also warns about this when defining rdf:HTML like so: https://w3c.github.io/rdf-concepts/spec/#section-html

Any language annotation (lang="…"), text directionality annotation (dir="…"), or XML namespaces (xmlns) desired in the HTML content must be included explicitly in the HTML literal. [...]

So in expanded JSON-LD form, these are fine:

{"https://www.w3.org/ns/activitystreams#content": {
  "@value": "<p>Hello world</p>",
  "@language": "en"
}}
{"https://www.w3.org/ns/activitystreams#content": {
  "@value": "<p lang='en'>Hello world</p>",
  "@type": "http://www.w3.org/1999/02/22-rdf-syntax-ns#HTML"
}}

The former is a language-tagged string (rdf:langString) in the English language (en), equivalent to "<p>Hello world</p>"@en... but applications know out-of-band that they can parse it as HTML (based on contentType or its default value).

The latter is an HTML Literal, which is not the same as a language-tagged string. Literals can only have one datatype, unlike Resources which can have multiple types/classes via rdf:type. As an HTML Literal, you know to parse it as HTML, but any language or direction information needs to be encoded in-band within the Literal's value.

Based on my current understanding, we essentially have to choose between:

  • Define datatype IRIs for every MIME type we are interested in using, a la rdf:HTML (so maybe something like example:Markdown as a datatype IRI for Markdown literals?)
  • Use MIME types in a separate property (so something like contentType: "text/html" informs how to process the value of content)
    • If we wanted to go further in a not-so-backwards-compatible way, we could bundle these together into an object node? ({"value": "foo", "mediaType": "text/plain"}) -- but this would probably not be worth it, since it's not directly a natural language property anymore...

In the interest of taking the least destructive path, probably a simple single contentType will work fine here. If there was justification for exploding it or unpacking it further, then maybe that could be done, but I don't particularly see that justification right now. If anything, it would make more sense to define a content property that was always HTML (typed value rdf:HTML Literal) and then encode the language and direction within that HTML literal, but this is perhaps similarly unjustifiable at this point.

@nightpool
Copy link
Collaborator

nightpool commented Feb 10, 2025 via email

@trwnh
Copy link
Author

trwnh commented Feb 10, 2025

If that restriction is to be loosened, it needs to be loosened all the way down the stack at the RDF abstract level (and then at the JSON-LD concrete level). I'm not sure how we would proceed there...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants