Skip to content

Proposal for Semweb library redesign

Wouter Beek edited this page Oct 6, 2015 · 18 revisions

Proposal for Semweb library redesign

1 Literals

1.1 Internal literal representations

Observations:

  • There are currently 3 compound term representations for literals:
    1. literal(type(D, Lex))
    2. literal(lang(Lang, Lex))
    3. literal(Lex)
  • Simple literals (variant 3) were removed in RDF 1.1. They are now (implicitly typed) XSD strings.
  • RDF 1.1 specifies the values of language-tagged strings as pairs of lexical expression and language tag in that order.

Proposal:

  • One representation for literals:
    1. lexical(D, Lex)
    2. lexical(rdf:langString, Lex-Lang)
    3. lexical(xsd:string, Lex)
  • rdf_assert(S,P,O), rdf(S,P,O) is always true.
  • rdf_assert(S,P,O), rdf(S,P,X), X = O is always true.
  • It is expensive to translate to canonical lexical expressions.

1.2 Lexical expression or value?

Observation:

  • Storing literal values i.o. lexical expressions would speed up rdf/[3,4] considerably.

Proposal:

  • One representation for interpreted literals:
    1. type(D, V)
    2. type(rdf:langString, Lex-Lang)
    3. type(xsd:string, V)
  • Add rdf_load/2 option interpret_literals(+bool) to either assert lexical/1 (default) or type/1.
  • If a datatype IRI is not recognized and interpret_literals(true) then throw a warning and assert the literal using lexical/1.
  • Remove option convert_typed_literal(:).

1.3 Canonical lexical expressions

Observation:

  • If we can presume that all literals have canonical lexical expressions then this makes it easier to search for triples.

Proposal:

  • All assertion predicates that assert lexical expressions do so canonically.

1.4 Syntactic sugar for literals

Observation:

  • Turtle is an often-used format for manual RDF data entry.

Proposal:

  • Use Turtle-like syntactic sugar for literals:
    • Text@Lang (Turtle: "Lex"@Lang)
    • Value^^Type (Turtle: "Lex"^^Datatype)

1.4 Literal sorting

Observation:

  • Literals are sorted, but the sorting has little relation to the SPARQL defined ordering. This implies we cannot optimize SPARQL queries that use comparison operators.

1.5 Language-tag matching

Not all predicates are currently matching language tags in language-tagged strings according to BCP 47, e.g., [2] currently does not match [1].

[1]   ex:a ex:b "c"@en-US .
[2]   ?- rdfs_label(rdf:s, en, Label).

2 Simplify the basic API

There should be a difference between (1) predicates that enumerate existing RDF terms, (2) predicates that create new RDF terms and (3) predicates that check whether something is an RDF term.

2.1 API for enumerating existing terms

Observations:

  • rdf_resource/1 does not enumerate all resource-denoting terms. It enumerates the terms that are blank nodes or IRIs and that appear in the subject or object position of some triple.
  • Enumeration based on positional occurrence (rdf_subject/1 and rdf_current_predicate/1) is missing object and node support.
  • Enumeration based on kind (rdf_current_literal/1) is missing support for blank nodes, IRIs and names.

Proposal:

  • Remove rdf_resource/1.
  • Enumerate terms:
    • Add rdf_term/1, uniquely enumerating the current RDF terms.
  • Enumerate terms based on their positional occurrence:
    • Keep rdf_subject/1 as-is.
    • Rename rdf_current_predicate/1rdf_predicate/1.
    • Add rdf_object/1 to complement rdf_subject/1 and rdf_predicate/1.
    • Add rdf_node/1, uniquely enumerating terms that occur in either the subject or object position.
  • Enumerate terms based on their kind:
    • Rename rdf_current_literal/1rdf_literal/1.
    • Add rdf_bnode/1, uniquely enumerating terms that are blank nodes.
    • Add rdf_iri/1, uniquely enumerating terms that are neither blank node nor literal.
    • Add rdf_name/1, uniquely enumerating terms that are either IRI or literal.

2.2 API for creating new terms

Observations:

  • Blank nodes and nodes are different concepts in RDF 1.1.

Proposal:

  • Rename rdf_bnode/1rdf_create_bnode/1.
  • Replace rdf_node/1 by the predicate described in section 2.1.

2.3 API for checking a term's kind

Observations:

  • Kind checking (rdf_is_bnode/1 and rdf_is_literal/1) is missing support for IRIs.
  • rdf_is_resource/1 does not succeed for all resource-denoting terms.

Proposal:

  • Keep rdf_is_bnode/1 as-is.
  • Keep rdf_is_literal/1 as-is.
  • Rename rdf_is_resource/1rdf_is_iri/1.
  • Change rdf_is_resource/1/rdf_is_iri/1 to not succeed for non-IRIs.

3 Load/save RDF serializations

3.1 Uniform API for parsers/writers

Observations:

  • Reading/writing RDF files is currently spread over multiple modules.

Proposals:

  • Core predicates:
    • rdf_load(+Input, +Options)
    • rdf_save(+Input, +Options)
  • Wrapper predicates:
    • rdf_load(+Input)
    • rdf_save(+Input)
  • Properties or rdf_load/[1,2] and rdf_save/[1,2]:
    • Using open_any/5 and close_any/1, Input can a (1) file, (2) stream, or (3) HTTP(S) IRI.
    • Using archive_data_stream/3, all archive streams of Input are processed in sequence.
  • Properties of rdf_load/[1,2]:
    • If format(-) then use rdf_guess_format/3 to determine the serialization format.
    • If Input is an HTTP(S) IRI then option base_iri is automatically set.
    • Option format(?) can either be used to set a serialization format or it can be used to return the serialization format that was heuristically found.

3.2 Loading named graphs

Observations:

  • rdf_load(+Input, +Options) with option graph(+Graph) does not make clear where quadruples are stored?
  • rdf_save(+Input, +Options) with option graph(+Graph) does not allow a collection graphs to be saved to a quadruples file.

Proposal:

  • For rdf_load/2 option dataset(-Default:atom, -Named:list(atom)) returns the graphs that data has be loaded into (i.e., Named=[] for triples).
  • For rdf_save/2 option dataset(+Default:atom, +Named:list(atom)) allows a subset of the currently loaded graphs to be saved to file.
  • Option graph(?) is a shorthand for option dataset(?,[]).

3.3 Serialization format-specific parsers/writers

Proposal:

  • Rename predicates used for loading/saving XML/RDF:
    • rdf_load/[1,2]rdf_xml_load/[1,2]
    • rdf_save/[1,2]rdf_xml_save/[1,2]
    • rdf_save_footer/1rdf_xml_save_footer/1
    • rdf_save_header/2rdf_xml_save_header/2
    • rdf_save_subject/3rdf_xml_save_subject/3

3.4 Missing parsers/generators

The following formats are not supported:

  • JSON-LD
  • N3

4 Collections

Observations:

  • Collections are part of RDF not RDFS.

Proposals:

  • Rename arguments:
    • DBGraph
    • SetCollection
  • Rename predicates:
    • rdfs_member/2rdf_member/2
    • rdfs_list_to_prolog_list/2rdf_list/2
    • rdfs_assert_list/[2,3]rdf_assert_list/[2,3]
  • Support recursive lists for rdf_list/2 and rdf_assert_list/[2,3].

5 Improve terminology

5.1 API

Option names:

  • base_uribase_iri.
  • register_namespacesregister_prefixes.
  • namespacesprefixes.

5.2 Documentation

Proposal:

  • Section "Enumerating objects" → "Enumerating RDF terms", since non only object terms are enumerated.

Terminology:

Term Intended meaning
RDF statement RDF triple or RDF quadruple

Variable names:

Variable name Intended expansion
C Class
D Datatype
G Graph
I Instance
Lex Lexical expression
O Object term
P Predicate term
Property
S Subject term
T RDF term
V Value

6 Graphs

6.1 Internal quadruple representation

Observations:

  • Named graphs (according to the data) are currently conflated with the ability to load data into different graphs (according to the application).
  • Originally, Line was the line in the file from which the triple was loaded. library(rdf_persistency) uses Line as time stamp to correlate transactions on different graphs. Caching applications might use it for last access, etc. Possibly we should separate that.

Proposal:

  • What about saying that a triple is identified by and have a predicate rdf_triple_property(S,P,O,G,Property) to query additional properties (last access, line, etc.)?
  • In the internal storage format for statements the 4th argument must be the graph name (either an IRI for named graphs or the default graph). We need a new rdf/5 (probably with a different name to query).

6.2 Default graph name

The default graph is currently called user.

6.3 Graph statistics

RDF statistics does not show the correct number of graphs in case a graph is unloaded:

?- use_module(library(semweb/rdf_db)).
true.
?- rdf_create_graph(test).
true.
?- rdf_unload_graph(test).
true.
?- rdf_statistics(graphs(N)).
N = 1.
?- rdf_graph(G).
false.

7 Backward chaining

Proposals:

  • Rename predicates:
    • rdfs_individual_of/2rdfs_instance_of/2
Clone this wiki locally