A toolset to work with N-Triples
Author: | Felix Bensmann |
Date: | 07. Dec. 2015 |
Last change: | 16. Jan. 2016 |
Please note: | This document is intended to provide help to get started with ReshapeRDF, nothing more. Its content is subject to change. |
- Introduction
- Sorted N-Triples
- Terms
- Setup
- Commands
- Commands for everyday use
- block
- checksorting
- extractresources
- filter
- getenrichment
- help
- merge
- mergedir
- ntriplify
- pick
- removeduplicates
- renameproperty
- restorebn
- securelooseends
- sort
- split
- version
- Special commands
- analyzetype
- correct
- extractduplicatelinks
- extractreferenced
- outline
- pigeonhole
- pumpup
- subtract
- Commands for everyday use
- Getting Started
Processing RDF mass data can be a prone job. Common triplestores offer certain functionality for querying and manipulating RDF data but only few can handle mass data (let's say more than 200 Mio. statements) at the same time. Typical operations like data import and SPARQL queries tend to be time consuming and inconvenient to be used in comprehensive reshaping operations.
So, when working with simple structured graph data, a solution can be to refrain from using a triplestore and to work with dump files instead. Recurring tasks are extracting entities of a certain class from a large dataset, or subdivide a dataset into blocks according to a certain property (Blocking), filtering the data, removing resources and single statements, renaming properties and similar reshaping operations.
Unfortunately organizing ones RDF mass data in the desired manner cannot be done easily with available out-of-the-box tools.
The tool at hand was developed to enable users of large RDF datasets to efficiently organize and reshape their data without the need of a triplestore.
When there is an RDF dump file to process, users cannot take for granted that stored resources are held together. This is especially true for the N-Triples file format but also applies for the RDF/XML file format that even provides a way to cluster statement by syntax. At the same time resources within such files cannot be found efficiently without having to read-in the whole file and examine the stream from the start to the end to find all occurences. Complex searches cannot be handled at all.
To overcome these limitations this tool applies an intermediate file format to be used by a given set of operations to organize data in a more flexible way. This format is "Sorted N-Triples" (SNT). These are, as the name already indicates, alphabetically sorted N-Triples.
The following example depicts how SNTs can be used for an interlinking and enrichment process.
- Convert a non-SNT file to N-Triples
- Sort it
- Extract relevant resources (one iteration)
- Split the extracted resources into smaller datasets (one iteration)
- Interlink - however
- If necessary convert the links to SNT
- Merge the links into the data (one iteration)
The flexible nature of this tool is especially helpful with heterogeneous datasets.
Copy the JAR-Archive reshaperdf-1.0-SNAPSHOT.jar and the lib folder to a directory of your choice. The software requires at least JRE 1.7 .
It is helpful to provide a script "reshaperdf" in /bin that facilitates the calls to the program.
#!/bin/bash
# Author: John Smith
# Purpose: Facilitates calls to ReshapeRDF.
java -jar reshaperdf-1.0-SNAPSHOT.jar $@
- triple and statement In this application a triple and a statement as known from the RDF context are the same thing. They always fit in one line.
- line based An operation is called that if it understands triples as a string line.
- statement based An operation is called that if it understands triples as triples/statements.
- resource based An operation is called that if it sees the data as a list of individual resources.
This chapter outlines the operations and their usage. A command can be called using the following syntax:
java -jar reshaperdf-1.0-SNAPSHOT.jar <command> [<command parameter> ...]
The chapter is subdivided into a section that is about commands intended for everyday use and a section about special commands that do not have a purpose in everyday use but become handy in exotic use cases. The special commands are available in their own branch.
At no point any of the commands will overwrite an input file, rather they produce a new file with the desired changes. However existing files will be overwritten by output files without notification.
Comments are usually not processed by the commands. Most commands require the long forms of a URI.
Name | block |
Usage | block <input file> <output dir> <predicate> <char offset> <char length> |
Type | Resource based |
Description | Assigns the resources of the input file to blocks according to a given character sequence of a given property's value. One block is one file. Files that exceed a statement count of 100 000 are further split into files of 100 000. |
Argument: input file | The input file, requires SNT. |
Argument: output dir | The Directory to store the output in. |
Argument: predicate | The property to block by. Requires long namespace version. |
Argument: char offset | The offset of the character sequence in the property's value. Use 0 for no offset. If the offset is higher than the value's length, then the whole property value will be evaluated. |
Argument: char length | The lenght of the character sequence in the property's value. If the length is higer than the value's length, then the whole property value will be evaluated. |
Output | A set of SNT files in the given output directory. |
Name | checksorting |
Usage | checksorting <input file> |
Type | Statement based |
Description | Checks the input file for proper sorting. This sorting differs from line sorting in the fact that it ignores the control characters. |
Argument: input file | The input file, requires N-Triples. |
Output | Prints "Sorted" to stdout if sorted correctly, "Not sorted" otherwise. |
Name | extractresources |
Usage | extractresources <input file> <output file> <predicate> <object> <offset> <length> |
Type | Resource based |
Description | Extracts resources with a given predicate-object combination. |
Argument: input file | The input file, requires N-Triples. |
Argument: output file | Name of the output file, the file with the extracted resources. |
Argument: predicate | The predicate to look for, namespace has to be in long form. Use a "?" to indicate a wildcard. |
Argument: object | The object to look for. Can be a literal or a URL. Use a "?" to indicate a wildcard. |
Argument: offset | Number of the matching resource to start from. |
Argument: length | Number of resources to extract. -1 indicates to use all available resources. |
Output | An SNT file with the extracted resources. |
Name | filter |
Usage | filter <whitelist|blacklist> <input file> <filter file> <output file> |
Type | Resource based |
Description | Removes statments from an N-Triples file accoringly to a white or black list. |
Argument: whitelist|blacklist | Either "whitelist" or "blacklist" to indicate what kind of filter is to be used. |
Argument: input file | File to filter |
Argument: filter file | A text file containing the properties to be subject to the filter. Is a simple line-based text file. |
Argument: output file | Name of the file to store the output in. |
Output | An SNT file with the remaining resources. |
Name | getenrichment |
Usage | getenrichment <linkfile> <resource file> <output file> |
Type | statement based/Resource based |
Description | Extracts resources from an SNT file, that are adressed by the object of an SNT link file. Missing resources in the resources file are ignored. The subjects of the extracted statements are altered to the subject of the link. |
Argument: linkfile | The link file, requires SNT. |
Argument: resource file | An SNT file containing the resources to be extracted. |
Argument: output file | Name of the output file. The file containing the extracted resources. |
Output | An SNT file with the extracted resources. |
See also extractreferenced.
Name | help |
Usage | help <cmd> |
Type | - |
Description | Displays the help text, for the specified command. |
Argument: cmd | Name of the command. |
Output | Helptext for the specified command. |
Name | merge |
Usage | merge <output file> <input file1> <input file2> [<input file3>...] |
Type | statement based |
Description | Merges a couple of sorted N-Triples files. |
Argument: output file | The name of the output file. |
Argument: input file1 | An SNT file containing statements to be merged. |
Argument: input file2 | Another SNT file containing statements to be merged. |
Argument: input fileN | Further optional SNT files containing statements to be merged. |
Output | An SNT file with the merged results. |
Name | mergedir |
Usage | mergedir <input dir> <output file> |
Type | statement based |
Description | Merges SNT files that are in the same directory. Extends namespaces to its long form. |
Argument: input dir | The name of the directory containing the SNT files to be merged. Subdirectories are also searched. |
Argument: output file | An SNT file containing the merged statements. |
Output | An SNT file with the merged results. |
Name | ntriplify |
Usage | ntriplify <input dir> <output file> [<JSON-LD context URI> <JSON-LD context file>][...] |
Type | statement based |
Description | Converts all RDF files from a directory into N-Triples and merges them into a single file. |
Argument: input dir | The name of the directory containing the RDF files. Subdirectories are also searched. |
Argument: output file | The name of the output file. |
Argument: JSON-LD context URIs and files | Optional. It is possible to state a mapping of JSON-LD contexts and local JSON-LD context files. The context-URIs and file paths will have to be inserted in pairs separated by a space. The command will use the local contexts whenever the remote context is not available. |
Output | An N-Triples file containing the converted statements. |
Name | pick |
Usage | pick <input file> <output file> <s|p|o|stmt|res> <s|list|?> <p|list|?> <o|list|?> |
Type | Dependant on search pattern |
Description | Takes an input file and extracts all subjects, predicates, objects, statements or resources according to the specified pattern and outputs them into a file. A "?"-character can be used to indicate a wildcard. Example: infile.nt outfile.nt o subjectlist.txt predicatelist.txt ? This returns all objects whose statments match any combination of subjectlist and predicatelist. |
Argument: input file | The name of the input file. Sorted N-Triples are required. |
Argument: output file | The name of the output file. |
Argument: return type | The kind of information to be returned; one of subject, predicate, object, statement or resource. |
Argument: subject expression | The expression for matching the subject: A single URL, a file containing a list of URLs or a wildcard. |
Argument: predicate expression | The expression for matching the predicate: A single URL, a file containing a list of URLs or a wildcard. |
Argument: object expression | The expression for matching the subject: A single URL/Literal, a file containing a list of URLs or literals or a wildcard. Datatypes and language tags cannot be processed. |
Output | An N-Triples file containing the output. |
Name | removeduplicates |
Usage | removeduplicates <input file> <output file> |
Type | Line based |
Description | Removes duplicate statements from an SNT file. Keeps one line of each kind. |
Argument: input file | The name of the input file, requires SNT. |
Argument: output file | The name of the output file. |
Output | An SNT file containing the remaining statements. |
Name | renameproperty |
Usage | renameproperty <input file> <output file> <property> <substitute> [<property> <substitute>...] |
Type | Statement based |
Description | Renames a property. Requires long namespaces. |
Argument: input file | The name of the input file, requires SNT with long namespaces. |
Argument: output file | The name of the output file. |
Argument: property | The property to be replaced. Long namespace required. |
Argument: substitutes | The substitute property. Long namespace required. |
Output | An SNT a copy of the input file with replaced properties. |
Name | restorebn |
Usage | restorebn <input file> <output file> |
Type | Statement based |
Description | Restores blank nodes within an N-Triples file that were transcripted e.g. by the ntriplify command. |
Argument: input file | The name of the input file, requires N-Triples with long namespaces. |
Argument: output file | The name of the output file. |
Output | A copy of input file with restored blank nodes. |
Name | securelooseends |
Usage | securelooseends <file A> <file B> <output file> <predicate1> <substitue1>[<predicate2> ...] |
Type | Resource based |
Description | Extracts resources from file B that are referenced in file A. Then reduces this resource to a meaningful string and adds it to the original resource. |
Argument: file A | An SNT input file containing the references. |
Argument: file B | An SNT input file containing the resources that are referenced in file A. |
Argument: output file | The name of the output file. |
Argument: predicate1 | A property from file A whose reference is to be looked up in file B. |
Argument: substitute1 | A property to map the meaningful string to. |
Output | An SNT file containing the resulting statements. e.g. <s> <substitute1> "meaningful string" |
Name | sort |
Usage | sort <input file> <output file> |
Type | statement based |
Description | Sorts an N-Triples file in ascending order of codepoints. |
Argument: input file | The name of the input file, requires N-Triples, requires long namspace forms. |
Argument: output file | The name of the output file. |
Output | An SNT file containing all the statements from the input file. |
Name | split |
Usage | split <input file> <output file prefix> <resources per file> |
Type | Resource based |
Description | Splits an SNT file into several smaller files, with a given number of resources. |
Argument: input file | The name of the input file, requires SNT. |
Argument: output file prefix | Prefix for the output files, e.g. /home/data/part_ |
Argument: resources per file | Number of resources per file. |
Output | Multiple SNT files, e.g. /home/data/part_1.nt etc. |
Name | version |
Usage | version |
Type | - |
Description | Prints the version to the screen, e.g. v0.1 . |
Name | analyzetype |
Usage | Usage: analyzetype <input file> <type> <predicate1> [<predicate2> ...] |
Type | Resource based |
Description | Counts the occurences of literal objects for one or more propertiesfor for a given rdf:type. When more than one properties are used, the combinations of properties are counted as well. Output is written to a CSV file. The entries are ranked by their occurences. Use case example: A ranking of most common first name and last name combinations for persons could be created. See also: pigeonhole |
Argument: input file | The input file, requires SNT. |
Argument: type | The type of resource to be analyze e.g. foaf:Person |
Argument: predicate1 | The property to examine. Requires long namespace version. |
Argument: further predicates | Further predicates, requires long namespace version. |
Output | One CSV file for every property and combination of properties, names are chosen automatically. |
Name | correct |
Usage | correct <input file> <output file> |
Type | Line based |
Description | Removes invalid triples from a given file, respectively replaces invalid characters with the ?-character. |
Argument: input file | The input file, requires N-Triples. |
Argument: output file | Name of the output file. |
Output | An N-Triples file without the problematic triples. |
Name | extractduplicatelinks |
Usage | extractduplicatelinks <input file> |
Type | Statement based |
Description | Extracts statements that do not address their subject or target exclusively. Use case example: Find owl#sameAs-links in a link set that connect commodity-resources, respectively identify such resources. Useful in combination with the subtract command. |
Argument: input file | The input file, requires SNT. |
Output | Two N-Triples files: subjects.nt contains all statements that do not address their subject exclusively; objects.nt contains all statements that do not address their objects exclusively. |
Name | extract referenced |
Usage | extractreferenced <file A> <file B> <output file> <predicate1> [<predicate2> ...] |
Type | Resource based |
Description | Extracts resources from file B that are referenced in file A. Missing resources in fileB are ignored. |
Argument: file A | The input file containing the references. SNT required. |
Argument: file B | A second input file containing the referenced resources. SNT required. |
Argument: output file | The name of the output file. This file will contain the extracted resources. |
Output | An SNT file containing the extracted resources. |
See also getenrichment.
Name | outline |
Usage | outline <input file> <output file> <target property> |
Type | Resource based |
Description | Creates literal representations for each resource in a file. The representation is mapped to a given property. |
Argument: input file | The input file with the resource to be outlined. SNT required. |
Argument: output file | The name of the file to store the output in. |
Argument: target property | The property to assign the outline to. |
Argument: output file | Name of the file to store the output in. |
Output | An SNT file with one statement for each resource. <original subject> <target property> "literal representation" |
See also: securelooseends.
Name | pigeonhole |
Usage | pigeonhole <input file> <output file A> <output file B> <output file C> <CSV> <total threshold> |
Type | Resource based |
Description | Extracts the resources from an SNT file according to the frequency of their attributes.
A CSV file, such as produced by analyzetype, is used to determine the necessary information.
The CSV file contains combinations of values (a single property is also considered a combination)
of the covered properties together with a number "total" that indicates the number occurences of the combination in the input file.
The entries in this CSV file are sorted by this number. The command reads the CSV entries up until the threshold of the "total"-field is undershot. Then it aborts. The command then reads the input file resource-wise and handles the resources: Their property-combinations are looked up in the CSV table. If a properties combination has an entry in the table then the resource is written to file A. If a certain combination is not present in the table, then the resource is written to file B. If the resource does not even contain all of the properties stated in the CSV file then it is written to file C. Thus the command extracts the resources of the top X most frequent properties combinations. |
Argument: input file | The input file with the resources to be pigeonholed. SNT required. |
Argument: output file A | The name of the file to store the output in. |
Argument: output file B | The name of the file to store the output in. |
Argument: output file C | The name of the file to store the output in. |
Argument: CSV | A CSV file containing the frequencies of the properties values. Same as the output of the analyzetype command. |
Argument: total threshold | A positive integer to be used as lower threshold on the total frequencies column in the CSV. Can be used to close out uncommon values. |
Output | Three SNT files with the resources of their category. File A: Contains all resources that containing a property combination that is in the top x (limited by the threshold parameter) most frequent combinations. File B: Contains all resources that have values for the requested properties but do not reside in the top x combinations. File C: Contains the remaining resources. |
Use together with analyzetype.
Name | pumpup |
Usage | pumpup <input file> <output file> |
Type | statement based |
Description | Extends the namespaces in an N-Triples file to thier long forms. Uses the namespaces as stated below. The file "namespaces.txt" specifying these namespaces comes along with the binaries and can be adapted to custom needs. Often commands already include this functionality. |
Argument: input file | The name of the input file, requires N-Triples. |
Argument: output file | The name of the output file. |
Output | An N-Triples file containing the merged statements. |
List of namespaces with their respective short forms.
- bf http://bibframe.org/vocab/
- bibo http://purl.org/ontology/bibo/
- dbp http://dbpedia.org/ontology/
- dc http://purl.org/dc/elements/1.1/
- dct http://purl.org/dc/terms/
- foaf http://xmlns.com/foaf/0.1/
- gnd http://d-nb.info/standards/elementset/gnd#
- owl http://www.w3.org/2002/07/owl#
- rdac http://rdaregistry.info/Elements/c/
- rdai http://rdaregistry.info/Elements/i/
- rdam http://rdaregistry.info/Elements/m/
- rdau http://rdaregistry.info/Elements/u/
- rdaw http://rdaregistry.info/Elements/w/
- rdf http://www.w3.org/1999/02/22-rdf-syntax-ns#
- rdfs http://www.w3.org/2000/01/rdf-schema#
- schema https://schema.org/
- skos http://www.w3.org/2004/02/skos/core#
- void http://rdfs.org/ns/void#
- sch http://schema.org/
Name | subtract |
Usage | subtract <file A> <file B> <output file> |
Type | Line based |
Description | Removes all statements from file A that are also in file B. |
Argument: file A | The name of the first file, requires SNT. |
Argument: file B | The name of the second file, requires SNT. |
Argument: output file | Name of the output file. |
Output | The resulting file containing SNT. |
Some steps to get started:
- Prepare your data in a single directory, have it in one of these formats: .nt, .rdf, .xml, .jsonld.
- Convert your data to N-Triples if not already in use.
java -jar reshaperdf-1.0-SNAPSHOT.jar ntriplify ./myrdf ./nt/mydata.nt
- Sort your data.
java -jar reshaperdf-1.0-SNAPSHOT.jar sort ./nt/mydata ./nt/mydata_sorted.nt
- Extract all persons (foaf:Person) from the file into another file.
java -jar reshaperdf-1.0-SNAPSHOT.jar extract ./nt/mydata_sorted.nt ./nt/mypersons.nt http://xmlns.com/foaf/0.1/Person ? 0 -1