-
Notifications
You must be signed in to change notification settings - Fork 273
Batch Implementation
<dependency>
<groupId>com.tinkerpop.blueprints</groupId>
<artifactId>blueprints-core</artifactId>
<version>??</version>
</dependency>
BatchGraph
wraps any TransactionalGraph
to enable batch loading of a large number of edges and vertices by chunking the entire load into smaller batches and maintaining a memory-efficient vertex cache so that intermediate transactional states can be flushed after each chunk is loaded to release memory.
BatchGraph
is ONLY meant for loading data and does not support any retrieval or removal operations. That is, BatchGraph only supports the following methods:
-
addVertex()
for adding vertices -
addEdge()
for adding edges -
getVertex()
to be used when adding edges - Property getter, setter and removal methods for vertices and edges as well as
getId()
An important limitation of BatchGraph
is that edge properties can only be set immediately after the edge has been added. If other vertices or edges have been created in the meantime, setting, getting or removing properties will throw exceptions. This is done to avoid caching of edges which would require a great amount of memory.
BatchGraph
wraps TransactionalGraph
. To wrap arbitrary graphs, use BatchGraph.wrap()
which will additionally wrap non-transactional graphs.
BatchGraph
can also automatically set the provided element ids as properties on the respective element. Use setVertexIdKey()
and setEdgeIdKey()
to set the keys for the vertex and edge properties respectively. This is useful when the graph implementation ignores supplied ids and allows to make the loaded graph compatible for later wrapping with IdGraph
(see Id Implementation) when setting the vertex and edge Id keys to IdGraph.ID
.
As an example, suppose we are loading a large number of edges defined by a String array with four entries called quads:
- The out vertex id
- The in vertex id
- The label of the edge
- A string annotation for the edge, i.e. an edge property
Assuming this array is very large, loading all these edges in a single transaction is likely to exhaust main memory. Furthermore,
one would have to rely on the database indexes to retrieve previously created vertices for a given id. BatchGraph
addresses
both of these issues.
BatchGraph bgraph = new BatchGraph(graph, BatchGraph.IdType.STRING, 1000);
for (String[] quad : quads) {
Vertex[] vertices = new Vertex[2];
for (int i=0;i<2;i++) {
vertices[i] = bgraph.getVertex(quad[i]);
if (vertices[i]==null) vertices[i]=bgraph.addVertex(quad[i]);
}
Edge edge = bgraph.addEdge(null,vertices[0],vertices[1],quad[2]);
edge.setProperty("annotation",quad[3]);
}
First, a BatchGraph
bgraph is created wrapping an existing graph and setting the id type to IdType.STRING
and the batch size to 1000.
BatchGraph
maintains a mapping from the external vertex ids, in our example the first two entries in the String array describing th edge,
to the internal vertex ids assigned by the wrapped grahp database. Since this mapping is maintained in memory, it is potentially much faster
than the database index. By specifying the IdType
, BatchGraph
chooses the most memory-efficient mapping data structure and applies compression
algorithms if possible. There are four different IdTypes
:
- OBJECT : For arbitrary object vertex ids. This is the most generic and least space efficient type.
- STRING : For string vertex ids. Attempts to apply string compression and prefixing strategies to reduce the memory footprint.
- URL : For string vertex ids that parse as URLs. Applies URL specific compression schemes that are more efficient than generic string compression.
- NUMBER : For numeric vertex ids. Uses primitive data structures that requires significantly less memory.
The last argument in the constructor is the batch size, that is, the number of vertices and edges to load before committing a transaction and starting a
new one.
The for-loop then iterates over all the quad String arrays and creates an edge for each by first retrieving or creating the vertex end points
and then creating the edge. Note, that we set the edge property immediately after creating the edge. This is required because
edges are only kept in memory until the next edge is created for efficiency reasons.
The above describes how BatchGraph
can be used to load data into a graph under the assumption that the wrapped graph is initially empty. BatchGraph
can also be used to incrementally batch load edges and vertices into a graph with existing data. In this case, vertices may already exist for given ids.
If the wrapped graph does not ignore ids, then enabling incremental batch loading is as simple as calling setLoadingFromScratch(false)
, i.e. to disable the assumption that data is loaded into an empty graph. If the wrapped graph does ignore ids, then one has to tell BatchGraph
how to find existing vertices for a given id by specifying the vertex id key using setVertexIdKey(uid)
where uid is some string for the property key. Also, uid must be key indexed for this to work.
graph.createKeyIndex("uid",Vertex.class);
BatchGraph bgraph = new BatchGraph(graph, BatchGraph.IdType.STRING, 1000);
bgraph.setVertexIdKey("uid);
bgraph.setLoadingFromScratch(false);
//Load data as shown above
Note, that incremental batch loading is more expensive than loading from scratch because BatchGraph
has to call on the wrapped graph to determine whether a vertex exists for a given id.