Skip to content

Batch Implementation

okram edited this page Oct 25, 2012 · 12 revisions

<dependency>
   <groupId>com.tinkerpop.blueprints</groupId>
   <artifactId>blueprints-core</artifactId>
   <version>??</version>
</dependency>

BatchGraph wraps any TransactionalGraph to enable batch loading of a large number of edges and vertices by chunking the entire load into smaller batches and maintaining a memory-efficient vertex cache so that intermediate transactional states can be flushed after each chunk is loaded to release memory.

BatchGraph is ONLY meant for loading data and does not support any retrieval or removal operations. That is, BatchGraph only supports the following methods:

  • addVertex() for adding vertices
  • addEdge() for adding edges
  • getVertex() to be used when adding edges
  • Property getter, setter and removal methods for vertices and edges as well as getId()

An important limitation of BatchGraph is that edge properties can only be set immediately after the edge has been added. If other vertices or edges have been created in the meantime, setting, getting or removing properties will throw exceptions. This is done to avoid caching of edges which would require a great amount of memory.

BatchGraph wraps TransactionalGraph. To wrap arbitrary graphs, use BatchGraph.wrap() which will additionally wrap non-transactional graphs.

BatchGraph can also automatically set the provided element ids as properties on the respective element. Use setVertexIdKey() and setEdgeIdKey() to set the keys for the vertex and edge properties respectively. This is useful when the graph implementation ignores supplied ids and allows to make the loaded graph compatible for later wrapping with IdGraph (see Id Implementation) when setting the vertex and edge Id keys to IdGraph.ID.

As an example, suppose we are loading a large number of edges defined by a String array with four entries called quads:

  1. The out vertex id
  2. The in vertex id
  3. The label of the edge
  4. A string annotation for the edge, i.e. an edge property

Assuming this array is very large, loading all these edges in a single transaction is likely to exhaust main memory. Furthermore,
one would have to rely on the database indexes to retrieve previously created vertices for a given id. BatchGraph addresses
both of these issues.

BatchGraph bgraph = new BatchGraph(graph, BatchGraph.IdType.STRING, 1000);
for (String[] quad : quads) {
    Vertex[] vertices = new Vertex[2];
    for (int i=0;i<2;i++) {
        vertices[i] = bgraph.getVertex(quad[i]);
        if (vertices[i]==null) vertices[i]=bgraph.addVertex(quad[i]);
    }
    Edge edge = bgraph.addEdge(null,vertices[0],vertices[1],quad[2]);
    edge.setProperty("annotation",quad[3]);
}

First, a BatchGraph bgraph is created wrapping an existing graph and setting the id type to IdType.STRING and the batch size to 1000.
BatchGraph maintains a mapping from the external vertex ids, in our example the first two entries in the String array describing th edge,
to the internal vertex ids assigned by the wrapped grahp database. Since this mapping is maintained in memory, it is potentially much faster
than the database index. By specifying the IdType, BatchGraph chooses the most memory-efficient mapping data structure and applies compression
algorithms if possible. There are four different IdTypes:

  • OBJECT : For arbitrary object vertex ids. This is the most generic and least space efficient type.
  • STRING : For string vertex ids. Attempts to apply string compression and prefixing strategies to reduce the memory footprint.
  • URL : For string vertex ids that parse as URLs. Applies URL specific compression schemes that are more efficient than generic string compression.
  • NUMBER : For numeric vertex ids. Uses primitive data structures that requires significantly less memory.

The last argument in the constructor is the batch size, that is, the number of vertices and edges to load before committing a transaction and starting a
new one.

The for-loop then iterates over all the quad String arrays and creates an edge for each by first retrieving or creating the vertex end points
and then creating the edge. Note, that we set the edge property immediately after creating the edge. This is required because
edges are only kept in memory until the next edge is created for efficiency reasons.

Incremental Loading

The above describes how BatchGraph can be used to load data into a graph under the assumption that the wrapped graph is initially empty. BatchGraph can also be used to incrementally batch load edges and vertices into a graph with existing data. In this case, vertices may already exist for given ids.

If the wrapped graph does not ignore ids, then enabling incremental batch loading is as simple as calling setLoadingFromScratch(false), i.e. to disable the assumption that data is loaded into an empty graph. If the wrapped graph does ignore ids, then one has to tell BatchGraph how to find existing vertices for a given id by specifying the vertex id key using setVertexIdKey(uid) where uid is some string for the property key. Also, uid must be key indexed for this to work.

graph.createKeyIndex("uid",Vertex.class);
BatchGraph bgraph = new BatchGraph(graph, BatchGraph.IdType.STRING, 1000);
bgraph.setVertexIdKey("uid");
bgraph.setLoadingFromScratch(false);
//Load data as shown above

Note, that incremental batch loading is more expensive than loading from scratch because BatchGraph has to call on the wrapped graph to determine whether a vertex exists for a given id.