Adding Rows in Batches and trying to Save Vertex Ids

181 views
Skip to first unread message

Venkat Dasari

unread,
May 3, 2021, 8:22:29 PM5/3/21
to Gremlin-users
Hi,
 I am trying to add the vertices in batches, and then trying to save the vertex ids for later use to build edges. Whenever I call .toList() after every 1000 records, I always get one record. What am I missing here?
 I am stuck here since more than a month, and I don't know how else to load large data into the graph. If this works, I want to do the same thing using Spark and parallelize the load.

private static GraphTraversalSource createTraversal() {
final String[] contactPoints = new String[]{"someIpAddress"};
String graphName = "";
final MessageSerializer serializer = Serializers.GRYO_V3D0.simpleInstance();
final Map<String, Object> gryoConfig = new HashMap<>();
gryoConfig.put("ioRegistries",
Arrays.asList(JanusGraphIoRegistry.class.getName(), TinkerIoRegistryV3d0.class.getName()));
serializer.configure(gryoConfig, null);
final String remoteTraversalSource = graphName.isEmpty() ? "g" : graphName + "_traversal";

GraphTraversalSource g = traversal().withRemote(DriverRemoteConnection.using(Cluster.build()
.serializer(serializer)
.addContactPoints(contactPoints)
.create(),
remoteTraversalSource));

return g;

}

public status void main(String[] args) throws Exception {
StopWatch sw = new StopWatch();
Random rnd = new Random();
sw.start()
for (int j = 0; j < 100; j++) {
sout("========== ITERATION ===== " + j);
GraphTraversalSource g = createTraversal();
GraphTraversal<Vertex, Vertex> vertexT = null;
List<Vertex> vertexList = new ArrayList<>();
for (int i = 0; i < 1000; i++) {
vertexT = g.addV("Market");

int mktId = rnd.nextInt();
vertexT = vertexT.property("mkt_id", mktId);
}
vertexList = vertexT.toList();
sout("Size of Iteration: " + vertexList.size());
sw.split();
sout("Execution for Iteration : " + (sw.getSplitTime()/1000) + " seconds");

}
}



Bassem Naguib

unread,
May 4, 2021, 1:36:39 AM5/4/21
to Gremlin-users
There are two problems I can see in your code. First, this line
vertexT = g.addV("Market");
overwrites the previous traversal without executing it.

Your code needs to look something like that.

GraphTraversal<VertexVertexvertexT = null;
for (int i = 0i < 1000i++)
{
    if (vertexT == null)
        vertexT = g.AddV("Market");
    else
        vertexT = vertexT.AddV("Market");

    ...
}
vertexT.Iterate();

Second problem is that you are expecting
vertexList = vertexT.toList();
to give you a list of all the vertices you added. But it will only give you the last vertex you added because the last step in your traversal is AddV() which returns the vertex that was just added.

If you would like to get the full list of vertices, you need to use something like that
IList<VertexvertexList = g.V().HasLabel("Market").ToList();
Or if you just need the count, not the vertices
long count = g.V().HasLabel("Market").Count().Next();

My code is C# code. Hopefully the Java code will not be too different.

Venkat Dasari

unread,
May 6, 2021, 11:25:35 AM5/6/21
to Gremlin-users
This would take forever to get me the vertexList, since the graph is going to be so massive.
Even if I add, indexing, that would make the load slower, (although its in the plan).

The way I did is, parallelize the load using spark, and then for every 1000 records do a .toList, get those ids back and save it to a data frame and move on.
It worked perfectly fine. I was able to load 1.5 million rows in 20 seconds, but I am occasionally seeing weird errors as below:

Caused by: java.util.concurrent.CompletionException: java.lang.IllegalStateException: Connection to server is no longer active
at java.util.concurrent.CompletableFuture.reportJoin(CompletableFuture.java:375)
at java.util.concurrent.CompletableFuture.join(CompletableFuture.java:1947)
at org.apache.tinkerpop.gremlin.driver.ResultSet.one(ResultSet.java:119)
at org.apache.tinkerpop.gremlin.driver.ResultSet$1.hasNext(ResultSet.java:171)
at org.apache.tinkerpop.gremlin.driver.ResultSet$1.next(ResultSet.java:178)
at org.apache.tinkerpop.gremlin.driver.ResultSet$1.next(ResultSet.java:165)

Venkat Dasari

unread,
May 6, 2021, 5:40:52 PM5/6/21
to Gremlin-users
So, now I am running on Spark and tried to load some 20 million vertices with 20 partitions, and creating a graph traversal for every 10000 rows and committing for every 1000 rows, the error I see is as below.
Is there any configuration properties that I need to change or any other suggestions that can help?

java.util.concurrent.CompletionException: java.lang.IllegalStateException: Connection to server is no longer active
at java.util.concurrent.CompletableFuture.reportJoin(CompletableFuture.java:375)
at java.util.concurrent.CompletableFuture.join(CompletableFuture.java:1947)
at org.apache.tinkerpop.gremlin.driver.ResultSet.one(ResultSet.java:119)
at org.apache.tinkerpop.gremlin.driver.ResultSet$1.hasNext(ResultSet.java:171)
at org.apache.tinkerpop.gremlin.driver.ResultSet$1.next(ResultSet.java:178)
at org.apache.tinkerpop.gremlin.driver.ResultSet$1.next(ResultSet.java:165)
at org.apache.tinkerpop.gremlin.driver.remote.DriverRemoteTraversal$TraverserIterator.next(DriverRemoteTraversal.java:146)
at org.apache.tinkerpop.gremlin.driver.remote.DriverRemoteTraversal$TraverserIterator.next(DriverRemoteTraversal.java:131)
at org.apache.tinkerpop.gremlin.driver.remote.DriverRemoteTraversal.nextTraverser(DriverRemoteTraversal.java:112)
at org.apache.tinkerpop.gremlin.process.remote.traversal.step.map.RemoteStep.processNextStart(RemoteStep.java:80)
at org.apache.tinkerpop.gremlin.process.traversal.step.util.AbstractStep.next(AbstractStep.java:129)
at org.apache.tinkerpop.gremlin.process.traversal.step.util.AbstractStep.next(AbstractStep.java:39)
at org.apache.tinkerpop.gremlin.process.traversal.Traversal.fill(Traversal.java:184)
at org.apache.tinkerpop.gremlin.process.traversal.Traversal.toList(Traversal.java:122)
at com.iqvia.janus.JanusRemoteTraversalVertexMapPartitionJava$1$1.hasNext(JanusRemoteTraversalVertexMapPartitionJava.java:185)
at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:43)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
Caused by: java.lang.IllegalStateException: Connection to server is no longer active
at org.apache.tinkerpop.gremlin.driver.Handler$GremlinResponseHandler.lambda$channelInactive$0(Handler.java:213)
at java.util.concurrent.ConcurrentHashMap$ValuesView.forEach(ConcurrentHashMap.java:4707)
at org.apache.tinkerpop.gremlin.driver.Handler$GremlinResponseHandler.channelInactive(Handler.java:213)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)


Stephen Mallette

unread,
May 10, 2021, 8:52:26 AM5/10/21
to gremli...@googlegroups.com
Usually that client-side error means the server closed the channel for some reason. I'd check the server to see if you have some errors there that would help you discern why this is happening. 

--
You received this message because you are subscribed to the Google Groups "Gremlin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/fb9fce7d-c0e0-478c-b802-b5acb57c3045n%40googlegroups.com.

Venkat Dasari

unread,
May 10, 2021, 9:47:13 AM5/10/21
to Gremlin-users
I figured out the problem. I am committing smaller batches to make it work.
that resolved the issue.

I am able to now load 20 million in 1.32 seconds using Spark.

Thanks for your help
Reply all
Reply to author
Forward
0 new messages