Guidance on importing 10-20k vertex data into Tinkerpop

753 views
Skip to first unread message

Gishu Pillai

unread,
Jul 4, 2018, 12:35:50 PM7/4/18
to Gremlin-users
Hi,

Here's what I have tried.

* Wrote a NodeJs script to scrape the data from source System into JSON
* Wrote another script that reads the JSON dump and translates it to graph equivalent. This script makes REST calls to create the vertex/edge as needed.

Now with the load of around 10-20k nodes and edges, I see that the Rest call intermittently throws an error
address:"127.0.0.1"
code:"ECONNRESET"
errno:"ECONNRESET"
message:"connect ECONNRESET 127.0.0.1:8182"
port:8182
stack:"Error: connect ECONNRESET 127.0.0.1:8182\n at Object.exports._errnoException (util.js:1007:11)\n at exports._exceptionWithHostPort (util.js:1030:20)\n at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1080:14)"
syscall:"connect"

  1. Maybe the problem is the chattiness ? Any server settings that i can tweak? I try the failing command in Postman and it succeeds with 200.
  2. The Node/JS gremlin client (websockets) was another option vs REST - however it doesn't seem to work out of the box. Invalid OpProcessor error for a simple call. 
  3. Is there any other option to bulk import vertexes and edges as a batch? I can generate CSV files with vertices and edges (Neptune has a published format but I need something that works against Gremlin. So that I can test multiple graph implementations w.r.t. performance for common queries.
PS: Would like to avoid writing java for ingestion if possible + also don't want to create Graphson format files by hand (although import would then be a breeze).

Thanks,
Gishu

Stephen Mallette

unread,
Jul 9, 2018, 7:41:57 AM7/9/18
to Gremlin-users
I don't know what could cause that error on the client side - anything in the server logs as a result of that client side error? 

> Node/JS gremlin client (websockets) was another option vs REST - however it doesn't seem to work out of the box. Invalid OpProcessor error for a simple call. 

you can't just connect with standard websockets as there is a subprotocol that must be adhered to:


i assume that's why you got that exception.

Is there any other option to bulk import vertexes and edges as a batch? 

For the size of data you're loading, i don't see a point to taking any more complex bulk loading paths. To me a simple groovy script executed in the Gremlin Console would be best, but loading through Gremlin Server with JS should work fine too.



--
You received this message because you are subscribed to the Google Groups "Gremlin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/eeb0ef85-eb6b-40f7-bd44-ee762bf1f19a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gishu Pillai

unread,
Jul 30, 2018, 4:54:24 AM7/30/18
to Gremlin-users
Was hoping that the JS Client libraries abstracts away the websocket protocol from me. It just looks like the client is broken - because even the sample on the documentation doesn't work. I was running short of time and didn't easily grok the docs on configuring the logger. Too much noise on the console with the default configuration.

I did end up getting the ingestion through via NodeJS + REST. Here's what helped.

* you can batch commands in a gremlin query - just make sure you end each line with .next(); If not, you would observe only the results of the last statement in the batch.
* Used a queue to limit concurrent REST calls. Also helped with retry with occasional failures during bursts. Had to throttle the queue to avoid running out of memory with large graphs.
* Increase the heap available to the gremlin server - JAVA_OPTIONS="-Xms512m -Xmx4096m"
* Increase the script timeout - as the graph grows larger, operations may take longer. Update server-config.yaml with scriptEvaluationTimeout: 120000

Tinkergraph has it's limitations; at some point, it runs out of memory; you would see some GC errors when it approaches the breaking point.

Against AWS Neptune, i was able to use the same script to upload ~3M verts and ~8M edges with the following modifications
* Neptune does NOT support gremlin variables - so I had to rewrite the gremlin queries accordingly to be single line 
* Neptune does NOT support certain clauses - check out their page on Gremlin deviations. It would save you a load of time.
* Neptune errors out if the value in has() clause is blank. Same for setProperty() too I think. So had to add additional checks.

All said and done, a bulk API out of the box would have been much better. Neptune has one; yet to try that one out.


Thanks,
Gishu

Stephen Mallette

unread,
Jul 30, 2018, 7:04:01 AM7/30/18
to Gremlin-users
Well, I'm glad you got things working, but it's really not clear where things went wrong for you except for memory issues. 

Tinkergraph has it's limitations; at some point, it runs out of memory; you would see some GC errors when it approaches the breaking point.

well, it's an in-memory database so it's not going to release any memory as you load things into it. Also, it shares the same memory space as Gremlin Server so you need to configure with enough memory for both. If you need a more memory-efficient version of TinkerGraph and you have a defined schema then this might help:


I'm going to increase the default -Xmx in Gremlin Server I think - it's current size seems insufficient for anything but the toy graphs.

All said and done, a bulk API out of the box would have been much better.

Because of the disparity of database types and options, TinkerPop has taken the position to not supply bulk-loading capabilities, beyond:

1. CloneVertexProgram - which is an OLAP based loader over spark/hadoop and requires a graph to supply it's own InputFormat and OutputFormat
2. The new g.io().read/write() steps which *could* be implemented to by different graph databases to offer Gremlin-level abstraction over their highly efficient loaders. So, Neptune could allow for g.io('file.csv').read() and then back that to their bulk loading system or Neo4j could do the same using the Cypher CSV loader. What's nice about that approach is that it allows GLVs (like gremlin-javascript) to have a native method for doing bulk data loads.

Both of the above approaches will be available in 3.4.0.







Reply all
Reply to author
Forward
0 new messages