Split GraphSON Export of Large Graph

179 views
Skip to first unread message

Matteo Lissandrini

unread,
Apr 17, 2020, 6:09:13 AM4/17/20
to Gremlin-users
Hello,
I would like to export a graph with

```
graph.io(graphson()).writeGraph('/data/exported-graph.json'
```

but the graph is too large to do it in one go.

Since the new GraphSON is node by node, can I do it in batches
Something Like:

for (i = 0; i <2; i+=batch) {
   g.V().range(i, i+batch).io(graphson()).writeGraph('/data/exported-graph-'+i+'.json');
};


This code is not working of course, but something like it?


Thanks,
Matteo

Stephen Mallette

unread,
Apr 20, 2020, 9:26:23 AM4/20/20
to gremli...@googlegroups.com
If your graph is large (i.e. tens of millions of edges) then you should probably look into using Hadoop:


but you should be able to do it programmatically in batches as well I would think. You probably would want to instantiate a GraphSONWriter directly though and call this method in your loop:

public void writeVertex(final OutputStream outputStream, final Vertex v, final Direction direction)

or simply pass an Iterator<Vertex> to:

public void writeVertices(final OutputStream outputStream, final Iterator<Vertex> vertexIterator, final Direction direction)

You can make a new instance of GraphSONWriter with the builder pattern:

GraphWriter writer = GraphSONWriter.build().create();

You might also want to take a look at GraphSONMapper which you can use directly for more control or pass a custom instance to GraphSONWriter with extra configuration for use there:


It too can be generated with builder pattern:


HTH,

Stephen 
 
 







 

 

--
You received this message because you are subscribed to the Google Groups "Gremlin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/9dc9a04c-8d99-4d14-ba96-a69ebd03112a%40googlegroups.com.

Matteo Lissandrini

unread,
Apr 20, 2020, 10:24:48 AM4/20/20
to Gremlin-users

Hi,

I've hacked together this solution using the "subgraph", so I write out with batches identified by set of edges. This is like partioning the graph.
Since GraphSon represent 1 node with all the adjacent edges, this should not lose data around, right?
I am not sure whether I should use the `both` or just the `out` direction

for (i = 0; i <size; i++) {
   System.out.println(i);
   g.V().range(i*batch,
(i+1)*batch).bothE().subgraph('subGraph').cap('subGraph').next().io(graphson()).writeGraph('/graph.+'i'+.json');
}
To unsubscribe from this group and stop receiving emails from it, send an email to gremli...@googlegroups.com.

Stephen Mallette

unread,
Apr 20, 2020, 10:57:03 AM4/20/20
to gremli...@googlegroups.com
You could use subgraph() but that seems unnecessary compared with the other options I presented. Also, if you have a "large" graph it seems like it will take a long time to do it that way. You're currently iterating V() once per batch with no guarantees that you are getting all the vertices in the graph because you are assuming the same ordering of results returned by the graph. The first change you should make is to iterate V() once and batch that Iterator. Of course, that makes it a bit harder to use subgraph() but makes it even easier to use one of the approaches I presented in my last post where you use GraphSONWriter directly. I'd avoid the subgraph() approach because you're not storing data terribly efficiently either. You could be storing the same vertex many many times across your batches that way.


To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/40eb6d65-c60e-4e0a-920e-8e3de7028955%40googlegroups.com.

Matteo Lissandrini

unread,
Apr 21, 2020, 3:43:14 AM4/21/20
to Gremlin-users

Yes, my solution is not terribly efficient. But any other option failed before.

I am trying the GraphWriter thing, would the following be what you had in mind?

final GraphWriter writer = graph.io(IoCore.graphson()).writer();


size=200;
c = g.V().count().next();
batch = (c/size) as int

ids = g.V().id();
a=[];
i=0;
for( idx in ids ){
  a.add(idx);
  if(a.size() > batch || !ids.hasNext()){
   i++;
   final OutputStream os = new FileOutputStream('/graph.export.+'i'+.json');
   a = a as Set;
   writer.writeVertices(os, g.V(a), final Direction.BOTH)
   a=[];
  }
}





On Monday, 20 April 2020 16:57:03 UTC+2, Stephen Mallette wrote:
You could use subgraph() but that seems unnecessary compared with the other options I presented. Also, if you have a "large" graph it seems like it will take a long time to do it that way. You're currently iterating V() once per batch with no guarantees that you are getting all the vertices in the graph because you are assuming the same ordering of results returned by the graph. The first change you should make is to iterate V() once and batch that Iterator. Of course, that makes it a bit harder to use subgraph() but makes it even easier to use one of the approaches I presented in my last post where you use GraphSONWriter directly. I'd avoid the subgraph() approach because you're not storing data terribly efficiently either. You could be storing the same vertex many many times across your batches that way.


Stephen Mallette

unread,
Apr 21, 2020, 7:49:38 AM4/21/20
to gremli...@googlegroups.com
From my previous post:

You can make a new instance of GraphSONWriter with the builder pattern:

GraphWriter writer = GraphSONWriter.build().create();

You might also want to take a look at GraphSONMapper which you can use directly for more control or pass a custom instance to GraphSONWriter with extra configuration for use there:


It too can be generated with builder pattern:


I would prefer the Bulider method over using io() because you may find the need to modify settings somewhere which you can't do as easily via the method you're using. Your approach would work but it's still very expensive because you do a full graph scan to get the count:

c = g.V().count().next(); 

a full graph scan for the ids:

ids = g.V().id();

and then finally one query per vertex which means you touch every vertex in your graph three times. I ideally you should be iterating all vertices just once. Here's one way to do it:

vertices = g.V();[]
batchSize = 100
counter = 0
currentBatch = 1
writer = GraphSONWriter.build().create()
os = null
while (vertices.hasNext()) {
  def v = vertices.next()
  def newBatch = counter % batchSize == 0
  if (newBatch) {
    if (null != os) os.close()
    os = new FileOutputStream("graph-${currentBatch}.json")
    currentBatch++
  }
  writer.writeVertex(os, v, OUT)
  os.write("\n".getBytes())
  counter++  
}
os.close()

The above is designed to run in Gremlin Console but I don't think you should have trouble converting it to Java. 

To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/82ab6584-b5a2-40ef-97a4-6cb55259b7dd%40googlegroups.com.

Matteo Lissandrini

unread,
Apr 21, 2020, 8:16:53 AM4/21/20
to Gremlin-users

Thanks a lot!
Few questions.

 1- When you say "modify settings" what is that?

 2- Does GraphSON `requires` that a node only exports its outgoing? Or is just "fine" with any?  For sure having only the OUT (or only IN) is the best to avoid useless redundancy.

Stephen Mallette

unread,
Apr 21, 2020, 8:40:00 AM4/21/20
to gremli...@googlegroups.com
> 1- When you say "modify settings" what is that?


>   2- Does GraphSON `requires` that a node only exports its outgoing? Or is just "fine" with any?  For sure having only the OUT (or only IN) is the best to avoid useless redundancy.

if you just want to export a complete graph then choosing incoming or outgoing is sufficient. You don't need to do both as it will write each edge twice. I wouldn't say that GraphSON requires anything in/out/both edges in particular...it can expect any of those options and still be valid.

To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/d94f88e4-6da5-4317-8169-da6af1672463%40googlegroups.com.

Matteo Lissandrini

unread,
Apr 21, 2020, 9:03:16 AM4/21/20
to Gremlin-users
Thanks again, last question (I hope)

I see in the builder that The default is GraphSONVersion.V2_0.
Is there a reason? I think V3 is the future proof now right?



On Tuesday, 21 April 2020 14:40:00 UTC+2, Stephen Mallette wrote:
> 1- When you say "modify settings" what is that?

Please see the javadoc links i provided:


>   2- Does GraphSON `requires` that a node only exports its outgoing? Or is just "fine" with any?  For sure having only the OUT (or only IN) is the best to avoid useless redundancy.

if you just want to export a complete graph then choosing incoming or outgoing is sufficient. You don't need to do both as it will write each edge twice. I wouldn't say that GraphSON requires anything in/out/both edges in particular...it can expect any of those options and still be valid.

Stephen Mallette

unread,
Apr 21, 2020, 1:10:19 PM4/21/20
to gremli...@googlegroups.com
I don't think there's a reason for it to be GraphSON 2.0. It was probably an oversight as we want to 3.4.x, but I don't remember. We probably should change that to GraphSON 3.0 for 3.5.0.

To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/34bc10ad-7e0d-477b-af07-4d2102250813%40googlegroups.com.

Matteo Lissandrini

unread,
Apr 21, 2020, 4:47:16 PM4/21/20
to Gremlin-users
Ok, thanks.
So is the writer (without mapper) going to export to v2 at the moment?



On Tuesday, 21 April 2020 19:10:19 UTC+2, Stephen Mallette wrote:
I don't think there's a reason for it to be GraphSON 2.0. It was probably an oversight as we want to 3.4.x, but I don't remember. We probably should change that to GraphSON 3.0 for 3.5.0.

Matteo Lissandrini

unread,
Jun 16, 2020, 1:55:15 PM6/16/20
to Gremlin-users




Hi again,

there is some problem with this technique, maybe a bug in the reader/writer ?

Here I'm using `air-routes-latest` from practical Gremlin as an example

```bash
git clone https://github.com/krlawrence/graph.git practical-gremlin
chmod 777 practical-gremlin/sample-data
cd practical-gremlin/sample-data

docker run --rm -it -v ${PWD}:/datasets tinkerpop/gremlin-console
gremlin> Gremlin.version()
==>3.4.6

```


```gremlin

conf = new BaseConfiguration();
conf.setProperty("gremlin.tinkergraph.vertexIdManager","LONG");
conf.setProperty("gremlin.tinkergraph.edgeIdManager","LONG");
conf.setProperty("gremlin.tinkergraph.vertexPropertyIdManager","LONG");
graph = TinkerGraph.open(conf);
graph.io(graphml()).readGraph('/datasets/air-routes-latest.graphml');
g=graph.traversal();

g.E().count()
==>57574

writer = GraphSONWriter.build().mapper(GraphSONMapper.build().version(GraphSONVersion.V3_0).create()).create();
os = new FileOutputStream("/datasets/air-routes-latest.json");
vertices = g.V();
while (vertices.hasNext()) {
  def v = vertices.next();
  writer.writeVertex(os, v, OUT);
  os.write("\n".getBytes());
}
os.close();

g3 = TinkerGraph.open(); g3t = g3.traversal(); 
GraphSONReader.build().mapper(GraphSONMapper.build().create()).create().readGraph(new FileInputStream(new File('/datasets/air-routes-latest.json')), g3t.getGraph());
g3t = g3.traversal();
==>graphtraversalsource[tinkergraph[vertices:3742 edges:0], standard]
```

Am I doing something wrong?


Thanks a lot!

Stephen Mallette

unread,
Jun 17, 2020, 7:12:13 AM6/17/20
to gremli...@googlegroups.com
You wrote the graph using writeVertex() but then read it in with readGraph(). That's a reasonable expectation but in a sense you wrote with a custom format by choosing OUT edges. I believe readGraph() expects IN edges. fwiw writeGraph() writes BOTH.

To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/e710fa01-a6ac-4787-99a7-4b87260029d7o%40googlegroups.com.

Matteo Lissandrini

unread,
Jun 17, 2020, 8:51:54 AM6/17/20
to Gremlin-users

I see, so if I use BOTH I can use readGraph then?


On Wednesday, 17 June 2020 13:12:13 UTC+2, Stephen Mallette wrote:
You wrote the graph using writeVertex() but then read it in with readGraph(). That's a reasonable expectation but in a sense you wrote with a custom format by choosing OUT edges. I believe readGraph() expects IN edges. fwiw writeGraph() writes BOTH.

Stephen Mallette

unread,
Jun 17, 2020, 9:09:58 AM6/17/20
to gremli...@googlegroups.com
just IN should be fine. I was just saying that writeGraph() uses BOTH. technically either should work but IN is obviously more space efficient.

To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/fda804d2-d91d-42a4-8ca8-fd0ad4d8aae0o%40googlegroups.com.

Matteo Lissandrini

unread,
Jun 17, 2020, 1:02:12 PM6/17/20
to Gremlin-users
Thanks this seems to work now!




On Wednesday, 17 June 2020 15:09:58 UTC+2, Stephen Mallette wrote:
just IN should be fine. I was just saying that writeGraph() uses BOTH. technically either should work but IN is obviously more space efficient.

Reply all
Reply to author
Forward
0 new messages