Import Graphson directly to remote server

Kunal Mukherjee

unread,

Mar 8, 2024, 7:20:16 PMMar 8

to Gremlin-users

Is there a java library function to directly import Graphson to my remote gremlin-server?

I'm sorry if the question is too naive, but I have exhaustively reviewed the documentation and different tutorials to find such a function. But, I didn't find any.

My current function is iteratively adding the vertices and then the edges, along with their properties. I was wondering if there is a native implementation.

```
private static void uploadGraphToServer(Client client, File graphFile) throws IOException {

TinkerGraph graph = TinkerGraph.open();
GraphSONReader reader = GraphSONReader.build().create();
reader.readGraph(new FileInputStream(graphFile), graph);

for (Vertex v : graph.traversal().V().toList()) {
try {
String vertexId = escapeStringForGremlin(v.id().toString());
StringBuilder addVertexQuery = new StringBuilder(String.format("g.addV('%s').property('id', '%s')", v.label(), vertexId));

for (String key : v.keys()) {
Object value = v.property(key).value();
addVertexQuery.append(String.format(".property('%s', '%s')", key, escapeStringForGremlin(value.toString())));
}
ResultSet resultSet = client.submit(addVertexQuery.toString());

} catch (Exception e1) {
System.err.println("Error submitting vertex to Gremlin server: " + e1.getMessage());
}
}

// Iterate over edges and add them to the server's graph
for (Edge e : graph.traversal().E().toList()) {

StringBuilder addEdgeQuery = new StringBuilder(String.format(
"g.V().has('id', '%s').addE('%s').to(V().has('id', '%s'))",
escapeStringForGremlin(outVertex.id().toString()),
e.label(),
escapeStringForGremlin(inVertex.id().toString())));
for (String key : e.keys()) {
try {
Object value = e.property(key).value();
addEdgeQuery.append(String.format(".property('%s', '%s')", key, escapeStringForGremlin(value.toString())));
ResultSet resultSet = client.submit(addEdgeQuery.toString());
} catch (Exception e2) {
System.err.println("Error submitting edge to Gremlin server: " + e2.getMessage());
}
}
}
```

Ken Hu

unread,

Mar 12, 2024, 3:13:40 AMMar 12

to Gremlin-users

It depends on the provider that you are using, sometimes a specific provider will have their own specific loading mechanisms that allows for this.

If you are asking about a remote TinkerGraph, then I believe the answer is no. The simplest way to do it would be to have the GraphSON file directly accessible from the remote server so you could do something like "g.io(fileToRead).with(IO.reader, IO.graphson).read().iterate()" or the GraphSONReader equivalent. Otherwise, I don't think that there is another native solution.

Kunal Mukherjee

unread,

Mar 12, 2024, 11:08:38 AMMar 12

to Gremlin-users

Thank you Ken.

I was wondering if you could give me some pointers about whether Gremlin is the right solution for my use case.

I have over 100k instances of heterogenous relational node-and-edge attributed graphs with approx 5k vertices and 10k edges. Some example queries that I want to run on these instances are:

1. given a traversal routine, what are the vertices contained in the path for all the graph instances?

2. what is the average clustering coefficient for all graphs?

3. what nodes participate in clustering triangles for a specific graph instance?

Another issue I realized is that multiple graphs cannot natively persist in the gremlin-server due to its use of the in-memory database. Currently, I have the naive gremlin-server running, and I am using the Java API to interact with it. Therefore, I am considering using JanusGraph with Casandra's backend so that multiple graphs can persist together.

If you can provide me with any pointers, that would be greatly appreciated.

Josh Perryman

unread,

Mar 13, 2024, 10:15:13 AMMar 13

to Gremlin-users

It sounds like this is a one time, or a project-scoped set of data analysis, as opposed to an application which will support regular usage / workflows. In that case, I'd recommend investing a little more time in Gremlin Server instead of taking on the operational complexity of JanusGraph, and especially Cassandra.

But there is a question of size here. It doesn't sound like the data is attribute-heavy and likely could fit in memory. If it is too large for the RAM available, I'd recommend first looking at JanusGraph + Berkeley DB instead of Cassandra. It should be a much simpler setup. Avoid using a distributed system as much as you can if you don't need it.

On the data modeling side, it is possible to have all of the >100k instances in a single graph, though they'd be disjoint from one another. You have to model/manage the primary keys and supporting attributes to take this approach, but it shouldn't be that difficult.

To close, I just want to call out that the term "graph" seems to be overloaded here. TinkerPop / Gremlin Server uses the term graph to represent "a collection of vertices and edges which can be loaded in memory" and that could contain multiple of your "graph instances". Don't let the technology's terminology differences (or similarities!) from your project's terminology be limiting.

Best,

Josh

Kunal Mukherjee

unread,

Mar 13, 2024, 12:26:26 PMMar 13

to Gremlin-users

Thank you, Josh.

In my "graph," each vertex is of 3 types and has 10 attributes (7 numerical, 3 string), and each edge is of five types and has 8 attributes (4 numerical, 4 string). So, one graph instance can be loaded into the RAM available, but I do not think 100k+ instances can be loaded into memory. When I loaded one graph into the vanilla gremlin-server with its default in-memory db, it took 1.2 GB of RAM. That is why I am saying 100k+ instances will fit in memory.

Initially, I thought that my backend DB should support that mechanism where my "graph" instance is loaded into memory (hopefully in a serializable fashion), and then the query is executed. Then, the "next" graph is loaded, the query is executed, and finally, the aggregated result is returned. If I follow your suggestion and make one gremlin graph containing all of my graphs, then my initial approach won't work, as I don't think all of my graphs can be loaded into memory at once.

As you suggested, I will look into JanusGraph + Berkeley DB. I just want to confirm that after reading my information, do you still believe that this is the way to go?

```

Some example queries that I want to run on my graph database:

1. given a traversal routine, what are the vertices contained in the path for all of my "graph" instances?

2. what is the average clustering coefficient for all of my graphs?

3. what nodes participate in clustering triangles for a specific graph instance?

```

Again, thank you so much for your detailed answer. Any pointers would be really appreciated.

Josh Perryman

unread,

Mar 13, 2024, 7:15:38 PMMar 13

to gremli...@googlegroups.com

Yes, Gremlin Server is notoriously memory-inefficient. As a reference implementation, performance isn't a high priority. I recall that there was a fork at some point which made remarkable improvements to the memory efficiency, but I can't recall its name.

Given the possible memory constraints, JanusGraph is a good next approach. Data size in memory is not the same as on disk and I think that the odds are good that a Berkeley DB would be sufficient, since they can go to 256TB. There may be other limitations you'll hit if the data really is nearly 1GB on disk per instance.

Setting up JanusGraph + Berkeley DB and loading 1,000 instances should give a good benchmark, and indicate if it will work. What's nice is that you can use this little "dev" instance for experimenting and rapid iteration while building your analysis. Then build the larger version. Also, if you need to switch from Berkeley DB to Cassandra, there won't be any changes to the Gremlin code.

Best,

Josh

--
You received this message because you are subscribed to a topic in the Google Groups "Gremlin-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/gremlin-users/UotOZFVvi3k/unsubscribe.
To unsubscribe from this group and all its topics, send an email to gremlin-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/7634786f-966b-43b6-b5c3-a40bf6284b94n%40googlegroups.com.

Josh Perryman

unread,

Mar 13, 2024, 7:22:27 PMMar 13

to gremli...@googlegroups.com

I found the better memory graph project. They replaced their fork of Gremlin Server with this: https://github.com/ShiftLeftSecurity/overflowdb

-Josh

Message has been deleted

Kunal Mukherjee

unread,

Mar 19, 2024, 5:01:35 PMMar 19

to Gremlin-users

Hey Josh,

Thank you for such a detailed explanation. I think I am almost there.

I used this tutorial to set up the JanusGraph instance with BerkeleyDB backend: https://github.com/JoinTheGraph/jointhegraph.github.io/blob/main/articles/installing-janusgraph-and-its-storage-backends/configuring-janusgraph-to-use-oracle-berkeley-db/index.md.

But now, when I am trying to upload multiple graph instances (stored in multiple GraphSON files), I am getting duplicate vertex and/or duplicate edge errors after the first graph is uploaded. I think this is due to the fact that each vertex/edge has an attribute called "id," and whenever the second graph is being uploaded, the "id" attribute is conflicting.

By any chance do you have a tutorial or blog that shows how to load multiple graphs into one janusGraph instance. The closest I came was to this tutorial https://github.com/JoinTheGraph/jointhegraph.github.io/blob/main/articles/hosting-multiple-graphs-on-janusgraph/index.md, and I think it is correct since it is using "ConfigurationManagementGraph," mentioned in the dynamic graph, https://docs.janusgraph.org/operations/dynamic-graphs/. But, it does not use Berkeley db as the backend, so I changed that, and I was trying to load the graphs from the application following https://groups.google.com/g/janusgraph-users/c/9s51WJL6dTA. That's where I am getting the duplicated "id" error.

I have gone through the ten Google pages trying to find any resource that could help me in this search, so I am again reaching out.

On Tuesday, March 19, 2024 at 3:58:31 PM UTC-5 Kunal Mukherjee wrote:

Hey Josh,

Thank you for such a detailed explanation. I think I am almost there.

I used this tutorial to set up the JanusGraph instance with BerkeleyDB backend: https://github.com/JoinTheGraph/jointhegraph.github.io/blob/main/articles/installing-janusgraph-and-its-storage-backends/configuring-janusgraph-to-use-oracle-berkeley-db/index.md.

But now, when I am trying to upload multiple graph instances (stored in multiple GraphSON files), I am getting duplicate vertex and/or duplicate edge errors after the first graph is uploaded. I think this is due to the fact that each vertex/edge has an attribute called "id," and whenever the second graph is being uploaded, the "id" attribute is conflicting.

By any chance do you have a tutorial or blog that shows how to load multiple graphs into one janusGraph instance. The closest I came was to this tutorial https://github.com/JoinTheGraph/jointhegraph.github.io/blob/main/articles/hosting-multiple-graphs-on-janusgraph/index.md, and I think it is correct since it is using "ConfigurationManagementGraph", mentioned in dynamic graph, https://docs.janusgraph.org/operations/dynamic-graphs/.

I have gone through the ten google papges trying to find any resource to help me in this search that's why I am again reaching out.

Josh Perryman

unread,

Mar 20, 2024, 12:29:54 PMMar 20

to gremli...@googlegroups.com

I would suggest that this is more of a data modeling issue than a technology one. I saw something similar at my day job a couple of weeks ago.

tl; dr: design your keys and properties to avoid conflicts and to support having all of the instance graphs in one graph database.

Guessing at your data model, I suspect that each instance is some permutation of highly similar data. So that it looks something like this:

instance 1:

- node1: (label: A, id: 1, str: 'def'}

- node2: (label: B, id: 2, str: 'ghi'}

- edge1: {label: C, src_id: 1, dst_id: 2, mag: 0.50}

instance 2:

- node1: (label: A, id: 1, str: 'def'}

- node2: (label: B, id: 2, str: 'ghi'}

- edge1: {label: C, src_id: 1, dst_id: 2, mag: 0.25}

I'm sure that your data is a lot more interesting than my trivial example, but it should be sufficient to illustrate the point.

I'm expecting that the workflow is something like this:

1. extract a specific instance into memory - this will be most if not all of the instance

2. perform some set of analysis or computations which has specific expectations around the id's in order to compare results

3. save results somewhere for further analysis

I suggest changing the data model to something like the following will support your workflow and storing of multiple instances in one database:

instance 1:

- vertex1: (label: A, id: '1:1', str: 'def', instance_id: 1, vertex_id: 1}

- vertex2: (label: B, id: '1:2', str: 'ghi', instance_id: 1, vertex_id: 2}

- edge1: {label: C, src_id: '1:1', dst_id: '1:2', mag: 0.50}

instance 2:

- vertex1: (label: A, id: '2:1', str: 'efd', instance_id: 2, vertex_id: 1}

- vertex2: (label: B, id: '2:2', str: 'igh', instance_id: 2, vertex_id: 2}

- edge1: {label: C, src_id: '2:1', dst_id: '2:2', mag: 0.25}

The trick here is to make the vertex's primary key, in this case the id property, a composite of the instance designation (instance_id) and the vertex designation (vertex_id). I've done that as a string concatenation, but there are other ways. For example, if each instance has fewer than 1,000 vertices, then you can multiply the instance_id by 1000 and then add the vertex id value. For instance 1, the IDs would be vertex1: 1001 and vertex2: 1002.

I do have a slight preference of integers over strings for the id values for performance reasons. But there are good "easily, quickly deciphered by humans" reasons for using strings, especially when the id performance costs are overshadowed by the cost/time of the other computations.

Note that this changes your analytics computations slightly since it must use the vertex_id values for doing comparisons between instances, and not the id value itself.

It should also simplify things in that you can get all of the vertices for a single instance with something like:

g.V().has('instance_id', 1)

I'm unsure of the JanusGraph indexing capabilities, but there should be some index in place on instance_id to avoid scanning all of the data in the graph every time a single instance is queried.

Best,

Josh

On Tue, Mar 19, 2024 at 3:58 PM Kunal Mukherjee <kun...@gmail.com> wrote:

Hey Josh,

Thank you for such a detailed explanation. I think I am almost there.

I used this tutorial to set up the JanusGraph instance with BerkeleyDB backend: https://github.com/JoinTheGraph/jointhegraph.github.io/blob/main/articles/installing-janusgraph-and-its-storage-backends/configuring-janusgraph-to-use-oracle-berkeley-db/index.md.

But now, when I am trying to upload multiple graph instances (stored in multiple GraphSON files), I am getting duplicate vertex and/or duplicate edge errors after the first graph is uploaded. I think this is due to the fact that each vertex/edge has an attribute called "id," and whenever the second graph is being uploaded, the "id" attribute is conflicting.

By any chance do you have a tutorial or blog that shows how to load multiple graphs into one janusGraph instance. The closest I came was to this tutorial https://github.com/JoinTheGraph/jointhegraph.github.io/blob/main/articles/hosting-multiple-graphs-on-janusgraph/index.md, and I think it is correct since it is using "ConfigurationManagementGraph", mentioned in dynamic graph, https://docs.janusgraph.org/operations/dynamic-graphs/.

I have gone through the ten google papges trying to find any resource to help me in this search that's why I am again reaching out.

On Wednesday, March 13, 2024 at 6:22:27 PM UTC-5 joshpe...@gmail.com wrote:

To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/bebd4fa5-c125-4aa4-a1b5-1ca40d8e6c19n%40googlegroups.com.

Tamás Cservenák

unread,

Mar 20, 2024, 12:39:23 PMMar 20

to gremli...@googlegroups.com

Just my 5 cents:

there is also Bitsy:

https://github.com/lambdazen/bitsy

Disclaimer: am unsure how up to date it is with latest Gremlin...

T

You received this message because you are subscribed to the Google Groups "Gremlin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/CAG0uFtVOi6G8R7EX7XTnarT%3D_Buv5zLgb6-qCtyLpJhwyJvzjA%40mail.gmail.com.

Kunal Mukherjee

unread,

Mar 21, 2024, 3:08:47 PMMar 21

to Gremlin-users

Thank you so much, Josh and T. I will update this thread as I make progress :)

Reply all

Reply to author

Forward