Optimizing inserts with multiple vertices + edges

531 views
Skip to first unread message

Ryan Beatty

unread,
Jun 13, 2019, 6:41:02 PM6/13/19
to Gremlin-users
I'm having trouble getting a query to be performant. I have the following use case: I have a list of dictionary/map/associative array objects that I would like to add to my graph. The graph is a bipartite graph with one set of vertices labeled `obj`'s representing the dictionary objects themselves as vertices. The other set of vertices are labeled `att`'s and represent unique fields from the dictionary objects. Example: if there is a dictionary with key `a` and value `foo`, there will be a `att` vertex with `type` as `a` and `value` as `foo`. `obj` vertices are connected to `att` vertices via edges (if `obj` vertex `v1` has property `a` with value `foo`, there will be an edge from `v1` to `v2` where `v2` is an `att` vertex with `type` as `a` and `value` as `foo`). The idea here is that you should be able to start at any `obj` vertex and traverse its edges to find out what other `obj` vertices have the same properties. Hopefully I've explained that clearly enough...

My problem is that the latency of inserts is not performant enough for my needs. Using AWS Neptune + gremlin_python, I'm seeing 50 millisecond latency when I'm only trying to insert a single dictionary object using my query. Is there anything I'm doing obviously wrong here? I don't have a great understanding of how gremlin traversals work internally, but my first thought is maybe the repeated calls to `V()` can be avoided? Thanks in advance for any help!!

g.inject(dicts).unfold().as_('dict') \
    .addV('obj') \
    .property('a', select('a')) \
    .property('b', select('b')) \
    .property('c', select('c')) \
    .property('d', select('d')) \
    .property('e', select('e')) \
    .as_('obj_vertex') \
    .coalesce(
        V().hasLabel('att').has('type', 'a') \
           .where(eq('dict')).by('value').by(select('a')),
        addV('att') \
           .property('type', 'a') \
           .property('value', select('dict').select('a'))) \
    .as_('a_vertex') \
    .addE('a').from_('obj_vertex').to('a_vertex') \
    .addE('a').from_('a_vertex').to('obj_vertex') \        
    .coalesce(
        V().hasLabel('att').has('type', 'b') \
           .where(eq('dict')).by('value').by(select('b')),
        addV('att') \
           .property('type', 'b') \
           .property('value', select('dict').select('b'))) \
    .as_('b_vertex') \
    .addE('b').from_('obj_vertex').to('b_vertex') \
    .addE('b').from_('b_vertex').to('obj_vertex') \
    .coalesce(
        V().hasLabel('att').has('type', 'c') \
           .where(eq('dict')).by('value').by(select('c')),
        addV('att') \
           .property('type', 'c') \
           .property('value', select('dict').select('c'))) \
    .as_('c_vertex') \
    .addE('c').from_('obj_vertex').to('c_vertex') \
    .addE('c').from_('c_vertex').to('obj_vertex') \
    .coalesce(
        V().hasLabel('att').has('type', 'd') \
           .where(eq('dict')).by('value').by(select('d')),
        addV('att') \
           .property('type', 'd') \
           .property('value', select('dict').select('d'))) \
    .as_('d_vertex') \
    .addE('d').from_('obj_vertex').to('d_vertex') \
    .addE('d').from_('d_vertex').to('obj_vertex') \
    .coalesce(
        V().hasLabel('att').has('type', 'e') \
           .where(eq('dict')).by('value').by(select('e')),
        addV('att') \
           .property('type', 'e') \
           .property('value', select('dict').select('e'))) \
    .as_('e_vertex') \
    .addE('e').from_('obj_vertex').to('e_vertex') \
    .addE('e').from_('e_vertex').to('obj_vertex').iterate()


Daniel Kuppitz

unread,
Jun 14, 2019, 11:19:48 AM6/14/19
to gremli...@googlegroups.com
I don't know any Neptune implementation details, but I guess this traversal produces a lot of full scans, thus it won't scale. The problem here is, that where(eq()).by() is usually not translated into an index lookup, hence loading with getOrCreate behavior becomes very expensive. You could create your own in-memory index for existing att vertices (basically just a nested map: type->value->vertex). Then, of course, the traversal's memory requirements will increase significantly, but that's still better than having an exponential growth rate for the full scan lookups.

Here's a traversal I came up with:

g.V().hasLabel('att').as('v').
  group('m').
    by('type').
    by(group().
         by('value')).
  cap('m').
  constant(dicts).
  unfold().as('dict').
  addV('obj').as('v').
  sideEffect(select('dict').unfold().as('kv').
             select('v').
               property(select('kv').by(keys), select('kv').by(values))).
  properties().as('p').
  coalesce(select('m').
             select(select('p').by(key)).
             select(select('p').by(value)).unfold(),
           addV('att').
               property('type', select('p').by(key)).
               property('value', select('p').by(value)).
             group('m').
               by('type').
               by(group().
                    by('value'))).
  addE('has_att').
    property('name', select('p').by(key)).
    from('v').iterate()

It's a little bit more dynamic than yours as it doesn't care what's inside each dict.

With dicts set to:

dicts = [["a":"foo1","b":"bar","c":1]
        ,["a":"foo1","b":"baz","c":5]]

...the traversal above produces the following graph:

image.png

HTH.

Cheers,
Daniel


--
You received this message because you are subscribed to the Google Groups "Gremlin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/f563a2d6-fd61-40f9-90d6-5659ef1f3e8b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Stephen Mallette

unread,
Jun 18, 2019, 6:52:06 AM6/18/19
to gremli...@googlegroups.com
Here's another person with a similar question that just popped up on SO:


I wonder if we really shouldn't think about what aspects of the language need to change to make bulk updates like this easier to do. As long as we equate one transaction to one traversal we're going to keep subjecting folks to Gremlin gymnastics as they try to optimize their mutations. 

Ryan Beatty

unread,
Jun 19, 2019, 4:02:35 PM6/19/19
to Gremlin-users
Daniel and Stephen,

Thanks for the response! I think I was a bit misleading with my question. The primary thing I'm interested in at the moment speeding up the latency to insert one dict object into the graph representation I described. Making bulk inserts faster this way would be an added benefit, but I should've reworded my question better

I think AWS released a bulk loading tool for Neptune, so that was what I was probably going to use to load up all of my initial data. The trick with dynamically adding properties to a vertex is very helpful though. It does look like Neptune might not possibly support all gremlin features though. I can run Daniel's query just fine locally, but get the following when sending to Neptune:

gremlin_python.driver.protocol.GremlinServerError: 599: {"requestId":"ca96155f-deef-45dd-8261-c04688d0394f","code":"InternalFailureException","detailedMessage":"Could not locate method: DefaultGraphTraversal.select([[SelectOneStep(last,p,key)]])

Enter code here...

From playing around with the query, it seems to be something in the coalesce() step
To unsubscribe from this group and stop receiving emails from it, send an email to gremli...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Gremlin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gremli...@googlegroups.com.

Daniel Kuppitz

unread,
Jun 20, 2019, 1:37:59 PM6/20/19
to gremli...@googlegroups.com
Hi Ryan,

The primary thing I'm interested in at the moment speeding up the latency to insert one dict object into the graph

For a single one, I wouldn't even start to mess with inject() and all other kinds of Gremlin kung-fu steps. Just let your app go through the dict entries and build the traversal. Something like this:

gremlin> m = ['foo':'bar','ping':'pong']
==>foo=bar
==>ping=pong
gremlin> t = g.addV(); m.each {k, v -> t.property(k, v)}; [t.bytecode, t.toString()]
==>[[], [addV(), property(foo, bar), property(ping, pong)]]
==>[AddVertexStartStep({ping=[pong], foo=[bar]})]

As you can see, if you do this, the traversal will be compiled into a single instruction. It can't get faster than this.

... it seems to be something in the coalesce() step

It's actually the nested select(select(...)) that they don't support.

Cheers,
Daniel


To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/2d6faaf7-aaf7-4d1a-bb61-409d53e9766e%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages