Hi Florian,
Good questions, I had to confirm my memory, but here's the storage layer behavior I think you'd see for a
few different scenarios. Let's say we have a vertex 'A', and it has 0.5 million adjacent vertices, only outbound to it.
g.V(a).outE().count(): storage adapter will retrieve all 0.5 million adjacent edges at once
using the default limit for queries, which is Integer.MAX_VALUE (2,147,483,647)
g.V(a).outE().limit(123): the limit will be set to 246 here, or 2x the limit you
set (see line 437 of BasicVertexCentricQueryBuilder if you're curious why it's 2x)
g.V(a).out().limit(123): like above, limit will be set to 246 because even though
it's an out vs. an outE, the edges have the inbound vertex ids so there is no need
to retrieve the vertices
g.V(a).out().valueMap().limit(123): limit is set to MAX_VALUE and all the edges
are retrieved even though we only need 123 of them. This occurs because we do
not have a strategy (yet) that will pull that limit leftwards. For the time being, you
could move that limit before the valueMap yourself and get the desired behavior.
g.V(a).out().range(100000, 110000): limit is set to 2x the upper bound (220,000).
Since there is no vertex-centric index on the edge, the edges aren't sorted in
anyway where we can page through them without always retrieving starting at 0 everytime
So, as you can see with these, the storage backend is part of it, but we also still
have scenarios where a large quantity of edges must be returned which will at
some point, cause issues on the Janus side, whether due to heap pressure or
the extra processing required to filter post retrieval.
With regards to splitting the adjacency lists up, you can create partitioned vertex
labels [1]. I have not used this feature so I'm not sure how well it works in practice,
but again, it may only help you at the storage layer if you have queries that you are
expecting to have to retrieve a large set of edges for. The bottleneck will just move
to the Janus JVM. If you can use vertex-centric indices (VCI), and constrain your
queries, you'll have less of an issue. The benefit of the VCI, is that it's one of the
few spots, where we can push down extra predicates to the storage layer, thereby
pruning unwanted data quickly, and greatly reducing the postprocessing in the JVM.
For the upsert case, Janus should be able to make that check without a full scan as long
as you formulate the query in Gremlin such that the AdjacentVertexFilterOptimizerStrategy
will optimize it. You also could use the low level Janus query API, but Gremlin is preferable
in my opinion. Here's an example to try out. My test has 510,000 adjacent edges.
The timing is included here from my laptop so you can see it's pretty snappy and this is with the global cache turned off.
gremlin> clockWithResult{g.tx().rollback(); adjVertex = g.V(356568).next(); g.V(356568).V(934072).outE("knows").where(inV().is(eq(adjVertex))).next()}
==>5.61414471
==>e[51qv-k0qg-1lh-7n4o][934072-knows->356568]
You can probably role that first vertex lookup into the Gremlin query if you want. Also, again if you don't
specify the edge label, it'll still pull everything back. If you run an explain on that query, you'll see the strategy in action.
Well, that turned out longer than expected but I hope there are a few things in there to try out.
--Ted