Speeding up the performance of a query on the traversal API

167 views
Skip to first unread message

Mark Needham

unread,
Oct 5, 2012, 5:20:42 PM10/5/12
to ne...@googlegroups.com
Hey,

I have a graph which is modelled like this:

Research Area 1 -> Research Area B -> Research Area X -> Product A -> Sales 123
Research Area 1 -> Research Area B -> Product B -> Sales 456
Research Area 1 -> Research Area C -> Product Z -> Sales 789

And so on. There are 4 levels of sub research areas. 

What I want to do is get the top level research areas and the 2nd from top research areas and show the sales of products which come underneath them.

So the final result would be something like this:

Research Area 1
-- Research Area B 579
-- Research Area C 789

I've indexed the top level research areas to speed up that lookup but I'm struggling to get the rest of the query to run quickly but it seems like it should be possible so I must be doing something wrong. This is what I have:

    private Map<String, BigDecimal> calculateTotalSales(List<Node> subResearchAreas) {
        final Map<String, BigDecimal> totalSales = new HashMap<String, BigDecimal>();
        org.neo4j.graphdb.traversal.Traverser traverse = Traversal.description()
                .depthFirst()
                .relationships(DynamicRelationshipType.withName("has_child"), Direction.OUTGOING)
                .evaluator(Evaluators.toDepth(4))
                .evaluator(new Evaluator() {
                    public Evaluation evaluate(Path path) {
                        if(path.length() == 0){
                            return Evaluation.EXCLUDE_AND_CONTINUE;
                        }

                        Node subResearchArea = path.endNode();
                        Iterable<Relationship> researchAreaToProducts = subResearchArea.getRelationships(DynamicRelationshipType.withName("primary_research_area"));
                        for (Relationship researchAreaToProduct : researchAreaToProducts) {
                            Node product = researchAreaToProduct.getEndNode();
                            Iterable<Relationship> productsToSales = product.getRelationships(DynamicRelationshipType.withName("sold"));
                            for (Relationship productToSale : productsToSales) {
                                Node sales = productToSale.getEndNode();
                                BigDecimal monthlySales = new BigDecimal((Double) sales.getProperty("sales"));
                                String name = (String) path.startNode().getProperty("display_name");
                                if (totalSales.containsKey(name)) {
                                    totalSales.put(name, totalSales.get(name).add(monthlySales));
                                } else {
                                    totalSales.put(name, monthlySales);
                                }
                            }
                        }

                        return Evaluation.INCLUDE_AND_CONTINUE;
                    }
                })
                .traverse(subResearchAreas.toArray(new Node[]{}));
        for (Path forceEvaluation : traverse) { }
        return totalSales;
    }

This traversal gets run 34 times since there are 34 top level research areas and it gets called inside a loop. Overall this bit of the code takes about 8.5 seconds to run so that's around 0.3 seconds per traversal. 

If there's a better way to solve this problem and I'm going about it totally the wrong way please let me know as well! I've tried changing the code to not use the traversal API and just do the relationship lookups directly on the nodes inside for loops but that actually takes longer than using the traversal API. I also tried pushing the 'primary_research_area' and 'sold' relationships into the 'relationships' method call on the API but that made it slower as well.


Thanks in advance,
Mark

Peter Neubauer

unread,
Oct 5, 2012, 11:26:52 PM10/5/12
to ne...@googlegroups.com
Mark,
how big is the dataset, how many pathes are you touching (just count
the invocations of the Evaluators)? Have the database somewhere? I am
vaguely thinking that from your code, "looking ahead" in your
evaluators might be the thing that is causing the slowdown. Would like
to test it with some data.

Cheers,

/peter neubauer

G: neubauer.peter
S: peter.neubauer
P: +46 704 106975
L: http://www.linkedin.com/in/neubauer
T: @peterneubauer

Neo4j 1.8 GA - http://www.dzone.com/links/neo4j_18_release_fluent_graph_literacy.html
> --
>
>

Mark Needham

unread,
Oct 6, 2012, 1:39:19 AM10/6/12
to ne...@googlegroups.com
Peter,

I put a count into that custom evaluator that I've got and then printed out its value just before the return line of the method and this is what it printed out:

invocations = 113
invocations = 166
invocations = 75
invocations = 270
invocations = 81
invocations = 142
invocations = 146
invocations = 19
invocations = 2
invocations = 2458
invocations = 188
invocations = 681
invocations = 164
invocations = 4
invocations = 3
invocations = 97
invocations = 13
invocations = 4
invocations = 1
invocations = 1
invocations = 34
invocations = 2839
invocations = 1
invocations = 52
invocations = 2
invocations = 1
invocations = 1
invocations = 1
invocations = 1
invocations = 1
invocations = 1
invocations = 1
invocations = 1
invocations = 1

I did try putting the 'primary_research_area' and 'sold' traversals as separate lines on the call to the traversal API i.e.

org.neo4j.graphdb.traversal.Traverser traverse = Traversal.description()
                .depthFirst()
                .relationships(DynamicRelationshipType.withName("has_child"), Direction.OUTGOING)
                .relationships(DynamicRelationshipType.withName("primary_research_area"), Direction.OUTGOING)
                .relationships(DynamicRelationshipType.withName("sold"), Direction.OUTGOING)
....

But that actually seemed to make things slower. 

I'm sure I also tried passing in all the subResearchAreas in one go instead of doing 34 calls to the traversal API but that didn't have any impact on the running time.

Mark

Michael Hunger

unread,
Oct 6, 2012, 2:44:22 AM10/6/12
to ne...@googlegroups.com
What about introducing a top level node and just running one big traversal?
Instead of 34?

Did you try to profile it? Is it on cold caches?

Some ideas

Move the rel types into an enum
Use exclude and continue as your not interested in the nodes anyway?

Imho your traversal tries to aggregate on every level not just on the last? Check e.g for path length

Perhaps just return the paths (research area to product to sale) from the traversal and do the aggregation on the returned paths?

HTH
Michael

Sent from mobile device
--
 
 

Mark Needham

unread,
Oct 6, 2012, 3:32:19 AM10/6/12
to ne...@googlegroups.com
When you say profile it what do you mean? I have jvisualvm open but all I was really looking for there was whether I had the heap space large enough and the JVM wasn't running GC because it couldn't keep the whole thing in memory. I think I've now got the heap size large enough that it's fine. Is there something else/some other tool you'd suggest?

Do I achieve the same outcome as  'move rel types into an enum' by putting the DynamicRelationshipType.withName calls as fields? I tried that and didn't see any difference.

I tried "Use exclude and continue as your not interested in the nodes anyway?" and didn't see any difference.

I didn't quite understand what you meant by "Imho your traversal tries to aggregate on every level not just on the last? Check e.g for path length" 

Are you saying that it runs the code in my custom evaluator even when the last node on the path doesn't have a product hanging of it? Or something else?

> "Perhaps just return the paths (research area to product to sale) from the traversal and do the aggregation on the returned paths?" 

I haven't tried this yet but will try now.

> "What about introducing a top level node and just running one big traversal?
> Instead of 34?" 

Ditto.

Michael Hunger

unread,
Oct 6, 2012, 3:42:38 AM10/6/12
to ne...@googlegroups.com
I mean method time profiling, 

The evaluator is called for every node in the traversal (imho)

But you want to run the code only for the leaf nodes





Sent from mobile device
--
 
 

Mark Needham

unread,
Oct 6, 2012, 4:28:01 AM10/6/12
to ne...@googlegroups.com
I tried this suggestion:

"Perhaps just return the paths (research area to product to sale) from the traversal and do the aggregation on the returned paths?"

This is what code I have:

   DynamicRelationshipType has_child = DynamicRelationshipType.withName("has_child");

    private Map<String, BigDecimal> calculateTotalSales(List<Node> subResearchAreas) {
        final Map<String, BigDecimal> totalSales = new HashMap<String, BigDecimal>();

        org.neo4j.graphdb.traversal.Traverser traverse = Traversal.description()
                .depthFirst()
                .relationships(has_child, Direction.OUTGOING)
                .relationships(primary_research_area, Direction.OUTGOING)
                .relationships(sold, Direction.OUTGOING)
                .evaluator(Evaluators.includeWhereLastRelationshipTypeIs(sold))
                .evaluator(Evaluators.toDepth(6))
                .traverse(subResearchAreas.toArray(new Node[]{}));
        for (Path path : traverse) {
            Node sales = path.endNode();

            BigDecimal monthlySales = new BigDecimal((Double) sales.getProperty("sales"));
            String name = (String) path.startNode().getProperty("display_name");
            if (totalSales.containsKey(name)) {
                totalSales.put(name, totalSales.get(name).add(monthlySales));
            } else {
                totalSales.put(name, monthlySales);
            }
        }

        return totalSales;
    }

Is that what you had in mind? 

It takes 23 seconds to execute that vs about 10 seconds for the original version. I changed the depth of the search to 6 because it now needs to also traverse the 'primary_research_area' and 'sold' edges. 

Michael Hunger

unread,
Oct 6, 2012, 5:02:28 AM10/6/12
to ne...@googlegroups.com
Interesting.
Do you have a dataset/generator?

Sent from mobile device
--
 
 

Mark Needham

unread,
Oct 6, 2012, 8:33:18 AM10/6/12
to ne...@googlegroups.com
I tried profiling what was going on with yourkit:

There are 11,000 calls to org.neo4j.kernel.impl.traversal.TraverserIterator.fetchNextOrNull()  and then 9,000 calls to org.neo4j.kernel.impl.traversal.TraversalBranchImpl.expandRelationships() - I'm not sure exactly what I should be looking for to try and work out what's causing it to run so slowly though?

Mark Needham

unread,
Oct 6, 2012, 8:42:46 AM10/6/12
to ne...@googlegroups.com
Attaching a snapshot of the yourkit trace - that might be more useful than my summary.
neo-snapshot.html

Michael Hunger

unread,
Oct 6, 2012, 9:09:27 PM10/6/12
to ne...@googlegroups.com
Two more things:

# try to use cache_type strong 
db = new GraphDatabaseFactory().newEmbeddedDatabaseBuilder(path).setConfig(GraphDatabaseSettings.cache_type, GraphDatabaseSettings.CacheTypeSetting.strong).newGraphDatabase();
# the bigdecimal addition takes quite a bit of time (perhaps it is possible to store the sales as whole number (money * 100)

The traversal framework seems to add a bit of overhead, esp with multiple given rel-types (I tried a custom-rel-expander but that didn't perform much better (esp. b/c of your variable length path in between)

I tried to connect the top-level research areas to a root node and it worked out quite well, so if you change that part it might help too.



--
 
 
<neo-snapshot.html>

Mattias Persson

unread,
Oct 10, 2012, 8:01:43 AM10/10/12
to ne...@googlegroups.com
I like your second approach better. A couple of things:

1) You could probably have just one evaluator:

    Evaluators.lastRelationshipIs( INCLUDE_AND_PRUNE, EXCLUDE_AND_CONTINUE, sold );

this way you tell the traversal to not even try to expand the product nodes. Will probably spare the traversal lots of getRelationships calls.

2) For every returned path, just key on the Node and when you got it all "convert" the key into the name by doing the getProperty on them.

3) Can a product be sold by many research areas? In that case you'd probably need uniqueness level NODE_PATH

4) Counts could instead be kept as a property per research levels, just the number of directly related products so that the traversal can just traverse through the research areas and accumulate that count.

2012/10/7 Michael Hunger <michael...@neotechnology.com>

--
 
 



--
Mattias Persson, [mat...@neotechnology.com]
Hacker, Neo Technology
www.neotechnology.com
Reply all
Reply to author
Forward
0 new messages