Message from discussion
Speeding up the performance of a query on the traversal API
Received: by 10.236.157.234 with SMTP id o70mr8394981yhk.4.1349526802171;
Sat, 06 Oct 2012 05:33:22 -0700 (PDT)
X-BeenThere: neo4j@googlegroups.com
Received: by 10.236.198.39 with SMTP id u27ls7238332yhn.6.gmail; Sat, 06 Oct
2012 05:33:20 -0700 (PDT)
Received: by 10.236.110.67 with SMTP id t43mr1397500yhg.6.1349526800355;
Sat, 06 Oct 2012 05:33:20 -0700 (PDT)
Date: Sat, 6 Oct 2012 05:33:18 -0700 (PDT)
From: Mark Needham <m.h.need...@gmail.com>
To: neo4j@googlegroups.com
Message-Id: <701fbc9b-f785-45ce-9628-9dc9e9cef198@googlegroups.com>
In-Reply-To: <8B8EAD72-C302-415A-89B1-4D1195771366@neopersistence.com>
References: <7399290d-0222-49bb-94bd-d79319943072@googlegroups.com> <930B3C1A-F1EA-4541-B761-1C8AC59E6782@neopersistence.com> <e1fe68e7-f8d6-44ad-8062-62f3f03012b8@googlegroups.com> <58F7B0EB-019C-49D2-8B42-5334931F1C51@neopersistence.com> <4ba0bef6-fcf8-4405-8309-0040fa429cad@googlegroups.com>
<8B8EAD72-C302-415A-89B1-4D1195771366@neopersistence.com>
Subject: Re: [Neo4j] Speeding up the performance of a query on the traversal
API
MIME-Version: 1.0
Content-Type: multipart/mixed;
boundary="----=_Part_123_4406288.1349526798702"
------=_Part_123_4406288.1349526798702
Content-Type: multipart/alternative;
boundary="----=_Part_124_4287465.1349526798702"
------=_Part_124_4287465.1349526798702
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
I tried profiling what was going on with yourkit:
There are 11,000 calls to
org.neo4j.kernel.impl.traversal.TraverserIterator.fetchNextOrNull() and
then 9,000 calls to
org.neo4j.kernel.impl.traversal.TraversalBranchImpl.expandRelationships() -
I'm not sure exactly what I should be looking for to try and work out
what's causing it to run so slowly though?
On Saturday, 6 October 2012 10:02:38 UTC+1, Michael Hunger wrote:
>
> Interesting.
> Do you have a dataset/generator?
>
> Sent from mobile device
>
> Am 06.10.2012 um 10:28 schrieb Mark Needham <m.h.n...@gmail.com<javascript:>
> >:
>
> I tried this suggestion:
>
> "Perhaps just return the paths (research area to product to sale) from the
> traversal and do the aggregation on the returned paths?"
>
> This is what code I have:
>
> DynamicRelationshipType has_child =
> DynamicRelationshipType.withName("has_child");
>
> private Map<String, BigDecimal> calculateTotalSales(List<Node>
> subResearchAreas) {
> final Map<String, BigDecimal> totalSales = new HashMap<String,
> BigDecimal>();
>
> org.neo4j.graphdb.traversal.Traverser traverse =
> Traversal.description()
> .depthFirst()
> .relationships(has_child, Direction.OUTGOING)
> .relationships(primary_research_area, Direction.OUTGOING)
> .relationships(sold, Direction.OUTGOING)
>
> .evaluator(Evaluators.includeWhereLastRelationshipTypeIs(sold))
> .evaluator(Evaluators.toDepth(6))
> .traverse(subResearchAreas.toArray(new Node[]{}));
> for (Path path : traverse) {
> Node sales = path.endNode();
>
> BigDecimal monthlySales = new BigDecimal((Double)
> sales.getProperty("sales"));
> String name = (String)
> path.startNode().getProperty("display_name");
> if (totalSales.containsKey(name)) {
> totalSales.put(name,
> totalSales.get(name).add(monthlySales));
> } else {
> totalSales.put(name, monthlySales);
> }
> }
>
> return totalSales;
> }
>
> Is that what you had in mind?
>
> It takes 23 seconds to execute that vs about 10 seconds for the original
> version. I changed the depth of the search to 6 because it now needs to
> also traverse the 'primary_research_area' and 'sold' edges.
>
>
>
> On Saturday, 6 October 2012 08:42:46 UTC+1, Michael Hunger wrote:
>>
>> I mean method time profiling,
>>
>> http://docs.oracle.com/javase/6/docs/technotes/guides/visualvm/profiler.html
>>
>> The evaluator is called for every node in the traversal (imho)
>>
>> But you want to run the code only for the leaf nodes
>>
>>
>>
>>
>>
>> Sent from mobile device
>>
>> Am 06.10.2012 um 09:32 schrieb Mark Needham <m.h.n...@gmail.com>:
>>
>> When you say profile it what do you mean? I have jvisualvm open but all I
>> was really looking for there was whether I had the heap space large enough
>> and the JVM wasn't running GC because it couldn't keep the whole thing in
>> memory. I think I've now got the heap size large enough that it's fine. Is
>> there something else/some other tool you'd suggest?
>>
>> Do I achieve the same outcome as 'move rel types into an enum' by
>> putting the DynamicRelationshipType.withName calls as fields? I tried that
>> and didn't see any difference.
>>
>> I tried "Use exclude and continue as your not interested in the nodes
>> anyway?" and didn't see any difference.
>>
>> I didn't quite understand what you meant by "Imho your traversal tries to
>> aggregate on every level not just on the last? Check e.g for path length"
>>
>> Are you saying that it runs the code in my custom evaluator even when the
>> last node on the path doesn't have a product hanging of it? Or something
>> else?
>>
>> > "Perhaps just return the paths (research area to product to sale) from
>> the traversal and do the aggregation on the returned paths?"
>>
>> I haven't tried this yet but will try now.
>>
>> > "What about introducing a top level node and just running one big
>> traversal?
>> > Instead of 34?"
>>
>> Ditto.
>>
>> On Saturday, 6 October 2012 07:44:37 UTC+1, Michael Hunger wrote:
>>>
>>> What about introducing a top level node and just running one big
>>> traversal?
>>> Instead of 34?
>>>
>>> Did you try to profile it? Is it on cold caches?
>>>
>>> Some ideas
>>>
>>> Move the rel types into an enum
>>> Use exclude and continue as your not interested in the nodes anyway?
>>>
>>> Imho your traversal tries to aggregate on every level not just on the
>>> last? Check e.g for path length
>>>
>>> Perhaps just return the paths (research area to product to sale) from
>>> the traversal and do the aggregation on the returned paths?
>>>
>>> HTH
>>> Michael
>>>
>>> Sent from mobile device
>>>
>>> Am 05.10.2012 um 23:20 schrieb Mark Needham <m.h.n...@gmail.com>:
>>>
>>> Hey,
>>>
>>> I have a graph which is modelled like this:
>>>
>>> Research Area 1 -> Research Area B -> Research Area X -> Product A ->
>>> Sales 123
>>> Research Area 1 -> Research Area B -> Product B -> Sales 456
>>> Research Area 1 -> Research Area C -> Product Z -> Sales 789
>>>
>>> And so on. There are 4 levels of sub research areas.
>>>
>>> What I want to do is get the top level research areas and the 2nd from
>>> top research areas and show the sales of products which come underneath
>>> them.
>>>
>>> So the final result would be something like this:
>>>
>>> Research Area 1
>>> -- Research Area B 579
>>> -- Research Area C 789
>>>
>>> I've indexed the top level research areas to speed up that lookup but
>>> I'm struggling to get the rest of the query to run quickly but it seems
>>> like it should be possible so I must be doing something wrong. This is what
>>> I have:
>>>
>>> private Map<String, BigDecimal> calculateTotalSales(List<Node>
>>> subResearchAreas) {
>>> final Map<String, BigDecimal> totalSales = new HashMap<String,
>>> BigDecimal>();
>>> org.neo4j.graphdb.traversal.Traverser traverse =
>>> Traversal.description()
>>> .depthFirst()
>>>
>>> .relationships(DynamicRelationshipType.withName("has_child"),
>>> Direction.OUTGOING)
>>> .evaluator(Evaluators.toDepth(4))
>>> .evaluator(new Evaluator() {
>>> public Evaluation evaluate(Path path) {
>>> if(path.length() == 0){
>>> return Evaluation.EXCLUDE_AND_CONTINUE;
>>> }
>>>
>>> Node subResearchArea = path.endNode();
>>> Iterable<Relationship> researchAreaToProducts =
>>> subResearchArea.getRelationships(DynamicRelationshipType.withName("primary_research_area"));
>>> for (Relationship researchAreaToProduct :
>>> researchAreaToProducts) {
>>> Node product =
>>> researchAreaToProduct.getEndNode();
>>> Iterable<Relationship> productsToSales =
>>> product.getRelationships(DynamicRelationshipType.withName("sold"));
>>> for (Relationship productToSale :
>>> productsToSales) {
>>> Node sales = productToSale.getEndNode();
>>> BigDecimal monthlySales = new
>>> BigDecimal((Double) sales.getProperty("sales"));
>>> String name = (String)
>>> path.startNode().getProperty("display_name");
>>> if (totalSales.containsKey(name)) {
>>> totalSales.put(name,
>>> totalSales.get(name).add(monthlySales));
>>> } else {
>>> totalSales.put(name, monthlySales);
>>> }
>>> }
>>> }
>>>
>>> return Evaluation.INCLUDE_AND_CONTINUE;
>>> }
>>> })
>>> .traverse(subResearchAreas.toArray(new Node[]{}));
>>> for (Path forceEvaluation : traverse) { }
>>> return totalSales;
>>> }
>>>
>>> This traversal gets run 34 times since there are 34 top level research
>>> areas and it gets called inside a loop. Overall this bit of the code takes
>>> about 8.5 seconds to run so that's around 0.3 seconds per traversal.
>>>
>>> If there's a better way to solve this problem and I'm going about it
>>> totally the wrong way please let me know as well! I've tried changing the
>>> code to not use the traversal API and just do the relationship lookups
>>> directly on the nodes inside for loops but that actually takes longer than
>>> using the traversal API. I also tried pushing the 'primary_research_area'
>>> and 'sold' relationships into the 'relationships' method call on the API
>>> but that made it slower as well.
>>>
>>>
>>> Thanks in advance,
>>> Mark
>>>
>>> --
>>>
>>>
>>>
>>> --
>>
>>
>>
>> --
>
>
>
>
------=_Part_124_4287465.1349526798702
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: quoted-printable
I tried profiling what was going on with yourkit:<div><br></div><div>There =
are 11,000 calls to org.neo4j.kernel.impl.traversal.TraverserIterator.fetch=
NextOrNull() and then 9,000 calls to org.neo4j.kernel.impl.traversal.=
TraversalBranchImpl.expandRelationships() - I'm not sure exactly what I sho=
uld be looking for to try and work out what's causing it to run so slowly t=
hough?</div><div><br><br>On Saturday, 6 October 2012 10:02:38 UTC+1, Michae=
l Hunger wrote:<blockquote class=3D"gmail_quote" style=3D"margin: 0;margin=
-left: 0.8ex;border-left: 1px #ccc solid;padding-left: 1ex;"><div dir=3D"au=
to"><div>Interesting.</div><div>Do you have a dataset/generator?<br><br>Sen=
t from mobile device</div><div><br>Am 06.10.2012 um 10:28 schrieb Mark Need=
ham <<a href=3D"javascript:" target=3D"_blank" gdf-obfuscated-mailto=3D"=
PcuKUlsP_KgJ">m.h.n...@gmail.com</a>>:<br><br></div><blockquote type=3D"=
cite"><div>I tried this suggestion:<div><br></div><div>"Perhaps just return=
the paths (research area to product to sale) from the traversal and do the=
aggregation on the returned paths?"</div><div><br></div><div>This is what =
code I have:</div><div><br></div><div> DynamicRelationshipType =
has_child =3D DynamicRelationshipType.<wbr>withName("has_child");</div><div=
><br></div><div><div> private Map<String, BigDecimal> ca=
lculateTotalSales(List<Node> subResearchAreas) {</div><div> &nb=
sp; final Map<String, BigDecimal> totalSales =3D new Ha=
shMap<String, BigDecimal>();</div><div><br></div><div> &=
nbsp; org.neo4j.graphdb.traversal.<wbr>Traverser traverse =3D Traver=
sal.description()</div><div>  =
; .depthFirst()</div><div> =
.relationships(has_child, Direction.OUTGOING)</div><div> =
; .relationships(primary_<=
wbr>research_area, Direction.OUTGOING)</div><div>  =
; .relationships(sold, Direction.OUTGOING)</div=
><div> .evaluator(Ev=
aluators.<wbr>includeWhereLastRelationshipTy<wbr>peIs(sold))</div><div>&nbs=
p; .evaluator(Evaluators.t=
oDepth(<wbr>6))</div><div> =
.traverse(subResearchAreas.<wbr>toArray(new Node[]{}));</div><div>&n=
bsp; for (Path path : traverse) {</div><div> &nb=
sp; Node sales =3D path.endNode();</div><div><b=
r></div><div> BigDecimal monthlySa=
les =3D new BigDecimal((Double) sales.getProperty("sales"));</div><div>&nbs=
p; String name =3D (String) path.startNo=
de().getProperty("<wbr>display_name");</div><div>  =
; if (totalSales.containsKey(name)) {</div><div> =
totalSales.put(name, totalSales.=
get(name).add(<wbr>monthlySales));</div><div> &n=
bsp; } else {</div><div> &n=
bsp; totalSales.put(name, monthlySales);</div><div> &nb=
sp; }</div><div> }</div><di=
v><br></div><div> return totalSales;</div><div>&=
nbsp; }</div><div><br></div><div>Is that what you had in mind? =
</div><div><br></div><div>It takes 23 seconds to execute that vs about 10 s=
econds for the original version. I changed the depth of the search to 6 bec=
ause it now needs to also traverse the 'primary_research_area' and 'sold' e=
dges. </div><div><br></div><div><br></div><br>On Saturday, 6 October 2=
012 08:42:46 UTC+1, Michael Hunger wrote:<blockquote class=3D"gmail_quote"=
style=3D"margin:0;margin-left:0.8ex;border-left:1px #ccc solid;padding-lef=
t:1ex"><div dir=3D"auto"><div>I mean method time profiling, </div><div=
><span><a href=3D"http://docs.oracle.com/javase/6/docs/technotes/guides/vis=
ualvm/profiler.html" target=3D"_blank">http://docs.oracle.com/javase/<wbr>6=
/docs/technotes/guides/<wbr>visualvm/profiler.html</a></span></div><div><sp=
an><br></span></div><div><span>The evaluator is called for every node in th=
e traversal (imho)</span></div><div><span><br></span></div><div><span>But y=
ou want to run the code only for the leaf nodes</span></div><div><span><br>=
</span></div><div><span><br></span></div><div><br></div><div><br><br>Sent f=
rom mobile device</div><div><br>Am 06.10.2012 um 09:32 schrieb Mark Needham=
<<a>m.h.n...@gmail.com</a>>:<br><br></div><blockquote type=3D"cite">=
<div><div>When you say profile it what do you mean? I have jvisualvm open b=
ut all I was really looking for there was whether I had the heap space larg=
e enough and the JVM wasn't running GC because it couldn't keep the whole t=
hing in memory. I think I've now got the heap size large enough that it's f=
ine. Is there something else/some other tool you'd suggest?</div><div><br><=
/div><div>Do I achieve the same outcome as 'move rel types into an en=
um' by putting the DynamicRelationshipType.<wbr>withName calls as fiel=
ds? I tried that and didn't see any difference.</div><div><br></div><div>I =
tried "Use exclude and continue as your not interested in the nodes anyway?=
" and didn't see any difference.</div><div><br></div><div>I didn't quite un=
derstand what you meant by "Imho your traversal tries to aggregate on every=
level not just on the last? Check e.g for path length" </div><div><br=
></div><div>Are you saying that it runs the code in my custom evaluator eve=
n when the last node on the path doesn't have a product hanging of it? Or s=
omething else?</div><div><br></div><div>> "Perhaps just return the paths=
(research area to product to sale) from the traversal and do the aggregati=
on on the returned paths?" </div><div><br></div><div>I haven't tried t=
his yet but will try now.</div><div><br></div><div>> "What about introdu=
cing a top level node and just running one big traversal?<div>> Instead =
of 34?" </div><div><br></div><div>Ditto.</div><br>On Saturday, 6 Octob=
er 2012 07:44:37 UTC+1, Michael Hunger wrote:<blockquote class=3D"gmail_qu=
ote" style=3D"margin:0;margin-left:0.8ex;border-left:1px #ccc solid;padding=
-left:1ex"><div dir=3D"auto"><div>What about introducing a top level node a=
nd just running one big traversal?</div><div>Instead of 34?</div><div><br><=
/div><div>Did you try to profile it? Is it on cold caches?</div><div><br></=
div><div>Some ideas</div><div><br></div><div>Move the rel types into an enu=
m</div><div>Use exclude and continue as your not interested in the nodes an=
yway?</div><div><br></div><div>Imho your traversal tries to aggregate on ev=
ery level not just on the last? Check e.g for path length</div><div><br></d=
iv><div>Perhaps just return the paths (research area to product to sale) fr=
om the traversal and do the aggregation on the returned paths?</div><div><b=
r></div><div>HTH</div><div>Michael</div><div><br>Sent from mobile device</d=
iv><div><br>Am 05.10.2012 um 23:20 schrieb Mark Needham <<a>m.h.n...@gma=
il.com</a>>:<br><br></div><blockquote type=3D"cite"><div>Hey,<div><br></=
div><div>I have a graph which is modelled like this:</div><div><br></div><d=
iv>Research Area 1 -> Research Area B -> Research Area X -> Produc=
t A -> Sales 123</div><div>Research Area 1 -> Research Area B -> P=
roduct B -> Sales 456<br></div><div>Research Area 1 -> Research Area =
C -> Product Z -> Sales 789<br></div><div><br></div><div>And so on. T=
here are 4 levels of sub research areas. </div><div><br></div><div>Wha=
t I want to do is get the top level research areas and the 2nd from top res=
earch areas and show the sales of products which come underneath them.</div=
><div><br></div><div>So the final result would be something like this:</div=
><div><br></div><div>Research Area 1</div><div>-- Research Area B 579</div>=
<div>-- Research Area C 789</div><div><br></div><div>I've indexed the top l=
evel research areas to speed up that lookup but I'm struggling to get the r=
est of the query to run quickly but it seems like it should be possible so =
I must be doing something wrong. This is what I have:</div><div><br></div><=
div><div> private Map<String, BigDecimal> calculateTotal=
Sales(List<Node> subResearchAreas) {</div><div> &=
nbsp; final Map<String, BigDecimal> totalSales =3D new HashMap<Str=
ing, BigDecimal>();</div><div> org.neo4j.grap=
hdb.traversal.<wbr>Traverser traverse =3D Traversal.description()</div><div=
> .depthFirst()</div=
><div> .relationship=
s(<wbr>DynamicRelationshipType.<wbr>withName("has_child"), Direction.OUTGOI=
NG)</div><div> .eval=
uator(Evaluators.toDepth(<wbr>4))</div><div> &nb=
sp; .evaluator(new Evaluator() {</div><div> &nbs=
p; public Evaluatio=
n evaluate(Path path) {</div><div> =
if(path.length() =3D=3D 0){</div=
><div> =
return Evaluation.EXCLUDE_AND_<wbr>CONTINUE;</=
div><div> &nb=
sp; }</div><div><br></div><div> &n=
bsp; Node subResearchArea =
=3D path.endNode();</div><div> &nb=
sp; Iterable<Relationship> researc=
hAreaToProducts =3D subResearchArea.<wbr>getRelationships(<wbr>DynamicRelat=
ionshipType.<wbr>withName("primary_research_<wbr>area"));</div><div> =
 =
; for (Relationship researchAreaToProduct : researchAreaToProducts) {</div>=
<div> =
Node product =3D researchAreaToProduct.<wbr>get=
EndNode();</div><div>  =
; Iterable<Relationship> pr=
oductsToSales =3D product.getRelationships(<wbr>DynamicRelationshipType.<wb=
r>withName("sold"));</div><div> &n=
bsp; for (Relationship pro=
ductToSale : productsToSales) {</div><div>  =
; &nb=
sp; Node sales =3D productToSale.getEndNode();</div><div> &nbs=
p; &n=
bsp; BigDecimal monthlySales =3D new BigDecimal((Double) sale=
s.getProperty("sales"));</div><div>  =
; Str=
ing name =3D (String) path.startNode().getProperty("<wbr>display_name");</d=
iv><div> &nbs=
p; if (totalSales.containsKey(nam=
e)) {</div><div> &nb=
sp; totalSal=
es.put(name, totalSales.get(name).add(<wbr>monthlySales));</div><div> =
&nbs=
p; } else {</div><div> &nbs=
p; &n=
bsp; totalSales.put(name, monthlySales);</div><div>&nb=
sp; &=
nbsp; }</div><div> &=
nbsp; }</div=
><div> =
}</div><div><br></div><div>  =
; return Evaluation.INCLUD=
E_AND_<wbr>CONTINUE;</div><div> &n=
bsp; }</div><div> &n=
bsp; })</div><div> &=
nbsp; .traverse(subResearchAreas.<wbr>toArray(new Node[]{}));</div><=
div> for (Path forceEvaluation : traverse) { }</=
div><div> return totalSales;</div><div> &n=
bsp; }</div></div><div><br></div><div>This traversal gets run 34 times sinc=
e there are 34 top level research areas and it gets called inside a loop. O=
verall this bit of the code takes about 8.5 seconds to run so that's around=
0.3 seconds per traversal. </div><div><br></div><div>If there's a bet=
ter way to solve this problem and I'm going about it totally the wrong way =
please let me know as well! I've tried changing the code to not use the tra=
versal API and just do the relationship lookups directly on the nodes insid=
e for loops but that actually takes longer than using the traversal API. I =
also tried pushing the 'primary_research_area' and 'sold' relationships int=
o the 'relationships' method call on the API but that made it slower as wel=
l.</div><div><br></div><div><br></div><div>Thanks in advance,</div><div>Mar=
k</div>
<p></p>
-- <br>
<br>
<br>
</div></blockquote></div></blockquote></div>
<p></p>
-- <br>
<br>
<br>
</div></blockquote></div></blockquote></div>
<p></p>
-- <br>
<br>
<br>
</div></blockquote></div></blockquote></div>
------=_Part_124_4287465.1349526798702--
------=_Part_123_4406288.1349526798702--