Traversing Large (weighted) graphs: performance, data structure, indexes

173 views
Skip to first unread message

gg4u

unread,
Oct 3, 2014, 5:43:55 PM10/3/14
to ne...@googlegroups.com
Hi,

here my new answer, I got into this issue:

I have a large weighted graph with only one schema index on nodes (Topic):
4M topics and 100M rels.

I wanted to find paths between two given nodes.

I tried out with queries like this one:
since it is a weighted graph, I compute the weighted path between nodes as the sum of its weight (I called weight 'proximity' here).

Problem is, a query of this type, on such a large graph, tooks ages:

Note that using an index, either directly the internal id, give same responsive results 
Is there any way to speed up performance to reasonable production time? (lower than 1s ... it means 3 orders of magnitude ... )

MATCH (n) , (m), p = (n)-[*0..2]-(m)
where id(n) = 103105 and id(m) = 1386672
with p, n, m
return p, reduce(totProximity = 0, n IN relationships(p)| totProximity + n.proximity) AS pathProximity order by pathProximity DESC;

~1M ms !!! 


same as
MATCH (n:Topic) , (m:Topic), p = (n)-[*0..2]-(m)
where n.name = 'title-1' and id(m) = 'title-2'
with p, n, m
return p, reduce(totProximity = 0, n IN relationships(p)| totProximity + n.proximity) AS pathProximity order by pathProximity DESC;

~2M ms !!! 

Michael Hunger

unread,
Oct 3, 2014, 6:25:48 PM10/3/14
to ne...@googlegroups.com
How many paths are returned from your query?

MATCH p = (n)-[*0..2]-(m)
where id(n) = 103105 and id(m) = 1386672
return p, reduce(totProximity = 0, n IN relationships(p)| totProximity + n.proximity) AS pathProximity 
order by pathProximity DESC;

your index is on :Topic(name) ?

MATCH p = (n:Topic)-[*0..2]-(m:Topic)
where n.name = 'title-1' and m.name = 'title-2'
return p, reduce(totProximity = 0, n IN relationships(p)| totProximity + n.proximity) AS pathProximity 
order by pathProximity DESC;

Can you profile your queries?


and enter:

profile MATCH p = (n)-[*0..2]-(m)
where id(n) = 103105 and id(m) = 1386672
return p, reduce(totProximity = 0, n IN relationships(p)| totProximity + n.proximity) AS pathProximity 
order by pathProximity DESC;

and

profile MATCH p = (n:Topic)-[*0..2]-(m:Topic)
where n.name = 'title-1' and m.name = 'title-2'
return p, reduce(totProximity = 0, n IN relationships(p)| totProximity + n.proximity) AS pathProximity 
order by pathProximity DESC;

and share the results

--
You received this message because you are subscribed to the Google Groups "Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email to neo4j+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

gg4u

unread,
Oct 9, 2014, 3:41:47 AM10/9/14
to ne...@googlegroups.com
Hi Micheal, thank you.
sure I post my profile result here below !


Il giorno sabato 4 ottobre 2014 00:25:48 UTC+2, Michael Hunger ha scritto:
How many paths are returned from your query?

in this case, 9 paths
 

MATCH p = (n)-[*0..2]-(m)
where id(n) = 103105 and id(m) = 1386672
return p, reduce(totProximity = 0, n IN relationships(p)| totProximity + n.proximity) AS pathProximity 
order by pathProximity DESC;

your index is on :Topic(name) ?

  • neo4j-sh (?)$ schema
  • ==> Indexes
  • ==>   ON :Topic(name) ONLINE  
  • ==>   ON :Topic(id)   ONLINE  
  • ==>   ON :topic(name) ONLINE  
  • ==>   ON :topic(id)   ONLINE  
  • ==> 
  • ==> No constraints
 

MATCH p = (n:Topic)-[*0..2]-(m:Topic)
where n.name = 'title-1' and m.name = 'title-2'
return p, reduce(totProximity = 0, n IN relationships(p)| totProximity + n.proximity) AS pathProximity 
order by pathProximity DESC;

Can you profile your queries?
 


and enter:

profile MATCH p = (n)-[*0..2]-(m)
where id(n) = 103105 and id(m) = 1386672
return p, reduce(totProximity = 0, n IN relationships(p)| totProximity + n.proximity) AS pathProximity 
order by pathProximity DESC;


  • ==> +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
  • ==> 9 rows
  • ==> 
  • ==> ColumnFilter
  • ==>   |
  • ==>   +Sort
  • ==>     |
  • ==>     +Extract
  • ==>       |
  • ==>       +ExtractPath
  • ==>         |
  • ==>         +PatternMatcher
  • ==>           |
  • ==>           +NodeByIdOrEmpty(0)
  • ==>             |
  • ==>             +NodeByIdOrEmpty(1)
  • ==> 
  • ==> +--------------------+------+--------+-------------------+-----------------------------------+
  • ==> |           Operator | Rows | DbHits |       Identifiers |                             Other |
  • ==> +--------------------+------+--------+-------------------+-----------------------------------+
  • ==> |       ColumnFilter |    9 |      0 |                   |     keep columns p, pathProximity |
  • ==> |               Sort |    9 |      0 |                   | Cached(pathProximity of type Any) |
  • ==> |            Extract |    9 |     36 |                   |                     pathProximity |
  • ==> |        ExtractPath |    9 |      0 |                 p |                                   |
  • ==> |     PatternMatcher |    9 |      0 | n, m,   UNNAMED13 |                                   |
  • ==> | NodeByIdOrEmpty(0) |    1 |      1 |              m, m |                      {  AUTOINT1} |
  • ==> | NodeByIdOrEmpty(1) |    1 |      1 |              n, n |                      {  AUTOINT0} |
  • ==> +--------------------+------+--------+-------------------+-----------------------------------+
  • ==> 
 

and

profile MATCH p = (n:Topic)-[*0..2]-(m:Topic)
where n.name = 'title-1' and m.name = 'title-2'
return p, reduce(totProximity = 0, n IN relationships(p)| totProximity + n.proximity) AS pathProximity 
order by pathProximity DESC;



  • ==> +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
  • ==> 9 rows
  • ==> 
  • ==> ColumnFilter
  • ==>   |
  • ==>   +Sort
  • ==>     |
  • ==>     +Extract
  • ==>       |
  • ==>       +ExtractPath
  • ==>         |
  • ==>         +Filter
  • ==>           |
  • ==>           +TraversalMatcher
  • ==> 
  • ==> +------------------+---------+---------+-------------+-------------------------------------------------------------------+
  • ==> |         Operator |    Rows |  DbHits | Identifiers |                                                             Other |
  • ==> +------------------+---------+---------+-------------+-------------------------------------------------------------------+
  • ==> |     ColumnFilter |       9 |       0 |             |                                     keep columns p, pathProximity |
  • ==> |             Sort |       9 |       0 |             |                                 Cached(pathProximity of type Any) |
  • ==> |          Extract |       9 |      36 |             |                                                     pathProximity |
  • ==> |      ExtractPath |       9 |       0 |           p |                                                                   |
  • ==> |           Filter |       9 | 3032385 |             | (hasLabel(m:Topic(0)) AND Property(m,name(1)) == {  AUTOSTRING1}) |
  • ==> | TraversalMatcher | 1010795 | 1024307 |             |                                                 m,   UNNAMED19, m |
  • ==> +------------------+---------+---------+-------------+-------------------------------------------------------------------+
 

and share the results

(
Also,if it may or not be useful, when I start the server I have the warning message:
./neo4j console
WARNING: Max 256 open files allowed, minimum of 40 000 recommended. See the Neo4j manual.

I thought it may have something to do with the search of files in the db ... ?
)


Also a quick insight for improving the query:
the results contain duplicates (same path can occur more than once).
I didn't understand why: I thought it is because I used not-directed rels, but results are not consistent: some are duplicates, some are not.
Should use a collection function to avoid duplicates as a 'union' ? 
If I understood correctly the manual, concatenating queries (with, with) should not increase 'significantly' time for obtaining results, cause it is 'interpreted' as a single transaction.
Am I right? Or is there maybe a more efficient query than the one I posted, to improve time-response?

thank you

Rodger

unread,
Oct 13, 2014, 12:37:50 PM10/13/14
to ne...@googlegroups.com
Hello,

I've done a lot of RDBMS performance tuning.
Just a few quick thoughts.


Be sure to run the queries in the shell, if you are not already doing so.

How many rows are returned? Just sorting, then returning many rows, 
takes a long time to scroll them to output. 



If you are getting duplicates, it may be the equivalent of a cartesian product, 
one of the worst things that can happen in RDBMS, and also one
of the least known. See my presentation on them here:


Try:

return p, count (*) 
order by count(*)



Without me looking at the raw data, and the query result, you
seem to have many operations going on. So, you have a lot of rows in 
the profile output.  As a general rule, the more rows there are in the 
profile, the slower the response time is. 
ie. the more complex the query, the slower it is. 


If I were looking at this, I would try to isolate which part of 
the query is the slow part.  The Return clause, or the Match clause?


You've already tried the response times with the data.
Try to simply: 
return count(*) .
How many seconds response time is that, versus the original query? 
What is the resulting profile?





See also the tuning presentations I've done: 
They are quick reads. 

I've put a number of principles and principles in there, that you might apply. 
ie. Could you create the NEO4J equivalent of a temp table?


Hope this helps.

gg4u

unread,
Oct 14, 2014, 3:06:06 AM10/14/14
to ne...@googlegroups.com
Hi Rodjer,

thank you for your insights!
please see comments below:

Il giorno lunedì 13 ottobre 2014 18:37:50 UTC+2, Rodger ha scritto:
Hello,

I've done a lot of RDBMS performance tuning.
Just a few quick thoughts.


Be sure to run the queries in the shell, if you are not already doing so.


Yes, they are run in the shell:
 
How many rows are returned? Just sorting, then returning many rows, 
takes a long time to scroll them to output. 



9 rows
In the answer above, I wrote 9 paths

 

If you are getting duplicates, it may be the equivalent of a cartesian product, 
one of the worst things that can happen in RDBMS, and also one
of the least known. See my presentation on them here:

So I had a look at your pdf,
page 11

and I think the idea you want to suggest, is to avoid duplicates (you called them 'cartesian products') by enforcing conditions.
Though, since it is a graph db and not relational, not clear to me where this applies because in the graph db I don't have 'jointed' queries between tables,
so the conditions I have are, at least in my case, properties (index on properties), and no-directional rels.
 


Try:

return p, count (*) 
order by count(*)

I run:

profile MATCH (n:Topic) , (m:Topic), p = (n)-[*0..2]-(m) where n.name = 'Topic1' and m.name = 'Topic2' with p, n, m return p, count(*) order by count(*);

and I've got: (see there are also duplicates in paths: is it because I have both (a)-[]->(b) and (a)<-[]-(b) ?)

==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | p                                                                                                                                                                                                                                           | count(*) |
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | [Node[103105]{id:1092923,name:"Topic1"},:P_Topic_Link[71185298]{proximity:68},Node[1401899]{id:21375850,name:"Topic3"},:P_Topic_Link[71185313]{proximity:32},Node[1386672]{id:21245,name:"Topic2"}]                   | 1        |
==> | [Node[103105]{id:1092923,name:"Topic1"},:P_Topic_Link[88675719]{proximity:28},Node[2594397]{id:31760062,name:"Topic4"},:P_Topic_Link[88675745]{proximity:23},Node[1386672]{id:21245,name:"Topic2"}]           | 1        |
==> | [Node[103105]{id:1092923,name:"Topic1"},:P_Topic_Link[30736000]{proximity:32},Node[2515502]{id:3106745,name:"Topic5"},:P_Topic_Link[30735974]{proximity:82},Node[1386672]{id:21245,name:"Topic2"}] | 1        |
==> | [Node[103105]{id:1092923,name:"Topic1"},:P_Topic_Link[68206383]{proximity:72},Node[1202629]{id:19635605,name:"Topic6"},:P_Topic_Link[68206440]{proximity:32},Node[1386672]{id:21245,name:"Topic2"}]              | 1        |
==> | [Node[103105]{id:1092923,name:"Topic1"},:P_Topic_Link[98898173]{proximity:23},Node[3329750]{id:38567205,name:"Topic7"},:P_Topic_Link[98898126]{proximity:124},Node[1386672]{id:21245,name:"Topic2"}]                        | 1        |
==> | [Node[103105]{id:1092923,name:"Topic1"},:P_Topic_Link[58107755]{proximity:55},Node[506613]{id:13841207,name:"Topic8"},:P_Topic_Link[58107766]{proximity:27},Node[1386672]{id:21245,name:"Topic2"}]                             | 1        |
==> | [Node[103105]{id:1092923,name:"Topic1"},:P_Topic_Link[98898173]{proximity:23},Node[3329750]{id:38567205,name:"Topic7"},:P_Topic_Link[1025873]{proximity:124},Node[1386672]{id:21245,name:"Topic2"}]                         | 1        |
==> | [Node[103105]{id:1092923,name:"Topic1"},:P_Topic_Link[5662626]{proximity:47},Node[736816]{id:157427,name:"Topic9"},:P_Topic_Link[5662565]{proximity:138},Node[1386672]{id:21245,name:"Topic2"}]                  | 1        |
==> | [Node[103105]{id:1092923,name:"Topic1"},:P_Topic_Link[5662626]{proximity:47},Node[736816]{id:157427,name:"Topic9"},:P_Topic_Link[1025864]{proximity:138},Node[1386672]{id:21245,name:"Topic2"}]                  | 1        |
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> 9 rows
==> 
==> ColumnFilter(0)
==>   |
==>   +Sort
==>     |
==>     +EagerAggregation
==>       |
==>       +ColumnFilter(1)
==>         |
==>         +ExtractPath
==>           |
==>           +Filter
==>             |
==>             +TraversalMatcher
==> 
==> +------------------+---------+---------+-------------+----------------------------------------------------------------------------------+
==> |         Operator |    Rows |  DbHits | Identifiers |                                                                            Other |
==> +------------------+---------+---------+-------------+----------------------------------------------------------------------------------+
==> |  ColumnFilter(0) |       9 |       0 |             |                                                         keep columns p, count(*) |
==> |             Sort |       9 |       0 |             | Cached(  INTERNAL_AGGREGATE931614f3-4def-4fc4-a80b-c6fca3839817 of type Integer) |
==> | EagerAggregation |       9 |       0 |             |                                                                                p |
==> |  ColumnFilter(1) |       9 |       0 |             |                                                             keep columns p, n, m |
==> |      ExtractPath |       9 |       0 |           p |                                                                                  |
==> |           Filter |       9 | 3032385 |             |                (hasLabel(m:Topic(0)) AND Property(m,name(1)) == {  AUTOSTRING1}) |
==> | TraversalMatcher | 1010795 | 1024307 |             |                                                                m,   UNNAMED36, m |
==> +------------------+---------+---------+-------------+----------------------------------------------------------------------------------+
==> 



Without me looking at the raw data, and the query result, you
seem to have many operations going on. So, you have a lot of rows in 
the profile output.

Only 9
 
 As a general rule, the more rows there are in the 
profile, the slower the response time is. 
ie. the more complex the query, the slower it is. 


If I were looking at this, I would try to isolate which part of 
the query is the slow part.  The Return clause, or the Match clause?


You've already tried the response times with the data.
Try to simply: 
return count(*) .

I run:
MATCH (n:Topic) , (m:Topic), p = (n)-[*0..2]-(m) where n.name = 'Topic1' and m.name = 'Topic2' with p, n, m return p, count(*) order by count(*);

and obtain 9 rows in 182799 ms

I run:
MATCH (n:Topic), (m:Topic) where n.name = 'Topic1' and m.name = 'Topic2' with n, m return count(*);

and obtain 856ms


profile MATCH (n:Topic), (m:Topic) where n.name = 'Topic1' and m.name = 'Topic2' with n, m return count(*);

results in:


==> ColumnFilter
==>   |
==>   +EagerAggregation
==>     |
==>     +SchemaIndex(0)
==>       |
==>       +SchemaIndex(1)
==> 
==> +------------------+------+--------+-------------+-------------------------------+
==> |         Operator | Rows | DbHits | Identifiers |                         Other |
==> +------------------+------+--------+-------------+-------------------------------+
==> |     ColumnFilter |    1 |      0 |             |         keep columns count(*) |
==> | EagerAggregation |    1 |      0 |             |                               |
==> |   SchemaIndex(0) |    1 |      2 |        m, m | {  AUTOSTRING1}; :Topic(name) |
==> |   SchemaIndex(1) |    1 |      2 |        n, n | {  AUTOSTRING0}; :Topic(name) |
==> +------------------+------+--------+-------------+-------------------------------+
 
How many seconds response time is that, versus the original query? 
What is the resulting profile?




So, it looks like it actually take huge time in traversing the graph,
while reasonable time '~900ms' to match a fullstring node.

Any idea for improving performance of traversal??

It is a real problem, since also for getting results of first neighbors of a node, I met the same problem which makes currently unfeasible for production :
Anyone with real case of similar size graph and structure trying to perform a similar query?

as example, this query to obtain first neighbors of node Topic44:

MATCH (n:Topic) , (m), p = (n)-[*0..1]-(m)
where n.name = 'Topic44' 
with p, n, m
return p, reduce(totProximity = 0, n IN relationships(p)| totProximity + n.proximity) AS pathProximity order by pathProximity DESC LIMIT 6

returns
6 rows in ~65000 ms VS 6 rows in less than a second with a NoSQL.

Any idea?

thank you guys for helping!! Hope to find a solution soon..
thank you, seen them, 
they are about SQL tuning mostly:
I've just used neo4j strucutre to store a graph with same label on 4M topics (I MUST keep it with one label), index on topic(name) property and used cypher to query the db,
this is my data structure. 

Michael Hunger

unread,
Oct 14, 2014, 4:00:29 AM10/14/14
to ne...@googlegroups.com
Can you try this:

profile 
MATCH (n:Topic), (m:Topic)
 where n.name = 'Topic1' and m.name = 'Topic2' 
MATCH  p = (n)-[*0..2]-(m)
return p, reduce(totProximity = 0, n IN relationships(p)| totProximity + n.proximity) AS pathProximity 
order by pathProximity DESC 
LIMIT 6

gg4u

unread,
Oct 14, 2014, 4:59:18 AM10/14/14
to ne...@googlegroups.com
Yes:

neo4j-sh (?)$ profile  MATCH (n:Topic), (m:Topic) where n.name = 'Topic1' and m.name = 'Topic2'  MATCH  p = (n)-[*0..2]-(m) return p, reduce(totProximity = 0, n IN relationships(p)| totProximity + n.proximity) AS pathProximity  order by pathProximity DESC  LIMIT 6;
==> 
[...results...]
==> 6 rows
==> 
==> ColumnFilter
==>   |
==>   +Top
==>     |
==>     +Extract
==>       |
==>       +ExtractPath
==>         |
==>         +PatternMatcher
==>           |
==>           +SchemaIndex(0)
==>             |
==>             +SchemaIndex(1)
==> 
==> +----------------+------+--------+-------------------+-------------------------------------------------+
==> |       Operator | Rows | DbHits |       Identifiers |                                           Other |
==> +----------------+------+--------+-------------------+-------------------------------------------------+
==> |   ColumnFilter |    6 |      0 |                   |                   keep columns p, pathProximity |
==> |            Top |    6 |      0 |                   | {  AUTOINT3}; Cached(pathProximity of type Any) |
==> |        Extract |    9 |     36 |                   |                                   pathProximity |
==> |    ExtractPath |    9 |      0 |                 p |                                                 |
==> | PatternMatcher |    9 |      0 | n, m,   UNNAMED94 |                                                 |
==> | SchemaIndex(0) |    1 |      2 |              m, m |                   {  AUTOSTRING1}; :Topic(name) |
==> | SchemaIndex(1) |    1 |      2 |              n, n |                   {  AUTOSTRING0}; :Topic(name) |
==> +----------------+------+--------+-------------------+-------------------------------------------------+
==> 
neo4j-sh (?)$ 

Rodger

unread,
Oct 14, 2014, 11:28:57 AM10/14/14
to ne...@googlegroups.com
Hello again,

Actually when I mentioned the shell, I was referring to 
the character shell. On Linux, run with 
./neo4j-shell


Although, if you are only getting 9 rows back, this shouldn't make much difference. 


-----

Looking at these two queries:


MATCH (n:Topic) , (m:Topic), p = (n)-[*0..2]-(m) 
where n.name = 'Topic1' and m.name = 'Topic2' 
with p, n, m 
return p, count(*) 
order by count(*);


9 rows in 182799 ms

(3 minutes)


MATCH (n:Topic), (m:Topic) 
where n.name = 'Topic1' and m.name = 'Topic2' 
with n, m 
return count(*);

856ms

(less than a second) 


The critical part that slows things down seems to be: 
p = (n)-[*0..2]-(m) 


BTW, how many rows are returned with this simpler, faster query? 



-----


More elementary analysis I would typically do.
If I didn't already know the answers. 


How many nodes are in the whole dataset? 


MATCH (*)  
return count(*) 



Do the Topics dominate the dataset?
Or, are they just a small percent of the nodes?


Find the distinct set of Labels, including Topic. 

MATCH (x)
RETURN labels(x), count(*)
order by count(*)



Find the distinct set of Topics:

MATCH (n:Topic)  
return n.name , count(*) 
order by count(*)


Do you get many nodes for Topic 1 and 2, but find only 9 paths between them?
(more work)

Or, just a few nodes for topic 1 and 2? (above)
(less work)


-----------


I'm also wondering if you are getting multiple paths between
the same two nodes, thus the duplicates. 

See a post I did last year on this subject:

Counting Many Paths Between Nodes In NEO4J

See the part:
List Every Distinct Path Between Two Specific Nodes:


In which case, you might want to look at shortestPath() or allShortestPaths (). 
 

-----

Just some thoughts.  

Hope it's useful and not too elementary. 

Michael Hunger

unread,
Oct 14, 2014, 4:54:43 PM10/14/14
to ne...@googlegroups.com
How many rows does this return?

MATCH (n:Topic) , (m:Topic), p = (n)-[*0..2]-(m) where n.name = 'Topic1' and m.name = 'Topic2' with p, n, m return p, count(*) order by count(*);

your aggregation was only on the same paths, so you get 9 different paths but you didn't show the counts per path.

and obtain 9 rows in 182799 ms

gg4u

unread,
Oct 15, 2014, 5:00:46 AM10/15/14
to ne...@googlegroups.com
Hi Micheal, 

your aggregation was only on the same paths, so you get 9 different paths but you didn't show the counts per path. 

not clear to me yet; I am gonna post results for each query you suggested to try out.

Rodger, to summarize a description of this test:
4M nodes labeled 'Topic'
100M rels (weighted)
Index on Topic(name) > 'is a string type property for each node'
'Topic' dominates all dataset and this will be a subgraph of a larger network (if we I can set this in production time, a next step will have a graph of 85M nodes, ~2B rels, with same type of structure putting properties as nodes' properties and not decoupling to other nodes). So this is a primary, real case test, to see if it is feasible using Neo4j datastructure Vs NoSQL.
And I'd love the answer be yes :D

Micheal, here another test with other topics (I think not cached):

MATCH (n:Topic) , (m:Topic), p = (n)-[*0..2]-(m) where n.name = 'Topic100' and m.name = 'Topic2' with p, n, m return p, count(*) order by count(*);

results:
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | p                                                                                                                                                                                                                                                 | count(*) |
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | [Node[4114904]{id:7955,name:"Topic100"},:P_Topic_Link[10618620]{proximity:90},Node[3528892]{id:411782,name:"Topic101"},:P_Topic_Link[1025954]{proximity:68},Node[1386672]{id:21245,name:"Topic2"}]                                               | 1        |
==> | [Node[4114904]{id:7955,name:"Topic100"},:P_Topic_Link[2424845]{proximity:91},Node[3719110]{id:52502,name:"Topic102"},:P_Topic_Link[1025923]{proximity:85},Node[1386672]{id:21245,name:"Topic2"}]                    | 1        |
==> | [Node[4114904]{id:7955,name:"Topic100"},:P_Topic_Link[100682940]{proximity:19},Node[3461206]{id:39782569,name:"Topic103"},:P_Topic_Link[100682931]{proximity:107},Node[1386672]{id:21245,name:"Topic2"}]            | 1        |
==> | [Node[4114904]{id:7955,name:"Topic100"},:P_Topic_Link[21653222]{proximity:82},Node[706102]{id:1551073,name:"Topic104"},:P_Topic_Link[21653218]{proximity:87},Node[1386672]{id:21245,name:"Topic2"}]                                 | 1        |

(.... results ...)
 
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> 67 rows
==> 3900775 ms

gg4u

unread,
Oct 15, 2014, 5:17:43 AM10/15/14
to ne...@googlegroups.com
Hi Rodger,

thank you for your page on blog: useful for next steps (I see also extracting paths and nodes belonging to first, second and so on generations) !!

a quick note on what you suggested:
MATCH (*) Return count(*);
==> SyntaxException: Invalid input '*': expected whitespace, an identifier, node labels, a property map, ')' or org$neo4j$cypher$internal$compiler$v2_1$parser$Patterns$$PatternElement (line 1, column 8)

huh, maybe was accepted on older version ?
I can tell you that, roughly, there are up ~100 nodes rels to first generation nodes for each node 
In this data structure, a topic is unique. (There is only one 'Topic1')

Michael Hunger

unread,
Oct 15, 2014, 8:04:57 AM10/15/14
to ne...@googlegroups.com
Hi,

from the profiling it seems that Cypher selects the wrong pattern matcher if we separate the node-lookup and path-match.

profile
 MATCH  p = (n:Topic)-[*0..2]-(m:Topic) 
 where n.name = 'Topic1' and m.name = 'Topic2'  
 return p, reduce(totProximity = 0, n IN relationships(p)| totProximity + n.proximity) AS pathProximity  
 order by pathProximity DESC  LIMIT 6;


+------------------+------+--------+-------------+-------------------------------------------------------------------+
|         Operator | Rows | DbHits | Identifiers |                                                             Other |
+------------------+------+--------+-------------+-------------------------------------------------------------------+
|     ColumnFilter |    0 |      0 |             |                                     keep columns p, pathProximity |
|              Top |    0 |      0 |             |                   {  AUTOINT3}; Cached(pathProximity of type Any) |
|          Extract |    0 |      0 |             |                                                     pathProximity |
|      ExtractPath |    0 |      0 |           p |                                                                   |
|           Filter |    0 |      0 |             | (hasLabel(m:Topic(0)) AND Property(m,name(1)) == {  AUTOSTRING1}) |
| TraversalMatcher |    0 |      1 |             |                                                 m,   UNNAMED20, m |
+------------------+------+--------+-------------+-------------------------------------------------------------------+

gg4u

unread,
Oct 15, 2014, 8:52:09 AM10/15/14
to ne...@googlegroups.com
Hi Michael,

sorry I don't understand what it means.
Can I help you in helping me sorting out the issue somehow? :)

What could I check or correct ?
What is a pattern matcher and can you teach in reading the profile for making your conclusion?
Which may be possible reasons for selecting wrong pattern matcher, how to correct it?

thank you

Michael Hunger

unread,
Oct 15, 2014, 3:56:01 PM10/15/14
to ne...@googlegroups.com
Can you just try this please?

MATCH  p = (n:Topic)-[*0..2]-(m:Topic) 
 where n.name = 'Topic1' and m.name = 'Topic2'  
 return p, reduce(totProximity = 0, n IN relationships(p)| totProximity + n.proximity) AS pathProximity  
 order by pathProximity DESC  LIMIT 6;


gg4u

unread,
Oct 15, 2014, 6:01:33 PM10/15/14
to ne...@googlegroups.com
Sure, I tried three examples with (n), (n:Topic) and allShortestPath() and also profiling them:

1.

MATCH  p = (n:Topic)-[*0..2]-(m:Topic)   where n.name = 'Topic1' and m.name = 'Topic2'    return p, reduce(totProximity = 0, n IN relationships(p)| totProximity + n.proximity) AS pathProximity    order by pathProximity DESC  LIMIT 6;

==> | [Node[103105]{id:1092923,name:"Topic1"},:P_Topic_Link[5662626]{proximity:47},Node[736816]{id:157427,name:"Topic3"},:P_Topic_Link[5662565]{proximity:138},Node[1386672]{id:21245,name:"Topic2"}]                  | 185
==> | [Node[103105]{id:1092923,name:"Topic1"},:P_Topic_Link[5662626]{proximity:47},Node[736816]{id:157427,name:"Topic3"},:P_Topic_Link[1025864]{proximity:138},Node[1386672]{id:21245,name:"Topic2"}]                  | 185           |

...


==> 6 rows
==> 162423 ms


profile MATCH  p = (n:Topic)-[*0..2]-(m:Topic)   where n.name = 'Topic1' and m.name = 'Topic2'    return p, reduce(totProximity = 0, n IN relationships(p)| totProximity + n.proximity) AS pathProximity    order by pathProximity DESC  LIMIT 6;

==> 6 rows
==> 
==> ColumnFilter
==>   |
==>   +Top
==>     |
==>     +Extract
==>       |
==>       +ExtractPath
==>         |
==>         +Filter
==>           |
==>           +TraversalMatcher
==> 
==> +------------------+---------+---------+-------------+-------------------------------------------------------------------+
==> |         Operator |    Rows |  DbHits | Identifiers |                                                             Other |
==> +------------------+---------+---------+-------------+-------------------------------------------------------------------+
==> |     ColumnFilter |       6 |       0 |             |                                     keep columns p, pathProximity |
==> |              Top |       6 |       0 |             |                   {  AUTOINT3}; Cached(pathProximity of type Any) |
==> |          Extract |       9 |      36 |             |                                                     pathProximity |
==> |      ExtractPath |       9 |       0 |           p |                                                                   |
==> |           Filter |       9 | 3032385 |             | (hasLabel(m:Topic(0)) AND Property(m,name(1)) == {  AUTOSTRING1}) |
==> | TraversalMatcher | 1010795 | 1024307 |             |                                                 m,   UNNAMED20, m |
==> +------------------+---------+---------+-------------+-------------------------------------------------------------------+
==> 


MATCH p = allShortestPaths((n:Topic)-[*..2]-(m:Topic)) where n.name = 'Topic1' and m.name = 'Topic2' with p, n, m return p, reduce(totProximity = 0, n IN relationships(p)| totProximity + n.proximity) AS pathProximity order by pathProximity;

==> 9 rows
==> 10111 ms


==> 9 rows
==> 
==> ColumnFilter
==>   |
==>   +Sort
==>     |
==>     +Extract
==>       |
==>       +ShortestPath
==>         |
==>         +SchemaIndex(0)
==>           |
==>           +SchemaIndex(1)
==> 
==> +----------------+------+--------+-------------+-----------------------------------+
==> |       Operator | Rows | DbHits | Identifiers |                             Other |
==> +----------------+------+--------+-------------+-----------------------------------+
==> |   ColumnFilter |    9 |      0 |             |     keep columns p, pathProximity |
==> |           Sort |    9 |      0 |             | Cached(pathProximity of type Any) |
==> |        Extract |    9 |     36 |             |                     pathProximity |
==> |   ShortestPath |    9 |      0 |           p |                                   |
==> | SchemaIndex(0) |    1 |      2 |        m, m |     {  AUTOSTRING1}; :Topic(name) |
==> | SchemaIndex(1) |    1 |      2 |        n, n |     {  AUTOSTRING0}; :Topic(name) |
==> +----------------+------+--------+-------------+-----------------------------------+


2. 

MATCH  p = (n:Topic)-[*0..2]-(m:Topic)   where n.name = 'Topic44' and m.name = 'Topic2'    return p, reduce(totProximity = 0, n IN relationships(p)| totProximity + n.proximity) AS pathProximity    order by pathProximity DESC  LIMIT 6;

==> 6 rows
==> 906108 ms



==> 6 rows
==> 
==> ColumnFilter
==>   |
==>   +Top
==>     |
==>     +Extract
==>       |
==>       +ExtractPath
==>         |
==>         +Filter
==>           |
==>           +TraversalMatcher
==> 
==> +------------------+---------+---------+-------------+-------------------------------------------------------------------+
==> |         Operator |    Rows |  DbHits | Identifiers |                                                             Other |
==> +------------------+---------+---------+-------------+-------------------------------------------------------------------+
==> |     ColumnFilter |       6 |       0 |             |                                     keep columns p, pathProximity |
==> |              Top |       6 |       0 |             |                   {  AUTOINT3}; Cached(pathProximity of type Any) |
==> |          Extract |      67 |     268 |             |                                                     pathProximity |
==> |      ExtractPath |      67 |       0 |           p |                                                                   |
==> |           Filter |      67 | 3246003 |             | (hasLabel(m:Topic(0)) AND Property(m,name(1)) == {  AUTOSTRING1}) |
==> | TraversalMatcher | 1082001 | 1097166 |             |                                                 m,   UNNAMED20, m |
==> +------------------+---------+---------+-------------+-------------------------------------------------------------------+



MATCH p = allShortestPaths((n:Topic)-[*..2]-(m:Topic)) where n.name = 'Topic44' and m.name = 'Topic2' with p, n, m return p, reduce(totProximity = 0, n IN relationships(p)| totProximity + n.proximity) AS pathProximity order by pathProximity;


magically and for first time:
146ms


so:

profile MATCH p = allShortestPaths((n:Topic)-[*..2]-(m:Topic)) where n.name = 'Topic44' and m.name = 'Topic2' with p, n, m return p, reduce(totProximity = 0, n IN relationships(p)| totProximity + n.proximity) AS pathProximity order by pathProximity;


==> 67 rows
==> 
==> ColumnFilter
==>   |
==>   +Sort
==>     |
==>     +Extract
==>       |
==>       +ShortestPath
==>         |
==>         +SchemaIndex(0)
==>           |
==>           +SchemaIndex(1)
==> 
==> +----------------+------+--------+-------------+-----------------------------------+
==> |       Operator | Rows | DbHits | Identifiers |                             Other |
==> +----------------+------+--------+-------------+-----------------------------------+
==> |   ColumnFilter |   67 |      0 |             |     keep columns p, pathProximity |
==> |           Sort |   67 |      0 |             | Cached(pathProximity of type Any) |
==> |        Extract |   67 |    268 |             |                     pathProximity |
==> |   ShortestPath |   67 |      0 |           p |                                   |
==> | SchemaIndex(0) |    1 |      2 |        m, m |     {  AUTOSTRING1}; :Topic(name) |
==> | SchemaIndex(1) |    1 |      2 |        n, n |     {  AUTOSTRING0}; :Topic(name) |
==> +----------------+------+--------+-------------+-----------------------------------+
==> 




3. 
So I tried:

MATCH p = allShortestPaths((n:Topic)-[*..2]-(m:Topic)) where n.name = 'Topic66' and m.name = 'Topic111' with p, n, m return p, reduce(totProximity = 0, n IN relationships(p)| totProximity + n.proximity) AS pathProximity order by pathProximity;

2 rows
34337 ms

and 

MATCH p = (n:Topic)-[*..2]-(m:Topic) where n.name = 'Topic66' and m.name = 'Topic111' with p, n, m return p, reduce(totProximity = 0, n IN relationships(p)| totProximity + n.proximity) AS pathProximity order by pathProximity;

2411 rows
3228423 ms !!

Please also note that for each row there is a duplicate
(in my structure I do have (a:Topic)-[]->(b:Topic) and (b:Topic)-[]->(a:Topic), but I thought that (a:Topic)-[]-(b:Topic) would give unique results since paths are the same ... huh ?
...
==> | [Node[1103460]{id:18831,name:"Topic66"},:P_Topic_Link[68136903]{proximity:189},Node[1198508]{id:19594028,name:"Topic113"},:P_Topic_Link[68136874]{proximity:368},Node[1603710]{id:22939,name:"Topic111"}]                                                                            | 557           |
==> | [Node[1103460]{id:18831,name:"Topic66"},:P_Topic_Link[68136903]{proximity:189},Node[1198508]{id:19594028,name:"Topic113"},:P_Topic_Link[1113182]{proximity:368},Node[1603710]{id:22939,name:"Topic111"}]                                                                             | 557           |




So I have that **allShortestPath()** gives faster time and **almost** wanted results **only** if previously searches were made (cached). May it be true?
It d make sense partially: I expect graph algorithms faster than retrieving paths, but a time for retriving 67 rows of general paths cannot be that slow... (> 100 order of magnitude slower than allShortestPath() ?? )

Would it make sense if post a script in python to generate a random structure similar to the one I have, post again the configurations files used for my server and batch-importer, post the header I used for loading the csv with the batch importer, and you could tell me if responsive time is less 1s (production time) ?
 you could try same tests and post results and a step by step guide ? 

gg4u

unread,
Oct 15, 2014, 6:12:35 PM10/15/14
to ne...@googlegroups.com
Profile for the last query:
profile MATCH p = (n:Topic)-[*..2]-(m:Topic) where n.name = 'Topic66' and m.name = 'Topic111' with p, n, m return p, reduce(totProximity = 0, n IN relationships(p)| totProximity + n.proximity) AS pathProximity order by pathProximity;

==> 2411 rows
==> 
==> ColumnFilter(0)
==>   |
==>   +Sort
==>     |
==>     +Extract
==>       |
==>       +ColumnFilter(1)
==>         |
==>         +ExtractPath
==>           |
==>           +Filter
==>             |
==>             +TraversalMatcher
==> 
==> +------------------+---------+---------+-------------+-------------------------------------------------------------------+
==> |         Operator |    Rows |  DbHits | Identifiers |                                                             Other |
==> +------------------+---------+---------+-------------+-------------------------------------------------------------------+
==> |  ColumnFilter(0) |    2411 |       0 |             |                                     keep columns p, pathProximity |
==> |             Sort |    2411 |       0 |             |                                 Cached(pathProximity of type Any) |
==> |          Extract |    2411 |    9640 |             |                                                     pathProximity |
==> |  ColumnFilter(1) |    2411 |       0 |             |                                              keep columns p, n, m |
==> |      ExtractPath |    2411 |       0 |           p |                                                                   |
==> |           Filter |    2411 | 4910094 |             | (hasLabel(m:Topic(0)) AND Property(m,name(1)) == {  AUTOSTRING1}) |
==> | TraversalMatcher | 1636698 | 1681810 |             |                                                 m,   UNNAMED19, m |
==> +------------------+---------+---------+-------------+-------------------------------------------------------------------+
1.

</s
...
Message has been deleted

Mark Findlater

unread,
Oct 16, 2014, 5:23:03 AM10/16/14
to ne...@googlegroups.com
There is a lot of history here that I cannot follow, and Michael is clearly thinking about something which means that the solution is not simple, but your profile (which reads bottom up) does not start well and isn't using your indexes. Unless I have missed something somewhere about why you cannot do this your very last query should perform (much) better if it begins with an Index hit rather than TraversalMatcher.

MATCH (n:Topic{name:"Topic66"}), (m:Topic{name:"Topic111"})
WITH n, m 
MATCH (n)-[*..2]-(m)
WITH p, n, m 
RETURN p, reduce(totProximity = 0, n IN relationships(p)| totProximity + n.proximity) AS pathProximity order by pathProximity;

Also, your assertion "would give unique results since paths are the same ... huh ?" is incorrect, because the paths are not the same, the nodes in the paths may be but the relationships/traversal routes are not. Is there any reason for you to duplicate all of your relationships (given you can navigate them in either direction anyway)?

Apologies if I have gone way off piste,

M
...

gg4u

unread,
Nov 7, 2014, 10:36:15 AM11/7/14
to ne...@googlegroups.com
Thank you Mark:

You're right this thread became hard to follow, and issue is still on.
I will re-import everything again since I haven't found a solution: 
maybe there's something I do wrong in importing and creating indexes?

I arranged also a python script to generate a random weighted graph with textual labels, as test.
I d love to hear what other people can find out... :))

Here's my contribution:

...

Michael Hunger

unread,
Nov 7, 2014, 9:06:15 PM11/7/14
to ne...@googlegroups.com
You didn't mention before that you used the "superfast" batch-inserter, I think that version is still work in progress, not sure if it creates a normal store.


I used my own batch-inserter  github.com/jexp/batch-import
with these batch.properties:

dump_configuration=false
cache_type=none
use_memory_mapped_buffers=true
neostore.propertystore.db.index.keys.mapped_memory=5M
neostore.propertystore.db.index.mapped_memory=5M
neostore.nodestore.db.mapped_memory=200M
neostore.relationshipstore.db.mapped_memory=2G
neostore.propertystore.db.mapped_memory=100M
neostore.relationshipgroupstore.db.mapped_memory=10M
neostore.propertystore.db.strings.mapped_memory=100M
batch_array_separator=,
batch_import.csv.quotes=false
#batch_import.csv.delim=,
#batch_import.node_index.source_id=exact
#batch_import.node_index.topic=fulltext


Importing 111111001 Relationships took 478 seconds 

Total import time: 520 seconds  

Then running your queries, actually without the second limit:

| [Node[103105]{name:"1963-64 Austrian football championship"},:My_Proximity[102026221]{proximity:13},Node[2513520]{name:"Cowley plant"},:My_Proximity[108842982]{proximity:28},Node[5523128]{name:"Kinzirô Miyake"}]                                                                                                                   | 41            |
| [Node[103105]{name:"1963-64 Austrian football championship"},:My_Proximity[102026221]{proximity:13},Node[2513520]{name:"Cowley plant"},:My_Proximity[25343932]{proximity:27},Node[9598046]{name:"Suzdal Urban Settlement"}]                                                                                                           | 40            |
| [Node[103105]{name:"1963-64 Austrian football championship"},:My_Proximity[102026221]{proximity:13},Node[2513520]{name:"Cowley plant"},:My_Proximity[108581215]{proximity:13},Node[2627627]{name:"DSFA"}]                                                                                                                             | 26            |
| [Node[103105]{name:"1963-64 Austrian football championship"},:My_Proximity[102026221]{proximity:13},Node[2513520]{name:"Cowley plant"}]                                                                                                                                                                                               | 13            |
| [Node[103105]{name:"1963-64 Austrian football championship"}]                                                                                                                                                                                                                                                                         | 0             |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2241 rows
129 ms

Add this to the config of the server: 
4G heap


# Default values for the low-level graph engine

neostore.nodestore.db.mapped_memory=250M

neostore.relationshipstore.db.mapped_memory=500M

neostore.propertystore.db.mapped_memory=250M

neostore.propertystore.db.strings.mapped_memory=250M



can you try this?

also add this to your neo4j.properties
neostore.relationshipgroupstore.db.mapped_memory=10M





On Fri, Oct 3, 2014 at 11:43 PM, gg4u <luigi...@gmail.com> wrote:

gg4u

unread,
Nov 9, 2014, 1:03:17 PM11/9/14
to ne...@googlegroups.com
Hi Michael, 

My conclusion is that I used super-fast importer and created a faulty data-structre. I see super fast batch-importer is now removed from git ?

Still, I need your de-brief for clarifying confusion in the document ion of batch-importer for indexing nodes' properties, complying with the doc in neo 2.1.5 .

After batch-import, I had no schema.
I also am confused by the meaning of schema, which shows indexes on NodeLabel(NodeProperties), and index --indexes command, which show indexes on the name I gave to the indexes.

I don't know exactly how to use them...


I will describe step by step which tool i used, configurations.
Hope this doc will be useful to other people too.

Please see questions below, maybe answers will be helpful to pin down where there is my hiccup in setting up an import with properly set up indexes.



Hypothesis
I think the issue is in the generation of schema and indexes wihttps://github.com/jexp/batch-import/tree/20th batch-import.

Using neo 2.1.5, how to generate a schema:

Schema Indexes  in documentation at https://github.com/jexp/batch-import/tree/20
it says I should pre-construct the db and create the schema upfront if I want to use Schema Indexes.

I dunno how to generate a schema upfront without creating nodes, thus import my nodes:
maybe simply doing, upfront, 
create index on: Topic(name) 
?
(for who jumped in the conversation only now, 'Topic' and 'name' are the Label and nodes' properties, see below 'My indexes')


Tools
Leaved alone the super-batch importer.

Instead of 

I used the version for neo > 2.0

I downloaded from git to used batch.importer file, and used the lib folder at: download zip


My indexes
In headers for nodes.csv and rels.csv

Nodes.csv

id:int:source_id     NodeType:label       name:string:topic

As label for NodeType, I set : "Topic" (capital 'T'): e.g.:

3998932     Topic       Neo4J Traversing test


Rels.csv

id:int:source_id      id:int:source_id     type    proximity:int




Batch.properties
dump_configuration=false
cache_type=none
use_memory_mapped_buffers=true
neostore.propertystore.db.index.keys.mapped_memory=5M
neostore.propertystore.db.index.mapped_memory=5M
neostore.nodestore.db.mapped_memory=200M
neostore.relationshipstore.db.mapped_memory=2G
neostore.propertystore.db.mapped_memory=200M
neostore.relationshipgroupstore.db.mapped_memory=10M
neostore.propertystore.db.strings.mapped_memory=200M
batch_array_separator=,
batch_import.csv.quotes=false
#batch_import.csv.delim=,
batch_import.node_index.source_id=exact
batch_import.node_index.topic=fulltext

Neo4j.properties

NOT clear what Auto-index is for:
does it index nodes' properties   ? full-text, int, arrays as well ?

Could you please clarify if, according to doc in git and doc in neo 2.1.5:

Are the headers for my cvs columns properly set for using legacy indexes or 2.0> indexes ?
should I use auto_indexes to index full_text 'topic' and 'source_id' ?
If so, could you please tell me which syntax to do it for headers and batch.properties?

In neo4.properties, what is a key? The property name, or the name of the index? I  my example:

# The node property keys to be auto-indexed, if enabled
node_keys_indexable= name, id 
or
node_keys_indexable= source_id, topic 
?



Neo4j-wrapper.conf
Setting memory heap to 4G (min):

wrapper.java.initmemory=4096
#wrapper.java.maxmemory=512




Results

After Import, I found schema is not fixed:

neo4j-sh (?)$ schema
==> No indexes
==> 
==> No constraints


while:

neo4j-sh (?)$ index --indexes
==> Node indexes:
==>   topic
==>   source_id



Are these indexes meant to use only with schema?
I supposed yes...


Query: 
MATCH (n {name : "Topic1"}) , (m {name : "Topic2"}),
p = allShortestPaths((n)-[*..2]-(m))
with p, n, m
return p, reduce(totProximity = 0, n IN relationships(p)| totProximity + n.proximity) AS pathProximity order by pathProximity DESC

12 rows
~11K ms

Query:
MATCH p = (n:Topic)-[*0..2]-(m:Topic) where n.name = 'My topic name1' and m.name= 'My topic name2'
return p, reduce(totProximity = 0, n IN relationships(p)| totProximity + n.proximity) AS pathProximity order by pathProximity DESC  LIMIT 6;

Had to abort: too long time (several minutes)


While using internal ID:

MATCH p = allShortestPaths((n)-[*..4]-(m))
where ID(n) = 103105 and ID(m) = 2513520
with p, n, m
return p, reduce(totProximity = 0, n IN relationships(p)| totProximity + n.proximity) AS pathProximity order by pathProximity DESC

229 rows
~1.5K ms

and 

229 rows in 134ms if cached



So hiccup is in matching the property:

Query:
MATCH (a) where a.name = 'My topic name' return ID(a)

1 row in 5K ms


while:

MATCH p = allShortestPaths((n)-[*..4]-(m))
where ID(n) = 103105 and ID(m) = 1386672
with p, n, m
return p, reduce(totProximity = 0, n IN relationships(p)| totProximity + n.proximity) AS pathProximity order by pathProximity DESC

returns 9 rows in 62ms


Also using my indexed source_id will take long time (id:int:source_id) , as well as full-text indexed names (name:string:topic)

MATCH p = allShortestPaths((n)-[*..4]-(m))
where n.id = 1092923 and ID(m) = 21245
with p, n, m
return p, reduce(totProximity = 0, n IN relationships(p)| totProximity + n.proximity) AS pathProximity order by pathProximity DESC

649 rows in 8K ms



Trying to resolve Schema
I tried to fix the schema after import:


adding 
CREATE INDEX ON: Topic (name)

Now i have:

schema
ON :Topic(name) ONLINE

and I see that:


MATCH p = allShortestPaths((n:Topic)-[*..4]-(m:Topic))
where n.name = 'MyTopicName1' and m.name = 'MyTopicName2'
with p, n, m
return p, reduce(totProximity = 0, n IN relationships(p)| totProximity + n.proximity) AS pathProximity order by pathProximity DESC

9 rows in 70 ms FUCK YEAH! 


Please note that:
CREATE INDEX ON: Topic(id)

where 'id' is the internal id given for the topics: see headers in csv. 

results in:

no data returned.

Is there a conflict with the word 'id' given to name a property ?


Testing LABEL AND Indexed Properties with (proper?) schema AGAINST Internal ID

Queries that not use 'AllShortPaths' , they will resolve, still in one order of magnitude longer than using internal ID:

**Using Label AND Indexed Property **

MATCH p = (n:Topic)-[*0..2]-(m:Topic) where n.name = 'Topic1' and m.name= 'Topic2'
return p, reduce(totProximity = 0, n IN relationships(p)| totProximity + n.proximity) AS pathProximity order by pathProximity DESC  LIMIT 6;

6 rows in 32253 ms

**Using Label AND Internal ID **

MATCH p = (n:Topic)-[*0..2]-(m:Topic) where ID(n) = 4115407 and ID(m) = 667541
return p, reduce(totProximity = 0, n IN relationships(p)| totProximity + n.proximity) AS pathProximity order by pathProximity DESC  LIMIT 6;

6 rows in 5K ms

**Using only Internal ID **

MATCH p = (n)-[*0..2]-(m) where ID(n) = 4115407 and ID(m) = 667541
return p, reduce(totProximity = 0, n IN relationships(p)| totProximity + n.proximity) AS pathProximity order by pathProximity DESC  LIMIT 6;

6 rows in 16Kms


1. Could you please explain why this query takes longer than AllShortestPath ?
2. Could you please explain why a query with Label And Internal ID seem to be faster than query of Internal ID ?
I understood that Labels works with indexes... cannot figure out if and why should they matter with internal ID. ... 



Conclusion
There has been an improvement: my previous data structure was likely not imported properly, because matching nodes by internal ID was taking long time as well.

And also trying to fix the schema with:
CREATE INDEX ON: MyNodeLabel(MyNodeProperty)


The issue (probably) was that I used super-batch importer rather than batch-importer (I cannot remember exactly now, two months are passed :)


I used now the batch importer, version for neo > 2.0.
I see that matching by internal ID starts to produce results as I expect, that fixing schema afterwards should produce "regular" results too.

Despite it looks possible to create an optional schema afterwards, I would like to learn how to properly index nodes' properties while importing them. 
I think I am still confused with the concept of indexing, the ones shown in Schema and index --indexes..

For who jump in the discussion now, do read this!

However, it is not at all clear to me how to use batch-importer tools to properly import indexes against schema for neo 2.1.5 :

Are the indexes created with batch-importer Legacy indexes (meant to use before 2.0) or indexes complying with 2.0 ?

Which is correct syntax and steps of batch-import, to import indexes for fast queries (full-text + exact) in neo 2.1.5, thus optional schema?

What is auto-index meant to do: does it auto-index (exact and full-text) nodes' properties and create schema against neo 2.1.5 ? If so, which is correct syntax?

If auto-index was not meant for legacy, could you please elaborate on examples for auto-indexing in github ?

Could you also include an example for importing an array of properties for nodes with batch importer, so that to index all the properties in the array for full-text either exact search ?

Which is the largest set of nodes of a certain Type (Label) to be indexed by a property of theirs for full-text search, so that to have an efficient time to query the db?  (E.g. here I have 4M nodes ALL of a 'Topic' type. Can this set scale to 100M ? Are there some benchmarks ? )
 



Meanwhile, big thanks!
Reply all
Reply to author
Forward
0 new messages