Re: Neo4j, slow queries with Cypher

K. Emre KISA

unread,

Apr 29, 2013, 5:06:16 PM4/29/13

to ne...@googlegroups.com

Plus i have these auto index settings;

# Enable auto-indexing for nodes, default is false
node_auto_indexing=true

# The node property keys to be auto-indexed, if enabled
node_keys_indexable=_type, name, lastname, no, accountNo, balance, lastTransactionTime

# Enable auto-indexing for relationships, default is false
relationship_auto_indexing=true

# The relationship property keys to be auto-indexed, if enabled
relationship_keys_indexable=_type, amount, time, date, lastTransactionDate, lastTransactionTime

On Monday, April 29, 2013 11:39:03 PM UTC+3, K. Emre KISA wrote:

Hi

I am new to Neo4j and trying to simulate database of a Bank.

In my database, i have 100 accounts, which are owned by 100 customers via "owns" relationship.

There are randomly generated ~3000 money transfers among these 100 accounts. Represented by "transaction" relationship.

What i want to do is to identify all potential fraudelent money transfers.

START n = node:node_auto_index(_type="account") MATCH p = n-[r1:transaction]-p1-[r2:transaction]->m WHERE r1.clerk = r2.clerk and r1.date <= r2.date and r1.time <= r2.time and r2.date - r1.date <= 10 and ( ((r1.amount >= r2.amount * 0.9) and (r1.amount <= r2.amount * 1.1)) or (r2.amount >= r1.amount * 0.9) and (r2.amount <= r1.amount * 1.1) ) return n, r1, p1, r2, m

First; How should i model transation times? Currently i use it as "r.date = 20130401"
I want to be able to query according to time intervals.

Second; The query above works slow. It takes 5 seconds to run. How can i speed it up.

Third; Would neo4j still run slow if i want to run the query above on 10 million transactions? Is accessing so many nodes is a problem that no database can handle, even neo4j?

My configurations are like below;
# Initial Java Heap Size (in MB)
#wrapper.java.initmemory=4096
# Maximum Java Heap Size (in MB)
#wrapper.java.maxmemory=4096

If you need any metrics for further inspection, first you should try to explain me how to get them :)

Last question;

start a1 = node (16124), a2 = node(16147) match a1-[r:transaction*10]->a2 return r

On the same small database, this query takes forever to run. What could be the problem?

Thanks alot!

K. Emre KISA

unread,

Apr 30, 2013, 2:36:46 PM4/30/13

to ne...@googlegroups.com

My Neo4j version is 1.9.RC1

Java 1.6.0_41

Windows 8

in the Neo4j /bin folder, i have also made following changes in base.bat file. Hope it is correct

set binPath="%javaPath%\bin\java.exe -XX:+UseNUMA -Xmx4096M -server -XX:+UseConcMarkSweepGC ....//rest is untouched

Michael Hunger

unread,

Apr 30, 2013, 2:43:30 PM4/30/13

to ne...@googlegroups.com

You should make those changes to neo4j-wrapper.conf

Any chance to share your generated db?

Sent from mobile device

--
You received this message because you are subscribed to the Google Groups "Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email to neo4j+un...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

K. Emre KISA

unread,

May 2, 2013, 1:48:56 PM5/2/13

to ne...@googlegroups.com

Hi Michael,

I uploaded all files in "neo4j-enterprise-1.9.RC1\data\graph.db" folder

I am using web console to generate data, hope this is the correct address

graph.db.rar

Michael Hunger

unread,

May 3, 2013, 2:55:30 AM5/3/13

to ne...@googlegroups.com

Hi,

this is an interesting use-case.

So far people have mostly created aggregated money flow graphs that were centered around people (aka owners of accounts) and filtered out the transactions that are non-standard transactions (e.g. payments to insurances, newspapers etc).

And ran those kinds of queries on the aggregated (or extracted) graphs.

A general problem with this kind of query is that it is a graph global analytics query which touches the whole graph in the end.

As each node, e.g. account has x relationships, each step (n) that you take outwards touches x to the power of n (x^n) rels, so with 1000 rels per node, and 3 steps you are already at potentially a billion relationships that you touch.

Would it be possible to share your generated graph (e.g. zipped on dropbox) I'd love to profile your query and speed it up.

you can run "profile start ...." yourself too.

Your memory config is not active, you have to uncomment these lines to become active.

Do you run on linux, or windows? Is that the first run of the query or a subsequent run? What version of Neo4j are you running?

see my explanation above, quoted below too.

As each node, e.g. account has x relationships, each step (n) that you take outwards touches x to the power of n (x^n) rels, so with 1000 rels per node, and 3 steps you are already at potentially a billion relationships that you touch.

you probably want to run a shortest-path algorithm instead?

also *10 means 10 or more, if you want to have at most 10 use *..10

start a1 = node (16124), a2 = node(16147) match

shortestPath(a1-[r:transaction*..10]->a2)

On the same small database, this query takes forever to run. What could be the problem?

Thanks alot!

Michael Hunger

unread,

May 3, 2013, 11:54:57 PM5/3/13

to ne...@googlegroups.com

Sorry,

you ran into a shortcoming of cypher, which is that it doesn't yet pull in predicates that span multiple elements of a path into the pattern matcher.

If you convert your cypher query into Java code for the core-API it runs in 100ms on my machine, code is below.

HTH

Michael

package com.example.neo4j;

import org.neo4j.graphdb.DynamicRelationshipType;

import org.neo4j.graphdb.GraphDatabaseService;

import org.neo4j.graphdb.Node;

import org.neo4j.graphdb.Relationship;

import org.neo4j.graphdb.index.IndexHits;

import org.neo4j.graphdb.index.ReadableIndex;

import org.neo4j.helpers.Pair;

import org.neo4j.kernel.EmbeddedGraphDatabase;

import java.util.ArrayList;

import java.util.Collection;

/*

START n = node:node_auto_index(_type="account")

MATCH p = n-[r1:transaction]-p1-[r2:transaction]->m

WHERE r1.clerk = r2.clerk

and r1.date <= r2.date and r1.time <= r2.time

and r2.date - r1.date <= 10

and (

((r1.amount >= r2.amount * 0.9) and (r1.amount <= r2.amount * 1.1))

or

(r2.amount >= r1.amount * 0.9) and (r2.amount <= r1.amount * 1.1)

)

return n, r1, p1, r2, m

;

START n = node:node_auto_index(_type="account")

return count(*)

;

*/

public class KisaMatcher {

private static final DynamicRelationshipType TRANSACTION = DynamicRelationshipType.withName("transaction");

public static void main(String[] args) {

final GraphDatabaseService db = new EmbeddedGraphDatabase("kisa.db");

for (int i=0;i<10;i++) {

findAccounts(db);

}

db.shutdown();

}

private static void findAccounts(GraphDatabaseService db) {

final ReadableIndex<Node> index = db.index().getNodeAutoIndexer().getAutoIndex();

Collection<Pair<Relationship,Relationship>> result=new ArrayList<Pair<Relationship, Relationship>>();

long start = System.currentTimeMillis();

final IndexHits<Node> accounts = index.get("_type", "account");

for (Node account : accounts) {

for (Relationship r1 : account.getRelationships(TRANSACTION)) {

long date1 = (Long)r1.getProperty("date");

long time1 = (Long)r1.getProperty("time");

double amount1 = (Double)r1.getProperty("amount");

String clerk= (String) r1.getProperty("clerk");

final Node otherAccount = r1.getOtherNode(account);

for (Relationship r2 : otherAccount.getRelationships(TRANSACTION)) {

long date2 = (Long)r2.getProperty("date");

if (date2 < date1) continue;

long time2 = (Long)r2.getProperty("time");

if (time2 < time1) continue;

double amount2 = (Double)r2.getProperty("amount");

final double delta = amount2 / amount1;

if (delta<0.9 || delta>1.1) continue;

if (!r2.getProperty("clerk").equals(clerk)) continue;

result.add(Pair.of(r1,r2));

}

System.out.println("Result "+result.size()+" took "+(System.currentTimeMillis()-start)+" ms");

}

<graph.db.rar>

K. Emre KISA

unread,

May 4, 2013, 3:28:07 AM5/4/13

to ne...@googlegroups.com

Michael, you are a life saver, thank you.

I haven't used core API yet but i will give it a try.

Thank a lot

Michael Hunger

unread,

May 4, 2013, 7:08:27 AM5/4/13

to ne...@googlegroups.com

You're welcome.

Hope you are successful

Sent from mobile device

K. Emre KISA

unread,

May 12, 2013, 1:49:24 PM5/12/13

to ne...@googlegroups.com

Hi Michael,

I really need your help. On the attached files, you can find the code i use.

There are 3 methods in InternalFraudMatcher.java

1st is insertData() : generates 1 million accounts and customers in batch

2nd is generateTransactions() : generates random transactions in batch (max 3 transaction per account)

3rd is findAccounts() : queries transactions to find the pattern with Core API

The problem is, even though i use Core API the execution takes 4.5 seconds (cold run takes 11 seconds :/). I can execute same query in 1.5 seconds with MySQL.

To be able to use the code;

- You should first uncomment and run first two methods to generate data.

- Since i use AutoIndex and batch, after 2nd method is run you should run AutoIndexUpdater.java to update indexes. (i copied this from one of your posts, thanks again)

- And finally you should use findAccounts() method to execute query.

BTW , I had to downgrade to neo4j-enterprise-1.8.2 because the same code gives error with 1.9.RC1.

These are my JVM parameters;

-server -D64 -Xms4096M -Xmx4096M -cp D:\Swf\DB\Neo4j\neo4j-enterprise-1.8.2\lib\neo4j-kernel.jar;D:\Swf\DB\Neo4j\neo4j-enterprise-1.8.2\jta.jar -Xss2048k  -XX:SurvivorRatio=9 -XX:TargetSurvivorRatio=90 -XX:+UseParNewGC  -Xloggc:"C:\Users\Emre KISA\Desktop\gc.txt" -XX:+PrintGCDetails

You can also find Garbage Collector statistics of GCViewer on the attached screenshot.

Thanks in advance

AutoIndexUpdater.java

InternalFraudMatcher.java

MySQLTest.java

GarbageCollector_GCviewer.png

K. Emre KISA

unread,

May 12, 2013, 2:12:07 PM5/12/13

to ne...@googlegroups.com

Actually, it takes 0.266 seconds to run with MySQL :)

Michael Hunger

unread,

May 13, 2013, 1:56:49 AM5/13/13

to ne...@googlegroups.com

Hi,

I took a look at your code and rewrote it a bit :) Code is attached.

# First your batch-insertion doesn't create auto-indexes, so I changed that to a transactional creation

# I left off the relationship indexes

# I fixed the mmio-settings for store sizes

For the lookup, some performance culprits:

# single threaded access only uses one CPU

# lucene lookup for 1M elements (_type=account) is slow

# changed that to GlobalGraphOperations.at(db).getAllNodes() and then a property-filter predicate which is twice as fast for the cached version

# still too slow :), pulled out the list of accounts upfront

# introduced threading and segmented the account list into segments of 1000 each which are each submitted to a threadpool of 4 threads (I have a quadcore CPU with 8 virtual CPUs).

JVM options: -Xmx4G -Xms4G -Xmn2G -server -d64 -XX:+UseParNewGC

Please note that you're running on windows, and there the MMIO-space is part of of the Java heap (unlike on MacOS / Linux) so you might have to increase your heap a bit

Here are some numbers of the last run (I have a SSD):

Physical mem: 6144MB, Heap size: 3891MB

Attempt 0

# of Potential Fraudulent Activities : 2, took 7477 ms