Re: Neo4j, slow queries with Cypher

1,218 views
Skip to first unread message

K. Emre KISA

unread,
Apr 29, 2013, 5:06:16 PM4/29/13
to ne...@googlegroups.com
Plus i have these auto index settings;

# Enable auto-indexing for nodes, default is false
node_auto_indexing=true

# The node property keys to be auto-indexed, if enabled
node_keys_indexable=_type, name, lastname, no, accountNo, balance, lastTransactionTime

# Enable auto-indexing for relationships, default is false
relationship_auto_indexing=true

# The relationship property keys to be auto-indexed, if enabled
relationship_keys_indexable=_type, amount, time, date, lastTransactionDate, lastTransactionTime



On Monday, April 29, 2013 11:39:03 PM UTC+3, K. Emre KISA wrote:
Hi

I am new to Neo4j and trying to simulate database of a Bank.

In my database, i have 100 accounts, which are owned by 100 customers via "owns" relationship.

There are randomly generated ~3000 money transfers  among these 100 accounts. Represented by "transaction" relationship.

What i want to do is to identify all potential fraudelent money transfers. 


START n = node:node_auto_index(_type="account")
MATCH p = n-[r1:transaction]-p1-[r2:transaction]->m
WHERE r1.clerk = r2.clerk
   and r1.date <= r2.date and r1.time <= r2.time
   and r2.date - r1.date <= 10
    and (
         ((r1.amount >= r2.amount * 0.9) and (r1.amount <= r2.amount * 1.1))
           or
         (r2.amount >= r1.amount * 0.9) and (r2.amount <= r1.amount * 1.1)  
       )
return n, r1, p1, r2, m

First; How should i model transation times? Currently i use it as "r.date = 20130401"
I want to be able to query according to time intervals.

Second; The query above works slow. It takes 5 seconds to run. How can i speed it up.

Third; Would neo4j still run slow if i want to run the query above on 10 million transactions? Is accessing so many nodes is a problem that no database can handle, even neo4j?

My configurations are like below;
# Initial Java Heap Size (in MB)
#wrapper.java.initmemory=4096
# Maximum Java Heap Size (in MB)
#wrapper.java.maxmemory=4096

If you need any metrics for further inspection, first you should try to explain me how to get them :)

Last question;

start a1 = node (16124), a2 = node(16147)
   match
a1-[r:transaction*10]->a2

return r

On the same small database, this query takes forever to run. What could be the problem?

Thanks alot!

K. Emre KISA

unread,
Apr 30, 2013, 2:36:46 PM4/30/13
to ne...@googlegroups.com
My Neo4j version is 1.9.RC1
Java 1.6.0_41
Windows 8

in the Neo4j /bin folder, i have also made following changes in base.bat file. Hope it is correct

set binPath="%javaPath%\bin\java.exe -XX:+UseNUMA -Xmx4096M -server -XX:+UseConcMarkSweepGC ....//rest is untouched


Michael Hunger

unread,
Apr 30, 2013, 2:43:30 PM4/30/13
to ne...@googlegroups.com
You should make those changes to neo4j-wrapper.conf

Any chance to share your generated db?

Sent from mobile device
--
You received this message because you are subscribed to the Google Groups "Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email to neo4j+un...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

K. Emre KISA

unread,
May 2, 2013, 1:48:56 PM5/2/13
to ne...@googlegroups.com
Hi Michael,

I uploaded all files in "neo4j-enterprise-1.9.RC1\data\graph.db" folder

I am using web console to generate data, hope this is the correct address
graph.db.rar

Michael Hunger

unread,
May 3, 2013, 2:55:30 AM5/3/13
to ne...@googlegroups.com
Hi,

this is an interesting use-case.

So far people have mostly created aggregated money flow graphs that were centered around people (aka owners of accounts) and filtered out the transactions that are non-standard transactions (e.g. payments to insurances, newspapers etc).
And ran those kinds of queries on the aggregated (or extracted) graphs.

A general problem with this kind of query is that it is a graph global analytics query which touches the whole graph in the end.

As each node, e.g. account has x relationships, each step (n) that you take outwards touches x to the power of n (x^n) rels, so with 1000 rels per node, and 3 steps you are already at potentially a billion relationships that you touch.

Would it be possible to share your generated graph (e.g. zipped on dropbox) I'd love to profile your query and speed it up.

you can run "profile start ...." yourself too.

Your memory config is not active, you have to uncomment these lines to become active.

Do you run on linux, or windows? Is that the first run of the query or a subsequent run? What version of Neo4j are you running?


see my explanation above, quoted below too.

As each node, e.g. account has x relationships, each step (n) that you take outwards touches x to the power of n (x^n) rels, so with 1000 rels per node, and 3 steps you are already at potentially a billion relationships that you touch.


you probably want to run a shortest-path algorithm instead?

also *10 means 10 or more, if you want to have at most 10 use *..10


start a1 = node (16124), a2 = node(16147)
   match

shortestPath(a1-[r:transaction*..10]->a2)

On the same small database, this query takes forever to run. What could be the problem?

Thanks alot!

Michael Hunger

unread,
May 3, 2013, 11:54:57 PM5/3/13
to ne...@googlegroups.com
Sorry,

you ran into a shortcoming of cypher, which is that it doesn't yet pull in predicates that span multiple elements of a path into the pattern matcher.

If you convert your cypher query into Java code for the core-API it runs in 100ms on my machine, code is below.

HTH

Michael

package com.example.neo4j;

import org.neo4j.graphdb.DynamicRelationshipType;
import org.neo4j.graphdb.GraphDatabaseService;
import org.neo4j.graphdb.Node;
import org.neo4j.graphdb.Relationship;
import org.neo4j.graphdb.index.IndexHits;
import org.neo4j.graphdb.index.ReadableIndex;
import org.neo4j.helpers.Pair;
import org.neo4j.kernel.EmbeddedGraphDatabase;

import java.util.ArrayList;
import java.util.Collection;

/*
START n = node:node_auto_index(_type="account")
MATCH p = n-[r1:transaction]-p1-[r2:transaction]->m
WHERE r1.clerk = r2.clerk
   and r1.date <= r2.date and r1.time <= r2.time
   and r2.date - r1.date <= 10
    and (
         ((r1.amount >= r2.amount * 0.9) and (r1.amount <= r2.amount * 1.1))
           or
         (r2.amount >= r1.amount * 0.9) and (r2.amount <= r1.amount * 1.1)
       )
return n, r1, p1, r2, m
;

START n = node:node_auto_index(_type="account")
return count(*)
;
 */
public class KisaMatcher {

    private static final DynamicRelationshipType TRANSACTION = DynamicRelationshipType.withName("transaction");

    public static void main(String[] args) {
        final GraphDatabaseService db = new EmbeddedGraphDatabase("kisa.db");

        for (int i=0;i<10;i++) {
            findAccounts(db);
        }
        db.shutdown();
    }

    private static void findAccounts(GraphDatabaseService db) {
        final ReadableIndex<Node> index = db.index().getNodeAutoIndexer().getAutoIndex();
        Collection<Pair<Relationship,Relationship>> result=new ArrayList<Pair<Relationship, Relationship>>();
        long start = System.currentTimeMillis();
        final IndexHits<Node> accounts = index.get("_type", "account");
        for (Node account : accounts) {
            for (Relationship r1 : account.getRelationships(TRANSACTION)) {
                long date1 = (Long)r1.getProperty("date");
                long time1 = (Long)r1.getProperty("time");
                double amount1 = (Double)r1.getProperty("amount");
                String clerk= (String) r1.getProperty("clerk");
                final Node otherAccount = r1.getOtherNode(account);
                for (Relationship r2 : otherAccount.getRelationships(TRANSACTION)) {
                    long date2 = (Long)r2.getProperty("date");
                    if (date2 < date1) continue;
                    long time2 = (Long)r2.getProperty("time");
                    if (time2 < time1) continue;
                    double amount2 = (Double)r2.getProperty("amount");
                    final double delta = amount2 / amount1;
                    if (delta<0.9 || delta>1.1) continue;
                    if (!r2.getProperty("clerk").equals(clerk)) continue;
                    result.add(Pair.of(r1,r2));
                }
            }
        }
        System.out.println("Result "+result.size()+" took "+(System.currentTimeMillis()-start)+" ms");
    }
}

<graph.db.rar>

K. Emre KISA

unread,
May 4, 2013, 3:28:07 AM5/4/13
to ne...@googlegroups.com
Michael, you are a life saver, thank you.

I haven't used core API yet but i will give it a try.

Thank a lot

Michael Hunger

unread,
May 4, 2013, 7:08:27 AM5/4/13
to ne...@googlegroups.com
You're welcome.

Hope you are successful

Sent from mobile device

K. Emre KISA

unread,
May 12, 2013, 1:49:24 PM5/12/13
to ne...@googlegroups.com
Hi Michael,

I really need your help. On the attached files, you can find the code i use. 

There are 3 methods in InternalFraudMatcher.java

1st is insertData() : generates 1 million accounts and customers in batch 
2nd is generateTransactions() : generates random transactions in batch (max 3 transaction per account)
3rd is findAccounts() : queries transactions to find the pattern with Core API

The problem is, even though i use Core API the execution takes 4.5 seconds (cold run takes 11 seconds :/). I can execute same query in 1.5 seconds with MySQL.  

To be able to use the code;
 - You should first uncomment and run first two methods to generate data.
 - Since i use AutoIndex and batch, after 2nd method is run you should run AutoIndexUpdater.java to update indexes. (i copied this from one of your posts, thanks again)
 - And finally you should use findAccounts() method to execute query.



BTW , I had to downgrade to neo4j-enterprise-1.8.2 because the same code gives error with 1.9.RC1.

These are my JVM parameters;

-server -D64 -Xms4096M -Xmx4096M -cp D:\Swf\DB\Neo4j\neo4j-enterprise-1.8.2\lib\neo4j-kernel.jar;D:\Swf\DB\Neo4j\neo4j-enterprise-1.8.2\jta.jar -Xss2048k  -XX:SurvivorRatio=9 -XX:TargetSurvivorRatio=90 -XX:+UseParNewGC  -Xloggc:"C:\Users\Emre KISA\Desktop\gc.txt" -XX:+PrintGCDetails


You can also find Garbage Collector statistics of GCViewer on the attached screenshot.


Thanks in advance
AutoIndexUpdater.java
InternalFraudMatcher.java
MySQLTest.java
GarbageCollector_GCviewer.png

K. Emre KISA

unread,
May 12, 2013, 2:12:07 PM5/12/13
to ne...@googlegroups.com
Actually, it takes 0.266 seconds to run with MySQL :)

Michael Hunger

unread,
May 13, 2013, 1:56:49 AM5/13/13
to ne...@googlegroups.com
Hi,

I took a look at your code and rewrote it a bit :) Code is attached.

# First your batch-insertion doesn't create auto-indexes, so I changed that to a transactional creation
# I left off the relationship indexes
# I fixed the mmio-settings for store sizes

For the lookup, some performance culprits: 

# single threaded access only uses one CPU
# lucene lookup for 1M elements (_type=account) is slow
# changed that to GlobalGraphOperations.at(db).getAllNodes() and then a property-filter predicate which is twice as fast for the cached version
# still too slow :), pulled out the list of accounts upfront
# introduced threading and segmented the account list into segments of 1000 each which are each submitted to a threadpool of 4 threads (I have a quadcore CPU with 8 virtual CPUs).

JVM options: -Xmx4G -Xms4G -Xmn2G -server -d64 -XX:+UseParNewGC

Please note that you're running on windows, and there the MMIO-space is part of of the Java heap (unlike on MacOS / Linux) so you might have to increase your heap a bit

Here are some numbers of the last run (I have a SSD):

Physical mem: 6144MB, Heap size: 3891MB
Attempt 0
# of Potential Fraudulent Activities : 2, took 7477 ms
Victim Transaction Middle Transaction2 Destination
1022218 1500215 494550 1119790 1496626
1383186 1878128 580036 1135953 1926382
Attempt 1
# of Potential Fraudulent Activities : 2, took 928 ms
Victim Transaction Middle Transaction2 Destination
1022218 1500215 494550 1119790 1496626
1383186 1878128 580036 1135953 1926382
Attempt 2
# of Potential Fraudulent Activities : 2, took 834 ms
Victim Transaction Middle Transaction2 Destination
1022218 1500215 494550 1119790 1496626
1383186 1878128 580036 1135953 1926382
Attempt 3
# of Potential Fraudulent Activities : 2, took 526 ms
Victim Transaction Middle Transaction2 Destination
1022218 1500215 494550 1119790 1496626
1383186 1878128 580036 1135953 1926382
Attempt 4
# of Potential Fraudulent Activities : 2, took 527 ms
Victim Transaction Middle Transaction2 Destination
1022218 1500215 494550 1119790 1496626
1383186 1878128 580036 1135953 1926382


InternalFraudMatcher.java

K. Emre KISA

unread,
May 13, 2013, 1:41:27 PM5/13/13
to ne...@googlegroups.com
Your willingness to help is far beyond my expectations Michael

Thanks a lot

I love your community :)

I will run your code asap

K. Emre KISA

unread,
May 13, 2013, 1:50:43 PM5/13/13
to ne...@googlegroups.com
Hi Stefan,

When i first read about Neo4j one of the cons people were saying is that, "you really have to change how you think about designing a database". 

Relationship type per clerk? Now i can see what they meant :)

That surely sounds cool but doesn't that approach have any side effects? I think that if i design database that way, then it becomes highly specialized for a specific use case doesn't it?

Thanks


On Monday, May 13, 2013 1:41:47 AM UTC+3, Stefan Plantikow wrote:
Hi,



Am Montag, 29. April 2013 22:39:03 UTC+2 schrieb K. Emre KISA:
Hi

I am new to Neo4j and trying to simulate database of a Bank.

In my database, i have 100 accounts, which are owned by 100 customers via "owns" relationship.

There are randomly generated ~3000 money transfers  among these 100 accounts. Represented by "transaction" relationship.

What i want to do is to identify all potential fraudelent money transfers. 


START n = node:node_auto_index(_type="account")
MATCH p = n-[r1:transaction]-p1-[r2:transaction]->m
WHERE r1.clerk = r2.clerk
   and r1.date <= r2.date and r1.time <= r2.time
   and r2.date - r1.date <= 10
    and (
         ((r1.amount >= r2.amount * 0.9) and (r1.amount <= r2.amount * 1.1))
           or
         (r2.amount >= r1.amount * 0.9) and (r2.amount <= r1.amount * 1.1)  
       )
return n, r1, p1, r2, m


Did you consider using an alternative way to model this information?  A possible approach could be to store transactions as a time sorted linked list and use that to only query relevant time windows.   

Additionally it might be worthwhile to keep track of which transactions have been issued by the same clerk directly in the graph.  To achieve this, you could even have a relationship type per clerk and sort transactions of the same clerk in a linked list by time. 

The fraud detection could then navigate from each clerk to the set of relevant transactions in a time windows and just check if they touch the same accounts.
If you just want to generate reports this has the benefit of being easily parallelizable per clerk.


Cheers,


Stefan
Reply all
Reply to author
Forward
0 new messages