Possible memory leak when using MERGE queries (2.1.2)

261 views
Skip to first unread message

Matt Aldridge

unread,
Jul 2, 2014, 2:20:10 PM7/2/14
to ne...@googlegroups.com
Hi everyone,

I have a use case that appears to expose a memory leak in Neo4j. I've been testing this with Neo4j 2.1.2 on OSX.

I've created a test case that reproduces the issue consistently and mimics the behavior of my real-world application. https://gist.github.com/mlaldrid/85a03fc022170561b807 This uses py2neo to interface with Neo4j's Cypher transactional HTTP endpoint. To force the suspected memory leak behavior to surface more quickly, I limit Neo4j's max heap to 1GB. In practice I tend to use an 8GB heap, but the misbehavior still occurs (albeit delayed).

In my real-world application, we need to CREATE millions of primary nodes of interest and MERGE ancillary nodes into the graph, as they can be shared by any number of other primary nodes. In the test case here we give the primary nodes the Person label, and the ancillary nodes are labeled Address and Phone. A fixed set of Address and Phone nodes are generated and randomly attached to Person nodes.

Each Cypher transaction CREATEs 1000 Person nodes and MERGEs in 2 Address and 1 Phone node for each Person. The transactions are created and then committed without any intermediate executions of the open transaction.

This log demonstrates increasingly poor load performance until finally Neo4j runs out of heap space and fails a transaction:

% time python neo4j_heap_stress.py
2014-07-01 17:30:07,596 :: __main__ :: Generating fake data ...
2014-07-01 17:30:31,430 :: __main__ :: Creating label indices ...
2014-07-01 17:30:31,992 :: __main__ :: Beginning load ...
2014-07-01 17:33:33,949 :: __main__ :: Finished batch 100
2014-07-01 17:35:49,346 :: __main__ :: Finished batch 200
2014-07-01 17:37:56,856 :: __main__ :: Finished batch 300
2014-07-01 17:40:01,333 :: __main__ :: Finished batch 400
2014-07-01 17:42:04,855 :: __main__ :: Finished batch 500
2014-07-01 17:44:11,104 :: __main__ :: Finished batch 600
2014-07-01 17:46:17,261 :: __main__ :: Finished batch 700
2014-07-01 17:48:21,778 :: __main__ :: Finished batch 800
2014-07-01 17:50:28,206 :: __main__ :: Finished batch 900
2014-07-01 17:52:39,424 :: __main__ :: Finished batch 1000
2014-07-01 17:54:56,618 :: __main__ :: Finished batch 1100
2014-07-01 17:57:22,797 :: __main__ :: Finished batch 1200
2014-07-01 18:02:27,327 :: __main__ :: Finished batch 1300
2014-07-01 18:12:35,143 :: __main__ :: Finished batch 1400
2014-07-01 18:24:16,126 :: __main__ :: Finished batch 1500
2014-07-01 18:38:25,835 :: __main__ :: Finished batch 1600
2014-07-01 18:56:18,826 :: __main__ :: Finished batch 1700
2014-07-01 19:22:00,779 :: __main__ :: Finished batch 1800
2014-07-01 20:17:18,317 :: __main__ :: Finished batch 1900
Traceback (most recent call last):
  File "neo4j_heap_stress.py", line 112, in <module>
    main()
  File "neo4j_heap_stress.py", line 34, in main
    load_batch(fake, addresses, phones)
  File "neo4j_heap_stress.py", line 68, in load_batch
    tx.commit()
  File "/Users/matt/.virtualenvs/dataenv/lib/python2.7/site-packages/py2neo/cypher.py", line 242, in commit
    return self._post(self._commit or self._begin_commit)
  File "/Users/matt/.virtualenvs/dataenv/lib/python2.7/site-packages/py2neo/cypher.py", line 217, in _post
    raise TransactionError.new(error["code"], error["message"])
py2neo.cypher.CouldNotCommit: commit threw exception
python neo4j_heap_stress.py  1351.47s user 48.76s system 10% cpu 3:46:27.77 total

Early in the load process, 100K Person nodes are inserted approximately every 2 minutes. However, after 1.2M Person nodes (batch 1200) have been inserted into the graph the batch times slow dramatically. This corresponds with increased Java garbage collection times--near the end the GC suspends application threads for several seconds at a time. I have captured the GC logs for Neo4j if there is any interest in seeing that.

Eventually a transaction fails, and this is in console.log:

Exception in thread "GC-Monitor"
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "GC-Monitor"
Exception in thread "DateCache"
Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "DateCache"
Exception in thread "RMI RenewClean-[192.168.1.214:60253]" java.lang.OutOfMemoryError: Java heap space


Of all the variables here, using many MERGE queries appears to me to be the culprit behind the degradation in performance. Prior to changing my application's graph model to use shared ancillary nodes among the primary nodes I was just doing a simple CREATE for all nodes. That scaled nearly linearly to tens of millions of nodes in the graph. That was also using the same cypher transaction method via py2neo. Only when I wanted to de-duplicate the shared ancillary nodes with MERGE did this behavior arise. Hopefully that's not a red herring, but something is almost certainly amiss here.

Please let me know if I can provide any other relevant details to reproduce.

Thanks,
-Matt

Matt Aldridge

unread,
Jul 7, 2014, 8:47:43 AM7/7/14
to ne...@googlegroups.com
FWIW I have replicated this issue on 2.0.3 as well. While GC churn does kick in at approximately the same point as with 2.1.2, it is interesting to note how much faster the test case cypher queries perform in 2.1.2--something like 50% faster! :)

Nonetheless, the memory leak does continue to be an issue for me. AFAICT, the py2neo API is properly opening, submitting, and closing the cypher transactions according to spec. I'd greatly appreciate any assistance in determining whether this is indeed a bug in Neo4j.

Thanks,
-Matt

Matt Aldridge

unread,
Jul 16, 2014, 10:51:05 AM7/16/14
to ne...@googlegroups.com
Following up on this, my initial hypothesis of there being a problem with MERGE queries appears incorrect, or at least overly specific. In another application, I needed to add relationships among millions of nodes in a pre-existing graph, essentially using cypher queries following the pattern of "MATCH (a), (b) CREATE (a)-[r]-(b)". Running with a 10GB heap (unnecessarily large I thought but just in case) I can make it through about 16M queries before GC churn takes over and eventually jetty times out.

To eliminate the possibility of any bugs in py2neo's transaction/HTTP connection handling, I altered my code to remove the py2neo dependency and communicate directly with Neo4j's begin/commit endpoint (http://localhost:7474/db/data/transaction/commit). ... Same result.

I updated my test code gist (https://gist.github.com/mlaldrid/85a03fc022170561b807) similarly to eliminate py2neo and use the begin/commit endpoint directly. Running on the Neo4j community edition with a 512MB heap I get GC churn at about 350K queries. I've downloaded a trial copy of the enterprise edition and run the same test code with the same heap size (512MB) for an hour/3.5M queries with no signs of memory leak or GC churn.

Why is this memory leak behavior different in the community and enterprise editions? Is it something that the enterprise edition's "advanced caching" feature solves? Is it a known but opaque limitation of the community edition?

Thanks,
-Matt

Chris Vest

unread,
Jul 16, 2014, 12:06:54 PM7/16/14
to ne...@googlegroups.com
Yeah, it’s a known bug: https://github.com/neo4j/neo4j/pull/2690

--
Chris Vest
System Engineer, Neo Technology
[ skype: mr.chrisvest, twitter: chvest ]


--
You received this message because you are subscribed to the Google Groups "Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email to neo4j+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Matt Aldridge

unread,
Jul 16, 2014, 12:54:11 PM7/16/14
to ne...@googlegroups.com
Ah nice, thank you for the link!

-Matt
Reply all
Reply to author
Forward
0 new messages