OOME in Brooklyn

2 views
Skip to first unread message

Alex Heneveld

unread,
Apr 1, 2014, 2:36:21 PM4/1/14
to brooklyn-dev

Hi folks-

A couple people (including myself) have hit some OOME's running Brooklyn.

A bit of poking around in jvisualvm suggests the culprit is the InmemoryDatagrid, or something leaking in to it.  Haven't yet determined what.

It occurred for me when I was running several blueprints, including cutting-edge Docker ones, and had run several others, some of which had hit failures and network drops (joys of hotel wifi), and worst of all the max heap size seems to be 128MB, based on the logs being constant at this when the OOME hit:

    BrooklynGarbageCollector [brooklyn-gc]: brooklyn gc (after) - using 129MB / 130MB memory

(This is a little weird in itself because ` java -XX:+PrintFlagsFinal -version` suggests MaxHeapSize should be 512MB on my system but I guess that's something Eclipse is doing.) . In any case with that low memory it might not even be a leak, but what makes me suspicious is that we have 80k keys in that data grid, and I've only got a handful of entities and tasks.  And the fact that other folks have reported OOME with more reasonable -Xmx1g settings.

A quick scan suggests most are location settings and ID strings ... perhaps too many copies of our brooklyn properties.

I'll keep looking but wanted to post in case any of you have more light to shed on this.

Of course persisting, restarting, and restoring fixes things but my thinking for next steps is:

* We should track down and fix this leak and any other
* We should spend some time improving memory usage efficiency (we've done this with tasks and threads to huge gain)
* We should periodically report usage information for the InmemoryDatagrid

Best
Alex

Andrew Kennedy (Cloudsoft)

unread,
Apr 1, 2014, 3:18:03 PM4/1/14
to brookl...@googlegroups.com
Alex Heneveld wrote:
> A quick scan suggests most are location settings and ID strings ...
> perhaps too many copies of our brooklyn properties.

Could be due to the behaviour of 'resolve' and similar in
'LocationRegistry' which create new locations whenever the REST API
lists them? This seems suspect to me, and there _is_ a note in the
registry interface warning about this...

Andrew.
--
-- andrew kennedy ? software engineer : https://github.com/grkvlt/ ;

Aled Sage

unread,
Apr 1, 2014, 3:27:42 PM4/1/14
to brookl...@googlegroups.com
Thanks Alex for the details.

Andrew, agreed it's suspect. I'm actually working on removing the need
for that - it's only being instantiated to get the displayName of the
locations that one can create!

Alex, do you have the output of `jmap -histo <pid>` that you can share?

Aled

Aled Sage

unread,
Apr 2, 2014, 9:57:43 AM4/2/14
to brookl...@googlegroups.com
Hi all,

I've added some logging for datagrid usage [1].

When running EntityCleanupLongevityTest.testAppCreatedStartedAndStopped (which creates/stops/unmanaged 100,000 apps sequentially) it logged:
using 370MB / 489MB memory; storage: {datagrid={size=178956, createCount=894748}, refsMapSize=178950, listsMapSize=0}; tasks: 0 active, 1 in memory (-2 incomplete and 984216 total submitted)
So we are leaving behind stuff in the datagrid. I'm continuing to investigate what (and also just testing locations based on Andrew's hunch).

Aled

[1] https://github.com/brooklyncentral/brooklyn/pull/1301

Aled Sage

unread,
Apr 2, 2014, 11:18:34 AM4/2/14
to brookl...@googlegroups.com
I've pushed a second commit to that pull request that fixes the memory leak.

When deleting a location (i.e. unmanaging it), we weren't removing its state from the datagrid.

As Andrew pointed out, we currently have a method that creates new location instances each time the rest-api polls, which was therefore consuming a lot of memory!
Next step for me is to stop it creating the locations in the first place.

Aled
--
You received this message because you are subscribed to the Google Groups "brooklyn-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to brooklyn-dev...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Duncan Johnston Watt

unread,
Apr 4, 2014, 3:24:03 AM4/4/14
to brooklyn-dev
What is the status of this please? Also AMP 2.0.0-M1 which is needed for several client engagements/due diligence? Will it be possible to patch this or will it include the fix?

Thanks

Duncan
--
Duncan Johnston-Watt
CEO | Cloudsoft Corporation

Twitter | @duncanjw
Mobile | +44 777 190 2653
Skype | duncan_johnstonwatt

Cloudsoft Corporation Limited, Registered in Scotland No: SC349230.  Registered Office: 13 Dryden Place, Edinburgh, EH9 1RP
 
This e-mail message is confidential and for use by the addressee only. If the message is received by anyone other than the addressee, please return the message to the sender by replying to it and then delete the message from your computer. Internet e-mails are not necessarily secure. Cloudsoft Corporation Limited does not accept responsibility for changes made to this message after it was sent.

Whilst all reasonable care has been taken to avoid the transmission of viruses, it is the responsibility of the recipient to ensure that the onward transmission, opening or use of this message and any attachments will not adversely affect its systems or data. No responsibility is accepted by Cloudsoft Corporation Limited in this regard and the recipient should carry out such virus and other checks as it considers appropriate.

Alex Heneveld

unread,
Apr 4, 2014, 8:15:26 AM4/4/14
to brookl...@googlegroups.com

Quick update on the OOME problem:

We have plugged the offending leaks in master/SNAPSHOT.  Nice work Aled & co.

The soak tests now pass (and there are new tests for soak/longevity as well as unit tests which look for many kinds of leaks).  There is also better reporting of consumption.  The source was some of the add'l location persistence information not being cleared on location teardown.  (And longevity tests not being run frequently enough!)

AMP 2.0.0-M1 *does* include these patches and is working its way through the upload-to-sonatype and release process.  (AMP is the commercially supported build of Brooklyn; like Brooklyn it is freely available although under different license terms.)

I think we should cut a brooklyn-070-M2 release sooner than we might normally to incorporate this fix.  In the meantime we recommend that the latest AMP M1 build be used for any long-running deployments.


Note that it is still possible in some situations to have leaks if tasks are left running when entities are destroyed.  These can be removed programmatically or by persisting and recycling a mgmt server.  But I think we should work on better GUI+API hooks to manage and clean up these dangling tasks -- and to report on consumption information and in general to autonomically manage ourselves as a server tier (which is on the roadmap as part of the federation management pieces).

Thanks
Alex

Duncan Johnston Watt

unread,
Apr 4, 2014, 11:19:08 AM4/4/14
to brooklyn-dev
Alex

Great news. Delighted AMP 2.0.0-M1 will include these patches.

Best

Duncan
Reply all
Reply to author
Forward
0 new messages