Excessive memory usage running Gremlin Server on huge Docker ECS instances

428 views
Skip to first unread message

sergio....@aurea.com

unread,
Jun 8, 2017, 1:06:48 PM6/8/17
to Gremlin-users
Hello guys,

We have two x1.large (1tb memory each, with 64 processors) nodes running on an ECS cluster, that basically hosts more than 300 Gremlin Servers 3.2.4. For each of them, we are reserving an amount of memory of 2.5GB, which is pretty enough for our Neo4j databases. So, these two servers provides 2TB of memory and 128 processors, it would be plenty enough.
The issue that we noticed is that if we run the containers locally on Docker, we have an average of 500mb just after startup, which is fair enough. But when we check the same container to ECS (which is Linux) on a huge host, we noticed that the memory usage was much more higher, like 4 ~ 6GB per service, while using a huge host like x1.large.

We are aware that Java have some limitations while running as a container service, as described here: https://developers.redhat.com/blog/2017/03/14/java-inside-docker/

Also, we know that the issue is occurring only in huge nodes, because we also have tested with +40 instances cluster of m4.2xlarge and it worked as expected (low memory usage, low processor usage, went perfectly fine), but we cannot proceed using smaller instances because its a business requirement that we use huge x1.large instances.

Here is the last GC logs for a failing service: https://gist.github.com/sergiofigueras/642d32321487cd8c04ba5582158e6b65

So, overall attempts that we have done are the following, I'm bolding results which seems that are important:
- Set memory limits (Xms / Xmx) to 128m / 2gb. Resulted in services being killed by OOM.
- Deactivated the GC alarms using 

-XX:-UseGCOverheadLimit 

. Result: memory usage was still very high.
- Reduce number of GC parallel threads to 8 using 

-XX:ParallelGCThreads. 

Result: memory usage was reduced to 2GB, but services did not started.
- Changed GC strategy to G1. Result: memory usage was still high.
- Reduced instance size to m4.32large instead x1.large. Result: worked, but was not accepted by the business team.

In this way, it seems to us that some code/library that is internally used by Gremlin Server is probably using the overall limits of x1 container to process something, not respecting that its running inside a container, and probably its due some Java constraint. We are considering this because we saw some other guys that had this kind of issue with other libraries/projects, like: https://github.com/moby/moby/issues/32788

My question here is: is anybody aware of that? Is there anything that we are missing? How do you guys suggest us to debug on that?

Any suggestions for anything in this situation is REALLY appreciated :) :)

Thanks!

Sergio.


Stephen Mallette

unread,
Jun 12, 2017, 7:50:29 AM6/12/17
to Gremlin-users
That's a weird problem. It's really odd that the JVM would behave differently in Docker on a larger instance as compared to a smaller instance.

we have an average of 500mb just after startup, which is fair enough. But when we check the same container to ECS (which is Linux) on a huge host, we noticed that the memory usage was much more higher, like 4 ~ 6GB per service, while using a huge host like x1.large.

I don't know what could account for that kind of growth in memory for Gremlin Server after startup. Gremlin Server is mostly just a host for graph instances. At startup it only loads graphs and initializes scriptengines. You mentioned you were using Neo4j so I would expect Neo4j to be the bulk of the memory requirements. The size of your graph would lead to different Xmx requirements, but if I remember correctly the memory would only get consumed once Neo4j caches are warmed (when Neo4j starts getting fast when everything is in memory). You dont' make mention of what you do on startup - do you warm caches through an init script or anything like that?





--
You received this message because you are subscribed to the Google Groups "Gremlin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-users+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/f433d511-5a1a-4253-94e2-5edb66ba96d8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

sergio....@aurea.com

unread,
Jun 12, 2017, 9:40:16 AM6/12/17
to Gremlin-users
Thanks for your reply, @Stephen. Yeah, its really odd.

We are not doing anything special on startup. It is just simple as that:

mkdir -p -m a=rwx /data/databases
unzip -o $ZIP_NAME -d /data/databases/graphdb
chmod -R 0777 /data/databases/graphdb
/bin/bash /gremlin-server/bin/gremlin-server.sh conf/gremlin-server-neo4j.yaml

We think that the problem is probably some internal library that is not respecting its limits or using whole resources to calculate something. Java have issues with that, as on https://github.com/docker-library/openjdk/issues/57
To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-user...@googlegroups.com.

Robert Dale

unread,
Jun 12, 2017, 9:49:46 AM6/12/17
to gremli...@googlegroups.com
You need to explicitly set all memory parameters.


For more options, visit https://groups.google.com/d/optout.
--
Robert Dale

sergio....@aurea.com

unread,
Jun 12, 2017, 9:57:11 AM6/12/17
to Gremlin-users
Hello @Robert,

Which parameters? I'm setting the following for JVM: -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xms128m -Xmx8G 

And for the container, I'm running with reserved memory of 1GB.

Is there any parameter missing?

Thanks!

Kedar Mhaswade

unread,
Jun 12, 2017, 11:49:34 AM6/12/17
to gremli...@googlegroups.com
On Mon, Jun 12, 2017 at 6:57 AM, <sergio....@aurea.com> wrote:
Hello @Robert,

Which parameters? I'm setting the following for JVM: -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xms128m -Xmx8G 

And for the container, I'm running with reserved memory of 1GB.

This requires some Java heap tuning experience and running inside a container does create some complications. Many operational questions also arise. For instance, are you configuring any swap on the instance/container? It is indeed puzzling why this would happen only with huge EC2 instances and not large ones.

For starters, if you know that your gremlin server JVM runs alright with 500MB, why don't you start the JVM on the huge instances with: -Xms128m -Xmx1g? That way, you are explicitly making any object allocations on the JVM heap after ~1g fail. Another very useful parameter is -XX:+HeapDumpOnOutOfMemoryError which gives you the heap after the JVM fails to allocate memory. One technique that is definitely short of profiling but worth using is taking thread dumps and then analyzing them using something like Spotify Thread dump analyzer. That way, you can understand what various threads were doing at the time the JVM went down in flames. Remember to take a few thread dumps when you are doing this.

Regards,
Kedar



To unsubscribe from this group and stop receiving emails from it, send an email to gremlin-users+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gremlin-users/811fae82-ef39-4c58-91c8-2e3842a32009%40googlegroups.com.

Robert Dale

unread,
Jun 12, 2017, 1:24:09 PM6/12/17
to gremli...@googlegroups.com
I think the main difference is in the number of cores.  I see you did try using `

-XX:ParallelGCThreads` which brought the memory down which supports my theory.

`-XX:+PrintFlagsFinal` will tell you what the JVM has configured itself to use.  It would be useful to see that output along with the entire gremlin-server start up log.

I think you were on the right track with some lib using up resources.  However, it's not memory directly, but thread explosion (which in turn uses memory).  If possible, just let java run unconstrained and see where it ends up at.  That is, how much memory, how many threads.



Robert Dale

Robert Dale

unread,
Jun 12, 2017, 1:25:58 PM6/12/17
to gremli...@googlegroups.com
Apparently, ctrl-something sends... let me continue..

 If not, it would be interesting to see if you could isolate gremlin-server from neo4j and see what happens with each independently.  Try gremlin-server with the default conf/gremlin-server.yaml.  Also try starting neo4j standalone server with the memory parameters you have. Again, see where the memory and threads and up.  You may need to take continuous dumps during startup.

Robert Dale

On Mon, Jun 12, 2017 at 1:24 PM, Robert Dale <rob...@gmail.com> wrote:
I think the main difference is in the number of cores.  I see you did try using `

-XX:ParallelGCThreads` which brought the memory down which supports my theory.

`-XX:+PrintFlagsFinal` will tell you what the JVM has configured itself to use.  It would be useful to see that output along with the entire gremlin-server start up log.

I think you were on the right track with some lib using up resources.  However, it's not memory directly, but thread explosion (which in turn uses memory).  If possible, just let java run unconstrained and see where it ends up at.  That is, how much memory, how many threads.



Robert Dale

Robert Dale

unread,
Jun 12, 2017, 1:30:47 PM6/12/17
to gremli...@googlegroups.com
You can constrain the number of threads gremlin-server will use with gremlinPool [1]. I don't know if these are eagerly allocated. 


Robert Dale

On Mon, Jun 12, 2017 at 1:25 PM, Robert Dale <rob...@gmail.com> wrote:
Apparently, ctrl-something sends... let me continue..

 If not, it would be interesting to see if you could isolate gremlin-server from neo4j and see what happens with each independently.  Try gremlin-server with the default conf/gremlin-server.yaml.  Also try starting neo4j standalone server with the memory parameters you have. Again, see where the memory and threads and up.  You may need to take continuous dumps during startup.

Robert Dale

On Mon, Jun 12, 2017 at 1:24 PM, Robert Dale <rob...@gmail.com> wrote:
I think the main difference is in the number of cores.  I see you did try using `

-XX:ParallelGCThreads` which brought the memory down which supports my theory.

`-XX:+PrintFlagsFinal` will tell you what the JVM has configured itself to use.  It would be useful to see that output along with the entire gremlin-server start up log.

I think you were on the right track with some lib using up resources.  However, it's not memory directly, but thread explosion (which in turn uses memory).  If possible, just let java run unconstrained and see where it ends up at.  That is, how much memory, how many threads.



Robert Dale
Reply all
Reply to author
Forward
0 new messages