Hi. There's been a lot of discussion around memory usage in the warden containers and Glyn and I have been exchanging some notes in the background about it over the past several weeks.Originally, memory reported was only RSS but, unfortunately, that's not the whole story.Within a cgroup, the memory subsystem uses RSS+cache to calculate and limit bytes in use. Some of the cache can be evicted while other bits can't. When the cgroup encounters memory pressure, it tries to release the portions of cache that can be evicted. When there's nothing left to evict, it gives up, you reach the limit, and the container falls over.One interesting portion of the cache is the 'active file' cache. This represents the amount of memory used to cache dirty file data that has not been written to disk. The linux kernel daemon that's responsible for writing this data hasn't been hooked into cgroups so it doesn't flush this when there's pressure within a container so when new app instances are starting and droplets are unzipped, a lot of this section of the cache gets used up. This is especially true of java apps since they tend to be large droplets.
To unsubscribe from this group and stop receiving emails from it, send an email to vcap-dev+unsubscribe@cloudfoundry.org.
@Glyn, your write up looks very well thought out. I would be happy to follow any requests for an improvement like this and would lend any technical support I could to help it get through. I have a lot of operating experience (MVS, z/OS) but just a working knowledge of Linux, but willing to help in any way I can.
I guess that leaves a PR for Cloud Foundry. What are we thinking?
Steve
On Fri, Feb 21, 2014 at 5:58 AM, Glyn Normington <gnorm...@gopivotal.com> wrote:
On 20/02/2014 16:10, Stephen Kinder wrote:Why don't we develop a description of the use case here and then, when we are happy with it, post it to the LKML? If anyone here works with kernel developers, it might also be possible to submit a patch.
Also, I think it is also reasonable, that this community help drive the better integration of the linux kernel daemon with cgroups, and at least offer the user stories which make it desirable for the linux kernel to do more appropriate cache management for the cgroup itself.
Here's a first draft of the use case - I'm sure it can be improved upon. I'm a bit nervous discussing approaches to a solution, but that seems more constructive than simply lobbing in a requirement.
----
Subject: Kernel scanning/freeing to relieve cgroup memory pressure
Currently, a memory cgroup can hit its oom limit when pages could, in principle, be reclaimed by the kernel except that the kernel does not respond directly to cgroup-local memory pressure.
A use case where this is important is running a moderately large Java application in a memory cgroup in a PaaS environment where cost to the user depends on the memory limit ([1]). Users need to tune the memory limit to reduce their costs. During application initialisation large numbers of JAR files are opened (read-only) and read while loading the application code and its dependencies. This is reflected in a peak of file cache usage which can push the memory cgroup memory usage significantly higher than the value actually needed to run the application.
Possible approaches include (1) automatic response to cgroup-local memory pressure in the kernel, and (2) a kernel API for reclaiming memory from a cgroup which could be driven under oom notification (with the oom killer disabled for the cgroup - it would be enabled if the cgroup was still oom after calling the kernel to reclaim memory).
Clearly (1) is the preferred approach. The closest facility in the kernel to (2) is to ask the kernel to free pagecache using `echo 1 > /proc/sys/vms/drop_caches`, but that is too wide-ranging, especially in a PaaS environment hosting multiple applications. A similar facility could be provided for a cgroup via a cgroup pseudo-file `memory.drop_caches`.
Other approaches include a mempressure cgroup ([2]) which would not be suitable for PaaS applications. See [3] for Andrew Morton's response. A related workaround ([4]) was included in the 3.6 kernel.
Related discussions:
[1] link to this vcap-dev thread
[2] https://lwn.net/Articles/531077/
[3] https://lwn.net/Articles/531138/
[4] https://lkml.org/lkml/2013/6/6/462 & https://github.com/torvalds/linux/commit/e62e384e.
----
Thoughts? Edits?
Regards,
Glyn
To unsubscribe from this group and stop receiving emails from it, send an email to vcap-dev+u...@cloudfoundry.org.
--
Steve
I posted our requirement to the LKML: https://lkml.org/lkml/2014/4/2/205
To unsubscribe from this group and stop receiving emails from it, send an email to vcap-dev+u...@cloudfoundry.org.
On Tue 15-04-14 09:38:10, Glyn Normington wrote:Thanks Michal - very helpful!On 14/04/2014 21:50, Johannes Weiner wrote:This depends on the kernel version. OOM with a lot of dirty pages onOn Mon, Apr 14, 2014 at 09:11:25AM +0100, Glyn Normington wrote:We are repeatedly seeing a situation where a memory cgroup with a givenJohannes/MichalAs Tejun said, memory cgroups *do* respond to internal pressure and
What are your thoughts on this matter? Do you see this as a valid
requirement?
enter targetted reclaim before invoking the OOM killer. So I'm not
exactly sure what you are asking.
memory limit results in an application process in the cgroup being killed
oom during application initialisation. One theory is that dirty file cache
pages are not being written to disk to reduce memory consumption before the
oom killer is invoked. Should memory cgroups' response to internal pressure
include writing dirty file cache pages to disk?
memcg LRUs was a big problem. Now we are waiting for pages under
writeback during reclaim which should prevent from such spurious OOMs.
Which kernel versions are we talking about? The fix (or better said
workaround) I am thinking about is e62e384e9da8 memcg: prevent OOM with
too many dirty pages.
I won't waste your time with the details of our setup unless the problem recurs with e62e384e9da8 in place.
I am still not sure I understand your setup and the problem. Could you
describe your setup (what runs where under what limits), please?