cf app output

Daniel Mikusa

unread,

Feb 13, 2014, 5:16:56 PM2/13/14

to vcap...@cloudfoundry.org

Recently I noticed that the output from “cf app <app>”, particularly the memory information, has changed. Previously when I ran the command for my application, it would report usage around 10M, which fits for my application since it’s very small.

Now as I run “cf app <app>” for the same application it’s reporting that it’s using almost all of the memory that I’ve assigned to it (127.6 of 128M).

Curious as to why it was taking this much memory and suspecting my application was at fault, I fired up bosh-lite (CF v156) and peaked into the warden container to see what was running. After poking around a bit, I could see that it was not using anywhere near the limit of 128M of memory.

Trying to understand this, I took a looking at the code and stumbled upon this commit, which seems to explain the situation.

https://github.com/cloudfoundry/dea_ng/commit/e5fd523aaf2cddeda01d4134ecd2a929829c4147

According to the commit message, the reported memory usage is now including both the used memory (RSS) and the cache memory, and while I understand the rationale listed in the commit message, I’m not sure that I agree for the following reasons.

1.) As an application developer, I often run “cf app <app>” to see the stats for my application, in particular the CPU and memory usage. When it comes to memory, the key thing that I want to know is how much memory is currently being used to run my application and how much I have remaining before I hit my memory limit. If this output contains the memory being used by the OS to cache things, which it currently does, then my perspective on what is available to me seems skewed, with it looking like I have less available memory than I actually do.

It is my understanding that while the memory being used by the OS to cache things is technically in use, it is still available to my application should my application need it. In other words, if there is memory pressure then the OS should stop using it to cache things. Thus from an application developer’s perspective, memory used by the cache is not actually in use.

2.) The OS is greedy and will try to cache as much as possible. Because the memory usage stat includes the amount of memory used by the OS to do this, the memory usage reported by “cf app” over time is going to show that an application is almost out of memory (much like it does when you run free on a Linux system that has been running for a while). I feel like this could be confusing and possibly even lead developers to draw invalid conclusions about an application (like it needs more memory or that there’s a memory leak).

Has anyone else noticed this change? If so, what are your thoughts?

Thanks

Dan

Dan Higham

unread,

Feb 13, 2014, 5:35:44 PM2/13/14

to vcap...@cloudfoundry.org

I see the same thing too Dan. I have a rails application that CF says
is using 148M of 256M;

requested state: started
instances: 1/1
usage: 256M x 1 instances
urls: rails-console-test.cfapps.io

state since cpu memory disk
#0 running 2014-02-13 11:12:55 AM 0.0% 148M of 256M 108M of 1G

However, on the container itself, if I run ps and free, I see the following;

vcap@17hoi7ahnbh:~$

ps auxww

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 992 288 ? S<s 19:12 0:00 wshd:
17hoi7ahnbh
vcap 29 0.0 0.0 19232 1720 ? S<s 19:12 0:00 /bin/bash
vcap 31 0.0 0.0 10304 3004 ? S<l 19:12 0:00
./console-server-linux-amd64 -console-process=bash
-main-process=bundle exec rails s -p 8080
vcap 32 0.0 0.0 19232 740 ? S< 19:12 0:00 /bin/bash
vcap 33 0.0 0.0 19232 744 ? S< 19:12 0:00 /bin/bash
vcap 34 0.0 0.0 5544 552 ? S< 19:12 0:00 tee
/home/vcap/logs/stdout.log
vcap 36 0.0 0.0 5544 556 ? S< 19:12 0:00 tee
/home/vcap/logs/stderr.log
vcap 40 0.0 0.1 124688 45004 ? S<l 19:12 0:04
/home/vcap/app/vendor/ruby-1.9.3/bin/ruby bin/rails s -p 8080
vcap 46 0.0 0.0 0 0 ? Z<s 19:13 0:00
[bash] <defunct>
vcap 63 0.0 0.0 0 0 ? Z<s 19:18 0:00
[bash] <defunct>
vcap 71 0.0 0.0 0 0 ? Z<s 19:19 0:00
[bash] <defunct>
vcap 79 0.0 0.0 0 0 ? Z<s 19:23 0:00
[bash] <defunct>
vcap 82 0.0 0.0 19404 1936 pts/1 S<s+ 20:27 0:00 bash
vcap 85 0.0 0.0 0 0 ? Z<s 21:24 0:00
[bash] <defunct>
vcap 92 0.0 0.0 0 0 ? Z<s 21:58 0:00
[bash] <defunct>
vcap 95 0.0 0.0 0 0 ? Z<s 22:11 0:00
[bash] <defunct>
vcap 99 0.0 0.0 0 0 ? Z<s 22:12 0:00
[bash] <defunct>
vcap 104 0.0 0.0 0 0 ? Z<s 22:21 0:00
[bash] <defunct>
vcap 109 0.0 0.0 19404 1936 pts/2 S<s 22:23 0:00 bash
vcap 112 0.0 0.0 15316 1196 pts/2 R<+ 22:23 0:00 ps auxww

vcap@17hoi7ahnbh:~$

free

total used free shared buffers cached
Mem: 35129128 32199936 2929192 0 3161816 12730340
-/+ buffers/cache: 16307780 18821348
Swap: 35134148 3200 35130948

The rails process is using 0.1%! so, my guess is that memory is being
used by the staging process and then cached rather than completely
freed up, as you have already mentioned. I agree with you, the memory
used reported by CF should not include cached.

Dan

> To unsubscribe from this group and stop receiving emails from it, send an email to vcap-dev+u...@cloudfoundry.org.

--
Kind Regards

Dan Higham
Pivotal Support

Matthew Sykes

unread,

Feb 13, 2014, 6:04:07 PM2/13/14

to vcap...@cloudfoundry.org

Hi. There's been a lot of discussion around memory usage in the warden containers and Glyn and I have been exchanging some notes in the background about it over the past several weeks.

Originally, memory reported was only RSS but, unfortunately, that's not the whole story.

Within a cgroup, the memory subsystem uses RSS+cache to calculate and limit bytes in use. Some of the cache can be evicted while other bits can't. When the cgroup encounters memory pressure, it tries to release the portions of cache that can be evicted. When there's nothing left to evict, it gives up, you reach the limit, and the container falls over.

One interesting portion of the cache is the 'active file' cache. This represents the amount of memory used to cache dirty file data that has not been written to disk. The linux kernel daemon that's responsible for writing this data hasn't been hooked into cgroups so it doesn't flush this when there's pressure within a container so when new app instances are starting and droplets are unzipped, a lot of this section of the cache gets used up. This is especially true of java apps since they tend to be large droplets.

Now when you put large droplets together with application runtimes that allocate lots of memory quickly, a bunch of memory gets used that the container's memory control system can't evict. This caused a lot of java applications to fail during staging or initial start.

When these java apps started to fail, many people wanted to know what was going on because the information coming back from cf stats indicated they were no where near the limit of the container. Unfortunately, they did reach the limit - it just wasn't in the RSS portion of the memory accounting.

What's being reported now is actually the 'bytes in use' value of the container. The calculation could be changed to subtract the 'inactive_file' portion of cache so the stuff that's evictable doesn't show up as in use. I've had mixed feelings about making that change since what's reported now is actual bytes in use. On the other hand, it doesn't accurately represent the headroom you may have.

I'm interested in other opinions.

Matthew Sykes
matthe...@gmail.com

Matthew Sykes

unread,

Feb 13, 2014, 6:13:06 PM2/13/14

to vcap...@cloudfoundry.org

Putting the cache bit aside for a moment, your container usage is more than just the ruby process - it's all of the tasks that are associated with the container. From your data above, it looks like the container has about 57M of RSS.

Also, tools like 'top' and 'free' run off of the data inside the /proc file system that represents the host environment - not the container. This throws a lot of people off. So your ruby process is using 0.1% of your host's memory - not of the container limit.

On Thu, Feb 13, 2014 at 5:35 PM, Dan Higham <dhi...@gopivotal.com> wrote:

--
Matthew Sykes
matthe...@gmail.com

Dan Higham

unread,

Feb 13, 2014, 6:14:49 PM2/13/14

to vcap...@cloudfoundry.org

Heh, I figured 0.1% was a bit low! Would it not be possible to report
both the container usage and the inactive_file portion?

Glyn Normington

unread,

Feb 14, 2014, 4:15:53 AM2/14/14

to vcap...@cloudfoundry.org

The difference between inactive file and active file appears to be simply how long it was since the memory was accessed ([1], [2]). Pages are moved between active and inactive lists and those on the inactive list are candidates for writing to disk. So inactive file memory would also seem to contribute towards container oom.

[1] http://linux-mm.org/Swapout (written by an intern, but looks plausible at least at a high level)

[2] http://serverfault.com/questions/132088/what-can-cause-an-increase-in-inactive-memory-and-how-to-reclaim-it (first para of answer)

Glyn Normington

unread,

Feb 14, 2014, 4:29:40 AM2/14/14

to vcap...@cloudfoundry.org

On Thursday, February 13, 2014 11:04:07 PM UTC, Matthew Sykes wrote:

Hi. There's been a lot of discussion around memory usage in the warden containers and Glyn and I have been exchanging some notes in the background about it over the past several weeks.

Originally, memory reported was only RSS but, unfortunately, that's not the whole story.

Within a cgroup, the memory subsystem uses RSS+cache to calculate and limit bytes in use. Some of the cache can be evicted while other bits can't. When the cgroup encounters memory pressure, it tries to release the portions of cache that can be evicted. When there's nothing left to evict, it gives up, you reach the limit, and the container falls over.

One interesting portion of the cache is the 'active file' cache. This represents the amount of memory used to cache dirty file data that has not been written to disk. The linux kernel daemon that's responsible for writing this data hasn't been hooked into cgroups so it doesn't flush this when there's pressure within a container so when new app instances are starting and droplets are unzipped, a lot of this section of the cache gets used up. This is especially true of java apps since they tend to be large droplets.

This seems to be the nub of the problem and explains why (especially Java) apps have to be given a large memory limit. Do you know what solutions are being investigated in the linux kernel? For instance https://lwn.net/Articles/531077/ proposes that user-space applications should be built to respond to memory pressure in their cgroup, which seems like the wrong direction as Andrew Morton points out (https://lwn.net/Articles/531138/). Any more up to date information would be really useful.

This warden PR https://github.com/cloudfoundry/warden/pull/54 may help as it logs memory.stat and other info when the container hits oom (but before the oom killer runs), so this will provide evidence of whether or not the oom could be avoided, in principle, by the kernel.

Matthew Sykes

unread,

Feb 14, 2014, 7:13:28 AM2/14/14

to vcap...@cloudfoundry.org

> So inactive file memory would also seem to contribute towards container oom

Hi. I also meant to point to 'writeback' for disk (which is part of cache instead of active). Oh well. :)

The point I was trying to make at the end was that in practice, inactive tends to be evacuated under pressure while active is not. If the kernel decides to move the pages from inactive to active, that prevents it from being quickly reclaimed under pressure and if it goes the other way, inactive can be evicted.

You can usually see what bits of memory will go away under pressure by telling the kernel to drop caches (echo 1 > /proc/sys/vms/drop_caches). Every time I've ever done this I see inactive_file fall to zero while active_file remains unchanged.

So where we are with stats right now shows an accurate representation of bytes used but it doesn't help in determining the head-room an app has. I'm okay with where we are but it has come as a surprise to some.

Daniel Mikusa

unread,

Feb 14, 2014, 8:46:39 AM2/14/14

to vcap...@cloudfoundry.org

Matthew,

Thanks for that fantastic explanation!

You asked for opinions, so here’s mine. After reading your post, I’m not sure that there can be one memory stat to rule them all. If there is that would be awesome, but from what you’re saying it sounds like there could be drawbacks no matter what we report. Given that, my thought would be to just provide more information to our users so they can make their own decisions.

For example:

1.) When running “cf app <app>”, the command would report the same information it does now. However the output would note something like “container memory” or “memory (rss+cache)”, so users can better understand what this “memory” stat is reporting.

2.) Add an option like “cf app —detail <app>” or “cf app —verbose <app>” that would show a break down of memory usage, definitely RSS and cache, but possibly also inactive. That way if the summary doesn’t seem correct, a user can get a break down and better understand what’s happening.

Also it probably goes without saying, but including this in the documentation would help too. Definitely something on the docs website, but it might also be nice to include an explanation of the stats with the output of “cf app" or even with “cf app -h”.

Anyway, there’s my $0.02. Thanks again for the explanation.

Dan

Rohit Kelapure

unread,

Feb 14, 2014, 1:53:41 PM2/14/14

to vcap...@cloudfoundry.org

Matt,

Reading the above description shouldn't the solution be to hookup the linux kernel daemon responsible for writing the dirty cache file data to disk to the cgroup subsystem. If this is not possible my suggestion is to report both actual bytes in use as well as the active and inactive file cache sizes.

-cheers,

Rohit

Stephen Kinder

unread,

Feb 20, 2014, 11:10:13 AM2/20/14

to vcap...@cloudfoundry.org

I really appreciate the detailed responses to @Daniel's post from @Matt and @Glyn.

I think that CF would benefit from giving the consumer an idea of headroom remaining. Consider that vendors that are providing commercial deployments of CF are considering or piloting charging based on "memory". While these models do not represent true utility pricing in that one pays for the entire memory they've reserved and not "used", the customer should have some indication of how close they are getting to consuming their reservation. I suspect most will try to run in the smallest container they can fit given their throughput demands, and they will misinterpret the current results of CF STATS.

Since I agree with @Matt and @Glyn, reporting remaining headroom may be tricky. Given the current state of linux kernel support, and given there is some "interesting" caching eviction delays that can occur, it may not be possible to guarantee a clear "headroom" at any given time, it seems best as @Daniel suggested to give more memory detail in the reporting so at least one could reason about it themselves.

Also, I think it is also reasonable, that this community help drive the better integration of the linux kernel daemon with cgroups, and at least offer the user stories which make it desirable for the linux kernel to do more appropriate cache management for the cgroup itself.

Glyn Normington

unread,

Feb 21, 2014, 5:58:09 AM2/21/14

to vcap...@cloudfoundry.org

On 20/02/2014 16:10, Stephen Kinder wrote:
> Also, I think it is also reasonable, that this community help drive
> the better integration of the linux kernel daemon with cgroups, and at
> least offer the user stories which make it desirable for the linux
> kernel to do more appropriate cache management for the cgroup itself.

Why don't we develop a description of the use case here and then, when
we are happy with it, post it to the LKML? If anyone here works with
kernel developers, it might also be possible to submit a patch.

Here's a first draft of the use case - I'm sure it can be improved upon.
I'm a bit nervous discussing approaches to a solution, but that seems
more constructive than simply lobbing in a requirement.

----
Subject: Kernel scanning/freeing to relieve cgroup memory pressure

Currently, a memory cgroup can hit its oom limit when pages could, in
principle, be reclaimed by the kernel except that the kernel does not
respond directly to cgroup-local memory pressure.

A use case where this is important is running a moderately large Java
application in a memory cgroup in a PaaS environment where cost to the
user depends on the memory limit ([1]). Users need to tune the memory
limit to reduce their costs. During application initialisation large
numbers of JAR files are opened (read-only) and read while loading the
application code and its dependencies. This is reflected in a peak of
file cache usage which can push the memory cgroup memory usage
significantly higher than the value actually needed to run the application.

Possible approaches include (1) automatic response to cgroup-local
memory pressure in the kernel, and (2) a kernel API for reclaiming
memory from a cgroup which could be driven under oom notification (with
the oom killer disabled for the cgroup - it would be enabled if the
cgroup was still oom after calling the kernel to reclaim memory).

Clearly (1) is the preferred approach. The closest facility in the
kernel to (2) is to ask the kernel to free pagecache using `echo 1 >
/proc/sys/vms/drop_caches`, but that is too wide-ranging, especially in
a PaaS environment hosting multiple applications. A similar facility
could be provided for a cgroup via a cgroup pseudo-file
`memory.drop_caches`.

Other approaches include a mempressure cgroup ([2]) which would not be
suitable for PaaS applications. See [3] for Andrew Morton's response. A
related workaround ([4]) was included in the 3.6 kernel.

Related discussions:
[1] link to this vcap-dev thread
[2] https://lwn.net/Articles/531077/
[3] https://lwn.net/Articles/531138/
[4] https://lkml.org/lkml/2013/6/6/462 &
https://github.com/torvalds/linux/commit/e62e384e.
----

Thoughts? Edits?

Regards,
Glyn

Stephen Kinder

unread,

Feb 25, 2014, 8:05:23 PM2/25/14

to vcap...@cloudfoundry.org

@Glyn, your write up looks very well thought out. I would be happy to follow any requests for an improvement like this and would lend any technical support I could to help it get through. I have a lot of operating experience (MVS, z/OS) but just a working knowledge of Linux, but willing to help in any way I can.

I guess that leaves a PR for Cloud Foundry. What are we thinking?

Steve

To unsubscribe from this group and stop receiving emails from it, send an email to vcap-dev+unsubscribe@cloudfoundry.org.

--
Steve

Glyn Normington

unread,

Feb 26, 2014, 4:55:06 AM2/26/14

to vcap...@cloudfoundry.org

On Wednesday, February 26, 2014 1:05:23 AM UTC, Stephen Kinder wrote:

@Glyn, your write up looks very well thought out. I would be happy to follow any requests for an improvement like this and would lend any technical support I could to help it get through. I have a lot of operating experience (MVS, z/OS) but just a working knowledge of Linux, but willing to help in any way I can.

I was hoping for a bit more feedback and possibly some corrections before approaching the LKML. Anyone else have thoughts on the above draft post?

I guess that leaves a PR for Cloud Foundry. What are we thinking?

I'm not sure if a PR is feasible without a change in cgroups. `echo 1 > /proc/sys/vms/drop_caches` under the oom notification (with the oom killer turned off) seems too likely to screw up system performance.

Steve

On Fri, Feb 21, 2014 at 5:58 AM, Glyn Normington <gnorm...@gopivotal.com> wrote:

On 20/02/2014 16:10, Stephen Kinder wrote:

Also, I think it is also reasonable, that this community help drive the better integration of the linux kernel daemon with cgroups, and at least offer the user stories which make it desirable for the linux kernel to do more appropriate cache management for the cgroup itself.

Why don't we develop a description of the use case here and then, when we are happy with it, post it to the LKML? If anyone here works with kernel developers, it might also be possible to submit a patch.

Here's a first draft of the use case - I'm sure it can be improved upon. I'm a bit nervous discussing approaches to a solution, but that seems more constructive than simply lobbing in a requirement.

----
Subject: Kernel scanning/freeing to relieve cgroup memory pressure

Currently, a memory cgroup can hit its oom limit when pages could, in principle, be reclaimed by the kernel except that the kernel does not respond directly to cgroup-local memory pressure.

A use case where this is important is running a moderately large Java application in a memory cgroup in a PaaS environment where cost to the user depends on the memory limit ([1]). Users need to tune the memory limit to reduce their costs. During application initialisation large numbers of JAR files are opened (read-only) and read while loading the application code and its dependencies. This is reflected in a peak of file cache usage which can push the memory cgroup memory usage significantly higher than the value actually needed to run the application.

Possible approaches include (1) automatic response to cgroup-local memory pressure in the kernel, and (2) a kernel API for reclaiming memory from a cgroup which could be driven under oom notification (with the oom killer disabled for the cgroup - it would be enabled if the cgroup was still oom after calling the kernel to reclaim memory).

Clearly (1) is the preferred approach. The closest facility in the kernel to (2) is to ask the kernel to free pagecache using `echo 1 > /proc/sys/vms/drop_caches`, but that is too wide-ranging, especially in a PaaS environment hosting multiple applications. A similar facility could be provided for a cgroup via a cgroup pseudo-file `memory.drop_caches`.

Other approaches include a mempressure cgroup ([2]) which would not be suitable for PaaS applications. See [3] for Andrew Morton's response. A related workaround ([4]) was included in the 3.6 kernel.

Related discussions:
[1] link to this vcap-dev thread
[2] https://lwn.net/Articles/531077/
[3] https://lwn.net/Articles/531138/
[4] https://lkml.org/lkml/2013/6/6/462 & https://github.com/torvalds/linux/commit/e62e384e.
----

Thoughts? Edits?

Regards,
Glyn

To unsubscribe from this group and stop receiving emails from it, send an email to vcap-dev+u...@cloudfoundry.org.

--
Steve

Glyn Normington

unread,

Apr 2, 2014, 9:58:44 AM4/2/14

to vcap...@cloudfoundry.org

I posted our requirement to the LKML: https://lkml.org/lkml/2014/4/2/205

Mike Youngstrom

unread,

Apr 11, 2014, 11:30:33 AM4/11/14

to vcap...@cloudfoundry.org

So, what does Tejun's response mean for this issue?

On Wed, Apr 2, 2014 at 7:58 AM, Glyn Normington <gnorm...@gopivotal.com> wrote:

I posted our requirement to the LKML: https://lkml.org/lkml/2014/4/2/205

To unsubscribe from this group and stop receiving emails from it, send an email to vcap-dev+u...@cloudfoundry.org.

Glyn Normington

unread,

Apr 14, 2014, 4:08:48 AM4/14/14

to vcap...@cloudfoundry.org

Tejun appears to be delegating to the memory resource controller developers. The point of raising the issue was to ensure the relevant developers were aware of it, so hopefully they are now. I'll ask them for some feedback and post back here.

Glyn Normington

unread,

Apr 17, 2014, 4:09:04 AM4/17/14

to vcap...@cloudfoundry.org

It seems there is a possible workaround for the problem, but it was introduced in kernel v3.6 under this commit:

https://github.com/torvalds/linux/commit/e62e384e9da8

and unless this fix was backported into the custom kernel used by warden, we will be missing it. I'll look into this and report back.

The relevant part of the kernel mailing list thread is below - apologies for the poor formatting.

---

On 16/04/2014 10:11, Michal Hocko wrote:

On Tue 15-04-14 09:38:10, Glyn Normington wrote:
On 14/04/2014 21:50, Johannes Weiner wrote:
On Mon, Apr 14, 2014 at 09:11:25AM +0100, Glyn Normington wrote:
Johannes/Michal

What are your thoughts on this matter? Do you see this as a valid
requirement?
As Tejun said, memory cgroups *do* respond to internal pressure and
enter targetted reclaim before invoking the OOM killer. So I'm not
exactly sure what you are asking.
We are repeatedly seeing a situation where a memory cgroup with a given
memory limit results in an application process in the cgroup being killed
oom during application initialisation. One theory is that dirty file cache
pages are not being written to disk to reduce memory consumption before the
oom killer is invoked. Should memory cgroups' response to internal pressure
include writing dirty file cache pages to disk?
This depends on the kernel version. OOM with a lot of dirty pages on
memcg LRUs was a big problem. Now we are waiting for pages under
writeback during reclaim which should prevent from such spurious OOMs.
Which kernel versions are we talking about? The fix (or better said
workaround) I am thinking about is e62e384e9da8 memcg: prevent OOM with
too many dirty pages.

Thanks Michal - very helpful!

The kernel version, as reported by uname -r, is 3.2.0-23-generic.

According to https://github.com/torvalds/linux/commit/e62e384e9da8, the above workaround first went into kernel version 3.6, so we should plan to upgrade.

I am still not sure I understand your setup and the problem. Could you
describe your setup (what runs where under what limits), please?

I won't waste your time with the details of our setup unless the problem recurs with e62e384e9da8 in place.

Regards,
Glyn

Reply all

Reply to author

Forward