Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

Control groups and Resource Management notes (part I)

2 views

Skip to first unread message

Balbir Singh

unread,

Aug 1, 2008, 8:59:35 PM8/1/08

to Linux Containers

Hi, All,

This is the first part of the resource management and control groups discussion.

I might have made mistakes while taking notes or typing them out, please feel

free to correct them for me or send me corrections.

The notes are really large, so they'll come in installments. This is the first

part of the notes.

Control Groups

==============

1. Multiphase locking - Paul brought up his multi phase locking design and

suggested approaches to implementing them. The problem with control groups

currently is that transactions cannot be atomically committed. If some

transactions fail (can_attach() callback fails or returns error), then there is

no notification sent out to groups that already committed the transaction

The suggested design includes

- Acquiring locks across callbacks - Balbir opposed this approach

stating that this would make it easier for subsystems to deadlock.

Balbir instead suggested that each callback hold it's own lock and

add an undo operation that cannot fail (returns void), since

uncharging usually succeeds. Dave suggested doing undo without holding

any locks.

2. Procs - Balbir and others have asked for an API to move all threads of a

process in one go from one control group to another. The question about doing it

in user space was asked. Doing it in user space is easy, but it can be expensive

(moving all threads one by one - acquiring the cgroup lock and releasing it for

every thread). What happens if another move is requested while a partial move is

in progress? Dave suggested that we have an abstract aggregation so that we

don't need to keep adding interfaces for every aggregation. Balbir mentioned

that the aggregation of interest are process, process groups and sessions and

the kernel already knows about these (there are data structures to link all

elements together). Abstracting it is a good idea, but hard to implement.

Paul asked what the behaviour should be, if a process being moved has several

threads belong to different cgroups. The answer that came up was that they

should all be migrated to the destination cgroup

3. Cgroup lock - The cgroup lock is held at various places in the system. The

question is -- is cgroup_lock() becoming the next BKL? Several solutions were

discussed - making the lock per hierarchy or per cgroup or use subsystem locks.

Paul mentioned that cgroups already use RCU.

4. Binary statistics - The question about binary statistics was raised. Since

control groups don't enforce any particular kind of API, is there a way to

generically handle control files and their parameters in the library? Paul

suggested his binary API approach, where every control group and it's API is

documented in an api file. Eric suggested using an ASCII interface (since that

is very generic) and using one file per API. Balbir mentioned that this will

lead to too many dentries and issues related to having extensive number of dentries.

5. User space notifications - Kamezawa had requested for user space notification

(through inotify) when a control group reaches it's memory limit for example.

The questions that were asked were, what happens if no one is listening in on

notifications? Denis suggested using a FIFO mechanism. Balbir suggested using

netlinks and building stuff on top of cgroupstats. With netlink we can pass

type, value and length of arguments, making it more suitable for this kind of

information exchange. The only concern with netlink is that it can lose

messages. The general consensus was to add one FIFO per control group and use

that for all notifications related to the control group.

Resource management

===================

1. Memory controller - Balbir mentioned that this is best discussed at the

memory controller BoF

2. Device subsystem was discussed and it was decided that mount (filesystem)

namespace and device namespace are the best places to handle device subsystem

issues.

3. Memrlimit - Balbir discussed the memrlimit controller. Dave and Paul are

opposed to doing any limits based on virtual address space. Balbir mentioned

that it serves several purposes

a. It allows us to control swap usage

b. It allows us to build a generic rlimits infrastructure

c. It allows us to fail applications nicely

Paul mentioned that (c) was not useful since no applications handle it today.

Balbir disagreed with that argument as being sufficient to prevent future

applications to handle malloc()/mmap() failure. Balbir asked why overcommit

accounting was not useful?

There was general agreement that a mlock() controller would be useful.

4. CPU controller - There was a request for hard limit feature. Peter opposed

the approach stating that anyone wanting hard limits should use the real time

group scheduler and a new EDF scheduler is being implemented. Denis mentioned

that without hard limits it is not possible for a service provider to

decide/plan how much capacity a single CPU can provide. Balbir mentioned that

with hard limits and SLA's the service provider could on reaching the hard limit

can save power by hard limiting execution on a CPU that is meeting its SLA

requirements. Peter mentioned that hard limits would make the group scheduler,

non work conserving.

Peter also updated everyone about the new load balancing patches that will make

it into the next merge window.

5. Kernel memory controller - The kernel memory controller was discussed

briefly. Pavel has not been actively working on it. Denis mentioned that it

would be nice to have a network buffer controller as well. Questions were asked

if the kernel memory controller should be merged with the existing memory

controller?

6. Swap subsystem - Daisuke mentioned that the swap subsystem works well for

fundamental operations and that he posted a version of the patch three weeks

ago. The patch controls swap entries to control the swap usage of a control

group. Paul mentioned that google has a patch internally to link swap files to

cpusets. Balbir asked Serge about his swap namespace patches. The swap namespace

is a different issue all together (compared to the swap controller). Currently

the swap controller is a part of the memory controller. There has been some

discussion about it being an independent controller.

Warm Regards,

Balbir Singh

Linux Technology Center

IBM, ISTL

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Balbir Singh

unread,

Aug 1, 2008, 9:11:03 PM8/1/08

to Linux Containers, linux kernel mailing list, libcg-devel

Here's part II (part I can be found at
(https://lists.linux-foundation.org/pipermail/containers/2008-August/012128.html)

Resource management (cont'd)
============================
7. Disk IO controller - There was a general discussion on the various disk IO
controllers
a. DM - IOBand
b. IO throttle
c. Anticipatory
d. CFQ

It was decided that it would be best for all the stake holders to work together
and let Jens Axboe and the block layer experts figure out what would be right
for the Linux kernel

8. Network traffic control - Paul discussed network traffic control and the
approach followed by Google. The existing classifier mechanism can be easily
extended by adding a classifier id (based on the control group). This is used in
combination with netfilters. Balbir mentioned that Thomas Graf was also looking
at something similar and raised the issue of input bandwidth control. Balbir
also pointed people to CKRM where the solution has been implemented. The OpenVZ
and Google team will post their patches

9. Network permissions - There was a recommendation to use security hooks for
network permissions. Paul explained what they use permissions with
a. connect
b. bind
c. accept

The issue of using netlabels was brought up.

10. Freezer subsystem - The freezer system was discussed briefly. Serge
mentioned the patches and wanted to collect feedback (if any) on them.

11. OOM Handler - The OOM handler was discussed in detail. Balbir mentioned
certain short comings of the OOM handler
a. Logic - it is based on total_vm, is that the correct metric for
OOMing?
b. Concurrency - it kills several tasks at once

There was a discussion on moving the policy for OOM handling to user space. Paul
described how the OOM handler has been modified at google to notify user space
when a CPUSet runs out of memory. Balbir asked if OOMing on reaching limits is a
good idea, it was generally discussed that it might not be such a good idea.

Control group library
=====================
Dhaval and Balbir introduced libcgroups and the purpose of the library and the
goals. Balbir described on paper what the current design looks like, it consists of

1. API
2. Test framework
3. A configuration subsystem

Dhaval discussed configuration syntax of XML versus home made. The issue of
classification of tasks was brought up. The reason that we want to classify
tasks is that we want them to move at fork/exec time to the correct cgroup so that

1. They don't consume resources in the parents group
2. The movement is automatic

It was generally agreed upon that the classification should take place in user
space. Eric and others suggested having a wrapper to start the application in
the correct cgroup (wrapper around fork/exec). Dave suggested that one might
even go the extent of hacking, such that a process is ptraced after fork/exec,
moved to the correct group and resumed. Using SELinux contexts was also recommended.

Vivek brought up using PAM plugins to do classifications, this suggestion was
nicely received. The decision was to do classification in user space and then
think of kernel space if it cannot be done in user space. Denis suggested that
classification is useful. In OpenVZ they classify all apache children to a
different group. Balbir asked Denis to post their classification infrastructure
as RFC.

Balbir asked for contributions to libcgroup. Libcgroup will effect system design
and both administrators and application administrators. Now is a good time to
get *involved*.

KOSAKI Motohiro

unread,

Aug 5, 2008, 3:53:35 AM8/5/08

to bal...@linux.vnet.ibm.com, kosaki....@jp.fujitsu.com, Linux Containers, linux kernel mailing list, libcg-devel

Hi balbir-san,

Thank you for nice minutes.
it is very helpful for non invited people (include me).

> 10. Freezer subsystem - The freezer system was discussed briefly. Serge
> mentioned the patches and wanted to collect feedback (if any) on them.

Who use it?

AFAIK the freezer is used by HPC guys in general.
but they think MPI process must be freezed.

Unfortunately, Opensource MPI implementation use various inter-process
communication method (e.g. SYSV IPC, socket, ptrace)

then, general freezer implementaion is very difficult.

> 11. OOM Handler - The OOM handler was discussed in detail. Balbir mentioned
> certain short comings of the OOM handler
> a. Logic - it is based on total_vm, is that the correct metric for
> OOMing?
> b. Concurrency - it kills several tasks at once
>
> There was a discussion on moving the policy for OOM handling to user space. Paul
> described how the OOM handler has been modified at google to notify user space
> when a CPUSet runs out of memory. Balbir asked if OOMing on reaching limits is a
> good idea, it was generally discussed that it might not be such a good idea.

CPUSET based limitation is not easy to use (slightly).
memcgroup based is better.

In addition, notification on reaching limit can be very generic.

various limit (e.g. cpu time, memory usage), various notification
(e.g. kill process, send signal, inotify), various target
(each process on the cgroup or manager process) can be tought.

> Control group library
> =====================
> Dhaval and Balbir introduced libcgroups and the purpose of the library and the
> goals. Balbir described on paper what the current design looks like, it consists of
>
> 1. API
> 2. Test framework
> 3. A configuration subsystem
>
> Dhaval discussed configuration syntax of XML versus home made. The issue of
> classification of tasks was brought up. The reason that we want to classify
> tasks is that we want them to move at fork/exec time to the correct cgroup so that

I don't recommend XML, because XML is tree based syntax but we want more fulexible
classification. then I guess XML reduce human readability.

> 1. They don't consume resources in the parents group
> 2. The movement is automatic
>
> It was generally agreed upon that the classification should take place in user
> space. Eric and others suggested having a wrapper to start the application in
> the correct cgroup (wrapper around fork/exec). Dave suggested that one might
> even go the extent of hacking, such that a process is ptraced after fork/exec,
> moved to the correct group and resumed. Using SELinux contexts was also recommended.
>
> Vivek brought up using PAM plugins to do classifications, this suggestion was
> nicely received. The decision was to do classification in user space and then
> think of kernel space if it cannot be done in user space. Denis suggested that
> classification is useful. In OpenVZ they classify all apache children to a
> different group. Balbir asked Denis to post their classification infrastructure
> as RFC.

I'm not sure about this issue.
but I like PAM approach.

Vivek Goyal

unread,

Aug 5, 2008, 9:38:21 AM8/5/08

to KOSAKI Motohiro, bal...@linux.vnet.ibm.com, Linux Containers, linux kernel mailing list, libcg-devel

Thanks balbir for nice summary.

Well, it was Rik Van Riel's idea to use PAM plugins so that processes
are put into right user cgroups upon login.

Is pam based classification alone is sufficient? I noticed couple of
instances which will avoid pam. For example.

- If one starts apache "service httpd start", then httpd threads change
their uid/gid to "apache/apache". But these threads will continue to
run in the cgroup belonging to root and will not go into apache cgroup.

- apache also offers "suexec" tool which execs a CGI script under a
different user than the user who has launched web server. I quickly
grepped for source code of suexec and it does not seem to be using
pam. That means CGI scripts running under some other user name will
continue to run in cgroup where apache is running.

I am not sure how many more such corner cases are there. These cases can
either be covered by modification of application or using some kind of
wrapper around application or by writing classification daemon.

Do we really need classification daemon to cover such cases or wrapper
approach is sufficient? I remember somebody in minisummit was mentioning
that it should work without any apache modifications.

Thanks
Vivek

KAMEZAWA Hiroyuki

unread,

Aug 5, 2008, 9:01:37 PM8/5/08

to Vivek Goyal, KOSAKI Motohiro, Linux Containers, libcg-devel, linux kernel mailing list, bal...@linux.vnet.ibm.com

Thanks, too.

> Well, it was Rik Van Riel's idea to use PAM plugins so that processes
> are put into right user cgroups upon login.
>
> Is pam based classification alone is sufficient? I noticed couple of
> instances which will avoid pam. For example.
>
> - If one starts apache "service httpd start", then httpd threads change
> their uid/gid to "apache/apache". But these threads will continue to
> run in the cgroup belonging to root and will not go into apache cgroup.
>
> - apache also offers "suexec" tool which execs a CGI script under a
> different user than the user who has launched web server. I quickly
> grepped for source code of suexec and it does not seem to be using
> pam. That means CGI scripts running under some other user name will
> continue to run in cgroup where apache is running.
>
> I am not sure how many more such corner cases are there. These cases can
> either be covered by modification of application or using some kind of
> wrapper around application or by writing classification daemon.
>
> Do we really need classification daemon to cover such cases or wrapper
> approach is sufficient? I remember somebody in minisummit was mentioning
> that it should work without any apache modifications.
>

We can go ahead step by step. I think PAM support is the first step.
The daemon is the second.

1. PAM
2. A daemon for task placement (via netlink ?)

I think developping "a daemon for task placement" is important.
but cannot be perfect solution for any situations.

The third step is

3. Modify applications (in newer version of them.)

"should work without any apache modifications" is (maybe) necessary. But for
perfect control, it's not enough. We should support a method to modify
applications easily in library.

I think develpment cost for "2" is bigger than "1" and "3". If "2" is hard,
starting from "1" and support funcs for "3" is a choice.
If support for "3" is ready, someone may start implementation of "2" in easier
way.

Thanks,
-Kame

Vivek Goyal

unread,

Aug 6, 2008, 9:06:26 AM8/6/08

to KAMEZAWA Hiroyuki, KOSAKI Motohiro, Linux Containers, libcg-devel, linux kernel mailing list, bal...@linux.vnet.ibm.com

On Wed, Aug 06, 2008 at 10:05:00AM +0900, KAMEZAWA Hiroyuki wrote:

[..]

Phase wise approach makes sense. I already have working patches for
following things.

1. PAM module for placement of tasks
2. Modification of init scripts and a tool "cgclassify" so that at boot up
time "init" and other system services are moved to "admin"'s group.
3. libcgroup API so that application can use these to place forked children
in right cgroup before doing exec.
4. A command line tool "execcg" which helps a user launch application in
specific "cgroup".

5. A classification daemon (based on netlink as of today. Should move to
cgroup fs based notification mechanism probably.)

I think in phase1, we can get first 4 items merged and stablized and then
work on daemon in phase2 (if need be).

One issue with daemon was raised with respect to containers. It will
interfere with placement of container threads also and this is not
desired.

This will have to be worked out.

Thanks
Vivek

0 new messages