This is the first part of the resource management and control groups discussion.
I might have made mistakes while taking notes or typing them out, please feel
free to correct them for me or send me corrections.
The notes are really large, so they'll come in installments. This is the first
part of the notes.
Control Groups
==============
1. Multiphase locking - Paul brought up his multi phase locking design and
suggested approaches to implementing them. The problem with control groups
currently is that transactions cannot be atomically committed. If some
transactions fail (can_attach() callback fails or returns error), then there is
no notification sent out to groups that already committed the transaction
The suggested design includes
- Acquiring locks across callbacks - Balbir opposed this approach
stating that this would make it easier for subsystems to deadlock.
Balbir instead suggested that each callback hold it's own lock and
add an undo operation that cannot fail (returns void), since
uncharging usually succeeds. Dave suggested doing undo without holding
any locks.
2. Procs - Balbir and others have asked for an API to move all threads of a
process in one go from one control group to another. The question about doing it
in user space was asked. Doing it in user space is easy, but it can be expensive
(moving all threads one by one - acquiring the cgroup lock and releasing it for
every thread). What happens if another move is requested while a partial move is
in progress? Dave suggested that we have an abstract aggregation so that we
don't need to keep adding interfaces for every aggregation. Balbir mentioned
that the aggregation of interest are process, process groups and sessions and
the kernel already knows about these (there are data structures to link all
elements together). Abstracting it is a good idea, but hard to implement.
Paul asked what the behaviour should be, if a process being moved has several
threads belong to different cgroups. The answer that came up was that they
should all be migrated to the destination cgroup
3. Cgroup lock - The cgroup lock is held at various places in the system. The
question is -- is cgroup_lock() becoming the next BKL? Several solutions were
discussed - making the lock per hierarchy or per cgroup or use subsystem locks.
Paul mentioned that cgroups already use RCU.
4. Binary statistics - The question about binary statistics was raised. Since
control groups don't enforce any particular kind of API, is there a way to
generically handle control files and their parameters in the library? Paul
suggested his binary API approach, where every control group and it's API is
documented in an api file. Eric suggested using an ASCII interface (since that
is very generic) and using one file per API. Balbir mentioned that this will
lead to too many dentries and issues related to having extensive number of dentries.
5. User space notifications - Kamezawa had requested for user space notification
(through inotify) when a control group reaches it's memory limit for example.
The questions that were asked were, what happens if no one is listening in on
notifications? Denis suggested using a FIFO mechanism. Balbir suggested using
netlinks and building stuff on top of cgroupstats. With netlink we can pass
type, value and length of arguments, making it more suitable for this kind of
information exchange. The only concern with netlink is that it can lose
messages. The general consensus was to add one FIFO per control group and use
that for all notifications related to the control group.
Resource management
===================
1. Memory controller - Balbir mentioned that this is best discussed at the
memory controller BoF
2. Device subsystem was discussed and it was decided that mount (filesystem)
namespace and device namespace are the best places to handle device subsystem
issues.
3. Memrlimit - Balbir discussed the memrlimit controller. Dave and Paul are
opposed to doing any limits based on virtual address space. Balbir mentioned
that it serves several purposes
a. It allows us to control swap usage
b. It allows us to build a generic rlimits infrastructure
c. It allows us to fail applications nicely
Paul mentioned that (c) was not useful since no applications handle it today.
Balbir disagreed with that argument as being sufficient to prevent future
applications to handle malloc()/mmap() failure. Balbir asked why overcommit
accounting was not useful?
There was general agreement that a mlock() controller would be useful.
4. CPU controller - There was a request for hard limit feature. Peter opposed
the approach stating that anyone wanting hard limits should use the real time
group scheduler and a new EDF scheduler is being implemented. Denis mentioned
that without hard limits it is not possible for a service provider to
decide/plan how much capacity a single CPU can provide. Balbir mentioned that
with hard limits and SLA's the service provider could on reaching the hard limit
can save power by hard limiting execution on a CPU that is meeting its SLA
requirements. Peter mentioned that hard limits would make the group scheduler,
non work conserving.
Peter also updated everyone about the new load balancing patches that will make
it into the next merge window.
5. Kernel memory controller - The kernel memory controller was discussed
briefly. Pavel has not been actively working on it. Denis mentioned that it
would be nice to have a network buffer controller as well. Questions were asked
if the kernel memory controller should be merged with the existing memory
controller?
6. Swap subsystem - Daisuke mentioned that the swap subsystem works well for
fundamental operations and that he posted a version of the patch three weeks
ago. The patch controls swap entries to control the swap usage of a control
group. Paul mentioned that google has a patch internally to link swap files to
cpusets. Balbir asked Serge about his swap namespace patches. The swap namespace
is a different issue all together (compared to the swap controller). Currently
the swap controller is a part of the memory controller. There has been some
discussion about it being an independent controller.
--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Resource management (cont'd)
============================
7. Disk IO controller - There was a general discussion on the various disk IO
controllers
a. DM - IOBand
b. IO throttle
c. Anticipatory
d. CFQ
It was decided that it would be best for all the stake holders to work together
and let Jens Axboe and the block layer experts figure out what would be right
for the Linux kernel
8. Network traffic control - Paul discussed network traffic control and the
approach followed by Google. The existing classifier mechanism can be easily
extended by adding a classifier id (based on the control group). This is used in
combination with netfilters. Balbir mentioned that Thomas Graf was also looking
at something similar and raised the issue of input bandwidth control. Balbir
also pointed people to CKRM where the solution has been implemented. The OpenVZ
and Google team will post their patches
9. Network permissions - There was a recommendation to use security hooks for
network permissions. Paul explained what they use permissions with
a. connect
b. bind
c. accept
The issue of using netlabels was brought up.
10. Freezer subsystem - The freezer system was discussed briefly. Serge
mentioned the patches and wanted to collect feedback (if any) on them.
11. OOM Handler - The OOM handler was discussed in detail. Balbir mentioned
certain short comings of the OOM handler
a. Logic - it is based on total_vm, is that the correct metric for
OOMing?
b. Concurrency - it kills several tasks at once
There was a discussion on moving the policy for OOM handling to user space. Paul
described how the OOM handler has been modified at google to notify user space
when a CPUSet runs out of memory. Balbir asked if OOMing on reaching limits is a
good idea, it was generally discussed that it might not be such a good idea.
Control group library
=====================
Dhaval and Balbir introduced libcgroups and the purpose of the library and the
goals. Balbir described on paper what the current design looks like, it consists of
1. API
2. Test framework
3. A configuration subsystem
Dhaval discussed configuration syntax of XML versus home made. The issue of
classification of tasks was brought up. The reason that we want to classify
tasks is that we want them to move at fork/exec time to the correct cgroup so that
1. They don't consume resources in the parents group
2. The movement is automatic
It was generally agreed upon that the classification should take place in user
space. Eric and others suggested having a wrapper to start the application in
the correct cgroup (wrapper around fork/exec). Dave suggested that one might
even go the extent of hacking, such that a process is ptraced after fork/exec,
moved to the correct group and resumed. Using SELinux contexts was also recommended.
Vivek brought up using PAM plugins to do classifications, this suggestion was
nicely received. The decision was to do classification in user space and then
think of kernel space if it cannot be done in user space. Denis suggested that
classification is useful. In OpenVZ they classify all apache children to a
different group. Balbir asked Denis to post their classification infrastructure
as RFC.
Balbir asked for contributions to libcgroup. Libcgroup will effect system design
and both administrators and application administrators. Now is a good time to
get *involved*.
Thank you for nice minutes.
it is very helpful for non invited people (include me).
> 10. Freezer subsystem - The freezer system was discussed briefly. Serge
> mentioned the patches and wanted to collect feedback (if any) on them.
Who use it?
AFAIK the freezer is used by HPC guys in general.
but they think MPI process must be freezed.
Unfortunately, Opensource MPI implementation use various inter-process
communication method (e.g. SYSV IPC, socket, ptrace)
then, general freezer implementaion is very difficult.
> 11. OOM Handler - The OOM handler was discussed in detail. Balbir mentioned
> certain short comings of the OOM handler
> a. Logic - it is based on total_vm, is that the correct metric for
> OOMing?
> b. Concurrency - it kills several tasks at once
>
> There was a discussion on moving the policy for OOM handling to user space. Paul
> described how the OOM handler has been modified at google to notify user space
> when a CPUSet runs out of memory. Balbir asked if OOMing on reaching limits is a
> good idea, it was generally discussed that it might not be such a good idea.
CPUSET based limitation is not easy to use (slightly).
memcgroup based is better.
In addition, notification on reaching limit can be very generic.
various limit (e.g. cpu time, memory usage), various notification
(e.g. kill process, send signal, inotify), various target
(each process on the cgroup or manager process) can be tought.
> Control group library
> =====================
> Dhaval and Balbir introduced libcgroups and the purpose of the library and the
> goals. Balbir described on paper what the current design looks like, it consists of
>
> 1. API
> 2. Test framework
> 3. A configuration subsystem
>
> Dhaval discussed configuration syntax of XML versus home made. The issue of
> classification of tasks was brought up. The reason that we want to classify
> tasks is that we want them to move at fork/exec time to the correct cgroup so that
I don't recommend XML, because XML is tree based syntax but we want more fulexible
classification. then I guess XML reduce human readability.
> 1. They don't consume resources in the parents group
> 2. The movement is automatic
>
> It was generally agreed upon that the classification should take place in user
> space. Eric and others suggested having a wrapper to start the application in
> the correct cgroup (wrapper around fork/exec). Dave suggested that one might
> even go the extent of hacking, such that a process is ptraced after fork/exec,
> moved to the correct group and resumed. Using SELinux contexts was also recommended.
>
> Vivek brought up using PAM plugins to do classifications, this suggestion was
> nicely received. The decision was to do classification in user space and then
> think of kernel space if it cannot be done in user space. Denis suggested that
> classification is useful. In OpenVZ they classify all apache children to a
> different group. Balbir asked Denis to post their classification infrastructure
> as RFC.
I'm not sure about this issue.
but I like PAM approach.
Thanks balbir for nice summary.
Well, it was Rik Van Riel's idea to use PAM plugins so that processes
are put into right user cgroups upon login.
Is pam based classification alone is sufficient? I noticed couple of
instances which will avoid pam. For example.
- If one starts apache "service httpd start", then httpd threads change
their uid/gid to "apache/apache". But these threads will continue to
run in the cgroup belonging to root and will not go into apache cgroup.
- apache also offers "suexec" tool which execs a CGI script under a
different user than the user who has launched web server. I quickly
grepped for source code of suexec and it does not seem to be using
pam. That means CGI scripts running under some other user name will
continue to run in cgroup where apache is running.
I am not sure how many more such corner cases are there. These cases can
either be covered by modification of application or using some kind of
wrapper around application or by writing classification daemon.
Do we really need classification daemon to cover such cases or wrapper
approach is sufficient? I remember somebody in minisummit was mentioning
that it should work without any apache modifications.
Thanks
Vivek
Thanks, too.
> Well, it was Rik Van Riel's idea to use PAM plugins so that processes
> are put into right user cgroups upon login.
>
> Is pam based classification alone is sufficient? I noticed couple of
> instances which will avoid pam. For example.
>
> - If one starts apache "service httpd start", then httpd threads change
> their uid/gid to "apache/apache". But these threads will continue to
> run in the cgroup belonging to root and will not go into apache cgroup.
>
> - apache also offers "suexec" tool which execs a CGI script under a
> different user than the user who has launched web server. I quickly
> grepped for source code of suexec and it does not seem to be using
> pam. That means CGI scripts running under some other user name will
> continue to run in cgroup where apache is running.
>
> I am not sure how many more such corner cases are there. These cases can
> either be covered by modification of application or using some kind of
> wrapper around application or by writing classification daemon.
>
> Do we really need classification daemon to cover such cases or wrapper
> approach is sufficient? I remember somebody in minisummit was mentioning
> that it should work without any apache modifications.
>
We can go ahead step by step. I think PAM support is the first step.
The daemon is the second.
1. PAM
2. A daemon for task placement (via netlink ?)
I think developping "a daemon for task placement" is important.
but cannot be perfect solution for any situations.
The third step is
3. Modify applications (in newer version of them.)
"should work without any apache modifications" is (maybe) necessary. But for
perfect control, it's not enough. We should support a method to modify
applications easily in library.
I think develpment cost for "2" is bigger than "1" and "3". If "2" is hard,
starting from "1" and support funcs for "3" is a choice.
If support for "3" is ready, someone may start implementation of "2" in easier
way.
Thanks,
-Kame
[..]
Phase wise approach makes sense. I already have working patches for
following things.
1. PAM module for placement of tasks
2. Modification of init scripts and a tool "cgclassify" so that at boot up
time "init" and other system services are moved to "admin"'s group.
3. libcgroup API so that application can use these to place forked children
in right cgroup before doing exec.
4. A command line tool "execcg" which helps a user launch application in
specific "cgroup".
5. A classification daemon (based on netlink as of today. Should move to
cgroup fs based notification mechanism probably.)
I think in phase1, we can get first 4 items merged and stablized and then
work on daemon in phase2 (if need be).
One issue with daemon was raised with respect to containers. It will
interfere with placement of container threads also and this is not
desired.
This will have to be worked out.
Thanks
Vivek