/etc/cgconfig.conf:
mount {cpuset = /cgroup/cpuset;cpu = /cgroup/cpu;#cpuacct = /cgroup/cpuacct;memory = /cgroup/memory;#devices = /cgroup/devices;#freezer = /cgroup/freezer;#net_cls = /cgroup/net_cls;#blkio = /cgroup/blkio;}group regular_users {cpu {cpu.shares=100;}cpuset {cpuset.cpus=4-19;cpuset.mems=0-1;}memory {memory.limit_in_bytes=48G;memory.soft_limit_in_bytes=48G;memory.memsw.limit_in_bytes=60G;}}template regular_users/%U {cpu {cpu.shares=100;}cpuset {cpuset.cpus=4-19;cpuset.mems=0-1;}memory {memory.limit_in_bytes=4G;memory.soft_limit_in_bytes=2G;memory.memsw.limit_in_bytes=6G;}/etc/cgrules.conf}#
# Include an explicit rule for root, otherwise commands with
# the setuid bit set on them will inherit the original user's
# gid and probably wind up under @everyone:
#
root cpuset,cpu,memory /
#
# sysadmin
#
user1 cpuset,cpu,memory /
user2 cpuset,cpu,memory /
#
# sysstaff
#
user3 cpuset,cpu,memory regular_users/
user4 cpuset,cpu,memory regular_users/
#
# workgroups:
#
@everyone cpuset,cpu,memory regular_users/%U/
@group1 cpuset,cpu,memory regular_users/%U/
@group2 cpuset,cpu,memory regular_users/%U/:
Hi Manuel,A possible workaround is to configure a cgroups limit by user in the frontend node so a single user cannot allocate more than 1GB of ram (or whatever value you prefer). The user would still be able to abuse the machine but as soon as his memory usage goes above the limit his job will be killed by cgroup and this should not affect too much the users behaving correctly.In any case the best solution I know is a non technical one. When a user abuse the system we close the account. He quickly sends and email asking what happened and why he cannot login and we reply that as he abused the system we won't open the account until his boss contacts us asking to reopen it. After the user has to explain the "problem" to his/her boss they don't abuse the system again ;)regards,Pablo.
I think cgroups is prob more elegant ………. but here is another script
https://github.com/FredHutch/IT/blob/master/py/loadwatcher.py#L59
The email text is hard coded so please change before using. We put this in place in Oct 2017 when things where getting out of control because folks were using much more multithreaded software than before. Since then we had 95 users removed from one of the login nodes and several 100 warnings sent. The killall -9 -v -g –u username
has been very effective. We have 3 login nodes with 28 cores and almost 400G RAM.
Dirk
-----Original Message-----
From: hpcx...@lists.fhcrc.org [mailto:hpcxxxx...@lists.fhcrc.org] On Behalf Of loadwatchx...@fhcrc.org
Sent: Tuesday, November 14, 2017 11:45 AM
To: Doe, John <xxxxxxxxx @fredhutch.org>
Subject: [hpcpol] RHINO3: Your jobs have been removed!
This is a notification message from loadwatcher.py, running on host RHINO3. Please review the following message:
jdoe, your CPU utilization on rhino3 is currently 4499 %!
For short term jobs you can use no more than 400 % or 4.0 CPU cores on the Rhino machines.
We have removed all your processes from this computer.
Please try again and submit batch jobs
or use the 'grabnode' command for interactive jobs.
see http://scicomp.fhcrc.org/Gizmo%20Cluster%20Quickstart.aspx
or http://scicomp.fhcrc.org/Grab%20Commands.aspx
or http://scicomp.fhcrc.org/SciComp%20Office%20Hours.aspx
If output is being captured, you may find additional information in your logs.
Dirk Petersen
Scientific Computing Director
Fred Hutch
1100 Fairview Ave. N.
Mail Stop M4-A882
Seattle, WA 98109
Phone: 206.667.5926
Skype: internetchen