Iperf processes keep running

1,254 views
Skip to first unread message

xxlio...@gmail.com

unread,
Jul 11, 2019, 7:10:12 AM7/11/19
to Pantheon
Hi there

The wrappers for some of the schemes use iperf for simulating traffic, for example cubic and bbr.
In my setup if appears however that these Iperf processes never get stopped properly, even if I extend their definition by a time parameter (I assume this is due to the unfortunate killing of the process after each experiment).

The only way I have found to remove these processes so far is 'pkill -9 iperf', i.e. sending sigkill (which is a terrible solution, if you can even call it that). Default arguments to pkill (sigterm) are insufficient for terminating the processes. Specifically in my setup this leaves hundreds of long dead processes lingering around, eating up memory, cpu time and still holding file references.

My question is, do you see any good way of cleaning this up?
I have not really worked with iperf before, but to me it appears if one does not kill it it should clean itself up after its runtime expires.

Thanks,
Joel

satadal.se...@gmail.com

unread,
Jul 11, 2019, 7:46:31 AM7/11/19
to Pantheon
Hi Joel,

As someone who has been trying to run batch experiments with pantheon, I am facing this problem too. For example, if you specify --runtimes as some slightly larger number (say 10) and run the experiment on a machine with low RAM (e.g. EC2 micro instances), the later runs report much lower bandwidth than the first few runs. This happens presumably because the later instances of iperf are not sending sufficient data to keep the link occupied, since they compete with previous lingering iperf instances for RAM. I tried separating the runs by just running pantheon N times in a loop (and each instance running just once) -- while this makes things better, the problem returns after a while.

Thanks,
Satadal

xxlio...@gmail.com

unread,
Jul 11, 2019, 8:26:44 AM7/11/19
to Pantheon
Hi Satadal

Thanks for your response! Great to hear I am not the only one who ran into this :)
That is exactly what I ended up doing as a "hotfix", including sending a sigkill after every run.

I am really not much of a fan of having sigkills when not absolutely necessary.


Thanks,
Joel

Francis Y. Yan

unread,
Jul 11, 2019, 7:03:20 PM7/11/19
to Schneepflueg, Pantheon
Hi Joel,

This is a good question! I noticed the bug in iperf too -- one obvious fix is to re-implement a simple iperf without such a bug. 

Alternatively, if you specify "--pkill-cleanup" when running "./test.py", iperf will be killed with SIGKILL properly, which I admit is not ideal. Please feel free to submit a pull request if you implement a simple iperf :)

Best,
Francis

--
You received this message because you are subscribed to the Google Groups "Pantheon" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pantheon-stanf...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pantheon-stanford/1680bb9f-6727-4efc-8498-43da13b7f921%40googlegroups.com.

scene

unread,
Apr 24, 2020, 1:35:55 AM4/24/20
to Pantheon
Hi Francis,
Recently i'm trying to run batch experiments with pantheon, and facing this problem too. 
pkill command can only stop iperf process, but it can not release the kernel cc module(e.g. bbr) that the iperf test specified.
After the test, i run "lsmod |grep bbr", the result is like this:
 test@ubuntu ~/workspace/pantheon/tests: lsmod |grep bbr                    
tcp_bbr        28672  1

But when i use iperf without pantheon and pkill iperf during the test, iperf process and the kernel cc module can both be stopped/released normally.

I wonder how can i release the  kernel cc module(e.g. bbr) that the iperf test specified when killing iperf in pantheon?


Best,
scene

在 2019年7月12日星期五 UTC+8上午7:03:20,Francis Y. Yan写道:
To unsubscribe from this group and stop receiving emails from it, send an email to pantheon...@googlegroups.com.

Francis Y. Yan

unread,
Apr 24, 2020, 2:41:49 PM4/24/20
to scene, Pantheon
Hi scene,

Thanks for reaching out! So you want to unload the kernel module for some reason? Typically you can load the ttp_bbr kernel module and just leave it as loaded; this wouldn't affect your default congestion control.

If it helps, you could consider wrapping Pantheon's scripts with your own script to clean up and set system states to the desired ones? Here's how we load a kernel module and enable it as congestion control: https://github.com/StanfordSNR/pantheon/blob/master/src/helpers/kernel_ctl.py. Please let me know if this works for you!

Best,
Francis

To unsubscribe from this group and stop receiving emails from it, send an email to pantheon-stanf...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pantheon-stanford/1f369227-2b3f-4554-8ad5-a3a8b9e5d78d%40googlegroups.com.
Message has been deleted

scene

unread,
Apr 25, 2020, 4:49:25 AM4/25/20
to Pantheon
Hi Francis, 

Thanks for you reply!
Normally, each run of bbr module will call initialization function, and release function to release the system resources. 
When i use patheon to run batch experiments of  a modified kernel bbr module(just add some logs), i just load the kernel module once and leave it as loaded. But each run of bbr test does not call release function immediately after the process is killed by pantheon(kill_proc_group). It will take hundreds of seconds to finally release. 
If i leave those as they are, the number of sockets used accumulates, occationally the following problem occur:

Done testing bbr
Died on std::runtime_error: `/sbin/iptables -w -t nat -D POSTROUTING -j MASQUERADE -m connmark --mark 27794': process died on signal 15
getsockopt failed strangely: No child processes
Died on std::runtime_error: `/sbin/iptables -w -t nat -D PREROUTING -s 100.64.0.4 -j CONNMARK --set-mark 27794': process exited with failure status 1
kill: (27876): No such process
Testing scheme reno for experiment run 1/1...
$ /home/workspace/pantheon/src/wrappers/reno.py run_first
[tunnel server manager (tsm)] $ python /home/workspace/pantheon/src/experiments/tunnel_manager.py
getsockopt failed strangely: No child processes
Died on std::runtime_error: `/sbin/iptables -w -t nat -D POSTROUTING -j MASQUERADE -m connmark --mark 27735': process exited with failure status 1
getsockopt failed strangely: No child processes
Died on std::runtime_error: `/sbin/iptables -w -t nat -D PREROUTING -s 100.64.0.2 -j CONNMARK --set-mark 27735': process exited with failure status 1
tunnel manager is running

在 2020年4月25日星期六 UTC+8上午2:41:49,Francis Y. Yan写道:
Hi scene,

Thanks for reaching out! So you want to unload the kernel module for some reason? Typically you can load the ttp_bbr kernel module and just leave it as loaded; this wouldn't affect your default congestion control.

If it helps, you could consider wrapping Pantheon's scripts with your own script to clean up and set system states to the desired ones? Here's how we load a kernel module and enable it as congestion control: https://github.com/StanfordSNR/pantheon/blob/master/src/helpers/kernel_ctl.py. Please let me know if this works for you!

Best,
Francis

Francis Y. Yan

unread,
Apr 25, 2020, 12:59:19 PM4/25/20
to scene, Pantheon
Hi scene,

Thanks for the clarification! Unfortunately I don't have an idea either :(

Pantheon doesn't make any assumptions on the tested scheme, so the best we can do is to send a SIGTERM to the scheme's process after each test run. We actually don't recommend to use pkill or send SIGKILL; they are the last resorts used on our servers for cleaning up those schemes that don't handle SIGTERM well, especially iperf. Perhaps in your case, the issue only persists when iperf is run as a child process? To be honest, I can't think of a reason why BBR module is not released off the top of my head, but if there's anything you can do to avoid using pkill or SIGKILL, you should do it! E.g., replacing iperf with your own sockets using BBR; we have examples in another project: https://github.com/StanfordSNR/puffer/blob/master/src/net/socket.cc#L214.

Best,
Francis

To unsubscribe from this group and stop receiving emails from it, send an email to pantheon-stanf...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pantheon-stanford/6f3125a9-42ef-490d-84ad-19dd672592e1%40googlegroups.com.

scene

unread,
Apr 26, 2020, 7:08:19 AM4/26/20
to Pantheon
Hi, Francis,

As you pointed out, the issue only persists when iperf is run as a child process. I doubt if the problem is related to the seprate network namespace?
Maybe i'll have to replace iperf with my own sockets.
Thanks for your advice!

Best, 
Scene

在 2020年4月26日星期日 UTC+8上午12:59:19,Francis Y. Yan写道:
Hi scene,

Thanks for the clarification! Unfortunately I don't have an idea either :(

Pantheon doesn't make any assumptions on the tested scheme, so the best we can do is to send a SIGTERM to the scheme's process after each test run. We actually don't recommend to use pkill or send SIGKILL; they are the last resorts used on our servers for cleaning up those schemes that don't handle SIGTERM well, especially iperf. Perhaps in your case, the issue only persists when iperf is run as a child process? To be honest, I can't think of a reason why BBR module is not released off the top of my head, but if there's anything you can do to avoid using pkill or SIGKILL, you should do it! E.g., replacing iperf with your own sockets using BBR; we have examples in another project: https://github.com/StanfordSNR/puffer/blob/master/src/net/socket.cc#L214.

Best,
Francis
Reply all
Reply to author
Forward
0 new messages