High memory utilization and issue stopping all Bro processes

1,438 views
Skip to first unread message

bmp

unread,
Oct 30, 2013, 4:39:14 PM10/30/13
to securit...@googlegroups.com
Hi,

I setup SO recently using the 12.04.3-20130904 ISO. I've set it up to install only Bro and ELSA for now. The server is monitoring a 10G link, but its not fully utilized - the peak traffic has been approx 5Gbps.

Server specs:
Dell R720
Dual E5-2680 2.7 GHz CPUs
96GB of RAM (6*16GB 1600 MHz RDIMMS)
NIC: Intel Ethernet Converged Network Adapter X520-SR
Disk Space: 4x 1.2TB 10K RPM SAS 6Gbps

I have Bro configured to run 10 worker processes. Initially I had assigned it more, but the CPU load avg was btw 15 - 20, and I tried setting different values for lb_procs to reduce the load. Finally I stayed with 10 - the load avg has been btw 10-15.

Bro seems to start and run fine initially - broctl netstats show very few drops, total RAM utilization is around 25G.

But over time (about a day) the RAM utilization goes upto 80-90G and broctl netstats shows lot more packet drops. This seems to be happening consistently, and after a reboot things seem to be normal.

Instead of a reboot, this time I tried running broctl stop - its says all processes have stopped.

But RAM utilization doesn't really reduce at all.

free -g
total used free shared buffers cached
Mem: 94 81 12 0 0 64
-/+ buffers/cache: 17 77

pgrep -lf bro
7186 bash /opt/bro/share/broctl/scripts/run-bro -i eth1 -U .status -p broctl -p broctl-live -p local -p Dell-tracker11-eth1-3 local.bro broctl base/frameworks/cluster local-worker.bro broctl/auto
7256 /opt/bro/bin/bro -i eth1 -U .status -p broctl -p broctl-live -p local -p Dell-tracker11-eth1-3 local.bro broctl base/frameworks/cluster local-worker.bro broctl/auto
7260 /opt/bro/bin/bro -i eth1 -U .status -p broctl -p broctl-live -p local -p Dell-tracker11-eth1-3 local.bro broctl base/frameworks/cluster local-worker.bro broctl/auto
12752 bash /opt/bro/share/broctl/scripts/run-bro -U .status -p broctl -p broctl-live -p local -p proxy local.bro broctl base/frameworks/cluster local-proxy broctl/auto
12761 /opt/bro/bin/bro -U .status -p broctl -p broctl-live -p local -p proxy local.bro broctl base/frameworks/cluster local-proxy broctl/auto
12769 /opt/bro/bin/bro -U .status -p broctl -p broctl-live -p local -p proxy local.bro broctl base/frameworks/cluster local-proxy broctl/auto
12807 bash /opt/bro/share/broctl/scripts/run-bro -i eth1 -U .status -p broctl -p broctl-live -p local -p Dell-tracker11-eth1-10 local.bro broctl base/frameworks/cluster local-worker.bro broctl/auto
12826 bash /opt/bro/share/broctl/scripts/run-bro -i eth1 -U .status -p broctl -p broctl-live -p local -p Dell-tracker11-eth1-3 local.bro broctl base/frameworks/cluster local-worker.bro broctl/auto
12864 bash /opt/bro/share/broctl/scripts/run-bro -i eth1 -U .status -p broctl -p broctl-live -p local -p Dell-tracker11-eth1-6 local.bro broctl base/frameworks/cluster local-worker.bro broctl/auto
12875 bash /opt/bro/share/broctl/scripts/run-bro -i eth1 -U .status -p broctl -p broctl-live -p local -p Dell-tracker11-eth1-7 local.bro broctl base/frameworks/cluster local-worker.bro broctl/auto
12886 bash /opt/bro/share/broctl/scripts/run-bro -i eth1 -U .status -p broctl -p broctl-live -p local -p Dell-tracker11-eth1-8 local.bro broctl base/frameworks/cluster local-worker.bro broctl/auto
12896 bash /opt/bro/share/broctl/scripts/run-bro -i eth1 -U .status -p broctl -p broctl-live -p local -p Dell-tracker11-eth1-9 local.bro broctl base/frameworks/cluster local-worker.bro broctl/auto
12925 /opt/bro/bin/bro -i eth1 -U .status -p broctl -p broctl-live -p local -p Dell-tracker11-eth1-10 local.bro broctl base/frameworks/cluster local-worker.bro broctl/auto
12929 /opt/bro/bin/bro -i eth1 -U .status -p broctl -p broctl-live -p local -p Dell-tracker11-eth1-3 local.bro broctl base/frameworks/cluster local-worker.bro broctl/auto
12934 /opt/bro/bin/bro -i eth1 -U .status -p broctl -p broctl-live -p local -p Dell-tracker11-eth1-6 local.bro broctl base/frameworks/cluster local-worker.bro broctl/auto
12935 /opt/bro/bin/bro -i eth1 -U .status -p broctl -p broctl-live -p local -p Dell-tracker11-eth1-7 local.bro broctl base/frameworks/cluster local-worker.bro broctl/auto
12936 /opt/bro/bin/bro -i eth1 -U .status -p broctl -p broctl-live -p local -p Dell-tracker11-eth1-8 local.bro broctl base/frameworks/cluster local-worker.bro broctl/auto
12937 /opt/bro/bin/bro -i eth1 -U .status -p broctl -p broctl-live -p local -p Dell-tracker11-eth1-9 local.bro broctl base/frameworks/cluster local-worker.bro broctl/auto
12946 /opt/bro/bin/bro -i eth1 -U .status -p broctl -p broctl-live -p local -p Dell-tracker11-eth1-10 local.bro broctl base/frameworks/cluster local-worker.bro broctl/auto
12960 /opt/bro/bin/bro -i eth1 -U .status -p broctl -p broctl-live -p local -p Dell-tracker11-eth1-3 local.bro broctl base/frameworks/cluster local-worker.bro broctl/auto
12974 /opt/bro/bin/bro -i eth1 -U .status -p broctl -p broctl-live -p local -p Dell-tracker11-eth1-6 local.bro broctl base/frameworks/cluster local-worker.bro broctl/auto
12977 /opt/bro/bin/bro -i eth1 -U .status -p broctl -p broctl-live -p local -p Dell-tracker11-eth1-7 local.bro broctl base/frameworks/cluster local-worker.bro broctl/auto
12984 /opt/bro/bin/bro -i eth1 -U .status -p broctl -p broctl-live -p local -p Dell-tracker11-eth1-9 local.bro broctl base/frameworks/cluster local-worker.bro broctl/auto
12991 /opt/bro/bin/bro -i eth1 -U .status -p broctl -p broctl-live -p local -p Dell-tracker11-eth1-8 local.bro broctl base/frameworks/cluster local-worker.bro broctl/auto


top -n1

Tasks: 334 total, 7 running, 318 sleeping, 0 stopped, 9 zombie
Cpu(s): 14.5%us, 9.9%sy, 1.7%ni, 72.5%id, 0.1%wa, 0.0%hi, 1.2%si, 0.0%st
Mem: 98961896k total, 85586608k used, 13375288k free, 227424k buffers
Swap: 126629100k total, 0k used, 126629100k free, 67451052k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
12936 root 20 0 661m 480m 74m R 98 0.5 577:31.46 bro
12937 root 20 0 631m 450m 74m S 69 0.5 580:34.83 bro
12935 root 20 0 717m 536m 74m R 63 0.6 582:56.00 bro
12925 root 20 0 728m 547m 74m R 49 0.6 579:12.69 bro
12929 root 20 0 708m 527m 74m S 35 0.5 570:39.86 bro
12934 root 20 0 655m 474m 74m R 29 0.5 579:09.55 bro
7256 root 20 0 627m 447m 74m R 27 0.5 546:31.56 bro
12769 root 25 5 257m 49m 888 S 27 0.1 480:26.62 bro
12960 root 25 5 199m 83m 64m S 20 0.1 232:36.78 bro
12946 root 25 5 197m 83m 64m S 18 0.1 233:13.47 bro
12974 root 25 5 199m 83m 64m S 18 0.1 237:47.34 bro
12991 root 25 5 199m 83m 64m S 18 0.1 233:05.52 bro
7260 root 25 5 202m 89m 64m S 16 0.1 210:06.25 bro
12977 root 25 5 199m 83m 64m R 16 0.1 232:18.54 bro
12984 root 25 5 199m 83m 64m S 10 0.1 233:24.38 bro
12761 root 20 0 545m 291m 3384 S 8 0.3 30:26.24 bro


So it looks like even after I ran broctl stop, Bro is still running some processes and using up memory.

memstat
529284k: PID 7256 (/opt/bro/bin/bro)
100008k: PID 7260 (/opt/bro/bin/bro)
517416k: PID 12761 (/opt/bro/bin/bro)
222632k: PID 12769 (/opt/bro/bin/bro)
632392k: PID 12925 (/opt/bro/bin/bro)
612356k: PID 12929 (/opt/bro/bin/bro)
557584k: PID 12934 (/opt/bro/bin/bro)
621604k: PID 12935 (/opt/bro/bin/bro)
563824k: PID 12936 (/opt/bro/bin/bro)
533456k: PID 12937 (/opt/bro/bin/bro)
95496k: PID 12946 (/opt/bro/bin/bro)
97540k: PID 12960 (/opt/bro/bin/bro)
97540k: PID 12974 (/opt/bro/bin/bro)
97540k: PID 12977 (/opt/bro/bin/bro)
97540k: PID 12984 (/opt/bro/bin/bro)
97532k: PID 12991 (/opt/bro/bin/bro)

I've attached the sostat and cpuinfo output.


Could you kindly provide some guidance to help troubleshoot this further?


Thanks!


cpuinfo.txt
sostat.txt
Message has been deleted
Message has been deleted

Doug Burks

unread,
Oct 31, 2013, 9:46:06 AM10/31/13
to securit...@googlegroups.com
Hi Benson,

Yes, I've seen this happen. I'm hoping that it's a bug in broctl that
has been fixed for the upcoming Bro 2.2 release. Perhaps Seth or Liam
can comment?

On Wed, Oct 30, 2013 at 4:48 PM, bmp <benson....@gmail.com> wrote:
> Also, I haven't made any change to my local.bro or other bro scripts on my site.
>
> I only edited my Bro config files:
> - network.cfg to include my home net
> - broctl.cfg: Commented out MailTo field to stop the hourly Bro reports.
> --
> You received this message because you are subscribed to the Google Groups "security-onion" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to security-onio...@googlegroups.com.
> To post to this group, send email to securit...@googlegroups.com.
> Visit this group at http://groups.google.com/group/security-onion.
> For more options, visit https://groups.google.com/groups/opt_out.



--
Doug Burks
http://securityonion.blogspot.com

Liam Randall

unread,
Oct 31, 2013, 10:21:42 AM10/31/13
to securit...@googlegroups.com
The reason Bro 2.2 has been in Beta so long was squashing a couple of memory leaks- some were from the new code, others were old ones exacerbated by the new frameworks.  On the 10 Gig + links I am working with now I am seeing excellent memory performance.

Excited to get 2.2 final out the door and into everyones hands.  The files and intel frameworks are really going to blow a lot of people away.

Liam

Doug Burks

unread,
Oct 31, 2013, 10:31:03 AM10/31/13
to securit...@googlegroups.com
Liam, have you seen this particular issue of having extra bro workers
running? I'm guessing that broctl thinks a worker has crashed and
spawns a new one, but the old one is still actually running and
consuming resources.

seth...@gmail.com

unread,
Oct 31, 2013, 12:00:57 PM10/31/13
to securit...@googlegroups.com

On Oct 31, 2013, at 9:46 AM, Doug Burks <doug....@gmail.com> wrote:

> Yes, I've seen this happen. I'm hoping that it's a bug in broctl that
> has been fixed for the upcoming Bro 2.2 release. Perhaps Seth or Liam
> can comment?

I think there is more than one problem here. The extreme memory use is one and the Bro processes not being stopped with "broctl stop" is the other. I don't know why either one is happening with the information that has been given so far.

Benson, could you add "@load misc/profiling" to your local.bro script? That should start writing out some information about sizes of global variables (and lots of other information). It will be directly written to the spool/<node-name> directory on each host for each node.

.Seth

Doug Burks

unread,
Oct 31, 2013, 12:43:39 PM10/31/13
to securit...@googlegroups.com
I could be totally wrong here, but I'm envisioning a scenario where
the high RAM usage could just be a symptom of a broctl bug...

My interpretation of Benson's email is that the RAM usage is
"acceptable" at first but over a day or so gets much higher, at which
time "broctl stop" stops the running processes that it thinks are
running BUT there are other bro processes still consuming RAM. I've
experienced this myself. At first, everything is fine, "pgrep -lf
bro" shows the correct number of workers as configured via node.cfg.
After a day or so, "pgrep -lf bro" MAY show more Bro workers on an
interface than configured via node.cfg and RAM usage is much higher
simply due to the additional workers. Is it possible that, perhaps
during hourly/daily rotation, an old Bro worker process is not killed
properly? Doing "broctl stop", waiting a few minutes, and then doing
"pgrep -lf bro" shows that some of those Bro workers are still running
and consuming resources.

If I'm interpreting Benson's email correctly and if I'm interpreting
the symptoms I've seen correctly, then my thinking is that perhaps
it's a bug in broctl that's not killing off old Bro processes
correctly and that fixing that bug would result in RAM usage returning
to "acceptable" levels.

Thoughts?

bmp

unread,
Oct 31, 2013, 1:53:50 PM10/31/13
to securit...@googlegroups.com
Thanks for the response guys!

Doug, you're right about the issue I'm having.

I rebooted the server today morning and things came back to normal.

free -g
total used free shared buffers cached

Mem: 94 27 67 0 0 18
-/+ buffers/cache: 8 85
Swap: 120 0 120

broctl netstats
Dell-tracker11-eth1-1: 1383226434.801952 recvd=17283751 dropped=5 link=17283751
Dell-tracker11-eth1-10: 1383226435.002263 recvd=17411012 dropped=1 link=17411012
Dell-tracker11-eth1-2: 1383226435.202305 recvd=17612996 dropped=5 link=17612996
Dell-tracker11-eth1-3: 1383226435.401446 recvd=16103695 dropped=4 link=16103695
Dell-tracker11-eth1-4: 1383226435.602099 recvd=16994737 dropped=6 link=16994737
Dell-tracker11-eth1-5: 1383226435.802116 recvd=19468767 dropped=7 link=19468767
Dell-tracker11-eth1-6: 1383226436.006219 recvd=20788249 dropped=9 link=20788249
Dell-tracker11-eth1-7: 1383226436.155957 recvd=21992413 dropped=3 link=21992413
Dell-tracker11-eth1-8: 1383226436.404069 recvd=16942935 dropped=8 link=16942935
Dell-tracker11-eth1-9: 1383226436.606389 recvd=20682335 dropped=12 link=20682335

top
top - 13:26:01 up 10 min, 1 user, load average: 7.32, 6.81, 3.67
Tasks: 338 total, 14 running, 310 sleeping, 0 stopped, 14 zombie
Cpu(s): 23.7%us, 5.7%sy, 1.1%ni, 67.5%id, 0.2%wa, 0.0%hi, 1.7%si, 0.0%st
Mem: 98961896k total, 11230080k used, 87731816k free, 47640k buffers
Swap: 126629100k total, 0k used, 126629100k free, 4127720k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

4087 root 20 0 623m 440m 75m R 90 0.5 6:23.50 bro
4086 root 20 0 620m 437m 71m R 81 0.5 6:00.67 bro
4076 root 20 0 619m 436m 71m R 74 0.5 5:56.89 bro
4083 root 20 0 619m 437m 75m S 74 0.5 6:04.21 bro
4088 root 20 0 618m 436m 75m S 73 0.5 5:59.86 bro
4089 root 20 0 621m 438m 71m R 72 0.5 6:12.30 bro
4078 root 20 0 615m 432m 71m S 70 0.4 6:11.44 bro
4085 root 20 0 613m 431m 71m R 70 0.4 6:25.19 bro
4080 root 20 0 622m 439m 71m R 69 0.5 5:59.31 bro
4084 root 20 0 619m 436m 75m R 62 0.5 5:58.01 bro
1947 root 20 0 205m 37m 3796 S 34 0.0 3:01.88 perl
3877 root 25 5 297m 19m 904 R 30 0.0 2:32.53 bro
3921 root 25 5 225m 18m 888 S 26 0.0 2:45.11 bro
1809 root 20 0 265m 200m 2908 S 22 0.2 1:59.14 syslog-ng
3868 root 20 0 1964m 54m 3944 S 19 0.1 1:49.53 bro
4131 root 25 5 199m 82m 64m R 16 0.1 1:24.02 bro
4108 root 25 5 199m 82m 64m R 15 0.1 1:24.51 bro
4109 root 25 5 199m 82m 64m S 15 0.1 1:26.26 bro
4145 root 25 5 199m 82m 64m S 15 0.1 1:25.49 bro
4114 root 25 5 199m 82m 64m S 14 0.1 1:24.51 bro
4093 root 25 5 199m 82m 64m S 14 0.1 1:26.57 bro
4123 root 25 5 199m 82m 64m S 14 0.1 1:24.92 bro
4100 root 25 5 199m 82m 64m R 13 0.1 1:25.75 bro
4117 root 25 5 199m 82m 64m R 13 0.1 1:25.47 bro
4138 root 25 5 199m 82m 64m S 13 0.1 1:25.87 bro
7641 root 20 0 63496 13m 2296 R 12 0.0 0:00.37 perl
3913 root 20 0 353m 99m 3960 S 6 0.1 0:35.00 bro


Seth, I've loaded the profiling module as you suggested. How long should I let it run before collecting the logs?


Thanks,
Benson

Seth Hall

unread,
Nov 2, 2013, 10:17:50 PM11/2/13
to securit...@googlegroups.com

On Oct 30, 2013, at 4:39 PM, bmp <benson....@gmail.com> wrote:

> So it looks like even after I ran broctl stop, Bro is still running some processes and using up memory.

This has been an ongoing issue for us with BroControl for quite some time but the problem only seems to expose itself on hosts that are overloaded so we've always had a difficult time tracking down the issue.

.Seth


--
Seth Hall
International Computer Science Institute
(Bro) because everyone has a network
http://www.bro.org/

signature.asc

Doug Burks

unread,
Nov 4, 2013, 10:34:24 AM11/4/13
to securit...@googlegroups.com
Now that I'm actively monitoring this, it looks like most, if not all,
of my sensors are doing this (and they aren't overloaded).

Seth, what can I provide to help track down the issue?

sfear...@gmail.com

unread,
Nov 4, 2013, 2:50:01 PM11/4/13
to securit...@googlegroups.com
Just to add more to this because I think it's the same thing I've been seeing.

I've got a sensor only box that has only 1 sensor NIC. The NIC is not hooked into a TAP or SPAN port as I am simply doing some testing with it. About the only thing I expect to see is broadcast traffic which is what I see. At some point after being up for a couple of days the bro logs will stop logging local broadcasts on our network. The 1st time it happened I tried a reboot and that fixed it. That was 3 days ago. Querying for udp class=BRO_CONN in ELSA returned expected broadcast traffic.

Now fast forward 3 days and that same ELSA query no longer returns results. It looks like they ceased on Sun Nov 03 01:59:59 UTC.

This box obviously has no load on it, but I do witness the load continually go up over time. This is a 2 proc box so you can see that I am already having load problems.

19:47:25 up 3 days, 3:16, 1 user, load average: 2.34, 2.28, 2.15


root 2992 1 0 Nov01 ? 00:00:00 bash /opt/bro/share/broctl/scripts/run-bro -U .status -p broctl -p broctl-live -p local -p manager local.bro broctl base/frameworks/cluster local-manager.bro broctl/auto
root 3132 2992 1 Nov01 ? 01:17:04 /opt/bro/bin/bro -U .status -p broctl -p broctl-live -p local -p manager local.bro broctl base/frameworks/cluster local-manager.bro broctl/auto
root 3140 3132 17 Nov01 ? 13:10:47 /opt/bro/bin/bro -U .status -p broctl -p broctl-live -p local -p manager local.bro broctl base/frameworks/cluster local-manager.bro broctl/auto
root 3207 1 0 Nov01 ? 00:00:00 bash /opt/bro/share/broctl/scripts/run-bro -U .status -p broctl -p broctl-live -p local -p proxy local.bro broctl base/frameworks/cluster local-proxy broctl/auto
root 3301 3207 1 Nov01 ? 01:12:04 /opt/bro/bin/bro -U .status -p broctl -p broctl-live -p local -p proxy local.bro broctl base/frameworks/cluster local-proxy broctl/auto
root 3306 3301 17 Nov01 ? 13:19:59 /opt/bro/bin/bro -U .status -p broctl -p broctl-live -p local -p proxy local.bro broctl base/frameworks/cluster local-proxy broctl/auto
root 3391 1 0 Nov01 ? 00:00:00 bash /opt/bro/share/broctl/scripts/run-bro -i eth1 -U .status -p broctl -p broctl-live -p local -p sensor01-eth1-1 local.bro broctl base/frameworks/cluster local-worker.bro broctl/auto
root 3466 3391 15 Nov01 ? 11:56:40 /opt/bro/bin/bro -i eth1 -U .status -p broctl -p broctl-live -p local -p sensor01-eth1-1 local.bro broctl base/frameworks/cluster local-worker.bro broctl/auto
root 3471 3466 10 Nov01 ? 08:15:59 /opt/bro/bin/bro -i eth1 -U .status -p broctl -p broctl-live -p local -p sensor01-eth1-1 local.bro broctl base/frameworks/cluster local-worker.bro broctl/auto
root 19620 1 0 Nov03 ? 00:00:00 bash /opt/bro/share/broctl/scripts/run-bro -U .status -p broctl -p broctl-live -p local -p manager local.bro broctl base/frameworks/cluster local-manager.bro broctl/auto
root 19814 19620 1 Nov03 ? 00:30:33 /opt/bro/bin/bro -U .status -p broctl -p broctl-live -p local -p manager local.bro broctl base/frameworks/cluster local-manager.bro broctl/auto
root 19821 19814 12 Nov03 ? 04:01:52 /opt/bro/bin/bro -U .status -p broctl -p broctl-live -p local -p manager local.bro broctl base/frameworks/cluster local-manager.bro broctl/auto
root 20888 1 0 Nov03 ? 00:00:00 bash /opt/bro/share/broctl/scripts/run-bro -U .status -p broctl -p broctl-live -p local -p manager local.bro broctl base/frameworks/cluster local-manager.bro broctl/auto
root 20982 1 0 18:00 ? 00:00:00 bash /opt/bro/share/broctl/scripts/run-bro -U .status -p broctl -p broctl-live -p local -p manager local.bro broctl base/frameworks/cluster local-manager.bro broctl/auto
root 21133 20982 1 18:00 ? 00:01:34 /opt/bro/bin/bro -U .status -p broctl -p broctl-live -p local -p manager local.bro broctl base/frameworks/cluster local-manager.bro broctl/auto
root 21139 21133 10 18:00 ? 00:11:38 /opt/bro/bin/bro -U .status -p broctl -p broctl-live -p local -p manager local.bro broctl base/frameworks/cluster local-manager.bro broctl/auto
root 21233 20888 1 Nov03 ? 00:40:35 /opt/bro/bin/bro -U .status -p broctl -p broctl-live -p local -p manager local.bro broctl base/frameworks/cluster local-manager.bro broctl/auto
root 21241 21233 13 Nov03 ? 05:33:34 /opt/bro/bin/bro -U .status -p broctl -p broctl-live -p local -p manager local.bro broctl base/frameworks/cluster local-manager.bro broctl/auto

punkrokk

unread,
Nov 5, 2013, 3:38:05 PM11/5/13
to securit...@googlegroups.com
I've seen this happen also - generally when I have a server oversubscribed. By the time this has happened for me - I'm dropping packets because either I don't have enough workers or I am out of cores/cycles on a processor. Have you seen any of these symptoms?

PS - I have NOT seen this w/Bro 2.2 beta. But I haven't tried it against the same load.

JP

Doug Burks

unread,
Nov 11, 2013, 8:49:20 AM11/11/13
to securit...@googlegroups.com
I think the issue I was seeing with orphaned Bro processes was due to
a bug in procps. I installed updates last week, which included an
updated procps package, and haven't had the issue since then.

"Changelog
procps (1:3.2.8-11ubuntu6.3) precise; urgency=low
* Avoid SEGV if file2str should read zero bytes. This is a backport of
526bc5df from upstream. When utility buffers were introduced for file2str
read requests, a subtle change was inadvertently introduced such that a
read of zero no longer returns a -1 value. This returns to the behavior to
returning -1 on zero byte reads. (LP: #1242746)
-- Dave Chiluk <chi...@canonical.com> Mon, 28 Oct 2013 10:56:11 -0700"

https://launchpad.net/ubuntu/+source/procps/1:3.2.8-11ubuntu6.3

bmp

unread,
Nov 14, 2013, 9:38:58 AM11/14/13
to securit...@googlegroups.com

Thanks Doug!

I did an update last week and no longer see the orphaned Bro processes when I stop Bro after running it for a day. But now, I see more drops than I used to before:
broctl netstats
Dell-tracker10-eth1-1: 1384438884.721735 recvd=2720065410 dropped=8656673 link=2720065410
Dell-tracker10-eth1-10: 1384438884.925728 recvd=2916822709 dropped=4174310 link=2916822709
Dell-tracker10-eth1-2: 1384438885.124428 recvd=2779923906 dropped=4468393 link=2779923906
Dell-tracker10-eth1-3: 1384438885.321093 recvd=3101140429 dropped=10454802 link=3101140429
Dell-tracker10-eth1-4: 1384438885.527422 recvd=3350559230 dropped=140228913 link=3350559230
Dell-tracker10-eth1-5: 1384438885.725776 recvd=3216089847 dropped=3511198 link=3216089847
Dell-tracker10-eth1-6: 1384438885.894169 recvd=3858174518 dropped=235303989 link=3858174518
Dell-tracker10-eth1-7: 1384438886.028788 recvd=2756054906 dropped=4015840 link=2756054906
Dell-tracker10-eth1-8: 1384438886.329623 recvd=2762332191 dropped=6322883 link=2762332191
Dell-tracker10-eth1-9: 1384438886.529575 recvd=2792720985 dropped=3562489 link=2792720985

When I start Bro, it runs fine for an hour or more and then the drop rate begin to increase.

Any thoughts?

Again, the server is monitoring traffic that peaks to 5Gbps, its got 96G of RAM and Bro is configured to run 10 worker processes. And when the drop rate increases there still ~15G of free RAM available. The server has 32 CPU processors, so there's room for more Bro processes to run - I tried increases that before but didnt see any improvement in the drop rate.

Thanks,
Benson

Doug Burks

unread,
Nov 14, 2013, 9:44:44 AM11/14/13
to securit...@googlegroups.com
Have you installed the updates released yesterday? The new sysctl
settings *might* help some:
http://blog.securityonion.net/2013/11/new-snort-nsm-and-sostat-packages.html

If you're monitoring 5Gbps, I'd definitely recommend increasing your
number of Bro workers. I think the last recommendation I heard from
the Bro team was to limit to half your number of processors, so try
increasing to 16 Bro workers, let it run for a while, then send "sudo
sostat-redacted".
> --
> You received this message because you are subscribed to the Google Groups "security-onion" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to security-onio...@googlegroups.com.
> To post to this group, send email to securit...@googlegroups.com.
> Visit this group at http://groups.google.com/group/security-onion.
> For more options, visit https://groups.google.com/groups/opt_out.



--
Doug Burks
http://securityonion.net

bmp

unread,
Nov 14, 2013, 1:47:45 PM11/14/13
to securit...@googlegroups.com

Alright so I bumped it up to 16 processes and rebooted the server. See attached sostats output.

free -g
total used free shared buffers cached

Mem: 94 13 81 0 0 3
-/+ buffers/cache: 9 84
Swap: 120 0 120

The CPU usage has gone up significantly...
top - 18:44:03 up 14 min, 1 user, load average: 16.17, 15.72, 9.81
Tasks: 375 total, 20 running, 335 sleeping, 0 stopped, 20 zombie
Cpu0 : 41.8%us, 5.7%sy, 1.4%ni, 40.0%id, 0.0%wa, 0.0%hi, 11.1%si, 0.0%st
Cpu1 : 55.2%us, 5.9%sy, 1.0%ni, 29.9%id, 0.0%wa, 0.3%hi, 7.6%si, 0.0%st
Cpu2 : 25.4%us, 8.5%sy, 1.1%ni, 59.2%id, 0.0%wa, 0.0%hi, 5.9%si, 0.0%st
Cpu3 : 43.1%us, 9.3%sy, 1.8%ni, 35.9%id, 1.1%wa, 0.0%hi, 8.9%si, 0.0%st
Cpu4 : 64.8%us, 3.2%sy, 1.1%ni, 22.2%id, 0.0%wa, 0.0%hi, 8.8%si, 0.0%st
Cpu5 : 25.8%us, 11.8%sy, 1.4%ni, 52.6%id, 0.0%wa, 0.0%hi, 8.4%si, 0.0%st
Cpu6 : 52.8%us, 7.1%sy, 1.8%ni, 29.4%id, 0.0%wa, 0.0%hi, 8.9%si, 0.0%st
Cpu7 : 45.7%us, 6.5%sy, 1.8%ni, 37.7%id, 0.0%wa, 0.0%hi, 8.3%si, 0.0%st
Cpu8 : 37.7%us, 11.6%sy, 1.4%ni, 40.9%id, 0.0%wa, 0.0%hi, 8.3%si, 0.0%st
Cpu9 : 49.3%us, 5.0%sy, 1.4%ni, 38.2%id, 0.0%wa, 0.0%hi, 6.1%si, 0.0%st
Cpu10 : 47.2%us, 5.9%sy, 2.1%ni, 38.1%id, 0.0%wa, 0.0%hi, 6.6%si, 0.0%st
Cpu11 : 58.9%us, 5.8%sy, 0.7%ni, 24.7%id, 0.0%wa, 0.0%hi, 9.9%si, 0.0%st
Cpu12 : 33.8%us, 6.8%sy, 2.5%ni, 48.6%id, 0.0%wa, 0.0%hi, 8.3%si, 0.0%st
Cpu13 : 53.9%us, 5.3%sy, 0.7%ni, 30.1%id, 0.0%wa, 0.0%hi, 9.9%si, 0.0%st
Cpu14 : 36.6%us, 11.0%sy, 2.9%ni, 42.9%id, 0.0%wa, 0.0%hi, 6.6%si, 0.0%st
Cpu15 : 34.8%us, 9.3%sy, 1.5%ni, 46.7%id, 0.0%wa, 0.4%hi, 7.4%si, 0.0%st
Cpu16 : 52.4%us, 5.4%sy, 1.4%ni, 40.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu17 : 51.0%us, 5.1%sy, 0.3%ni, 43.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu18 : 49.3%us, 5.4%sy, 1.7%ni, 43.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu19 : 36.6%us, 7.5%sy, 2.7%ni, 53.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu20 : 57.9%us, 3.8%sy, 1.4%ni, 37.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu21 : 30.4%us, 5.9%sy, 1.7%ni, 61.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu22 : 39.4%us, 5.9%sy, 1.4%ni, 52.9%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st
Cpu23 : 52.7%us, 6.2%sy, 0.7%ni, 40.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu24 : 30.3%us, 9.5%sy, 2.5%ni, 57.4%id, 0.0%wa, 0.0%hi, 0.4%si, 0.0%st
Cpu25 : 21.0%us, 3.7%sy, 1.0%ni, 74.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu26 : 54.4%us, 4.8%sy, 0.3%ni, 40.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu27 : 37.6%us, 4.7%sy, 1.0%ni, 56.3%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st
Cpu28 : 48.3%us, 7.2%sy, 1.0%ni, 43.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu29 : 44.1%us, 4.2%sy, 1.0%ni, 50.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu30 : 30.2%us, 7.4%sy, 2.5%ni, 59.6%id, 0.0%wa, 0.0%hi, 0.4%si, 0.0%st
Cpu31 : 32.8%us, 6.9%sy, 1.7%ni, 58.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 98961896k total, 14630068k used, 84331828k free, 51184k buffers
Swap: 126629100k total, 0k used, 126629100k free, 3968468k cached


I'll let it run for a while and send the sostats output again.

Thanks,
Benson

Seth Hall

unread,
Nov 14, 2013, 3:05:24 PM11/14/13
to securit...@googlegroups.com

On Nov 14, 2013, at 1:47 PM, bmp <benson....@gmail.com> wrote:

> Alright so I bumped it up to 16 processes and rebooted the server. See attached sostats output.

I would also load misc/capture-loss (it will create a capture-loss.log after 15 minutes and write to it every 15 minutes)…

@load misc/capture-loss

Add that to local.bro then do "check manager", "install", and "restart" in broctl.
signature.asc

Doug Burks

unread,
Nov 14, 2013, 3:17:53 PM11/14/13
to securit...@googlegroups.com
Security Onion users have capture-loss loaded automatically:
https://github.com/sethhall/bro-securityonion/blob/master/load-non-default-scripts.bro

bmp

unread,
Nov 14, 2013, 4:39:56 PM11/14/13
to securit...@googlegroups.com
I guess I forgot to attach the sostats output before...
sostats-redacted.txt

Doug Burks

unread,
Nov 14, 2013, 4:47:41 PM11/14/13
to securit...@googlegroups.com
Dropped stats look better than they did before:

Dell-tracker10-eth1-1: 1384454424.644318 recvd=15562926 dropped=3 link=15562926
Dell-tracker10-eth1-10: 1384454424.844317 recvd=15911346 dropped=4 link=15911346
Dell-tracker10-eth1-11: 1384454425.043700 recvd=15614984 dropped=6 link=15614984
Dell-tracker10-eth1-12: 1384454425.244326 recvd=16443020 dropped=7 link=16443020
Dell-tracker10-eth1-13: 1384454425.445089 recvd=15848874 dropped=1 link=15848874
Dell-tracker10-eth1-14: 1384454425.648424 recvd=16674988 dropped=10
link=16674988
Dell-tracker10-eth1-15: 1384454425.782254 recvd=15894364 dropped=5 link=15894364
Dell-tracker10-eth1-16: 1384454426.048371 recvd=14921808 dropped=4 link=14921808
Dell-tracker10-eth1-2: 1384454426.248352 recvd=16117355 dropped=8 link=16117355
Dell-tracker10-eth1-3: 1384454426.449090 recvd=13414290 dropped=8 link=13414290
Dell-tracker10-eth1-4: 1384454426.648269 recvd=15082478 dropped=2 link=15082478
Dell-tracker10-eth1-5: 1384454426.848042 recvd=14553499 dropped=6 link=14553499
Dell-tracker10-eth1-6: 1384454427.050978 recvd=17531897 dropped=7 link=17531897
Dell-tracker10-eth1-7: 1384454427.252275 recvd=15519243 dropped=12 link=15519243
Dell-tracker10-eth1-8: 1384454427.448383 recvd=15208056 dropped=11 link=15208056
Dell-tracker10-eth1-9: 1384454427.638822 recvd=16631278 dropped=12 link=16631278

On Thu, Nov 14, 2013 at 4:39 PM, bmp <benson....@gmail.com> wrote:
> I guess I forgot to attach the sostats output before...
>

bmp

unread,
Nov 15, 2013, 8:01:38 AM11/15/13
to securit...@googlegroups.com
Those stats were taken few minutes after rebooting the server yesterday (2pm EST)- I just forgot to attach it to the mail.

Here's how it looks this morning (8am).
broctl netstats
Dell-tracker10-eth1-1: 1384520278.468432 recvd=1365921754 dropped=904 link=1365921754
Dell-tracker10-eth1-10: 1384520278.668452 recvd=130895343 dropped=1795032 link=130895343
Dell-tracker10-eth1-11: 1384520278.872467 recvd=1353338121 dropped=47628 link=1353338121
Dell-tracker10-eth1-12: 1384520279.072505 recvd=1393140075 dropped=47293 link=1393140075
Dell-tracker10-eth1-13: 1384520279.272388 recvd=1360511289 dropped=191999 link=1360511289
Dell-tracker10-eth1-14: 1384520279.472515 recvd=1398079476 dropped=814 link=1398079476
Dell-tracker10-eth1-15: 1384520279.672414 recvd=1375886700 dropped=526954460 link=1375886700
Dell-tracker10-eth1-16: 1384520279.872528 recvd=1384310838 dropped=91720 link=1384310838
Dell-tracker10-eth1-2: 1384520280.072523 recvd=1401012236 dropped=25761 link=1401012236
Dell-tracker10-eth1-3: 1384520280.271685 recvd=1373363316 dropped=625486 link=1373363316
Dell-tracker10-eth1-4: 1384520280.472520 recvd=1480524233 dropped=5029529 link=1480524233
Dell-tracker10-eth1-5: 1384520280.676484 recvd=2014091406 dropped=12635690 link=2014091406
Dell-tracker10-eth1-6: 1384520280.877537 recvd=1405829162 dropped=156378139 link=1405829162
Dell-tracker10-eth1-7: 1384520281.076487 recvd=1436587318 dropped=190466 link=1436587318
Dell-tracker10-eth1-8: 1384520281.276508 recvd=1429048012 dropped=173966 link=1429048012
Dell-tracker10-eth1-9: 1384520281.476540 recvd=1379976753 dropped=714 link=1379976753

root@Dell-tracker10:~# free -g


total used free shared buffers cached

Mem: 94 55 38 0 0 29
-/+ buffers/cache: 25 68
Swap: 120 0 120


sostats-redacted.txt

Doug Burks

unread,
Nov 16, 2013, 4:41:22 PM11/16/13
to securit...@googlegroups.com
Your load average still looks OK considering you have 32 cores. Try
increasing your Bro workers again and see what happens.

bmp

unread,
Nov 19, 2013, 9:21:44 AM11/19/13
to securit...@googlegroups.com
So I bumped it up to 22 workers yesterday and let it run for a day, see attached sostat.

Some worker nodes are dropping a lot... I've also attached the bro capture_loss logs.

I've just re-added '@load misc/profiling' to my local.bro. I'm going to reboot, let it run for a while again and see if I can get more error details for the misbehaving worker nodes.

sostat.txt
capture_loss.txt

Seth Hall

unread,
Nov 19, 2013, 3:59:10 PM11/19/13
to securit...@googlegroups.com

On Nov 19, 2013, at 9:21 AM, bmp <benson....@gmail.com> wrote:

> Some worker nodes are dropping a lot... I've also attached the bro capture_loss logs.


What happened at Nov 19 11:29:27 UTC? It looks like your capture loss was great until that point and then it jumped way up.
signature.asc

bmp

unread,
Nov 20, 2013, 10:47:29 AM11/20/13
to securit...@googlegroups.com
Not sure what happened at that time 6:29am EST . Im thinking maybe backup traffic or some other large single stream session.

I added some filters on my tap to remove backup and video traffic. Will let it run for a day and see how it performs now.

Are there any tools that can do flow-ip profiling to see if there are any large single stream session over a given period of time? Is there a rule of thumb for how much traffic each Bro worker can handle?

Seth Hall

unread,
Nov 20, 2013, 11:20:45 AM11/20/13
to securit...@googlegroups.com

On Nov 20, 2013, at 10:47 AM, bmp <benson....@gmail.com> wrote:

> Not sure what happened at that time 6:29am EST . Im thinking maybe backup traffic or some other large single stream session.

I would be a bit surprised if that was it because all of your Bro processes jumped in packet loss. It could be interesting to take a look at your packet loss notices around that time to see if they jumped too. capture-loss will detect loss off of your box as well so I'd like to make sure that the packets are being lost on the box and not on a failing tap or something first.

> Are there any tools that can do flow-ip profiling to see if there are any large single stream session over a given period of time? Is there a rule of thumb for how much traffic each Bro worker can handle?

Not yet, I've been meaning to write a Bro script for a long time that provides some diagnostic information like that. I've been starting to tool around with a few different mechanisms for doing that over the past few months too, but I don't have anything solid yet.
signature.asc

bmp

unread,
Nov 20, 2013, 12:07:17 PM11/20/13
to securit...@googlegroups.com
Note sure if I collected the right ones... but this is what I ran...

zcat /nsm/bro/logs/2013-11-19/notice.* |grep 'Dropped_Packets' |cf|less > /var/tmp/dropped_packets.txt

See attached logs.

The second file consists of log entries where the drops were greater that 500.

dropped_packets.txt
captureloss_notice.txt
large_dropped_packets.txt
Reply all
Reply to author
Forward
0 new messages