Zeek 3.0.1 high CPU usage

321 views
Skip to first unread message

Pete

unread,
Feb 22, 2020, 10:09:49 AM2/22/20
to security-onion
All,

I'm seeing an issue with Zeek 3.0.1 where some of the worker processes peg the CPU at 100%.  The worker continues processing packets and writing logs, so the only way to detect this is to observe the CPU consumed for the Zeek worker processes, eg with top.  For me, they're appearing within a few minutes to a day of the last time Zeek was restarted.  I've paused upgrades from Zeek 2.6.4 until I get this figured out, as it's affecting both of the sensors I've upgraded to 3.0.1.

I'm working the issue with Justin from CoreLight, and I see the 3.0 series is considered an LTS release so am hoping for a resolution other than upgrade to 3.1.  From preliminary investigation, I think it's in an area of the code that's completely reworked in 3.1, so maybe he'll just back-port that.

I'm curious if anyone else is seeing this.  It would even be good to know if you've checked it and are not seeing the issue, as that would help me know it's something specific to my setup...

Can I talk a few SecurityOnion users into checking the CPU usage of their Zeek worker processes to see if any are pegged at 100+%, please?

An easy way to check is to run
  top -cbn1 -p$(pgrep -f '/opt/bro/bin/zeek -i'|paste -sd,)
and look for the %CPU of the first process listed.
--
Pete

Doug Burks

unread,
Feb 23, 2020, 3:57:13 PM2/23/20
to securit...@googlegroups.com
Hi Pete,

I'm not seeing this on any of my production boxes at all.  They all show Zeek running at the same CPU usage as the previous version of Bro.

Are you running any non-default Zeek scripts?

From your post on the Zeek mailing list, it looks like you're using AF_PACKET.  Have you tried switching to PF_RING to see if that makes any difference at all?

Are you running netsniff-ng?  If so, can you verify that it is running with the --no-hwtimestamp option?

Are you logging in TSV or JSON format?  Have you tried switching that format to see if it makes any difference at all?

Are you able to share a full sostat?

--
Follow Security Onion on Twitter!
https://twitter.com/securityonion
---
You received this message because you are subscribed to the Google Groups "security-onion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to security-onio...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/security-onion/f75cd7b3-6ef7-4600-9989-b751216546b6%40googlegroups.com.


--
Doug Burks
CEO
Security Onion Solutions, LLC

Pete

unread,
Feb 23, 2020, 5:00:36 PM2/23/20
to security-onion
Thanks for checking, Doug.


On Sunday, 23 February 2020 20:57:13 UTC, Doug Burks wrote:
Are you running any non-default Zeek scripts?
I've enabled intel and frameworks/intel/whitelist, and disabled MHR.  No custom scripts.
 
From your post on the Zeek mailing list, it looks like you're using AF_PACKET.  Have you tried switching to PF_RING to see if that makes any difference at all?
I am using AF_PACKET.  I have not tried switching to PF_RING.  Is there a guide on how to do that?
 
Are you running netsniff-ng?  If so, can you verify that it is running with the --no-hwtimestamp option?
I am running netsniff-ng.  A ps listing shows it is running with --no-hwtimestamp.

Are you logging in TSV or JSON format?  Have you tried switching that format to see if it makes any difference at all?
I'm logging to JSON.  If I switch to TSV, logstash loses the ability to import into Elastic, doesn't it? 

Are you able to share a full sostat?
I'll see about getting that together this week  This is an install done by loading Ubuntu Server, adding your repo, and installing securityonion-[sensor,server,elastic].  It is running a few other python scripts and docker images.
--
Pete

Doug Burks

unread,
Feb 24, 2020, 5:24:04 AM2/24/20
to securit...@googlegroups.com
Hi Pete,

Replies inline.

On Sun, Feb 23, 2020 at 5:00 PM Pete <peti...@gmail.com> wrote:
Thanks for checking, Doug.

On Sunday, 23 February 2020 20:57:13 UTC, Doug Burks wrote:
Are you running any non-default Zeek scripts?
I've enabled intel and frameworks/intel/whitelist, and disabled MHR.  No custom scripts.
 
From your post on the Zeek mailing list, it looks like you're using AF_PACKET.  Have you tried switching to PF_RING to see if that makes any difference at all?
I am using AF_PACKET.  I have not tried switching to PF_RING.  Is there a guide on how to do that?

Off the top of my head, it should be something like:

sudo so-zeek-stop

Edit /opt/zeek/etc/node.cfg and change lb_method=custom to lb_method=pf_ring

sudo so-zeek-start
 
 
Are you running netsniff-ng?  If so, can you verify that it is running with the --no-hwtimestamp option?
I am running netsniff-ng.  A ps listing shows it is running with --no-hwtimestamp.

OK, that's good.  The reason I ask is that we did run into a somewhat similar issue previously with Suricata on AF_PACKET hitting 100% CPU usage when netsniff-ng would enable hardware timestamps (default for certain hardware).  We worked around that issue by adding the --no-hwtimestamp issue to our default netsniff-ng options:


Are you logging in TSV or JSON format?  Have you tried switching that format to see if it makes any difference at all?
I'm logging to JSON.  If I switch to TSV, logstash loses the ability to import into Elastic, doesn't it? 

That depends on where you're doing the actual parsing.  If you're using the traditional Logstash parsing, then it should be able to handle both JSON and TSV format.  If you're using the new LOGSTASH_MINIMAL config where Logstash sends unparsed logs to Elasticsearch where they are then parsed using ingest node parsing, then that only handles JSON format.
 

Are you able to share a full sostat?
I'll see about getting that together this week  This is an install done by loading Ubuntu Server, adding your repo, and installing securityonion-[sensor,server,elastic].  It is running a few other python scripts and docker images.

Are any of your additions using AF_PACKET?  I'm wondering if there might be a weird interaction similar to the Suricata/netsniff-ng issue mentioned above.
 
--
Pete

--
Follow Security Onion on Twitter!
https://twitter.com/securityonion
---
You received this message because you are subscribed to the Google Groups "security-onion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to security-onio...@googlegroups.com.

Pete Nelson

unread,
Feb 24, 2020, 8:46:29 AM2/24/20
to securit...@googlegroups.com
Thanks, Doug. Responses inline..
--
Pete

On Mon, Feb 24, 2020 at 10:24 AM Doug Burks
<doug....@securityonionsolutions.com> wrote:
> On Sun, Feb 23, 2020 at 5:00 PM Pete <peti...@gmail.com> wrote:
>> I am using AF_PACKET. I have not tried switching to PF_RING. Is there a guide on how to do that?
> Off the top of my head, it should be something like:
> sudo so-zeek-stop
> Edit /opt/zeek/etc/node.cfg and change lb_method=custom to lb_method=pf_ring
> sudo so-zeek-start

Thank you. I'll give that a shot. We have spare CPU/RAM and
relatively low packet rate, so PF_RING should have no trouble keeping
up.

>> I'm logging to JSON. If I switch to TSV, logstash loses the ability to import into Elastic, doesn't it?
> That depends on where you're doing the actual parsing. If you're using the traditional Logstash parsing, then it should be able to handle both JSON and TSV format. If you're using the new LOGSTASH_MINIMAL config where Logstash sends unparsed logs to Elasticsearch where they are then parsed using ingest node parsing, then that only handles JSON format.

We're running an ArcSight connector each for Zeek, Snort, and syslog
(and some others sent via syslog on alternate ports. When we
converted from the prior SecurityOnion release, which changed the
default Zeek logs from TSV to JSON, we converted the connector as
well. Switching back to TSV would require restoring the old-style TSV
connector. It's doable, but would be one of the last things I'd want
to try.

I'll look into the LOGSTASH_MINIMAL config, as the startup time on
logstash is a pain point. That's a separate discussion, though..

> Are any of your additions using AF_PACKET? I'm wondering if there might be a weird interaction similar to the Suricata/netsniff-ng issue mentioned above.
I don't think so. I am using the PF_RING-enabled libpcap to monitor
and parse DNS separately from bro.

Pete Nelson

unread,
Feb 24, 2020, 9:20:04 AM2/24/20
to securit...@googlegroups.com
> > Off the top of my head, it should be something like:
> > sudo so-zeek-stop
> > Edit /opt/zeek/etc/node.cfg and change lb_method=custom to lb_method=pf_ring
> > sudo so-zeek-start
>
> Thank you. I'll give that a shot. We have spare CPU/RAM and
> relatively low packet rate, so PF_RING should have no trouble keeping
> up.

I was hopeful this would be a solution, but within minutes, I had one
process back up at 102% again (100 on the main thread and a little on
a couple of the others).. If anything, it might be happening sooner
using pf_ring.

I'll wait a bit to see if Justin responds before I start fiddling with
other areas. I
--
Pete

Francois

unread,
Feb 24, 2020, 1:21:25 PM2/24/20
to security-onion
Pete,

I think I might be experiencing the same thing as you (I did post about it earlier - https://groups.google.com/forum/#!topic/security-onion/fD8DpO3hdgM).

The ps command that you have in your first post shows 9/11 processes at >= %100.

I'm on Ubuntu with latest SO and pretty much vanilla install (I do have some custom logstash configs for an in-house custom app and Cisco firewall).

Let me know if I can help.

Thanks,

Francois

Pete

unread,
Feb 24, 2020, 2:49:56 PM2/24/20
to security-onion
Francois,

That definitely sounds like what I'm experiencing.  How did you build your server(s), from the SecurityOnion install ISO, or from Ubuntu's Server install ISO plus manually adding the SO PPA and packages?

You can run the following command as root and compare the number produced between zeek processes running at 100%+ and those that are running at normal CPU levels (replace 28964 with your zeek PID, ignore job control messages in the output):
  strace -p 28964 -s 220 2>&1 | grep nanosleep | wc -l & sleep 1; pkill strace; sleep 0.2
I'm seeing around 6500 calls to nanosleep for the normal processes and 0 for those consuming all of the CPU.

I am running additional logstash configs too, plus java-based event connectors, but this seems much more low-level than that.  This appears to be an IPC issue between posix threads within the zeek process, if I'm reading my trace info and perf output correctly.

Thanks for speaking up.  I'm sorry you're seeing it too, but glad I'm not the only one.  :)

If you let Justin know you see it too, please join the zeek mailing list at
--
Pete

Doug Burks

unread,
Feb 24, 2020, 6:06:15 PM2/24/20
to securit...@googlegroups.com
Thanks for confirming, Pete.  At least we can eliminate the AF_PACKET plugin as a possible culprit.

Sounds like Justin knows where the issue is.  If you guys are able to convince him that it is indeed a bug, then perhaps they can fix it and release Zeek 3.0.2?
 
--
Pete

--
Follow Security Onion on Twitter!
https://twitter.com/securityonion
---
You received this message because you are subscribed to the Google Groups "security-onion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to security-onio...@googlegroups.com.

Pete

unread,
Feb 25, 2020, 12:14:35 PM2/25/20
to security-onion
I think I'm onto something, and I think it explains why the guys at Corelight may have missed this.

High-performance Zeek installs use CPU pinning, as you can reduce latency and cache misses by using the same CPU socket as the one the NIC IRQs come in on.  They likely do much or all of their testing in this mode, as it offers the best performance and highest throughput.  I don't, as I have ample performance without pinning, and there are other loads on the system that wouldn't benefit from that.

Yet, when I pin the Zeek workers to a set of CPU cores, I no longer see them eventually peg CPU usage at 100%.

My guess is it's a race condition of some sort where the main thread and one of its children interact in such a way that a pipe used for signalling between the two ends up with an extra character in it that shows up as ready for IO when calling select() and prevents the main thread from ever calling usleep() between packets.

I have not yet dedicated those cores to just Zeek, and have not found any works-for-all guides for pinning processes to the CPUs since it is highly system specific.  

Francios, if you try pinning your workers to specific CPUs, can you please let me know if that fixes your load issue?
--
Pete


On Monday, 24 February 2020 18:06:15 UTC-5, Doug Burks wrote:
On Mon, Feb 24, 2020 at 9:20 AM Pete Nelson <peti...@gmail.com> wrote:
> > Off the top of my head, it should be something like:
> > sudo so-zeek-stop
> > Edit /opt/zeek/etc/node.cfg and change lb_method=custom to lb_method=pf_ring
> > sudo so-zeek-start
>
> Thank you.  I'll give that a shot.  We have spare CPU/RAM and
> relatively low packet rate, so PF_RING should have no trouble keeping
> up.

I was hopeful this would be a solution, but within minutes, I had one
process back up at 102% again (100 on the main thread and a little on
a couple of the others)..  If anything, it might be happening sooner
using pf_ring.

I'll wait a bit to see if Justin responds before I start fiddling with
other areas.  I

Thanks for confirming, Pete.  At least we can eliminate the AF_PACKET plugin as a possible culprit.

Sounds like Justin knows where the issue is.  If you guys are able to convince him that it is indeed a bug, then perhaps they can fix it and release Zeek 3.0.2?
 
--
Pete

--
Follow Security Onion on Twitter!
https://twitter.com/securityonion
---
You received this message because you are subscribed to the Google Groups "security-onion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to securit...@googlegroups.com.

Doug Burks

unread,
Feb 25, 2020, 2:04:59 PM2/25/20
to securit...@googlegroups.com
Hi Pete,

It does sound like you may be onto something.  The production systems I previously reported on are single socket systems.  I had also tried to duplicate your report in a VM, but I believe all the VMs I tried were configured in VMware to be single socket systems.  So that may help explain why I was unable to duplicate your issue.

To unsubscribe from this group and stop receiving emails from it, send an email to security-onio...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/security-onion/7879d62c-e375-4a0d-989e-6e70281d08ae%40googlegroups.com.

Francois

unread,
Feb 25, 2020, 5:55:59 PM2/25/20
to security-onion
Pete,

I'll give this a go (once I figure out how to do it) and get back to you.

Thanks!

François

Pete Nelson

unread,
Feb 25, 2020, 6:34:07 PM2/25/20
to securit...@googlegroups.com
Francios,

There's a slide deck (warning: PDF) at
https://www.zeek.org/zeekweek2019/slides/michal-mozilla.pdf that has
some hints. I'm still learning myself, so I'm not ready yet to say
what's working for me.

The simple, but not optimal, solution is to pin things to CPUs without
regard to NIC interrupts, reserved cores, and such. All that's
required for that is adding a line in the /opt/zeek/etc/node.cfg file
in the sensor's stanza like "pin_cpus=0,1,2" (use exactly the same
number of CPUs as you have specified in lb_procs) and restarting zeek.
That doesn't reserve those CPUs for zeek only, though, but if other
tasks are running there and zeek needs the CPU, in theory the other
tasks should be moved.
--
Pete

Doug Burks

unread,
Mar 11, 2020, 1:04:23 PM3/11/20
to securit...@googlegroups.com
Hi Pete and Francois,

I just released our Zeek 3.0.3 packages for testing:
https://github.com/Security-Onion-Solutions/security-onion/issues/1726

It would be great if you are able to do some testing on a test box and verify that this CPU usage issue is resolved in your environments.

Thanks!

--
Follow Security Onion on Twitter!
https://twitter.com/securityonion
---
You received this message because you are subscribed to the Google Groups "security-onion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to security-onio...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/security-onion/CAAYFXLkQX4gAwGooj3SYkU4oN60Y4KeMQpm4Utt%2BtVJEfbNaPQ%40mail.gmail.com.

Francois

unread,
Mar 11, 2020, 5:41:25 PM3/11/20
to security-onion
Unfortunately, I do not have a test box that is experiencing this problem.
To unsubscribe from this group and stop receiving emails from it, send an email to securit...@googlegroups.com.

Steven J

unread,
Mar 12, 2020, 10:48:36 AM3/12/20
to securit...@googlegroups.com

@Doug Burks, thank you for:
' Edit /opt/zeek/etc/node.cfg and change lb_method=custom to lb_method=pf_ring '

I was already running pf_ring for my 4 workers and somebody snuck this one in when I wasn't looking.
Setting method to pf_ring has put my cpu consumption back to the 12-34% range. :-)

Sjm

To unsubscribe from this group and stop receiving emails from it, send an email to security-onio...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/security-onion/10038209-7828-4e62-b204-61ee280ad7a4%40googlegroups.com.

Doug Burks

unread,
Mar 12, 2020, 12:59:59 PM3/12/20
to securit...@googlegroups.com
Hi Steven,

I'm not sure I understand.  The suggestion I gave was just a temporary test so that we could try to pinpoint if this were an issue with AF_PACKET or PF_RING.  It turned out that it was not related to either of those but was an issue in the broker component in Zeek.  That broker issue should be resolved in Zeek 3.0.3, currently in testing.  So the hope is that Zeek 3.0.3 will be back to normal levels of CPU usage regardless of AF_PACKET or PF_RING.

In general, we see better performance and reliability with AF_PACKET and that's why its our standard for new installations now.

Steven Malm

unread,
Mar 12, 2020, 2:25:33 PM3/12/20
to securit...@googlegroups.com

Had I not read your test suggestion, I would not have noticed we had both running at the same time.
Choosing the one that was already working has resolved my version of the issue. :-)


 


Pete

unread,
Mar 12, 2020, 3:36:03 PM3/12/20
to security-onion
Steven Malm and Steven J,

I wanted to make sure you don't misunderstand what's going on here.  The fix is not setting the lb_method.  I have personally seen the CPU load peg using both AF_PACKET and PF_RING.  It may appear you've fixed it, but what you're seeing is just due to zeek being restarted; without the version upgrade, it is likely to come back after some time.

The fix is coming in Zeek v3.0.3 in the form of a code update in the broker package.

Doug, I've been running v3.0.3 on one system for close to a day, and haven't seen the issue.  I'm going to test on another system or two, but so far it looks good.  I haven't tried any other variants as described in the testing procedure, however.

I'll give another update tomorrow.
--
Pete


On Thursday, 12 March 2020 14:25:33 UTC-4, Steven Malm wrote:

Had I not read your test suggestion, I would not have noticed we had both running at the same time.
Choosing the one that was already working has resolved my version of the issue. :-)

Doug Burks

unread,
Mar 12, 2020, 3:49:22 PM3/12/20
to securit...@googlegroups.com
Thanks for the update, Pete!

--
Follow Security Onion on Twitter!
https://twitter.com/securityonion
---
You received this message because you are subscribed to the Google Groups "security-onion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to security-onio...@googlegroups.com.

Doug Burks

unread,
Mar 17, 2020, 1:17:58 PM3/17/20
to securit...@googlegroups.com

Francois

unread,
Mar 17, 2020, 2:20:49 PM3/17/20
to security-onion
Nice!  Updating today...


On Tuesday, 17 March 2020 11:17:58 UTC-6, Doug Burks wrote:
On Thu, Mar 12, 2020 at 3:49 PM Doug Burks <doug...@securityonionsolutions.com> wrote:
Thanks for the update, Pete!

On Thu, Mar 12, 2020 at 3:36 PM Pete <peti...@gmail.com> wrote:
Steven Malm and Steven J,

I wanted to make sure you don't misunderstand what's going on here.  The fix is not setting the lb_method.  I have personally seen the CPU load peg using both AF_PACKET and PF_RING.  It may appear you've fixed it, but what you're seeing is just due to zeek being restarted; without the version upgrade, it is likely to come back after some time.

The fix is coming in Zeek v3.0.3 in the form of a code update in the broker package.

Doug, I've been running v3.0.3 on one system for close to a day, and haven't seen the issue.  I'm going to test on another system or two, but so far it looks good.  I haven't tried any other variants as described in the testing procedure, however.

I'll give another update tomorrow.
--
Pete

On Thursday, 12 March 2020 14:25:33 UTC-4, Steven Malm wrote:

Had I not read your test suggestion, I would not have noticed we had both running at the same time.
Choosing the one that was already working has resolved my version of the issue. :-)

On Thu, Mar 12, 2020 at 10:48 AM 'Steven J' via security-onion <securit...@googlegroups.com> wrote:

@Doug Burks, thank you for:
' Edit /opt/zeek/etc/node.cfg and change lb_method=custom to lb_method=pf_ring '

I was already running pf_ring for my 4 workers and somebody snuck this one in when I wasn't looking.
Setting method to pf_ring has put my cpu consumption back to the 12-34% range. :-)

Sjm

--
Follow Security Onion on Twitter!
https://twitter.com/securityonion
---
You received this message because you are subscribed to the Google Groups "security-onion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to securit...@googlegroups.com.


--
Doug Burks
CEO
Security Onion Solutions, LLC
Reply all
Reply to author
Forward
0 new messages