We are currently running a POC of Security Onion. Our setup:
1 Bro/Zeek forward sensor (2x18 core Xeons/72 HT)
1 Suricata forward sensor (2x12 core Xeons/48 HT)
1 Master (2x12 core Xeons/48 HT)
5 Storage nodes (4-6 core VMware hosts)
All disks are either spinny drives or vmware virtual
disks.
As this is a Proof of Concept setup, this is not an ideal
hardware setup. However, I have been able to pinpoint some
possible areas for improvement in the SO setup.
(for reference, we are running the securityonionsolutionselas
docker images on the master and the storage nodes)
We currently have our border routers sending traffic
to a GigaMON. That traffic is then sent to the Zeek
SO sensor, the Suricata SO sensor, and one of our
production snort sensors (among other things).
I first noticed a problem with the bro events in SO
when looking at the graphs for daily traffic. As the
day progressed, the number of events went down, and
then went back up at night, in general, the opposite
of what is expected.
Also, when investigating events found in our snort setup,
it seemed that some of those events were missing corresponding
events in Bro.
After doing some checking, I found that during some parts
of the day, over 95% of our bro events were never making
it off the Bro sensor.
syslog-ng-ctl was reporting high numbers of event drops. I
set up a cron job to dump syslog-ng-ctl stats to a file
once per hour, and wrote up a little script to get stats
on hourly intervals (our time zone is EDT):
Morning:
Stats for Interval : 2019-09-18 10:00:01 (2019-09-18 06:00:01 EDT) - 2019-09-18 11:00:01 :
Seconds : 3600.0 (1 hour)
Processed : 34,877,990
Dropped : 27,882,494
Written : 6,995,496
Written/s : 1943.19
Processed/s : 9688.33
Drops/s : 7745.14
Drop % : 79.94%
Peak Time:
Stats for Interval : 2019-09-18 16:00:01 (2019-09-18 12:00:01 EDT) - 2019-09-18 17:00:01 :
Seconds : 3600.0 (1 hour)
Processed : 102,267,929
Dropped : 98,591,776
Written : 3,676,153
Written/s : 1021.15
Processed/s : 28407.76
Drops/s : 27386.60
Drop % : 96.41%
There are quite a few drops in the morning, but during peak traffic
time it's quite substantial.
According to Kibana, everything is working as expected.
In trying to pinpoint where issues are, we first made some
changes to syslog-ng. We enabled flow-control so syslog-ng
would try to ensure all the messages in the bro log files
would make it into Elastic. This worked after a fasion, as
all the logs were eventually ingested into Elastic, however,
it took hours to do so, so bro logs were rotated out several
times before syslog-ng could finish, leaving large gaps in
the events with some hours having large numbers of events.
Next, we turned out attention to the master to see if part
of the problem was happening there.
I installed the non-oss version of Kibana on our storage
nodes so I could check ingestion rates. With the default
syslog-ng setup, we were ingesting about 500 eps per node,
or roughly 2500 eps total.
On the master, we increased the variable
pipeline.batch.size
from 125 to 2048 and seemed to see some modest improvement in
the rate of ingestion. We eventally increased it to 40960 which
helped more, but the events ingested started falling behind.
This seemed to indicate that logstash was having trouble
ingesting events at the rate syslog-ng was sending them. However,
logstash seems to give no indication of processing problems,
even when increasing the logging level to "trace" .
The syslog-ng documentation seems to agree :
https://www.syslog-ng.com/technical-documents/doc/syslog-ng-open-source-edition/3.16/administration-guide/destination-queue-full
If flow-control is disabled, syslog-ng will drop messages if the destination queues are full
To further investigate this, I wrote a small perl script
("perlstash") that reads events on port 6050, "tags" them
with "syslogng" via search/replace, and pipes them into
redis-cli --pipe using the redis protocol. I then ran this
from a docker container.
The improvement was pretty significant. From syslog-ng :
Morning:
Stats for Interval : 2019-10-11 05:00:01 (2019-10-11 01:00:01 EDT) - 2019-10-11 06:00:01 :
Seconds : 3600.0 (1 hour)
Processed : 46,553,013
Dropped : 16,615,449
Written : 29,937,564
Written/s : 8315.99
Processed/s : 12931.39
Drops/s : 4615.40
Drop % : 35.69%
Peak Time:
Stats for Interval : 2019-10-11 16:00:01 (2019-10-11 12:00:01 EDT) - 2019-10-11 17:00:02 :
Seconds : 3601.0 (1 hour)
Processed : 84,793,371
Dropped : 54,784,684
Written : 30,008,687
Written/s : 8333.43
Processed/s : 23547.17
Drops/s : 15213.74
Drop % : 64.61%
The events now remain roughly constant through the day at
approximately 30M / hr. The storage node backends seem to
be reporting 8-10k eps, which roughly corresponds to this.
With this, it seems easy to conclude the master-node-as-queue
solution seems to be a pretty significant bottleneck in our
current setup.
Next, I set up a 3-node kafka cluster and setup syslog-ng
with kafka-c support. With this setup I was able to to improve
the numbers even further:
Stats for Interval : 2019-10-16 19:00:01 (2019-10-16 15:00:01 EDT) - 2019-10-16 19:25:55 :
Seconds : 1554.0 (25 minutes)
Processed : 37,587,448
Dropped : 7,423,401
Written : 30,164,047
Written/s : 19410.58
Processed/s : 24187.55
Drops/s : 4776.96
Drop % : 19.75%
Stats for Interval : 2019-10-16 19:25:55 (2019-10-16 15:25:55 EDT) - 2019-10-16 20:00:02 :
Seconds : 2047.0 (35 minutes)
Processed : 48,866,709
Dropped : 9,173,456
Written : 39,693,253
Written/s : 19390.94
Processed/s : 23872.35
Drops/s : 4481.41
Drop % : 18.77%
The storage nodes couldn't really keep up with this volume
and they seemed to max out at around 12.5k eps. I only kept
this setup running for about an hour, then re-enabled the
perl reader and let kafka drain out. Once all the events were
finally ingested it translated to roughly 2x the number of
events ingested by the perl reader.
I would also think sending bro logs straight to kafka could
improve performance even further, but I haven't tried the
plugin that could be found here:
https://github.com/apache/metron-bro-plugin-kafka
These are the issues I can see at this point:
1) There is no kind of alert reporting dropped syslog-ng
messages in the SO setup.
2) The master-as-queue seems to impose a significant bottlneck
on event processing
3) It's difficult to pinpoint what is causing the bottleneck as
logstash on the master doesn't seem to indicate any problems
ingesting the events.
4) logstash and redis on the master end up fighting each other
for resources.
Questions:
1) What would be the preferred method of getting syslog-ng stats
into SO. metricbeat module?
2) Would the SO group be interested in adding kafka clusters or
other queueing systems to the SO configuration?
There's no doubt our setup would benefit from hardware improvements,
but it seems with some adjustments an SO cluster would be able to
perform better than the default forward/master/storage setup.
With all that said, it's also entirely possible I've missed some
things or could do things differently. I welcome any and all
feedback, thanks.
(Attached are the perl program/Docker file I used for testing,
the kafka config I added to /etc/logstash/custom on the master,
and the syslog-ng kafka-c config example).
--
Jim Hranicky
Data Security Specialist
UF Information Technology
720 SW 2nd Avenue Suite 450, North Tower, Gainesville, FL 32605
352-273-1341