Help monitor 10G links

622 views
Skip to first unread message

yawnbox

unread,
Oct 30, 2014, 8:07:55 PM10/30/14
to securit...@googlegroups.com
Hi SO community,

I manage a large lab environment that has no NSM. Just to get an idea of what I'm up against, I'm currently TAPing (but failing) one of two border switches, which, during the day, said TAP is a fully saturated 10G link. I will eventually need to monitor the second border switch for a total of 20Gbps.

Hardware limitations:
- E5-2620 (2x 2GHz 6x cores, or 24 (with HT) total cores)
- 192GB of RAM

I can use several of these servers if needed, but each server can't support more than 192GB of RAM. I know this is not recommended for a 10G link, so I will need recommendations for what needs to be cut out, likely rules.

Q1- will cutting down on rules allow me to monitor a 10G link with 192GB of RAM?

I don't have experience deploying server/sensor configurations. Having read the wiki, it is not clear to me how the storage needs to be deployed if I use one "server" and two "sensors", presuming that I can leverage one 192GB RAM node for one 10G TAP.

When I initially deployed SO on one of the above servers as a standalone node, within 30 minutes I had used at least 5% of 2TB, and looking at the NIC stats now, I see the server can't keep up at all:

Packets: 8,857,581,801
Dropped: 56,371,058
...and hundreds of thousands of bad "SURICATA STREAM" events.

Q2- should a "server" have the most storage space, or should the "sensors"?

Each node could have up to 14TB of space, but I presume the sensors don't need that much.

One of my biggest problems is that I don't grasp how all of the SO tools work with each other. I do not know how to troubleshoot my issues, or where bottlenecks are.

Q3- Is there a high-level overview of how Suricata works with Bro, or Squert with MySql, etc?

Q4- what tuning would need to be done above and beyond the default SO.iso advanced configuration?

Q5- how would you monitor two 10G TAPs with my limitations?

I appreciate any and all feedback. I manage a small SO standalone server for another company that I volunteer for, but it's a 500Mbps TAP that the standard SO configuration is fine for. This deployment is going to teach me a lot, and I hope to contribute back to the community when possible.

Cheers

Lee Sharp

unread,
Oct 30, 2014, 10:30:47 PM10/30/14
to securit...@googlegroups.com
Replies inline and heavily trimmed...

On 10/30/2014 07:07 PM, yawnbox wrote:
> Q1- will cutting down on rules allow me to monitor a 10G link with 192GB of RAM?

You will need to. Probably most of the "SURICATA STREAM" events.

> I don't have experience deploying server/sensor configurations. Having read the wiki, it is not clear to me how the storage needs to be deployed if I use one "server" and two "sensors", presuming that I can leverage one 192GB RAM node for one 10G TAP.

Start with the big boy and see how it tunes.

> When I initially deployed SO on one of the above servers as a standalone node, within 30 minutes I had used at least 5% of 2TB, and looking at the NIC stats now, I see the server can't keep up at all:
>
> Packets: 8,857,581,801
> Dropped: 56,371,058
> ...and hundreds of thousands of bad "SURICATA STREAM" events.

Not a surprise...

> Q2- should a "server" have the most storage space, or should the "sensors"?
>
> Each node could have up to 14TB of space, but I presume the sensors don't need that much.
>
> One of my biggest problems is that I don't grasp how all of the SO tools work with each other. I do not know how to troubleshoot my issues, or where bottlenecks are.

The Server stores the database. This is all your logs and the Sguil and
Snorby events.
The Sensor stores the pcaps. At 10gig, you will need a lot of space for
pcaps. Note that you set the threshold where it starts deleting them.
Unless you like fragmentation, I recommend 75% not 90%...

> Q4- what tuning would need to be done above and beyond the default SO.iso advanced configuration?

Disbaling unneeded rules, and thresholding some others.

> Q5- how would you monitor two 10G TAPs with my limitations?

Probably two sensors with serious power behind them.

Lee

Doug Burks

unread,
Oct 31, 2014, 9:56:43 AM10/31/14
to securit...@googlegroups.com
Hi yawnbox,

Replies inline.

On Thu, Oct 30, 2014 at 8:07 PM, yawnbox <yaw...@gmail.com> wrote:
> Hi SO community,
>
> I manage a large lab environment that has no NSM. Just to get an idea of what I'm up against, I'm currently TAPing (but failing) one of two border switches, which, during the day, said TAP is a fully saturated 10G link. I will eventually need to monitor the second border switch for a total of 20Gbps.
>
> Hardware limitations:
> - E5-2620 (2x 2GHz 6x cores, or 24 (with HT) total cores)
> - 192GB of RAM
>
> I can use several of these servers if needed

If you can load-balance the traffic between sensors (using a
flow-based load balancer etc.) then you can just keep adding sensors
as needed to handle additional traffic.

> , but each server can't support more than 192GB of RAM. I know this is not recommended for a 10G link, so I will need recommendations for what needs to be cut out, likely rules.

Yes, only run the IDS rules necessary for your environment, and only
run the processes necessary for your environment.

> Q1- will cutting down on rules allow me to monitor a 10G link with 192GB of RAM?

Depends on rules, traffic, hardware, many variables. If you can
provide sostat-redacted output, perhaps we can provide more detailed
recommendations.

> I don't have experience deploying server/sensor configurations. Having read the wiki, it is not clear to me how the storage needs to be deployed if I use one "server" and two "sensors", presuming that I can leverage one 192GB RAM node for one 10G TAP.

If you're running full packet capture and ELSA, then your sensors need
MUCH more storage than your master server.

> When I initially deployed SO on one of the above servers as a standalone node, within 30 minutes I had used at least 5% of 2TB, and looking at the NIC stats now, I see the server can't keep up at all:
>
> Packets: 8,857,581,801
> Dropped: 56,371,058
> ...and hundreds of thousands of bad "SURICATA STREAM" events.
>
> Q2- should a "server" have the most storage space, or should the "sensors"?
>
> Each node could have up to 14TB of space, but I presume the sensors don't need that much.

If you're running full packet capture, Bro, and/or ELSA, then your
sensors need MUCH more storage than your master server.

> One of my biggest problems is that I don't grasp how all of the SO tools work with each other. I do not know how to troubleshoot my issues, or where bottlenecks are.
>
> Q3- Is there a high-level overview of how Suricata

Suricata sniffs traffic and writes out NIDS alerts in unified2 format.
barnyard2 scoops up the unified2 alerts and sends them to 3
destinations by default:
1. Snorby database
2. Sguil's Snort agent --> Sguil database (MySQL)
3. syslog --> ELSA

> works with Bro

Bro sniffs traffic and writes logs to /nsm/bro/logs/current/.
syslog-ng scoops up these logs and sends them to ELSA, where they are
stored in MySQL and indexed by Sphinx.

> , or Squert with MySql, etc?

Squert is a web interface for the Sguil database.

> Q4- what tuning would need to be done above and beyond the default SO.iso advanced configuration?

Depends on what choices you made during Advanced Setup. Have you seen
our PostInstallation page?
https://code.google.com/p/security-onion/wiki/PostInstallation

If you can provide sostat-redacted output, perhaps we can provide more
detailed recommendations.

> Q5- how would you monitor two 10G TAPs with my limitations?

I'd use a flow-based load balancer to evenly distribute the traffic
among multiple sensors.

> I appreciate any and all feedback. I manage a small SO standalone server for another company that I volunteer for, but it's a 500Mbps TAP that the standard SO configuration is fine for. This deployment is going to teach me a lot, and I hope to contribute back to the community when possible.


--
Doug Burks
Need Security Onion Training or Commercial Support?
http://securityonionsolutions.com

yawnbox

unread,
Nov 3, 2014, 8:51:53 PM11/3/14
to securit...@googlegroups.com
I've deployed a server/sensor configuration, and I've managed to replace my sensor with something more appropriate. The sensor now has 512GB of RAM, and four ten-core E7-4870 @ 2.40GHz CPUs (80 cores total with HT).

I've also created an /nsm XFS volume which is about 11TB.
NOTE: the directions here are slightly wrong: https://code.google.com/p/security-onion/wiki/NewDisk

it's "umount" not "unmount"

with the Ubuntu 12.04 SO .iso, this line was added to /etc/fstab:
UUID=234524-3452-45674-3456-456732455674567 /nsm xfs rw,user,auto 0 1

I configured the sensor to use 20 cores for Suricata, and 20 for Bro. It is not specified in the documentation if hyper-threaded cores count as cores.

I'm still losing a lot of packets. I've attached a preliminary sostat-redacted. something still seems like it can't keep up.

sostat_redacted.txt

Lee Sharp

unread,
Nov 3, 2014, 10:41:12 PM11/3/14
to securit...@googlegroups.com
CPU is not a problem, and you memory is fine. I am thinking that your
drives can not keep up with the packet capture. You may need an array
of SSD cards to have a sustained 10Gig transfer rate.

Lee

Doug Burks

unread,
Nov 4, 2014, 7:40:50 AM11/4/14
to securit...@googlegroups.com
Replies inline.

On Mon, Nov 3, 2014 at 8:51 PM, yawnbox <yaw...@gmail.com> wrote:
> I've deployed a server/sensor configuration, and I've managed to replace my sensor with something more appropriate. The sensor now has 512GB of RAM, and four ten-core E7-4870 @ 2.40GHz CPUs (80 cores total with HT).
>
> I've also created an /nsm XFS volume which is about 11TB.
> NOTE: the directions here are slightly wrong: https://code.google.com/p/security-onion/wiki/NewDisk
>
> it's "umount" not "unmount"

Fixed, thanks!

> with the Ubuntu 12.04 SO .iso, this line was added to /etc/fstab:
> UUID=234524-3452-45674-3456-456732455674567 /nsm xfs rw,user,auto 0 1
>
> I configured the sensor to use 20 cores for Suricata, and 20 for Bro. It is not specified in the documentation if hyper-threaded cores count as cores.
>
> I'm still losing a lot of packets. I've attached a preliminary sostat-redacted. something still seems like it can't keep up.

You should disable these services:
* prads (sessions/assets)[ OK ]
* sancp_agent (sguil)[ OK ]
* pads_agent (sguil)[ OK ]
* argus[ OK ]
* http_agent (sguil)[ OK ]

https://code.google.com/p/security-onion/wiki/DisablingProcesses

yawnbox

unread,
Nov 14, 2014, 8:29:46 PM11/14/14
to securit...@googlegroups.com
Ok. My issue with severe packet drop appears to be resolved by disabling flow control.

admin@securityonion:~$ ethtool -a ethX
Pause parameters for ethX:
Autonegotiate: off
RX: on
TX: on

This works great:
sudo ethtool -A ethX tx off rx off autoneg off

But it is not a permanent solution.

None of these solutions work:
https://help.ubuntu.com/community/UbuntuLTSP/FlowControl

"Disabling flow control with module parameters"
"Disabling flow control with ethtool"

Just to comment, changing the MTU from 1500 to 9000 also did not affect packet drop. It definitely appears to be flow control.

Doug Burks

unread,
Nov 15, 2014, 9:26:29 AM11/15/14
to securit...@googlegroups.com
You could probably add it to our existing ethtool statements in
/etc/network/interfaces:
https://code.google.com/p/security-onion/wiki/NetworkConfiguration

I don't know of anybody else who has had to disable flow control so,
out of curiosity, what are your sniffing interfaces plugged into? And
how are those ports configured?
> --
> You received this message because you are subscribed to the Google Groups "security-onion" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to security-onio...@googlegroups.com.
> To post to this group, send email to securit...@googlegroups.com.
> Visit this group at http://groups.google.com/group/security-onion.
> For more options, visit https://groups.google.com/d/optout.

yawnbox

unread,
Nov 17, 2014, 8:03:27 PM11/17/14
to securit...@googlegroups.com
On Saturday, November 15, 2014 6:26:29 AM UTC-8, Doug Burks wrote:
> You could probably add it to our existing ethtool statements in
> /etc/network/interfaces:
> https://code.google.com/p/security-onion/wiki/NetworkConfiguration
>
> I don't know of anybody else who has had to disable flow control so,
> out of curiosity, what are your sniffing interfaces plugged into? And
> how are those ports configured?
>

Modifying this line doesn't appear to work.

post-up ethtool -G $IFACE rx 4096; for i in rx tx sg tso ufo gso gro lro; do ethtool -K $IFACE $i off; do ethtool -A $IFACE tx off rx off; done

We're simply mirroring a single 10G upstream port on an Arista 7050 switch.

Doug Burks

unread,
Nov 18, 2014, 9:24:38 AM11/18/14
to securit...@googlegroups.com
Replies inline.

On Mon, Nov 17, 2014 at 8:03 PM, yawnbox <yaw...@gmail.com> wrote:
> On Saturday, November 15, 2014 6:26:29 AM UTC-8, Doug Burks wrote:
>> You could probably add it to our existing ethtool statements in
>> /etc/network/interfaces:
>> https://code.google.com/p/security-onion/wiki/NetworkConfiguration
>>
>> I don't know of anybody else who has had to disable flow control so,
>> out of curiosity, what are your sniffing interfaces plugged into? And
>> how are those ports configured?
>>
>
> Modifying this line doesn't appear to work.
>
> post-up ethtool -G $IFACE rx 4096; for i in rx tx sg tso ufo gso gro lro; do ethtool -K $IFACE $i off; do ethtool -A $IFACE tx off rx off; done

Looks like you have an extra "do" in there (there should only be one
to start the for-loop). Try something like this:

post-up ethtool -G $IFACE rx 4096; for i in rx tx sg tso ufo gso gro
lro; do ethtool -K $IFACE $i off; ethtool -A $IFACE tx off rx off;
done

> We're simply mirroring a single 10G upstream port on an Arista 7050 switch.

What kind of NIC in your sensor?

yawnbox

unread,
Nov 18, 2014, 1:17:15 PM11/18/14
to securit...@googlegroups.com
On Tuesday, November 18, 2014 6:24:38 AM UTC-8, Doug Burks wrote:
> Replies inline.

> Looks like you have an extra "do" in there (there should only be one
> to start the for-loop). Try something like this:
>
> post-up ethtool -G $IFACE rx 4096; for i in rx tx sg tso ufo gso gro
> lro; do ethtool -K $IFACE $i off; ethtool -A $IFACE tx off rx off;
> done
>
> > We're simply mirroring a single 10G upstream port on an Arista 7050 switch.
>
> What kind of NIC in your sensor?

Your modification works great (removing the second "do"), thank you Doug.

$ ethtool -a ethX
Pause parameters for ethX:
Autonegotiate: off

RX: off
TX: off

We're using Intel X520-SR2, Intel sfp+, new OM4 fiber.

Some DMESG output:

ixgbe: Intel(R) 10 Gigabit PCI Express Network Driver - version 3.15.1-k
Multiqueue Enabled: Rx Queue count = 63, Tx Queue count = 63
PCI Express bandwidth of 32GT/s available
Speed:5.0GT/s, Width: x8, Encoding Loss:20%
NIC Link is Up 10 Gbps, Flow Control: None

I see that the driver version is older, (http://downloadmirror.intel.com/14687/eng/ixgbe-3.22.3.tar.gz) is the newest but I don't know how to update that in Linux or if its needed.

yawnbox

unread,
Mar 5, 2015, 6:06:04 PM3/5/15
to securit...@googlegroups.com
My last reply, several months ago, appears to not have been approved for posting. I wonder why such a tyrannical system exists for this open source project. Hoping this one gets approved, I have a valuable update.

It turns out that when a mirror port is made on an Arista switch, flow control is turned off by default. This caused a huge amount of PAUSE packets to get sent from the server (that, by default, has flow control on) to the switch, which never replied or paused. To enable flow control on the Arista, I performed:

conf term
int Et98
flowcontrol receive on
flowcontrol send on
exit

I've since had zero packet drop on this interface that appears to have a 2.5 Gbps average.

Additionally, because it's possible for one port to become overloaded with Rx *and* Tx traffic (if the sum of both ever surpass 10Gbps), I changed the uplink mirror to send Rx traffic to one port and Tx traffic to another port. I altered the Arista config with the following:

show monitor session
config t
monitor session redirect_1 source ethernet 97 rx
monitor session redirect_2 source ethernet 97 tx
monitor session redirect_2 destination ethernet 99
int ethernet 99
no shutdown
exit
exit
show int status |grep 99

(port "98" was already set as the "97" mirror destination, and "99" was previously off)

Retrospectively:

port 97 = 10Gbps uplink being mirrored (source)
port 98 = 97 Rx mirror destination
port 99 = 97 Tx mirror destination

Following, I reconfigured the disk array to be able to handle 1250MB/s burst writes by using a RAID-0 array across 12 SATA 3 SSDs. Typical write burst seems to be about 350MB/s. I see this with the "iotop" tool:

sudo apt-get install sysstat dstat iotop
sudo iotop -aoP

I've also burned the newest SO version, but I have a problem. Suricata doesn't seem to be running. The steps I took:

1. burned new ISO
2. installed SO onto a Dell R910
3. restarted
4. sudo apt-get update && sudo apt-get upgrade
5. restarted
6. setup up SO with advanced config

Running sostat-quick shows that Suricata is not running. Performing an NSM restart shows that Suricata was not running previously but is successfully started with the restart. Immediately running sostat-quick again shows that Suricata is not running. Because of the above upgrade, I'm running the newest version of all apps. Snorby shows zero events even after running for 12 hours.

Please see the attached sostat-redacted.

I've like to make one final note: performing a dist-upgrade breaks grub. I'm not able to upgrade the Linux kernel without making the server unbootable.

Cheers
so-redacted2-copy.txt

Doug Burks

unread,
Mar 5, 2015, 7:13:08 PM3/5/15
to securit...@googlegroups.com
Hi yawnbox,

Replies inline.

On Thu, Mar 5, 2015 at 6:06 PM, yawnbox <yaw...@gmail.com> wrote:
> My last reply, several months ago, appears to not have been approved for posting. I wonder why such a tyrannical system exists for this open source project. Hoping this one gets approved, I have a valuable update.

If you're referring to your reply on 11/18, it looks like it was posted:
https://groups.google.com/d/topic/security-onion/g6y5oy4rHnk/discussion

Our Google Group is moderated, but this is a fairly standard practice
to protect the members of the group from spam and malicious email.
Your account was approved at some point in the past. After that
point, your messages to the group were no longer moderated (like
today's message).

> It turns out that when a mirror port is made on an Arista switch, flow control is turned off by default. This caused a huge amount of PAUSE packets to get sent from the server (that, by default, has flow control on) to the switch, which never replied or paused. To enable flow control on the Arista, I performed:
>
> conf term
> int Et98
> flowcontrol receive on
> flowcontrol send on
> exit
>
> I've since had zero packet drop on this interface that appears to have a 2.5 Gbps average.
>
> Additionally, because it's possible for one port to become overloaded with Rx *and* Tx traffic (if the sum of both ever surpass 10Gbps), I changed the uplink mirror to send Rx traffic to one port and Tx traffic to another port. I altered the Arista config with the following:

Are you sending RX traffic to eth5 and TX traffic to eth7 (or vice
versa)? If so, then Suricata and other sniffing processes won't get
to see both sides of your network flows. If you have to split RX and
TX on the Arista, then you will need to bridge/bond eth5 and eth7 back
to one interface (like br0 or bond0) to be sniffed by Suricata and the
other sniffing processes.
Have you checked the Suricata log in /var/log/nsm/HOSTNAME-INTERFACE/
for additional clues?

> Please see the attached sostat-redacted.

Your load average is pretty high:
load average: 86.76, 78.46, 76.78

How many CPU cores do you have?

Your min_num_slots is low:
Min Num Slots : 4096

If you're seeing lots of traffic, you should max that value out (65534
I believe).

> I've like to make one final note: performing a dist-upgrade breaks grub. I'm not able to upgrade the Linux kernel without making the server unbootable.

Did you create a dedicated /boot partition at the beginning of the drive?
Reply all
Reply to author
Forward
0 new messages