Personal experiences and estimation of disk size and performance needed for Elastic Search

Brant Hale

unread,

Aug 19, 2018, 5:57:00 PM8/19/18

to security-onion

I have been running the 16.0 version in several lab iterations and have noticed the following and wondered if anyone else would like to share some of their experiences.

My primary long term test setup:

Master Server is a VM 4cores/8GB of ram on RAID5 SATA - no issues
Storage node 1 is a physical box 8cores/64GB of ram on single 400GB SSD - no issues
Storage node 2 is a VM 4cores/16GB of ram on RAID5 SATA - seeing if I can get performance issues
Sensor is physical box 4core/12GB of ram local SATA disks - no issues

My sensor sees up to 100 megabytes of traffic (10 megabytes is avg) and can keep up with the disk writes and CPU.

Storage nodes never go above 100 IOPS with no queries and don't seem to have any disk issues.

I can peg out the CPUs on the storage nodes doing a long "*something*" query otherwise they stay at around a load avg of 1 or less.

The master server never seems to have much load at all.

Storage count and size

so-storage-02: 37GB of /nsm/elastic
{"count":53588166,"_shards":{"total":81,"successful":81,"skipped":0,"failed":0}}
SO-STORAGE: 48GB of /nsm/elastic
{"count":67116284,"_shards":{"total":90,"successful":90,"skipped":0,"failed":0}}

So it looks like I am getting about 1,400 - 1,500 logs per megabyte of disk
(85,000 GB of disk 120,704,450 logs 1,420 logs/megabyte)

I am planning out multiple sensors with about 40x the traffic

Some of the things I am trying to figure out:

1. Storage nodes - everyone on ElasticSearch forums says SSD and RAID10 or RAID0

What is everyone seeing on IOPS on their big storage nodes? I am leaning toward RAID10, but I really am wondering if I can get away with Raid5+hot spare. I only see high IOPS on queries and RAID5 shouldn't be a big penalty with reads.

So I am thinking even with the write penalty of RAID5 I would be in my IOP range for my array during writes and any queries would be reads without a significant penalty to the RAID5. Everything I have read suggests this is a bad idea so I am skeptical of my logic on this.

Right now I am considering running 2 storage nodes each with 12 non-ssd one in RAID10 and one in RAID5 to see how they run in production.

Anybody have any opinions or real world data on this topic?

Thanks

Francois Lachance

unread,

Aug 22, 2018, 6:37:34 PM8/22/18

to security-onion

I would be curious to hear from someone with experience on this setup. Right-sizing hardware is always though because no two environments are exactly the same. Typically, you just end up over engineering just in case (if you can afford it).

Thanks!

Brant Hale

unread,

Aug 22, 2018, 9:35:45 PM8/22/18

to security-onion

Not sure if this is helpful or not more experimentation with disks

I recently got the everything needs to be virtual in the DC so I am running some tests in the lab with vmware.

Hardware 8 core with hyperthreading enabled (16 logical cores)
76 GB of RAM
RAID10 - 6 1TB disks

Test 1 - slight CPU over subscription of physical cores
SO Master 2 Cores / 8GB ram
SO Storage 8 Cores / 64GB ram / java heap set to 28GB
Sensor is dedicated hardware
Notes - Under heavy query would get visualization timeout on kibana (*value* all data type)

Test 2 - CPU pining and affinity - set cores to logical with 1 reserved on each vm for esxi
SO Master 3 Cores / 8GB ram (Pinned to logical cores 0-3)
SO Storage 11 Cores / 64GB ram (pinned to logical cores 4-15) / java heap set to 28GB
Sensor is dedicated hardware
Notes - so many issues - cpu pegged on esxi - this was a bad idea (TM)

Test 3 - Split Storage into 2 nodes with half the memory - I didn't see the extra memory getting used as cache on the single storage box. I thought this might help with more java heap overall.
SO Master 2 Cores / 8GB ram
SO Storage 4 Cores / 32GB ram / java heap set to 24GB
SO Storage 4 Cores / 32GB ram / java heap set to 24GB
Sensor is dedicated hardware
Notes - visualization timeouts everywhere. I have no idea if splitting up the cores just slowed everything down or if it was just a bad idea. Overall much slower and more problems than one storage node with more memory and cpu.

Test 4 - very slight CPU over subscription of physical cores - the idea is that even under 100% storage load at least 1 core is available for master
SO Master 2 Cores / 8GB ram
SO Storage 7 Cores / 64GB ram / java heap set to 28GB
Sensor is dedicated hardware
Notes - This seems to be ok. no visualization timeouts. Started to get more caching on free memory which greatly sped up queries especially repeated ones.

I used htop, nload, dstat, iostat ( I like dstat --disk-tps ) and vmware console performance tab to troubleshoot and try and discern impacts of each choice.

This not very scientific process left me with the following thoughts:
1. Avoid virtualization of the storage nodes
2. If forced to visualize - no oversubscription - I had no luck pining or affinity either ( what is the point of virtualization at this point?)
3. I seemed more CPU bound than disk bound - during heavy query the CPU was spiked, but disk IOPS never went above my arrays specs - maybe I am not using a big enough dataset or doing a bad test query.

Let me know if anyone else is doing this type of testing and what results you have gotten.

Brant Hale

unread,

Aug 26, 2018, 5:06:49 PM8/26/18

to security-onion

Ok, some more testing.

Found a net tool FIO for benchmarking IOPS

Just 100% Reads
sudo fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=100
90% writes and 10% reads (guessing this is what Security Onion will be like)
sudo fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=10
100% writes
sudo fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=0

Test is Ubuntu 16.04 patched up to today 08/26/2018
ESXi 5.5 Dell Perc controller / 8 CPU cores

Disks are SATA 7200 Dell midline SATA 6 disks

RAID5 Max Read 1400 IOPS MAX Write 400 IOPS 90/10 Write/Read 40/400
RAID10 Max Read 1300 IOPS MAX Write 1400 IOPS 90/10 Write/Read 120/1100

So my thoughts are:

1. The RAID10 really makes a difference with 3.5x writes (expected)
2. The RAID5 could really bog down the reads with the constant writes (expected)
3. Running the RAID5 while it may keep up with the writes will really kill read performance (I should have known that, but the disk space savings were tempting.)
4. I still need to get an idea of how many writes I will need in production. If it is more than 50% of my write IOPS max then I don't think I can run the RAID5.

Anyway, I thought I would share in case anyone else is working on the same planning thoughts.

Philip Robson

unread,

Aug 27, 2018, 4:39:58 AM8/27/18

to securit...@googlegroups.com

Raid5 is never good in production, issue with raid 5 and 6 is if the number of parity disks fail but hidden in the remaining disks you have a disk or so that has bad sectors that haven't been detected, when it rebuilds you will end up with a dead array as it cannot work around the corruption.

This is where scrubbing comes in.

On larger raid sets then raid 50 or 60 can be better. IBM like raid 50 with small raid sets. So with 12 disks you can have 3 raid sets. Although I always like to have a hot spare in waiting.

Looking forward to ssds coming down in price as the iops and reliability is much higher than disks.

Phil

--
Follow Security Onion on Twitter!
https://twitter.com/securityonion
---
You received this message because you are subscribed to the Google Groups "security-onion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to security-onio...@googlegroups.com.
To post to this group, send email to securit...@googlegroups.com.
Visit this group at https://groups.google.com/group/security-onion.
For more options, visit https://groups.google.com/d/optout.

Audrius J

unread,

Aug 28, 2018, 7:27:20 PM8/28/18

to security-onion

Hi Brant,

My advise will be quite simple.
I would use sensors for sensors work (IDS, BRO, netsnif-ng, capme) and analytics would do only on storage nodes (elastic data nodes). These nodes should run only on SSD for best performance and would add more nodes if required. So in that case, if you will see that you the search is slow, you always can add more Storage nodes. Sensors it self will be delegated for dedicated work.

Long long time ago, when we collected about from 50 to 100GB of ES logs per day, I had dedicated ES data nodes - 4 Dell 730xd servers with 64GB each, with no virtualization, Raid0 and ES 1.7. Despite dedicated ES nodes I was able to kill my cluster because of java heap. And it was usualy done by my colleagues which run heavy queries. Nowadays a lot of things are changed and some controls are implemented, but these controls will kill queries if they run for a long time (default 30s).

Also for best performance it is advisible for ES Data node have at least 64GB of RAM, 8vCPU and 4-6TB of SSD, shard count max 600 per node.

Doug always said, that HW is cheap, comparing to the value you get!

Of course if you will do analysis very rare, so your requirements could be low, but if the system will be unstable for your analysts, they will start to complain...

Also you need to evaluate your enviroment and data used in your network. We had some sites, where traffic was quite low, but had a lot of sessions (so more logs), and with high traffic but low session count (so less logs).

You always can take on server and deploy it as standalone on your enviroment and see how it behaves. Based on results you can adjust your setup. It will cost you nothing - but you wil get better experience and will be able to plan properly.

Regards,
Audrius

Brant Hale

unread,

Aug 28, 2018, 10:08:57 PM8/28/18

to security-onion

Thanks for the replies. I was curious about other's experiences and didn't see much out there so I thought I would get a conversation started on it to see what everyone was doing.

I have convinced myself to run RAID10 with my non ssd drives on the storage nodes. No changing that this year. I do know some people who are planning on RAID5 on their storage nodes so I wanted to see how well that would do.

Audrius J

unread,

Aug 29, 2018, 2:23:34 AM8/29/18

to security-onion

In that case for storage nodes I would go with RAID10 for sure...

Reply all

Reply to author

Forward