My primary long term test setup:
Master Server is a VM 4cores/8GB of ram on RAID5 SATA - no issues
Storage node 1 is a physical box 8cores/64GB of ram on single 400GB SSD - no issues
Storage node 2 is a VM 4cores/16GB of ram on RAID5 SATA - seeing if I can get performance issues
Sensor is physical box 4core/12GB of ram local SATA disks - no issues
My sensor sees up to 100 megabytes of traffic (10 megabytes is avg) and can keep up with the disk writes and CPU.
Storage nodes never go above 100 IOPS with no queries and don't seem to have any disk issues.
I can peg out the CPUs on the storage nodes doing a long "*something*" query otherwise they stay at around a load avg of 1 or less.
The master server never seems to have much load at all.
Storage count and size
so-storage-02: 37GB of /nsm/elastic
{"count":53588166,"_shards":{"total":81,"successful":81,"skipped":0,"failed":0}}
SO-STORAGE: 48GB of /nsm/elastic
{"count":67116284,"_shards":{"total":90,"successful":90,"skipped":0,"failed":0}}
So it looks like I am getting about 1,400 - 1,500 logs per megabyte of disk
(85,000 GB of disk 120,704,450 logs 1,420 logs/megabyte)
I am planning out multiple sensors with about 40x the traffic
Some of the things I am trying to figure out:
1. Storage nodes - everyone on ElasticSearch forums says SSD and RAID10 or RAID0
What is everyone seeing on IOPS on their big storage nodes? I am leaning toward RAID10, but I really am wondering if I can get away with Raid5+hot spare. I only see high IOPS on queries and RAID5 shouldn't be a big penalty with reads.
So I am thinking even with the write penalty of RAID5 I would be in my IOP range for my array during writes and any queries would be reads without a significant penalty to the RAID5. Everything I have read suggests this is a bad idea so I am skeptical of my logic on this.
Right now I am considering running 2 storage nodes each with 12 non-ssd one in RAID10 and one in RAID5 to see how they run in production.
Anybody have any opinions or real world data on this topic?
Thanks
Thanks!
I recently got the everything needs to be virtual in the DC so I am running some tests in the lab with vmware.
Hardware 8 core with hyperthreading enabled (16 logical cores)
76 GB of RAM
RAID10 - 6 1TB disks
Test 1 - slight CPU over subscription of physical cores
SO Master 2 Cores / 8GB ram
SO Storage 8 Cores / 64GB ram / java heap set to 28GB
Sensor is dedicated hardware
Notes - Under heavy query would get visualization timeout on kibana (*value* all data type)
Test 2 - CPU pining and affinity - set cores to logical with 1 reserved on each vm for esxi
SO Master 3 Cores / 8GB ram (Pinned to logical cores 0-3)
SO Storage 11 Cores / 64GB ram (pinned to logical cores 4-15) / java heap set to 28GB
Sensor is dedicated hardware
Notes - so many issues - cpu pegged on esxi - this was a bad idea (TM)
Test 3 - Split Storage into 2 nodes with half the memory - I didn't see the extra memory getting used as cache on the single storage box. I thought this might help with more java heap overall.
SO Master 2 Cores / 8GB ram
SO Storage 4 Cores / 32GB ram / java heap set to 24GB
SO Storage 4 Cores / 32GB ram / java heap set to 24GB
Sensor is dedicated hardware
Notes - visualization timeouts everywhere. I have no idea if splitting up the cores just slowed everything down or if it was just a bad idea. Overall much slower and more problems than one storage node with more memory and cpu.
Test 4 - very slight CPU over subscription of physical cores - the idea is that even under 100% storage load at least 1 core is available for master
SO Master 2 Cores / 8GB ram
SO Storage 7 Cores / 64GB ram / java heap set to 28GB
Sensor is dedicated hardware
Notes - This seems to be ok. no visualization timeouts. Started to get more caching on free memory which greatly sped up queries especially repeated ones.
I used htop, nload, dstat, iostat ( I like dstat --disk-tps ) and vmware console performance tab to troubleshoot and try and discern impacts of each choice.
This not very scientific process left me with the following thoughts:
1. Avoid virtualization of the storage nodes
2. If forced to visualize - no oversubscription - I had no luck pining or affinity either ( what is the point of virtualization at this point?)
3. I seemed more CPU bound than disk bound - during heavy query the CPU was spiked, but disk IOPS never went above my arrays specs - maybe I am not using a big enough dataset or doing a bad test query.
Let me know if anyone else is doing this type of testing and what results you have gotten.
Found a net tool FIO for benchmarking IOPS
Just 100% Reads
sudo fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=100
90% writes and 10% reads (guessing this is what Security Onion will be like)
sudo fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=10
100% writes
sudo fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=0
Test is Ubuntu 16.04 patched up to today 08/26/2018
ESXi 5.5 Dell Perc controller / 8 CPU cores
Disks are SATA 7200 Dell midline SATA 6 disks
RAID5 Max Read 1400 IOPS MAX Write 400 IOPS 90/10 Write/Read 40/400
RAID10 Max Read 1300 IOPS MAX Write 1400 IOPS 90/10 Write/Read 120/1100
So my thoughts are:
1. The RAID10 really makes a difference with 3.5x writes (expected)
2. The RAID5 could really bog down the reads with the constant writes (expected)
3. Running the RAID5 while it may keep up with the writes will really kill read performance (I should have known that, but the disk space savings were tempting.)
4. I still need to get an idea of how many writes I will need in production. If it is more than 50% of my write IOPS max then I don't think I can run the RAID5.
Anyway, I thought I would share in case anyone else is working on the same planning thoughts.
--
Follow Security Onion on Twitter!
https://twitter.com/securityonion
---
You received this message because you are subscribed to the Google Groups "security-onion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to security-onio...@googlegroups.com.
To post to this group, send email to securit...@googlegroups.com.
Visit this group at https://groups.google.com/group/security-onion.
For more options, visit https://groups.google.com/d/optout.
My advise will be quite simple.
I would use sensors for sensors work (IDS, BRO, netsnif-ng, capme) and analytics would do only on storage nodes (elastic data nodes). These nodes should run only on SSD for best performance and would add more nodes if required. So in that case, if you will see that you the search is slow, you always can add more Storage nodes. Sensors it self will be delegated for dedicated work.
Long long time ago, when we collected about from 50 to 100GB of ES logs per day, I had dedicated ES data nodes - 4 Dell 730xd servers with 64GB each, with no virtualization, Raid0 and ES 1.7. Despite dedicated ES nodes I was able to kill my cluster because of java heap. And it was usualy done by my colleagues which run heavy queries. Nowadays a lot of things are changed and some controls are implemented, but these controls will kill queries if they run for a long time (default 30s).
Also for best performance it is advisible for ES Data node have at least 64GB of RAM, 8vCPU and 4-6TB of SSD, shard count max 600 per node.
Doug always said, that HW is cheap, comparing to the value you get!
Of course if you will do analysis very rare, so your requirements could be low, but if the system will be unstable for your analysts, they will start to complain...
Also you need to evaluate your enviroment and data used in your network. We had some sites, where traffic was quite low, but had a lot of sessions (so more logs), and with high traffic but low session count (so less logs).
You always can take on server and deploy it as standalone on your enviroment and see how it behaves. Based on results you can adjust your setup. It will cost you nothing - but you wil get better experience and will be able to plan properly.
Regards,
Audrius
I have convinced myself to run RAID10 with my non ssd drives on the storage nodes. No changing that this year. I do know some people who are planning on RAID5 on their storage nodes so I wanted to see how well that would do.