First off, big thanks to Doug and team for their work with Security Onion and the awesome job integrating Elastic into the SO stack! I'm looking forward to leveraging Bro in Elastic on Security Onion.
I stood up an SO Elastic alpha release box on the same network as one of my production Security Onion sensors. I'm trying to ship Bro logs from the prod sensor to the new elastic box to test out elastic, but the vast majority of the logs are dropping in transit according to syslog-ng on the source sensor. I had this functionality working on a previous release of the elastic preview - I can't remember if that was TP1 or TP2, but I don't believe network congestion is the limiting factor here. For context, on the previous TP I had running, the old ELK instance was seeing about 50 million logs per day. I've had this alpha running for about 18 hours now and it's only imported about 300,000 logs.
To send the logs, I copied and pasted the relevant syslog-ng configuration lines out of the new SO-ELK alpha server and put them in the source sensor. I modified the output to send to the IP of the ELK box, and added a UFW rule to allow connectivity.
I've used the `syslog-ng-ctl stats` command to confirm that the syslog to logstash transport step is where my logs are being dropped.
I've tried the following:
*Switched Syslog-ng output between TCP and UDP - both resulted in about the same amount of loss.
*Increased RAM and worker count on logstash - no changes observed.
*Set up a syslog-ng receiver on the ELK box, wrote the Bro logs to disk vis syslog-ng, and used syslog-ng to send the contents of that file into logstash. The source SO sensor reported FAR fewer dropped logs with this approach - but syslog-ng on the ELK box reported a large amounts of drops from that file source to the logstash destination. This makes me think this is a logstash input / input rate issue.
*Modified syslog-ng settings such as log_fifo_size and flow-control. Nothing I tried here made a difference.
*Briefly stopped sending syslog from the source sensor to our SIEM thinking I was putting too much stress on syslog - but this didn't make a difference either.
*Commented out all source log lines except for source(s_bro_conn);. Still observed a high rate of log loss.
user@sourceSensor:/var/log$ sudo syslog-ng-ctl stats | grep logstash_bro
destination;d_logstash_bro;;a;processed;17253139
dst.tcp;d_logstash_bro#0;tcp,Dest.SO.ELK.IP:6050;a;dropped;17229709
dst.tcp;d_logstash_bro#0;tcp,Dest.SO.ELK.IP:6050;a;processed;17253139
dst.tcp;d_logstash_bro#0;tcp,Dest.SO.ELK.IP:6050;a;stored;10000
Here's the relevant syslog-ng configuration lines on the sensor / source machine (note I didn't modify any of the log source lines so leaving those out intentionally at this time):
destination d_logstash_bro { tcp("IP.Of.ELK.Box" port(6050) template("$(format-json --scope selected_macros --scope nv_pairs --exclude DATE --key ISODATE)\n")); };
log {
source(s_bro_conn);
...#rest of the bro log sources
log { filter(f_bro_headers); flags(final); };
log { destination(d_logstash_bro); };
};
So I guess my questions are:
1. Is anyone experiencing logs dropped between syslog-ng and logstash on a standalone ELK Alpha box? (taking the network factor entirely out of the equation)? (`syslog-ng-ctl stats` and looking for the logstash destination should show this)
2. Has anyone else attempted sending these logs from a production SO sensor to an ELK TP or alpha release and encountered this issue?
3. Are there any tuning options that may help? Recommendations from anyone that has encountered similar issues?
I've attached sostat-redacted output for both the source and destination systems in case there's any useful information in there for troubleshooting this.
Thanks in advance for any input / help!
James Gordon
Hi Doug,
I increased logstash (and ES) RAM values in /etc/nsm/securityonion.conf. Setup initially dedicated 8227mb to both of these - I upped them to 12000m. The machine has 32 GB of RAM.
# Elasticsearch options
ELASTICSEARCH_ENABLED="yes"
ELASTICSEARCH_HOST="localhost"
ELASTICSEARCH_PORT=9200
#ELASTICSEARCH_HEAP="8227m"
ELASTICSEARCH_HEAP="12000m"
ELASTICSEARCH_OPTIONS=""
# Logstash options
LOGSTASH_ENABLED="yes"
#LOGSTASH_HEAP="8227m"
LOGSTASH_HEAP="12000m"
LOGSTASH_OPTIONS=""
I modified /etc/logstash/logstash.yml to increase the pipeline workers - I went from 1 to 4 workers to see if this made any improvement.
root@securityonion-elk:~# cat /etc/logstash/logstash.yml
path.config: /usr/share/logstash/pipeline
queue.type: persisted
queue.max_bytes: 1gb
pipeline.workers: 4
path.logs: /var/log/logstash
After making these changes I restarted elastic with so-elastic-restart. I built the ELK machine on a test network and moved it to production to ingest logs, so so-elastic-restart timed out getting docker configs and used cached values. I don't see where the number of pipeline workers is reflected in the running logstash process, but it definitely applied the heap size change.
root@securityonion-elk:~# ps -ef | grep logstash
jgordon 14758 14741 7 12:10 ? 00:29:10 /usr/bin/java -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+DisableExplicitGC -Djava.awt.headless=true -Dfile.encoding=UTF-8 -XX:+HeapDumpOnOutOfMemoryError -Djava.security.egd=file:/dev/urandom -Xmx12000m -Xms12000m -Xss2048k -Djffi.boot.library.path=/usr/share/logstash/vendor/jruby/lib/jni -Xbootclasspath/a:/usr/share/logstash/vendor/jruby/lib/jruby.jar -classpath : -Djruby.home=/usr/share/logstash/vendor/jruby -Djruby.lib=/usr/share/logstash/vendor/jruby/lib -Djruby.script=jruby -Djruby.shell=/bin/sh org.jruby.Main /usr/share/logstash/lib/bootstrap/environment.rb logstash/runner.rb
Thanks,
James Gordon
James,
Are you still experiencing issues with this, or were you able to get this resolved?
Thanks,
Wes
Wes,
I rebuilt the ELK server last night. Just rebuilding it fixed the dropped logs as reported by syslog-ng on the production sensor, but only a very small portion of the logs were making into elasticsearch still. I noticed some error logs in /var/logs/logstash/logstash.log for DomainStats timing out - I disabled domain stats in /etc/nsm/securityonion.conf, rebooted, and things are working much better! I will note that I had to increase the logstash workers count to keep up with the logs. It fell behind pretty quickly last night - after an hour of ingesting logs it was 45 minutes behind - so I bumped the logstash worker count to 8 and it was up to date when I checked on it this morning. Over the past 14 hours ELK has ingested just under 30 million logs.
As mentioned in my response to Doug, I built the ELK instance in a test / lab environment that has open access to the internet, then swapped it over to production. It does not currently have proxy rules in place to connect outbound to the internet, so I'm assuming that's the cause of the domain stats time outs.
Thanks,
James Gordon
Had similar issue with domainstats slowing down the indexing of logs
the systems were long periods of sporadic log indexing.
Sorted by setting DOMAIN_STATS_ENABLED="no" in /etc/nsm/securityonion.conf
[2017-10-18T12:50:54,205][ERROR][logstash.filters.rest ] error in rest filter {:request=>[:get, "http://domainstats:20000/domain/creation_date/angsrvr.com", {}], :json=>false, :code=>nil, :body
=>nil, :client_error=>#<Manticore::StreamClosedException: Could not read from stream: Read timed out>}
Looks like DOCKER Iptables chain has no forwarding rule for domainstats
iptables -L DOCKER
Chain DOCKER (2 references)
target prot opt source destination
ACCEPT tcp -- anywhere 172.17.0.4 tcp dpt:9300
ACCEPT tcp -- anywhere 172.17.0.4 tcp dpt:9200
ACCEPT tcp -- anywhere 172.17.0.5 tcp dpt:6053
ACCEPT tcp -- anywhere 172.17.0.5 tcp dpt:6052
ACCEPT tcp -- anywhere 172.17.0.5 tcp dpt:6051
ACCEPT tcp -- anywhere 172.17.0.5 tcp dpt:6050
ACCEPT tcp -- anywhere 172.17.0.6 tcp dpt:5601
containe ris running but not listening on tcp/20000
# docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
399aeb648082 securityonionsolutions/so-curator "/bin/bash" 16 minutes ago Up 16 minutes so-curator
3c38a258221a securityonionsolutions/so-elastalert "/opt/start-elasta..." 16 minutes ago Up 16 minutes so-elastalert
8ec4f2d73a44 securityonionsolutions/so-kibana "/bin/sh -c /usr/l..." 16 minutes ago Up 16 minutes 127.0.0.1:5601->5601/tcp so-kibana
5e40eb533ae5 securityonionsolutions/so-logstash "/usr/local/bin/do..." 16 minutes ago Up 16 minutes 5044/tcp, 9600/tcp, 127.0.0.1:6050-6053->6050-6053/tcp so-logstash
0ce595e769bc securityonionsolutions/so-elasticsearch "/bin/bash bin/es-..." 16 minutes ago Up 16 minutes 127.0.0.1:9200->9200/tcp, 127.0.0.1:9300->9300/tcp so-elasticsearch
a9ecb1effec9 securityonionsolutions/so-domainstats "/bin/sh -c '/usr/..." 16 minutes ago Up 16 minutes 20000/tcp so-domainstats
8552535e82e1 securityonionsolutions/so-freqserver "/bin/sh -c '/usr/..." 16 minutes ago Up 16 minutes 10004/tcp so-freqserver