It seems that after our upgrade yesterday, we are seeing problems on the Security Onion manager.
The first indication that something was wrong was that we could not log in to the Sguert web ui. When investigating this further, we noticed that we could not log on to the database via the commandline either, because there were too many connections to it.
After some reading we came across threads here describing similar problems, for example this one: https://groups.google.com/forum/#!topic/security-onion/xqMrSmnBng4. Within that thread, this other older thread was referenced: https://groups.google.com/forum/#!topic/security-onion/xF9v3RfNnaM
We can confirm that we have most of the symptoms described in these threads, but when performing this step from https://groups.google.com/forum/#!topic/security-onion/xF9v3RfNnaM
# verify there is no pid column in the query_log table:
mysql -uroot -Delsa_web -e 'describe query_log;'
We can see that we have the pid-column already:
+--------------+----------------------+------+-----+-------------------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------------+----------------------+------+-----+-------------------+----------------+
| qid | int(10) unsigned | NO | PRI | NULL | auto_increment |
| uid | int(10) unsigned | NO | MUL | NULL | |
| query | varchar(8000) | NO | | NULL | |
| system | tinyint(3) unsigned | NO | | 0 | |
| timestamp | timestamp | NO | MUL | CURRENT_TIMESTAMP | |
| num_results | int(11) | YES | | NULL | |
| milliseconds | int(10) unsigned | YES | | NULL | |
| archive | tinyint(1) | NO | MUL | 0 | |
| pid | smallint(5) unsigned | YES | | NULL | |
+--------------+----------------------+------+-----+-------------------+----------------+
Nontheless, we could disable the job by commenting out the cron job-line in /etc/cron.d/elsa. As also described in the previously mentioned thread.
If i try to run the command "/opt/elsa/web/cron.pl -c /etc/elsa_web.conf" manually from the shell, i get a massive loop of perl error messages which seems odd. I have searched other threads for similar reports of these messages, but have found none.
I should probably also mention that we are also seeing some batching-problems when doing queries in ELSA. The messages are something like "Query 1460 submitted. SENSOR: Batching because estimated query time is 1051 seconds.127.0.0.1: Batching because estimated query time is 1051 seconds."
Im assuming this is caused by the cron.pl-script not being able to finish since the upgrade? Queries work fine if "nobatch:1" is appended to the line.
I have attached a log containing the error messages produced when running the cron.pl-script manually as described above. I have also attached the output of "sostat-redacted" from our manager.
Any ideas about what could be causing this?
I have answered your questions below.
Den torsdag 30 juli 2015 kl. 21:17:43 UTC+2 skrev Doug Burks:
> Do you have a master and one or more separate sensor boxes?
One master server and one sensor.
> Did you upgrade the master first and then all of the sensor boxes?
Yes, we ran "sudo soup" on the manager first, and after it had restarted we ran the same thing on the sensor.
> Were there are any errors during the upgrade?
We did not observe any errors in the output during or after the upgrade.
> When was the last time you upgraded (before yesterday)?
The last upgrade we ran was a long time ago, like several months ago.
> Can you provide sostat output from the sensor boxes as well?
Yes, absolutely. I have attached it in this reply.
> Do you have any ELSA alerts (scheduled queries)? If so, have you
> tried disabling them?
I have checked in ELSA, and there are none to be found there.
> Have you checked the log files in /nsm/elsa/data/elsa/log/ for additional clues?
Yes, we did some testing during the day (it's eveningtime here now), and there has been a few interresting findings in that log. We had a lot of these errors:
* ERROR [2015/07/24 08:17:08] /opt/elsa/web/lib/API.pm (354) API::__ANON__ 26515 [undef]
malformed JSON string, neither array, object, number, string or atom, at character offset 0 (before "(end of string)") at /opt/elsa/web/lib/API.pm line 350.
Header: $VAR1 = {
'Reason' => 'Forbidden',
'URL' => 'https://127.0.0.1:3154/API/result?qid=597',
'Status' => '403'
};
body: $VAR1 = undef;
And also some stuff that seemed to be related to our proxy server, set in the /etc/environment file
* ERROR [2015/07/24 08:18:09] /opt/elsa/web/lib/Utils.pm (679) Utils::__ANON__ 26560 [undef]
malformed JSON string, neither array, object, number, string or atom, at character offset 0 (before "<!DOCTYPE HTML PUBL
I...") at /opt/elsa/web/lib/Utils.pm line 674.
Header: $VAR1 = {
'URL' => 'http://127.0.0.1:50000/API/info',
'connection' => 'close',
'Status' => '504',
'HTTPVersion' => '1.0',
'date' => 'Fri, 24 Jul 2015 06:21:43 GMT',
'x-squid-error' => 'ERR_CONNECT_FAIL 111',
'content-length' => '1079',
'Reason' => 'Gateway Time-out',
'content-type' => 'text/html',
'via' => '1.0 PROXYIP:800 (squid/2.7.STABLE9)',
'server' => 'squid/2.7.STABLE9',
'x-cache' => 'MISS from inf-proxy1'
};
We tried to temporarily move the environment file out of place and reboot, and that seemed to take care of those errors. But we are still seeing the same kind of output when trying to run the cron.pl-script.
But we got the cron.pl-script running again yesterday evening by dropping some rows from the "saved_results" table in the elsa_web database, since we saw some mentions of duplicate entries in the elsa web.log. It seemed to have solved the problem with cron.pl not being able to complete:
DBD::mysql::st execute failed: Duplicate entry '1357' for key 'PRIMARY' QUERY: INSERT INTO saved_results (qid, comments) VALUES(?,?)
We figure we ended up with some transactions not being completed or something while doing the upgrade. Now it is okay to query data indexed/collected since then.
We will try to run it like this during the weekend, and hopefully we will be back on track after the weekend if the cron.pl-job runs like it should.
On the sensor, the "log_size_limit" is set to 432000000000.
I will report back if that solved the problem.
We also noticed that there seems to be a few "perl-processes" running the cron.pl-script again getting stuck:
root 1996 1977 0 13:37 ? 00:00:00 /bin/sh -c perl /opt/elsa/node/elsa.pl -c /etc/elsa_node.conf
root 1997 1996 0 13:37 ? 00:00:03 perl /opt/elsa/node/elsa.pl -c /etc/elsa_node.conf
root 12011 12005 0 14:08 ? 00:00:00 /bin/sh -c perl /opt/elsa/web/cron.pl -c /etc/elsa_web.conf > /dev/null 2>&1
root 12013 12011 99 14:08 ? 00:28:16 perl /opt/elsa/web/cron.pl -c /etc/elsa_web.conf
root 12116 12111 0 14:09 ? 00:00:00 /bin/sh -c perl /opt/elsa/web/cron.pl -c /etc/elsa_web.conf > /dev/null 2>&1
root 12120 12116 99 14:09 ? 00:27:15 perl /opt/elsa/web/cron.pl -c /etc/elsa_web.conf
root 12370 12367 0 14:10 ? 00:00:00 /bin/sh -c perl /opt/elsa/web/cron.pl -c /etc/elsa_web.conf > /dev/null 2>&1
root 12375 12370 99 14:10 ? 00:26:15 perl /opt/elsa/web/cron.pl -c /etc/elsa_web.conf
root 12454 12448 0 14:11 ? 00:00:00 /bin/sh -c perl /opt/elsa/web/cron.pl -c /etc/elsa_web.conf > /dev/null 2>&1
root 12456 12454 99 14:11 ? 00:25:15 perl /opt/elsa/web/cron.pl -c /etc/elsa_web.conf
root 12500 12498 0 14:12 ? 00:00:00 /bin/sh -c perl /opt/elsa/web/cron.pl -c /etc/elsa_web.conf > /dev/null 2>&1
root 12503 12500 99 14:12 ? 00:24:15 perl /opt/elsa/web/cron.pl -c /etc/elsa_web.conf
root 14877 14876 0 14:26 ? 00:00:00 /bin/sh -c perl /opt/elsa/web/cron.pl -c /etc/elsa_web.conf > /dev/null 2>&1
root 14878 14877 75 14:26 ? 00:07:47 perl /opt/elsa/web/cron.pl -c /etc/elsa_web.conf
root 14928 14925 0 14:27 ? 00:00:00 /bin/sh -c perl /opt/elsa/web/cron.pl -c /etc/elsa_web.conf > /dev/null 2>&1
root 14932 14928 83 14:27 ? 00:07:46 perl /opt/elsa/web/cron.pl -c /etc/elsa_web.conf
root 15036 15029 0 14:29 ? 00:00:00 /bin/sh -c perl /opt/elsa/web/cron.pl -c /etc/elsa_web.conf > /dev/null 2>&1
root 15039 15036 98 14:29 ? 00:07:15 perl /opt/elsa/web/cron.pl -c /etc/elsa_web.conf
root 15158 15147 0 14:30 ? 00:00:00 /bin/sh -c perl /opt/elsa/web/cron.pl -c /etc/elsa_web.conf > /dev/null 2>&1
root 15164 15158 98 14:30 ? 00:06:15 perl /opt/elsa/web/cron.pl -c /etc/elsa_web.conf
root 15742 15736 0 14:31 ? 00:00:00 /bin/sh -c perl /opt/elsa/web/cron.pl -c /etc/elsa_web.conf > /dev/null 2>&1
root 15744 15742 97 14:31 ? 00:05:15 perl /opt/elsa/web/cron.pl -c /etc/elsa_web.conf
Although, the script runs OK if you run it manually, so maybe this isn't actually a problem or related?
Any ideas what more we can check for? I'm sort of wondering if it would be possible to wipe the elsa database and re-create it with a package-reinstall or something? Is that possible without actually losing the syslog data?
I have made some other observations during this time as well, that might be interresting:
* There are differences in the query_timeout-value in the new config files on both sensor and manager, between the config-file revisions /etc/elsa_web.conf and /etc/elsa_web.conf.20150729 (20150729 is when we ran soup). Notably, the value was set to 10000 on both sensor and manager before the upgrade, but after it was set to 1000 on the manager and 55 on the sensor. Tried to set them back to the old value and restart, with no success.
* There is a lot of data accumulated in the "/nsm/elsa_data/elsa/mysql"-directory on the sensor. The oldes file here is from 20150729 and the newest is from today.
* When running queries in the elsa UI, data is returned fine for queries where the from-date is set to 20150410. It is fine to do queries on older data up until 20150729. If we chose 20150739 or some later date (up until 24 hours ago), we get the "batched query" message. A query with a fromdate set to 24 hours ago also works fine.
I guess that even if our index date range is still only ~24 hours, we should not get these "batched query" messages, elsa should fetch the data from the archive, as with the older data that we can do queries on fine. Do you have any more ideas about what we can try to get further here?
Thanks!
Den fredag 21 augusti 2015 kl. 18:16:23 UTC+2 skrev Doug Burks:
> Replies inline.
>
> On Fri, Aug 21, 2015 at 12:01 PM, Magnus Wild <mag...@kalasarn.se> wrote:
> > Hello again! Sorry it has taken a while to get back to this thread. Anyway, we have tried increasing the log_size_limit quite a lot now, but we are still seeing the same index date range.
>
> What exactly did you set log_size_limit to?
>
Right now it is set to 740000000000 on both sensor and server. Approx 90 % of our available disk-space on the /nsm-mountpoint.
> Have you tried restarting services since making that change?
>
Yes, we have restarted multiuple times since the changes was made.
> > I have made some other observations during this time as well, that might be interresting:
> >
> > * There are differences in the query_timeout-value in the new config files on both sensor and manager, between the config-file revisions /etc/elsa_web.conf and /etc/elsa_web.conf.20150729 (20150729 is when we ran soup). Notably, the value was set to 10000 on both sensor and manager before the upgrade, but after it was set to 1000 on the manager and 55 on the sensor. Tried to set them back to the old value and restart, with no success.
>
> Yes, previously there was an issue with ELSA (unrelated to the problem
> you're experiencing) that we had tried to workaround by increasing
> query_timeout to 10000. This new version of ELSA no longer has that
> particular issue, so we've reverted query_timeout to its default value
> of 55.
>
Ah, all right. That explains it.
> > * There is a lot of data accumulated in the "/nsm/elsa_data/elsa/mysql"-directory on the sensor. The oldes file here is from 20150729 and the newest is from today.
>
> Yes, this is where ELSA stores its data.
Weirdly enough there is only data from within the time period that we are having trouble querying. Well, and including the 24-hour period that we can actually query successfully of course. But it is very conincidental that the oldes files here are from the same date that we ran the upgrade?
If we check in the same directory on the manager, we also have some files there, but only from today, and only three files in total.
>
> > * When running queries in the elsa UI, data is returned fine for queries where the from-date is set to 20150410. It is fine to do queries on older data up until 20150729. If we chose 20150739
>
> Is this a typo? 20150739 is an invalid date.
Ah, yes! It was supposed to be 20150730. :)
>
> > or some later date (up until 24 hours ago), we get the "batched query" message. A query with a fromdate set to 24 hours ago also works fine.
>
> > I guess that even if our index date range is still only ~24 hours, we should not get these "batched query" messages, elsa should fetch the data from the archive, as with the older data that we can do queries on fine. Do you have any more ideas about what we can try to get further here?
>
> When you do these test queries, are you monitoring the logs in
> /nsm/elsa/data/elsa/log/ for additional clues?
>
Yes, we have. I have a hard time finding much that stands out though in connection to when we are running queries. I have attached some typical output from the logs produced just after a query has been run. The first ones i did a simple query with startdate 20150819 and by the term "md5". The second time i used the same start date and used the query for the "Bro top weird types" (class=BRO_WEIRD "-" groupby:name) from the ELSA-frame in Security Onion.
We do some messages like "no saved queries":
* DEBUG [2015/08/21 21:56:12] /opt/elsa/web/lib/Controller.pm (220) Controller::__ANON__ 28277 [undef]
got results body from peer ilp-NSM-S-TC-1: $VAR1 = '{"warnings":[{"trace":"No saved results for qid 2202 found. at /opt/
But that seems to occur continuously, not only when running queries, which i guess is because it some scheduled activity?
Something else i saw are these lines in the node.log:
Effective log_size_limit: 740000000000, archive_limit: 244200000000
The archive_limit is set quite a lot lower than the log_size_limit. Would it be an idea to try to increase that limit as well? How should it be adjusted in relation to log_size_limit?