After Security Onion upgrade, problems with /opt/elsa/web/cron.pl processes being created indefinitely.

811 views
Skip to first unread message

Magnus Wild

unread,
Jul 30, 2015, 8:27:36 AM7/30/15
to security-onion
Hello!

It seems that after our upgrade yesterday, we are seeing problems on the Security Onion manager.

The first indication that something was wrong was that we could not log in to the Sguert web ui. When investigating this further, we noticed that we could not log on to the database via the commandline either, because there were too many connections to it.

After some reading we came across threads here describing similar problems, for example this one: https://groups.google.com/forum/#!topic/security-onion/xqMrSmnBng4. Within that thread, this other older thread was referenced: https://groups.google.com/forum/#!topic/security-onion/xF9v3RfNnaM

We can confirm that we have most of the symptoms described in these threads, but when performing this step from https://groups.google.com/forum/#!topic/security-onion/xF9v3RfNnaM

# verify there is no pid column in the query_log table:
mysql -uroot -Delsa_web -e 'describe query_log;'

We can see that we have the pid-column already:
+--------------+----------------------+------+-----+-------------------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------------+----------------------+------+-----+-------------------+----------------+
| qid | int(10) unsigned | NO | PRI | NULL | auto_increment |
| uid | int(10) unsigned | NO | MUL | NULL | |
| query | varchar(8000) | NO | | NULL | |
| system | tinyint(3) unsigned | NO | | 0 | |
| timestamp | timestamp | NO | MUL | CURRENT_TIMESTAMP | |
| num_results | int(11) | YES | | NULL | |
| milliseconds | int(10) unsigned | YES | | NULL | |
| archive | tinyint(1) | NO | MUL | 0 | |
| pid | smallint(5) unsigned | YES | | NULL | |
+--------------+----------------------+------+-----+-------------------+----------------+

Nontheless, we could disable the job by commenting out the cron job-line in /etc/cron.d/elsa. As also described in the previously mentioned thread.

If i try to run the command "/opt/elsa/web/cron.pl -c /etc/elsa_web.conf" manually from the shell, i get a massive loop of perl error messages which seems odd. I have searched other threads for similar reports of these messages, but have found none.

I should probably also mention that we are also seeing some batching-problems when doing queries in ELSA. The messages are something like "Query 1460 submitted. SENSOR: Batching because estimated query time is 1051 seconds.127.0.0.1: Batching because estimated query time is 1051 seconds."
Im assuming this is caused by the cron.pl-script not being able to finish since the upgrade? Queries work fine if "nobatch:1" is appended to the line.

I have attached a log containing the error messages produced when running the cron.pl-script manually as described above. I have also attached the output of "sostat-redacted" from our manager.

Any ideas about what could be causing this?

cronpl_output.txt
SOSTAT_NSM_MAN.txt

Doug Burks

unread,
Jul 30, 2015, 3:17:43 PM7/30/15
to securit...@googlegroups.com
Do you have a master and one or more separate sensor boxes?

Did you upgrade the master first and then all of the sensor boxes?

Were there are any errors during the upgrade?

When was the last time you upgraded (before yesterday)?

Can you provide sostat output from the sensor boxes as well?

Do you have any ELSA alerts (scheduled queries)? If so, have you
tried disabling them?

Have you checked the log files in /nsm/elsa/data/elsa/log/ for additional clues?
> --
> You received this message because you are subscribed to the Google Groups "security-onion" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to security-onio...@googlegroups.com.
> To post to this group, send email to securit...@googlegroups.com.
> Visit this group at http://groups.google.com/group/security-onion.
> For more options, visit https://groups.google.com/d/optout.



--
Doug Burks
Need Security Onion Training or Commercial Support?
http://securityonionsolutions.com

Magnus Wild

unread,
Jul 30, 2015, 4:52:57 PM7/30/15
to security-onion, doug....@gmail.com
Hello Doug!

I have answered your questions below.

Den torsdag 30 juli 2015 kl. 21:17:43 UTC+2 skrev Doug Burks:
> Do you have a master and one or more separate sensor boxes?

One master server and one sensor.

> Did you upgrade the master first and then all of the sensor boxes?

Yes, we ran "sudo soup" on the manager first, and after it had restarted we ran the same thing on the sensor.

> Were there are any errors during the upgrade?

We did not observe any errors in the output during or after the upgrade.



> When was the last time you upgraded (before yesterday)?

The last upgrade we ran was a long time ago, like several months ago.

> Can you provide sostat output from the sensor boxes as well?

Yes, absolutely. I have attached it in this reply.

> Do you have any ELSA alerts (scheduled queries)? If so, have you
> tried disabling them?

I have checked in ELSA, and there are none to be found there.

> Have you checked the log files in /nsm/elsa/data/elsa/log/ for additional clues?

Yes, we did some testing during the day (it's eveningtime here now), and there has been a few interresting findings in that log. We had a lot of these errors:

* ERROR [2015/07/24 08:17:08] /opt/elsa/web/lib/API.pm (354) API::__ANON__ 26515 [undef]
malformed JSON string, neither array, object, number, string or atom, at character offset 0 (before "(end of string)") at /opt/elsa/web/lib/API.pm line 350.

Header: $VAR1 = {
'Reason' => 'Forbidden',
'URL' => 'https://127.0.0.1:3154/API/result?qid=597',
'Status' => '403'
};

body: $VAR1 = undef;

And also some stuff that seemed to be related to our proxy server, set in the /etc/environment file

* ERROR [2015/07/24 08:18:09] /opt/elsa/web/lib/Utils.pm (679) Utils::__ANON__ 26560 [undef]
malformed JSON string, neither array, object, number, string or atom, at character offset 0 (before "<!DOCTYPE HTML PUBL
I...") at /opt/elsa/web/lib/Utils.pm line 674.

Header: $VAR1 = {
'URL' => 'http://127.0.0.1:50000/API/info',
'connection' => 'close',
'Status' => '504',
'HTTPVersion' => '1.0',
'date' => 'Fri, 24 Jul 2015 06:21:43 GMT',
'x-squid-error' => 'ERR_CONNECT_FAIL 111',
'content-length' => '1079',
'Reason' => 'Gateway Time-out',
'content-type' => 'text/html',
'via' => '1.0 PROXYIP:800 (squid/2.7.STABLE9)',
'server' => 'squid/2.7.STABLE9',
'x-cache' => 'MISS from inf-proxy1'
};

We tried to temporarily move the environment file out of place and reboot, and that seemed to take care of those errors. But we are still seeing the same kind of output when trying to run the cron.pl-script.

SOSTAT_NSM_SEN.txt

Doug Burks

unread,
Jul 30, 2015, 10:18:07 PM7/30/15
to Magnus Wild, security-onion
From sostat_nsm_sen.txt:

ELSA Index Date Range:
MIN(start) MAX(end)
2015-07-29 14:17:29 2015-07-30 20:39:05

The default search range in ELSA is 48 hours. Since your index date
range is less than 48 hours, this is causing your searches to go to
archive and batch.
You can test this by searching ELSA with a start date that's later
than MIN(start) as shown in your Index Date Range. This should search
indexes and not batch.

What is log_size_limit set to in /etc/elsa_node.conf on the sensor?
You'll most likely want to increase it to ensure you have more than 48
hours in your Index Date Range.

Magnus Wild

unread,
Jul 31, 2015, 7:50:25 AM7/31/15
to security-onion, mag...@kalasarn.se, doug....@gmail.com
It seems that if we set the date-range to 12-hours ago, then it works! I guess this has something to do with the cron.pl-script not running during that day, and because of that, it seemed to batch evry query for data during that time range?

But we got the cron.pl-script running again yesterday evening by dropping some rows from the "saved_results" table in the elsa_web database, since we saw some mentions of duplicate entries in the elsa web.log. It seemed to have solved the problem with cron.pl not being able to complete:

DBD::mysql::st execute failed: Duplicate entry '1357' for key 'PRIMARY' QUERY: INSERT INTO saved_results (qid, comments) VALUES(?,?)

We figure we ended up with some transactions not being completed or something while doing the upgrade. Now it is okay to query data indexed/collected since then.

We will try to run it like this during the weekend, and hopefully we will be back on track after the weekend if the cron.pl-job runs like it should.

On the sensor, the "log_size_limit" is set to 432000000000.

I will report back if that solved the problem.

Magnus Wild

unread,
Aug 4, 2015, 11:03:50 AM8/4/15
to security-onion, mag...@kalasarn.se, doug....@gmail.com
I am afraid that the fix we implemented did not actually solve the problem. We did a slight modification of the file /opt/elsa/web/lib/Query/SQL.pm to make the calculation of the "estimated query time" work properly. Initially it seemed to work, but upon further investigation, we get null-results instead. If we revert the file to the package default we get the batched-query-behaviour back.

We also noticed that there seems to be a few "perl-processes" running the cron.pl-script again getting stuck:

root 1996 1977 0 13:37 ? 00:00:00 /bin/sh -c perl /opt/elsa/node/elsa.pl -c /etc/elsa_node.conf
root 1997 1996 0 13:37 ? 00:00:03 perl /opt/elsa/node/elsa.pl -c /etc/elsa_node.conf
root 12011 12005 0 14:08 ? 00:00:00 /bin/sh -c perl /opt/elsa/web/cron.pl -c /etc/elsa_web.conf > /dev/null 2>&1
root 12013 12011 99 14:08 ? 00:28:16 perl /opt/elsa/web/cron.pl -c /etc/elsa_web.conf
root 12116 12111 0 14:09 ? 00:00:00 /bin/sh -c perl /opt/elsa/web/cron.pl -c /etc/elsa_web.conf > /dev/null 2>&1
root 12120 12116 99 14:09 ? 00:27:15 perl /opt/elsa/web/cron.pl -c /etc/elsa_web.conf
root 12370 12367 0 14:10 ? 00:00:00 /bin/sh -c perl /opt/elsa/web/cron.pl -c /etc/elsa_web.conf > /dev/null 2>&1
root 12375 12370 99 14:10 ? 00:26:15 perl /opt/elsa/web/cron.pl -c /etc/elsa_web.conf
root 12454 12448 0 14:11 ? 00:00:00 /bin/sh -c perl /opt/elsa/web/cron.pl -c /etc/elsa_web.conf > /dev/null 2>&1
root 12456 12454 99 14:11 ? 00:25:15 perl /opt/elsa/web/cron.pl -c /etc/elsa_web.conf
root 12500 12498 0 14:12 ? 00:00:00 /bin/sh -c perl /opt/elsa/web/cron.pl -c /etc/elsa_web.conf > /dev/null 2>&1
root 12503 12500 99 14:12 ? 00:24:15 perl /opt/elsa/web/cron.pl -c /etc/elsa_web.conf
root 14877 14876 0 14:26 ? 00:00:00 /bin/sh -c perl /opt/elsa/web/cron.pl -c /etc/elsa_web.conf > /dev/null 2>&1
root 14878 14877 75 14:26 ? 00:07:47 perl /opt/elsa/web/cron.pl -c /etc/elsa_web.conf
root 14928 14925 0 14:27 ? 00:00:00 /bin/sh -c perl /opt/elsa/web/cron.pl -c /etc/elsa_web.conf > /dev/null 2>&1
root 14932 14928 83 14:27 ? 00:07:46 perl /opt/elsa/web/cron.pl -c /etc/elsa_web.conf
root 15036 15029 0 14:29 ? 00:00:00 /bin/sh -c perl /opt/elsa/web/cron.pl -c /etc/elsa_web.conf > /dev/null 2>&1
root 15039 15036 98 14:29 ? 00:07:15 perl /opt/elsa/web/cron.pl -c /etc/elsa_web.conf
root 15158 15147 0 14:30 ? 00:00:00 /bin/sh -c perl /opt/elsa/web/cron.pl -c /etc/elsa_web.conf > /dev/null 2>&1
root 15164 15158 98 14:30 ? 00:06:15 perl /opt/elsa/web/cron.pl -c /etc/elsa_web.conf
root 15742 15736 0 14:31 ? 00:00:00 /bin/sh -c perl /opt/elsa/web/cron.pl -c /etc/elsa_web.conf > /dev/null 2>&1
root 15744 15742 97 14:31 ? 00:05:15 perl /opt/elsa/web/cron.pl -c /etc/elsa_web.conf

Although, the script runs OK if you run it manually, so maybe this isn't actually a problem or related?

Any ideas what more we can check for? I'm sort of wondering if it would be possible to wipe the elsa database and re-create it with a package-reinstall or something? Is that possible without actually losing the syslog data?

Doug Burks

unread,
Aug 4, 2015, 1:30:49 PM8/4/15
to security-onion
First, I would double-check the log files in /nsm/elsa/data/elsa/log/
for any additional clues like the "Duplicate entry" log you found
earlier.

Once you've resolved all errors I would then go back to my previous
recommendations regarding log_size_limit and ELSA Index Date Range.
If Index Date Range is less than the default 2 day search window, then
I would either:

- increase log_size_limit until Index Date Range is greater than the
default 2 day search window

OR

- decrease the default search window of 2 days in /etc/elsa_web.conf
to fit within your Index Date Range:
"default_start_time_offset": 2,
(then restart Apache)

Magnus Wild

unread,
Aug 5, 2015, 10:38:22 AM8/5/15
to security-onion
Hello again Doug!

I just noticed that the log_size_limit on the manager was set much lower than on the sensor. It was se to 60GB on the manager, while 400GB was set on the sensor. I tried increasing it somewhat to see if that does any change during the course of the next few days.

I also re-checked the elsa-web.log and see a few different DEBUG-labeled messages about some qid's missing saved results. From what i understand that is because the query returns nothing. Here are a sample of one of those messages:

* DEBUG [2015/08/05 14:27:09] /opt/elsa/web/lib/Controller.pm (220) Controller::__ANON__ 10632 [undef]
got results body from peer ilp-NSM-S-TC-1: $VAR1 = '{"warnings":[{"trace":"No saved results for qid 1963 found. at /opt/elsa/web/lib/Controller.pm line 97\\n\\tController::add_warning(\'Controller::Peers=HASH(0x5b59690)\', 404, \'No saved results for qid 1963 found.\') called at /opt/elsa/web/lib/Controller.pm line 300\\n\\tController::__ANON__() called at /usr/share/perl5/Try/Tiny.pm line 76\\n\\teval {...} called at /usr/share/perl5/Try/Tiny.pm line 67\\n\\tTry::Tiny::try(\'CODE(0x885a978)\', \'Try::Tiny::Catch=REF(0x80ee2b8)\') called at /opt/elsa/web/lib/Controller.pm line 324\\n\\tController::local_result(\'Controller::Peers=HASH(0x5b59690)\', \'HASH(0x819f778)\') called at /opt/elsa/web/lib/Controller.pm line 206\\n\\tController::result(\'Controller::Peers=HASH(0x5b59690)\', \'HASH(0x819f778)\') called at /opt/elsa/web/lib/View/API.pm line 121\\n\\tView::API::__ANON__() called at /usr/share/perl5/Try/Tiny.pm line 76\\n\\teval {...} called at /usr/share/perl5/Try/Tiny.pm line 67\\n\\tTry::Tiny::try(\'CODE(0x7f28e58)\', \'Try::Tiny::Catch=REF(0x901ca68)\') called at /opt/elsa/web/lib/View/API.pm line 135\\n\\tView::API::__ANON__(\'CODE(0x901d5f0)\') called at /usr/share/perl5/Plack/Util.pm line 324\\n\\tPlack::Util::__ANON__(\'CODE(0x92e0d50)\') called at /usr/share/perl5/Starman/Server.pm line 225\\n\\tStarman::Server::process_request(\'Starman::Server=HASH(0x16a1900)\') called at /usr/share/perl5/Net/Server.pm line 142\\n\\tNet::Server::run_client_connection(\'Starman::Server=HASH(0x16a1900)\') called at /usr/share/perl5/Net/Server/PreFork.pm line 273\\n\\teval {...} called at /usr/share/perl5/Net/Server/PreFork.pm line 273\\n\\tNet::Server::PreFork::run_child(\'Starman::Server=HASH(0x16a1900)\') called at /usr/share/perl5/Net/Server/PreFork.pm line 229\\n\\tNet::Server::PreFork::run_n_children(\'Starman::Server=HASH(0x16a1900)\', 1) called at /usr/share/perl5/Net/Server/PreFork.pm line 456\\n\\tNet::Server::PreFork::coordinate_children(\'Starman::Server=HASH(0x16a1900)\') called at /usr/share/perl5/Net/Server/PreFork.pm line 410\\n\\tNet::Server::PreFork::run_parent(\'Starman::Server=HASH(0x16a1900)\') called at /usr/share/perl5/Starman/Server.pm line 104\\n\\tStarman::Server::run_parent(\'Starman::Server=HASH(0x16a1900)\') called at /usr/share/perl5/Net/Server/PreFork.pm line 147\\n\\tNet::Server::PreFork::loop(\'Starman::Server=HASH(0x16a1900)\') called at /usr/share/perl5/Net/Server.pm line 117\\n\\tNet::Server::run(\'Starman::Server=HASH(0x16a1900)\', \'port\', \'ARRAY(0x1af6210)\', \'host\', \'ARRAY(0x1af6120)\', \'proto\', \'ARRAY(0x1a95ed8)\', \'serialize\', \'flock\', ...) called at /usr/share/perl5/Starman/Server.pm line 61\\n\\tStarman::Server::run(\'Starman::Server=HASH(0x16a1900)\', \'CODE(0x16a1a80)\', \'HASH(0x16a1d80)\') called at /usr/share/perl5/Plack/Handler/Starman.pm line 18\\n\\tPlack::Handler::Starman::run(\'Plack::Handler::Starman=HASH(0x16a1990)\', \'CODE(0x16a1a80)\') called at /usr/share/perl5/Plack/Loader/Delayed.pm line 20\\n\\tPlack::Loader::Delayed::run(\'Plack::Loader::Delayed=HASH(0x166dec8)\', \'Plack::Handler::Starman=HASH(0x16a1990)\') called at /usr/share/perl5/Plack/Runner.pm line 263\\n\\tPlack::Runner::run(\'Plack::Runner=HASH(0x166d940)\') called at /usr/bin/starman line 31\\n","shortmess":"No saved results for qid 1963 found. at /opt/elsa/web/lib/Controller.pm line 97\\n","data":null,"code":404,"message":"No saved results for qid 1963 found."}],"incomplete":{"127.0.0.1":0}}';
* DEBUG [2015/08/05 14:27:09] /opt/elsa/web/lib/Controller.pm (178) Controller::__ANON__ 10632 [undef]
stats: $VAR1 = {
'ilp-NSM-S-TC-1' => {
'total_request_time' => '0.423200130462646'
}
};
* TRACE [2015/08/05 14:27:09] /opt/elsa/web/cron.pl (99) main:: 10632 [undef]
cron.pl finished.

Other than that, i did not observe anything else in the web.log. Well see if the increase of log_size_limit solves the issue. I'm sort of curious about the root-cause here though. Could the setting/configu have been overwritten by some package in the upgrade? The /etc/elsa_web.conf seems to have been at least, since there is now also a "elsa_web.conf.20150729" stored there, which i assume is a backup of the old configuration file. the /etc/elsa_node.conf does not have such a backup-file though.

Magnus Wild

unread,
Aug 10, 2015, 3:31:04 AM8/10/15
to security-onion
I just wanted to give you an update about this issue. It seems that the increase of log_size_limit in the node.conf on the manager did the trick. We can now do queries within the default time range without problems, and we can also see that the indexed logs has increased from somewhere around 125 million to 252 millions, which seems promising.

Not exactly sure about the initial problems with the cron.pl-script, but i think it was resolved by doing the drop of the duplicate rows in the database, as that problem was solved pretty early in the troubleshooting process.

Thank you Doug for your patience and pointers. It helped put us on the right track and eventually solve the issue.

/Magnus

Magnus Wild

unread,
Aug 11, 2015, 10:00:07 AM8/11/15
to security-onion
Sadly, it didn't last for long. We seem to be having some issues still with the batched queries showing up, and i have some new information to share.

We have one manager and one sensor. If i run sostat on the manager i get this information in the elsa stanza:

=========================================================================
ELSA
=========================================================================
Syslog-ng
Checking for process:
31528 supervising syslog-ng
31529 /usr/sbin/syslog-ng -p /var/run/syslog-ng.pid
Checking for connection:
Connection to localhost 514 port [tcp/shell] succeeded!

MySQL
Checking for process:
2090 /usr/sbin/mysqld
Checking for connection:
Connection to localhost 3306 port [tcp/mysql] succeeded!

Sphinx
Checking for process:
1917 su -s /bin/sh -c exec "$0" "$@" sphinxsearch -- /usr/bin/searchd --nodetach
Checking for connection:
Connection to localhost 9306 port [tcp/*] succeeded!

ELSA Buffers in Queue:
2
If this number is consistently higher than 20, please see:
https://github.com/Security-Onion-Solutions/security-onion/wiki/FAQ#why-does-sostat-show-a-high-number-of-elsa-buffers-in-queue

ELSA Directory Sizes:
5.4G /nsm/elsa/data
48M /var/lib/mysql/syslog
32K /var/lib/mysql/syslog_data

ELSA Index Date Range:
+---------------------+---------------------+
| MIN(start) | MAX(end) |
+---------------------+---------------------+
| 2015-04-08 13:42:37 | 2015-08-11 13:44:40 |
+---------------------+---------------------+

If i run it on the sensor, i get this stanza:

=========================================================================
ELSA
=========================================================================
Syslog-ng
Checking for process:
18160 supervising syslog-ng
18163 /usr/sbin/syslog-ng -p /var/run/syslog-ng.pid
Checking for connection:
Connection to localhost 514 port [tcp/shell] succeeded!

MySQL
Checking for process:
2544 /usr/sbin/mysqld
Checking for connection:
Connection to localhost 50000 port [tcp/*] succeeded!

Sphinx
Checking for process:
2505 su -s /bin/sh -c exec "$0" "$@" sphinxsearch -- /usr/bin/searchd --nodetach
Checking for connection:
Connection to localhost 9306 port [tcp/*] succeeded!

ELSA Buffers in Queue:
3
If this number is consistently higher than 20, please see:
https://github.com/Security-Onion-Solutions/security-onion/wiki/FAQ#why-does-sostat-show-a-high-number-of-elsa-buffers-in-queue

ELSA Directory Sizes:
618G /nsm/elsa/data
121M /var/lib/mysql/syslog
3.9M /var/lib/mysql/syslog_data

ELSA Index Date Range:
+---------------------+---------------------+
| MIN(start) | MAX(end) |
+---------------------+---------------------+
| 2015-08-10 13:01:23 | 2015-08-11 13:50:21 |
+---------------------+---------------------+

I don't think it makes sense that we even have an index date range on the sensor, right? Could it be that the configuration has been updated in the upgrade process somehow?

/Magnus

Doug Burks

unread,
Aug 11, 2015, 7:20:14 PM8/11/15
to securit...@googlegroups.com
On Tue, Aug 11, 2015 at 10:00 AM, Magnus Wild <mag...@kalasarn.se> wrote:
> ELSA Index Date Range:
> +---------------------+---------------------+
> | MIN(start) | MAX(end) |
> +---------------------+---------------------+
> | 2015-08-10 13:01:23 | 2015-08-11 13:50:21 |
> +---------------------+---------------------+
>
> I don't think it makes sense that we even have an index date range on the sensor, right? Could it be that the configuration has been updated in the upgrade process somehow?

Sensors have their own local ELSA databases and so, yes, they should
have their own Index Date Range. You'll want to increase
log_size_limit in /etc/elsa_node.conf on the sensor until you have at
least a 48-hour window in that Index Date Range.

Magnus Wild

unread,
Aug 21, 2015, 12:01:15 PM8/21/15
to security-onion
Hello again! Sorry it has taken a while to get back to this thread. Anyway, we have tried increasing the log_size_limit quite a lot now, but we are still seeing the same index date range.

I have made some other observations during this time as well, that might be interresting:

* There are differences in the query_timeout-value in the new config files on both sensor and manager, between the config-file revisions /etc/elsa_web.conf and /etc/elsa_web.conf.20150729 (20150729 is when we ran soup). Notably, the value was set to 10000 on both sensor and manager before the upgrade, but after it was set to 1000 on the manager and 55 on the sensor. Tried to set them back to the old value and restart, with no success.
* There is a lot of data accumulated in the "/nsm/elsa_data/elsa/mysql"-directory on the sensor. The oldes file here is from 20150729 and the newest is from today.
* When running queries in the elsa UI, data is returned fine for queries where the from-date is set to 20150410. It is fine to do queries on older data up until 20150729. If we chose 20150739 or some later date (up until 24 hours ago), we get the "batched query" message. A query with a fromdate set to 24 hours ago also works fine.

I guess that even if our index date range is still only ~24 hours, we should not get these "batched query" messages, elsa should fetch the data from the archive, as with the older data that we can do queries on fine. Do you have any more ideas about what we can try to get further here?

Thanks!

Doug Burks

unread,
Aug 21, 2015, 12:16:23 PM8/21/15
to securit...@googlegroups.com
Replies inline.

On Fri, Aug 21, 2015 at 12:01 PM, Magnus Wild <mag...@kalasarn.se> wrote:
> Hello again! Sorry it has taken a while to get back to this thread. Anyway, we have tried increasing the log_size_limit quite a lot now, but we are still seeing the same index date range.

What exactly did you set log_size_limit to?

Have you tried restarting services since making that change?

> I have made some other observations during this time as well, that might be interresting:
>
> * There are differences in the query_timeout-value in the new config files on both sensor and manager, between the config-file revisions /etc/elsa_web.conf and /etc/elsa_web.conf.20150729 (20150729 is when we ran soup). Notably, the value was set to 10000 on both sensor and manager before the upgrade, but after it was set to 1000 on the manager and 55 on the sensor. Tried to set them back to the old value and restart, with no success.

Yes, previously there was an issue with ELSA (unrelated to the problem
you're experiencing) that we had tried to workaround by increasing
query_timeout to 10000. This new version of ELSA no longer has that
particular issue, so we've reverted query_timeout to its default value
of 55.

> * There is a lot of data accumulated in the "/nsm/elsa_data/elsa/mysql"-directory on the sensor. The oldes file here is from 20150729 and the newest is from today.

Yes, this is where ELSA stores its data.

> * When running queries in the elsa UI, data is returned fine for queries where the from-date is set to 20150410. It is fine to do queries on older data up until 20150729. If we chose 20150739

Is this a typo? 20150739 is an invalid date.

> or some later date (up until 24 hours ago), we get the "batched query" message. A query with a fromdate set to 24 hours ago also works fine.

> I guess that even if our index date range is still only ~24 hours, we should not get these "batched query" messages, elsa should fetch the data from the archive, as with the older data that we can do queries on fine. Do you have any more ideas about what we can try to get further here?

When you do these test queries, are you monitoring the logs in
/nsm/elsa/data/elsa/log/ for additional clues?

Magnus Wild

unread,
Aug 21, 2015, 6:11:31 PM8/21/15
to security-onion
Hello again!

Den fredag 21 augusti 2015 kl. 18:16:23 UTC+2 skrev Doug Burks:
> Replies inline.
>
> On Fri, Aug 21, 2015 at 12:01 PM, Magnus Wild <mag...@kalasarn.se> wrote:
> > Hello again! Sorry it has taken a while to get back to this thread. Anyway, we have tried increasing the log_size_limit quite a lot now, but we are still seeing the same index date range.
>
> What exactly did you set log_size_limit to?
>

Right now it is set to 740000000000 on both sensor and server. Approx 90 % of our available disk-space on the /nsm-mountpoint.

> Have you tried restarting services since making that change?
>

Yes, we have restarted multiuple times since the changes was made.

> > I have made some other observations during this time as well, that might be interresting:
> >
> > * There are differences in the query_timeout-value in the new config files on both sensor and manager, between the config-file revisions /etc/elsa_web.conf and /etc/elsa_web.conf.20150729 (20150729 is when we ran soup). Notably, the value was set to 10000 on both sensor and manager before the upgrade, but after it was set to 1000 on the manager and 55 on the sensor. Tried to set them back to the old value and restart, with no success.
>
> Yes, previously there was an issue with ELSA (unrelated to the problem
> you're experiencing) that we had tried to workaround by increasing
> query_timeout to 10000. This new version of ELSA no longer has that
> particular issue, so we've reverted query_timeout to its default value
> of 55.
>

Ah, all right. That explains it.

> > * There is a lot of data accumulated in the "/nsm/elsa_data/elsa/mysql"-directory on the sensor. The oldes file here is from 20150729 and the newest is from today.
>
> Yes, this is where ELSA stores its data.

Weirdly enough there is only data from within the time period that we are having trouble querying. Well, and including the 24-hour period that we can actually query successfully of course. But it is very conincidental that the oldes files here are from the same date that we ran the upgrade?

If we check in the same directory on the manager, we also have some files there, but only from today, and only three files in total.

>
> > * When running queries in the elsa UI, data is returned fine for queries where the from-date is set to 20150410. It is fine to do queries on older data up until 20150729. If we chose 20150739
>
> Is this a typo? 20150739 is an invalid date.

Ah, yes! It was supposed to be 20150730. :)

>
> > or some later date (up until 24 hours ago), we get the "batched query" message. A query with a fromdate set to 24 hours ago also works fine.
>
> > I guess that even if our index date range is still only ~24 hours, we should not get these "batched query" messages, elsa should fetch the data from the archive, as with the older data that we can do queries on fine. Do you have any more ideas about what we can try to get further here?
>
> When you do these test queries, are you monitoring the logs in
> /nsm/elsa/data/elsa/log/ for additional clues?
>

Yes, we have. I have a hard time finding much that stands out though in connection to when we are running queries. I have attached some typical output from the logs produced just after a query has been run. The first ones i did a simple query with startdate 20150819 and by the term "md5". The second time i used the same start date and used the query for the "Bro top weird types" (class=BRO_WEIRD "-" groupby:name) from the ELSA-frame in Security Onion.

We do some messages like "no saved queries":

* DEBUG [2015/08/21 21:56:12] /opt/elsa/web/lib/Controller.pm (220) Controller::__ANON__ 28277 [undef]
got results body from peer ilp-NSM-S-TC-1: $VAR1 = '{"warnings":[{"trace":"No saved results for qid 2202 found. at /opt/

But that seems to occur continuously, not only when running queries, which i guess is because it some scheduled activity?

Something else i saw are these lines in the node.log:

Effective log_size_limit: 740000000000, archive_limit: 244200000000

The archive_limit is set quite a lot lower than the log_size_limit. Would it be an idea to try to increase that limit as well? How should it be adjusted in relation to log_size_limit?

manager_md5_elsa_web.log
sensor_bro_weird_elsa_web.log
sensor_md5_elsa_web.log
server_bro_weird_elsa_web.log

Doug Burks

unread,
Aug 22, 2015, 6:27:27 AM8/22/15
to securit...@googlegroups.com
On Fri, Aug 21, 2015 at 6:11 PM, Magnus Wild <mag...@kalasarn.se> wrote:
>> What exactly did you set log_size_limit to?
>>
>
> Right now it is set to 740000000000 on both sensor and server. Approx 90 % of our available disk-space on the /nsm-mountpoint.

My understanding so far is that you increased log_size_limit to
roughly 740GB and restarted services, but the sensor's index date
range has not increased. Let's focus on that issue. Please run the
following on your sensor:
sudo du --max-depth=0 -h /nsm/elsa/data /var/lib/mysql/syslog
/var/lib/mysql/syslog_data

Also, how much traffic is this sensor seeing (in terms of average/peak
Mbps or Gbps)?

Magnus Wild

unread,
Aug 24, 2015, 3:35:50 AM8/24/15
to security-onion
Hello again!

Here is the output of the du-command:

656G /nsm/elsa/data
130M /var/lib/mysql/syslog
5.2M /var/lib/mysql/syslog_data

Average is about 80Mbps and peak is around 100Mbps.

Doug Burks

unread,
Aug 24, 2015, 7:25:30 AM8/24/15
to securit...@googlegroups.com
Replies inline.

On Mon, Aug 24, 2015 at 3:35 AM, Magnus Wild <mag...@kalasarn.se> wrote:
> Hello again!
>
> Here is the output of the du-command:
>
> 656G /nsm/elsa/data
> 130M /var/lib/mysql/syslog
> 5.2M /var/lib/mysql/syslog_data

If you watch this over a few hours, is 656G the maximum? Seems like
it's not quite getting to the 740000000000 value that you set in
/etc/elsa_node.conf.

> Average is about 80Mbps and peak is around 100Mbps.

I would think that 656G would be enough to store an Index Date Range
of more than 48 hours if all you're sending into ELSA is just Bro logs
for 100Mbps connection. Are you sending any other logs to ELSA?

Here is a test you could try on the sensor if you don't mind losing a
little data:
- go into /etc/elsa_node.conf and drop the archive percentage from 33
down to something like 15 or 20
- reboot the sensor
- watch the sostat output on the sensor and see if Index Date Range increases

Here is another test you could try on the sensor if you don't mind
dropping all of the ELSA data on that sensor:
- go into /etc/elsa_node.conf and drop the log_size_limit from
740000000000 down to a much smaller value like 100
- reboot the sensor
- watch the output of "sudo du --max-depth=0 -h /nsm/elsa/data
/var/lib/mysql/syslog" and verify that old logs are removed and disk
space is released
- go into /etc/elsa_node.conf and increase log_size_limit back to 740000000000
- reboot the sensor
- after 48 hours, check the output of sostat and see if the Index Date
Range includes that entire 48 hours

Magnus Wild

unread,
Aug 25, 2015, 5:21:04 AM8/25/15
to security-onion
Hello again!

Den måndag 24 augusti 2015 kl. 13:25:30 UTC+2 skrev Doug Burks:
> Replies inline.
>
> On Mon, Aug 24, 2015 at 3:35 AM, Magnus Wild <mag...@kalasarn.se> wrote:
> > Hello again!
> >
> > Here is the output of the du-command:
> >
> > 656G /nsm/elsa/data
> > 130M /var/lib/mysql/syslog
> > 5.2M /var/lib/mysql/syslog_data
>
> If you watch this over a few hours, is 656G the maximum? Seems like
> it's not quite getting to the 740000000000 value that you set in
> /etc/elsa_node.conf.
>

Seems that it was changing periodically, but stayed around 650-670G used. I have not observed it go beyond 670G.

> > Average is about 80Mbps and peak is around 100Mbps.
>
> I would think that 656G would be enough to store an Index Date Range
> of more than 48 hours if all you're sending into ELSA is just Bro logs
> for 100Mbps connection. Are you sending any other logs to ELSA?
>
> Here is a test you could try on the sensor if you don't mind losing a
> little data:
> - go into /etc/elsa_node.conf and drop the archive percentage from 33
> down to something like 15 or 20
> - reboot the sensor
> - watch the sostat output on the sensor and see if Index Date Range increases
>

I tried changing that value to 15 yesterday. There is still only a 24-hour Index Date Range available after the change (made the change approx 12+ hours ago).

> Here is another test you could try on the sensor if you don't mind
> dropping all of the ELSA data on that sensor:
> - go into /etc/elsa_node.conf and drop the log_size_limit from
> 740000000000 down to a much smaller value like 100
> - reboot the sensor
> - watch the output of "sudo du --max-depth=0 -h /nsm/elsa/data
> /var/lib/mysql/syslog" and verify that old logs are removed and disk
> space is released
> - go into /etc/elsa_node.conf and increase log_size_limit back to 740000000000
> - reboot the sensor
> - after 48 hours, check the output of sostat and see if the Index Date
> Range includes that entire 48 hours
>

I think we will try this next. As of now, it is more important to get this working for future investigations than retaining all the old elsa-data. We are still only really in a POC, so it is not THAT important right now to retain all the data.

I'll report back how things are going. Thanks!

Magnus Wild

unread,
Aug 28, 2015, 3:34:34 AM8/28/15
to security-onion
Status update. We have tried to purge the logs from the sensor like you suggested, and it seems that it worked fine. No more batched queries and you get the expected results within this 2 days period.

One thing i noticed though is that the disk is quickly filling up, and it seems that ELSA is using up most of the space again. On the 20150826 after the purge we had 42G used on the /nsm mountpoint. Now, two days later roughly we have 615G used, which will mean that we will hit our 2 day index limit pretty soon. Is this really the expected increase for two days worth of ~100Mbps traffic?

/Magnus

Doug Burks

unread,
Aug 29, 2015, 7:22:02 AM8/29/15
to securit...@googlegroups.com
On Fri, Aug 28, 2015 at 3:34 AM, Magnus Wild <mag...@kalasarn.se> wrote:
> Status update. We have tried to purge the logs from the sensor like you suggested, and it seems that it worked fine. No more batched queries and you get the expected results within this 2 days period.
>
> One thing i noticed though is that the disk is quickly filling up, and it seems that ELSA is using up most of the space again. On the 20150826 after the purge we had 42G used on the /nsm mountpoint. Now, two days later roughly we have 615G used, which will mean that we will hit our 2 day index limit pretty soon. Is this really the expected increase for two days worth of ~100Mbps traffic?

That's somewhat dependent on your exact mix of traffic.

Looking back at your sostat output from the sensor, it shows that
you're sniffing both eth1 and eth3. How much traffic are each of
these interfaces seeing?

Your sostat output also shows that you have 3 Bro workers for each of
these sniffing interfaces. You probably don't need 3 Bro workers for
just 100Mbps of traffic, so are you sure that's all the traffic
they're seeing?

Also, are you sending any other logs to ELSA?

Any syslog?

Any OSSEC agents?

Magnus Wild

unread,
Oct 7, 2015, 7:32:46 AM10/7/15
to security-onion
Hello Doug!

Sorry to resurrect such an old thread, but i felt it would be beneficial to get some closure here.

The problem has been mitigated for the time being, and we are seeing a good 4-5 days now for the search window in ELSA, which has resolved the "batched query" issues that we were seeing.

Both interfaces are seeing a peak of about ~100Mbps, so actually my previous statement should have been that we are sseing something like ~200-ish Mbps at peaks.

No syslog is forwarded to ELSA, and no other OSSEC-agents than the sensor and manager.

What we did see though was that there were enormous amounts of DNS-reguests coming in from two specific hosts that seemed to represent the majority of logs we saw in ELSA. Eventually it boiled down to some misbehaving application that made thousands's of requests per second. We have no direct control over those machines or applications, so we ended up with a BPF to filter out specifically DNS-traffic between the DNS-server and those application servers, and after a few days the ELSA index built up sufficiently to allow us to query 4-5 days back in time, which is sufficient for us right now. If we should be able to have a more days in the index i guess our way forward would be to tune if further, or expand the disk that houses the logs.

I thank you for your patience Doug, you have been very helpful throughout the troubleshooting of this issue.
Reply all
Reply to author
Forward
0 new messages