High Latency MySQL Connections

704 views

Skip to first unread message

Mike Lococo

unread,

Sep 29, 2011, 6:33:52 PM9/29/11

to barnyar...@googlegroups.com

Hi Folks,

I'm experiencing slow event processing from barnyard2 on a host that is
connected to its mysql server via a high-bandwidth but high-latency
transatlantic link (roundtrips of 0.2-0.3 seconds). I strongly suspect
that mysql roundtrip delays are causing barnyard to process events at
about 1-2 events per second (compared to hundreds of events per second
for by2 instances on a lan close to the mysql server). My questions are:

1) Does by2 have any instrumentation that would help me confirm or deny
this suspicion?
2) Has anyone successfully run by2 outputting to mysql over a
high-latency link?
3) Has anyone tried to do so and failed?
4) Do you have any suggestions for increasing throughput besides
standing up another db-server close to the sensor? For example, is
it possible to get barnyard2 to use multiple parallel database
insert threads? There's plenty of bandwidth left on the link, it's
just the long roundtrip times that are a problem.

Additional details that may prove interesting to folks considering this
question:

* I first observed that unified2 files were building up in Snort's log
directory even through Barnyard is set to archive files to a separate
directory once they're processed.
* I then confirmed that I was getting out of date timestamped events
from this sensor inserted into the database, and that the delay was
growing over time.
* I stopped by2, deleted its waldo file, deleted the unified2 file that
it was processing, and restarted by2. The timestamps jumped up to a
more recent value corresponding to when the new u2 file was created,
but events were still were coming in slowly (about 1/sec) in spite of
the fact that several unified2 log files were queued up for
processing. Additionally, the timestamp gap was still growing over
time... aka by2 was falling further behind because snort was
generating events faster than they were being inserted into the
database.
* I checked the disk IO/CPU on the db server, which is minimal. It's a
pretty beefy system that isn't loaded anywhere near capacity.
* I ran a tcpdump and confirmed that it takes a second or more to
complete the 7 tcp roundtrips that comprise an insert (BEGIN, SELECT
from signature, INSERT into event, INSERT into tcp/udp/icmp-hdr,
INSERT into iphdr, INSERT into data, COMMIT) and that queries appear
to be sequential with no overlap that indicates multiple database
insert threads. I generate more than 1 event per second on this
sensor, so that rate isn't fast enough to keep up.

Any suggestions are much appreciated.

Cheers,
Mike Lococo

beenph

unread,

Sep 29, 2011, 8:23:34 PM9/29/11

to barnyar...@googlegroups.com

On Thu, Sep 29, 2011 at 6:33 PM, Mike Lococo <mikel...@gmail.com> wrote:
> Hi Folks,
>

Hi mike, have you tried my db reliability branch?

Its still work in progress but mysql support is working.

Would you be able to test it?

https://github.com/binf/barnyard2/tree/dbConnectionReliability

I will reply to the rest of your questions on a subsequent e-mail
within the next hour, since there's alot more to say.

beenph

unread,

Sep 30, 2011, 12:13:16 AM9/30/11

to barnyar...@googlegroups.com

On Thu, Sep 29, 2011 at 6:33 PM, Mike Lococo <mikel...@gmail.com> wrote:
> Hi Folks,
>

> I'm experiencing slow event processing from barnyard2 on a host that is
> connected to its mysql server via a high-bandwidth but high-latency
> transatlantic link (roundtrips of 0.2-0.3 seconds). I strongly suspect that
> mysql roundtrip delays are causing barnyard to process events at about 1-2
> events per second (compared to hundreds of events per second for by2
> instances on a lan close to the mysql server). My questions are:
>
> 1) Does by2 have any instrumentation that would help me confirm or deny
> this suspicion?

Not internally, while you could modify spo_database to calculate the
time it takes
to execute each query and report it to you.

> 2) Has anyone successfully run by2 outputting to mysql over a
> high-latency link?
> 3) Has anyone tried to do so and failed?

Yes but it would require customization on your end.

> 4) Do you have any suggestions for increasing throughput besides
> standing up another db-server close to the sensor? For example, is
> it possible to get barnyard2 to use multiple parallel database
> insert threads? There's plenty of bandwidth left on the link, it's
> just the long roundtrip times that are a problem.
>
> Additional details that may prove interesting to folks considering this
> question:
>
> * I first observed that unified2 files were building up in Snort's log
> directory even through Barnyard is set to archive files to a separate
> directory once they're processed.

-How do you specify the archive directory? Via the command line?
-Do you have more than one snort process writing to the directory?

The archive mechanism works as it allways has been working thus, when
one file is processed it is moved to the archive directory.

> * I then confirmed that I was getting out of date timestamped events
> from this sensor inserted into the database, and that the delay was
> growing over time.
> * I stopped by2, deleted its waldo file, deleted the unified2 file that
> it was processing, and restarted by2. The timestamps jumped up to a
> more recent value corresponding to when the new u2 file was created,
> but events were still were coming in slowly (about 1/sec) in spite of
> the fact that several unified2 log files were queued up for
> processing. Additionally, the timestamp gap was still growing over
> time... aka by2 was falling further behind because snort was
> generating events faster than they were being inserted into the
> database.

This situation is known as "backlog", and if spo_database can't pipe
enough data fast enough it will fall behind.

> * I checked the disk IO/CPU on the db server, which is minimal. It's a
> pretty beefy system that isn't loaded anywhere near capacity.

Well im not an expert on MySQL but i am sure there is alot of tweaks that
can be applied on the server side to make it uses more ressource so it
would end up
being faster. If you use default configuration on a system that could
technically handle alot more, then your dababase could be
"underconfigured" and this could lead to potential slowdown on some
queries which would
ad up to latency when the database get bigger and dirtyer "dead
tupples/pages/index,etc..."

> * I ran a tcpdump and confirmed that it takes a second or more to
> complete the 7 tcp roundtrips that comprise an insert (BEGIN, SELECT
> from signature, INSERT into event, INSERT into tcp/udp/icmp-hdr,
> INSERT into iphdr, INSERT into data, COMMIT) and that queries appear
> to be sequential with no overlap that indicates multiple database
> insert threads. I generate more than 1 event per second on this
> sensor, so that rate isn't fast enough to keep up.

spo_database.c is in a state where its requires alot of interaction
with the database
to insert an single event and it is impossible make them non sequential.

This ultimately result in almost unavoidable backlog state if you
have a high network latency, alot of event and especialy if some of
the queries take time to execute.

The way barnyard is conseived it is not possible for now to have
multiple threads inserting in the database since only one event at a
time is read from the spooler and sent to the respective output
plugin.

Having multiple "threads" sending multiple events to multiple instance
of the same output plugin would not garantee performance.

Other area are mutch prone to improvement than threading, like the
database schema it self.

While a new database schema is being worked on, exepect a new
database plugin to be created so backwared compatiblity would be
maintained with application still using the old schema.

Since a new database schema affect more than just barnyard2, its not
possible for the moment to give you an effective timeframe of when it
will take a public form.

Hopefully this has enlighten you.

If you have more questions/comments do not hesitate.

-elz

Mike Lococo

unread,

Sep 30, 2011, 11:36:37 AM9/30/11

to barnyar...@googlegroups.com

beenph wrote:
> Hi mike, have you tried my db reliability branch?
> Its still work in progress but mysql support is working.
> Would you be able to test it?
>
> https://github.com/binf/barnyard2/tree/dbConnectionReliability

I'm happy to test if it might help. Does it have any specific changes
that might address the latency issue?

>> Additional details that may prove interesting to folks considering this
>> question:
>>
>> * I first observed that unified2 files were building up in Snort's log
>> directory even through Barnyard is set to archive files to a separate
>> directory once they're processed.
>

> The archive mechanism works as it allways has been working thus, when
> one file is processed it is moved to the archive directory.

I understand that this is expected behavior in a backlog state, I just
mention it as the anomaly I first observed that alerted me to the backlog.

>> * I checked the disk IO/CPU on the db server, which is minimal. It's a
>> pretty beefy system that isn't loaded anywhere near capacity.
>
> Well im not an expert on MySQL but i am sure there is alot of tweaks that
> can be applied on the server side to make it uses more ressource so it
> would end up being faster.

While I'm no expert, I'm fairly familiar with MySQL performance
monitoring and config-tweaking. I think the quickest statistic to
illustrate my point is that the roundtrip time for a syn -> syn/ack for
these two hosts is 208 milliseconds, and the database queries are
returning in 210 milliseconds, so the latency caused by the network is
100x that caused by the database.

Also barnyards local to this DB spike at moderately high event rates
without a db issue.

> spo_database.c is in a state where its requires alot of interaction
> with the database to insert an single event and it is impossible

> makethem non sequential.

>
> This ultimately result in almost unavoidable backlog state if you
> have a high network latency, alot of event and especialy if some of
> the queries take time to execute.
>
> The way barnyard is conseived it is not possible for now to have
> multiple threads inserting in the database since only one event at a
> time is read from the spooler and sent to the respective output
> plugin.
>
> Having multiple "threads" sending multiple events to multiple instance
> of the same output plugin would not garantee performance.
>
> Other area are mutch prone to improvement than threading, like the
> database schema it self.

Database schema fixes don't get you very far, though. Even given a
perfect schema where an event can be inserted in a single roundtrip,
without parallelization by2 maxes out at the following rates over a wan
link (currently the max is 5-8x lower than these figures because many
roundtrips are needed for each insert):

80-100 ms -> 10-12.5 eps - Typical latency for US east-cost to
US west-coast or to western-europe
200 ms -> 5 eps - Typical latency for US east-coast to
asia or australia
300+ ms -> 3 or fewer eps - Typical latencies to locations with
poor-quality internet infrastructure

The only feasible solution that I can imagine to achieve higher event
rates over wan links is to have some framework for initiating multiple
inserts in parallel, either by:

- Having multiple instances of whatever is doing the inserting, with a
queue accepting events from the spooler and dispatching them
round-robin.
- Batching events and sending them in groups that share the delay of a
single roundtrip... or a single group of roundtrips that is O(1) with
respect to the number of events in the batch.
- Use the network as a queue by not waiting for responses before
sending new requests.

I'm actually quite surprised that no one has tried to operate by2 over a
wan-link yet, but it strikes me as a valuable feature. In the meantime,
if the db reliability branch doesn't address high-latency throughput
I'll have to stand up a database at the site and then copy events back
to my primary database in batches, which feels awfully ugly.

Let me know if you'd consider contracting to develop this as a feature.
I can't promise that my purchasing group is agile enough to understand
an open-source feature-development contract... but something phrased as
a support-contract might work. If you're open to such an idea, ping me
off-list with a pricing range that would make it worth your effort and
I'll start shopping the idea around my organization. We have contracts
with companies associated with a few other OSS projects we use and it's
generally worked out well for both sides.

Cheers,
Mike Lococo

beenph

unread,

Sep 30, 2011, 6:24:31 PM9/30/11

to barnyar...@googlegroups.com

On Fri, Sep 30, 2011 at 11:36 AM, Mike Lococo <mikel...@gmail.com> wrote:
> beenph wrote:
>>
>> Hi mike, have you tried my db reliability branch?
>> Its still work in progress but mysql support is working.
>> Would you be able to test it?
>>
>> https://github.com/binf/barnyard2/tree/dbConnectionReliability
>
> I'm happy to test if it might help. Does it have any specific changes that
> might address the latency issue?
>

It should help.

>>> Additional details that may prove interesting to folks considering this
>>> question:
>>>
>>> * I first observed that unified2 files were building up in Snort's log
>>> directory even through Barnyard is set to archive files to a separate
>>> directory once they're processed.
>>
>> The archive mechanism works as it allways has been working thus, when
>> one file is processed it is moved to the archive directory.
>
> I understand that this is expected behavior in a backlog state, I just
> mention it as the anomaly I first observed that alerted me to the backlog.
>

Using the default database schema/ mode of operation, backlog could happen
in many context and "slow link" is just one of the many places where
it can happen.

>>> * I checked the disk IO/CPU on the db server, which is minimal. It's a
>>> pretty beefy system that isn't loaded anywhere near capacity.
>>
>> Well im not an expert on MySQL but i am sure there is alot of tweaks that
>> can be applied on the server side to make it uses more ressource so it
>> would end up being faster.
>
> While I'm no expert, I'm fairly familiar with MySQL performance monitoring
> and config-tweaking. I think the quickest statistic to illustrate my point
> is that the roundtrip time for a syn -> syn/ack for these two hosts is 208
> milliseconds, and the database queries are returning in 210 milliseconds, so
> the latency caused by the network is 100x that caused by the database.
>

This is one of the issue addressed by the patch since the connection
will stay open
the the only round trip you should experience is the query's execution time.

> Also barnyards local to this DB spike at moderately high event rates without
> a db issue.
>
>> spo_database.c is in a state where its requires alot of interaction
>> with the database to insert an single event and it is impossible
>> makethem non sequential.
>>
>> This ultimately result in almost unavoidable backlog state if you
>> have a high network latency, alot of event and especialy if some of
>> the queries take time to execute.
>>
>> The way barnyard is conseived it is not possible for now to have
>> multiple threads inserting in the database since only one event at a
>> time is read from the spooler and sent to the respective output
>> plugin.
>>
>> Having multiple "threads" sending multiple events to multiple instance
>> of the same output plugin would not garantee performance.
>>
>> Other area are mutch prone to improvement than threading, like the
>> database schema it self.
>
> Database schema fixes don't get you very far, though. Even given a perfect
> schema where an event can be inserted in a single roundtrip, without
> parallelization by2 maxes out at the following rates over a wan link
> (currently the max is 5-8x lower than these figures because many roundtrips
> are needed for each insert):
>
>

> The only feasible solution that I can imagine to achieve higher event rates
> over wan links is to have some framework for initiating multiple inserts in
> parallel, either by:
>

I would like to refute your claim where the schema can't help to improve the
basic capacity of insertion.

Inserting signature text, inserting reference, inserting classification,
manually handling the event id (cid) are all things that can slow down
the insertion process for example.

Parallel insertion is not really possible and effecitve as a solution
if the schema is not handling the event id (cid) using an internal
sequence (database sequences).

> - Having multiple instances of whatever is doing the inserting, with a
> queue accepting events from the spooler and dispatching them
> round-robin.
> - Batching events and sending them in groups that share the delay of a
> single roundtrip... or a single group of roundtrips that is O(1) with
> respect to the number of events in the batch.
> - Use the network as a queue by not waiting for responses before
> sending new requests.
>

> I'm actually quite surprised that no one has tried to operate by2 over a
> wan-link yet, but it strikes me as a valuable feature. In the meantime, if
> the db reliability branch doesn't address high-latency throughput I'll have
> to stand up a database at the site and then copy events back to my primary
> database in batches, which feels awfully ugly.
>

Oh im pretty shure people have done it (raise hands), but some might
have changed some code or developped their own output plugin, For
obvious reasons.

The reliability patch is written to stop the bleeding on some corners
of the output plugin,