Nimrod overwhelmed by rate of ingest

92 views
Skip to first unread message

ad...@metabroadcast.com

unread,
Feb 19, 2013, 4:44:37 AM2/19/13
to nimro...@googlegroups.com
Hi,

We're finding that Nimrod cannot cope with calculating historical percentiles when requesting a small window like 5 minutes when the rate of log ingest is high. We currently have a service split over 3 servers, each is logged and tracked independently in Nimrod and these endpoints operate as expected without any gaps in data of failed queries in Nimrod. Per server, the peak of nimrod formatted log lines is around 72 per second. We're seeing that the aggregated input of these logs into nimrod so they appear under a single endpoint begins to fail when the rate of ingest is around 150 per second.

In the nimrod application log file we see the following stack traces:
2013-02-19 07:49:01,268 ERROR nimrod.core.metric - integrity constraint violation: unique constraint or index violation; SYS_PK_10065 table: METRICS
java.sql.SQLIntegrityConstraintViolationException: integrity constraint violation: unique constraint or index violation; SYS_PK_10065 table: METRICS
        at org.hsqldb.jdbc.Util.sqlException(Unknown Source)
        at org.hsqldb.jdbc.Util.sqlException(Unknown Source)
        at org.hsqldb.jdbc.JDBCPreparedStatement.fetchResult(Unknown Source)
        at org.hsqldb.jdbc.JDBCPreparedStatement.executeUpdate(Unknown Source)
        at clojure.java.jdbc$do_prepared_return_keys$fn__590.invoke(jdbc.clj:548)
        at clojure.java.jdbc$transaction_STAR_.invoke(jdbc.clj:384)
        at clojure.java.jdbc$do_prepared_return_keys.invoke(jdbc.clj:548)
        at clojure.lang.AFn.applyToHelper(AFn.java:163)
        at clojure.lang.AFn.applyTo(AFn.java:151)
        at clojure.core$apply.invoke(core.clj:603)
        at clojure.java.jdbc$insert_values.doInvoke(jdbc.clj:578)
        at clojure.lang.RestFn.invoke(RestFn.java:442)
        at clojure.java.jdbc$update_or_insert_values$fn__603.invoke(jdbc.clj:639)
        at clojure.java.jdbc$transaction_STAR_.invoke(jdbc.clj:384)
        at clojure.java.jdbc$update_or_insert_values.invoke(jdbc.clj:636)
        at nimrod.core.store.DiskStore$fn__912$fn__913.invoke(store.clj:115)
        at clojure.java.jdbc$transaction_STAR_.invoke(jdbc.clj:372)
        at nimrod.core.store.DiskStore$fn__912.invoke(store.clj:114)
        at clojure.java.jdbc$with_connection_STAR_.invoke(jdbc.clj:302)
        at nimrod.core.store.DiskStore.set_metric(store.clj:113)
        at nimrod.core.metric$compute_metric.invoke(metric.clj:138)
        at nimrod.log.processor$process.invoke(processor.clj:37)
        at nimrod.log.tailer$create_tailer$fn__1175.invoke(tailer.clj:20)
        at nimrod.log.tailer.proxy$tayler.TailerListenerAdapter$0.handle(Unknown Source)

Would changing the history window to something smaller, like 60 seconds help with this issue, or is the problem coming from elsewhere?

Regards,

Adam Horwich

Sergio Bossa

unread,
Feb 19, 2013, 6:51:13 AM2/19/13
to nimro...@googlegroups.com, nimro...@googlegroups.com
Hi Adam,

Nimrod can definitely sustain good read/write performance at thousands of updates.

So the problem is either:
1) The constraint violation, which makes the database kind of corrupted: how many of those do you see? How often?
2) The dataset in the 5 minutes aggregation: How large is it?

Please note that if you see #1 often, I strongly suggest to revert to a working backup ASAP.

Cheers,

Sergio Bossa
Sent by iPhone
--
You received this message because you are subscribed to the Google Groups "nimrod-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nimrod-user...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

ad...@metabroadcast.com

unread,
Feb 19, 2013, 7:56:06 AM2/19/13
to nimro...@googlegroups.com
Hey Sergio,

My concern over rates and sampling is because we are seeing some log endpoints and all their gauges operate uninterrupted, while others stop for all gauges in that failing endpoint. Overall update rates peak at around 250 per second, possibly higher.

I've checked for constraint violations over the last 3 weeks. We had the odd violation around the 1st of Feb, but only a few. The next time it occurred and became more frequent was from the 11th of Feb when we had the DB defrag process taking 10+ hours and being interrupted as a result. It's possible that some corruption was introduced there?

Aggregations for our most heavily used endpoints are around 20,000 for the 5 minute window.

Adam.

ad...@metabroadcast.com

unread,
Feb 19, 2013, 11:59:21 AM2/19/13
to nimro...@googlegroups.com
Incidentally, we tested with a clean database and the same volume of metrics. We've so far not seen any gaps in nimrod API availability, however we still are frequently seeing the integrity constraint violation.

Regards,

Adam

ad...@metabroadcast.com

unread,
Feb 19, 2013, 1:37:34 PM2/19/13
to nimro...@googlegroups.com
Sadly that too has started showing prolonged periods of api unavailability for certain log endpoints. All on a clean DB :(

Sergio Bossa

unread,
Feb 19, 2013, 2:36:13 PM2/19/13
to nimro...@googlegroups.com
So let me try to recap:

1) The old database shows exceptions and slow aggregation.
2) The new database has no performance problems but still shows (same) exceptions.

Is that correct?


Sergio Bossa
Sent by iPhone

Adam Horwich

unread,
Feb 19, 2013, 2:55:48 PM2/19/13
to nimro...@googlegroups.com
After about 3 hours of running without issues, the same problem of prolonged request failures occurred and aggregation is still pretty slow.
--

Sergio Bossa

unread,
Feb 19, 2013, 6:22:23 PM2/19/13
to nimro...@googlegroups.com, nimro...@googlegroups.com
Did anything change since the last time things was working smoothly?

Sergio Bossa
Sent by iPhone

Adam Horwich

unread,
Feb 19, 2013, 6:46:27 PM2/19/13
to nimro...@googlegroups.com
Well the only thing to noticeably change is the volume of logs going into Nimrod after we identified and resolve an apache log parsing bottleneck ;)

Other changes include: 
 The use of the apache timestamp for log insertion time to nimrod (i.e. metrics may be submitted to nimrod out of time sequence)
 Volume of API requests increased to generate metrics for Graphite
 Upgrade to 0.6.0

Sergio Bossa

unread,
Feb 20, 2013, 12:55:49 AM2/20/13
to nimro...@googlegroups.com
On Tue, Feb 19, 2013 at 11:46 PM, Adam Horwich <ad...@metabroadcast.com> wrote:

> The use of the apache timestamp for log insertion time to nimrod (i.e.
> metrics may be submitted to nimrod out of time sequence)

Could you elaborate on this? How was it working before, and how does
it work now?

--
Sergio Bossa
http://www.linkedin.com/in/sergiob

Adam Horwich

unread,
Feb 20, 2013, 4:06:55 AM2/20/13
to nimro...@googlegroups.com
Before our script was generating a time stamp during execution and using that to submit to Nimrod, which was obviously wrong.

Since nimrod requires logs to be millisecond precision, and apache log times are to the nearest second, we simply add three 0s on the end of the epoch. Does this cause a problem for nimrod if the same metric id is submitted with the same timestamp multiple times?

Sergio Bossa

unread,
Feb 20, 2013, 9:27:18 AM2/20/13
to nimro...@googlegroups.com
On Wed, Feb 20, 2013 at 9:06 AM, Adam Horwich <ad...@metabroadcast.com> wrote:

> Does this cause a problem for nimrod if the same metric id is submitted with
> the same timestamp multiple times?

Nope.

Could you post your nimrod.conf?

Sergio Bossa

unread,
Feb 21, 2013, 11:34:17 AM2/21/13
to nimro...@googlegroups.com
Hi Adam,

did you sort this out?

I'll try to reproduce a similar scenario in terms of ingestion rate.
In the meantime, it would be helpful if you could attach your
nimrod.conf and /system/stats output under load.

Adam Horwich

unread,
Feb 21, 2013, 1:09:03 PM2/21/13
to nimro...@googlegroups.com
Hi Sergio,

We're still seeing intermittent unresponsive endpoints and significant amounts of integrity constraint violation exceptions in the logs, even on a fresh DB. The rate of ingest is highly variable for our use case, so the output of /system/stats fluctuates significantly. During a relatively calm period, and sampling over a minute, the average is somewhere around 600 store operations per second but I've seen it reach over 3000. The troubling peaks occur roughly every 2 hours and there we see store operations regularly over 2000.

nimrod.conf.
..
store {
    type : disk
    path : /data/nimrod
    options {
        "cache.entries" : 1000
        "cache.results" : 5000000
        "defrag.limit" : 10
    }
    sampling {
        "requests.frequency" : 25000
        "requests.factor" : 10
        "other-requests.frequency" : 25000
        "other-requests.factor" : 10
        "more-things.frequency" : 0
        "more-things.factor" : 1
        "even-more-things.frequency" : 0
        "even-more-things.factor" : 1
    }
}
server {
    max-busy-requests : 100
}

Regards,

Adam

-- 
Adam Horwich
Sent with Sparrow

Sergio Bossa

unread,
Feb 22, 2013, 1:29:50 AM2/22/13
to nimro...@googlegroups.com
Thanks Adam, I'm having a look at this and trying to reproduce.
One more question: do you introduce many new metric names during the
day? Or are all almost always the same (that is, only metric values
change)?

Adam Horwich

unread,
Feb 22, 2013, 4:43:26 AM2/22/13
to nimro...@googlegroups.com
It's rare that new metric names will appear in the list in the course of a day. We did however add individual web server monitoring a few weeks ago. So this increased the volume of metrics names about twofold. These endpoints do not have any interruption in Nimrod though. It's only the one which receives that pattern of 2 hourly bursty metrics which seems to not cope.

Adam

-- 
Adam Horwich
Sent with Sparrow

Sergio Bossa

unread,
Feb 22, 2013, 12:03:45 PM2/22/13
to nimro...@googlegroups.com
Hi Adam,

I may have reproduced your problem: could you send me a thread dump
taken during the blocked aggregate requests?

ad...@metabroadcast.com

unread,
Feb 22, 2013, 12:13:18 PM2/22/13
to nimro...@googlegroups.com
Hey Sergio, I was literally just coming back to this thread to mention that restoring from an old backup did not help things. We can see on the /system/stats that metrics are ingesting into the system, and that the /metric_name endpoint is updating with the results of the most recent metric value, but no historical percentiles are being calculated. I've attached the thread dump output.

Adam
nimrod-threads.txt

Sergio Bossa

unread,
Feb 22, 2013, 1:06:29 PM2/22/13
to nimro...@googlegroups.com
Could you attach a few more dumos?
Also, is the CPU constantly busy by any chance?


Sergio Bossa
Sent by iPhone
<nimrod-threads.txt>

ad...@metabroadcast.com

unread,
Feb 25, 2013, 6:44:15 AM2/25/13
to nimro...@googlegroups.com
CPU is not constantly busy on Nimrod, but it is being shared by other services. I've attached another thread dump but Nimrod is currently defragging its DB and has been for the last 10 hours.

Adam
nimrod-threads2.txt

Sergio Bossa

unread,
Feb 26, 2013, 5:20:52 AM2/26/13
to nimro...@googlegroups.com
On Mon, Feb 25, 2013 at 11:44 AM, <ad...@metabroadcast.com> wrote:

> Nimrod is currently defragging its DB and has been for the last 10 hours.

So are Nimrod defrags lasting that long even with a defrag limit set at 10%?

Adam Horwich

unread,
Feb 26, 2013, 5:31:01 AM2/26/13
to nimro...@googlegroups.com
That was the first one after recovering from a backup (so there's a lot of fragmented data), after observing the same problem with a blank DB, but that's an aside. Any further developments on the intermittent historical endpoint behaviour?

Adam

-- 
Adam Horwich
Sent with Sparrow

Sergio Bossa

unread,
Feb 27, 2013, 5:35:17 PM2/27/13
to nimro...@googlegroups.com
On Tue, Feb 26, 2013 at 10:31 AM, Adam Horwich <ad...@metabroadcast.com> wrote:

> Any further developments on the intermittent historical
> endpoint behaviour?

I've tracked down the problem on a database bug, at least that's what
it looks like.
I'll submit a bug report (too busy to fix by myself) to Hsqldb and
find a workaround in Nimrod, keep you posted.

Adam Horwich

unread,
Feb 27, 2013, 5:39:07 PM2/27/13
to nimro...@googlegroups.com
Thaaaanks Sergio! :)


--
You received this message because you are subscribed to a topic in the Google Groups "nimrod-user" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/nimrod-user/CBbj8gL3gfQ/unsubscribe?hl=en.
To unsubscribe from this group and all its topics, send an email to nimrod-user...@googlegroups.com.

Sergio Bossa

unread,
Mar 3, 2013, 4:29:00 PM3/3/13
to nimro...@googlegroups.com
Hi Adam,

I tracked down the problem to some hsqldb limitations under heavy
write loads, and overcame it with some fixes you can find in the
wip-hsqldb-bugs branch.
I was able to constantly reproduce both the integrity violation and
hanging queries, but now everything works fine all the time in that
branch, so should be ok.
Let me know if it works for you too, so I can port those fixes into
the main branch and make a release ASAP.

Cheers,

Sergio B.
> You received this message because you are subscribed to the Google Groups
> "nimrod-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an

Adam Horwich

unread,
Mar 4, 2013, 12:46:36 PM3/4/13
to nimro...@googlegroups.com
Hi Sergio,

I've just been testing the new build. It sat there for 15 minutes doing nothing and then gave an OOM:

2013-03-04 17:35:32,556 ERROR nimrod.core.startup - java.lang.OutOfMemoryError: Java heap space
java.sql.SQLException: java.lang.OutOfMemoryError: Java heap space
        at org.hsqldb.jdbc.Util.sqlException(Unknown Source)
        at org.hsqldb.jdbc.Util.sqlException(Unknown Source)
        at org.hsqldb.jdbc.JDBCPreparedStatement.fetchResult(Unknown Source)
        at org.hsqldb.jdbc.JDBCPreparedStatement.executeQuery(Unknown Source)
        at clojure.java.jdbc$with_query_results_STAR_.invoke(jdbc.clj:672)
        at nimrod.core.store.DiskStore$fn__890$fn__898$fn__899.invoke(store.clj:87)
        at clojure.java.jdbc$with_query_results_STAR_.invoke(jdbc.clj:673)
        at nimrod.core.store.DiskStore$fn__890$fn__898.invoke(store.clj:81)
        at clojure.java.jdbc$transaction_STAR_.invoke(jdbc.clj:372)
        at nimrod.core.store.DiskStore$fn__890.invoke(store.clj:80)
        at clojure.java.jdbc$with_connection_STAR_.invoke(jdbc.clj:302)
        at nimrod.core.store.DiskStore.init(store.clj:73)
        at nimrod.core.store$new_disk_store.invoke(store.clj:283)
        at nimrod.core.setup$setup_store.invoke(setup.clj:22)
        at nimrod.core.setup$setup.invoke(setup.clj:32)
        at nimrod.core.startup$_main.invoke(startup.clj:7)
        at clojure.lang.AFn.applyToHelper(AFn.java:159)
        at clojure.lang.AFn.applyTo(AFn.java:151)
        at nimrod.core.startup.main(Unknown Source)
Caused by: org.hsqldb.HsqlException: java.lang.OutOfMemoryError: Java heap space
        at org.hsqldb.error.Error.error(Unknown Source)
        at org.hsqldb.result.Result.newErrorResult(Unknown Source)
        at org.hsqldb.StatementDMQL.execute(Unknown Source)
        at org.hsqldb.Session.executeCompiledStatement(Unknown Source)
        at org.hsqldb.Session.execute(Unknown Source)
        ... 17 more
Caused by: java.lang.OutOfMemoryError: Java heap space
        at org.hsqldb.RowAVL.setNewNodes(Unknown Source)
        at org.hsqldb.RowAVL.<init>(Unknown Source)
        at org.hsqldb.persist.RowStoreAVLHybrid.getNewCachedObject(Unknown Source)
        at org.hsqldb.navigator.RowSetNavigatorDataTable.add(Unknown Source)
        at org.hsqldb.navigator.RowSetNavigatorDataTable.<init>(Unknown Source)
        at org.hsqldb.QuerySpecification.buildResult(Unknown Source)
        at org.hsqldb.QuerySpecification.getSingleResult(Unknown Source)
        at org.hsqldb.QuerySpecification.getResult(Unknown Source)
        at org.hsqldb.StatementQuery.getResult(Unknown Source)
        ... 20 more

I've increased the max heap from 2.5 to 2.9GB but I am somewhat concerned that this is a result of our large dataset?

Regards,

Adam Horwich

-- 
Adam Horwich
Sent with Sparrow

Sergio Bossa

unread,
Mar 5, 2013, 4:42:52 AM3/5/13
to nimro...@googlegroups.com
On Mon, Mar 4, 2013 at 5:46 PM, Adam Horwich <ad...@metabroadcast.com> wrote:

> I've just been testing the new build. It sat there for 15 minutes doing
> nothing and then gave an OOM:

That must be due to some non-optimized queries I had to change to
overcome the hsqldb problems: I'll give a look and get back.

Adam Horwich

unread,
Mar 8, 2013, 6:36:46 AM3/8/13
to nimro...@googlegroups.com
Hi Sergio,

Have you had any time to progress with this? We've reverted for now. We're thinking of working around the large data volume problem by having monthly rotation of databases, but we still have the issue of volume of insertions.

Adam

-- 
Adam Horwich
Sent with Sparrow

Sergio Bossa

unread,
Mar 11, 2013, 4:47:12 AM3/11/13
to nimro...@googlegroups.com
Hi Adam,

sorry for this late response.

I'm testing a different implementation, keep you posted later today.
> You received this message because you are subscribed to the Google Groups
> "nimrod-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an

ad...@metabroadcast.com

unread,
Mar 14, 2013, 6:06:01 AM3/14/13
to nimro...@googlegroups.com
Hey Sergio,

Any progress on this?

Adam

Sergio Bossa

unread,
Apr 7, 2013, 8:06:15 AM4/7/13
to nimro...@googlegroups.com
Hi Adam,

it took some time, but you should finally have a redesigned version of the disk store which should address the problems above, as well as provide more control over the defrag operation (see https://github.com/sbtourist/nimrod/issues/60).

Check it out in the https://github.com/sbtourist/nimrod/tree/wip-hsqldb-batcher branch and give it a try!

ad...@metabroadcast.com

unread,
Apr 8, 2013, 9:34:12 AM4/8/13
to nimro...@googlegroups.com
Hi Sergio,

I'm still seeing serious problems with reliably calling the aggregate endpoints. We're seeing failures not just on aggregate but also list metrics based on stack traces:

at nimrod.core.store.DiskStore.aggregate_history(store.clj:264)
...
at nimrod.core.store.DiskStore.list_metrics(store.clj:210)

We're using an Optimised EBS volume with 1000 IOPS. A defrag kicked in 15 minutes ago, with lots of reads occurring via iostat, but it doesn't actually look like anything is being written to the new defragged DB.

Adam

Sergio Bossa

unread,
Apr 8, 2013, 9:43:22 AM4/8/13
to nimro...@googlegroups.com
On Mon, Apr 8, 2013 at 2:34 PM, <ad...@metabroadcast.com> wrote:
 
I'm still seeing serious problems with reliably calling the aggregate endpoints.

Are you seeing those problems building from the head of the wip-hsqldb-batcher branch?
 
We're seeing failures not just on aggregate but also list metrics based on stack traces:

Do they hang or return any exceptions?
 
A defrag kicked in 15 minutes ago, with lots of reads occurring via iostat, but it doesn't actually look like anything is being written to the new defragged DB.

Are you seeing problems only during defrag?

ad...@metabroadcast.com

unread,
Apr 8, 2013, 9:54:19 AM4/8/13
to nimro...@googlegroups.com

Are you seeing those problems building from the head of the wip-hsqldb-batcher branch?
 Yep, I'm using the latest code on that branch
 
Do they hang or return any exceptions?
 Currently the queries are returning 404 immediately after an initial pause when first running over a new time window. Things were fine initially when I started up but are now in this state. There are no stack traces generated from the queries.
 
Are you seeing problems only during defrag?
No, these persist outside of defrag. I raised the defrag issue as it looked like odd behaviour, does the new mechanism do in-place defragmentation now instead of writing out a new file?

Adam. 

Sergio Bossa

unread,
Apr 8, 2013, 11:11:13 AM4/8/13
to nimro...@googlegroups.com
On Mon, Apr 8, 2013 at 2:54 PM, <ad...@metabroadcast.com> wrote:

> Currently the queries are returning 404 immediately after an initial pause when first running over a new time window. Things were fine initially when I started 
> up but are now in this state. There are no stack traces generated from the queries.

You see 404 with list metrics queries too?!

If you see them with aggregate queries, it means Nimrod fell behind your 5m queries, which seems to point at performance problems with the new single-writer approach.

What's the "processed-metrics-per-second-per-file" system stat, and what's the actual logging rate (if you have a ballpark figure)?

Adam Horwich

unread,
Apr 8, 2013, 12:05:13 PM4/8/13
to nimro...@googlegroups.com

You see 404 with list metrics queries too?!

If you see them with aggregate queries, it means Nimrod fell behind your 5m queries, which seems to point at performance problems with the new single-writer approach.

What's the "processed-metrics-per-second-per-file" system stat, and what's the actual logging rate (if you have a ballpark figure)?

I'm not currently seeing 404s on the list metrics query (I do get invalid JDBC arguments when the defrag is active though), however, with the default defrag op, I'm seeing defrag kick in every hour and taking about 25 minutes to complete (which is obviously impractical for production environments) so it's getting in the way of things.

I've tried to do a simple history query without any parameters, and that also 404s immediately. It's very difficult to know the actual rate when we're using real world data, but the numbers hitting the processed-metrics-per second look to be in the right ballpark. 

Adam

ad...@metabroadcast.com

unread,
Apr 19, 2013, 5:10:47 AM4/19/13
to nimro...@googlegroups.com
Hi Sergio,

Any further insight on this? We're now permanently unable to use Nimrod with the rate of metric ingestion and requests we have.

Regards,

Adam

Sergio Bossa

unread,
Apr 19, 2013, 6:54:56 AM4/19/13
to nimro...@googlegroups.com
Hi Adam,

I'm working on yet another alternative solution to workaround this hsqldb bug, but spare time is limited: by the way, hope to get you a version to try out by Monday.

Cheers,

Sergio B.


Adam

--
You received this message because you are subscribed to the Google Groups "nimrod-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nimrod-user...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Sergio Bossa

unread,
Apr 22, 2013, 3:47:05 AM4/22/13
to nimro...@googlegroups.com
Hi Adam,

there is a new branch called wip-hsqldb-bugs-v3, which should both overcome the hsqldb concurrency problem and avoid major slowdowns causing 404 errors: please give it a try, I'll keep fingers crossed in the meantime ;)

Adam Horwich

unread,
Apr 22, 2013, 5:39:02 AM4/22/13
to nimro...@googlegroups.com
Good news! We have metrics again!!

I'll keep an eye on it over the next 24 hours to ensure it consistently records metrics.

Adam

-- 
Adam Horwich
Sent with Sparrow

You received this message because you are subscribed to a topic in the Google Groups "nimrod-user" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/nimrod-user/CBbj8gL3gfQ/unsubscribe?hl=en.
To unsubscribe from this group and all its topics, send an email to nimrod-user...@googlegroups.com.

ad...@metabroadcast.com

unread,
Apr 22, 2013, 8:07:00 AM4/22/13
to nimro...@googlegroups.com
Yeah,

It was all good until the first defrag kicked in and completed (after 90 minutes). Once the defrag ended, there was a large backlog of messages to process(?) and all aggregate endpoints are 404ing.

Store ops per second are happily reaching over 3200. And IOStat shows little iowait and disk read/write when aggregate queries are being made.

Peeking at the non aggregate endpoint, the datestamps of the metrics are not backlogged.

Regards,

Adam Horwich

Sergio Bossa

unread,
Apr 22, 2013, 8:24:43 AM4/22/13
to nimro...@googlegroups.com
On Mon, Apr 22, 2013 at 1:07 PM, <ad...@metabroadcast.com> wrote:

It was all good until the first defrag kicked in and completed (after 90 minutes). 

Does it mean the defrag itself took 90 minutes? Or did it happen after 90 minutes?

In the former case, it explains the aggregate requests going 404 (as the metrics collection itself fell behind due to the long defrag), but just leaves me puzzled as this shouldn't really happen: did you try with a brand new database? If not, could you try so and report back?

Adam Horwich

unread,
Apr 22, 2013, 9:36:34 AM4/22/13
to nimro...@googlegroups.com
The defrag of an existing 13GB DB took 90 minutes, yes. I have since started with a fresh DB and it has successfully survived its first defrag, although that only lasted 25 seconds. I'll keep an eye on things for the next few days in case this behaviour only presents itself after the DB reaches a certain size

Adam

-- 
Adam Horwich
Sent with Sparrow

--

Sergio Bossa

unread,
Apr 22, 2013, 9:41:57 AM4/22/13
to nimro...@googlegroups.com
On Mon, Apr 22, 2013 at 2:36 PM, Adam Horwich <ad...@metabroadcast.com> wrote:
The defrag of an existing 13GB DB took 90 minutes, yes. I have since started with a fresh DB and it has successfully survived its first defrag, although that only lasted 25 seconds. I'll keep an eye on things for the next few days in case this behaviour only presents itself after the DB reaches a certain size

So yes, that happened because previously the defrag worked based on the percentage of wasted space, so not really deterministic and pretty much proportional to the actual size of the database.
Now, I implemented defrag in a different way and made it happen after a configurable amount of metrics updates (1 million by default): as a consequence, you should have constant, predictable and hopefully time-bound defrag delays.

Let me know how it goes!

ad...@metabroadcast.com

unread,
Apr 25, 2013, 8:44:34 AM4/25/13
to nimro...@googlegroups.com
Hi Sergio,

After 3 days with a fresh Database, which has grown to 3.7GB, we are now seeing the symptom of aggregate timeouts again. When a defrag kicks in does a backlog of waiting log entries build up?

Adam

Sergio Bossa

unread,
Apr 25, 2013, 1:39:24 PM4/25/13
to nimro...@googlegroups.com
On Thu, Apr 25, 2013 at 1:44 PM, <ad...@metabroadcast.com> wrote:
 
After 3 days with a fresh Database, which has grown to 3.7GB, we are now seeing the symptom of aggregate timeouts again. When a defrag kicks in does a backlog of waiting log entries build up?

Yeah, defrag certainly causes an overall slow down: again, I'll have to look into this and possible improvements.
Reply all
Reply to author
Forward
0 new messages