Socorro Date Ranges

Skip to first unread message


Jun 5, 2008, 12:25:55 PM6/5/08
Lots of details below, but it boils down to a few questions:
* What types of data are important to keep over time?
* How long do we need to keep it?
* What can we delete and when?

In our crash reporting system, we have different types of data and
while we are in the process of improving our maintenance scripts and
strategy, the issue of "how much data do we need" has resurfaced.

We store different types of data:
* report meta data (reports, dumps)
* module data attached to each report (1:100 cardinality, contains
loaded modules for the crash)
* top 10 frames (1:10 cardinality, contains meta for top to stacks/

The Socorro database is mainly having issues with the size of the
modules table. So far we have not deleted any information from this
database, but we are nearing critical mass for this database.

To alleviate some of the size and speed issues, Lars and I are working
on auto-partitioning scripts and auto-pruning as well as removing non-
literals from date-bound database queries. (see Webtools->Socorro 0.5
in bugzilla).


When it comes to auto-pruning we are unsure of the sliding window of
availability we need to maintain. It's been 30 days, 90 days, 6
months -- I've gotten different answers. Now we need to decide on

Chris Hofmann

Jun 12, 2008, 4:29:24 PM6/12/08
to Chris Hofmann

One of the things we considered in the past for pruning the database
is to remove "like" records in the database. Once we standardize on a
good system for setting up unique stack signatures we really don't
need to keep the full stack dump around too long. where we need
statistical information we can just use a counter for the unique stack
signature, and where we need the full stack trace we can keep around
one copy of it to point too.

We wouldn't be able to do that right now since we have so many of the
stack signatures that have widely varied stack dumps. I forget which
bug it is but Ted was working on a patch for cleaning up the
signatures to make them more unique.


Dan Mosedale

Jun 12, 2008, 5:13:52 PM6/12/08
IMO, the stack traces are vastly more useful than most of the rest of
the data. I'm with chofmann in that as long as we keep a single example
stack trace (per-platform, perhaps) indefinitely of each "stack
signature", we're in pretty good shape.



Jun 16, 2008, 4:02:40 PM6/16/08
to morgamic
Hey There,

This isn't an immediate solution - but for the longer run it seems like
we have 2 basic type of queries:

1) Loading the specifics of a particular crash report
2) Doing queries across a series of reports (e.g. all crashes with stack
frame foobar).

Given we can partition the data not just by date but arbitrarily by
crash-report-id we could partition the longer term data into a series of
cheap distributed databases (where the application tier knows which
database to pull from based on crash-id). Then we could use something
like hadoop
( to
easily write aggregated queries across the distributed set.

Just an idea..



Reply all
Reply to author
0 new messages