Using mongodb as a time-series database

357 views

Skip to first unread message

keith2011

unread,

Nov 8, 2011, 2:58:08 PM11/8/11

to mongodb-user

We're considering using mongodb to store time-series data that is
collected from a variety of data acquisition systems. Here are some
characteristics of the data management system we are building:
* We have many test systems scattered around different sites within
our company.
* Each test system contains one or more data acquisition system.
* Each data acquisition system samples analog and digital channels.
There can be as many as 10000 channels per test system.
* Channels are grouped into rate groups where all the channels are
sampled at the same time (or from the same multiplexed ADC). Sample
rates are typically 25 Hz - 1 kHz but some are 8-10 kHz.
* We expect to "stream" live data into the DB at sample rates up to 1
kHz. Anything higher than that, we expect to post-process into the DB
if it can't be done live.
* The sample data for a rate group is a time series: absolute
timestamp + sample values from each channel in the group. There can
be gaps in the timestamps (due to channels being temporarily
unavailable or communication problems). Timestamps are provided
externally (from the data source).
* Samples are (sometimes) in raw counts, with one set of calibration
mapping counts to sample volts and another set of calibration mapping
from volts to engineering units (i.e. PSI, deg C, etc.). These
calibrations are updated at different times by different processes.
* All sample data is written once by a single writer (for each rate
group) and read *many* times by multiple concurrent readers. Once
sample data is collected into the DB, it cannot be modified. It can
only be moved or deleted. Readers will want to see data as it is
being written.
* We will be collecting 100s of TBs of sample data, but fortunately
the data is highly compressible. We are planning on running on top of
a compressed file system.
* The most common performance-driving use-case of the time-series DB
is to provide a "quick plot" of selected channels (draw the plot
within seconds). This boils down to finding the fastest way to render
a plot of huge set of data samples. The current best method is to
down-sample the selected channels (one sample per pixel -- about
1000-2000 samples for each channel). This means the query will make
large "skips" over huge data sets (caching it all in memory is not an
option). We'd also like to have an efficient min/max computation for
each pixel, which will mean pre-computing some sort of min/max binary
tree and storing this in the DB as well.
* The second most common performance-driving use-case is to export
all the samples for the selected channels within a selected time
range.
* Other less-common use cases would be to query by value (i.e. "when
were the times when channel X > 3.5 and channel Y < 12.4?")
* Associate meta-data with each time series (what system collected
it, what test was it a part of, when the test started and ended, what
calibration was used, when was the last time the DAQ card was
calibrated, etc.) and allow the user to query for data by meta-data
("show me the data sets for this type of test between these dates").
* Provide a method to annotate the time-series data ("this channel
went bad between these times")
* Allow analysts to upload derived/processed data and associate it
with the raw data.
* Provide authentication and permission levels. For instance, only
privileged users can delete data sets for a specific test system and
the owner of one test system should not be allowed to muck with
another test system unless they are specifically authorized for it.
* Provide an interface to a multi-tiered storage system. Tier 1 is
the current "hot" data (what analysts are most interested in) and is
an array of SSDs. Tier 2 is cheaper/slower (HDDs) but more massive
storage. Tier 3 is even cheaper/slower (archives) but even more
massive. The system needs to provide automatic "aging" of time-series
data sets to higher tiers (i.e. data is copied and then removed from
the lower tier). The aging policy would be inactivity after some
amount of time. The system needs to also provide automatic reload of
data to lower tiers (i.e. upon a query for old data, the data is
cached back to a lower tier).
* Data collected at remote sites needs to be copied to the
headquarters storage, but the sites do not want to have copies of all
other sites.
* The data management system needs to be 100% reliable. Some of our
data is extremely valuable (loss of the data could have dire
consequences).

We have some specific questions on how mongodb could fit our problem:
* Is there anything above that will be difficult to achieve with
mongodb?
* What is the best schema to optimize the performance-driving use
cases?
* Will mongo's global lock be an issue for us?
* How can we provide user authentication on a per-test basis? Our
understanding is that mongodb only provides per-database
authentication.
* There are existing file system tiered caching systems, but they
would of course work independently of mongodb. Would this be a
problem? Or should the tiering be done from mongodb? If so, how?

Any comments or suggestions would be appreciated. And please be
honest if you think maybe mongodb isn't the best fit (just because it
*can* do it doesn't mean it's the best solution).

Thanks,
Keith.

Richard Kreuter

unread,

Nov 8, 2011, 4:22:10 PM11/8/11

to mongodb-user

That's quite a list of requirements.

In general, MongoDB should be pretty good at storing and retrieving
your fairly large data sets (in your case, you'd probably want to
consider a sharded deployment to handle the high write volume). It
might take some headscratcing to figure out the optimal way to handle
certain aspects (e.g., for the rendering, if I understand the
requirements, you might need to incorporate some synthetic fields to
make the sampling work maximally well in MongoDB; by including
something like a random number uniformly distributed between 0 and 1
in the pixel's sample, you could query for samples whose random number
was below some threshold, on order to control your down-sampling). In
general, most of these sorts of designs details have to get worked out
on a case-by-case basis. Exporting samples should be fairly
straightforward, given suitable indexes on timestamp fields.

However, a few requirements jump out as not immediately
straightforward given what MongoDB gives you:

(0) MongoDB's index format isn't optimized for ranges over multi-
dimensional spaces (your less common use case with ranges over two
fields), but some optimizations we're planning for version 2.2 will
support improve this query pattern.

(1) you'd have to implement your authentication logic yourself, or
else put what you're calling "tests" into separate databases to rely
on MongoDB's authentication.

(2) MongoDB isn't really designed for uneven data distribution (your
model where headquarters accumulates all data gathered by all remote
sites, but remote sites don't want all the data headquarters has).
For this, you'd need your own custom ETL-like layer that loaded data
from separate deployments at the remote sites into a (presumably
larger) headquarters deployment.

(3) Similarly, there is nothing built into MongoDB that would solve
your tiering problem out of the box, so you'd have to roll that
yourself.

Reply all

Reply to author

Forward

0 new messages