IMHO, sober usage of plain log files will solve the problem. One thing
worth to consider is storing metadata in the database, just to
simplify files management and make queries faster.
Sincerely,
--Victor
On Mar 24, 12:27 pm, Heiner Bunjes <h.bun...@googlemail.com> wrote:
> I need a database to log and retrieve sensor data.
> The scale is as follows:
> (01) Tenthousand sensors deliver 1 record per second each
> -> Insert rate = 10kilorecord/s
> -> Insert rate = 315 gigarecord per Year
> (02) They have to be stored for ~3 years
> -> Size of database = 1 terarecord
> (03) Each record has about 200 bytes
> -> Size of database = 200TB
> (04) The system will start with just 700 sensors but will
> soon increase upto the described volume
> The main operations on the data are
> (05) Append the new records to the database
> (written records are never changed)
> (06) Return all records for sensor X with a timestamp
> between timestamp_a and timestamp_b
> (07) Return the records specified in (06) ordered by
> timestamp
> (08) Delete all records older than Y
> Further the following is true
> (09) The database system MUST be free and open source
> (10) The DB SHOULD be easy to administrate
> Which is the most appropriate database for this?
> p.s.:
> I am currently havin a look at riak but I'm not sure how
> to efficiently implement (06) and (07) or if there is a
> much better choice than riak.
If you're interested in a SQL approach, infobright<http://www.infobright.com>is open source, and they claim to be focusing themselves on machine generated data ....
On Thu, Mar 24, 2011 at 4:58 PM, Viktor Sovietov <victor.sove...@gmail.com>wrote:
> IMHO, sober usage of plain log files will solve the problem. One thing > worth to consider is storing metadata in the database, just to > simplify files management and make queries faster.
> Sincerely,
> --Victor
> On Mar 24, 12:27 pm, Heiner Bunjes <h.bun...@googlemail.com> wrote: > > Hello!
> > I need a database to log and retrieve sensor data.
> > The scale is as follows:
> > (01) Tenthousand sensors deliver 1 record per second each > > -> Insert rate = 10kilorecord/s > > -> Insert rate = 315 gigarecord per Year
> > (02) They have to be stored for ~3 years > > -> Size of database = 1 terarecord
> > (03) Each record has about 200 bytes > > -> Size of database = 200TB
> > (04) The system will start with just 700 sensors but will > > soon increase upto the described volume
> > The main operations on the data are
> > (05) Append the new records to the database > > (written records are never changed)
> > (06) Return all records for sensor X with a timestamp > > between timestamp_a and timestamp_b
> > (07) Return the records specified in (06) ordered by > > timestamp
> > (08) Delete all records older than Y
> > Further the following is true
> > (09) The database system MUST be free and open source
> > (10) The DB SHOULD be easy to administrate
> > Which is the most appropriate database for this?
> > p.s.: > > I am currently havin a look at riak but I'm not sure how > > to efficiently implement (06) and (07) or if there is a > > much better choice than riak.
> > Thanks > > Heiner
> -- > You received this message because you are subscribed to the Google Groups > "NOSQL" group. > To post to this group, send email to nosql-discussion@googlegroups.com. > To unsubscribe from this group, send email to > nosql-discussion+unsubscribe@googlegroups.com. > For more options, visit this group at > http://groups.google.com/group/nosql-discussion?hl=en.
-- Jim Peters +1-415-608-0851 (Cell) +1-416-466-9790 (Home) +1-415-508-8651 (Google Voice)
i think the general term for what you are looking for is a "time series database". eventmonitor data accelerator is the one i'm somewhat familiar with, but it's not open source, might be specific to the financial data and it's web presence is pretty opaque. the wikipedia entry doesn't list any implementations, but after a quick search, rrdtool sounds like it might be what you're looking for. if not, your write requirements are pretty modest. if your read requirements are minimal as well, you could build an ad-hoc solution, eg using flat files as viktor suggests
On Thu, Mar 24, 2011 at 6:27 AM, Heiner Bunjes <h.bun...@googlemail.com> wrote:
> Hello!
> I need a database to log and retrieve sensor data.
> The scale is as follows:
> (01) Tenthousand sensors deliver 1 record per second each > -> Insert rate = 10kilorecord/s > -> Insert rate = 315 gigarecord per Year
> (02) They have to be stored for ~3 years > -> Size of database = 1 terarecord
> (03) Each record has about 200 bytes > -> Size of database = 200TB
> (04) The system will start with just 700 sensors but will > soon increase upto the described volume
> The main operations on the data are
> (05) Append the new records to the database > (written records are never changed)
> (06) Return all records for sensor X with a timestamp > between timestamp_a and timestamp_b
> (07) Return the records specified in (06) ordered by > timestamp
> (08) Delete all records older than Y
> Further the following is true
> (09) The database system MUST be free and open source
> (10) The DB SHOULD be easy to administrate
> Which is the most appropriate database for this?
> p.s.: > I am currently havin a look at riak but I'm not sure how > to efficiently implement (06) and (07) or if there is a > much better choice than riak.
> Thanks > Heiner
> -- > You received this message because you are subscribed to the Google Groups "NOSQL" group. > To post to this group, send email to nosql-discussion@googlegroups.com. > To unsubscribe from this group, send email to nosql-discussion+unsubscribe@googlegroups.com. > For more options, visit this group at http://groups.google.com/group/nosql-discussion?hl=en.
I am not ready to say that Riak would be the right solution here,
although I think it could work. You have some good things going for
you. A specific, well defined spec and immutable data. Using the
criteria you provided it would not be that much work to spec out a
solution. If I were to use Riak, as it exists today, for this project
I would do the following:
-Make extensive usage of key filters.
Key filters is a mechanism by which you can construct a filter to
select, from memory, a subset of keys to pass to map/reduce. This
means that if you use meaningful key names you get a big win
performance wise, and in reality would be one of the only ways to do
this. In your case the naming convention I would use would be
something like "sensorname_timestamp". You could construct a filter
that would look for only keys starting with "sensorname" and having a
"timestamp" between A and B where timestamp is a unix int. See
http://wiki.basho.com/Key-Filters.html .
-Map/Reduce considerations:
Maps are distributed in Riak. Meaning that after your key list has
been constructed via the key filter phase, the coordinating node (Riak
is homogeneous - any node can service any request) will hash the keys
in the set to their corresponding nodes in the cluster and distributed
the requests to all those nodes. The cluster will fetch the keys from
all those nodes and return them - out of order - to the coordinating
node. The reduce, in this case a sort, is applied on the coordinating
node.
Things to think about is the total key set returned via the key
filter. Given your exacting parameters we can calculate that each
sensor will have a maximum of 60*60*24 keys per day which if you are
doing daily rollups would push 86,400 keys to your coordinating node.
Once hashed (relatively inexpensive) the fetch would be fanned out
across the cluster and the cluster would return 86,400*250 bytes of
data back to the coordinating node for sorting (I padded your payload
a touch). Recall that the reduce (sort in this case) is not
distributed. This means that your coordinating node is sorting 21MB
data. If you are writing your m/r in javascript you will incur a
penalty for the marshaling between the erlang vm and the javascript
vm. If you do use javascript I would pay attention to the javascript
specific vm config items in the config file, namely the number of js
vm's and their memory allocations. But in reality I would get the m/r
written in erlang - not hard considering the simplicity of your needs.
-File system backend
The other thing to consider is the number of keys per node in your
cluster. Bitcask, the default file backend for Riak, stores a hint
file of all its keys in memory. This means the more keys you have the
more memory you eat. Fortunately, there is a metric to ascertain
maximum capacity like so:
cluster ram = ( key size + overhead ) * number of keys
In your case the sheer number of 1trillion keys will exhaust the
capacity for total ram in a cluster. In short, you can absolutely not
do this with the bitcask backend and the current convention of 1 key
per sensor per second. Your only other alternative would be the innodb
backend and I would have a serious conversation with the Basho guys
about that. The other alternative would be to rollup your 86,400 keys
per day per sensor into 1 key , thereby reducing your total key count
to simply 10 k per day. Each key would have 20 some odd MB of data but
that is totally manageable. Of course, your rollup frequency can be
shortened from a day to an hour. But you get my point, if you want to
use Bitcask (which you probably do) you will need to reduce your
overall key count. See https://spreadsheets.google.com/ccc?key=0Ak4OBkABJPsxdEowYXc2akxnYU9x... .
I would post this exact question to the Riak mailing list and see what
the Basho folks say. That said, I would also take a look here:
http://discoproject.org/ . I saw the main developer give a demo at
erlang factory just today. It is pretty impressive and may be a good
fit. Think hadoop but switch java with erlang.
Do write back with your findings.
Cheers,
-Alexander
@siculars
On Mar 24, 3:27 am, Heiner Bunjes <h.bun...@googlemail.com> wrote:
> I need a database to log and retrieve sensor data.
> The scale is as follows:
> (01) Tenthousand sensors deliver 1 record per second each
> -> Insert rate = 10kilorecord/s
> -> Insert rate = 315 gigarecord per Year
> (02) They have to be stored for ~3 years
> -> Size of database = 1 terarecord
> (03) Each record has about 200 bytes
> -> Size of database = 200TB
> (04) The system will start with just 700 sensors but will
> soon increase upto the described volume
> The main operations on the data are
> (05) Append the new records to the database
> (written records are never changed)
> (06) Return all records for sensor X with a timestamp
> between timestamp_a and timestamp_b
> (07) Return the records specified in (06) ordered by
> timestamp
> (08) Delete all records older than Y
> Further the following is true
> (09) The database system MUST be free and open source
> (10) The DB SHOULD be easy to administrate
> Which is the most appropriate database for this?
> p.s.:
> I am currently havin a look at riak but I'm not sure how
> to efficiently implement (06) and (07) or if there is a
> much better choice than riak.
Many thanks to all of you who answered my question.
Especially I thank @siculars whose answer is extremely helpful.
From your answers I got some input to add some requirements to my
list.
######## <requirements version="2">
I need a database to log and retrieve sensor data.
Glossary
- Node = A computer on which on instance of the database
is running
- Blip = one data record send by a sensor
- Bin = A container holding a lot of blips. The database
could store bins instead of blips to reduce the total number
of keys.
- Blip page = The sorted list of all blips for a specific sensor
and a specific time range.
The scale is as follows:
(01) Tenthousand sensors deliver 1 blip per second each
-> Insert rate = 10kiloblip/s
-> Insert rate = 315 gigablip per Year
(02) They have to be stored for ~3 years
-> Size of database = 1 terablip
(03) Each blip has about 200 bytes
-> Size of database = 200TB
(04) The system will start with just 700 sensors but will
soon increase upto the described volume.
The main operations on the data are:
(05) Append the new blips to the database
(written blips are never changed)!
(06) Return all blips for sensor X with a timestamp
between timestamp_a and timestamp_b!
With other words: Return a blip page.
(07) Return all the blips specified in (06) ordered
by timestamp!
(08) Delete all blips older than Y!
Further the following is true:
(09) The database system MUST be free and open source.
(10) The DB SHOULD be easy to administrate.
(11) 99.9% of the blips are inserted in
chronological order, the rest is not.
(12) All data MUST still be writable and readable while less
then the configurable number N of nodes are down (unexpectedly).
(13) The mechanisms to distibute the data to the available
nodes SHOULD be handled by the database.
This means that the database SHOULD automatically
redistribute the data when nodes are added or removed.
######## </requirements>
The application I am building is a kind of complex event processing
core.
It is mainly written in erlang.
#### Plain logfiles
are an interesting approach but IMO are very
much in conflict with (10, 11, 12, 13).
It would mean to do most of the work by myself.
#### infobright
seems not to be able to fullfill replication (12, 13) with
its free variant.
#### eventmonitor data accelerator
is not free (09).
#### rrdtools
doesn't seem to be distributed. So again I would have to do
very much by myself.
#### riak
Hmm.
As it looks I have (at least) the problem that
at 1 key/blip I get to many keys for Bitcask (InnoDB might work).
So I could build records which hold multiple blips; I will call them
bins.
One bin might hold 86,400 blips (e.g.) but only has one single key.
This will cause some work for packing and unpacking.
And it would in some cases increase the amount of data that is send
to the coordinating node because as I understand it, a whole bin would
be
sent even if only one blip is needed. Right? Or is there a mechanism
to
already filter the content of the bin before sending it to the
coordinating
node?
I'm still reading the pagination and discoproject stuff.
Give me some time before answering that!
At the moment I am really wondering if there is no off the shelf
solution for my needs. The problem I have seems so natural to me.
I just have blips to be logged away and I have users who later on
want to have a look at specific blip pages. It's just the amount of
data that is unusual.
There is probably more than one database that would fit the bill, but I would encourage you to at least have a look at Cassandra.
It handles times series very well and in particular (05), (06) and (07) will be easy to do and efficient (Cassandra will store your records in sorted order on disk, so reading a range is just a seek to where the range start and then sequential read).
For (08), there could be a number of solutions depending on the exact data layout you use, but I'll mention here that Cassandra supports TTL at the record level, so if you know upfront that you need to keep a record 3 years, that will be super easy (and cheap). If that don't work for you, writing a map-reduce job to do this should be straightforward (using say Hive to make that simpler).
It's free and open source and I think fairly easy to administrate. The scale won't be a problem since that's why Cassandra is built for.
On Thu, Mar 24, 2011 at 11:27 AM, Heiner Bunjes <h.bun...@googlemail.com> wrote: > Hello!
> I need a database to log and retrieve sensor data.
> The scale is as follows:
> (01) Tenthousand sensors deliver 1 record per second each > -> Insert rate = 10kilorecord/s > -> Insert rate = 315 gigarecord per Year
> (02) They have to be stored for ~3 years > -> Size of database = 1 terarecord
> (03) Each record has about 200 bytes > -> Size of database = 200TB
> (04) The system will start with just 700 sensors but will > soon increase upto the described volume
> The main operations on the data are
> (05) Append the new records to the database > (written records are never changed)
> (06) Return all records for sensor X with a timestamp > between timestamp_a and timestamp_b
> (07) Return the records specified in (06) ordered by > timestamp
> (08) Delete all records older than Y
> Further the following is true
> (09) The database system MUST be free and open source
> (10) The DB SHOULD be easy to administrate
> Which is the most appropriate database for this?
> p.s.: > I am currently havin a look at riak but I'm not sure how > to efficiently implement (06) and (07) or if there is a > much better choice than riak.
> Thanks > Heiner
> -- > You received this message because you are subscribed to the Google Groups "NOSQL" group. > To post to this group, send email to nosql-discussion@googlegroups.com. > To unsubscribe from this group, send email to nosql-discussion+unsubscribe@googlegroups.com. > For more options, visit this group at http://groups.google.com/group/nosql-discussion?hl=en.
> i think the general term for what you are looking for is a "time
> series database". eventmonitor data accelerator is the one i'm
> somewhat familiar with, but it's not open source, might be specific to
> the financial data and it's web presence is pretty opaque. the
> wikipedia entry doesn't list any implementations, but after a quick
> search, rrdtool sounds like it might be what you're looking for. if
> not, your write requirements are pretty modest. if your read
> requirements are minimal as well, you could build an ad-hoc solution,
> eg using flat files as viktor suggests
> On Thu, Mar 24, 2011 at 6:27 AM, Heiner Bunjes <h.bun...@googlemail.com> wrote:
> > Hello!
> > I need a database to log and retrieve sensor data.
> > The scale is as follows:
> > (01) Tenthousand sensors deliver 1 record per second each
> > -> Insert rate = 10kilorecord/s
> > -> Insert rate = 315 gigarecord per Year
> > (02) They have to be stored for ~3 years
> > -> Size of database = 1 terarecord
> > (03) Each record has about 200 bytes
> > -> Size of database = 200TB
> > (04) The system will start with just 700 sensors but will
> > soon increase upto the described volume
> > The main operations on the data are
> > (05) Append the new records to the database
> > (written records are never changed)
> > (06) Return all records for sensor X with a timestamp
> > between timestamp_a and timestamp_b
> > (07) Return the records specified in (06) ordered by
> > timestamp
> > (08) Delete all records older than Y
> > Further the following is true
> > (09) The database system MUST be free and open source
> > (10) The DB SHOULD be easy to administrate
> > Which is the most appropriate database for this?
> > p.s.:
> > I am currently havin a look at riak but I'm not sure how
> > to efficiently implement (06) and (07) or if there is a
> > much better choice than riak.
> > Thanks
> > Heiner
> > --
> > You received this message because you are subscribed to the Google Groups "NOSQL" group.
> > To post to this group, send email to nosql-discussion@googlegroups.com.
> > To unsubscribe from this group, send email to nosql-discussion+unsubscribe@googlegroups.com.
> > For more options, visit this group athttp://groups.google.com/group/nosql-discussion?hl=en.
How does the application code read the sensors? For example, is there
a process for each sensor that hangs for a second and reads an IP
address? Are there multiple processes each of which polls a group of
sensors? Is there a device driver that reads the sensor data and
asynchronously e.g., via DMA, and puts the readings in a shared memory
segment?
What are the requirements for availability? Recoverability in the
event of hardware failure?
Sounds like a nice little afternoon project with GT.M (http://fis- gtm.com). Full disclosure: I manage GT.M.
Regards
-- Bhaskar (yes, it's my last name, but that's what I'm called)
ks dot bhaskar at fisglobal dot com <-- send e-mail here
--
GT.M - Rock solid. Lightning fast. Secure. No compromises.
On Mar 24, 6:27 am, Heiner Bunjes <h.bun...@googlemail.com> wrote:
> I need a database to log and retrieve sensor data.
> The scale is as follows:
> (01) Tenthousand sensors deliver 1 record per second each
> -> Insert rate = 10kilorecord/s
> -> Insert rate = 315 gigarecord per Year
> (02) They have to be stored for ~3 years
> -> Size of database = 1 terarecord
> (03) Each record has about 200 bytes
> -> Size of database = 200TB
> (04) The system will start with just 700 sensors but will
> soon increase upto the described volume
> The main operations on the data are
> (05) Append the new records to the database
> (written records are never changed)
> (06) Return all records for sensor X with a timestamp
> between timestamp_a and timestamp_b
> (07) Return the records specified in (06) ordered by
> timestamp
> (08) Delete all records older than Y
> Further the following is true
> (09) The database system MUST be free and open source
> (10) The DB SHOULD be easy to administrate
> Which is the most appropriate database for this?
> p.s.:
> I am currently havin a look at riak but I'm not sure how
> to efficiently implement (06) and (07) or if there is a
> much better choice than riak.
It is supposed to be a more scalable and higher performing time series database than rrd. It was developed by Orbitz.com for their serious data acquisition and graphing needs. I haven't used it yet but look forward to playing with it some day.
-- Tracy Reed Digital signature attached for your safety. Copilotco Professionally Managed PCI Compliant Secure Hosting 866-MY-COPILOT x101 http://copilotco.com