A strawman for logging huge amounts of (streaming) data in mongo

Amit Manjhi

unread,

Jun 3, 2011, 12:35:06 PM6/3/11

to mongod...@googlegroups.com

Hi,

I wanted to get the group's perspective on how best to organize the following use case.

The application can receive a fair amount of data per day.
Most of this data is just being logged, apart some minimal computation done using just the most recent data, say received in last few minutes.
All the data must be available but the size of the database must be overall capped. It is okay if the old data (say more than a week old) is archived and not available in the production database.

If there was no need to archive the data, capped collections would have worked. But it seems there is no auto-archiving mechanism for capped collection. If I do not use capped collection, the memory foot-print of the application will not grow, but the database size will keep on growing.

I came up with the following strawman to handle this situation and wanted feedback on it. The application keeps moving to a new database, every fixed time period, say a month. So every db request received in month of June gets logged to "db_june_2011", every db request received in the month of July gets logged to "db_july_2011", and so on. At the beginning of a month, I can just archive the database files older than the past 2 months. Seems like this can be easily coded into the application logic. What are the drawbacks of this solution? Is there a limit on the number of databases in mongo? (I searched but couldn't find any.)

Regards,

Amit Manjhi

Scott Hernandez

unread,

Jun 4, 2011, 12:59:54 AM6/4/11

to mongod...@googlegroups.com

This is a reasonable solution. There is no limit on the number of databases, aside from the limits of the filesystem for the number of files in a directory/partition.

The drawbacks of a system like this are that you need to customize the query logic to run queries against old dbs/collections. It also means you have to select the current insertions collections based on the date/time of the insert -- this is not a big problem but it does mean putting more logic in the application and query system (if you want to search archives).

There are a few solution which are very similar to this for the collection of statistical data where a new collection is used by time period and then the data is reduced and stored in an archival format for long term storage/aggregation.

--
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com.
To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.

Jayesh

unread,

Jun 4, 2011, 1:11:12 PM6/4/11

to mongodb-user

To further help contain your data, you can keep the database name by
the month only.
This way you can further reduce some amount of hard-coding.
I had used a similar solution in RDBMS apps where I empty/truncate a
table before I start loading into it the first time.
This way, I always had atleast 11 months of data on-hand.

Also, depending upon the usage, you can abstract the fact that data is
going into different DBs using just a few "API" like functions - to
insert, update, query data. The functions can evaluate the target
database internally and - if you ever choose to move to a single
database, you would be still ok.

On Jun 3, 11:35 am, Amit Manjhi <amitman...@gmail.com> wrote:
> Hi,
>
> I wanted to get the group's perspective on how best to organize the
> following use case.
>

> 1. The application can receive a fair amount of data per day.
> 2. Most of this data is just being logged, apart some minimal computation

> done using just the most recent data, say received in last few minutes.

> 3. All the data must be available but the size of the database must be

Amit Manjhi

unread,

Jun 5, 2011, 11:34:27 AM6/5/11

to mongod...@googlegroups.com

Thanks Scott and Jayesh.

For this particular case, the overhead of customizing the query logic to run queries against old dbs/collections should be fine. Having a few API like functions to abstract the fact that the data is going into different DBs is a great suggestion -- I was planning to do something like that so that I can test and tune the time-period as well.

Reply all

Reply to author

Forward