Offline data generation for later exploration via mongodb

20 views

Skip to first unread message

Jacob

unread,

Sep 4, 2017, 5:00:45 PM9/4/17

to mongodb-user

Hi,

i'll appreciate an expert advise on the following use case. I don't manage to find similar posts, so if this is a duplicate, I apologize and please point me to the relevant posts.

In summary my use case looks like: generate a lot of data by diverse processes -> people select subsets of data (a lot, but much less than what was generated) -> bring it online for exploration as a nosql db.

In a little more detail, the situation looks like this:

- There is a zillion of periodic batch processes running in different places (geos, data centers) creating data files of varying size (say upto 50G). The processes are not continuous, ie. they run a few hours and terminate, so it’s not a constantly streaming data.

- Many of the data files are not needed later on and therefore removed. Some other are deemed necessary and preserved. The decision of what’s needed and what’s not may only be made post-factum (i.e. there is no way to not generate the files in the first place). I’d like to work with the preserved data collectively via mongodb (mostly to index and query it).

- The data files today are in bson-like format. Raw data, no indexes. I control the format and the embedded sw component that writes it out, so I can change it as desired. I need local-disk-like throughput when generating/storing the data.

- Since only part of the data is eventually needed, I want to avoid uploading all data to a central server (or cluster) at the time of generation.

- I understand that I probably can run a local server nearby each batch process and create a mongodb collection locally, then copy some of those collections (bson files) that are needed into a central server/cluster and “restore” them via mongorestore. After reading the documentation and playing around, it seems that this will work as a series of inserts/bulk-inserts into the db, i.e. involve copying data.

The question is: is this the fastest overall method to prepare and bring up massive amounts of offline data, format of which is under my control? Is there a way to “prepare” a db collection locally (sort of offline) and then “integrate” it into a server db seamlessly without “re-inserting” it? Maybe using the storage-engine’s API directly (e.g. wiredtiger api)?

Thanks!

Jacob

Kevin Adistambha

unread,

Sep 19, 2017, 3:03:29 AM9/19/17

to mongodb-user

Hi Jacob

The question is: is this the fastest overall method to prepare and bring up massive amounts of offline data, format of which is under my control? Is there a way to “prepare” a db collection locally (sort of offline) and then “integrate” it into a server db seamlessly without “re-inserting” it? Maybe using the storage-engine’s API directly (e.g. wiredtiger api)?

Currently there is no method to prepare and attach an offline data as you described using WiredTiger. However, there is a feature request for this exact functionality to be implemented in WiredTiger in SERVER-19043. Please vote/comment on the ticket if you feel this feature request is relevant to your use case.

Note that you can do this using the MMAPv1 storage engine, as long as the namespace (i.e. database name) doesn’t overlap with anything in use. Having said that, by using MMAPv1 storage engine you also sacrifice the advantages of WiredTiger such as compression, better concurrency, etc. Please note that even though MMAPv1 supports this in practice, you are directly modifying the dbpath content of MongoDB, which carries a certain amount of risk and could result in an undefined state, leading to a non-functional database. I would suggest you to thoroughly test your procedures should you want to proceed with this method.