Re: Advise Needed: Bulk load into Mongo from csv (81 Mill rows)

2,556 views
Skip to first unread message

Kyle Wolfe

unread,
Jan 25, 2013, 5:10:51 PM1/25/13
to mongod...@googlegroups.com
PHP fgetcsv()

On Friday, January 25, 2013 4:35:26 PM UTC-5, MongoUser wrote:
Hi,

I am a newbie to Mongodb and am experimenting if we could use it to store Time series data. I have several csv files that contain this csv data, each about ~2GB in size and holding about 81 mill records. Each csv has data for one type of time series, timestamp is stored as a number and values are recorded at 5 min intervals for around 4000 objects. I need to load this data into Mongo. What is the fastest way to do this? The data is of the following format:

trf1.csv: contains timestamp & value for each object by 5 min intervals. There are around 4000 objects:

object_id,metric,timestamp,value
197666, trf1,1352985262, 0.944456778
197514,trf1, 1352985272, 0.776543234
197666, trf1,1352987301, 0.944456778

Likewise there are trf2.csv, trf3.csv etc for other types of time series.

Im not sure what the best way to store this is, but here's what Im thinking -  A collection called metrics which will contain a document for each type of metric. For ex:

db.metric.insert
{
 {   id : trf1
  Object: 197666
        TS{
               { timestamp : 1352985262,
                 value : 0.944456778
                },
               { timestamp : 1352987301,
                 value : 0.944456778
                },
               and so on ~ 50,000 key value pairs
            }
  Object: 197514
        TS{
               { timestamp : 1352985262,
                 value : 0.944456778
                },
               { timestamp : 1352987301,
                 value : 0.944456778
                },
               and so on.....
            }
 },
 {
   _id : trf2
   Object : 197514
       TS {
               { timestamp : 1352985262,
                 value : 0.944456778
                },
and so on.....
            }
 },
{
 ........
}

Please share your suggestions on how to best model & insert this data. Thanks

MongoUser

unread,
Jan 25, 2013, 5:41:46 PM1/25/13
to mongod...@googlegroups.com
Thanks Kyle. I have not used PHP before :-(  Im looking for a quick way to load up this data and run some queries against it to evaluate Mongo Performance.

Any other thoughts/ideas out there? I am also looking for inputs on the best way to store this data in Mongo. Thanks for your help. 

Sam Millman

unread,
Jan 25, 2013, 6:32:37 PM1/25/13
to mongod...@googlegroups.com
Providing your data in its formed shape ready to be inserted you can use mongimport: http://docs.mongodb.org/manual/administration/import-export/

Otherwise you will need to code maybe a JavaScript file which will do this for you.


--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to mongod...@googlegroups.com
To unsubscribe from this group, send email to
mongodb-user...@googlegroups.com
See also the IRC channel -- freenode.net#mongodb
 
 
 

Gustavo Niemeyer

unread,
Jan 25, 2013, 6:32:37 PM1/25/13
to mongod...@googlegroups.com
On Fri, Jan 25, 2013 at 7:35 PM, MongoUser <gupta...@gmail.com> wrote:
> object_id,metric,timestamp,value
> 197666, trf1,1352985262, 0.944456778
> 197514,trf1, 1352985272, 0.776543234
> 197666, trf1,1352987301, 0.944456778
>
> Likewise there are trf2.csv, trf3.csv etc for other types of time series.

Before you have a reason to diverge from it, I'd try to the most
straighforward way: load the CSV and insert one document per row. Put
a small sample in, experiment with your application to see if that's
what you want, and move on.

A few other suggestions:

- Don't create further indexes before you load it all.
- Use one-letter field names. At several hundred million entries,
small savings will matter.

There are other potential tweaks to do, but I'd recommend not
overthinking the solution before you've had any interaction with the
environment at all.


gustavo @ http://niemeyer.net

MongoUser

unread,
Jan 26, 2013, 8:32:37 PM1/26/13
to mongod...@googlegroups.com
Hi Gustavo
Thank you for your suggestion. I did try loading the csv in its simplest format using mongoimport - so basically it would create one document per row. However this process was very slow - about 13500 records/sec - and it would have taken several hours for me to load one csv. I need to load 12 csv files with ~80 mill rows.

I'm not sure if creating a new document per row is the best way to store this data in Mongo - maybe this is the reason for the slow speed with import? 

Also, since there are a huge number of rows, it will take a long time to process these csv files so that they can fall in the right format for the mongo nested document structure that I have mentioned earlier, might take days for me to process & load all the data :-(

Any other suggestions ? Has anyone stored time series data in Mongo befoer and used it for running analytical queries? Note that for this evaluation I am only loading 90 days of data, in reality this data could span across 3 years. Please advise.

Thanks 

Ronald Stalder

unread,
Jan 27, 2013, 5:01:24 PM1/27/13
to mongod...@googlegroups.com
Hi MongoUser

Just a matter of guess (I haven't played around with it): mongoimport might be faster importing json than csv (if mongoimport converts to json before inserting). So you could use a streaming editor to convert the csv to json and pipe it into mongoimport - you'd then have two processes running the job - one that does the conversion and the other inserting the data.

Depending on the nature of the queries you'll run against the data you could also think of creating a collection per metric and run mongoimport in parallel for several csv files (I understand you have one csv file per metric, trf1, trf2 etc). "Several" because there will somewhere be an optimum; running all 12 in parallel might be slower than running a few at a time.

Another thought: have you tried the --dbpath option on mongoimprt?

cheers
Ronald

Kyle Wolfe

unread,
Jan 28, 2013, 9:54:19 AM1/28/13
to mongod...@googlegroups.com
This is entirely dependent on your system IO, and 13.5 a second isn't all that bad. Keep indexes of your tables until after your done inserting.

Wojons Tech

unread,
Jan 28, 2013, 12:04:44 PM1/28/13
to mongod...@googlegroups.com
Make sure to turn off indexes you don't want to index these documents as you write them find out the average line size of the csv then calculate the amount of memory the server has. Then insert the right number of document to not fill up your memory. Use bulk insert command send that many documents in and wait for mongostat to tell you the qw is back to 0. Then check cat /proc/meminfo look at the dirty wait till its less then a few megabytes. Dirty is the amount of data in memory waiting to be written to disk. You may want to clear your cache and run it again until your done. When you rebuild your index make sure to use the background flag so the indexes build behind the senses. You may want to use sparce fkag to save disk space.

Gustavo Niemeyer

unread,
Jan 29, 2013, 2:35:24 AM1/29/13
to mongod...@googlegroups.com, MongoUser
On Sat, Jan 26, 2013 at 11:32 PM, MongoUser <gupta...@gmail.com> wrote:
> Thank you for your suggestion. I did try loading the csv in its simplest
> format using mongoimport - so basically it would create one document per
> row. However this process was very slow - about 13500 records/sec - and it
> would have taken several hours for me to load one csv. I need to load 12 csv
> files with ~80 mill rows.

You would be done now.

> I'm not sure if creating a new document per row is the best way to store
> this data in Mongo - maybe this is the reason for the slow speed with
> import?

If you're not sure, trying it out might be instructive.

> Any other suggestions ? Has anyone stored time series data in Mongo befoer
> and used it for running analytical queries? Note that for this evaluation I
> am only loading 90 days of data, in reality this data could span across 3
> years. Please advise.

I've already had MongoDB holding hundreds of millions of records, with
sub-second responses, but it's hard to prescribe a design like that.
It really depends on the use you're going to make of your data, and
once you grow big enough, you'll have to design other things than your
schema to make it viable.


gustavo @ http://niemeyer.net

mounica...@gmail.com

unread,
Oct 21, 2015, 6:01:50 PM10/21/15
to mongodb-user
Hi MongoUser,

Did you figure out a way for this? I am facing the same problem where I have to load millions of records from csv files into mongodb. I am currently doing it using python by processing row by row and its very very slow. I would like to know if you figured out anything for this.

Thanks

Asya Kamsky

unread,
Oct 21, 2015, 7:22:33 PM10/21/15
to mongod...@googlegroups.com
Use mongoimport.  It's been completely rearchitected and rewritten since that thread was started.  
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.
 
For other MongoDB technical support options, see: http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user...@googlegroups.com.
To post to this group, send email to mongod...@googlegroups.com.
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/9c25a5fb-33db-464d-8c7a-e29610739df4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
Asya Kamsky
Lead Product Manager
MongoDB
Download MongoDB - mongodb.org/downloads
Free MongoDB Monitoring - cloud.mongodb.com
Free Online Education - university.mongodb.com
Get Involved - mongodb.org/community
We're Hiring! - https://www.mongodb.com/careers

asimov

unread,
Oct 22, 2015, 3:18:29 PM10/22/15
to mongodb-user
I have to preprocess certain columns in the csv, for example, I need to convert the Date and TIme to a timestamp, clean up invalid dates, rename some columns etc. Is there a way to do this using mongoimport?

jona...@findmeon.com

unread,
Oct 22, 2015, 5:10:24 PM10/22/15
to mongodb-user
I recently did something similar in Python, migrating several million records from Postgres to mongo.

After some experimenting, I realized the bottleneck was on the "Import", not on Mongo.

To get around that, I did the following:

1. I chunked the data into batches.  With Postgres, that just meant creating a table of "ids_to_migrate", then iteratively querying it (I used exclusive table locking to eliminate duplicate batches) .  With a CSV, one would probably have to chunk the file into multiple parts and then use something to co-ordinate who gets a batch.

2. I wrote a script that works as a task-runner.  It pulled N items from the database (500 was a good number for me), inserted them into mongo, then repeated until no items were left.

3. I kept spawning new task-runners, while monitoring CPU/MEM (via top) and running a script that updated the total number of records every minute , and calculated the differential.

On my server, I could migrate several thousand records per minute per task-runner.... up to 5 task runners (so a total of 5*N records/minute).    Six or more task runners dropped the efficiency of each runner by more than the gains from the new runner, so overall performance was down.

A 5x improvement on the import was great though.  Everything was migrated by lunch, instead of having to wait overnight.

Vikas Meena

unread,
May 19, 2016, 9:53:55 AM5/19/16
to mongodb-user
Hey jona,

Hope you are doing well.

I am trying to the same thing what you did as in i have to store data from multiple csv files of different types in mongodb and i am using python for that. Could you tell me little more about your script as in how did you exactly make those task runners?

and if any other way of doing this that you know please do help me out over here.
Reply all
Reply to author
Forward
0 new messages