Importing pipe/tilde separated file into Mongo via Mongoimport

2,566 views
Skip to first unread message

Srinivas

unread,
Dec 22, 2010, 8:35:34 PM12/22/10
to mongodb-user
Hi,

I am trying to port data from a sybase database into mongo via the
mongoimport utility. From the command line options, I see that
mongoimport accepts data as json or csv or tsv (tab separated). I am
extracting data from sybase using sybases bcp tool where I can specify
a delimiter. The problem is that I have columns which could have
spaces/comma in them so using the csv/tsv options in mongoimport is
not possible (i.e. is possible but does'nt help me). When I try to
import these using mongoimport, the documents pushed into mongo are
not right (e.g. field "78.5% IDC" goes into database as two
different fields instead of one).

I could convert the BCP files from sybase into JSON but that means
again parsing these and converting it into a json format file (which
is going to take a lot of time since I have several large tables of
over 200 million rows each).

Is there some undocumented option to pass to mongoimport which allows
specifying a non-standard delimiter to get around this? If not, is
there the possibility of adding this to mongoimport to allow this?

My setup is mongo-stable-server-20101220-mongodb_1 on a 64 bit centos
5.5 and Sybase ASE 12.5.2 on a different server (32 bit RH EL4 Update
4).

Srinivas.

Eliot Horowitz

unread,
Dec 22, 2010, 8:52:12 PM12/22/10
to mongod...@googlegroups.com
There isn't an option now, but there is a case for one:
http://jira.mongodb.org/browse/SERVER-1008

I would recommend parsing the file yourself and inserting directly to
mongo, and bypassing mongoimport.

> --
> You received this message because you are subscribed to the Google Groups "mongodb-user" group.
> To post to this group, send email to mongod...@googlegroups.com.
> To unsubscribe from this group, send email to mongodb-user...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/mongodb-user?hl=en.
>
>

Andrew Armstrong

unread,
Dec 22, 2010, 10:01:33 PM12/22/10
to mongodb-user
I've never used sybase, but is there an escape option for your export?
A csv file for example can still accept comma's in a field value, it
just needs to be escaped (eg, quoted or backslashed). I assume
mongoimport respects escape sequences.

- Andrew

D Boyd

unread,
Dec 23, 2010, 10:08:49 AM12/23/10
to mongodb-user
While I would vote for the added option to mongoimport.
However the parsing
from one delimeter to another can be straight forward with
regex replacements.
I am assuming if you are using | as a delimeter that it does
not occur in your data. The parse would be three steps:
1. Escape all quotes so " become \"
2. Replace all existing delimeters with new ones and quote the fields.
3. Put a " at the start of the file

Something along this line might work (this is off the top of
my head and may not be exactly right):

cat <file> | sed -e 's/"/\"/g' | sed -e 's/|/","/g' | sed -e 's/^/"/'
> <outfile>


You don't mention having embedded newlines in your data which
would really complicate things.

Another option would be a short java program to read from sybase using
jdbc, just write directly to mongo.
Then you could do any denormalization you need to in that same step.

Srinivas

unread,
Dec 23, 2010, 6:09:11 PM12/23/10
to mongodb-user
Thanks for the responses. My source data comes from a sybase server
which is my DR box and my approx window to grab all data for this
particular database is 4 hours (which is the only time the whole week
that there are no database changes happening). This means that a
snapshot of the database at this time is going to be "consistent"
since no table level data will be changed at this time.

@Eliot - What I did initially was to grab all tables from sybase with
pipe or '~~' as separators and push them into mongo via batch_inserts.
This made sense for the small tables (<10 million) where I could see
insert rates of approx 30000 rows/s(My entire setup is currently proof-
of-concept and the box I have for mongo is a beefed up desktop,
definitely not server class). However I have a few large tables (200+
million) for which a rate of 30K r/s is just not enough.
When I shutdown mongod and run mongoimport with the --dbpath
parameter, I see rates of 70000 rows/s (which makes a huge difference
to the time taken since I have a limited maintenance window under
which to complete this exercise). And hence the desire to use
mongoimport...

@Andrew - Some of the table bcp-out files are over 20GB in size and
theres no way I can get sybases bcp (propreitary tool) to escape
commas in the file it generates. Which means running the file through
a perl script to do this and hence additional time needed.

My data does not have embedded newlines AFAIK. I did try reading this
data straight from sybase but that means crazy selects - I want *all*
the table data which means a table scan(and will lock that table for
the duration of the select).
Using BCP (similar to mongoexport) is several times faster than this
and allows me to pull out a 200mill table in 40 mins(select will take
_much_ longer) and hence the reason why I dont do an "online"
select(from sybase) and insert(into mongo).

Srinivas.

Srinivas

unread,
Dec 31, 2010, 5:50:30 PM12/31/10
to mongodb-user
Hi,
I managed to hack the src of import.cpp (into custom_import.cpp; this
was from mongo 1.6.5 source) to allow arbitrary user defined
separators and it works pretty good. Not checked this for performance
etc but my source data goes into mongo just fine...

mongocustom_import -h localhost:27017 -u admin -p XXXXXX --dbpath /
var/lib/mongo -d myDB -c PreIndex -f "collist1-n" --type osv --drop --
stopOnError --zsep '~~' --file /pathto/PreIndex.out.utf8

Fri Dec 31 14:26:35 [tools] CMD: drop myDB.PreIndex
OSV type with separator ~~
Fri Dec 31 14:26:35 [tools] building new index on { _id: 1 } for
myDB.PreIndex
Fri Dec 31 14:26:35 [tools] done for 0 records 0.008secs
...
imported 6319145 objects
Fri Dec 31 14:28:34 dbexit:
Fri Dec 31 14:28:34 [tools] shutdown: going to close listening
sockets...
Fri Dec 31 14:28:34 [tools] shutdown: going to flush oplog...
Fri Dec 31 14:28:34 [tools] shutdown: going to close sockets...
Fri Dec 31 14:28:34 [tools] shutdown: waiting for fs preallocator...
Fri Dec 31 14:28:34 [tools] shutdown: closing all files...
Fri Dec 31 14:28:35 closeAllFiles() finished

Fri Dec 31 14:28:35 [tools] shutdown: removing fs lock...
Fri Dec 31 14:28:35 dbexit: really exiting now
Fri Dec 31 14:28:35 PST 2010: Finished

Approx rate of import => 53100 r/s


For those interested, copy http://onepwr.dyndns.org/code/custom_import.cpp
as custom_import.cpp into mongodb-src-r1.6.5/tools/. Also modify
SConstruct to have:
normalTools = [ "dump" , "restore" , "export" , "import" , "files" ,
"stat" , "custom_import" ]

Then "scons all" will build mongocustom_import.cpp in root folder.
Strip it and use as needed. I guess I could get rid of a lot of
unnecessary stuff in here as this program does'nt need to be as
"flexible" as mongoimport.

I was also wondering if anyone could point me to do a batch/
bulk_insert using the C++ driver? I realized that mongoimport also
does a single insert at a time.

Srinivas.





Reply all
Reply to author
Forward
0 new messages