Processing a 5gig+ JSON file and pushing data to CouchDB - running out of memory

168 views
Skip to first unread message

Miles Burton

unread,
Mar 28, 2014, 9:48:44 AM3/28/14
to nod...@googlegroups.com
Hey guys,

I'm working on some semi-large data-sets and I'm attempting to upload them to CouchDB.

My logic is fairly simple:
* Read the local file as a stream
* Pipe to JSONStream and pull out each entity
* Listen for data event and save to CouchDB with forceSave: false

What seems to be happening is CouchDB is taking a while to upload each entity. This is slowly eating up memory till it falls over.

I can really only see two solutions:
* Combine entities into an array (maybe read 10k entities)?
* or.. throttle the number of events JSONStream can emit.

Does anyone have any reading material which would help? I'm fairly new to Stream in NodeJS and CouchDB doesn't seem to have a streamable write.

Stefan Klein

unread,
Mar 28, 2014, 10:05:19 AM3/28/14
to nod...@googlegroups.com
Hi,

I had to import a zipped csv file (~6 mio. lines) into CouchDB, what i did was:

pipe the csv through a transform function to get json and implemented a writeable Stream which buffered 20k documents (each isn't very big) and wrote them in one go using the _bulk_docs API of CouchDB, worked like charm.
Sadly CouchDB doesn't handle to big _bulk_docs requests gracefully, it just crashes. :/

regards,
Stefan



--
--
Job Board: http://jobs.nodejs.org/
Posting guidelines: https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
You received this message because you are subscribed to the Google
Groups "nodejs" group.
To post to this group, send email to nod...@googlegroups.com
To unsubscribe from this group, send email to
nodejs+un...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/nodejs?hl=en?hl=en

---
You received this message because you are subscribed to the Google Groups "nodejs" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nodejs+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Miles Burton

unread,
Mar 28, 2014, 10:11:21 AM3/28/14
to nod...@googlegroups.com

Yeah I used a similar approach with mongo.

Is it not possible to limit the number of events emitted by json stream?

Sent from my phone, apologies for brevity

You received this message because you are subscribed to a topic in the Google Groups "nodejs" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/nodejs/j15sn3hg_Pc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to nodejs+un...@googlegroups.com.

Stefan Klein

unread,
Mar 28, 2014, 10:30:45 AM3/28/14
to nod...@googlegroups.com
2014-03-28 15:11 GMT+01:00 Miles Burton <miles....@gmail.com>:

Yeah I used a similar approach with mongo.

Is it not possible to limit the number of events emitted by json stream?

If you don't use the data event, but pipe your jsonstream into your own writeablestream and call the callback when you got the response from CouchDB nodejs should handle pausing and resuming for you.
Why would you try to pause/resume or limit the stream on your own?

--
Stefan

Floby

unread,
Mar 29, 2014, 7:22:21 AM3/29/14
to nod...@googlegroups.com
Your approach is good but using the data event won't let you have enough control on the speed of the flow.

I have a similar use case, only with xml files. At the end of my pipeline is a writable stream I called a cradlepusher because I used the cradle module.

It's implemented from stream2 writable stream which will handle back pressure for you when implemented correctly. The whole point if these is that they're easier to get right than old streams.

CradlePusher.prototype._write = function _write(resource, encoding, callback) {
if(typeof resource.id === 'undefined') {
return callback(new TypeError('Given resource does not have an ID'));
}

var self = this;

this._db.get(resource.id, function(err, doc) {
if (!err) {
self._updateDoc(doc, resource, callback);
}
else if (err.error === 'not_found' && err.reason === 'missing') {
self._pushNewDoc(resource, callback);
}
else {
return callback(err);
}
});
};

My use case also has an update logic, which is why I'm not using bulk updates anymore and I have to get a doc before updating it. But the interesting part is the use of the callback of the write method. Calling it at the right time will get your back pressure handles more graciously.

Bruno Jouhier

unread,
Mar 29, 2014, 8:41:33 AM3/29/14
to nod...@googlegroups.com
I have all sorts of data pumps running between JSON files, mongo, mysql and oracle. Memory usage is completely flat (I actually used it to debug memory leaks in the Oracle driver).

The pumps are built with https://github.com/Sage/ez-streams

For example JSON to mongo:

var ez = require('ez-streams');

function jsonFileToMongo(errHandler, sourceFileName, targetMongoCollection) {
var reader = ez
.devices.file.text.reader(sourceFileName);
var writer = ez.devices.mongodb.writer(
targetMongoCollection);
reader.
transform(ez.transforms.json.parser()).pipe(errHandler, writer);
}

Ez-streams are callback-based rather than event-based. This has two important consequences:
  • back-pressure is a non-issue. You don't control back pressure by wiring events but by inserting .buffer(bufsize) elements in your processing chains.
  • exception handling is robust: every chains ends with a "reducer" (pipe is a reducer). All the exceptions which are not caught explicitly in the transforms are propagated to the reducer's error handler.
There is no device yet for couchDB but it takes very little to write a device (mongo device is 26 lines, oracle device 31 lines).

Bruno

Miles Burton

unread,
Mar 29, 2014, 8:55:37 AM3/29/14
to nod...@googlegroups.com
Thanks all, I'll look into these solutions and see what I can come up with. Cheers


--
--
Job Board: http://jobs.nodejs.org/
Posting guidelines: https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
You received this message because you are subscribed to the Google
Groups "nodejs" group.
To post to this group, send email to nod...@googlegroups.com
To unsubscribe from this group, send email to
nodejs+un...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/nodejs?hl=en?hl=en

---
You received this message because you are subscribed to a topic in the Google Groups "nodejs" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/nodejs/j15sn3hg_Pc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to nodejs+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Many thanks,
Miles Burton
Reply all
Reply to author
Forward
0 new messages