Memory leak when processing huge XML and creating new database entries - Can I stream to database?

26 views
Skip to first unread message

Einar Magnússon

unread,
Oct 15, 2016, 12:53:38 PM10/15/16
to LoopbackJS
Hi,
I'm processing a 350 MB XML file on my Loopback server, and of course ran into trouble immediately with my previous method of just reading the file and using xpath to extract what I need.  The XML is basically a very long list of entries

<xml>
  <entry>
  ...
  </entry>
  <entry>
 ...
 </entry>
...
</xml>

So, I tried using readline to extract each entry on its own, process that, enter it into the database (I'm using a local mongodb), and then continue reading the xml. This works fine for 10 minutes or so, and then the script fails with 'Allocation failed - process out of memory'

I've googled around a bit, and here are two posts by someone with a very similar problem:

It seems that what I need to do is to implement this as a stream all the way, that way there is a built-in backpressure mechanism that throttles the read to accommodate a slow output process. But don't I then need to implement a streaming write function to the database as well? Is this possible using the built in model API? Or do I need to use a third party streaming module like 'stream-to-mongo'?

My current workaround is to wait until each entry has been saved before I start processing the next.

*** A corollary question is about what happens while the script is creating all these entries (in the future it might mostly be just updating, e.g. removing expired entries), it seems to make the API completely unresponsive. Since Node should be able to handle all this in a non-blocking way, my guess is that the mongo server is the bottleneck? Any thoughts on how to improve the situation?

Cheers,
Einar

Raymond Feng

unread,
Oct 15, 2016, 1:22:30 PM10/15/16
to loopb...@googlegroups.com
What is the size of each entry? How do you parse xml?

I'll probably do the following:

1. Create a stream from the xml document
2. Use `sax` module to parse the xml, process entry by entry using startElement/endElement events, write each entry into mongodb
3. Use something like heapdump to profile memory usage
4. Adjust nodejs heapsize if necessary. See https://www.quora.com/How-does-Node-js-do-memory-allocation

The key is to do the work in small chunks asynchronously so that event loop thread won't be blocked for too long.

Thanks,
Sent from my iPhone 6 Plus
--
You received this message because you are subscribed to the Google Groups "LoopbackJS" group.
To unsubscribe from this group and stop receiving emails from it, send an email to loopbackjs+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/loopbackjs/84cd6011-f1e3-4c55-9d73-cd6da9ecb9b5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Einar Magnússon

unread,
Oct 16, 2016, 1:27:24 PM10/16/16
to LoopbackJS
Thanks for your reply,

the size of each entry is approximately 1 KB, but they are grouped together in bunches of about 60 KB, so I'm parsing each 60KB chunk in one go.
I am currently parsing by using 'readline' and just checking for the element that I'm looking for. Works similar to your suggestion, but it's not really streaming. 

I started out writing the code using an xml streaming parser but had some trouble understanding how to do it, and then I read up a bit more about streams and thought that it didn't make sense to do it anyway if I couldn't stream all the way into the database.

You are right that I should try to profile what's actually going on, I haven't had much experience doing that but I have to learn some day. I was trying to figure out how to do it with Strongloop Arc actually, but you recommend using heapdump instead? 

Cheers,
Einar

Raymond Feng

unread,
Oct 16, 2016, 1:39:47 PM10/16/16
to loopb...@googlegroups.com
StrongLoop arc works too. It provides a GUI.


Sent from my iPhone 6 Plus
Reply all
Reply to author
Forward
0 new messages