Working on large datasets in pandas

harsh

unread,

Apr 12, 2017, 11:23:44 PM4/12/17

to PyData

I want to work on a 50gb of data stored in XML format in pandas. I will try to do custom transformations from XML data to csv data. But as my system hangs as I try to load the 50gb datafile. Any suggestions will be appreciated?

luc.k

unread,

Apr 13, 2017, 6:14:35 AM4/13/17

to PyData

Did you try xmldataset?

Never tried it on such a large dataset but it gives you the possibility to read xml in pandas.

Olivier Jeulin

unread,

Apr 13, 2017, 9:33:17 AM4/13/17

to PyData

You should use XSLT or XQuery to transform the XML to CSV first. This is even more true if you just need a subset of the data contained in the XML.
Or in Python, using a streamed parsing…
For a huge file like that, the best choice would be to use the streaming capability of XSLT3: I highly recommand Saxon (http://www.saxonica.com/download/download_page.xml) for speed and efficiency (paid version, or use a test licence).

But if you want to stick with free software, you can use XQuery and load your file in a XML database (eXistDB http://www.exist-db.org/ or BaseX http://basex.org/).
Both XSLT en XQuery have a "text" output mode.

If you don't know XSLT already, try XQuery, it's easier to use.

Reply all

Reply to author

Forward