large input file causes out-of-memory

Matan Safriel

未读，

2016年10月20日 05:34:352016/10/20

收件人 OpenRefine

Hi,

I've taken OpenRefine for a spin for a 5GB xml file.

It seems that no amount of memory (up to 16GB XMX) would satisfy loading it past the second step of creating a new OpenRefine project.

Are there any tips and tricks?

Thanks,

Matan

Owen Stephens

未读，

2016年10月20日 08:42:292016/10/20

收件人 OpenRefine

My general experience is that if you hit issues with importing data to OpenRefine and you've already upped the memory to your maximum is that the only thing you can do is do some pre-processing on the data to get it ready to put into OR.

Sometimes changing format can help (e.g. xls may get imported more easily than equivalent csv)

Sometimes making sure you are only importing the data you really need can help (e.g. empty columns in spreadsheets can cause problems on import - and removing the cols in excel beforehand is highly advisable and will make the import work without problems)

Sometimes you have to only import a partial set of data (either fewer records, or only selected fields from each record)

My feeling is (and I can't quantify this but based on my general experience) hierarchical data causes more performance issues than tabular data even for similar amounts of data - so the first thing I'd look at is whether you can flatten out the XML to csv outside OR. Of course this may not be possible.

Even if this is possible, I'd say 5Gb of data is likely to cause OR issues anyway.

If so, then you'd have to fall back on working with a subset of the data - so the next question would be whether you could extract just the relevant parts of each XML record to create a smaller import file you can work on.

Or whether it makes sense to split the data into smaller numbers of records and work on these smaller record sets one by one

If none of this is possible/helps I suspect you are at the point where OR is not going to do the job for you and it's time to look at alternative tools

Owen

qi cui

未读，

2016年10月20日 22:03:512016/10/20

收件人 OpenRefine

Owen is right, 5GB xml is too much for 16GB RAM. If you want to just try the upper limit of OR, you can start with 1GB xml. Or if you really have need to processing so big file, You can check RefinePro.com which provide AWS instance to suite your needs.

Joe Wicentowski

未读，

2016年10月23日 04:37:362016/10/23

收件人 OpenRefine

Also, for issues importing XML into OR, see
https://github.com/OpenRefine/OpenRefine/issues/1095.

I turned to pre-processing my XML data - turning it into TSV with an
XQuery before loading into OR.

回复全部

回复作者