memory limit ?

1,582 views
Skip to first unread message

Cato Nano

unread,
Jul 7, 2015, 6:39:23 AM7/7/15
to openr...@googlegroups.com
Hello

I have a ~15 million lines csv file

Is OpenRefine the right tool for such a file ?

I have created a project with that file but the importation process seems to require 60 minutes (and growing) and is constantly taking  ~380% cpu

I see a writing informing me that the 99% of the memory is being used and that amounts to roughly 1.2 Gb

But I have 8 Gb on this machine. How do I instruct OpenRefine to take more memory ?

Will that help and make the importing process faster ?

Thanks

Owen Stephens

unread,
Jul 7, 2015, 6:58:06 AM7/7/15
to openr...@googlegroups.com
I'd say that 15 million lines is more than can be comfortably handled by OpenRefine. However, the number of lines is not the only factor - in my experience a large number of columns will cause you performance problems even with relatively small numbers of rows. I've been working on a 1 million row, 3 column project with 4Gb RAM allocated to OpenRefine and that worked OK although OpenRefine was a little sluggish for some operations it didn't slow enough to be a problem. My general experience suggests that multi-million line files are probably not going to work effectively in OpenRefine. If you need OpenRefine functionality for such files you might want to consider breaking down the file into several smaller files, working with one of them, then apply the same transformations to the other files (although whether this is an effective strategy may depend on the type of data and the work you are doing).

My experience also suggests that actually creating the project in the first place is the operation that is a reasonable limiting factor - if I can get the data into OpenRefine in the first place, then I can probably use OpenRefine to work with the data (although again some operations can be problematic - facets and clustering etc.). Different file formats perform differently on import - I have a feeling that importing Excel files takes less memory than the equivalent CSV, but that's from a vague memory of trying this out on a single file that was giving me problems and not something I've experimented with substantially. If you are seeing 100% (or close) memory usage on the import for several minutes, I suspect the import process is going to fail to be honest.

All of that said, I would certainly suggest increasing the amount of memory available to OpenRefine - instructions are at https://github.com/OpenRefine/OpenRefine/wiki/FAQ:-Allocate-More-Memory

Owen

Thad Guidry

unread,
Jul 7, 2015, 10:20:23 AM7/7/15
to openrefine
OpenRefine was originally intended, designed, and tested around a 1 million line workload.  Memory is your friend and enemy to overcome that original design but not always the most effective way forward for certain use cases.

Also, in addition to what Owen said, you might think about the operations that you are wanting to do with that file.
Gnu Tools like awk and sed are quite handy and have no limitation on file size handling other than the underlying disk format like ext2,3

If your looking to do large pattern analysis or clustering on a column(s) with millions or billions of rows...you might just try out database technologies themselves like MongoDB or even search technologies like Elasticsearch which I have both used for doing very expressive multi-million record search/analyze/replace/transform (you just have to learn about their provided functions or sometimes even plugins that can perform a lot of magic for just a little bit of learning).



--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages