Extension of OpenRefine with HBase

Nikhil Siddhartha

unread,

Sep 28, 2013, 2:09:59 PM9/28/13

to openr...@googlegroups.com

Hi,

I am a 4th year student of Computer Science at IIIT Hyderabad. Our team of 3 members are trying to extend the storage capabilities of OpenRefine with HBase. I would like to know your opinion, if it is a feasible project. Any kind of feedback or guidance will be very appreciated.

fabio.tacchelli

unread,

Sep 29, 2013, 6:05:07 AM9/29/13

to openr...@googlegroups.com

Hi there!

Using HBase could be a good improvement to load massive datasets.

Currently i'm working on a smaller scale (i wanted to load 4 milion rows with less than 8Gb ram) and i'm working to implement in-memory compression with the lowest performance impact possible, but HBase would be on a different level :P

Tom Morris

unread,

Sep 29, 2013, 6:19:36 PM9/29/13

to openr...@googlegroups.com

On Sat, Sep 28, 2013 at 2:09 PM, Nikhil Siddhartha <nikhilsi...@gmail.com> wrote:

Hi,
I am a 4th year student of Computer Science at IIIT Hyderabad. Our team of 3 members are trying to extend the storage capabilities of OpenRefine with HBase. I would like to know your opinion, if it is a feasible project. Any kind of feedback or guidance will be very appreciated.

That certainly sounds feasible, but I don't have enough hands-on experience with HBase to offer any concrete guidance.

Another interesting backend might be Google's BigQuery API.

Tom

Stefano Mazzocchi

unread,

Sep 29, 2013, 7:24:26 PM9/29/13

to openr...@googlegroups.com

Remember that feasible might not be usable.

The key usability feature of Refine it's its speed: the UI is designed around a backend with almost no latency (<100ms) and immediate action. It was having such a backend that made the special style of interaction design possible. If you move it, as-is, over to a backend that takes several seconds to do each operation (plus the network time when served over a hosted solution instead of running on your local machine) the usability of the whole app will drop to the point of not being much more useful than other solutions. You will have succeeded in having something working but nobody will likely want to use it.

This should not prevent you from experimenting with it, but sets your expectations accordingly: you will never be able to have a scalable OpenRefine that feels like the one you have running on your own local memory on your own local machine; no matter how much smart software and hardware you throw at it, if will be at the mercy of I/O latencies (disk and, way worse, network).

Tom rightly suggest BigQuery which is heavily optimized for facet-style filtering operations (you have no idea how much) but the latency is at least an order of magnitude inferior (seconds instead of hundreds of milliseconds) to what OpenRefine would need for a seamless transition (even without including network latency). HBase will likely add another order of magnitude (tens of seconds per query, and refine could send many simultaneous queries).

If I were you, I'd try with Dremel (aka BigQuery) first because if that doesn't work, there is no reason to even try to use HBase (or similar distributed database solutions).

--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reply all

Reply to author

Forward