Re: Running Refine on App Engine?

Thad Guidry

unread,

Nov 1, 2012, 1:59:27 PM11/1/12

to refine

You need a pure ETL approach to your problem. Refine is not well suited to your task since it is not engineered for pipelining as other ETL tools are. And you do not need to port it into the cloud, that's been tried but didn't get very far because of a few issues (but don't let that stop you from hacking your heart out, if you want to)

Both Talend OpenStudio and Pentaho Data Integration (Kettle) have de-duplicating drop-in components (unique row / record) that could handle billions of rows / records on only 1GB ram, if you wanted to.

My suggestion is just use Pentaho for your repeatable tasks.

Pentaho's download link: http://sourceforge.net/projects/pentaho/files/Data%20Integration/ and help center: http://infocenter.pentaho.com

On Thu, Nov 1, 2012 at 5:22 AM, Kenan <ken...@google.com> wrote:

Hi all,

is it possible to run refine on Googles App engine since it's written in Java? I've never ported anything to the cloud myself, so I have absolutely no idea what kind of modifications would need to be done, but I could most probably find someone with the right knowhow who will help me if this is indeed possible.

The reason for me asking is that I am running into memory issues when de-duplicating spreadsheets with more than 5 million rows and I suppose using app engine instead of my linux instance may be an option? Due to several reasons I can't use more than 12GB of ram with my linux instance.

Cheers,
Kenan

--
-Thad
http://www.freebase.com/view/en/thad_guidry

Kenan Bardt

unread,

Nov 6, 2012, 11:18:18 AM11/6/12

to google...@googlegroups.com

thanks a lot for your answer, I don't think what I'm doing is worth the effort of having Refine ported, it's probably easier to write a tool from the ground up that just does what I want to do in Refine. I will check out Pentaho and see whether I can integrate this into my projects in the mean time!

Cheers,

Kenan

Tom Morris

unread,

Nov 6, 2012, 11:44:07 AM11/6/12

to google-refine

I suspect Thad was talking about a different kind of porting ("cloud" is pretty all encompassing and could involve porting to something like BigQuery, depending on what you were trying to do).

Rather than worrying about any kind of porting at all, I'd look into using Amazon EC2 or Google Compute Engine to run the existing Refine -- assuming of course that it's a good fit functionality-wise for what you're trying to do.

Tom

Kenan Bardt

unread,

Nov 6, 2012, 11:49:15 AM11/6/12

to google...@googlegroups.com

Will it scale though? As I understood Thad, Refine is not programmed to scale in the cloud (e.g. like software that can't really take advantage of multiple CPUs or won't be twice able to handle much more data if RAM is doubled)?

Tom Morris

unread,

Nov 6, 2012, 12:28:06 PM11/6/12

to google-refine

I could give you a theoretical answer, but it'd only cost you a couple of dollars to find the actual answer for your real world data.

The stated design center for Refine is < 1 M rows, but that's a simplistic rule of thumb. Refine will use as much memory as you can give your JVM. Refine gets a certain amount of parallelism through the use of separate threads for various types of operations (handling connections, background save processing, etc), but the core GREL algorithms themselves are generally not parallelized (ie it won't run N copies of an algorithm over 1/N row sets even if the operations are easily decomposed).

In my experience most performance problems with Refine are caused by either a) heap thrashing from too little VM or b) browser issues with very large tables (e.g. many columns) or lists (big facet choice lists). If you look at the JVM stats for the 12M heap JVM with your 5M row data set, you may be able to get a sense for whether heap contention is an issue.

My gut feel is that 5M rows should be easily doable with sufficient memory, but the only way to tell is try it.

Tom

Thad Guidry

unread,

Nov 6, 2012, 1:06:49 PM11/6/12

to refine

Tom is right in that 5M should be doable. I have tested and handled 10 million rows, 3 columns via a 20gb instance, using GREL string transforms on the columns. But the core GREL algorithms not being parallelized is where I was referring to using an ETL pipeline solution as an alternative for Refine's deficiency in that. In my testing, t was about 100 times faster using Pentaho or Talend for the same kind of work.