Re: Suitability of OpenRefine for custom ETL process?

381 views
Skip to first unread message

Thad Guidry

unread,
Dec 4, 2012, 4:59:44 PM12/4/12
to openr...@googlegroups.com
Why not use a dedicated ETL opensource tool such as Talend OpenStudio or Pentaho Data Integration ?

How convoluted do your transformations need to be ?

Can you give us some examples of your transformations needed on a few sample rows ?


On Tue, Dec 4, 2012 at 3:44 PM, Justin Wiley <justin...@gmail.com> wrote:
Forgive me if I'm re-inventing the thread, I searched but didn't find anything similar.

I recently came across Google Refine / Open Refine, and noticed that it can record and export transformations in JSON form.

I'm considering a custom ETL process using Open Refine consisting of two steps:
  1. transform a sample of raw data using Open Refine, store transformations
  2. replay transformations on new data using Open Refine
I'm imagining step 1 would occur on a local machine, and step 2 would occur on a private cloud VM with live data (after locally created transformations had been synced).

I've seen client libraries like https://github.com/maxogden/refine-ruby/ that handle step 2 pragmatically.
  • Are there known / obvious draw-backs to this approach?
  • I anticipate processing in the realm of 6,500 rows of CSV data per hour with a modest number of transformations...I assume the server could handle this?
  • As Google Refine becomes Open Refine, do you expect it will dramatically change in a way that leaves me stranded?
Thanks for your hard work, awesome project!!

Justin

--
 
 



--
-Thad
http://www.freebase.com/view/en/thad_guidry

Alex Gordon

unread,
Dec 4, 2012, 5:01:59 PM12/4/12
to openr...@googlegroups.com
Justin, can you please give more detail on exactly what kinds of transformations you are performing?


--
 
 

Justin Wiley

unread,
Dec 5, 2012, 9:47:59 AM12/5/12
to openr...@googlegroups.com
Sure.  Essentially we need to take transform customer data to fit our API.

For example:

Given our product's API requires the following fields:  customer_number, email, race, ethnicity

And Customer A gives us a CSV with:  visit_number, email_address, race_and_ethnicity

We need to rename the columns appropriately, and split race_and_ethnicity into two fields using a regex or some heuristic.

We assume each additional customer B, C...N all will produce different CSVs, which will require custom transforms.

Off-the-shelf ETL does seem like the logical approach.  However from what I understand of Pentaho etc they:
  • have a high learning curve, and/or require consultants
  • don't allow for test-driven development, i.e. ensuring transforms reliably
Open Refine is appealing because the GUI is a lot more attractive than the huge custom Eclipse mess that I see in demos, and because the transforms are JSON they seem easier to parse and work with pragmatically.

Thad Guidry

unread,
Dec 5, 2012, 10:33:15 AM12/5/12
to openr...@googlegroups.com
Justin,

This is indeed the realm of regular pipeline ETL tools.  Although any programming language like Python, Ruby, etc, could also hack this up quickly, but perhaps your team is limited on their skill set for that.

Your use case is exactly what I am doing in Pentaho at this very moment... I took a customer's CSV file and split multi-valued fields where appropriate using a simple regex along with an additional formula.  I built the Transformation using one that I had previously built for the same case, just a few months ago.  Many ETL products let you rinse and repeat and reuse your existing code or transformations to adapt to changing customers needs.  For your case, it's probably just a few tweaks here and there as you go through each customers CSV file.  Piece of cake.

I'd be happy to help you wire something up quickly in about 5 minutes that handles your case in Pentaho.

In fact, a quick sample of your use case is already done for you in Pentaho.

You can also just download and install Pentaho Data Integration 4.4, and if on Windows, start it with the spoon.bat file.  (which will launch the Java startup and config and get you running)

Go to File, then Import From An XML File, then navigate to the folder path of /pdi-ce-4.4.0-stable/data-integration/samples/transformations and then load the sample file called "CSV Input - Reading customer data.ktr"
Then double-click on the boxy component on the screen called "CSV file input" and change or tweak to your desire.

Reach me in private mail, offlist, if you want more help.

Martin Magdinier

unread,
Dec 27, 2012, 11:38:25 AM12/27/12
to openr...@googlegroups.com
Justin,

I am doing a similar work too, integrating in a single data base csv files coming from various partners (so different field name and format).

We use OpenRefine to explore each partner file and design the process (what needs to be done). Then we reproduce it in Talend Studio for the ETL part. I haven't found  Talend that hard to learn for basic processing. It also have a drag and drop interface for each sub process and a pretty good documentation, active forum.

Let us know how you managed this.
Martin
--
 
 

Reply all
Reply to author
Forward
0 new messages