what do you use Google Refine for?

26 views
Skip to first unread message

David Huynh

unread,
May 22, 2011, 12:35:14 AM5/22/11
to <google-refine@googlegroups.com>
Hello everyone,

I've given a few talks/tutorials on Google Refine recently and I keep getting asked the same question--what do people use it for? I think there's a diversity in how Refine is being used, and it'd be good to get a sense of how diverse.

I would like to conduct an informal survey here to ask, what have you used Refine for and how? Here are some more concrete questions:

- What domain of data do you deal with? Journalism? Open government data? Scientific data? Business data? Web logs? ...
- What do the data sets you deal with look like? How many rows and columns?
- What tasks do you perform using Refine? Simple transformations (e.g., fixing date format)? Structural editing (e.g., transposing rows/columns)? Clustering to fix inconsistencies? Reconciliation? ...
- Which feature(s) make you choose Refine as opposed to another tool?
- What tool would you have used instead? Scripting? An ETL tool? Excel?

We can get everyone's input and then summarize them on a wiki page. It would help some of us who do advocacy/outreach for Refine, to tell our audiences how versatile Refine is. It'd also help us developers concentrate on the most used features.

Thanks in advance!

David

fadi maali

unread,
May 22, 2011, 12:51:31 PM5/22/11
to google...@googlegroups.com
Hello,

I mainly use Google Refine to transform tabular data into RDF. My second use is making sense of data i.e. getting a general idea of what is in some dateset.

- What domain of data do you deal with? Journalism? Open government data? Scientific data? Business data? Web logs? ...

Open government data

- What do the data sets you deal with look like? How many rows and columns?

As I am focusing on datasets about small regions e.g. a county or a city; most of the datasets are small. On average 100 rows and 5 columns

- What tasks do you perform using Refine? Simple transformations (e.g., fixing date format)? Structural editing (e.g., transposing rows/columns)? Clustering to fix inconsistencies? Reconciliation? ...

mainly simple transformations (fixing typos, removing duplicates based on facets only) and reconciliation.

- Which feature(s) make you choose Refine as opposed to another tool?

In order: Facets, scriptability (GREL), bulk edit

 What tool would you have used instead? Scripting? An ETL tool? Excel?

If I don't use google Refine, only scripting would serve my needs 

Regards,
Fadi

Jeanne Kramer-Smyth

unread,
May 22, 2011, 8:46:38 PM5/22/11
to google...@googlegroups.com
Let's see:

- What domain of data do you deal with? Journalism? Open government data? Scientific data? Business data? Web logs? ...
So far I have used it to work through candidate terms for building a controlled vocabulary. Example sources of this data are tag lists, search logs and web analytics

- What do the data sets you deal with look like? How many rows and columns?
For this type work it is between 1 and a small number of columns - rows between a few hundred and a few thousand.

- What tasks do you perform using Refine? Simple transformations (e.g., fixing date format)? Structural editing (e.g., transposing rows/columns)? Clustering to fix inconsistencies? Reconciliation? ...
The magic that Refine gives me for this task is clustering to fix inconsistencies and just to let me get counts on usage of an idea rather than a strict single value.

- Which feature(s) make you choose Refine as opposed to another tool?
I know of know other way to do this!

- What tool would you have used instead? Scripting? An ETL tool? Excel?
Manual torture with Excel.

I also hope to use this to explore other open data sets, electronic records metadata and other web logs.

Thanks!
Jeanne Kramer-Smyth
http://www.spellboundblog.com


Håkan Jonsson

unread,
May 23, 2011, 3:47:49 AM5/23/11
to google-refine
Hi,

> - What domain of data do you deal with? Journalism? Open government data?
> Scientific data? Business data? Web logs? ...

I mainly use it for cleaning up survey results. I use Google Docs
Forms to create surveys, and then clean up the data using Google
Refine.

> - What do the data sets you deal with look like? How many rows and columns?

Small sets. A couple of hundred rows and up to about 50 columns.

> - What tasks do you perform using Refine? Simple transformations (e.g.,
> fixing date format)? Structural editing (e.g., transposing rows/columns)?
> Clustering to fix inconsistencies? Reconciliation? ...

First of all finding errors or inconsistencies in data, the fixing
them.
Also simple transformations e.g. changing formats.

> - Which feature(s) make you choose Refine as opposed to another tool?

Facets on different types

> - What tool would you have used instead? Scripting? An ETL tool? Excel?

Google Docs.

/Håkan

Luigi Selmi

unread,
May 23, 2011, 4:22:00 AM5/23/11
to google...@googlegroups.com
Hi David and all,

Q) What domain of data do you deal with? Journalism? Open government data? Scientific data? Business data? Web logs? ..
R) Government and business data

Q) What do the data sets you deal with look like? How many rows and columns?
R) tabular data with thousands  of rows, 20 or more columns

Q) What tasks do you perform using Refine? Simple transformations (e.g., fixing date format)? Structural editing (e.g., transposing rows/columns)?
R) Mainly reconciliation towards our and others' services after some transformations and editing and also mapping into some ontology using the RDF extension

Q) Which feature(s) make you choose Refine as opposed to another tool?
R) facets, extendibility, freely available with license

Q) What tool would you have used instead? Scripting? An ETL tool? Excel?
R) Hard to say. Scripting or java application to do reconciliation but without the functionalities to augment the quality of data

Best regards

Luigi Selmi





Date: Sat, 21 May 2011 21:35:14 -0700
Subject: what do you use Google Refine for?
From: dfh...@gmail.com
To: google...@googlegroups.com

Friedrich Lindenberg

unread,
May 23, 2011, 4:40:30 AM5/23/11
to google...@googlegroups.com
Hi David,

cool questionnaire:

On Sun, May 22, 2011 at 6:35 AM, David Huynh <dfh...@gmail.com> wrote:
> I've given a few talks/tutorials on Google Refine recently and I keep
> getting asked the same question--what do people use it for? I think there's
> a diversity in how Refine is being used, and it'd be good to get a sense of
> how diverse.
> I would like to conduct an informal survey here to ask, what have you used
> Refine for and how? Here are some more concrete questions:
> - What domain of data do you deal with? Journalism? Open government data?
> Scientific data? Business data? Web logs? ...

I mostly use refine for government spending and budgetary data as well
as general cleanup of code sheets for various types government
activities, programmes etc.

> - What do the data sets you deal with look like? How many rows and columns?

A typical European member state budget document is about 15 columns
and 140000 rows. Once I'm done with it, I've usually denormalized it
to about 20 columns.

> - What tasks do you perform using Refine? Simple transformations (e.g.,
> fixing date format)? Structural editing (e.g., transposing rows/columns)?
> Clustering to fix inconsistencies? Reconciliation? ...

The most important aspect is data exploration, easy faceting etc. I
try to keep most of the intrusive stuff in python scripts (I don't
fully trust the repeatability of refine scripts) but the functions I
use most are:

* Simple transforms and column generation
* Transpositions form columns to rows
* Reconciliation
* Filtered deletion

> - Which feature(s) make you choose Refine as opposed to another tool?

The intuitive web-based UI and its simplicity; data exploration tools,
simple GREL syntax. Running it on your own computer is a big factor,
I'm not sure I will still use the cloud-based version as it will
necessarily not be file-based but running against some big datastore
which probably won't fit into my workflow nearly as well.

> - What tool would you have used instead? Scripting? An ETL tool? Excel?

Not a big excel user, normally using Python scripts, SQL and document
dbs; Apache Solr; and custom applications but really the niche was
(and I suspect still is) wide open.

- Friedrich

Gary Frederick

unread,
May 23, 2011, 10:11:09 AM5/23/11
to google...@googlegroups.com
:-)


On Saturday, May 21, 2011 11:35:14 PM UTC-5, David Huynh wrote:
Hello everyone,

I've given a few talks/tutorials on Google Refine recently and I keep getting asked the same question--what do people use it for? I think there's a diversity in how Refine is being used, and it'd be good to get a sense of how diverse.

I would like to conduct an informal survey here to ask, what have you used Refine for and how? Here are some more concrete questions:

- What domain of data do you deal with? Journalism? Open government data? Scientific data? Business data? Web logs? ...
various, mostly seeing what it can do
 
- What do the data sets you deal with look like? How many rows and columns?
not big, 10 or less columns and 200-1000 rows
 
- What tasks do you perform using Refine? Simple transformations (e.g., fixing date format)? Structural editing (e.g., transposing rows/columns)? Clustering to fix inconsistencies? Reconciliation? ...

some simple transformations, reconciliation, structural editing

- Which feature(s) make you choose Refine as opposed to another tool?
I was looking at Google Fusion Tables and Refine was mentioned.

I am cleaning up my data and then saving in JSON or some other source. It's very easy to modify to get exactly what I want.
 
- What tool would you have used instead? Scripting? An ETL tool? Excel?
Python scripting
what's Excel? ;-)

Benoit Thiell

unread,
May 23, 2011, 11:11:10 AM5/23/11
to google...@googlegroups.com
On Sun, May 22, 2011 at 12:35 AM, David Huynh <dfh...@gmail.com> wrote:
> - What domain of data do you deal with? Journalism? Open government data?
> Scientific data? Business data? Web logs? ...

Affiliation data about academic publication in astronomy and physics.

> - What do the data sets you deal with look like? How many rows and columns?

I'm currently working on cleaning up affiliation data for authors in
astronomy and in physics. The first set has around 1 million rows and
the second set has 6 million rows. My project has 4 columns (original
affiliation, modified affiliation, emails, bibliographic reference
number) although I work on decreasing the number of single strings
only in one column.

Working with several million rows in Refine is sometimes painful
especially because of the

> - What tasks do you perform using Refine? Simple transformations (e.g.,
> fixing date format)? Structural editing (e.g., transposing rows/columns)?
> Clustering to fix inconsistencies? Reconciliation? ...

I mostly use transformations (mostly in Jython and GREL). No
structural editing, no reconciliation. I use some clustering but due
to the amount of data, have to stick to the safest option only
(fingerprint).

> - Which feature(s) make you choose Refine as opposed to another tool?

> - What tool would you have used instead? Scripting? An ETL tool? Excel?

We could simply load the data and work on it with a scripting language
(python or perl) and regular expressions but Google refine offers a
nicer graphical alternative and also the possibility to look back in
the history of the modifications.

Benoit.

David Huynh

unread,
May 23, 2011, 12:26:32 PM5/23/11
to <google-refine@googlegroups.com>
Thanks to everyone who has replied so far! Please keep the responses coming!

I've thought of a few more questions here, just as suggestions of how you might describe your usage of Refine.

- Do you ever collaborate with other people on a common Refine project?

- Do you care either way if Refine is a desktop app or a hosted web app (like Google Spreadsheets)? Or do you want to host it yourself in your intranet?

- What other tools do you use Refine *in conjunction* with? E.g., Google Fusion Tables, R, Matlab, ...

- Is there some name registry that you want to reconcile to? E.g., names of politicians in country X, names of consumer products, names of cardinals and popes in the first millennium CE, names of bird species, ...

- How do you describe Refine to someone else? E.g., "Excel++", "Excel + database hybrid", ...

- Fill in the blank: "It would be awesome if ... Refine ... ".

Thanks :)

David

Iain Sproat

unread,
May 23, 2011, 12:39:04 PM5/23/11
to google...@googlegroups.com
I've put the questions into a Google Docs form, if you could add your
response there it would be appreciated:

https://spreadsheets.google.com/viewform?formkey=dGIzbHBtY3hxYkhhNlJycEloNTl2TFE6MQ

Thanks

Iain

Douglas Galbi

unread,
May 23, 2011, 7:25:59 PM5/23/11
to google...@googlegroups.com

I've only experimented a bit with Google Refine.  Reconciliation, which I would use to add information about companies, services, and geographic entities, would provide the most added value to me. Most of my data cleaning is so simple that the torture of Excel hasn't been enough to push me over the fixed cost of getting good at Google Refine.   But I think that getting good at Google Refine is a worthy investment for me, and I hope to do it.

Thanks for your work on this data tool.

Sincerely,
Douglas

Paul Makepeace

unread,
May 23, 2011, 8:09:06 PM5/23/11
to google...@googlegroups.com
On Mon, May 23, 2011 at 09:40, Friedrich Lindenberg
<friedrich....@okfn.org> wrote:
> The most important aspect is data exploration, easy faceting etc. I
> try to keep most of the intrusive stuff in python scripts (I don't
> fully trust the repeatability of refine scripts)

I'm curious where/how you came to this suspicion. I routinely rely on
scripted actions with Refine and have found them to work well.

Paul

Friedrich Lindenberg

unread,
May 24, 2011, 6:45:28 AM5/24/11
to google...@googlegroups.com
Hi,

An example: one of the things I'm doing at the moment is trying to
convert a set of XML-based project descriptions into a CSV list of the
financial support each project has received. This means I have one
script that does the basic flattening of XML into CSV, I then loaded
it into Refine to make some basic observations: about half of the
projects have a geographic zone, the other half has a country
associated with them. Both zones and countries are dirty, so I used
refine's clustering mechanism to get them cleaned and then made a
lookup sheet of country -> zone (e.g. Congo -> Sub-saharan Africa)
which I "cross'd" in.

Nice process, but I need to repeat it every month. While I could do
this manually each time, its much nicer to use a cron job and an
additional script that goes through the raw CSV and looks up countries
and regions against a Google Spreadsheet via the GData API to
normalize them and to derive regions from countries. This now gives me
a new, refined version of the CSV each night that I can cross check
using a third script with a few "assert" statements and then load into
the target DB.

I guess the point I'm trying to make is that Refine automation would
only solve half of my problem: I could make the steps in the app
repeatable, but not the larger process of handling the data.

- Friedrich

Richard Cyganiak

unread,
May 24, 2011, 8:06:17 AM5/24/11
to google...@googlegroups.com
On 23 May 2011, at 17:26, David Huynh wrote:
> - How do you describe Refine to someone else? E.g., "Excel++", "Excel + database hybrid", ...

“Like Excel, but optimized for exploring, cleaning, and transforming really large sheets”

> - Fill in the blank: "It would be awesome if ... Refine ... ".

Command-line tool for running Refine scripts. Address geocoding. Matching geocodes to areas.

Best,
Richard

David Huynh

unread,
May 25, 2011, 6:16:03 PM5/25/11
to google...@googlegroups.com
Hi everyone,

Thanks again for responding to this survey! Feel free to continue responding here or using Iain's spreadsheet form


I've collated all the responses here


David

drLization

unread,
Jun 7, 2011, 1:01:46 AM6/7/11
to google...@googlegroups.com
Done, filled in for my situation :)

Great tool!
Reply all
Reply to author
Forward
0 new messages