cool questionnaire:
On Sun, May 22, 2011 at 6:35 AM, David Huynh <dfh...@gmail.com> wrote:
> I've given a few talks/tutorials on Google Refine recently and I keep
> getting asked the same question--what do people use it for? I think there's
> a diversity in how Refine is being used, and it'd be good to get a sense of
> how diverse.
> I would like to conduct an informal survey here to ask, what have you used
> Refine for and how? Here are some more concrete questions:
> - What domain of data do you deal with? Journalism? Open government data?
> Scientific data? Business data? Web logs? ...
I mostly use refine for government spending and budgetary data as well
as general cleanup of code sheets for various types government
activities, programmes etc.
> - What do the data sets you deal with look like? How many rows and columns?
A typical European member state budget document is about 15 columns
and 140000 rows. Once I'm done with it, I've usually denormalized it
to about 20 columns.
> - What tasks do you perform using Refine? Simple transformations (e.g.,
> fixing date format)? Structural editing (e.g., transposing rows/columns)?
> Clustering to fix inconsistencies? Reconciliation? ...
The most important aspect is data exploration, easy faceting etc. I
try to keep most of the intrusive stuff in python scripts (I don't
fully trust the repeatability of refine scripts) but the functions I
use most are:
* Simple transforms and column generation
* Transpositions form columns to rows
* Reconciliation
* Filtered deletion
> - Which feature(s) make you choose Refine as opposed to another tool?
The intuitive web-based UI and its simplicity; data exploration tools,
simple GREL syntax. Running it on your own computer is a big factor,
I'm not sure I will still use the cloud-based version as it will
necessarily not be file-based but running against some big datastore
which probably won't fit into my workflow nearly as well.
> - What tool would you have used instead? Scripting? An ETL tool? Excel?
Not a big excel user, normally using Python scripts, SQL and document
dbs; Apache Solr; and custom applications but really the niche was
(and I suspect still is) wide open.
- Friedrich
Hello everyone,I've given a few talks/tutorials on Google Refine recently and I keep getting asked the same question--what do people use it for? I think there's a diversity in how Refine is being used, and it'd be good to get a sense of how diverse.I would like to conduct an informal survey here to ask, what have you used Refine for and how? Here are some more concrete questions:- What domain of data do you deal with? Journalism? Open government data? Scientific data? Business data? Web logs? ...
- What do the data sets you deal with look like? How many rows and columns?
- What tasks do you perform using Refine? Simple transformations (e.g., fixing date format)? Structural editing (e.g., transposing rows/columns)? Clustering to fix inconsistencies? Reconciliation? ...
- Which feature(s) make you choose Refine as opposed to another tool?
- What tool would you have used instead? Scripting? An ETL tool? Excel?
Affiliation data about academic publication in astronomy and physics.
> - What do the data sets you deal with look like? How many rows and columns?
I'm currently working on cleaning up affiliation data for authors in
astronomy and in physics. The first set has around 1 million rows and
the second set has 6 million rows. My project has 4 columns (original
affiliation, modified affiliation, emails, bibliographic reference
number) although I work on decreasing the number of single strings
only in one column.
Working with several million rows in Refine is sometimes painful
especially because of the
> - What tasks do you perform using Refine? Simple transformations (e.g.,
> fixing date format)? Structural editing (e.g., transposing rows/columns)?
> Clustering to fix inconsistencies? Reconciliation? ...
I mostly use transformations (mostly in Jython and GREL). No
structural editing, no reconciliation. I use some clustering but due
to the amount of data, have to stick to the safest option only
(fingerprint).
> - Which feature(s) make you choose Refine as opposed to another tool?
> - What tool would you have used instead? Scripting? An ETL tool? Excel?
We could simply load the data and work on it with a scripting language
(python or perl) and regular expressions but Google refine offers a
nicer graphical alternative and also the possibility to look back in
the history of the modifications.
Benoit.
https://spreadsheets.google.com/viewform?formkey=dGIzbHBtY3hxYkhhNlJycEloNTl2TFE6MQ
Thanks
Iain
I'm curious where/how you came to this suspicion. I routinely rely on
scripted actions with Refine and have found them to work well.
Paul
An example: one of the things I'm doing at the moment is trying to
convert a set of XML-based project descriptions into a CSV list of the
financial support each project has received. This means I have one
script that does the basic flattening of XML into CSV, I then loaded
it into Refine to make some basic observations: about half of the
projects have a geographic zone, the other half has a country
associated with them. Both zones and countries are dirty, so I used
refine's clustering mechanism to get them cleaned and then made a
lookup sheet of country -> zone (e.g. Congo -> Sub-saharan Africa)
which I "cross'd" in.
Nice process, but I need to repeat it every month. While I could do
this manually each time, its much nicer to use a cron job and an
additional script that goes through the raw CSV and looks up countries
and regions against a Google Spreadsheet via the GData API to
normalize them and to derive regions from countries. This now gives me
a new, refined version of the CSV each night that I can cross check
using a third script with a few "assert" statements and then load into
the target DB.
I guess the point I'm trying to make is that Refine automation would
only solve half of my problem: I could make the steps in the app
repeatable, but not the larger process of handling the data.
- Friedrich
“Like Excel, but optimized for exploring, cleaning, and transforming really large sheets”
> - Fill in the blank: "It would be awesome if ... Refine ... ".
Command-line tool for running Refine scripts. Address geocoding. Matching geocodes to areas.
Best,
Richard