How do I reconcile against my own data?

436 views
Skip to first unread message

Binayak

unread,
Oct 12, 2011, 1:04:33 PM10/12/11
to Google Refine
Hi,

I have a dataset that includes a column of application names (say). I
would like to save these names in some fashion, so that if ever I have
another dataset that includes application names, I can reconcile my
'official' set to get the canonical name. Again, this data is all
private to me, and doesn't come from dbpedia or any other source.

It seems that I have 2 choices:
1. Reconcile against a sparql endpoint. This sounds complicated to
set up.
2. Reconcile against an RDF dump. Thiis sounds better, but I'm
lacking information about the expected format of said RDF dump.

Does anyone have a small example of #2, or even some clarity on how
best to do this with Google Refine?

Is there a third option that I'm missing?

Thanks!

fadi maali

unread,
Oct 12, 2011, 3:58:19 PM10/12/11
to google...@googlegroups.com
Hi,

Google Refine can reconcile against structured data in multiple formats not only RDF. Google Refine comes with the ability to reconcile against Freebase. For reconciling against other sources, a standard API needs to be implemented on top of the source (details of the API is documented at: http://code.google.com/p/google-refine/wiki/ReconciliationServiceApi ). Few people have already implemented the API on top of a number of data sources, for example Open Corporate (http://opencorporates.com/) and Kasabi (http://kasabi.com/doc/api/reconciliation ).

The two options you listed with RDF are provided by the RDF extension (http://lab.linkeddata.deri.ie/2010/grefine-rdf-extension/ ) which is a third-party extension and not part of the core of Google Refine (Disclaimer, I am this third-party :-) )  The RDF dump option accepts  any RDF files (N3, Turtle, RDF/XML...) you need to tell the extension what property (or multiple properties) are used in the file for labeling the entities. 

Hope that helps.

Regards,
Fadi

Tom Morris

unread,
Oct 12, 2011, 4:41:32 PM10/12/11
to google...@googlegroups.com
A standalone toy reconciliation service might be something useful as
an example for people building reconciliation services as well as
something useful in and of itself. I'm thinking of something like a
little Python web server that reads a simple CSV or JSON file
containing the "database." Nothing like this exists today, to my
knowledge, but it wouldn't be that hard to put together.

You might also look at using the cross() function to reference other
Refine projects. You could keep your standard database as a Refine
project and use the cross() function to do lookups against it.

Tom

Michael Lenczner

unread,
Oct 12, 2011, 4:42:57 PM10/12/11
to google...@googlegroups.com
On Wed, Oct 12, 2011 at 4:41 PM, Tom Morris <tfmo...@gmail.com> wrote:
> A standalone toy reconciliation service might be something useful as
> an example for people building reconciliation services as well as
> something useful in and of itself.  I'm thinking of something like a
> little Python web server that reads a simple CSV or JSON file
> containing the "database."  Nothing like this exists today, to my
> knowledge, but it wouldn't be that hard to put together.
>
FYI - I would be interested in that.

Thad Guidry

unread,
Oct 12, 2011, 4:59:26 PM10/12/11
to google...@googlegroups.com
I mentioned to David that Refine has a web server component, Jetty.  It also can read and absorb many formats already and store them in a columnar form (a pseudo-database) as a Refine Project.

Refine itself could handle most of this for smaller to medium datasets, I think.

What Refine is lacking currently to Reconcile against it's own Projects is a better "mapping" interface where a user can give some quick weighting with matches using Ngram, etc, across columns that you want weighted.  Somthing more than the cross() function using strings alone. (currently you can do a bit of this with Cross(value.ngram).blah constructs.) 

Once we have a better mapping interface, then it should be silly simple to Reconcile a column of data in Refine against another Refine project's column.

David Huynh

unread,
Oct 12, 2011, 6:03:33 PM10/12/11
to google...@googlegroups.com, Shawn Simister
I think Shawn Simister might already have a toy recon server somewhere ...

David

Shawn Simister

unread,
Oct 12, 2011, 7:45:08 PM10/12/11
to google...@googlegroups.com, Shawn Simister
Yes, I have a Java recon service that I could open source. Its going to take some time to get it ready for public consumption but I'll update the list when its online.
--
Shawn Simister

Developer Programs Engineer
Google, San Francisco

Peter Nõu

unread,
Oct 13, 2011, 4:38:33 AM10/13/11
to google...@googlegroups.com
Eagerly anticipating below! Would be Super Duper useful in my use
cases - often to reconcile between different 'excel' sheets that
expose subsets of a data warehouse that's impossible (due to political
reasons) to get to in the timeframe and within reason/keeping the job.
thanks for great efforts everyone involved in the past and in the
future - refine is 'the best' - finally after 15+ years of
procrastination i start working expressions, scripting etc a little
bit /peter

fadi maali

unread,
Oct 13, 2011, 11:53:14 AM10/13/11
to google...@googlegroups.com
Hello,

With the RDF extension, you can export the data of a particular project in RDF then use the exported data to define a reconciliation service i.e. a reconciliation service based on RDF dump.
This reconciliation service can then be used to reconcile data in other projects. Eventually enabling reconciling across projects.

I am not advocating RDF here and I understand that it is more for people who already know and need RDF. My point is that having the full cycle, all using only Refine was very handy in my experience. No other servers and no need to upload the data somewhere out Refine, etc...

Cheers,
Fadi
Message has been deleted

Thad Guidry

unread,
Feb 27, 2013, 8:00:29 PM2/27/13
to openr...@googlegroups.com
Rainer,

Tom Morris is busily trying to finish up some code changes for making sure that the Reconcile menu options in OpenRefine are working against Freebase again (with the New Google Apis).  Stay tuned over the next few weeks for further announcements.


On Wed, Feb 27, 2013 at 8:57 AM, Rainer <rainer....@gmail.com> wrote:
Hello there, I just discovered this thread and was wondering whether there are any advances in this direction. FYI, I'm a social researcher and I'm currently trying to code a quite messy survey database containing lots of string values which I would like to reconcile against some predefined lists (eg. ISO-country codes, village names with geocodes, etc.)

Thanks!

Rainer

--
You received this message because you are subscribed to the Google Groups "Open Refine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.





--
-Thad
http://www.freebase.com/view/en/thad_guidry

Tom Morris

unread,
Feb 28, 2013, 12:03:59 AM2/28/13
to openr...@googlegroups.com
On Wed, Feb 27, 2013 at 9:57 AM, Rainer <rainer....@gmail.com> wrote:

Hello there, I just discovered this thread and was wondering whether there are any advances in this direction. FYI, I'm a social researcher and I'm currently trying to code a quite messy survey database containing lots of string values which I would like to reconcile against some predefined lists (eg. ISO-country codes, village names with geocodes, etc.)

Hi Rainer.  Some of this you can do by reconciling against Freebase (e.g. ISO country codes).  For reconciling against things that you can't find in Freebase, the two best choices are still the RDF extension and the OpenReconcile project.

If it's a straight lookup with no similarity measures or other fanciness required, you can also use the cross() function to join between multiple Refine projects.

Tom
Reply all
Reply to author
Forward
0 new messages