Getting the Best Reconciliation / Wikidata etc

374 views
Skip to first unread message

N.H. Leroy

unread,
Jun 16, 2017, 11:32:20 AM6/16/17
to OpenRefine
Hello,

I'm interested in using the Wikidata reconciliation to get the best match on people and organizations. In my practice just based on name alone I got about 25 percent match on top result being accurate. Because I am running it on thousands, I won't be able to go through each candidate manually to choose the best one.

I was wonder about using additional metadata I have on the person. I see there is a way with OpenRefine to use other columns to match against other properties. But that seems like data like IDS, locations, occupations which are structured.

Let's say I have:

Bill Williams      Biographical description

Can this unstructured (which I won't be structuring due to the time it would take) biographical description be used for the reconcilation using Wikidata to choose a better match? If so, how would it be used?  If it were in the same column as the name, would it look at the surrounding content of Bill Williams in choosing him? It's not desired, because then it will pick up entity matches on the description too, but would it improve the pick on Bill Williams?  If, more ideally, the biographical line it was used in a secondary column, what would it match against? And because it's unstructured, would it be a problem to even use matching against structured data like a wikipedia property?

Thanks in advance!

Neal



Ettore Rizza

unread,
Jun 16, 2017, 3:11:37 PM6/16/17
to OpenRefine
Interesting question. Theoretically, it should be possible to compute a semantic similarity between a "description" column and the Wikidata header. But this is not yet the case.


Antonin Delpeuch (lists)

unread,
Jun 19, 2017, 3:56:30 AM6/19/17
to openr...@googlegroups.com
Hi,

Unfortunately I do not see any way to use the reconciliation interface
with unstructured data like this. It is a very interesting topic though!
If I had to do it, I would try to use DBpedia Spotlight to detect named
entities in the descriptions, then extract the ones which are locations
or occupations (and dates), and use these in the reconciliation
interface. But that is a lot of work.

Antonin
> --
> You received this message because you are subscribed to the Google
> Groups "OpenRefine" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to openrefine+...@googlegroups.com
> <mailto:openrefine+...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

Ettore Rizza

unread,
Jun 19, 2017, 6:41:26 AM6/19/17
to OpenRefine, li...@antonin.delpeuch.eu
Here is a possibility, using this free API :


For each Wikidata Description of the candidates, I computed a semantic similarity with my own description using this Jython script :

import sys
sys
.path.append(r'D:\jython2.7.0\Lib\site-packages')
from requests import get


sss_url
= "http://swoogle.umbc.edu/SimService/GetSimilarity"


def sss(s1, s2, type='relation', corpus='webbase'):
   
try:
        response
= get(sss_url, params={'operation':'api','phrase1':s1,'phrase2':s2,'type':type,'corpus':corpus})
       
return float(response.text.strip())
   
except:
       
print 'Error in getting similarity for %s: %s' % ((s1,s2), response)
       
return 0.0


return sss(value, cells['description']['value'], type='relation', corpus='webbase')

As you can see, the candidate whose bio looks most like the Wikidata description got the highest score.

Problem: For this demo, I had to copy and paste the Wikidata descriptions manually.
@Antonin : Is there a way to retrieve them using the tool https://tools.wmflabs.org/openrefine-wikidata/en/fetch_values?item= ?

Antonin Delpeuch (lists)

unread,
Jun 19, 2017, 7:14:46 AM6/19/17
to openr...@googlegroups.com
On 19/06/2017 11:41, Ettore Rizza wrote:
> Here is a possibility, using this free API
> <http://swoogle.umbc.edu/SimService/api.html> :
>

Very nice!

> @Antonin : Is there a way to retrieve them using the tool
> https://tools.wmflabs.org/openrefine-wikidata/en/fetch_values?item= ?

Not currently. But it should not be too hard with this:
https://www.wikidata.org/w/api.php?format=json&action=wbgetentities&props=descriptions&ids=

Antonin

Ettore RIZZA

unread,
Jun 19, 2017, 7:17:47 AM6/19/17
to openr...@googlegroups.com


--
You received this message because you are subscribed to a topic in the Google Groups "OpenRefine" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/openrefine/8_Hck3trZbQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to openrefine+unsubscribe@googlegroups.com.

Drew Roberson

unread,
Mar 19, 2019, 6:31:56 PM3/19/19
to OpenRefine
Hello,

I'm currently working on a large scale reconciliation project using OpenRefine. I'm trying to reconcile various name forms from my archives against LCnaf, Wikidata, VIAF, and ULAN. I have about 1,600 names that should match when I reconcile with Wikidata, but I'm only getting 55.

I'm using the standard service, trying to play with name forms (e.g., "Smith, John (1942-1999)" and "Smith, John"), but I'm still not having much luck. Since most of these 1,600 people are photographers, I added a new column, "Occupation" and tried to match this field specifically with the "Occupation" field on Wikidata. Unfortunately, this has not helped either.

Does anyone have any suggestions on how to maximize my reconciliation results?

Screen Shot 2019-03-19 at 3.27.48 PM.png

"none"

Screen Shot 2019-03-19 at 3.28.32 PM.png

"matched"

Thanks,
Drew
To unsubscribe from this group and all its topics, send an email to openrefine+...@googlegroups.com.

Owen Stephens

unread,
Mar 20, 2019, 5:30:00 AM3/20/19
to OpenRefine
Hi Drew,

How do you know that 1600 names should match against Wikidata? Looking at the examples which you give which are not matchings, I've tried searching for them by hand on Wikidata:

Juan Enrique Bedoya
Mariel Vidal
Joseph Vitone
Hay Wrightson
Hughes & Mullin
Henneman & Malone
George Daniell
Lawrie Brown

Searching these by hand on Wikidata does not find hits for me, with one exception (George Daniell). While I don't claim to have searched exhaustively, my initial conclusion would be that most of these are not in Wikidata.

A few other pointers that might help:
  • If you can separate the dates of birth/death from the name that will probably improve your hit rate. In the case of George Daniell I suspect this may be why you did not get the match
  • Wikidata labels for people are usually just their name in the order they would normally be written - I would try re-ordering your names to be like this (e.g. George Daniell not Daniell, George)
  • Where the person has multiple names or initials, you could try excluding these in case these haven't been included in the wikidata entry (e.g. try searching for Juan Bedoya as well as Juan Enrique Bedoya) - I'd only do this if I failed to get a match the first time
  • Look for non-personal names in your data and separate out - e.g. you have two company names in the listed examples "Hughes & Mullin" and "Henneman & Malone" - you might want try these with "and" instead of "&", and generally handle differently (e.g. for people you could add P31=Q5 to your reconciliation process, but not for companies)
  • Where you know definitely a person is in wikidata and the name has not matched, see if you can work out why, and see if you can make adjustments to make that match - fixing one line is likely to fix multiple lines.

There is no magic solution of course - ultimately (unless you know different) it's possible that only a small fraction of the people in your archives have Wikidata records. My advice would be to put some time/resource limit on how long you spend. If you can't find a Wikidata match with a reasonable amount of effort, then it either doesn't exist, or the Wikidata record isn't good enough to match to.

I hope some of this helps

Owen

Drew Roberson

unread,
Mar 23, 2019, 1:13:19 PM3/23/19
to openr...@googlegroups.com
Hi Owen,

Turns out I misunderstood and most of the names I have probably will not, in fact, match with Wikidata like you said. This set of names came from the George Eastman House Authority File, some of which have been uploaded to Wikidata, but clearly many have not.

Since this group of names/companies are photographers or related to photography in some way, I have found a lot of the names on New York Public Library's Photographers' Identities Catalog ( http://pic.nypl.org/ ), but unfortunately I don't think it has been legitimized as a name authority file quite yet. On their FAQ page:

"Is PIC an Authority File?
You will notice many of the PIC entries contain links to external authorities such as VIAF, ULAN and Wikidata. These authority files provide useful persistent identifiers that ensure we are all talking about the same thing, Dorothea Lange for example. When used in the cataloging of materials, these identifiers begin to form a web of data across institutions. In the future, as NYPL’s linked data program grows, we hope to contribute to this web of data by publishing our own identifiers and RDF data for the entities found in PIC, especially for those not found in other authorities. Until then PIC can aid in connecting you to these existing authority files."

They do provide IDs (e.g., "Mariel Vidal" --> ID: 393357), but I'm unclear if this is a reliable URI. What's more, there doesn't seem to be a standard service for PIC to reconcile on OpenRefine. But maybe this is a topic for another OpenRefine Google Group...

Thank you for the help and tips!

Drew

You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.

Alexandra Alisa Provo

unread,
Apr 1, 2019, 12:33:12 PM4/1/19
to openr...@googlegroups.com
Hi!

If you want to reconcile against PIC, you could grab their CSV data (https://github.com/NYPL/pic-data/) and use Reconcile-CSV (https://github.com/okfn/reconcile-csv). It would be great to get PIC data into Wikidata--perhaps that would be a good conversation to start with the creator. I'd definitely be interested!

Alex
--
Alexandra Provo
Metadata Librarian
Division of Libraries
New York University
20 Cooper Square, 3rd floor
New York, NY 10003

alexand...@nyu.edu
212-992-7534
pronouns: she/her/hers

Ettore Rizza

unread,
Apr 1, 2019, 2:56:30 PM4/1/19
to OpenRefine
+1 for reconcile-csv, which is a very good application. Regarding the integration of PIC in Wikidata, it seems very feasible since this database seems open source. Once the "PIC id" property is requested and created, it should not be very difficult to import some of these CSVs in Mix n' Match to check whether the artist names already have a Wikidata page.
To unsubscribe from this group and stop receiving emails from it, send an email to openr...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openr...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Alexandra Alisa Provo

unread,
Apr 1, 2019, 3:04:14 PM4/1/19
to openr...@googlegroups.com
Thanks, Ettore! I think it would be good to get in touch with the creator of the database. I'll reach out.

To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.


--

Ettore Rizza

unread,
Apr 1, 2019, 3:04:54 PM4/1/19
to OpenRefine
Erratum: PIC already has a Wikidata property, but only 17,854 entity linked with this external id.

Owen Stephens

unread,
Apr 2, 2019, 3:52:03 AM4/2/19
to OpenRefine

On Monday, 1 April 2019 20:56:30 UTC+2, Ettore Rizza wrote:
+1 for reconcile-csv, which is a very good application. Regarding the integration of PIC in Wikidata, it seems very feasible since this database seems open source. Once the "PIC id" property is requested and created, it should not be very difficult to import some of these CSVs in Mix n' Match to check whether the artist names already have a Wikidata page.


Out of interest Ettore - why Mix 'n' Match rather than using OpenRefine to do the matching and creation?
 

Ettore Rizza

unread,
Apr 2, 2019, 4:26:33 AM4/2/19
to OpenRefine
@owen: This is another possibility of course, but when a Mix n 'Match catalog exists (as for PIC) it seems to me that the reconciliation is simpler and more collaborative. Just a feeling.

David Lowe

unread,
Apr 2, 2019, 3:05:39 PM4/2/19
to OpenRefine
Hi all, I'm David Lowe, the editor of PIC. Alex tipped me off to this discussion (thanks!), so I'm here if I can help or answer any questions. As Ettore pointed out, PIC IDs do have a Wikidata Property, and there is a (now old) Mix-n-Match project. I'm not sure if there's a way to take that down and upload a newer snapshot of the data... it's been awhile since I've had the time to look at MnM. But let me know how I can help- I'd love to get more (or all) of PIC into WD. And thanks all for the interest.
d
Reply all
Reply to author
Forward
0 new messages