Hi all,
To celebrate the new version of OpenRefine I have added a new feature to
the reconciliation interface. (But there is no need to upgrade anything
on your side to use it, as it is entirely server-side.)
Until now, matching was only supported for string-based fields (such as
monolingual texts or identifiers). I have now added custom matching
strategies for most value types in Wikibase. For instance, matching on
geographical coordinates is now possible (use the "lat,long" format in
OpenRefine).
In addition, you can now extract sub-fields in these complex values. For
instance, if for some reason you only have records of people with their
month and day of birth (but not their year of birth), you can match on that.
This is all described in the documentation:
https://github.com/OpenRefine/OpenRefine/wiki/Reconciliation#comparing-values
A few notes of caution :
- you should not rely too much on the particular behaviour of the
scoring as it might evolve in the future. Of course I try to keep the
scoring sensible for every use case, but things can break. Please let me
know if that happens!
- all this is just a reranking *after* the search for Qids, we do not
query Wikidata based on them. So for instance if you have very noisy
names but accurate geographical coordinates, this reconciliation
interface will not work.
Also: contributors welcome!
- if you want to contribute subfields, it is quite straightforward: just
add a class here:
https://github.com/wetneb/openrefine-wikidata/blob/master/wdreconcile/subfields.py
- if you want to improve the scoring methods for particular values, it
is here:
https://github.com/wetneb/openrefine-wikidata/blob/master/wdreconcile/wikidatavalue.py
Cheers,
Antonin