questions on gazetteer API/using gazetteer with OpenRefine

56 views
Skip to first unread message

Erik Paulson

unread,
May 28, 2021, 5:55:19 PM5/28/21
to open-source-...@googlegroups.com
Hello - 

I've been a big fan of the DeDupe work for quite a while, thanks for providing it and great work with it so far!

In somewhat related work, there's the W3C Reconciliation API:

which grew out of the OpenRefine/Google Refine tool for data wrangling. (The use case there is usually to match a row in a record against a canonical dataset, e.g. match a list of cities against the Wikidata/Wikipedia entry for that city) 

Recently there's been some interest in extending the protocol to support providing "feedback" for which match candidate was selected by the user, in case some server wanted to incorporate that match result into future suggestions. The OpenRefine workflow wouldn't ever be quite the same as DeDupe - I don't think OpenRefine be as nice for large batches as DeDupe - but there is some overlap. 

I wanted to play with that protocol idea a bit, and this seemed like a great time to try using the DeDupe python library, especially the gazetteer configuration.

I had just a couple of questions, mostly around mark_pairs.

The pairs that mark_pairs takes don't use the IDs/keys from the data1/data2 dictionaries from prepare_training -  but mark_pairs is very particular about the data that is passed to it - I accidentally swapped the order of the datasets of the pairs, and so while I was passing "matching" pairs, I did so with a reverse order of data1 and data2 and I got a screenful of errors. Could mark_pairs be changed to take a pair of IDs as inputs instead of actual data, which might make that clearer?

I assume that both the "messy" and "canonical" records that are passed to mark_pairs() must be drawn from the same set of records that are passed to prepare_training(), but I did not experiment to verify that to be true. I suppose it's possible that maybe only the canonical record must be present? When you pass data to search(), it doesn't seem to be required that the search target comes the "messy" dataset - I passed a made-up ID to search and it worked just fine (though on my search handler I'm using a StaticGazetteer so it doesn't have the data1 or data2 datasets anymore, unless the Gazetteer object saves the two inputs when it writes out the learned_settings file). Do the pairs for mark_pairs in fact have to be drawn from both data1 and data2?

The sample_size defaults to 150K, which I assume is the number of candidate pairs drawn from the cross of data1 and data2 (my tiny dataset is 332 and 534 rows, so I just barely squeak by). For the reconciliation API, it's not normally the case that we'd have the full "messy" dataset available at training time. Is there a good rule of thumb for what a workable lowerbound on the number of available candidate pairs for the training process?

My hacky server is here:

Thanks!

-Erik

Reply all
Reply to author
Forward
0 new messages