Reconcile with multiple possible values for the same property

115 views
Skip to first unread message

Paul

unread,
Jun 14, 2022, 12:03:50 PM6/14/22
to openr...@googlegroups.com
Dear all,

First of all thank you very much for open refine.

I am using the reconcile function to wikidata and I stumble upon an
advanced use case.

Let's say I have names of people who I know are either painter or
sculptor or engraver.

To reconcile I can do a query on people Q5 which firstname on P735 and
I'd like to add occupation P106 but not with only one value but three
possible values.

The only way I found to do that is to generate as much rows as possible
value in my record: split multivalued cells on occupation then fill
down then reconcil all the rows then reduce the record into one row...

It's a bit complicated AND it generates as much queries as possible
values. I guess the reconcile API would allow to have a multiple should
constraints on the same property?
If yes, would there be a way to craft a multivalued query on one
property?

Guess I am dreaming but I just want to make sure.
If there's a way to do by passing by an API or so I am also interested
as along as we can still use the marvelous reconcil UI facet/candidates
after.

Let me know if that rings any bells.

Best regards,

Paul

Paul

unread,
Jun 14, 2022, 6:14:27 PM6/14/22
to openr...@googlegroups.com
As a follow up, in my use case what I need is something like a or
between values. 
Others might need an "AND".
And actually doing an "AND" can be done by using multiple columns on
the same property. Candidates will get better score with the number of
values they have in the targeted property. 100 will mean a full
matching AND.

In my case doing with one column and then one property constraint by
value (painter, scupltor, engraver...) lowers the scores as one person
having all the possible occupations is very unlikely. But still
logically possible. In such a case candidates with score above ~70
could be considered as full matching. Which can actually be done in two
clicks thanks to beautiful Open Refine reconciliation facets and
actions.

So I guess that the way to go.
And at least in that case we stick to only one query by alignment to
achieve.

To finish on this very-long-talk-to-myslef thread, I would ideally need
here is a way to ask for a scoring mecanism which does output something
like a fix score whatever the number of "OR" values which match the
property.
Very specific I guess.
Above all it's not in Open Refine hands anyway as "The way candidates
are retrieved from the underlying database and scored against the query
is left entirely at the discretion of the service."
https://reconciliation-api.github.io/specs/0.1/#a-note-on-candidate-retrieval-and-scoring

Ok let's dig then:
"For each supplied property, all query values are matched against
reference values and the maximum matching score of all pairs is used as
the similarity score for this property."
https://openrefine-wikibase.readthedocs.io/en/latest/scoring.html#global-matching-formula

But wait. If I understand correctly this would actually means that if
any value matches perfectly the score should be maximum. That's a OR...
Mmm what I see from playing with Open Refine does not sounds like that
statement.
Can Open Refine actually send a list of multiple values for one
property? Or is that a list of with multiple times the same property
with one value each?
Sounds like the second option to me.

At that point I wish I knew a way to see the query sent by Open Refine
to the reconciliation service.
When I do a query with two diferent column on the same property I think
what Open Refine does is :
```
{
"q0": {
"query": "Paul Girard",
"type": "DifferentiatedPerson",
"limit": 5,
"properties": [
{
"pid": "occupation",
"v": "painter"
},
{
"pid": "occupation",
"v": "sculptor"
}
],
"type_strict": "should"
}
}
```

where it could also be
```
{
"q0": {
"query": "Paul Girard",
"type": "DifferentiatedPerson",
"limit": 5,
"properties": [
{
"pid": "occupation",
"v": [
"painter",
"sculptor"
]
}
],
"type_strict": "should"
}
}
```
Syntax taken from
https://reconciliation-api.github.io/specs/0.1/#structure-of-a-reconciliation-query


To finish number two: a very common use case about what I am describing
here is matching people Q5 when having a list of firstnames P735.


Sorry for beeing so long.
Hope this makes any sense to someone.

Best regards,
Paul

Antoine Beaubien

unread,
Jun 14, 2022, 7:06:59 PM6/14/22
to OpenRefine
Hi Paul,

   I don't understand your scenario.

   But, let's take a simple user case. You have a table with person’s name in the first column. 
  • You reconcilate the name to a Qid.
  • You create a new occupation column from the reconciled name.
  • If a name has more than ONE occupation, you will get more rows for each records.
At that point, you can mannually add new rows to people to give them more occupations, and that can be pushed back to Wikidata, if you setup the schema.

Records are pushed in the same Wikidata item. So you can have multiple rows for a same Wikidata item. There is no need to join or split those cells.

Regards,
   Antoine 

Thad Guidry

unread,
Jun 14, 2022, 7:24:58 PM6/14/22
to openr...@googlegroups.com
Hi Paul,

So, let's start a bit at the beginning about how your data currently is setup in the datagrid with the columns, which will impact how the underlying query is composed by OpenRefine's default recon batch handling.

Does it look somewhat like this where you have a single row for a person and then multiple columns that describe the person, where each column could be part of an AND against multiple properties of a recon service?

Person_First | Person_Last | Occupation1 | Occupation2 | Occupation3 | Birth_location
John | Smith | painter | sculptor | gardener | United Kingdom

or more like this as a record ? ...

id | Person_First | Person_Last | Occupation | Birth_location
1 | John | Smith | painter | United Kingdom
1 | John | Smith | sculptor | United Kingdom
1 | John | Smith | gardener | United Kingdom

Incidentally, the first example is somewhat more intuitive and a bit easier to also align against recon services (depending on the service and it's algorithms).
As far as the AND... the first example is how we originally designed OpenRefine (Gridworks) to batch recon properties against Freebase which did allow exclusive AND and OR later on in it's life sometime around 2012-2013 - they were called "constraints" and accessed through the "filter" query keyword.  Google Refine 2.6 is the version that incorporated the changes for the new Freebase Reconciliation API that allowed full constraint capability.
See https://markmail.org/message/fdysrya6pi6pzfdy
And https://developers.google.com/freebase/v1/search-cookbook#schema-constraints

Behind the scenes the MQL queries looked something like this:
 query "john" filter: "(all type:/occupation/painter type:/occupation/sculptor type:/occupation/gardener /people/person/nationality:"United Kingdom")"


--
You received this message because you are subscribed to the Google Groups "OpenRefine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openrefine+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/openrefine/5e7827281c619cc05290e3bb5cd7cbac6fda477a.camel%40ouestware.com.

Thad Guidry

unread,
Jun 14, 2022, 7:32:35 PM6/14/22
to openr...@googlegroups.com
I forgot to link to the most important part of your question.. the combining behavior of multiple properties in a batch recon.
The original Freebase service and Google Refine had support for:  https://developers.google.com/freebase/v1/search-overview#advanced-filtering

This combining behavior can be overriden and better controlled with the filter parameter which offers a richer interface to combining constraints. It is an s-expression, possibly arbitrarily nested, where the operator is one of:

  • any, logically an OR
  • all, logically an AND
  • not
  • should, which can only be used at the top level and which denotes that the constraint is optional. During scoring, matches that don't match optional constraints have their score divided in half for each optional constraint they don't match.
I am not entirely sure of the current state of Wikidata recon and would have to defer to developers like Antonin Delpuche that have first hand knowledge of the constraint or filtering mechanisms for the Wikidata recon service.

Paul OuestWare

unread,
Jun 27, 2022, 8:11:59 AM6/27/22
to OpenRefine
Thank you for you answers.
Yes I need to understand how the actual reconciliation system works underhood to explain this multiple value issue.

To state more clearly my question, I submitted an issue which narrows down to a precise example using firstnames: https://github.com/OpenRefine/OpenRefine/issues/4993

I hoppe this will not be considered too much of an off-topic.

best,

Paul
ps: I switched to gmail email as I could join the group with my other email...
Reply all
Reply to author
Forward
0 new messages