Prior probabilities

Vasco

unread,

Sep 15, 2015, 7:38:33 AM9/15/15

to duke

Hi,

Is there any way to adjust the prior? It seems to be always set to .5? Is there some reason I am missing why an adjustable prior doesn't make sense with Duke?

To explain a little bit why I am asking. I am using Duke with the elastic search plugin to make an interactive data linkage tool. With the ES plugin if a field subject to similarity calculations in the index is missing, it gets a score of 0.5 (which makes sense, I think). For a record that doesn't have any field subject to similarity calculations present that means a total score of 0.5. This intuitively seems wrong to me, as the probability of two records representing the same entity if there is no evidence of a them being similar should (in my case) be close to 0, i.e., the prior probability is low.

Maybe I am thinking about this in the wrong way? Any input appreciated.

--
Vasco

Alan Johnson

unread,

Sep 29, 2015, 8:52:02 AM9/29/15

to duke

I think you are correct; starting with (prior) a 50/50 change of a match makes sense in general, but I would that that the evidence (or complete lack thereof) in this case that should reduce it to 0 as a result.

Lars Marius Garshol

unread,

Sep 29, 2015, 9:12:36 AM9/29/15

to duke

* Vasco

Is there any way to adjust the prior? It seems to be always set to .5? Is there some reason I am missing why an adjustable prior doesn't make sense with Duke?

There is no way to adjust the prior at the moment. There are several reasons for this. One is that we don't really know what the prior is, and it is strongly affected by the configuration of the "database" (where we look up candidate records). If you use the API to compare records directly that's going to affect the prior, too.

Further, if you lower the prior, is that really any different from increasing the threshold? It's not clear to me that it is.

So that's the reason why there's no way to change it at the moment.

To explain a little bit why I am asking. I am using Duke with the elastic search plugin to make an interactive data linkage tool. With the ES plugin if a field subject to similarity calculations in the index is missing, it gets a score of 0.5 (which makes sense, I think). For a record that doesn't have any field subject to similarity calculations present that means a total score of 0.5. This intuitively seems wrong to me, as the probability of two records representing the same entity if there is no evidence of a them being similar should (in my case) be close to 0, i.e., the prior probability is low.

Maybe I am thinking about this in the wrong way?

No, I think you're right about this from a probability point of view. The question is if it's worth doing anything about it. If you factor in a prior in the UI you need to also do it in Duke, I guess, for consistency.

--Lars Marius

Reply all

Reply to author

Forward