Dear All,
I am reviving this thread because I would like to ask for some clarification -- if possible. Based on what has already been discussed in this thread, the weight provides a confidence measure on the validity of the assertion where it is found. Fine.
In ConceptNet 4, the equivalent measure was that of the `score' of an assertion, which was an integer. There (in ConceptNet 4) it made sense to filter out assertions that had a score of 0 or less and in fact it was a common practice in order to get rid of spurious links. For example, see the paper AnalogySpace: Reducing the Dimensionality of Common Sense Knowledge, from AAAI 2008.
So, I have what I believe is two natural questions:
(A) Is there a similar universal weight threshold that can be used for ConceptNet 5.6 as a score of 0 was used in ConceptNet 4?
(I assume the answer is `no'.)
(B) What about weight thresholds per dataset?
For example, based on the above descriptions and a quick glance on the code I can see that the default values for weights are:
-- conceptnet 4 : 1.0
-- dbpedia: 0.5 or 1.0
-- emoji: 1.0 (even though there is no weight in the code; all the assertions currently have only this value)
-- opencyc: 1.0
-- verbosity: <not clear because it depends on some `score'; has 4875 different weight values ranging from 0.1 to 15.414>
-- wiktionary: 0.25 or 1.0
-- wordnet 3.1: 1.0 or 2.0
Excluding verbosity which I do not know how to treat it at the moment, for all the other datasets mentioned above, I see that when we restrict to the English datasets in case of multiple languages ( e.g., `/d/conceptnet/4/en' ), then the respective weights of the assertions are never smaller than the default values mentioned above (take the minimum in case of multiple default values).
But now this is in sharp contrast to how `score' used to behave in ConceptNet 4. In fact, we can see that ConceptNet 4 is part of ConceptNet 5.6, so, unless all those assertions that had score of 0 or less in ConceptNet 4 have been dropped when inserting it to ConceptNet 5.6 (which would indeed be a good thing), then some care may be needed for assertions that have weights near the default values.
So, the question remains:
Do people still drop some assertions of datasets, even if this means that they may have to use different thresholds for different datasets?
As a last remark, there are assertions in the database that have weight, say, 0.1, or 0.101, or 0.102, or .... Based on all the above default values it looks like that such assertions should probably be dropped when one wants to apply some algorithm that operates on `meaningful' assertions -- regardless of the dataset where these assertions arise. But if this is the case, then we should probably be dropping assertions that have weights lower than the default values of the individual datasets, correct?
Comments? Thoughts?
Best regards,
Dimitris