Gazetteer - Datasets for training, testing, and application

85 views

Skip to first unread message

unread,

Nov 26, 2020, 3:29:22 PM11/26/20

to open source deduplication

Hello everyone,

I am starting to use Dedupe 2.0 and learning about Active Learning as well.

Currently, I would like to know if it is possible to use a dataset to train and another dataset with the same structure for the clustering process.

In fact, I want to use the Gazetteer class and I have the following doubts:

I have a messy dataset that is already classified (i.e., I know all the duplications) and I want to use 70% of this dataset to train the model and 30% of this dataset to test the accuracy of Dedupe. In the case of testing, I want to reuse the training and settings files generated in the training process and use the testing data only for the clustering process.

Also, I have another messy dataset with the same fields as the previous messy dataset, that is not classified yet, and I want to discover all the duplications. Is it possible to reuse the previous training and settings files to classify the new messy dataset?. The canonical dataset in the previous case (i.e., the canonical dataset used for the training process) is a subset of the new canonical dataset that would be used in this case.

Finally, I would like to know if there is a recommended size of samples and labeled samples to use for the training process. I am thinking of using 28.000 samples and 10% of them as labeled samples (i.e., 2.800 labeled samples) for this process.