Adding labeled training data to already trained model

79 views

Skip to first unread message

Tom Proctor

unread,

Jun 18, 2018, 4:48:00 PM6/18/18

to open source deduplication

I have successfully been using dedupe to identify duplicates in a data set. I will be adding new data to this data set periodically, and the added data will often contain a large number of novel words.

When I have tried loading a training file and then labeling new data using the console labeler, the console labeling text mentions the previously labeled data (eg `28/10 positive, 30/10 negative` just after loading a training file). However, it appears that previous labeling is not actually used in training. If I just press "f" to finish immediately, I receive a `TypeError: descriptor 'union' of 'set' object needs an argument`, which I believe is triggered when there is no training data available. If I only label one or two additional data pairs, I end up with terrible results, as would be expected from mostly unlabeled training data.

I'd like to be able to add to already labeled training data using the console labeler, but it doesn't seem to be working for me. I've poked around in the documentation, but for the life of me I can't seem to find a way to do this, beyond manually adding old training data. It seems like this might be a bug, as the previous training is mentioned, but I can't be sure as the documentation is practically non-existent.