Using the mysql_example as a template, I've implemented data-dedupe on my own data set. When I do my own flavor of blocking to concentrate the data and get more duplicates, the training on the subset of the data runs quite well and I learn rules which are then able to be applied quite well to the subset, and to deduplicate that set of data. Unfortunately, when I then use that model on the full set of data, the model is only able to generate duplicates within the subset I was working - the data outside of that subset is 100% excluded from the reduplication (I used a subset where name started with "i" to concentrate the positive instances). In effect, my own flavor of blocking.
But when I don't do my own blocking, the training that happens finds so few positive instances of duplication that I don't get a model. In practice, I'm seeing at least 10-50 training instances in a row that are negative, for a single positive instance. Is this sort of behavior reasonable, when I know (from the subset example previously mentioned), that there are significantly more positives available that wouldn't require going through so many negatives to find them?
Any general feedback is appreciated, but I have a couple of specific questions.
In the mysql_example, a random set of pairs is presented for training. I wouldn't expect there to be very many duplicates in a randomly drawn set of pairs, so the behavior I'm seeing is consistent with there being sparse duplicates. Is there a different technique for identifying a better training set that will both generalize to the full set of data, and will have a higher concentration of positive duplicate instances?
Should I expect the random collection of pairs to be asked about sequentially during training, or should I expect the model to get better and better at finding and asking about positive duplicate instances?
The good news is that I'm getting past minimal installation and setup problems, and starting to get more and more into how the package is working, and how to optimize the results. I see a lot of potential in the approach you're using, and I'm hoping to both make use of this on a variety of problems I'm faced with, as well as finding a way to contribute some of what I'm learning back to the project (probably in the form of documentation rather than more code).
Thanks
Asoka