Having trouble training a model

Asoka Diggs

unread,

Sep 23, 2014, 3:04:59 PM9/23/14

to open-source-...@googlegroups.com

Using the mysql_example as a template, I've implemented data-dedupe on my own data set. When I do my own flavor of blocking to concentrate the data and get more duplicates, the training on the subset of the data runs quite well and I learn rules which are then able to be applied quite well to the subset, and to deduplicate that set of data. Unfortunately, when I then use that model on the full set of data, the model is only able to generate duplicates within the subset I was working - the data outside of that subset is 100% excluded from the reduplication (I used a subset where name started with "i" to concentrate the positive instances). In effect, my own flavor of blocking.

But when I don't do my own blocking, the training that happens finds so few positive instances of duplication that I don't get a model. In practice, I'm seeing at least 10-50 training instances in a row that are negative, for a single positive instance. Is this sort of behavior reasonable, when I know (from the subset example previously mentioned), that there are significantly more positives available that wouldn't require going through so many negatives to find them?

Any general feedback is appreciated, but I have a couple of specific questions.

In the mysql_example, a random set of pairs is presented for training. I wouldn't expect there to be very many duplicates in a randomly drawn set of pairs, so the behavior I'm seeing is consistent with there being sparse duplicates. Is there a different technique for identifying a better training set that will both generalize to the full set of data, and will have a higher concentration of positive duplicate instances?

Should I expect the random collection of pairs to be asked about sequentially during training, or should I expect the model to get better and better at finding and asking about positive duplicate instances?

The good news is that I'm getting past minimal installation and setup problems, and starting to get more and more into how the package is working, and how to optimize the results. I see a lot of potential in the approach you're using, and I'm hoping to both make use of this on a variety of problems I'm faced with, as well as finding a way to contribute some of what I'm learning back to the project (probably in the form of documentation rather than more code).

Thanks

Asoka

Forest Gregg

unread,

Sep 23, 2014, 3:11:10 PM9/23/14

to open-source-...@googlegroups.com

Hi Asoka,

We are actively working on this issue right now,

https://github.com/datamade/dedupe/pull/313
https://github.com/datamade/dedupe/issues/221

This should be ready for testing in a week or so. In the meantime,
you're only real move is to increase the size of the random sample you
are drawing.

Help with documentation is always very much appreciated!

Best,

Forest

> --
>
> ---
> You received this message because you are subscribed to the Google Groups
> "open source deduplication" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to open-source-dedupl...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

--
773.888.2718
2231 N. Monticello Ave
Chicago, IL 60647

Diggs, Asoka

unread,

Sep 23, 2014, 3:29:55 PM9/23/14

to open-source-...@googlegroups.com

Thanks for the quick update.

In my particular data set, I suspect that I can manually do a primitive version of what I think you're doing after reading the two threads. In my case, I have a primary field (name) with several additional fields. When I draw a subset from names starting with a particular letter, then that sample has a good mix of positive and negative instances, and I train a good model on that single letter.

I'm thinking that I concatenate a training set together using training instances where the first letter of the name in each randomly drawn pair is the same letter. So I would draw a random sample from names starting with "a", a random sample from names starting with "b", etc.., and concatenate them together into the actual training set. I will miss duplicates that don't share the same first letter in the name, and that's why I figure this approach is a primitive and manual approach to what you and Cathy Deng are implementing. I also believe that in this particular data set, the precision and recall will both stay high with this approach.

I'll think about trying that. I might just let this sit for a week or two while you're getting the initial implementation of what you're working on ready to go.

On the plus side, when that implementation is ready to go, I have a data set or two with dedupe implemented on my side, already in place to drop in the new sampling approach and see how it goes. I look forward to contributing to the testing of the updated sampling routine.

Asoka

You received this message because you are subscribed to a topic in the Google Groups "open source deduplication" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/open-source-deduplication/t8TB3FFSSV4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to open-source-dedupl...@googlegroups.com.

Forest Gregg

unread,

Sep 23, 2014, 3:56:20 PM9/23/14

to open-source-...@googlegroups.com

That sounds like it could work.

Forest Gregg

unread,

Oct 27, 2014, 7:21:56 PM10/27/14

to open-source-...@googlegroups.com

Hi Asoka,

Block sampling has been merged into master. Right now we are covering the Dedupe case, but will soon start work on the record linking case. You should be able to decrease the size of the requested sample by at least one order of magnitude and still get lots of positive examples.

Best,

Forest

--

773.888.2718

Diggs, Asoka

unread,

Oct 28, 2014, 12:07:41 PM10/28/14

to open-source-...@googlegroups.com

Thanks for the update Forest. This is great news and I’ll be sure and find an excuse to give the new logic a whirl.

Reply all

Reply to author

Forward