Sizeof labeled examples

53 views

Skip to first unread message

Darek Lubomski

unread,

Jan 22, 2018, 5:49:09 AM1/22/18

to open source deduplication

Hello, Dedupe lib ask me for 10 example of duplicates and 10 example of non-duplicates.

Is increasing this number will give me better results?

For example I have one milion records and 10 000 of them are labeled as duplicated, should I put all 10 000 as training data? Or few of them will be sufficient?

Abhinav Jain

unread,

Mar 1, 2018, 11:58:10 PM3/1/18

to open source deduplication

Look if you give all the 10000 in training then, first question arises how will you do this for this u need to know the duplicates and if u already know the duplicates then there is no need of dedupe . definitely you dont need to give all the 10000 examples. just give few examples like 20-25 of both positive nad negative and u will find the answer.

Reply all

Reply to author

Forward

0 new messages