Sizeof labeled examples

51 views
Skip to first unread message

Darek Lubomski

unread,
Jan 22, 2018, 5:49:09 AM1/22/18
to open source deduplication
Hello, Dedupe lib ask me for 10 example of duplicates and 10 example of non-duplicates.
Is increasing this  number will give me better results?

For example I have one milion records  and 10 000 of them are labeled as duplicated, should I put all 10 000 as training data? Or few of them will be sufficient?



Abhinav Jain

unread,
Mar 1, 2018, 11:58:10 PM3/1/18
to open source deduplication
Look if you give all the 10000 in training then, first question arises how will you do this for this u need to know the duplicates and if u already know the duplicates then there is no need of dedupe . definitely you dont need to give all the 10000 examples. just give few examples like 20-25 of both positive nad negative and u will find the answer. 
Reply all
Reply to author
Forward
0 new messages