id | school 1 | University of Southern California |
abbrs USC |
other columns ... |
|
name Joe Smith Jane Smith |
address ... ... |
school1 Smith College Univ. of Southern California |
school2 USC |
school3 |
other columns ... ... |
Hi,
I found your email on github. Want to ask you a couple questions regarding your lib dedupe (not sure if this question would go in the issues):
I have a large dataset of students and the schools they studied in:
name
address
school1
school2
school2
other columns
I have another set of schools and their abbrs
school
abbrs
other columns
1. Would it be a good start to separate all the abbreviations of schools in separate columns and give that as a training set? And the program would take in the student information and consider each student into multiple blocks, depending on the number of schools he is a part of (undergrad, grad, phd, etc.).
2. Or should I create multiple records for each student (one for each school he is a part of)?
My main question here is if I can give school (name, abbr1, abbr2..) as a training input and give student info as input to program? If yes, would be awesome if you can help me with a small example of how to go about this with your library.
Thanks,
Parin
--
---
You received this message because you are subscribed to the Google Groups "open source deduplication" group.
To unsubscribe from this group and stop receiving emails from it, send an email to open-source-dedupl...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.