Re: Dedupe solution

70 views
Skip to first unread message

Forest Gregg

unread,
Nov 23, 2013, 6:48:23 PM11/23/13
to open-source-...@googlegroups.com
Hi Parin,

If I understand correctly, you have one school dataset that looks like this:

id | school

 1 | University of Southern California

abbrs

USC

other columns

...


Where each row is unique school, and there is one abbreviation for each school. Is that right?

Then you have another student dataset that looks like

name

Joe Smith

Jane Smith

address

...

...

school1

Smith College

Univ. of Southern California

school2

USC


school3



other columns

...

...


And you want to match up the schools in the student dataset with the schools in the schools in the school dataset. Is that right?

Assuming, that I understand you, I would proceed as follows.

I would create a new dataset of 'school names'  that combined the school name across the fields.

name,school_id
'University of Southern California',1
'USC',1
Smith College,
USC,
Univ. of Southern California

And run dedupe over this data.

You will 'dedupe cluster' that should be hopefully by synonyms. You are likely to get two clusters for each school one for the full name and one for the abbreviation. You should be able to easily combine them because you are keeping track of the school_id.

Make sense?

Best,

Forest








 


On Sat, Nov 23, 2013 at 2:23 PM, Parin Jogani <ppjo...@usc.edu> wrote:

Hi,

I found your email on github. Want to ask you a couple questions regarding your lib dedupe (not sure if this question would go in the issues):

 

I have a large dataset of students and the schools they studied in:

name

address

school1

school2

school2

other columns

 

I have another set of schools and their abbrs

school

abbrs

other columns

 

1.       Would it be a good start to separate all the abbreviations of schools in separate columns and give that as a training set? And the program would take in the student information and consider each student into multiple blocks, depending on the number of schools he is a part of (undergrad, grad, phd, etc.).

2.       Or should I create multiple records for each student (one for each school he is a part of)?

My main question here is if I can give school (name, abbr1, abbr2..) as a training input and give student info as input to program? If yes, would be awesome if you can help me with a small example of how to go about this with your library.

 

Thanks,

Parin

https://github.com/waater




--
773.888.2718
2231 N. Monticello Ave.
Chicago, IL 60647
Message has been deleted

PJ

unread,
Nov 23, 2013, 7:43:11 PM11/23/13
to open-source-...@googlegroups.com
What about USC as University of Southern California and USC as University of South Carolina? Should I expect the student to be a part of both the clusters after I run dedupe, or dedupe will result in each row being a part of only 1 cluster?

Forest Gregg

unread,
Nov 24, 2013, 12:11:49 AM11/24/13
to open-source-...@googlegroups.com
At least for the foreseeable future, dedupe really won't be able to do anything a human can't. If all you tell me is that a student went to USC then I won't be able tell whether you mean University of South Carolina or University of Southern California. Neither will dedupe.




--
 
---
You received this message because you are subscribed to the Google Groups "open source deduplication" group.
To unsubscribe from this group and stop receiving emails from it, send an email to open-source-dedupl...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.



--
773.888.2718
2231 N. Monticello Ave
Chicago, IL 60647
Reply all
Reply to author
Forward
0 new messages