Help with R logic - near similar name

100 views
Skip to first unread message

rammano...@gmail.com

unread,
Aug 25, 2020, 10:40:02 AM8/25/20
to datameet
Hi,

I have collected hospital data from multiple sources. However, each source have different name. Trying to clean list with no duplicates. I am using R and couldn't resolve with stringdist_join . Appreciate you suggesting some approach. 

For example, Guntur (A.P) is listed with following names. Can we mark (or eliminate) duplicate?

Example 1
SANKARA EYE HOSPITAL(GUNTUR)
SANKARA EYE HOSPITAL
SANKARA EYE HOSPITAL ( A UNIT OF SRI KANCHI KAMA KOTI MEDICAL TRUST)  


Example 2
ASHIRWAD HEART HOSPITAL ( GHATKOPAR )
Ashirwad Heart Hospital
ASHIRWAD HEART HOSPITAL ( GHATKOPAR )
Ashirwad Heart Hospita-Ghatkopar  

Thanks
Ram

Dilawar Singh

unread,
Aug 25, 2020, 10:48:44 AM8/25/20
to datameet
Not sure what is the equivalent of python difflib (SequenceMatcher) in R. If you have one, it will work.

Sent from a handheld device. Pardon the brevity and typos.
--
Datameet is a community of Data Science enthusiasts in India. Know more about us by visiting http://datameet.org
---
You received this message because you are subscribed to the Google Groups "datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email to datameet+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/datameet/19ee8101-84ec-42b0-974a-43035b5902f1n%40googlegroups.com.

Rahul Gupta

unread,
Aug 25, 2020, 11:16:21 AM8/25/20
to data...@googlegroups.com
Hi Ram,

Not sure if there is something very similar to FuzzyWuzzy (Python) in R. But you can try this link

It is similar kind of approximate string matching. You can set your own threshold criteria and filter data accordingly.

--

Ravikant P

unread,
Aug 25, 2020, 3:18:43 PM8/25/20
to data...@googlegroups.com

Hi Ram,

For one project I had to match a village name in one dataset with another dataset containing ~44000 villages in Maharashtra. I had faced a similar situation. To find exact(or closest) match I had used following tricks

from both strings to be compared:
  1. remove white spaces
  2. convert everything to lowercase
  3. compare strings for exact match. if not found then go to next step.
  4. remove everything inside(including) parentheses
  5. compare strings for exact match. if not found then go to next step.
  6. remove characters like { ! - _ .  etc}
  7. compare strings for exact match. if not found then go to next step.
  8. compare two strings for levenshtein distance = 1, then for distance=2 and so on. (more the distance, lesser the accuracy of result)
Levenshtein distance: reference1, wikipedia
Reference 1 mentions function 'adist' in R. I haven't used R so not much idea about ready available functions, packages. But a quick search showed me few more functions like adist.
This might help.

Best, Ravikant.


Sudatta Ray

unread,
Aug 25, 2020, 3:18:45 PM8/25/20
to data...@googlegroups.com
Hi Ram,

Faced with similar issues, the following worked for me - 

1. Make everything lower or upper case using tolower/ toupper
2. Grep to match the common pattern of name

Best,
Sudatta

m...@ncf-india.org

unread,
Aug 25, 2020, 8:51:24 PM8/25/20
to datameet
Hi Ram

In addition to the helpful suggestions made above, here are some R-specific pointers:
stringr is an extremely helpful package with which to do most of the string manipulation actions (whitespace removal, tokenisation, regex matching) recommended above.
— you may also need a package that helps you compute ‘distances’ between the strings you are comparing. stringdist is one such package. However, with Indian names, I found some of the phonetic distance algorithms (rogerroot, soundex) in the phonics package much more helpful.

Hope this helps! Good luck!
Madhu

Nikhil VJ

unread,
Aug 26, 2020, 2:50:09 AM8/26/20
to datameet
Hi Ram,

I'm not sure about R, but if you have the list in an excel / csv then OpenRefine can help you iron it all out in a jiffy. Check out this article I've written that explains the flow for this particular task:

OpenRefine is a tool made for non-coders to clean up messy data. Site: https://openrefine.org/ 

--
Cheers,
Nikhil VJ
https://nikhilvj.co.in


rammano...@gmail.com

unread,
Aug 29, 2020, 1:53:10 PM8/29/20
to datameet
Thank you Dilawar, Rahul, Ravikant, Sudatta, Madhu, Nikhil:

I mix-matched all the options you suggested. Finally, I have 18k hospital list in India. I will be providing this data from  http://india-data.com/  , where people can search information by Pincode. Beta version is live   http://india-data.com/pincode/221107/   .

Thanks again to all.

Regards
Ram

Herry Gulabani

unread,
Aug 30, 2020, 10:09:15 AM8/30/20
to data...@googlegroups.com
Ram, 

Would it be possible to add a number of beds to Hospital data? If so, it could be a great data source to complement the Census Number of Hospital Beds data.

Or is it some kind of Location/Map data that you are scrapping to get the name and location of Hospitals.

Great work by the way! 



--
Herry Gulabani
Master Of Planning (2019)
USC Sol Price School of Public Policy

Reply all
Reply to author
Forward
0 new messages