Help with R logic - near similar name

rammano...@gmail.com

unread,

Aug 25, 2020, 10:40:02 AM8/25/20

to datameet

Hi,

I have collected hospital data from multiple sources. However, each source have different name. Trying to clean list with no duplicates. I am using R and couldn't resolve with stringdist_join . Appreciate you suggesting some approach.

For example, Guntur (A.P) is listed with following names. Can we mark (or eliminate) duplicate?

Example 1
SANKARA EYE HOSPITAL(GUNTUR)
SANKARA EYE HOSPITAL
SANKARA EYE HOSPITAL ( A UNIT OF SRI KANCHI KAMA KOTI MEDICAL TRUST)

Example 2

ASHIRWAD HEART HOSPITAL ( GHATKOPAR )
Ashirwad Heart Hospital
ASHIRWAD HEART HOSPITAL ( GHATKOPAR )
Ashirwad Heart Hospita-Ghatkopar

Thanks

Ram

Dilawar Singh

unread,

Aug 25, 2020, 10:48:44 AM8/25/20

to datameet

Not sure what is the equivalent of python difflib (SequenceMatcher) in R. If you have one, it will work.

Sent from a handheld device. Pardon the brevity and typos.

--
Datameet is a community of Data Science enthusiasts in India. Know more about us by visiting http://datameet.org
---
You received this message because you are subscribed to the Google Groups "datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email to datameet+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/datameet/19ee8101-84ec-42b0-974a-43035b5902f1n%40googlegroups.com.

Rahul Gupta

unread,

Aug 25, 2020, 11:16:21 AM8/25/20

to data...@googlegroups.com

Hi Ram,

Not sure if there is something very similar to FuzzyWuzzy (Python) in R. But you can try this link

https://astrostatistics.psu.edu/su07/R/html/base/html/agrep.html

It is similar kind of approximate string matching. You can set your own threshold criteria and filter data accordingly.

--

Ravikant P

unread,

Aug 25, 2020, 3:18:43 PM8/25/20

to data...@googlegroups.com

Hi Ram,

For one project I had to match a village name in one dataset with another dataset containing ~44000 villages in Maharashtra. I had faced a similar situation. To find exact(or closest) match I had used following tricks

from both strings to be compared:

remove white spaces
convert everything to lowercase
compare strings for exact match. if not found then go to next step.
remove everything inside(including) parentheses
compare strings for exact match. if not found then go to next step.
remove characters like { ! - _ . etc}
compare strings for exact match. if not found then go to next step.
compare two strings for levenshtein distance = 1, then for distance=2 and so on. (more the distance, lesser the accuracy of result)

Levenshtein distance: reference1, wikipedia

Reference 1 mentions function 'adist' in R. I haven't used R so not much idea about ready available functions, packages. But a quick search showed me few more functions like adist.

This might help.

Best, Ravikant.

To view this discussion on the web visit https://groups.google.com/d/msgid/datameet/CAKxLuZeB5_2K4Td%3DP8-_AjFob9Wp2Vc9jic649HD%2BV1itEpYfg%40mail.gmail.com.

Sudatta Ray

unread,

Aug 25, 2020, 3:18:45 PM8/25/20

to data...@googlegroups.com

Hi Ram,

Faced with similar issues, the following worked for me -

1. Make everything lower or upper case using tolower/ toupper

2. Grep to match the common pattern of name

Best,

Sudatta

To view this discussion on the web visit https://groups.google.com/d/msgid/datameet/CAKxLuZeB5_2K4Td%3DP8-_AjFob9Wp2Vc9jic649HD%2BV1itEpYfg%40mail.gmail.com.

m...@ncf-india.org

unread,

Aug 25, 2020, 8:51:24 PM8/25/20

to datameet

Hi Ram

In addition to the helpful suggestions made above, here are some R-specific pointers:

— stringr is an extremely helpful package with which to do most of the string manipulation actions (whitespace removal, tokenisation, regex matching) recommended above.

— you may also need a package that helps you compute ‘distances’ between the strings you are comparing. stringdist is one such package. However, with Indian names, I found some of the phonetic distance algorithms (rogerroot, soundex) in the phonics package much more helpful.

Hope this helps! Good luck!

Madhu

Nikhil VJ

unread,

Aug 26, 2020, 2:50:09 AM8/26/20

to datameet

Hi Ram,

I'm not sure about R, but if you have the list in an excel / csv then OpenRefine can help you iron it all out in a jiffy. Check out this article I've written that explains the flow for this particular task:

http://datameet.org/2018/06/13/openrefine-bus-stop/

OpenRefine is a tool made for non-coders to clean up messy data. Site: https://openrefine.org/

--
Cheers,
Nikhil VJ
https://nikhilvj.co.in

To view this discussion on the web visit https://groups.google.com/d/msgid/datameet/ccf8287d-4b7e-4fe3-8efd-b15614f7f056n%40googlegroups.com.

rammano...@gmail.com

unread,

Aug 29, 2020, 1:53:10 PM8/29/20

to datameet

Thank you Dilawar, Rahul, Ravikant, Sudatta, Madhu, Nikhil:

I mix-matched all the options you suggested. Finally, I have 18k hospital list in India. I will be providing this data from http://india-data.com/ , where people can search information by Pincode. Beta version is live http://india-data.com/pincode/221107/ .

Thanks again to all.

Regards

Ram

Herry Gulabani

unread,

Aug 30, 2020, 10:09:15 AM8/30/20

to data...@googlegroups.com

Ram,

Would it be possible to add a number of beds to Hospital data? If so, it could be a great data source to complement the Census Number of Hospital Beds data.

Or is it some kind of Location/Map data that you are scrapping to get the name and location of Hospitals.

Great work by the way!

To view this discussion on the web visit https://groups.google.com/d/msgid/datameet/401db8e5-feb0-4ccd-a942-734df8d4f0ban%40googlegroups.com.

--

Herry Gulabani

Master Of Planning (2019)

USC Sol Price School of Public Policy

(213)431-7634 | gulaba...@gmail.com

Website: gulabani.wixsite.com/portfolio

Reply all

Reply to author

Forward