> Here I will give you some more typical records (they are all assumed to be
> matched records) from my global master. I have changed all records to upper
> records for clarity.
>
> MAHABOOB SAHEB S ,MAHABOOB SAHEB
> CHANDRAMALA P ,P CHANDRAMALA
> SREERAMULU RAJU DA,SREERAMULURAJU
> GOVINDA SWAMY SRIP,S G SWAMY
> SUBBARAYULU MANIGA,SUBBARAYULU M
> RAJENDRAN A R ,A R RAJENDRAN
> MODIN SHAREEF SHAI,S M SHAREEF
> MODIN SHAREEF SHAI,MODIN SHAREEF
> MODIN SHAREEF SHAI,SHAREEF
>
> Please see the last 3 records which are all matched but spelled little
> differently.
I wrote a program in another language, the TXR language, to attack this problem
in a detailed way in hopes of getting good results.
The general approach I chose is to generate permutations of one name, and find
the other name among those permutations. If that is not succesful, then try it
with the names reversed.
To "generate permutations" means to take a name, and generate all possible
orderings of its elements, under all possible reductions of its elements to
one-letter initials, and all possible reductions of the name by omission of
components. A constraint is that each permutation has at least one full name
word in it (in other words, a permutation cannot reduce everything to initials,
so "A R R" is not a valid permutation of "RAJENDRAN A R"). A refinement, also,
is that any two-letter word consisting of all caps is treated as a pair of
initials, so "Archer SK" is treated as three components: "Archer S K".
The ultimate matching of a name among the permutations is done with white space
removed. That takes care of probelm situations like "SREERAMULU RAJU" versus
"SREERAMULURAJU". For instance the permutation "FOO BAR" matches "FOO BAR"
and "FOOBAR".
I have taken your two data sets and combined them. I converted the original
example one into this fixed column format (18 columns, comma, 18 columns).
$ cat names.txt
MAHABOOB SAHEB S ,MAHABOOB SAHEB
CHANDRAMALA P ,P CHANDRAMALA
SREERAMULU RAJU DA,SREERAMULURAJU
GOVINDA SWAMY SRIP,S G SWAMY
SUBBARAYULU MANIGA,SUBBARAYULU M
RAJENDRAN A R ,A R RAJENDRAN
MODIN SHAREEF SHAI,S M SHAREEF
MODIN SHAREEF SHAI,MODIN SHAREEF
MODIN SHAREEF SHAI,SHAREEF
Archer SK ,Archer S K
Chapman W ,Chapman William
Edward Phil ,Edward P
Samuel C ,Samson P
$ txr match.txr names.txt
MAHABOOB SAHEB S ,MAHABOOB SAHEB key: ("MAHABOOB" "SAHEB" "")
CHANDRAMALA P ,P CHANDRAMALA key: ("CHANDRAMALA" "P")
SREERAMULU RAJU DA,SREERAMULURAJU key: ("SREERAMULU" "RAJU" "" "")
GOVINDA SWAMY SRIP,S G SWAMY key: ("S" "G" "SWAMY")
SUBBARAYULU MANIGA,SUBBARAYULU M key: ("SUBBARAYULU" "M")
RAJENDRAN A R ,A R RAJENDRAN key: ("RAJENDRAN" "A" "R")
MODIN SHAREEF SHAI,S M SHAREEF key: ("S" "M" "SHAREEF")
MODIN SHAREEF SHAI,MODIN SHAREEF key: ("MODIN" "SHAREEF" "")
MODIN SHAREEF SHAI,SHAREEF key: ("SHAREEF" "" "")
Archer SK ,Archer S K key: ("Archer" "S" "K")
Chapman W ,Chapman William key: ("Chapman" "W")
Edward Phil ,Edward P key: ("Edward" "P")
Samuel C ,Samson P *mismatch*
For a match, the program prints the "key": which shows us the permutation that
was generated from one name, and which matched the other. So for instance, we
know that RAJENDRAN A R and A R RAJENDRAN completely matched on all three
components because the key is ("RAJENDRAN" "A" "R").
Similarly, from the key we know that "SREERAMULU RAJU DA" and "SREERAMULURAJU"
matched on ("SREERAMULU" "RAJU" "" ""). The two blank strings represent the
fact that the permutations were being generated on the left side, which has
four components: "SREERAMLU", "RAJU", "D" and "A". "DA" is treated as initials
and split into two. But only the permutation with the initials blanked out
could match the right hand side, which has no such initials.
Note that even if the master file had the catenated name backwards as
"RAJUSREERAMULU", the match would still be made, because of this permutation
process.
Contents of "match.txr" script:
@(do
(defun name-list (name-string)
(tok-str name-string #/\S+/))
(defun sort-by-desc-length (name-list)
[sort name-list > length])
(defun break-up-initials (name-list)
(mappend [iff (do and [all @1 chr-isupper] (< (length @1) 3))
(op tok-str @1 #/./)
list]
name-list))
(defun perm-with-masks (masks list)
(collect-each ((mask masks))
(collect-each ((mask-elem mask)
(list-elem list))
(caseq mask-elem
(:omit "")
(:shorten [list-elem 0..1])
(:preserve list-elem)))))
(defun nameperm (name-string)
(let* ((nl [[chain name-list sort-by-desc-length break-up-initials]
name-string])
(long-name-count [count-if (op < 1) nl length])
(initial-count (- (length nl) long-name-count))
(long-perm-masks (remove-if (op none @1 (op equal :preserve))
(rperm '(:shorten :omit :preserve)
long-name-count)))
(initials-perm-masks (rperm '(:omit :preserve) initial-count))
(long-perm (perm-with-masks long-perm-masks nl))
(initials-perm (perm-with-masks initials-perm-masks
[nl long-name-count..:]))
(combined-perm (append-each ((lp long-perm))
(collect-each ((ip initials-perm))
(append lp ip)))))
[mappend perm combined-perm]))
(defun name-derived-from (name-1 name-2)
[find (regsub #/\s/ "" name-1) (nameperm name-2) : cat-str]))
@(repeat)
@{left 18},@{right 18}
@(do (let ((f (or (name-derived-from left right)
(name-derived-from right left))))
(put-line `@{left 18},@{right 18} @(if
f `key: @(tostring f)`
"*mismatch*")`)))
@(end)