word matching in two different fields

nag

unread,

Feb 3, 2015, 10:05:05 PM2/3/15

to

Hi all,

I am trying to extract records by matching two fields.
I have two columns

Archer SK, Archer S K
Chapman W, Chapman William
Edward Phil, Edward P
Samuel C, Samson P

awk -F, '{
pname = toupper($1);gsub(/ /,"",pname)
mastname =toupper($2);gsub(/ /,"",mastname)
if(pname==mastname){
print pname,mastname
}
}' input

This will give onle the first record. Actually the second and third records are also matched ones. (Ex: in second record William in second field is taken as W in first field)

I want to match the records at least one word(more than one letter) is matched in the second field the same record. Then second and third records also be extracted.

Please help. Thank you.

Janis Papanagnou

unread,

Feb 3, 2015, 11:38:27 PM2/3/15

to

Your program is not designed reliable; "Edward S T" will match "Edwards T".

Just guessing what you want...

define FS=", +"
sub " " by "_" in $1 and in $2
gsub " " by "" in $1 and in $2
compare $1 and $2 up to the length of the shorter string
if equal print $1 and $2 with "_" replaced by " "

Janis

>
>
>

nag

unread,

Feb 4, 2015, 8:30:07 AM2/4/15

to

Nope.

Here I will give you some more typical records (they are all assumed to be matched records) from my global master. I have changed all records to upper records for clarity.

MAHABOOB SAHEB S ,MAHABOOB SAHEB
CHANDRAMALA P ,P CHANDRAMALA
SREERAMULU RAJU DA,SREERAMULURAJU
GOVINDA SWAMY SRIP,S G SWAMY
SUBBARAYULU MANIGA,SUBBARAYULU M
RAJENDRAN A R ,A R RAJENDRAN
MODIN SHAREEF SHAI,S M SHAREEF
MODIN SHAREEF SHAI,MODIN SHAREEF
MODIN SHAREEF SHAI,SHAREEF

Please see the last 3 records which are all matched but spelled little differently.

Anton Treuenfels

unread,

Feb 4, 2015, 9:38:52 AM2/4/15

to

"nag" <visi...@gmail.com> wrote in message
news:b238e3e6-75d8-4040...@googlegroups.com...

As far as I can tell your original example is working exactly the way it is
written. Only the first record can match the criteria you specified. And as
far as I can tell none of the records in your second example will match.

Stripping out spaces is a good start (and what allows the first record of
the first example to match), but it doesn't do anything for records that are
spelled differently since they can never match exactly. Perhaps you'd be
better off with a different test. Instead of

if ( pname == mastname )

try

if ( (index(pname, mastname) ==1) || (index(mastname,pname) == 1) )

which should catch any case where the shorter of the two names matches the
start of the longer one. The first three records of your first example
should pass this test. In your second example it should match records one,
three, five and eight.

- Anton Treuenfels

Kaz Kylheku

unread,

Feb 4, 2015, 10:13:40 AM2/4/15

to

On 2015-02-04, nag <visi...@gmail.com> wrote:
> On Wednesday, 4 February 2015 10:08:27 UTC+5:30, Janis Papanagnou wrote:
>> On 04.02.2015 04:05, nag wrote:
>> > Hi all,
>> >
>> > I am trying to extract records by matching two fields.
>> > I have two columns
>> >
>> > Archer SK, Archer S K
>> > Chapman W, Chapman William
>> > Edward Phil, Edward P
>> > Samuel C, Samson P

> Here I will give you some more typical records (they are all assumed to be
> matched records) from my global master.
>
> I have changed all records to upper
> records for clarity.

Bad idea. You should cite the original data that is to be fed to the
proposed program, without any such alterations.

> MAHABOOB SAHEB S ,MAHABOOB SAHEB

This very first case reveals something that your sample data missed.
You want these names to match, even though the given names are different.
The initials on the left are SS and on the right they are S.

> CHANDRAMALA P ,P CHANDRAMALA

This case also reveals something not in your original sample data:
the surname and given initial are reordered.

> SREERAMULU RAJU DA,SREERAMULURAJU

Here, "DA" is missing on the right, yet you say this matches.

Also, you neglected to reveal that the data uses fixed-width 18 character
fields. There could be cases where this helps, like some record that happens
to contain a spurious comma.

Bruce Horrocks

unread,

Feb 4, 2015, 10:22:16 AM2/4/15

to

On 04/02/2015 03:05, nag wrote:
> I want to match the records at least one word(more than one letter)
> is matched in the second field the same record. Then second and third
> records also be extracted.

You could split the strings into two arrays of words using the split
function with space as the separator. Then drop all one-letter words
from each array. Then see if the first word from pname is in mastname
(using the 'in' operator.)

If so then you have found a record where at least one word matches.

--
Bruce Horrocks
Surrey
England
(bruce at scorecrow dot com)

nag

unread,

Feb 4, 2015, 7:51:42 PM2/4/15

to

The two fields are taken from different files. The second one is master file field and the first one is user provided. I am trying to correct the master on the basis of user data. In some cases the surname or real names are written in shorter form. I am trying to match the data on at least one "word" of the user data matches in the mastdata. I will further filter the same after initial filtering is over.

nag

unread,

Feb 4, 2015, 8:09:34 PM2/4/15

to

Thank you Anton and Janis..
the following could give me maximum number of matched records..

awk 'BEGIN{FS=",+"}{
pname = $1; sub(/ /,"_",pname)
mastname = $2; sub(/ /,"_",mastname)
gsub(/ /,"",pname)
gsub(/ /,"",mastname)
if ( (index(pname, mastname) ==1) || (index(mastname,pname) == 1) ){
gsub(/_/," ",pname)
gsub(/_/," ",mastname)
print pname,mastname
}
}'

Janis Papanagnou

unread,

Feb 4, 2015, 8:18:19 PM2/4/15

to

On 05.02.2015 02:09, nag wrote:
>
> awk 'BEGIN{FS=",+"}{

Above code means an arbitrary number of commas, while the
upthread suggested FS, which was

BEGIN{FS=", +"}

means a comma followed by an arbitrary amount of blanks.

Janis

Kaz Kylheku

unread,

Feb 4, 2015, 10:12:03 PM2/4/15

to

On 2015-02-04, nag <visi...@gmail.com> wrote:

> Here I will give you some more typical records (they are all assumed to be
> matched records) from my global master. I have changed all records to upper
> records for clarity.
>
> MAHABOOB SAHEB S ,MAHABOOB SAHEB
> CHANDRAMALA P ,P CHANDRAMALA
> SREERAMULU RAJU DA,SREERAMULURAJU
> GOVINDA SWAMY SRIP,S G SWAMY
> SUBBARAYULU MANIGA,SUBBARAYULU M
> RAJENDRAN A R ,A R RAJENDRAN
> MODIN SHAREEF SHAI,S M SHAREEF
> MODIN SHAREEF SHAI,MODIN SHAREEF
> MODIN SHAREEF SHAI,SHAREEF
>
> Please see the last 3 records which are all matched but spelled little
> differently.

I wrote a program in another language, the TXR language, to attack this problem
in a detailed way in hopes of getting good results.

The general approach I chose is to generate permutations of one name, and find
the other name among those permutations. If that is not succesful, then try it
with the names reversed.

To "generate permutations" means to take a name, and generate all possible
orderings of its elements, under all possible reductions of its elements to
one-letter initials, and all possible reductions of the name by omission of
components. A constraint is that each permutation has at least one full name
word in it (in other words, a permutation cannot reduce everything to initials,
so "A R R" is not a valid permutation of "RAJENDRAN A R"). A refinement, also,
is that any two-letter word consisting of all caps is treated as a pair of
initials, so "Archer SK" is treated as three components: "Archer S K".

The ultimate matching of a name among the permutations is done with white space
removed. That takes care of probelm situations like "SREERAMULU RAJU" versus
"SREERAMULURAJU". For instance the permutation "FOO BAR" matches "FOO BAR"
and "FOOBAR".

I have taken your two data sets and combined them. I converted the original
example one into this fixed column format (18 columns, comma, 18 columns).

$ cat names.txt

MAHABOOB SAHEB S ,MAHABOOB SAHEB
CHANDRAMALA P ,P CHANDRAMALA
SREERAMULU RAJU DA,SREERAMULURAJU
GOVINDA SWAMY SRIP,S G SWAMY
SUBBARAYULU MANIGA,SUBBARAYULU M
RAJENDRAN A R ,A R RAJENDRAN
MODIN SHAREEF SHAI,S M SHAREEF
MODIN SHAREEF SHAI,MODIN SHAREEF
MODIN SHAREEF SHAI,SHAREEF

Archer SK ,Archer S K

Chapman W ,Chapman William

Edward Phil ,Edward P
Samuel C ,Samson P

$ txr match.txr names.txt
MAHABOOB SAHEB S ,MAHABOOB SAHEB key: ("MAHABOOB" "SAHEB" "")
CHANDRAMALA P ,P CHANDRAMALA key: ("CHANDRAMALA" "P")
SREERAMULU RAJU DA,SREERAMULURAJU key: ("SREERAMULU" "RAJU" "" "")
GOVINDA SWAMY SRIP,S G SWAMY key: ("S" "G" "SWAMY")
SUBBARAYULU MANIGA,SUBBARAYULU M key: ("SUBBARAYULU" "M")
RAJENDRAN A R ,A R RAJENDRAN key: ("RAJENDRAN" "A" "R")
MODIN SHAREEF SHAI,S M SHAREEF key: ("S" "M" "SHAREEF")
MODIN SHAREEF SHAI,MODIN SHAREEF key: ("MODIN" "SHAREEF" "")
MODIN SHAREEF SHAI,SHAREEF key: ("SHAREEF" "" "")
Archer SK ,Archer S K key: ("Archer" "S" "K")
Chapman W ,Chapman William key: ("Chapman" "W")
Edward Phil ,Edward P key: ("Edward" "P")
Samuel C ,Samson P *mismatch*

For a match, the program prints the "key": which shows us the permutation that
was generated from one name, and which matched the other. So for instance, we
know that RAJENDRAN A R and A R RAJENDRAN completely matched on all three
components because the key is ("RAJENDRAN" "A" "R").

Similarly, from the key we know that "SREERAMULU RAJU DA" and "SREERAMULURAJU"
matched on ("SREERAMULU" "RAJU" "" ""). The two blank strings represent the
fact that the permutations were being generated on the left side, which has
four components: "SREERAMLU", "RAJU", "D" and "A". "DA" is treated as initials
and split into two. But only the permutation with the initials blanked out
could match the right hand side, which has no such initials.

Note that even if the master file had the catenated name backwards as
"RAJUSREERAMULU", the match would still be made, because of this permutation
process.

Contents of "match.txr" script:

@(do
(defun name-list (name-string)
(tok-str name-string #/\S+/))

(defun sort-by-desc-length (name-list)
[sort name-list > length])

(defun break-up-initials (name-list)
(mappend [iff (do and [all @1 chr-isupper] (< (length @1) 3))
(op tok-str @1 #/./)
list]
name-list))

(defun perm-with-masks (masks list)
(collect-each ((mask masks))
(collect-each ((mask-elem mask)
(list-elem list))
(caseq mask-elem
(:omit "")
(:shorten [list-elem 0..1])
(:preserve list-elem)))))

(defun nameperm (name-string)
(let* ((nl [[chain name-list sort-by-desc-length break-up-initials]
name-string])
(long-name-count [count-if (op < 1) nl length])
(initial-count (- (length nl) long-name-count))
(long-perm-masks (remove-if (op none @1 (op equal :preserve))
(rperm '(:shorten :omit :preserve)
long-name-count)))
(initials-perm-masks (rperm '(:omit :preserve) initial-count))
(long-perm (perm-with-masks long-perm-masks nl))
(initials-perm (perm-with-masks initials-perm-masks
[nl long-name-count..:]))
(combined-perm (append-each ((lp long-perm))
(collect-each ((ip initials-perm))
(append lp ip)))))
[mappend perm combined-perm]))

(defun name-derived-from (name-1 name-2)
[find (regsub #/\s/ "" name-1) (nameperm name-2) : cat-str]))
@(repeat)
@{left 18},@{right 18}
@(do (let ((f (or (name-derived-from left right)
(name-derived-from right left))))
(put-line `@{left 18},@{right 18} @(if
f `key: @(tostring f)`
"*mismatch*")`)))
@(end)