Am asking on behalf of a colleague who is doing a project involving matching based on Social Security numbers. The client wants to match two files using the Social Security number and wants to detect as possible matches cases where position by position 7 of the nine digits match. I am presuming that is because they want to account for the possibility of transposed digits (but don't know that for sure).
I've come up with what I call a brute force solution that uses the sum function in the where clause to count position by position how many characters match. Based on the sample below, it seems to work.
But I am wondering if any of the RegEx gurus on the list can comment on whether this can be done with RegEx.
TIA. My sample data and quick and dirty brute force code follows.
data one; input ssn $9.; datalines; 123456789 987654321 121212121 343434343 ; data two; input ssn $9.; datalines; 123456789 987645321 121212121 232323232 ; proc sql; create table matches as select a.ssn as ssn_one, b.ssn as ssn_two, case when (a.ssn ne b.ssn) then 'Not Exact' else 'Exact' end as MatchResults from one a, two b where sum(substr(a.ssn,1,1)=substr(b.ssn,1,1), substr(a.ssn,2,1)=substr(b.ssn,2,1), substr(a.ssn,3,1)=substr(b.ssn,3,1), substr(a.ssn,4,1)=substr(b.ssn,4,1), substr(a.ssn,5,1)=substr(b.ssn,5,1), substr(a.ssn,6,1)=substr(b.ssn,6,1), substr(a.ssn,7,1)=substr(b.ssn,7,1), substr(a.ssn,8,1)=substr(b.ssn,8,1), substr(a.ssn,9,1)=substr(b.ssn,9,1) ) ge 7; quit;
Don Henderson Don.Hender...@hcsbi.com http://www.hcsbi.com (301) 570-5530 (office) (301) 980-9027 (cell)
This is not using RegEx but on Key-Indexing. I presume that TWO has to be compared with ONE and the matching SSN based on the first 7 digits of SSN-string of ONE is wanted.
data one; input ssn $9.; datalines; 123456789 987654321 121212121 343434343 ; run; data two; input ssn $9.; datalines; 123456789 987645321 121212121 232323232 ; run; data need; array k[9999999] _temporary_; do until(eof); set one end = eof; one = input(substr(ssn,1,7),7.); k[one] = 1; end; do until(eof1); set two end = eof1; two = input(substr(ssn,1,7),7.); if k[two] then output; end; drop one two; run;
donaldjhender...@hotmail.com> wrote: > And no, this is not a Sci-Fi question ;-).
> Am asking on behalf of a colleague who is doing a project involving > matching > based on Social Security numbers. The client wants to match two files using > the Social Security number and wants to detect as possible matches cases > where position by position 7 of the nine digits match. I am presuming that > is because they want to account for the possibility of transposed digits > (but don't know that for sure).
> I've come up with what I call a brute force solution that uses the sum > function in the where clause to count position by position how many > characters match. Based on the sample below, it seems to work.
> But I am wondering if any of the RegEx gurus on the list can comment on > whether this can be done with RegEx.
> TIA. My sample data and quick and dirty brute force code follows.
> data one; > input ssn $9.; > datalines; > 123456789 > 987654321 > 121212121 > 343434343 > ; > data two; > input ssn $9.; > datalines; > 123456789 > 987645321 > 121212121 > 232323232 > ; > proc sql; > create table matches as > select a.ssn as ssn_one, > b.ssn as ssn_two, > case > when (a.ssn ne b.ssn) then 'Not Exact' > else 'Exact' > end as MatchResults > from one a, two b > where sum(substr(a.ssn,1,1)=substr(b.ssn,1,1), > substr(a.ssn,2,1)=substr(b.ssn,2,1), > substr(a.ssn,3,1)=substr(b.ssn,3,1), > substr(a.ssn,4,1)=substr(b.ssn,4,1), > substr(a.ssn,5,1)=substr(b.ssn,5,1), > substr(a.ssn,6,1)=substr(b.ssn,6,1), > substr(a.ssn,7,1)=substr(b.ssn,7,1), > substr(a.ssn,8,1)=substr(b.ssn,8,1), > substr(a.ssn,9,1)=substr(b.ssn,9,1) > ) ge 7; > quit;
I don't think that this type of problem can be solved using regexp - or at least I haven't a clue how I would go about such a solution. As you want to seek similarity between two strings finding the edit distance seems like a good bet.
The only problem is what edit distance algorithm to use. SAS allows you to specify your own scores for use in edit distance calculation with the call compcost routine, the values are then used in the compged function, this may be useful for other unique types of similarity calculations.
However, given your problem it seems like the complev function would work best as the Levenshtein Edit Distance allows insertion, deletion or substitution of a single character where the cost of each operation is 1. Since you know the length of your strings can filter for when we have le 3 mismatches.
data one; input ssn $9.; datalines; 123456789 987654321 121212121 343434343 ; data two; input ssn $9.; datalines; 123456789 987645321 121212121 232323232 ; proc sql; create table matches as select a.ssn as ssn_one, b.ssn as ssn_two, case when (a.ssn ne b.ssn) then 'Not Exact' else 'Exact' end as MatchResults from one a, two b where sum(substr(a.ssn,1,1)=substr(b.ssn,1,1), substr(a.ssn,2,1)=substr(b.ssn,2,1), substr(a.ssn,3,1)=substr(b.ssn,3,1), substr(a.ssn,4,1)=substr(b.ssn,4,1), substr(a.ssn,5,1)=substr(b.ssn,5,1), substr(a.ssn,6,1)=substr(b.ssn,6,1), substr(a.ssn,7,1)=substr(b.ssn,7,1), substr(a.ssn,8,1)=substr(b.ssn,8,1), substr(a.ssn,9,1)=substr(b.ssn,9,1) ) ge 7; quit;
proc sql; create table matches2 as select a.ssn as ssn_one, b.ssn as ssn_two, case when (a.ssn ne b.ssn) then 'Not Exact' else 'Exact' end as MatchResults from one a, two b where complev(a.ssn, b.ssn) le 3; quit;
> Am asking on behalf of a colleague who is doing a project involving matching > based on Social Security numbers. The client wants to match two files using > the Social Security number and wants to detect as possible matches cases > where position by position 7 of the nine digits match. I am presuming that > is because they want to account for the possibility of transposed digits > (but don't know that for sure).
> I've come up with what I call a brute force solution that uses the sum > function in the where clause to count position by position how many > characters match. Based on the sample below, it seems to work.
> But I am wondering if any of the RegEx gurus on the list can comment on > whether this can be done with RegEx.
> TIA. My sample data and quick and dirty brute force code follows.
> data one; > input ssn $9.; > datalines; > 123456789 > 987654321 > 121212121 > 343434343 > ; > data two; > input ssn $9.; > datalines; > 123456789 > 987645321 > 121212121 > 232323232 > ; > proc sql; > create table matches as > select a.ssn as ssn_one, > b.ssn as ssn_two, > case > when (a.ssn ne b.ssn) then 'Not Exact' > else 'Exact' > end as MatchResults > from one a, two b > where sum(substr(a.ssn,1,1)=substr(b.ssn,1,1), > substr(a.ssn,2,1)=substr(b.ssn,2,1), > substr(a.ssn,3,1)=substr(b.ssn,3,1), > substr(a.ssn,4,1)=substr(b.ssn,4,1), > substr(a.ssn,5,1)=substr(b.ssn,5,1), > substr(a.ssn,6,1)=substr(b.ssn,6,1), > substr(a.ssn,7,1)=substr(b.ssn,7,1), > substr(a.ssn,8,1)=substr(b.ssn,8,1), > substr(a.ssn,9,1)=substr(b.ssn,9,1) > ) ge 7; > quit;
> On Thu, 22 Jan 2009 10:22:02 -0500, Don Henderson > <donaldjhender...@HOTMAIL.COM> wrote:
>>And no, this is not a Sci-Fi question ;-).
>>Am asking on behalf of a colleague who is doing a project involving > matching >>based on Social Security numbers. The client wants to match two files > using >>the Social Security number and wants to detect as possible matches cases >>where position by position 7 of the nine digits match. I am presuming that >>is because they want to account for the possibility of transposed digits >>(but don't know that for sure).
>>I've come up with what I call a brute force solution that uses the sum >>function in the where clause to count position by position how many >>characters match. Based on the sample below, it seems to work.
>>But I am wondering if any of the RegEx gurus on the list can comment on >>whether this can be done with RegEx.
>>TIA. My sample data and quick and dirty brute force code follows.
>>data one; >> input ssn $9.; >>datalines; >>123456789 >>987654321 >>121212121 >>343434343 >>; >>data two; >> input ssn $9.; >>datalines; >>123456789 >>987645321 >>121212121 >>232323232 >>; >>proc sql; >> create table matches as >> select a.ssn as ssn_one, >> b.ssn as ssn_two, >> case >> when (a.ssn ne b.ssn) then 'Not Exact' >> else 'Exact' >> end as MatchResults >> from one a, two b >> where sum(substr(a.ssn,1,1)=substr(b.ssn,1,1), >> substr(a.ssn,2,1)=substr(b.ssn,2,1), >> substr(a.ssn,3,1)=substr(b.ssn,3,1), >> substr(a.ssn,4,1)=substr(b.ssn,4,1), >> substr(a.ssn,5,1)=substr(b.ssn,5,1), >> substr(a.ssn,6,1)=substr(b.ssn,6,1), >> substr(a.ssn,7,1)=substr(b.ssn,7,1), >> substr(a.ssn,8,1)=substr(b.ssn,8,1), >> substr(a.ssn,9,1)=substr(b.ssn,9,1) >> ) ge 7; >>quit;
> hi, don, > here is one way. see matches2 below. sas 9.1.3. hth. > cheers, > chang
> data one; > input ssn $9.; > datalines; > 123456789 > 987654321 > 121212121 > 343434343 > ; > data two; > input ssn $9.; > datalines; > 123456789 > 987645321 > 121212121 > 232323232 > ; > proc sql; > create table matches as > select a.ssn as ssn_one, > b.ssn as ssn_two, > case > when (a.ssn ne b.ssn) then 'Not Exact' > else 'Exact' > end as MatchResults > from one a, two b > where sum(substr(a.ssn,1,1)=substr(b.ssn,1,1), > substr(a.ssn,2,1)=substr(b.ssn,2,1), > substr(a.ssn,3,1)=substr(b.ssn,3,1), > substr(a.ssn,4,1)=substr(b.ssn,4,1), > substr(a.ssn,5,1)=substr(b.ssn,5,1), > substr(a.ssn,6,1)=substr(b.ssn,6,1), > substr(a.ssn,7,1)=substr(b.ssn,7,1), > substr(a.ssn,8,1)=substr(b.ssn,8,1), > substr(a.ssn,9,1)=substr(b.ssn,9,1) > ) ge 7;
> /* using prx */ > create table matches2 as > select a.ssn as ssn_one > , b.ssn as ssn_two > , ifc(a.ssn=b.ssn,'Exact','Not Exact') as MatchResults > length=9 > from one a, two b > where countc(prxchange("s/(.)(?=(.{8}\1))/M/" > , 9, catt(a.ssn,b.ssn)),"M")>=7; > quit;
> proc compare base=matches compare=matches2; > run; > /* on lst in part: > NOTE: No unequal values were found. All values compared are exactly equal. > */
<donaldjhender...@HOTMAIL.COM> wrote: >And no, this is not a Sci-Fi question ;-).
>Am asking on behalf of a colleague who is doing a project involving matching >based on Social Security numbers. The client wants to match two files using >the Social Security number and wants to detect as possible matches cases >where position by position 7 of the nine digits match. I am presuming that >is because they want to account for the possibility of transposed digits >(but don't know that for sure).
>I've come up with what I call a brute force solution that uses the sum >function in the where clause to count position by position how many >characters match. Based on the sample below, it seems to work.
>But I am wondering if any of the RegEx gurus on the list can comment on >whether this can be done with RegEx.
>TIA. My sample data and quick and dirty brute force code follows.
>data one; > input ssn $9.; >datalines; >123456789 >987654321 >121212121 >343434343 >; >data two; > input ssn $9.; >datalines; >123456789 >987645321 >121212121 >232323232 >; >proc sql; > create table matches as > select a.ssn as ssn_one, > b.ssn as ssn_two, > case > when (a.ssn ne b.ssn) then 'Not Exact' > else 'Exact' > end as MatchResults > from one a, two b > where sum(substr(a.ssn,1,1)=substr(b.ssn,1,1), > substr(a.ssn,2,1)=substr(b.ssn,2,1), > substr(a.ssn,3,1)=substr(b.ssn,3,1), > substr(a.ssn,4,1)=substr(b.ssn,4,1), > substr(a.ssn,5,1)=substr(b.ssn,5,1), > substr(a.ssn,6,1)=substr(b.ssn,6,1), > substr(a.ssn,7,1)=substr(b.ssn,7,1), > substr(a.ssn,8,1)=substr(b.ssn,8,1), > substr(a.ssn,9,1)=substr(b.ssn,9,1) > ) ge 7; >quit;
hi, don, here is one way. see matches2 below. sas 9.1.3. hth. cheers, chang
data one; input ssn $9.; datalines; 123456789 987654321 121212121 343434343 ; data two; input ssn $9.; datalines; 123456789 987645321 121212121 232323232 ; proc sql; create table matches as select a.ssn as ssn_one, b.ssn as ssn_two, case when (a.ssn ne b.ssn) then 'Not Exact' else 'Exact' end as MatchResults from one a, two b where sum(substr(a.ssn,1,1)=substr(b.ssn,1,1), substr(a.ssn,2,1)=substr(b.ssn,2,1), substr(a.ssn,3,1)=substr(b.ssn,3,1), substr(a.ssn,4,1)=substr(b.ssn,4,1), substr(a.ssn,5,1)=substr(b.ssn,5,1), substr(a.ssn,6,1)=substr(b.ssn,6,1), substr(a.ssn,7,1)=substr(b.ssn,7,1), substr(a.ssn,8,1)=substr(b.ssn,8,1), substr(a.ssn,9,1)=substr(b.ssn,9,1) ) ge 7;
/* using prx */ create table matches2 as select a.ssn as ssn_one , b.ssn as ssn_two , ifc(a.ssn=b.ssn,'Exact','Not Exact') as MatchResults length=9 from one a, two b where countc(prxchange("s/(.)(?=(.{8}\1))/M/" , 9, catt(a.ssn,b.ssn)),"M")>=7; quit;
proc compare base=matches compare=matches2; run; /* on lst in part: NOTE: No unequal values were found. All values compared are exactly equal. */
Thanks for these suggestions. The idea of a distance function (e.g., COMPLEV) had not occurred to me.
I don't have access to the real data my colleague will be using (he is working in a locked down environment as it is sensitive financial data), but I will create some volume test data and run some experiments to compare the performance of the three approaches and will report back.
-----Original Message----- From: SAS(r) Discussion [mailto:SA...@LISTSERV.UGA.EDU] On Behalf Of karma Sent: Thursday, January 22, 2009 12:10 PM To: SA...@LISTSERV.UGA.EDU Subject: Re: Regex - Seven of Nine Matching
" don't think that this type of problem can be solved using regexp "
Proven wrong in zero time :(
2009/1/22 Chang Chung <chang_y_ch...@hotmail.com>: > On Thu, 22 Jan 2009 10:22:02 -0500, Don Henderson > <donaldjhender...@HOTMAIL.COM> wrote:
>>And no, this is not a Sci-Fi question ;-).
>>Am asking on behalf of a colleague who is doing a project involving > matching >>based on Social Security numbers. The client wants to match two files > using >>the Social Security number and wants to detect as possible matches >>cases where position by position 7 of the nine digits match. I am >>presuming that is because they want to account for the possibility of >>transposed digits (but don't know that for sure).
>>I've come up with what I call a brute force solution that uses the >>sum function in the where clause to count position by position how >>many characters match. Based on the sample below, it seems to work.
>>But I am wondering if any of the RegEx gurus on the list can comment >>on whether this can be done with RegEx.
>>TIA. My sample data and quick and dirty brute force code follows.
>>data one; >> input ssn $9.; >>datalines; >>123456789 >>987654321 >>121212121 >>343434343 >>; >>data two; >> input ssn $9.; >>datalines; >>123456789 >>987645321 >>121212121 >>232323232 >>; >>proc sql; >> create table matches as >> select a.ssn as ssn_one, >> b.ssn as ssn_two, >> case >> when (a.ssn ne b.ssn) then 'Not Exact' >> else 'Exact' >> end as MatchResults >> from one a, two b >> where sum(substr(a.ssn,1,1)=substr(b.ssn,1,1), >> substr(a.ssn,2,1)=substr(b.ssn,2,1), >> substr(a.ssn,3,1)=substr(b.ssn,3,1), >> substr(a.ssn,4,1)=substr(b.ssn,4,1), >> substr(a.ssn,5,1)=substr(b.ssn,5,1), >> substr(a.ssn,6,1)=substr(b.ssn,6,1), >> substr(a.ssn,7,1)=substr(b.ssn,7,1), >> substr(a.ssn,8,1)=substr(b.ssn,8,1), >> substr(a.ssn,9,1)=substr(b.ssn,9,1) >> ) ge 7; >>quit;
> hi, don, > here is one way. see matches2 below. sas 9.1.3. hth. > cheers, > chang
> data one; > input ssn $9.; > datalines; > 123456789 > 987654321 > 121212121 > 343434343 > ; > data two; > input ssn $9.; > datalines; > 123456789 > 987645321 > 121212121 > 232323232 > ; > proc sql; > create table matches as > select a.ssn as ssn_one, > b.ssn as ssn_two, > case > when (a.ssn ne b.ssn) then 'Not Exact' > else 'Exact' > end as MatchResults > from one a, two b > where sum(substr(a.ssn,1,1)=substr(b.ssn,1,1), > substr(a.ssn,2,1)=substr(b.ssn,2,1), > substr(a.ssn,3,1)=substr(b.ssn,3,1), > substr(a.ssn,4,1)=substr(b.ssn,4,1), > substr(a.ssn,5,1)=substr(b.ssn,5,1), > substr(a.ssn,6,1)=substr(b.ssn,6,1), > substr(a.ssn,7,1)=substr(b.ssn,7,1), > substr(a.ssn,8,1)=substr(b.ssn,8,1), > substr(a.ssn,9,1)=substr(b.ssn,9,1) > ) ge 7;
> /* using prx */ > create table matches2 as > select a.ssn as ssn_one > , b.ssn as ssn_two > , ifc(a.ssn=b.ssn,'Exact','Not Exact') as MatchResults > length=9 > from one a, two b > where countc(prxchange("s/(.)(?=(.{8}\1))/M/" > , 9, catt(a.ssn,b.ssn)),"M")>=7; quit;
> proc compare base=matches compare=matches2; run; > /* on lst in part: > NOTE: No unequal values were found. All values compared are exactly equal. > */
Find out if the intent is to tolerate exactly one transposition of adjacent digits. If so, I believe you can use CALL COMPCOST to adjust the coefficients, then use the COMPGED function to get the precise subset.
<donaldjhender...@HOTMAIL.COM> wrote: >Karma and Chang,
>Thanks for these suggestions. The idea of a distance function (e.g., >COMPLEV) had not occurred to me.
>I don't have access to the real data my colleague will be using (he is >working in a locked down environment as it is sensitive financial data), but >I will create some volume test data and run some experiments to compare the >performance of the three approaches and will report back.
>Thanks again! >Donh
>-----Original Message----- >From: SAS(r) Discussion [mailto:SA...@LISTSERV.UGA.EDU] On Behalf Of karma >Sent: Thursday, January 22, 2009 12:10 PM >To: SA...@LISTSERV.UGA.EDU >Subject: Re: Regex - Seven of Nine Matching
>" don't think that this type of problem can be solved using regexp "
>Proven wrong in zero time :(
>2009/1/22 Chang Chung <chang_y_ch...@hotmail.com>: >> On Thu, 22 Jan 2009 10:22:02 -0500, Don Henderson >> <donaldjhender...@HOTMAIL.COM> wrote:
>>>And no, this is not a Sci-Fi question ;-).
>>>Am asking on behalf of a colleague who is doing a project involving >> matching >>>based on Social Security numbers. The client wants to match two files >> using >>>the Social Security number and wants to detect as possible matches >>>cases where position by position 7 of the nine digits match. I am >>>presuming that is because they want to account for the possibility of >>>transposed digits (but don't know that for sure).
>>>I've come up with what I call a brute force solution that uses the >>>sum function in the where clause to count position by position how >>>many characters match. Based on the sample below, it seems to work.
>>>But I am wondering if any of the RegEx gurus on the list can comment >>>on whether this can be done with RegEx.
>>>TIA. My sample data and quick and dirty brute force code follows.
>>>data one; >>> input ssn $9.; >>>datalines; >>>123456789 >>>987654321 >>>121212121 >>>343434343 >>>; >>>data two; >>> input ssn $9.; >>>datalines; >>>123456789 >>>987645321 >>>121212121 >>>232323232 >>>; >>>proc sql; >>> create table matches as >>> select a.ssn as ssn_one, >>> b.ssn as ssn_two, >>> case >>> when (a.ssn ne b.ssn) then 'Not Exact' >>> else 'Exact' >>> end as MatchResults >>> from one a, two b >>> where sum(substr(a.ssn,1,1)=substr(b.ssn,1,1), >>> substr(a.ssn,2,1)=substr(b.ssn,2,1), >>> substr(a.ssn,3,1)=substr(b.ssn,3,1), >>> substr(a.ssn,4,1)=substr(b.ssn,4,1), >>> substr(a.ssn,5,1)=substr(b.ssn,5,1), >>> substr(a.ssn,6,1)=substr(b.ssn,6,1), >>> substr(a.ssn,7,1)=substr(b.ssn,7,1), >>> substr(a.ssn,8,1)=substr(b.ssn,8,1), >>> substr(a.ssn,9,1)=substr(b.ssn,9,1) >>> ) ge 7; >>>quit;
>> hi, don, >> here is one way. see matches2 below. sas 9.1.3. hth. >> cheers, >> chang
>> data one; >> input ssn $9.; >> datalines; >> 123456789 >> 987654321 >> 121212121 >> 343434343 >> ; >> data two; >> input ssn $9.; >> datalines; >> 123456789 >> 987645321 >> 121212121 >> 232323232 >> ; >> proc sql; >> create table matches as >> select a.ssn as ssn_one, >> b.ssn as ssn_two, >> case >> when (a.ssn ne b.ssn) then 'Not Exact' >> else 'Exact' >> end as MatchResults >> from one a, two b >> where sum(substr(a.ssn,1,1)=substr(b.ssn,1,1), >> substr(a.ssn,2,1)=substr(b.ssn,2,1), >> substr(a.ssn,3,1)=substr(b.ssn,3,1), >> substr(a.ssn,4,1)=substr(b.ssn,4,1), >> substr(a.ssn,5,1)=substr(b.ssn,5,1), >> substr(a.ssn,6,1)=substr(b.ssn,6,1), >> substr(a.ssn,7,1)=substr(b.ssn,7,1), >> substr(a.ssn,8,1)=substr(b.ssn,8,1), >> substr(a.ssn,9,1)=substr(b.ssn,9,1) >> ) ge 7;
>> /* using prx */ >> create table matches2 as >> select a.ssn as ssn_one >> , b.ssn as ssn_two >> , ifc(a.ssn=b.ssn,'Exact','Not Exact') as MatchResults >> length=9 >> from one a, two b >> where countc(prxchange("s/(.)(?=(.{8}\1))/M/" >> , 9, catt(a.ssn,b.ssn)),"M")>=7; quit;
>> proc compare base=matches compare=matches2; run; >> /* on lst in part: >> NOTE: No unequal values were found. All values compared are exactly equal. >> */
Don: I've generally used the SPEDIS() function (or expressions involving SPEDIS()) for comparing two SSN during the past decade, and I find SPEDIS() a better measure of matching than any of the "number of matching digits" measures. It also works very fast. It does a "cost of rearrangement" calculation that assigns low costs to simple transpositions and exchanges of characters. The expression I use adjusts for lengths of strings (for cases where a SSN has fewer than nine digits, such as when a person records only the last four digits) and imposes symmetry of comparison (because SPEDIS() can yield different results for comparisons of the reverses of two SSN). As a rule a SPEDIS() score of <= 0.05 indicates either a recording error in one or both of two SSN or SSN for two siblings (since persons applying at the same time for SSN may receive consecutive numbers). S
-----Original Message----- From: SAS(r) Discussion [mailto:SA...@LISTSERV.UGA.EDU] On Behalf Of Don Henderson Sent: Thursday, January 22, 2009 12:36 PM To: SA...@LISTSERV.UGA.EDU Subject: Re: Regex - Seven of Nine Matching
Karma and Chang,
Thanks for these suggestions. The idea of a distance function (e.g., COMPLEV) had not occurred to me.
I don't have access to the real data my colleague will be using (he is working in a locked down environment as it is sensitive financial data), but I will create some volume test data and run some experiments to compare the performance of the three approaches and will report back.
Thanks again! Donh
-----Original Message----- From: SAS(r) Discussion [mailto:SA...@LISTSERV.UGA.EDU] On Behalf Of karma Sent: Thursday, January 22, 2009 12:10 PM To: SA...@LISTSERV.UGA.EDU Subject: Re: Regex - Seven of Nine Matching
" don't think that this type of problem can be solved using regexp "
Proven wrong in zero time :(
2009/1/22 Chang Chung <chang_y_ch...@hotmail.com>: > On Thu, 22 Jan 2009 10:22:02 -0500, Don Henderson > <donaldjhender...@HOTMAIL.COM> wrote:
>>And no, this is not a Sci-Fi question ;-).
>>Am asking on behalf of a colleague who is doing a project involving > matching >>based on Social Security numbers. The client wants to match two files > using >>the Social Security number and wants to detect as possible matches >>cases where position by position 7 of the nine digits match. I am >>presuming that is because they want to account for the possibility of >>transposed digits (but don't know that for sure).
>>I've come up with what I call a brute force solution that uses the >>sum function in the where clause to count position by position how >>many characters match. Based on the sample below, it seems to work.
>>But I am wondering if any of the RegEx gurus on the list can comment >>on whether this can be done with RegEx.
>>TIA. My sample data and quick and dirty brute force code follows.
>>data one; >> input ssn $9.; >>datalines; >>123456789 >>987654321 >>121212121 >>343434343 >>; >>data two; >> input ssn $9.; >>datalines; >>123456789 >>987645321 >>121212121 >>232323232 >>; >>proc sql; >> create table matches as >> select a.ssn as ssn_one, >> b.ssn as ssn_two, >> case >> when (a.ssn ne b.ssn) then 'Not Exact' >> else 'Exact' >> end as MatchResults >> from one a, two b >> where sum(substr(a.ssn,1,1)=substr(b.ssn,1,1), >> substr(a.ssn,2,1)=substr(b.ssn,2,1), >> substr(a.ssn,3,1)=substr(b.ssn,3,1), >> substr(a.ssn,4,1)=substr(b.ssn,4,1), >> substr(a.ssn,5,1)=substr(b.ssn,5,1), >> substr(a.ssn,6,1)=substr(b.ssn,6,1), >> substr(a.ssn,7,1)=substr(b.ssn,7,1), >> substr(a.ssn,8,1)=substr(b.ssn,8,1), >> substr(a.ssn,9,1)=substr(b.ssn,9,1) >> ) ge 7; >>quit;
> hi, don, > here is one way. see matches2 below. sas 9.1.3. hth. > cheers, > chang
> data one; > input ssn $9.; > datalines; > 123456789 > 987654321 > 121212121 > 343434343 > ; > data two; > input ssn $9.; > datalines; > 123456789 > 987645321 > 121212121 > 232323232 > ; > proc sql; > create table matches as > select a.ssn as ssn_one, > b.ssn as ssn_two, > case > when (a.ssn ne b.ssn) then 'Not Exact' > else 'Exact' > end as MatchResults > from one a, two b > where sum(substr(a.ssn,1,1)=substr(b.ssn,1,1), > substr(a.ssn,2,1)=substr(b.ssn,2,1), > substr(a.ssn,3,1)=substr(b.ssn,3,1), > substr(a.ssn,4,1)=substr(b.ssn,4,1), > substr(a.ssn,5,1)=substr(b.ssn,5,1), > substr(a.ssn,6,1)=substr(b.ssn,6,1), > substr(a.ssn,7,1)=substr(b.ssn,7,1), > substr(a.ssn,8,1)=substr(b.ssn,8,1), > substr(a.ssn,9,1)=substr(b.ssn,9,1) > ) ge 7;
> /* using prx */ > create table matches2 as > select a.ssn as ssn_one > , b.ssn as ssn_two > , ifc(a.ssn=b.ssn,'Exact','Not Exact') as MatchResults > length=9 > from one a, two b > where countc(prxchange("s/(.)(?=(.{8}\1))/M/" > , 9, catt(a.ssn,b.ssn)),"M")>=7; quit;
> proc compare base=matches compare=matches2; run; > /* on lst in part: > NOTE: No unequal values were found. All values compared are exactly > equal. */
One other possibility to think about is the COMPARE function, which simply compares two strings and returns the position of the first non-matching character (or 0 for an exact match).
So, my approach would be:
1. Call COMPARE once to try to find the first difference. 2. If another difference is possible (there was a difference prior to the last character in the string), call it a second time for the remaining (rightmost) portion of the string. 3. Similar to 2, call it a 3rd time if necessary.
If you find a 3rd difference, you don't have a match by your 7-of-9 rule. Otherwise you do.
Note that you need to use ABS with COMPARE, since it may return either a positive or negative result (depending on which of the two strings has the "greater" value).
-----Original Message----- From: SAS(r) Discussion [mailto:SA...@LISTSERV.UGA.EDU] On Behalf Of Don Henderson Sent: Thursday, January 22, 2009 2:59 PM To: SA...@LISTSERV.UGA.EDU Subject: Re: Regex - Seven of Nine Matching
Sig, Muthia and Howard,
Thanks for the alternative suggestions.
However, the requirement is 7 of the 9 positions match. And I believe that this is mandated by some regulatory rule. So alternative matching algorithms, even if they might be "better" in an abstract sense are not an option. It was just speculation on my part as to the reason (transposition) for 7 of the 9 match; it was not part of the requirement.
Given the specifics of the requirements, I am curious as to the performance issues and so I will compare the RegEx, complev and brute force method. The online doc says that COMPLEV is faster than COMPGED which is faster than SPEDIS for a comparison that COMPLEV can do.
Thanks again for the suggestions however.
Regards, donh
-----Original Message----- From: SAS(r) Discussion [mailto:SA...@LISTSERV.UGA.EDU] On Behalf Of Sigurd Hermansen Sent: Thursday, January 22, 2009 1:31 PM To: SA...@LISTSERV.UGA.EDU Subject: Re: Regex - Seven of Nine Matching
Don: I've generally used the SPEDIS() function (or expressions involving SPEDIS()) for comparing two SSN during the past decade, and I find SPEDIS() a better measure of matching than any of the "number of matching digits" measures. It also works very fast. It does a "cost of rearrangement" calculation that assigns low costs to simple transpositions and exchanges of characters. The expression I use adjusts for lengths of strings (for cases where a SSN has fewer than nine digits, such as when a person records only the last four digits) and imposes symmetry of comparison (because SPEDIS() can yield different results for comparisons of the reverses of two SSN). As a rule a SPEDIS() score of <= 0.05 indicates either a recording error in one or both of two SSN or SSN for two siblings (since persons applying at the same time for SSN may receive consecutive numbers). S
-----Original Message----- From: SAS(r) Discussion [mailto:SA...@LISTSERV.UGA.EDU] On Behalf Of Don Henderson Sent: Thursday, January 22, 2009 12:36 PM To: SA...@LISTSERV.UGA.EDU Subject: Re: Regex - Seven of Nine Matching
Karma and Chang,
Thanks for these suggestions. The idea of a distance function (e.g., COMPLEV) had not occurred to me.
I don't have access to the real data my colleague will be using (he is working in a locked down environment as it is sensitive financial data), but I will create some volume test data and run some experiments to compare the performance of the three approaches and will report back.
Thanks again! Donh
-----Original Message----- From: SAS(r) Discussion [mailto:SA...@LISTSERV.UGA.EDU] On Behalf Of karma Sent: Thursday, January 22, 2009 12:10 PM To: SA...@LISTSERV.UGA.EDU Subject: Re: Regex - Seven of Nine Matching
" don't think that this type of problem can be solved using regexp "
Proven wrong in zero time :(
2009/1/22 Chang Chung <chang_y_ch...@hotmail.com>: > On Thu, 22 Jan 2009 10:22:02 -0500, Don Henderson > <donaldjhender...@HOTMAIL.COM> wrote:
>>And no, this is not a Sci-Fi question ;-).
>>Am asking on behalf of a colleague who is doing a project involving > matching >>based on Social Security numbers. The client wants to match two files > using >>the Social Security number and wants to detect as possible matches >>cases where position by position 7 of the nine digits match. I am >>presuming that is because they want to account for the possibility of >>transposed digits (but don't know that for sure).
>>I've come up with what I call a brute force solution that uses the >>sum function in the where clause to count position by position how >>many characters match. Based on the sample below, it seems to work.
>>But I am wondering if any of the RegEx gurus on the list can comment >>on whether this can be done with RegEx.
>>TIA. My sample data and quick and dirty brute force code follows.
>>data one; >> input ssn $9.; >>datalines; >>123456789 >>987654321 >>121212121 >>343434343 >>; >>data two; >> input ssn $9.; >>datalines; >>123456789 >>987645321 >>121212121 >>232323232 >>; >>proc sql; >> create table matches as >> select a.ssn as ssn_one, >> b.ssn as ssn_two, >> case >> when (a.ssn ne b.ssn) then 'Not Exact' >> else 'Exact' >> end as MatchResults >> from one a, two b >> where sum(substr(a.ssn,1,1)=substr(b.ssn,1,1), >> substr(a.ssn,2,1)=substr(b.ssn,2,1), >> substr(a.ssn,3,1)=substr(b.ssn,3,1), >> substr(a.ssn,4,1)=substr(b.ssn,4,1), >> substr(a.ssn,5,1)=substr(b.ssn,5,1), >> substr(a.ssn,6,1)=substr(b.ssn,6,1), >> substr(a.ssn,7,1)=substr(b.ssn,7,1), >> substr(a.ssn,8,1)=substr(b.ssn,8,1), >> substr(a.ssn,9,1)=substr(b.ssn,9,1) >> ) ge 7; >>quit;
> hi, don, > here is one way. see matches2 below. sas 9.1.3. hth. > cheers, > chang
> data one; > input ssn $9.; > datalines; > 123456789 > 987654321 > 121212121 > 343434343 > ; > data two; > input ssn $9.; > datalines; > 123456789 > 987645321 > 121212121 > 232323232 > ; > proc sql; > create table matches as > select a.ssn as ssn_one, > b.ssn as ssn_two, > case > when (a.ssn ne b.ssn) then 'Not Exact' > else 'Exact' > end as MatchResults > from one a, two b > where sum(substr(a.ssn,1,1)=substr(b.ssn,1,1), > substr(a.ssn,2,1)=substr(b.ssn,2,1), > substr(a.ssn,3,1)=substr(b.ssn,3,1), > substr(a.ssn,4,1)=substr(b.ssn,4,1), > substr(a.ssn,5,1)=substr(b.ssn,5,1), > substr(a.ssn,6,1)=substr(b.ssn,6,1), > substr(a.ssn,7,1)=substr(b.ssn,7,1), > substr(a.ssn,8,1)=substr(b.ssn,8,1), > substr(a.ssn,9,1)=substr(b.ssn,9,1) > ) ge 7;
> /* using prx */ > create table matches2 as > select a.ssn as ssn_one > , b.ssn as ssn_two > , ifc(a.ssn=b.ssn,'Exact','Not Exact') as MatchResults > length=9 > from one a, two b > where countc(prxchange("s/(.)(?=(.{8}\1))/M/" > , 9, catt(a.ssn,b.ssn)),"M")>=7; quit;
> proc compare base=matches compare=matches2; run; > /* on lst in part: > NOTE: No unequal values were found. All values compared are exactly > equal. */
However, the requirement is 7 of the 9 positions match. And I believe that this is mandated by some regulatory rule. So alternative matching algorithms, even if they might be "better" in an abstract sense are not an option. It was just speculation on my part as to the reason (transposition) for 7 of the 9 match; it was not part of the requirement.
Given the specifics of the requirements, I am curious as to the performance issues and so I will compare the RegEx, complev and brute force method. The online doc says that COMPLEV is faster than COMPGED which is faster than SPEDIS for a comparison that COMPLEV can do.
-----Original Message----- From: SAS(r) Discussion [mailto:SA...@LISTSERV.UGA.EDU] On Behalf Of Sigurd
Hermansen Sent: Thursday, January 22, 2009 1:31 PM To: SA...@LISTSERV.UGA.EDU Subject: Re: Regex - Seven of Nine Matching
Don: I've generally used the SPEDIS() function (or expressions involving SPEDIS()) for comparing two SSN during the past decade, and I find SPEDIS() a better measure of matching than any of the "number of matching digits" measures. It also works very fast. It does a "cost of rearrangement" calculation that assigns low costs to simple transpositions and exchanges of characters. The expression I use adjusts for lengths of strings (for cases where a SSN has fewer than nine digits, such as when a person records only the last four digits) and imposes symmetry of comparison (because SPEDIS() can yield different results for comparisons of the reverses of two SSN). As a rule a SPEDIS() score of <= 0.05 indicates either a recording error in one or both of two SSN or SSN for two siblings (since persons applying at the same time for SSN may receive consecutive numbers). S
-----Original Message----- From: SAS(r) Discussion [mailto:SA...@LISTSERV.UGA.EDU] On Behalf Of Don Henderson Sent: Thursday, January 22, 2009 12:36 PM To: SA...@LISTSERV.UGA.EDU Subject: Re: Regex - Seven of Nine Matching
Karma and Chang,
Thanks for these suggestions. The idea of a distance function (e.g., COMPLEV) had not occurred to me.
I don't have access to the real data my colleague will be using (he is working in a locked down environment as it is sensitive financial data), but I will create some volume test data and run some experiments to compare the performance of the three approaches and will report back.
Thanks again! Donh
-----Original Message----- From: SAS(r) Discussion [mailto:SA...@LISTSERV.UGA.EDU] On Behalf Of karma Sent: Thursday, January 22, 2009 12:10 PM To: SA...@LISTSERV.UGA.EDU Subject: Re: Regex - Seven of Nine Matching
" don't think that this type of problem can be solved using regexp "
Proven wrong in zero time :(
2009/1/22 Chang Chung <chang_y_ch...@hotmail.com>: > On Thu, 22 Jan 2009 10:22:02 -0500, Don Henderson > <donaldjhender...@HOTMAIL.COM> wrote:
>>And no, this is not a Sci-Fi question ;-).
>>Am asking on behalf of a colleague who is doing a project involving > matching >>based on Social Security numbers. The client wants to match two files > using >>the Social Security number and wants to detect as possible matches >>cases where position by position 7 of the nine digits match. I am >>presuming that is because they want to account for the possibility of >>transposed digits (but don't know that for sure).
>>I've come up with what I call a brute force solution that uses the >>sum function in the where clause to count position by position how >>many characters match. Based on the sample below, it seems to work.
>>But I am wondering if any of the RegEx gurus on the list can comment >>on whether this can be done with RegEx.
>>TIA. My sample data and quick and dirty brute force code follows.
>>data one; >> input ssn $9.; >>datalines; >>123456789 >>987654321 >>121212121 >>343434343 >>; >>data two; >> input ssn $9.; >>datalines; >>123456789 >>987645321 >>121212121 >>232323232 >>; >>proc sql; >> create table matches as >> select a.ssn as ssn_one, >> b.ssn as ssn_two, >> case >> when (a.ssn ne b.ssn) then 'Not Exact' >> else 'Exact' >> end as MatchResults >> from one a, two b >> where sum(substr(a.ssn,1,1)=substr(b.ssn,1,1), >> substr(a.ssn,2,1)=substr(b.ssn,2,1), >> substr(a.ssn,3,1)=substr(b.ssn,3,1), >> substr(a.ssn,4,1)=substr(b.ssn,4,1), >> substr(a.ssn,5,1)=substr(b.ssn,5,1), >> substr(a.ssn,6,1)=substr(b.ssn,6,1), >> substr(a.ssn,7,1)=substr(b.ssn,7,1), >> substr(a.ssn,8,1)=substr(b.ssn,8,1), >> substr(a.ssn,9,1)=substr(b.ssn,9,1) >> ) ge 7; >>quit;
> hi, don, > here is one way. see matches2 below. sas 9.1.3. hth. > cheers, > chang
> data one; > input ssn $9.; > datalines; > 123456789 > 987654321 > 121212121 > 343434343 > ; > data two; > input ssn $9.; > datalines; > 123456789 > 987645321 > 121212121 > 232323232 > ; > proc sql; > create table matches as > select a.ssn as ssn_one, > b.ssn as ssn_two, > case > when (a.ssn ne b.ssn) then 'Not Exact' > else 'Exact' > end as MatchResults > from one a, two b > where sum(substr(a.ssn,1,1)=substr(b.ssn,1,1), > substr(a.ssn,2,1)=substr(b.ssn,2,1), > substr(a.ssn,3,1)=substr(b.ssn,3,1), > substr(a.ssn,4,1)=substr(b.ssn,4,1), > substr(a.ssn,5,1)=substr(b.ssn,5,1), > substr(a.ssn,6,1)=substr(b.ssn,6,1), > substr(a.ssn,7,1)=substr(b.ssn,7,1), > substr(a.ssn,8,1)=substr(b.ssn,8,1), > substr(a.ssn,9,1)=substr(b.ssn,9,1) > ) ge 7;
> /* using prx */ > create table matches2 as > select a.ssn as ssn_one > , b.ssn as ssn_two > , ifc(a.ssn=b.ssn,'Exact','Not Exact') as MatchResults > length=9 > from one a, two b > where countc(prxchange("s/(.)(?=(.{8}\1))/M/" > , 9, catt(a.ssn,b.ssn)),"M")>=7; quit;
> proc compare base=matches compare=matches2; run; > /* on lst in part: > NOTE: No unequal values were found. All values compared are exactly > equal. */
If it is performance you are primarily after (which would be logical to seek trying to process a Cartesian product explosion) and you still want to remain confined to SQL, I would suggest you change from substr(x,y,1) to char(x,y). Since char() returns a single byte, that avoids gobs of the idle blank-to-blank comparisons.
This simple change results in saving almost half the time in my tests (with 9.1.3 under AIX), which I have done on your own sample data with one record added to each file, then exploded 1000 times, so that 5000 "ssn" numbers are compared to 5000. For example, you SQL step with SUBSTR() ran 130 seconds, while the same step using CHAR() ran 77 seconds.
If you/your friend are not confined to SQL, I would rather use the DATA step for the simple reason that its logic (absent from SQL) allows to short-circuit the process of comparing a ssn1*ssn2 pair any time the number of *non-matching* digits exceeds 2, and even before that never consider a comparison if the strings are equal. For example,
data matches (drop = _:) ; set one (rename = (ssn = ssn1)) ; do p = 1 to n ; set two (rename = (ssn = ssn2)) point = p nobs = n ; _ne_cnt = 0 ; if ssn1 ne ssn2 then do ; ExactMatch = "N" ; do _cnt = 1 to 9 while (_ne_cnt <= 2) ; if char (ssn1, _cnt) ne char (ssn2, _cnt) then _ne_cnt ++ 1 ; end ; if _cnt <= 9 then continue ; end ; else ExactMatch = "Y" ; if _ne_cnt <= 2 then output ; end ; run ;
cuts the SQL processing time down to 47 seconds. Eschewing the re-reading of the second data set by placing it into an array beforehand,
data matches3 (drop = _:) ; array s [99999] $ 9 _temporary_ ; if _n_ = 1 then do _n_ = 1 to n ; set two nobs = n; s [_n_] = ssn ; end ; set one (rename = (ssn = ssn1)) ; do _n_ = 1 to n ; ssn2 = s[_n_] ; _ne_cnt = 0 ; if ssn1 ne ssn2 then do ; ExactMatch = "N" ; do _cnt = 1 to 9 while (_ne_cnt <= 2) ; if char (ssn1, _cnt) ne char (ssn2, _cnt) then _ne_cnt ++ 1 ; end ; if _cnt <= 9 then continue ; end ; else ExactMatch = "Y" ; if _ne_cnt <= 2 then output ; end ; run ;
reduces the run time further down to 32 seconds, meaning that a rather simple DATA step logic allows increasing performance four-fold from the original point.
I rather like Karma's COMPLEV() approach for its elegance, and it greatly simplifies the code, but unfortunately it is quite computationally intensive in addition to the obvious fact that it has to consider all the character in both strings being compared unconditionally. I have not tested regexen, but am pretty much sure they would fair no better for the same reason. Methinks any method failing to cut on the number of byte-to-byte comparisons is not going to do better than your "brute force". The only reason all the "improvements" suggested above do work is that they, including char() vs substr(), simply let SAS avoid extra work. For the same reason, Mike Rhoads' suggestions are very likely to improve run times as well. Given the simplistic nature of any Lazy Susan logic, it is effectively a brute force, only applied in a different direction.
The performance difference between 30 and 130 seconds is still not a big deal for one willing to wait an extra minute, but that may very well change if instead of 5000*5000 SSNs to compare, each file were mere 10 times longer. That would increase the processing time 100 fold, and we would be talking about the run time from 4 hours to 1.
Kind regards ------------ Paul Dorfman Jax, FL ------------
<donaldjhender...@HOTMAIL.COM> wrote: >Sig, Muthia and Howard,
>Thanks for the alternative suggestions.
>However, the requirement is 7 of the 9 positions match. And I believe that >this is mandated by some regulatory rule. So alternative matching >algorithms, even if they might be "better" in an abstract sense are not an >option. It was just speculation on my part as to the reason (transposition) >for 7 of the 9 match; it was not part of the requirement.
>Given the specifics of the requirements, I am curious as to the performance >issues and so I will compare the RegEx, complev and brute force method. The >online doc says that COMPLEV is faster than COMPGED which is faster than >SPEDIS for a comparison that COMPLEV can do.
>Thanks again for the suggestions however.
>Regards, >donh
>-----Original Message----- >From: SAS(r) Discussion [mailto:SA...@LISTSERV.UGA.EDU] On Behalf Of Sigurd >Hermansen >Sent: Thursday, January 22, 2009 1:31 PM >To: SA...@LISTSERV.UGA.EDU >Subject: Re: Regex - Seven of Nine Matching
>Don: >I've generally used the SPEDIS() function (or expressions involving >SPEDIS()) for comparing two SSN during the past decade, and I find SPEDIS() >a better measure of matching than any of the "number of matching digits" >measures. It also works very fast. It does a "cost of rearrangement" >calculation that assigns low costs to simple transpositions and exchanges of >characters. The expression I use adjusts for lengths of strings (for cases >where a SSN has fewer than nine digits, such as when a person records only >the last four digits) and imposes symmetry of comparison (because SPEDIS() >can yield different results for comparisons of the reverses of two SSN). As >a rule a SPEDIS() score of <= 0.05 indicates either a recording error in one >or both of two SSN or SSN for two siblings (since persons applying at the >same time for SSN may receive consecutive numbers). >S
>-----Original Message----- >From: SAS(r) Discussion [mailto:SA...@LISTSERV.UGA.EDU] On Behalf Of Don >Henderson >Sent: Thursday, January 22, 2009 12:36 PM >To: SA...@LISTSERV.UGA.EDU >Subject: Re: Regex - Seven of Nine Matching
>Karma and Chang,
>Thanks for these suggestions. The idea of a distance function (e.g., >COMPLEV) had not occurred to me.
>I don't have access to the real data my colleague will be using (he is >working in a locked down environment as it is sensitive financial data), but >I will create some volume test data and run some experiments to compare the >performance of the three approaches and will report back.
>Thanks again! >Donh
>-----Original Message----- >From: SAS(r) Discussion [mailto:SA...@LISTSERV.UGA.EDU] On Behalf Of karma >Sent: Thursday, January 22, 2009 12:10 PM >To: SA...@LISTSERV.UGA.EDU >Subject: Re: Regex - Seven of Nine Matching
>" don't think that this type of problem can be solved using regexp "
>Proven wrong in zero time :(
>2009/1/22 Chang Chung <chang_y_ch...@hotmail.com>: >> On Thu, 22 Jan 2009 10:22:02 -0500, Don Henderson >> <donaldjhender...@HOTMAIL.COM> wrote:
>>>And no, this is not a Sci-Fi question ;-).
>>>Am asking on behalf of a colleague who is doing a project involving >> matching >>>based on Social Security numbers. The client wants to match two files >> using >>>the Social Security number and wants to detect as possible matches >>>cases where position by position 7 of the nine digits match. I am >>>presuming that is because they want to account for the possibility of >>>transposed digits (but don't know that for sure).
>>>I've come up with what I call a brute force solution that uses the >>>sum function in the where clause to count position by position how >>>many characters match. Based on the sample below, it seems to work.
>>>But I am wondering if any of the RegEx gurus on the list can comment >>>on whether this can be done with RegEx.
>>>TIA. My sample data and quick and dirty brute force code follows.
>>>data one; >>> input ssn $9.; >>>datalines; >>>123456789 >>>987654321 >>>121212121 >>>343434343 >>>; >>>data two; >>> input ssn $9.; >>>datalines; >>>123456789 >>>987645321 >>>121212121 >>>232323232 >>>; >>>proc sql; >>> create table matches as >>> select a.ssn as ssn_one, >>> b.ssn as ssn_two, >>> case >>> when (a.ssn ne b.ssn) then 'Not Exact' >>> else 'Exact' >>> end as MatchResults >>> from one a, two b >>> where sum(substr(a.ssn,1,1)=substr(b.ssn,1,1), >>> substr(a.ssn,2,1)=substr(b.ssn,2,1), >>> substr(a.ssn,3,1)=substr(b.ssn,3,1), >>> substr(a.ssn,4,1)=substr(b.ssn,4,1), >>> substr(a.ssn,5,1)=substr(b.ssn,5,1), >>> substr(a.ssn,6,1)=substr(b.ssn,6,1), >>> substr(a.ssn,7,1)=substr(b.ssn,7,1), >>> substr(a.ssn,8,1)=substr(b.ssn,8,1), >>> substr(a.ssn,9,1)=substr(b.ssn,9,1) >>> ) ge 7; >>>quit;
>> hi, don, >> here is one way. see matches2 below. sas 9.1.3. hth. >> cheers, >> chang
>> data one; >> input ssn $9.; >> datalines; >> 123456789 >> 987654321 >> 121212121 >> 343434343 >> ; >> data two; >> input ssn $9.; >> datalines; >> 123456789 >> 987645321 >> 121212121 >> 232323232 >> ; >> proc sql; >> create table matches as >> select a.ssn as ssn_one, >> b.ssn as ssn_two, >> case >> when (a.ssn ne b.ssn) then 'Not Exact' >> else 'Exact' >> end as MatchResults >> from one a, two b >> where sum(substr(a.ssn,1,1)=substr(b.ssn,1,1), >> substr(a.ssn,2,1)=substr(b.ssn,2,1), >> substr(a.ssn,3,1)=substr(b.ssn,3,1), >> substr(a.ssn,4,1)=substr(b.ssn,4,1), >> substr(a.ssn,5,1)=substr(b.ssn,5,1), >> substr(a.ssn,6,1)=substr(b.ssn,6,1), >> substr(a.ssn,7,1)=substr(b.ssn,7,1), >> substr(a.ssn,8,1)=substr(b.ssn,8,1), >> substr(a.ssn,9,1)=substr(b.ssn,9,1) >> ) ge 7;
>> /* using prx */ >> create table matches2 as >> select a.ssn as ssn_one >> , b.ssn as ssn_two >> , ifc(a.ssn=b.ssn,'Exact','Not Exact') as MatchResults >> length=9 >> from one a, two b >> where countc(prxchange("s/(.)(?=(.{8}\1))/M/" >> , 9, catt(a.ssn,b.ssn)),"M")>=7; quit;
>> proc compare base=matches compare=matches2; run; >> /* on lst in part:
The crush of actual work has kept me from coming out to play on the 'L as much as I would like. (Don't you hate that!) Paul's solution has both inspired me to take a few minutes off to think about the problem from a different angle and come up with a solution.
As a first thought, I would conclude that a solution in the c or c++ language would work far faster than a SAS solution simply because c treats strings as character arrays and iterating through a series of character arrays should be an order faster than substringing strings. Paul's char() method helps reduce that advantage.
I next wondered where I would hide if I were a more efficient database programming solution. I immediately thought of an efficient database operation (a join) and a way to implement one (transposing character strings).
This SQL solution requires a couple of scans to set up implicit array indexes. It also minimizes the data values being carried from data sources to the solution. It won't perform O(n), but it will scale up to at least tens of thousands of SSN. For really large-scale problems, one of those hyperactive hash object solutions could approach O(n). I'll leave that to Paul, Greg, Chang, or one of the other uberhashers.
data one; input ssn $9.; datalines; 123456789 987654321 121212121 343434343 ; data two; input ssn $9.; datalines; 123456789 987645321 121212121 232323232 ; data one1; do until(EOF); i1+1; set one end=EOF; do d1=1 to 9; v1=char(ssn,d1); output; end; end; run; data two2; do until(EOF); i2+1; set two end=EOF; do d2=1 to 9; v2=char(ssn,d2); output; end; end; run; proc sql; create table matches as select i1,i2,sum(m) as count from (select i1,i2,d1,d2,(v1=v2) as m from one1 as t1,two2 as t2 where t1.d1=t2.d2 ) group by i1,i2 having sum(m)>=7 ; quit;
-----Original Message----- From: SAS(r) Discussion [mailto:SA...@LISTSERV.UGA.EDU] On Behalf Of Paul Dorfman Sent: Thursday, January 22, 2009 5:19 PM To: SA...@LISTSERV.UGA.EDU Subject: Re: Regex - Seven of Nine Matching
Don,
If it is performance you are primarily after (which would be logical to seek trying to process a Cartesian product explosion) and you still want to remain confined to SQL, I would suggest you change from substr(x,y,1) to char(x,y). Since char() returns a single byte, that avoids gobs of the idle blank-to-blank comparisons.
This simple change results in saving almost half the time in my tests (with 9.1.3 under AIX), which I have done on your own sample data with one record added to each file, then exploded 1000 times, so that 5000 "ssn" numbers are compared to 5000. For example, you SQL step with SUBSTR() ran 130 seconds, while the same step using CHAR() ran 77 seconds.
If you/your friend are not confined to SQL, I would rather use the DATA step for the simple reason that its logic (absent from SQL) allows to short-circuit the process of comparing a ssn1*ssn2 pair any time the number of *non-matching* digits exceeds 2, and even before that never consider a comparison if the strings are equal. For example,
data matches (drop = _:) ; set one (rename = (ssn = ssn1)) ; do p = 1 to n ; set two (rename = (ssn = ssn2)) point = p nobs = n ; _ne_cnt = 0 ; if ssn1 ne ssn2 then do ; ExactMatch = "N" ; do _cnt = 1 to 9 while (_ne_cnt <= 2) ; if char (ssn1, _cnt) ne char (ssn2, _cnt) then _ne_cnt ++ 1 ; end ; if _cnt <= 9 then continue ; end ; else ExactMatch = "Y" ; if _ne_cnt <= 2 then output ; end ; run ;
cuts the SQL processing time down to 47 seconds. Eschewing the re-reading of the second data set by placing it into an array beforehand,
data matches3 (drop = _:) ; array s [99999] $ 9 _temporary_ ; if _n_ = 1 then do _n_ = 1 to n ; set two nobs = n; s [_n_] = ssn ; end ; set one (rename = (ssn = ssn1)) ; do _n_ = 1 to n ; ssn2 = s[_n_] ; _ne_cnt = 0 ; if ssn1 ne ssn2 then do ; ExactMatch = "N" ; do _cnt = 1 to 9 while (_ne_cnt <= 2) ; if char (ssn1, _cnt) ne char (ssn2, _cnt) then _ne_cnt ++ 1 ; end ; if _cnt <= 9 then continue ; end ; else ExactMatch = "Y" ; if _ne_cnt <= 2 then output ; end ; run ;
reduces the run time further down to 32 seconds, meaning that a rather simple DATA step logic allows increasing performance four-fold from the original point.
I rather like Karma's COMPLEV() approach for its elegance, and it greatly simplifies the code, but unfortunately it is quite computationally intensive in addition to the obvious fact that it has to consider all the character in both strings being compared unconditionally. I have not tested regexen, but am pretty much sure they would fair no better for the same reason. Methinks any method failing to cut on the number of byte-to-byte comparisons is not going to do better than your "brute force". The only reason all the "improvements" suggested above do work is that they, including char() vs substr(), simply let SAS avoid extra work. For the same reason, Mike Rhoads' suggestions are very likely to improve run times as well. Given the simplistic nature of any Lazy Susan logic, it is effectively a brute force, only applied in a different direction.
The performance difference between 30 and 130 seconds is still not a big deal for one willing to wait an extra minute, but that may very well change if instead of 5000*5000 SSNs to compare, each file were mere 10 times longer. That would increase the processing time 100 fold, and we would be talking about the run time from 4 hours to 1.
Kind regards ------------ Paul Dorfman Jax, FL ------------
On Thu, 22 Jan 2009 14:59:21 -0500, Don Henderson <donaldjhender...@HOTMAIL.COM> wrote:
>Sig, Muthia and Howard,
>Thanks for the alternative suggestions.
>However, the requirement is 7 of the 9 positions match. And I believe >that this is mandated by some regulatory rule. So alternative matching >algorithms, even if they might be "better" in an abstract sense are not >an option. It was just speculation on my part as to the reason (transposition) >for 7 of the 9 match; it was not part of the requirement.
>Given the specifics of the requirements, I am curious as to the performance >issues and so I will compare the RegEx, complev and brute force method. The >online doc says that COMPLEV is faster than COMPGED which is faster >than SPEDIS for a comparison that COMPLEV can do.
>Thanks again for the suggestions however.
>Regards, >donh
>-----Original Message----- >From: SAS(r) Discussion [mailto:SA...@LISTSERV.UGA.EDU] On Behalf Of Sigurd >Hermansen >Sent: Thursday, January 22, 2009 1:31 PM >To: SA...@LISTSERV.UGA.EDU >Subject: Re: Regex - Seven of Nine Matching
>Don: >I've generally used the SPEDIS() function (or expressions involving >SPEDIS()) for comparing two SSN during the past decade, and I find SPEDIS() >a better measure of matching than any of the "number of matching >digits" measures. It also works very fast. It does a "cost of >rearrangement" calculation that assigns low costs to simple >transpositions and exchanges of >characters. The expression I use adjusts for lengths of strings (for >cases where a SSN has fewer than nine digits, such as when a person >records only the last four digits) and imposes symmetry of comparison >(because SPEDIS() can yield different results for comparisons of the >reverses of two SSN). As >a rule a SPEDIS() score of <= 0.05 indicates either a recording error >in one >or both of two SSN or SSN for two siblings (since persons applying at >the same time for SSN may receive consecutive numbers). S
>-----Original Message----- >From: SAS(r) Discussion [mailto:SA...@LISTSERV.UGA.EDU] On Behalf Of >Don Henderson >Sent: Thursday, January 22, 2009 12:36 PM >To: SA...@LISTSERV.UGA.EDU >Subject: Re: Regex - Seven of Nine Matching
>Karma and Chang,
>Thanks for these suggestions. The idea of a distance function (e.g., >COMPLEV) had not occurred to me.
>I don't have access to the real data my colleague will be using (he is >working in a locked down environment as it is sensitive financial >data), but >I will create some volume test data and run some experiments to compare the >performance of the three approaches and will report back.
>Thanks again! >Donh
>-----Original Message----- >From: SAS(r) Discussion [mailto:SA...@LISTSERV.UGA.EDU] On Behalf Of >karma >Sent: Thursday, January 22, 2009 12:10 PM >To: SA...@LISTSERV.UGA.EDU >Subject: Re: Regex - Seven of Nine Matching
>" don't think that this type of problem can be solved using regexp "
>Proven wrong in zero time :(
>2009/1/22 Chang Chung <chang_y_ch...@hotmail.com>: >> On Thu, 22 Jan 2009 10:22:02 -0500, Don Henderson >> <donaldjhender...@HOTMAIL.COM> wrote:
>>>And no, this is not a Sci-Fi question ;-).
>>>Am asking on behalf of a colleague who is doing a project involving >> matching >>>based on Social Security numbers. The client wants to match two files >> using >>>the Social Security number and wants to detect as possible matches >>>cases where position by position 7 of the nine digits match. I am >>>presuming that is because they want to account for the possibility of >>>transposed digits (but don't know that for sure).
>>>I've come up with what I call a brute force solution that uses the >>>sum function in the where clause to count position by position how >>>many characters match. Based on the sample below, it seems to work.
>>>But I am wondering if any of the RegEx gurus on the list can comment >>>on whether this can be done with RegEx.
>>>TIA. My sample data and quick and dirty brute force code follows.
I haven't been able to tell if the requirement was simply to check each record against its counterpart, or against all of the other records.
If its the former, wouldn't a simple data step sum be the quickest and easiest? For example:
data matches; set one; set two (rename=(ssn=ssn2)); matches=0; do i=1 to 9; matches=matches+(substr(ssn,i,1) eq substr(ssn2,i,1)); end; if matches gt 6 then output; run;
Art ------- On Thu, 22 Jan 2009 12:35:58 -0500, Don Henderson
<donaldjhender...@HOTMAIL.COM> wrote: >Karma and Chang,
>Thanks for these suggestions. The idea of a distance function (e.g., >COMPLEV) had not occurred to me.
>I don't have access to the real data my colleague will be using (he is >working in a locked down environment as it is sensitive financial data), but >I will create some volume test data and run some experiments to compare the >performance of the three approaches and will report back.
>Thanks again! >Donh
>-----Original Message----- >From: SAS(r) Discussion [mailto:SA...@LISTSERV.UGA.EDU] On Behalf Of karma >Sent: Thursday, January 22, 2009 12:10 PM >To: SA...@LISTSERV.UGA.EDU >Subject: Re: Regex - Seven of Nine Matching
>" don't think that this type of problem can be solved using regexp "
>Proven wrong in zero time :(
>2009/1/22 Chang Chung <chang_y_ch...@hotmail.com>: >> On Thu, 22 Jan 2009 10:22:02 -0500, Don Henderson >> <donaldjhender...@HOTMAIL.COM> wrote:
>>>And no, this is not a Sci-Fi question ;-).
>>>Am asking on behalf of a colleague who is doing a project involving >> matching >>>based on Social Security numbers. The client wants to match two files >> using >>>the Social Security number and wants to detect as possible matches >>>cases where position by position 7 of the nine digits match. I am >>>presuming that is because they want to account for the possibility of >>>transposed digits (but don't know that for sure).
>>>I've come up with what I call a brute force solution that uses the >>>sum function in the where clause to count position by position how >>>many characters match. Based on the sample below, it seems to work.
>>>But I am wondering if any of the RegEx gurus on the list can comment >>>on whether this can be done with RegEx.
>>>TIA. My sample data and quick and dirty brute force code follows.
>>>data one; >>> input ssn $9.; >>>datalines; >>>123456789 >>>987654321 >>>121212121 >>>343434343 >>>; >>>data two; >>> input ssn $9.; >>>datalines; >>>123456789 >>>987645321 >>>121212121 >>>232323232 >>>; >>>proc sql; >>> create table matches as >>> select a.ssn as ssn_one, >>> b.ssn as ssn_two, >>> case >>> when (a.ssn ne b.ssn) then 'Not Exact' >>> else 'Exact' >>> end as MatchResults >>> from one a, two b >>> where sum(substr(a.ssn,1,1)=substr(b.ssn,1,1), >>> substr(a.ssn,2,1)=substr(b.ssn,2,1), >>> substr(a.ssn,3,1)=substr(b.ssn,3,1), >>> substr(a.ssn,4,1)=substr(b.ssn,4,1), >>> substr(a.ssn,5,1)=substr(b.ssn,5,1), >>> substr(a.ssn,6,1)=substr(b.ssn,6,1), >>> substr(a.ssn,7,1)=substr(b.ssn,7,1), >>> substr(a.ssn,8,1)=substr(b.ssn,8,1), >>> substr(a.ssn,9,1)=substr(b.ssn,9,1) >>> ) ge 7; >>>quit;
>> hi, don, >> here is one way. see matches2 below. sas 9.1.3. hth. >> cheers, >> chang
>> data one; >> input ssn $9.; >> datalines; >> 123456789 >> 987654321 >> 121212121 >> 343434343 >> ; >> data two; >> input ssn $9.; >> datalines; >> 123456789 >> 987645321 >> 121212121 >> 232323232 >> ; >> proc sql; >> create table matches as >> select a.ssn as ssn_one, >> b.ssn as ssn_two, >> case >> when (a.ssn ne b.ssn) then 'Not Exact' >> else 'Exact' >> end as MatchResults >> from one a, two b >> where sum(substr(a.ssn,1,1)=substr(b.ssn,1,1), >> substr(a.ssn,2,1)=substr(b.ssn,2,1), >> substr(a.ssn,3,1)=substr(b.ssn,3,1), >> substr(a.ssn,4,1)=substr(b.ssn,4,1), >> substr(a.ssn,5,1)=substr(b.ssn,5,1), >> substr(a.ssn,6,1)=substr(b.ssn,6,1), >> substr(a.ssn,7,1)=substr(b.ssn,7,1), >> substr(a.ssn,8,1)=substr(b.ssn,8,1), >> substr(a.ssn,9,1)=substr(b.ssn,9,1) >> ) ge 7;
>> /* using prx */ >> create table matches2 as >> select a.ssn as ssn_one >> , b.ssn as ssn_two >> , ifc(a.ssn=b.ssn,'Exact','Not Exact') as MatchResults >> length=9 >> from one a, two b >> where countc(prxchange("s/(.)(?=(.{8}\1))/M/" >> , 9, catt(a.ssn,b.ssn)),"M")>=7; quit;
>> proc compare base=matches compare=matches2; run; >> /* on lst in part: >> NOTE: No unequal values were found. All values compared are exactly equal. >> */
And permit me to acknowledge that I probably should have asked a few more questions before posting this question as a favor to my colleague.
First, let me answer Paul's question that yes performance is the number one criteria. My brute force approach using a cartesian product was simply a first cut at testing the viability of the idea. But there is nothing about the problem or the client that requires SQL; so a data step solution is an option. And I had thought about an approach using SET with the POINT as Paul showed in his example but figured I would defer that until later.
Now to provide some additional details about the data and what is being looked for. I can't say too much about them other than to say:
- the first file is expected to contain upwards of 3M distinct SSNs - the second one could contain anywhere from 3M to 5M rows - it is expected that most rows could/should have an exact match, but there could be SSNs in the first file that are not in the second, and vice versa. - the expected matching is that there is a (0 or 1)-M relationship between the two files where M could be 0, 1, 2, etc. - what is still unknown (my colleague needs to get clarification from the client) is whether once an SSN in the second file matches exactly to an SSN in the first file is it excluded from the 7/9 matching
Clarification of the latter point is a key item before further investigation as it will dramatically impact the number of potential matches. While I don't want to even think about the real combinations and permutations, if we had all possible 9 digit numbers (clearly not the case, but makes this rough approximation easier to think about), then by just looking at as single position, we could have (I think) 81 possible matches (9 other potential values for each of the nine positions) for each row in the first file. Factor in a second position and I think we can all see where this is going.
As of now I think I am going to assume that if an exact match is found for an SSN in the second file that it can be excluded from further analysis. The business logic behind the 1-M matching suggests that this might be a reasonable assumption.
So again, thx for all the great ideas. As I said earlier, once I get a better handle on the scope of the data and refined requirements for how exact matches are handled, I will evaluate the alternatives suggested and report back.
> -----Original Message----- > From: SAS(r) Discussion [mailto:SA...@LISTSERV.UGA.EDU] On > Behalf Of Sigurd Hermansen > Sent: Thursday, January 22, 2009 6:34 PM > To: SA...@LISTSERV.UGA.EDU > Subject: Re: Regex - Seven of Nine Matching
> The crush of actual work has kept me from coming out to play > on the 'L as much as I would like. (Don't you hate that!) > Paul's solution has both inspired me to take a few minutes > off to think about the problem from a different angle and > come up with a solution.
> As a first thought, I would conclude that a solution in the c > or c++ language would work far faster than a SAS solution > simply because c treats strings as character arrays and > iterating through a series of character arrays should be an > order faster than substringing strings. Paul's char() method > helps reduce that advantage.
> I next wondered where I would hide if I were a more efficient > database programming solution. I immediately thought of an > efficient database operation (a join) and a way to implement > one (transposing character strings).
> This SQL solution requires a couple of scans to set up > implicit array indexes. It also minimizes the data values > being carried from data sources to the solution. It won't > perform O(n), but it will scale up to at least tens of > thousands of SSN. For really large-scale problems, one of > those hyperactive hash object solutions could approach O(n). > I'll leave that to Paul, Greg, Chang, or one of the other uberhashers.
> data one; > input ssn $9.; > datalines; > 123456789 > 987654321 > 121212121 > 343434343 > ; > data two; > input ssn $9.; > datalines; > 123456789 > 987645321 > 121212121 > 232323232 > ; > data one1; > do until(EOF); > i1+1; > set one end=EOF; > do d1=1 to 9; > v1=char(ssn,d1); > output; > end; > end; > run; > data two2; > do until(EOF); > i2+1; > set two end=EOF; > do d2=1 to 9; > v2=char(ssn,d2); > output; > end; > end; > run; > proc sql; > create table matches as > select i1,i2,sum(m) as count > from (select i1,i2,d1,d2,(v1=v2) as m > from one1 as t1,two2 as t2 > where t1.d1=t2.d2 > ) > group by i1,i2 having sum(m)>=7 > ; > quit;
> -----Original Message----- > From: SAS(r) Discussion [mailto:SA...@LISTSERV.UGA.EDU] On > Behalf Of Paul Dorfman > Sent: Thursday, January 22, 2009 5:19 PM > To: SA...@LISTSERV.UGA.EDU > Subject: Re: Regex - Seven of Nine Matching
> Don,
> If it is performance you are primarily after (which would be > logical to seek trying to process a Cartesian product > explosion) and you still want to remain confined to SQL, I > would suggest you change from substr(x,y,1) to char(x,y). > Since char() returns a single byte, that avoids gobs of the > idle blank-to-blank comparisons.
> This simple change results in saving almost half the time in > my tests (with 9.1.3 under AIX), which I have done on your > own sample data with one record added to each file, then > exploded 1000 times, so that 5000 "ssn" numbers are compared > to 5000. For example, you SQL step with SUBSTR() ran 130 > seconds, while the same step using CHAR() ran 77 seconds.
> If you/your friend are not confined to SQL, I would rather > use the DATA step for the simple reason that its logic > (absent from SQL) allows to short-circuit the process of > comparing a ssn1*ssn2 pair any time the number of > *non-matching* digits exceeds 2, and even before that never > consider a comparison if the strings are equal. For example,
> data matches (drop = _:) ; > set one (rename = (ssn = ssn1)) ; > do p = 1 to n ; > set two (rename = (ssn = ssn2)) point = p nobs = n ; > _ne_cnt = 0 ; > if ssn1 ne ssn2 then do ; > ExactMatch = "N" ; > do _cnt = 1 to 9 while (_ne_cnt <= 2) ; > if char (ssn1, _cnt) ne char (ssn2, _cnt) then > _ne_cnt ++ 1 ; > end ; > if _cnt <= 9 then continue ; > end ; > else ExactMatch = "Y" ; > if _ne_cnt <= 2 then output ; > end ; > run ;
> cuts the SQL processing time down to 47 seconds. Eschewing > the re-reading of the second data set by placing it into an > array beforehand,
> data matches3 (drop = _:) ; > array s [99999] $ 9 _temporary_ ; > if _n_ = 1 then do _n_ = 1 to n ; > set two nobs = n; > s [_n_] = ssn ; > end ; > set one (rename = (ssn = ssn1)) ; > do _n_ = 1 to n ; > ssn2 = s[_n_] ; > _ne_cnt = 0 ; > if ssn1 ne ssn2 then do ; > ExactMatch = "N" ; > do _cnt = 1 to 9 while (_ne_cnt <= 2) ; > if char (ssn1, _cnt) ne char (ssn2, _cnt) then > _ne_cnt ++ 1 ; > end ; > if _cnt <= 9 then continue ; > end ; > else ExactMatch = "Y" ; > if _ne_cnt <= 2 then output ; > end ; > run ;
> reduces the run time further down to 32 seconds, meaning that > a rather simple DATA step logic allows increasing performance > four-fold from the original point.
> I rather like Karma's COMPLEV() approach for its elegance, > and it greatly simplifies the code, but unfortunately it is > quite computationally intensive in addition to the obvious > fact that it has to consider all the character in both > strings being compared unconditionally. I have not tested > regexen, but am pretty much sure they would fair no better > for the same reason. Methinks any method failing to cut on > the number of byte-to-byte comparisons is not going to do > better than your "brute force". The only reason all the > "improvements" suggested above do work is that they, > including char() vs substr(), simply let SAS avoid extra > work. For the same reason, Mike Rhoads' suggestions are very > likely to improve run times as well. Given the simplistic > nature of any Lazy Susan logic, it is effectively a brute > force, only applied in a different direction.
> The performance difference between 30 and 130 seconds is > still not a big deal for one willing to wait an extra minute, > but that may very well change if instead of 5000*5000 SSNs to > compare, each file were mere 10 times longer. That would > increase the processing time 100 fold, and we would be > talking about the run time from 4 hours to 1.
> On Thu, 22 Jan 2009 14:59:21 -0500, Don Henderson > <donaldjhender...@HOTMAIL.COM> wrote:
> >Sig, Muthia and Howard,
> >Thanks for the alternative suggestions.
> >However, the requirement is 7 of the 9 positions match. And > I believe > >that this is mandated by some regulatory rule. So > alternative matching > >algorithms, even if they might be "better" in an abstract > sense are not > >an option. It was just speculation on my part as to the reason > (transposition) > >for 7 of the 9 match; it was not part of the requirement.
> >Given the specifics of the requirements, I am curious as to the > performance > >issues and so I will compare the RegEx, complev and brute > force method. > The > >online doc says that COMPLEV is faster than COMPGED which is faster > >than SPEDIS for a comparison that COMPLEV can do.
> >Thanks again for the suggestions however.
> >Regards, > >donh
> >-----Original Message----- > >From: SAS(r) Discussion [mailto:SA...@LISTSERV.UGA.EDU] On Behalf Of > Sigurd
Don: Generally fuzzy matching of SSN or other identifiers turns out to be a Cartesian product problem; if one has an ID that can be used to join two datasets, one doesn't need a SSN to link records for the same person in two datasets.
A 7 of 9 digit SSN match rule won't identify matching records perfectly, but may do well enough. Expansion of the number of rows won't make much difference so long as the datasets used in comparisons remain very narrow. 1-M matching will only expand the number of potential matches. Watch out for M-M matching, though, because the yield of a join will explode. Many linkage problems have duplicates in data and inadvertent M-M matches.
I've worked with Paul D. on similar problems for the last ten years. I'd look to him for technical expertise. Any neophyte in the field of fuzzy or probabilistic record linkage has a steep learning curve to climb. The literature goes back more than thirty years. S
-----Original Message----- From: SAS(r) Discussion [mailto:SA...@LISTSERV.UGA.EDU] On Behalf Of Don Henderson Sent: Thursday, January 22, 2009 8:57 PM To: SA...@LISTSERV.UGA.EDU Subject: Re: Regex - Seven of Nine Matching
Sig, Paul, Art,
Thanks again for the insights.
And permit me to acknowledge that I probably should have asked a few more questions before posting this question as a favor to my colleague.
First, let me answer Paul's question that yes performance is the number one criteria. My brute force approach using a cartesian product was simply a first cut at testing the viability of the idea. But there is nothing about the problem or the client that requires SQL; so a data step solution is an option. And I had thought about an approach using SET with the POINT as Paul showed in his example but figured I would defer that until later.
Now to provide some additional details about the data and what is being looked for. I can't say too much about them other than to say:
- the first file is expected to contain upwards of 3M distinct SSNs - the second one could contain anywhere from 3M to 5M rows - it is expected that most rows could/should have an exact match, but there could be SSNs in the first file that are not in the second, and vice versa. - the expected matching is that there is a (0 or 1)-M relationship between the two files where M could be 0, 1, 2, etc. - what is still unknown (my colleague needs to get clarification from the client) is whether once an SSN in the second file matches exactly to an SSN in the first file is it excluded from the 7/9 matching
Clarification of the latter point is a key item before further investigation as it will dramatically impact the number of potential matches. While I don't want to even think about the real combinations and permutations, if we had all possible 9 digit numbers (clearly not the case, but makes this rough approximation easier to think about), then by just looking at as single position, we could have (I think) 81 possible matches (9 other potential values for each of the nine positions) for each row in the first file. Factor in a second position and I think we can all see where this is going.
As of now I think I am going to assume that if an exact match is found for an SSN in the second file that it can be excluded from further analysis. The business logic behind the 1-M matching suggests that this might be a reasonable assumption.
So again, thx for all the great ideas. As I said earlier, once I get a better handle on the scope of the data and refined requirements for how exact matches are handled, I will evaluate the alternatives suggested and report back.
Thanks again, -donh
> -----Original Message----- > From: SAS(r) Discussion [mailto:SA...@LISTSERV.UGA.EDU] On Behalf Of > Sigurd Hermansen > Sent: Thursday, January 22, 2009 6:34 PM > To: SA...@LISTSERV.UGA.EDU > Subject: Re: Regex - Seven of Nine Matching
> The crush of actual work has kept me from coming out to play on the 'L > as much as I would like. (Don't you hate that!) Paul's solution has > both inspired me to take a few minutes off to think about the problem > from a different angle and come up with a solution.
> As a first thought, I would conclude that a solution in the c or c++ > language would work far faster than a SAS solution simply because c > treats strings as character arrays and iterating through a series of > character arrays should be an order faster than substringing strings. > Paul's char() method helps reduce that advantage.
> I next wondered where I would hide if I were a more efficient database > programming solution. I immediately thought of an efficient database > operation (a join) and a way to implement one (transposing character > strings).
> This SQL solution requires a couple of scans to set up implicit array > indexes. It also minimizes the data values being carried from data > sources to the solution. It won't perform O(n), but it will scale up > to at least tens of thousands of SSN. For really large-scale problems, > one of those hyperactive hash object solutions could approach O(n). > I'll leave that to Paul, Greg, Chang, or one of the other uberhashers.
> data one; > input ssn $9.; > datalines; > 123456789 > 987654321 > 121212121 > 343434343 > ; > data two; > input ssn $9.; > datalines; > 123456789 > 987645321 > 121212121 > 232323232 > ; > data one1; > do until(EOF); > i1+1; > set one end=EOF; > do d1=1 to 9; > v1=char(ssn,d1); > output; > end; > end; > run; > data two2; > do until(EOF); > i2+1; > set two end=EOF; > do d2=1 to 9; > v2=char(ssn,d2); > output; > end; > end; > run; > proc sql; > create table matches as > select i1,i2,sum(m) as count > from (select i1,i2,d1,d2,(v1=v2) as m > from one1 as t1,two2 as t2 > where t1.d1=t2.d2 > ) > group by i1,i2 having sum(m)>=7 > ; > quit;
> -----Original Message----- > From: SAS(r) Discussion [mailto:SA...@LISTSERV.UGA.EDU] On Behalf Of > Paul Dorfman > Sent: Thursday, January 22, 2009 5:19 PM > To: SA...@LISTSERV.UGA.EDU > Subject: Re: Regex - Seven of Nine Matching
> Don,
> If it is performance you are primarily after (which would be logical > to seek trying to process a Cartesian product > explosion) and you still want to remain confined to SQL, I would > suggest you change from substr(x,y,1) to char(x,y). Since char() > returns a single byte, that avoids gobs of the idle blank-to-blank > comparisons.
> This simple change results in saving almost half the time in my tests > (with 9.1.3 under AIX), which I have done on your own sample data with > one record added to each file, then exploded 1000 times, so that 5000 > "ssn" numbers are compared to 5000. For example, you SQL step with > SUBSTR() ran 130 seconds, while the same step using CHAR() ran 77 > seconds.
> If you/your friend are not confined to SQL, I would rather use the > DATA step for the simple reason that its logic (absent from SQL) > allows to short-circuit the process of comparing a ssn1*ssn2 pair any > time the number of > *non-matching* digits exceeds 2, and even before that never consider a > comparison if the strings are equal. For example,
> data matches (drop = _:) ; > set one (rename = (ssn = ssn1)) ; > do p = 1 to n ; > set two (rename = (ssn = ssn2)) point = p nobs = n ; > _ne_cnt = 0 ; > if ssn1 ne ssn2 then do ; > ExactMatch = "N" ; > do _cnt = 1 to 9 while (_ne_cnt <= 2) ; > if char (ssn1, _cnt) ne char (ssn2, _cnt) then _ne_cnt ++ > 1 ; > end ; > if _cnt <= 9 then continue ; > end ; > else ExactMatch = "Y" ; > if _ne_cnt <= 2 then output ; > end ; > run ;
> cuts the SQL processing time down to 47 seconds. Eschewing the > re-reading of the second data set by placing it into an array > beforehand,
> data matches3 (drop = _:) ; > array s [99999] $ 9 _temporary_ ; > if _n_ = 1 then do _n_ = 1 to n ; > set two nobs = n; > s [_n_] = ssn ; > end ; > set one (rename = (ssn = ssn1)) ; > do _n_ = 1 to n ; > ssn2 = s[_n_] ; > _ne_cnt = 0 ; > if ssn1 ne ssn2 then do ; > ExactMatch = "N" ; > do _cnt = 1 to 9 while (_ne_cnt <= 2) ; > if char (ssn1, _cnt) ne char (ssn2, _cnt) then _ne_cnt ++ > 1 ; > end ; > if _cnt <= 9 then continue ; > end ; > else ExactMatch = "Y" ; > if _ne_cnt <= 2 then output ; > end ; > run ;
> reduces the run time further down to 32 seconds, meaning that a rather > simple DATA step logic allows increasing performance four-fold from > the original point.
> I rather like Karma's COMPLEV() approach for its elegance, and it > greatly simplifies the code, but unfortunately it is quite > computationally intensive in addition to the obvious fact that it has > to consider all the character in both strings being compared > unconditionally. I have not tested regexen, but am pretty much sure > they would fair no better for the same reason. Methinks any method > failing to cut on the number of byte-to-byte comparisons is not going > to do better than your "brute force". The only reason all the > "improvements" suggested above do work is that they, > including char() vs substr(), simply let SAS avoid extra > work. For the same reason, Mike Rhoads' suggestions are very > likely to improve run times as well. Given the simplistic > nature of any Lazy Susan logic, it is effectively a brute > force, only applied in a different direction.
> The performance difference between 30 and 130 seconds is still not a > big deal for one willing to wait an extra minute, but that may very > well change if instead of 5000*5000 SSNs to compare, each file were > mere 10 times longer. That would increase the processing time 100 > fold, and we would be
As I said originally, the specs from the client were very specific - 7 out of the 9 positions match. I don't know the source of this and we've been unable (so far) to engage in conversations regarding other matching criteria. So for now we need to go with this scheme (but I do agree that other schemes/approaches are likely to be better in a number of ways).
Now to an interim progress report on what I've discovered so far.
First, it seems that the complev function won't do this sort of matching. But it does pick up other potential matches that probably should be of interest (but, sigh, are not part of the specs). Consider the following:
890 data complev; 891 insertions = complev('123456789', 892 '012345678'); 893 put insertions=; 894 run;
insertions=2
Clearly all nine digits are different. But since complev counts insertions/deletions needed to make the strings match, it won't do the 7/9 matching.
And before I proceed, let me say that the data volumes are pretty big. The first file is expected to be upwards of 3M rows and the second file on the order of 4M rows. Given that volume and the cartesian product nature of the comparisons (whether in SQL or the data step/array approach), that is simply way too much data. So we have been able to convince the client to drop from the second file any SSN that matches exactly to the first file. We expect that will reduce the second file to around 100K rows.
That is still too much (I expect) for a SQL solution, so I decided to go down the data step path. I have lots more work/analysis to do as this needs to be as optimal as possible. Eventually this will be running on a MF, but for now, I have to use my PC and randomly generated data (not ideal, but I don't have access to the real data). And yes, I know that real data is likely to have patterns that might alter the comparisons.
Using a variation of the approach suggested by Paul and Mike, here is what I have found so far for a first file of 1M rows and a second file of 33K (roughly one third the size of the real files).
- I first ran a simple data step that loaded the second file into an array of size 33K. That data step then ready the 1M rows and for each row iterated thru the array. It did nothing but iterate. This was just to give me a benchmark of the overhead to handle the data volumes.
This step took roughly 1 hour on an (admittedly) underpowered PC.
- Next I ran a compare that used the compare logic suggested by Mike to compare the strings.
This step took roughly 4 hours.
- Next, I looked at adding some logic to do more short-circuiting. Basically what I did was build pointers for the second file so that once I knew there were 3 differences in the first X characters, I would skip to next value that had a different value in the first X characters.
That dramatically reduced the time - to around 40 minutes - less that the baseline where no work was done other than looping thru all the combinations.
Have lots more analysis to do and more short-circuiting criteria that I can probably add. But I wanted to thank everyone now for their input by posting this "progress" report.
Once I've done some more work on this, I will summarize it by writing up an article on sasCommunity.org ;-).
Again, thx to everyone for their input/suggestions.
> -----Original Message----- > From: SAS(r) Discussion [mailto:SA...@LISTSERV.UGA.EDU] On > Behalf Of Sigurd Hermansen > Sent: Thursday, January 22, 2009 9:39 PM > To: SA...@LISTSERV.UGA.EDU > Subject: Re: Regex - Seven of Nine Matching
> Don: > Generally fuzzy matching of SSN or other identifiers turns > out to be a Cartesian product problem; if one has an ID that > can be used to join two datasets, one doesn't need a SSN to > link records for the same person in two datasets.
> A 7 of 9 digit SSN match rule won't identify matching records > perfectly, but may do well enough. Expansion of the number of > rows won't make much difference so long as the datasets used > in comparisons remain very narrow. 1-M matching will only > expand the number of potential matches. Watch out for M-M > matching, though, because the yield of a join will explode. > Many linkage problems have duplicates in data and inadvertent > M-M matches.
> I've worked with Paul D. on similar problems for the last ten > years. I'd look to him for technical expertise. Any neophyte > in the field of fuzzy or probabilistic record linkage has a > steep learning curve to climb. The literature goes back more > than thirty years. > S
> -----Original Message----- > From: SAS(r) Discussion [mailto:SA...@LISTSERV.UGA.EDU] On > Behalf Of Don Henderson > Sent: Thursday, January 22, 2009 8:57 PM > To: SA...@LISTSERV.UGA.EDU > Subject: Re: Regex - Seven of Nine Matching
> Sig, Paul, Art,
> Thanks again for the insights.
> And permit me to acknowledge that I probably should have > asked a few more questions before posting this question as a > favor to my colleague.
> First, let me answer Paul's question that yes performance is > the number one criteria. My brute force approach using a > cartesian product was simply a first cut at testing the > viability of the idea. But there is nothing about the problem > or the client that requires SQL; so a data step solution is > an option. And I had thought about an approach using SET with > the POINT as Paul showed in his example but figured I would > defer that until later.
> Now to provide some additional details about the data and > what is being looked for. I can't say too much about them > other than to say:
> - the first file is expected to contain upwards of 3M distinct SSNs > - the second one could contain anywhere from 3M to 5M rows > - it is expected that most rows could/should have an exact > match, but there could be SSNs in the first file that are not > in the second, and vice versa. > - the expected matching is that there is a (0 or 1)-M > relationship between the two files where M could be 0, 1, 2, etc. > - what is still unknown (my colleague needs to get > clarification from the > client) is whether once an SSN in the second file matches > exactly to an SSN in the first file is it excluded from the > 7/9 matching
> Clarification of the latter point is a key item before > further investigation as it will dramatically impact the > number of potential matches. While I don't want to even think > about the real combinations and permutations, if we had all > possible 9 digit numbers (clearly not the case, but makes > this rough approximation easier to think about), then by just > looking at as single position, we could have (I think) 81 > possible matches (9 other potential values for each of the > nine positions) for each row in the first file. Factor in a > second position and I think we can all see where this is going.
> As of now I think I am going to assume that if an exact match > is found for an SSN in the second file that it can be > excluded from further analysis. The business logic behind the > 1-M matching suggests that this might be a reasonable assumption.
> So again, thx for all the great ideas. As I said earlier, > once I get a better handle on the scope of the data and > refined requirements for how exact matches are handled, I > will evaluate the alternatives suggested and report back.
> Thanks again, > -donh
> > -----Original Message----- > > From: SAS(r) Discussion [mailto:SA...@LISTSERV.UGA.EDU] On > Behalf Of > > Sigurd Hermansen > > Sent: Thursday, January 22, 2009 6:34 PM > > To: SA...@LISTSERV.UGA.EDU > > Subject: Re: Regex - Seven of Nine Matching
> > The crush of actual work has kept me from coming out to > play on the 'L > > as much as I would like. (Don't you hate that!) Paul's solution has > > both inspired me to take a few minutes off to think about > the problem > > from a different angle and come up with a solution.
> > As a first thought, I would conclude that a solution in the > c or c++ > > language would work far faster than a SAS solution simply because c > > treats strings as character arrays and iterating through a > series of > > character arrays should be an order faster than > substringing strings. > > Paul's char() method helps reduce that advantage.
> > I next wondered where I would hide if I were a more > efficient database > > programming solution. I immediately thought of an efficient > database > > operation (a join) and a way to implement one (transposing > character > > strings).
> > This SQL solution requires a couple of scans to set up > implicit array > > indexes. It also minimizes the data values being carried from data > > sources to the solution. It won't perform O(n), but it will > scale up > > to at least tens of thousands of SSN. For really > large-scale problems, > > one of those hyperactive hash object solutions could approach O(n). > > I'll leave that to Paul, Greg, Chang, or one of the other > uberhashers.
> > data one; > > input ssn $9.; > > datalines; > > 123456789 > > 987654321 > > 121212121 > > 343434343 > > ; > > data two; > > input ssn $9.; > > datalines; > > 123456789 > > 987645321 > > 121212121 > > 232323232 > > ; > > data one1; > > do until(EOF); > > i1+1; > > set one end=EOF; > > do d1=1 to 9; > > v1=char(ssn,d1); > > output; > > end; > > end; > > run; > > data two2; > > do until(EOF); > > i2+1; > > set two end=EOF; > > do d2=1 to 9; > > v2=char(ssn,d2); > > output; > > end; > > end; > > run; > > proc sql; > > create table matches as > > select i1,i2,sum(m) as count > > from (select i1,i2,d1,d2,(v1=v2) as m > > from one1 as t1,two2 as t2 > > where t1.d1=t2.d2 > > ) > > group by i1,i2 having sum(m)>=7 > > ; > > quit;
> As I said originally, the specs from the client were very specific - 7 out > of the 9 positions match. I don't know the source of this and we've been > unable (so far) to engage in conversations regarding other matching > criteria. So for now we need to go with this scheme (but I do agree that > other schemes/approaches are likely to be better in a number of ways).
> Now to an interim progress report on what I've discovered so far.
> First, it seems that the complev function won't do this sort of matching. > But it does pick up other potential matches that probably should be of > interest (but, sigh, are not part of the specs). Consider the following:
> 890 data complev; > 891 insertions = complev('123456789', > 892 '012345678'); > 893 put insertions=; > 894 run;
> insertions=2
> Clearly all nine digits are different. But since complev counts > insertions/deletions needed to make the strings match, it won't do the 7/9 > matching.
> And before I proceed, let me say that the data volumes are pretty big. The > first file is expected to be upwards of 3M rows and the second file on the > order of 4M rows. Given that volume and the cartesian product nature of the > comparisons (whether in SQL or the data step/array approach), that is simply > way too much data. So we have been able to convince the client to drop from > the second file any SSN that matches exactly to the first file. We expect > that will reduce the second file to around 100K rows.
> That is still too much (I expect) for a SQL solution, so I decided to go > down the data step path. I have lots more work/analysis to do as this needs > to be as optimal as possible. Eventually this will be running on a MF, but > for now, I have to use my PC and randomly generated data (not ideal, but I > don't have access to the real data). And yes, I know that real data is > likely to have patterns that might alter the comparisons.
> Using a variation of the approach suggested by Paul and Mike, here is what I > have found so far for a first file of 1M rows and a second file of 33K > (roughly one third the size of the real files).
> - I first ran a simple data step that loaded the second file into an array > of size 33K. That data step then ready the 1M rows and for each row iterated > thru the array. It did nothing but iterate. This was just to give me a > benchmark of the overhead to handle the data volumes.
> This step took roughly 1 hour on an (admittedly) underpowered PC.
> - Next I ran a compare that used the compare logic suggested by Mike to > compare the strings.
> This step took roughly 4 hours.
> - Next, I looked at adding some logic to do more short-circuiting. Basically > what I did was build pointers for the second file so that once I knew there > were 3 differences in the first X characters, I would skip to next value > that had a different value in the first X characters.
> That dramatically reduced the time - to around 40 minutes - less that the > baseline where no work was done other than looping thru all the > combinations.
> Have lots more analysis to do and more short-circuiting criteria that I can > probably add. But I wanted to thank everyone now for their input by posting > this "progress" report.
> Once I've done some more work on this, I will summarize it by writing up an > article on sasCommunity.org ;-).
> Again, thx to everyone for their input/suggestions.
> -donh
> > -----Original Message----- > > From: SAS(r) Discussion [mailto:SA...@LISTSERV.UGA.EDU] On > > Behalf Of Sigurd Hermansen > > Sent: Thursday, January 22, 2009 9:39 PM > > To: SA...@LISTSERV.UGA.EDU > > Subject: Re: Regex - Seven of Nine Matching
> > Don: > > Generally fuzzy matching of SSN or other identifiers turns > > out to be a Cartesian product problem; if one has an ID that > > can be used to join two datasets, one doesn't need a SSN to > > link records for the same person in two datasets.
> > A 7 of 9 digit SSN match rule won't identify matching records > > perfectly, but may do well enough. Expansion of the number of > > rows won't make much difference so long as the datasets used > > in comparisons remain very narrow. 1-M matching will only > > expand the number of potential matches. Watch out for M-M > > matching, though, because the yield of a join will explode. > > Many linkage problems have duplicates in data and inadvertent > > M-M matches.
> > I've worked with Paul D. on similar problems for the last ten > > years. I'd look to him for technical expertise. Any neophyte > > in the field of fuzzy or probabilistic record linkage has a > > steep learning curve to climb. The literature goes back more > > than thirty years. > > S
> > -----Original Message----- > > From: SAS(r) Discussion [mailto:SA...@LISTSERV.UGA.EDU] On > > Behalf Of Don Henderson > > Sent: Thursday, January 22, 2009 8:57 PM > > To: SA...@LISTSERV.UGA.EDU > > Subject: Re: Regex - Seven of Nine Matching
> > Sig, Paul, Art,
> > Thanks again for the insights.
> > And permit me to acknowledge that I probably should have > > asked a few more questions before posting this question as a > > favor to my colleague.
> > First, let me answer Paul's question that yes performance is > > the number one criteria. My brute force approach using a > > cartesian product was simply a first cut at testing the > > viability of the idea. But there is nothing about the problem > > or the client that requires SQL; so a data step solution is > > an option. And I had thought about an approach using SET with > > the POINT as Paul showed in his example but figured I would > > defer that until later.
> > Now to provide some additional details about the data and > > what is being looked for. I can't say too much about them > > other than to say:
> > - the first file is expected to contain upwards of 3M distinct SSNs > > - the second one could contain anywhere from 3M to 5M rows > > - it is expected that most rows could/should have an exact > > match, but there could be SSNs in the first file that are not > > in the second, and vice versa. > > - the expected matching is that there is a (0 or 1)-M > > relationship between the two files where M could be 0, 1, 2, etc. > > - what is still unknown (my colleague needs to get > > clarification from the > > client) is whether once an SSN in the second file matches > > exactly to an SSN in the first file is it excluded from the > > 7/9 matching
> > Clarification of the latter point is a key item before > > further investigation as it will dramatically impact the > > number of potential matches. While I don't want to even think > > about the real combinations and permutations, if we had all > > possible 9 digit numbers (clearly not the case, but makes > > this rough approximation easier to think about), then by just > > looking at as single position, we could have (I think) 81 > > possible matches (9 other potential values for each of the > > nine positions) for each row in the first file. Factor in a > > second position and I think we can all see where this is going.
> > As of now I think I am going to assume that if an exact match > > is found for an SSN in the second file that it can be > > excluded from further analysis. The business logic behind the > > 1-M matching suggests that this might be a reasonable assumption.
> > So again, thx for all the great ideas. As I said earlier, > > once I get a better handle on the scope of the data and > > refined requirements for how exact matches are handled, I > > will evaluate the alternatives suggested and report back.
> > Thanks again, > > -donh
> > > -----Original Message----- > > > From: SAS(r) Discussion [mailto:SA...@LISTSERV.UGA.EDU] On > > Behalf Of > > > Sigurd Hermansen > > > Sent: Thursday, January 22, 2009 6:34 PM > > > To: SA...@LISTSERV.UGA.EDU > > > Subject: Re: Regex - Seven of Nine Matching
> > > The crush of actual work has kept me from coming out to > > play on the 'L > > > as much as I would like. (Don't you hate that!) Paul's solution has > > > both inspired me to take a few minutes off to think about > > the problem > > > from a different angle and come up with a solution.
> > > As a first thought, I would conclude that a solution in the > > c or c++ > > > language would work far faster than a SAS solution simply because c > > > treats strings as character arrays and iterating through a > > series of > > > character arrays should be an order faster than > > substringing strings. > > > Paul's char() method helps reduce that advantage.
> > > I next wondered where I would hide if I were a more > > efficient database > > > programming solution. I immediately thought of an efficient > > database > > > operation (a join) and a way to implement one (transposing > > character > > > strings).
> > > This SQL solution requires a couple of scans to set up > > implicit array > > > indexes. It also minimizes the data values being carried from data > > > sources to the solution. It won't perform O(n), but it will > > scale up > > > to at least tens of thousands of SSN. For really > > large-scale problems, > > > one of those hyperactive hash object solutions could approach O(n). > > > I'll leave that to Paul, Greg, Chang, or one of the other > > uberhashers.
But why... a Levenshtein implementation is already available via complev() function, and if that is not enough, there is always compged() augmented with call compcost() giving a much wider latitude in distance editing. Regardless, it appears that any of these are way too expensive for Don's purposes, forthey invariably have to examine all 9 characters, and then perform rather heavy computations. Comparing the corresponding digits and applying as much logical short-circuiting as much as possible (including that dictated by intimate data knowledge) is better equipped to deal with (unavoidable?) Cartesian product nature of the problem. I am using the question mark in the parentheses deliberately... for I have not fully proven to myself yet that no method to avoid comparing all to all exists.
Kind regards ------------ Paul Dorfman Jax, FL ------------
> I know you know C# a bit now so this algorithmic approach may help you > solve the issue. You could possibly code this in SAS or call it via C# > and include it into the dataset.
> But why... a Levenshtein implementation is already available via complev() > function, and if that is not enough, there is always compged() augmented with > call compcost() giving a much wider latitude in distance editing. Regardless, > it appears that any of these are way too expensive for Don's purposes, > forthey invariably have to examine all 9 characters, and then perform rather > heavy computations. Comparing the corresponding digits and applying as much > logical short-circuiting as much as possible (including that dictated by > intimate data knowledge) is better equipped to deal with (unavoidable?) > Cartesian product nature of the problem. I am using the question mark in > the parentheses deliberately... for I have not fully proven to myself yet that > no method to avoid comparing all to all exists.
> > I know you know C# a bit now so this algorithmic approach may help you > > solve the issue. You could possibly code this in SAS or call it via C# > > and include it into the dataset.
> > Just some late night ideas...
> > Alan- Hide quoted text -
> - Show quoted text -
Paul,
I wasn't aware that it was already in SAS.
Don just learned enough C# to get him started so I will let him assess whether this is a fit that he is comfortable with. I think he should compare the SAS vs the C# approach but it depends on his client, O/S, circumstances, etc.
I was actually trying to research a regex approach to it and came across the algorithm. It's cool that SAS already has it built in.
Replying to Alan yesterday, I wrote, in part: "...to deal with (unavoidable?) Cartesian product nature of the problem. I am using the question mark in the parentheses deliberately... for I have not fully proven to myself yet that no method to avoid comparing all to all exists." The reason for the doubt was a hint in Sig's first post in the thread. What I now think is that I have proven the opposite.
The key here is the matching pattern constraint in the specifications. To wit, there are 9!/7!/(9-7)!=36 7-digit positional patterns in a 9-digit string, and we have to match only pattern #1 of SSN1 to pattern #1 of SSN2, pattern #2 of SSN1 to pattern #2 of SSN2,..., and so on till pattern #36. Since thus no cross-pattern matching is required, separate patterns can be matched independently. Hence the algorithm can be devised as follows:
DO PATTERN = 1 to 36 pass through each file extract 7-digit pattern #N from SSN1 and SSN1 have it as key and SSN1 and SSN2 as satellites perform many-to-many match of result sets Output matches and append them to final data set END
At the first glance, this scheme appears idiotic, because effectively it means exploding each file 36 times, then matching by (pattern number * pattern). However, it appears less so after the realization that the number of passes through each file is limited to *only* 36 *regardless* of the file size, and after the explosion has occurred, matching on each pattern becomes time-linear, provided that a proper matching method is selected.
Aside from the speed, the concrete implementation is largely a matter of taste and experience. SQL will have no problem with many-to-many matching of 3mx4m rows by a pattern, and that can be simply iterated 36 times. Or, for each pattern, one side can be indexed and searched during a pass through the other side, collecting satellites for duplicate patterns by looping over duplicate key index values.
Below, I choose the method I think is the fastest. To avoid re-reading the files from disk, they are buffered on the first file passes into a [large enough] array, which is subsequebtly re-read 35 times. The actual matching by a pattern is done by storing the pattern ID from file DSN1 and its discriminating enumerator N as a hash key and SSN1 - as a satellite, then searching the table for each respective pattern from SSN2. The discriminator is necessary because a pattern can correspond to more than one SSN1. (In Version 9.2, multi-hash can be used to harvest different SSN1 for the same key pattern, but here I am running 9.1.3 and hence use my old - and very simple - trick of storing and harvesting hash dupes, which ought to be apparent from the code.) It is important that after each pass, the table is explicitly deleted, else memory overload can (and does) occur. (Again, in 9.2 it is not necessary - instead .clear() method should be used.) Although .delete() is an expensive method, it is done only 36 times and, practically speaking, has no adverse effect.
%let size = 1e5 ;
data dsn1 dsn2 ; length ssn $ 9 ; do _n_ = 1 to &size ; ssn = put (ceil (ranuni(1) * 1e9), z9.) ; output dsn1 ; ssn = put (ceil (ranuni(1) * 1e9), z9.) ; output dsn2 ; end ; run ;
data matches (keep = ssn: exact) ; array ss [2, 5000000] $ 9 _temporary_ ; do i = 1 to 8 ; do j = i + 1 to 9 ; dcl hash h (hashexp: 16) ; h.definekey ('id', 'n') ; h.definedata ('ssn1') ; h.definedone () ; do p = 1 to n1 ; if i = 1 and j = 2 then do ; set dsn1 (rename = (ssn = ssn1)) nobs = n1 ; ss [1, p] = ssn1 ; end ; else ssn1 = ss [1, p] ; id = ssn1 ; substr (id, i, 1) = "*" ; substr (id, j, 1) = "*" ; do n = 1 by 1 until (h.check() ne 0) ; end ; h.add() ; end ; do p = 1 to n2 ; if i = 1 and j = 2 then do ; set dsn2 (rename = (ssn = ssn2)) nobs = n2 ; ss [2, p] = ssn2 ; end ; else ssn2 = ss [2, p] ; id = ssn2 ; substr (id, i, 1) = "*" ; substr (id, j, 1) = "*" ; do n = 1 by 1 ; if h.find() ne 0 then leave ; if ssn1 = ssn2 then Exact = "Y" ; else Exact = "N" ; output ; end ; end ; h.delete() ; idn ++ 1 ; put "Progress: pattern # " idn 2.-R " processed." ; end ; end ; stop ; run ;
This 1e5*1e5 match runs in under 10 seconds on a garden variety AIX box and outputs 35,938 records, taking up 250M of RAM. What is critical, though, is how it scales up. Here are some log figures:
Even the 5m by 5m match runs under 27 minutes, which includes producing the enormous output (in the real environment likely to be infinitesimally smaller). The real time scales almost linearly as the sizes grow, "almost" probably being due to the O(log(N)), rather than O(1), nature of the AVL trees underlying SAS hash. Still, the comparison math is nowhere near the Cartesian explosion. Running 5m by 5m match using the scheme above means 36 passes over side 1, during which 36*5e6=1.8*5e7 SSNs are considered, plus, roughly speaking, 5e6*log(5e6)=3.5e7 SSN pairs are considered searching the table. Running the same via a Cartesian explosion means 1 pass through file DSN1 and 5m passes through file DSN2, during which 2.5e13 SSN pairs are considered - the difference of good 5 orders of magnitude.
Two more notes. First, on some mainframes, getting 640M of main storage may be a (pretty ridiculous) problem, although I would not expect a system programmer to fret about trading a short peak of main storage usage for a job quickly leaving the queue.
Second, in the above code no effort was made to first eliminate full SSN matches (although it still seems to run ok) and/or 8-digit matches before 7-digit patterns are considered. For example, the 2m*2m output from above contains 80,280 exact matches responsible for 2,890,080 7-digit matches, plus gobs of 8-digit matched (I did not count) responsible for 9*gobs output 7-digit matches. I reckon eliminating both beforehand could very well cut the real times rather considerably.
Kind regards ------------ Paul Dorfman Jax, FL ------------
<donaldjhender...@HOTMAIL.COM> wrote: >And no, this is not a Sci-Fi question ;-).
>Am asking on behalf of a colleague who is doing a project involving matching >based on Social Security numbers. The client wants to match two files using >the Social Security number and wants to detect as possible matches cases >where position by position 7 of the nine digits match. I am presuming that >is because they want to account for the possibility of transposed digits >(but don't know that for sure).
>I've come up with what I call a brute force solution that uses the sum >function in the where clause to count position by position how many >characters match. Based on the sample below, it seems to work.
>But I am wondering if any of the RegEx gurus on the list can comment on >whether this can be done with RegEx.
>TIA. My sample data and quick and dirty brute force code follows.
>data one; > input ssn $9.; >datalines; >123456789 >987654321 >121212121 >343434343 >; >data two; > input ssn $9.; >datalines; >123456789 >987645321 >121212121 >232323232 >; >proc sql; > create table matches as > select a.ssn as ssn_one, > b.ssn as ssn_two, > case > when (a.ssn ne b.ssn) then 'Not Exact' > else 'Exact' > end as MatchResults > from one a, two b > where sum(substr(a.ssn,1,1)=substr(b.ssn,1,1), > substr(a.ssn,2,1)=substr(b.ssn,2,1), > substr(a.ssn,3,1)=substr(b.ssn,3,1), > substr(a.ssn,4,1)=substr(b.ssn,4,1), > substr(a.ssn,5,1)=substr(b.ssn,5,1), > substr(a.ssn,6,1)=substr(b.ssn,6,1), > substr(a.ssn,7,1)=substr(b.ssn,7,1), > substr(a.ssn,8,1)=substr(b.ssn,8,1), > substr(a.ssn,9,1)=substr(b.ssn,9,1) > ) ge 7; >quit;
I create some realistically sized test data (3e6 in the main file, and 1e5 in the smaller file) and using Mike's suggestion of the COMPARE function recursively:
ssn2 = ssnArray(i); firstMismatch = abs(compare(ssn,ssn2)); else if firstMismatch gt 7 then do; /* last two chars don't matter */ output; continue; end; /* last two chars don't matter */ secondMismatch = abs(compare(substr(ssn,firstMismatch+1),substr(ssn2,firstMismatch+1))); if secondMismatch = 0 or firstMismatch+secondMismatch=9 then do; /* only one or two mismatches */ output; continue; end; /* only one or two mismatches */ Mismatch = abs(compare(substr(ssn,firstMismatch+secondMismatch+1),substr(ssn2,firstMis m atch+secondMisMatch+1))); if Mismatch = 0 then do; /* only two mismatches */ output; continue; end; /* only two mismatches */
Basically this incorporates some of the pattern matching you generalized (e.g., the first mismatch is in position 8/9 so we know it will match on 7 positions), etc.
This turned out to perform faster than looping and counting using the CHAR function as you suggested in your first post (which indeed was faster than comparing all the characters and faster than substr):
do _cnt = 1 to 9 while (_ne_cnt <= 2) ; if char (ssn1, _cnt) ne char (ssn2, _cnt) then _ne_cnt ++ 1 ; end;
I then augmented the short circuiting by recognizing that once I knew the third mismatch was in position X, that I could skip over all the other SSNs in the array that had the same first X positions. That cut the time by almost a factor of two. However, on an old PC it still took over 3 hours.
But I expect your code below will blow that out of the water. Am remote now with no access to that test data. Once I am able to get to that PC, I will try to adjust my code to reflect this approach and let you know what the performance numbers are. Note that in my example, I only needed to load the smaller file into an array and (as you pointed out) getting that much memory may be an issue. But I will cross that bridge when I come to it ;-).
Again, many thanks for the insights (and the code).
-----Original Message----- From: SAS(r) Discussion [mailto:SA...@LISTSERV.UGA.EDU] On Behalf Of Paul
Dorfman Sent: Monday, January 26, 2009 5:10 PM To: SA...@LISTSERV.UGA.EDU Subject: Re: Regex - Seven of Nine Matching
Don,
Replying to Alan yesterday, I wrote, in part: "...to deal with (unavoidable?) Cartesian product nature of the problem. I am using the question mark in the parentheses deliberately... for I have not fully proven to myself yet that no method to avoid comparing all to all exists." The reason for the doubt was a hint in Sig's first post in the thread. What I now think is that I have proven the opposite.
The key here is the matching pattern constraint in the specifications. To wit, there are 9!/7!/(9-7)!=36 7-digit positional patterns in a 9-digit string, and we have to match only pattern #1 of SSN1 to pattern #1 of SSN2, pattern #2 of SSN1 to pattern #2 of SSN2,..., and so on till pattern #36. Since thus no cross-pattern matching is required, separate patterns can be matched independently. Hence the algorithm can be devised as follows:
DO PATTERN = 1 to 36 pass through each file extract 7-digit pattern #N from SSN1 and SSN1 have it as key and SSN1 and SSN2 as satellites perform many-to-many match of result sets Output matches and append them to final data set END
At the first glance, this scheme appears idiotic, because effectively it means exploding each file 36 times, then matching by (pattern number * pattern). However, it appears less so after the realization that the number of passes through each file is limited to *only* 36 *regardless* of the file size, and after the explosion has occurred, matching on each pattern becomes time-linear, provided that a proper matching method is selected.
Aside from the speed, the concrete implementation is largely a matter of taste and experience. SQL will have no problem with many-to-many matching of 3mx4m rows by a pattern, and that can be simply iterated 36 times. Or, for each pattern, one side can be indexed and searched during a pass through the other side, collecting satellites for duplicate patterns by looping over duplicate key index values.
Below, I choose the method I think is the fastest. To avoid re-reading the files from disk, they are buffered on the first file passes into a [large enough] array, which is subsequebtly re-read 35 times. The actual matching by a pattern is done by storing the pattern ID from file DSN1 and its discriminating enumerator N as a hash key and SSN1 - as a satellite, then searching the table for each respective pattern from SSN2. The discriminator is necessary because a pattern can correspond to more than one SSN1. (In Version 9.2, multi-hash can be used to harvest different SSN1 for the same key pattern, but here I am running 9.1.3 and hence use my old - and very simple - trick of storing and harvesting hash dupes, which ought to be apparent from the code.) It is important that after each pass, the table is explicitly deleted, else memory overload can (and does) occur. (Again, in 9.2 it is not necessary - instead .clear() method should be used.) Although .delete() is an expensive method, it is done only 36 times and, practically speaking, has no adverse effect.
%let size = 1e5 ;
data dsn1 dsn2 ; length ssn $ 9 ; do _n_ = 1 to &size ; ssn = put (ceil (ranuni(1) * 1e9), z9.) ; output dsn1 ; ssn = put (ceil (ranuni(1) * 1e9), z9.) ; output dsn2 ; end ; run ;
data matches (keep = ssn: exact) ; array ss [2, 5000000] $ 9 _temporary_ ; do i = 1 to 8 ; do j = i + 1 to 9 ; dcl hash h (hashexp: 16) ; h.definekey ('id', 'n') ; h.definedata ('ssn1') ; h.definedone () ; do p = 1 to n1 ; if i = 1 and j = 2 then do ; set dsn1 (rename = (ssn = ssn1)) nobs = n1 ; ss [1, p] = ssn1 ; end ; else ssn1 = ss [1, p] ; id = ssn1 ; substr (id, i, 1) = "*" ; substr (id, j, 1) = "*" ; do n = 1 by 1 until (h.check() ne 0) ; end ; h.add() ; end ; do p = 1 to n2 ; if i = 1 and j = 2 then do ; set dsn2 (rename = (ssn = ssn2)) nobs = n2 ; ss [2, p] = ssn2 ; end ; else ssn2 = ss [2, p] ; id = ssn2 ; substr (id, i, 1) = "*" ; substr (id, j, 1) = "*" ; do n = 1 by 1 ; if h.find() ne 0 then leave ; if ssn1 = ssn2 then Exact = "Y" ; else Exact = "N" ; output ; end ; end ; h.delete() ; idn ++ 1 ; put "Progress: pattern # " idn 2.-R " processed." ; end ; end ; stop ; run ;
This 1e5*1e5 match runs in under 10 seconds on a garden variety AIX box and outputs 35,938 records, taking up 250M of RAM. What is critical, though, is how it scales up. Here are some log figures:
Even the 5m by 5m match runs under 27 minutes, which includes producing the enormous output (in the real environment likely to be infinitesimally smaller). The real time scales almost linearly as the sizes grow, "almost" probably being due to the O(log(N)), rather than O(1), nature of the AVL trees underlying SAS hash. Still, the comparison math is nowhere near the Cartesian explosion. Running 5m by 5m match using the scheme above means 36 passes over side 1, during which 36*5e6=1.8*5e7 SSNs are considered, plus, roughly speaking, 5e6*log(5e6)=3.5e7 SSN pairs are considered searching the table. Running the same via a Cartesian explosion means 1 pass through file DSN1 and 5m passes through file DSN2, during which 2.5e13 SSN pairs are considered - the difference of good 5 orders of magnitude.
Two more notes. First, on some mainframes, getting 640M of main storage may be a (pretty ridiculous) problem, although I would not expect a system programmer to fret about trading a short peak of main storage usage for a job quickly leaving the queue.
Second, in the above code no effort was made to first eliminate full SSN matches (although it still seems to run ok) and/or 8-digit matches before 7-digit patterns are considered. For example, the 2m*2m output from above contains 80,280 exact matches responsible for 2,890,080 7-digit matches, plus gobs of 8-digit matched (I did not count) responsible for 9*gobs output 7-digit matches. I reckon eliminating both beforehand could very well cut the real times rather considerably.
Kind regards ------------ Paul Dorfman Jax, FL ------------
On Thu, 22 Jan 2009 10:22:02 -0500, Don Henderson <donaldjhender...@HOTMAIL.COM> wrote:
>And no, this is not a Sci-Fi question ;-).
>Am asking on behalf of a colleague who is doing a project involving matching >based on Social Security numbers. The client wants to match two files using >the Social Security number and wants to detect as possible matches >cases where position by position 7 of the nine digits match. I am >presuming that is because they want to account for the possibility of >transposed digits (but don't know that for sure).