Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Regex - Seven of Nine Matching
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  21 messages - Expand all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Don Henderson  
View profile  
 More options Jan 22 2009, 10:22 am
Newsgroups: comp.soft-sys.sas
From: donaldjhender...@HOTMAIL.COM (Don Henderson)
Date: Thu, 22 Jan 2009 10:22:02 -0500
Local: Thurs, Jan 22 2009 10:22 am
Subject: Regex - Seven of Nine Matching
And no, this is not a Sci-Fi question ;-).

Am asking on behalf of a colleague who is doing a project involving matching
based on Social Security numbers. The client wants to match two files using
the Social Security number and wants to detect as possible matches cases
where position by position 7 of the nine digits match. I am presuming that
is because they want to account for the possibility of transposed digits
(but don't know that for sure).

I've come up with what I call a brute force solution that uses  the sum
function in the where clause to count position by position how many
characters match. Based on the sample below, it seems to work.

But I am wondering if any of the RegEx gurus on the list can comment on
whether this can be done with RegEx.

TIA. My sample data and quick and dirty brute force code follows.

data one;
 input ssn $9.;
datalines;
123456789
987654321
121212121
343434343
;
data two;
 input ssn $9.;
datalines;
123456789
987645321
121212121
232323232
;
proc sql;
 create table matches as
 select a.ssn as ssn_one,
        b.ssn as ssn_two,
        case
         when (a.ssn ne b.ssn) then 'Not Exact'
         else 'Exact'
        end as MatchResults
 from one a, two b
 where sum(substr(a.ssn,1,1)=substr(b.ssn,1,1),
           substr(a.ssn,2,1)=substr(b.ssn,2,1),
           substr(a.ssn,3,1)=substr(b.ssn,3,1),
           substr(a.ssn,4,1)=substr(b.ssn,4,1),
           substr(a.ssn,5,1)=substr(b.ssn,5,1),
           substr(a.ssn,6,1)=substr(b.ssn,6,1),
           substr(a.ssn,7,1)=substr(b.ssn,7,1),
           substr(a.ssn,8,1)=substr(b.ssn,8,1),
           substr(a.ssn,9,1)=substr(b.ssn,9,1)
          ) ge 7;
quit;

Don Henderson
Don.Hender...@hcsbi.com
http://www.hcsbi.com
(301) 570-5530 (office)
(301) 980-9027 (cell)


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Muthia Kachirayan  
View profile  
 More options Jan 22 2009, 12:50 pm
Newsgroups: comp.soft-sys.sas
From: muthia.kachira...@GMAIL.COM (Muthia Kachirayan)
Date: Thu, 22 Jan 2009 13:50:43 -0400
Local: Thurs, Jan 22 2009 12:50 pm
Subject: Re: Regex - Seven of Nine Matching
Don,

This is not using RegEx but on Key-Indexing. I presume that TWO has to be
compared with ONE and the matching SSN based on the first 7 digits of
SSN-string of ONE is wanted.

data one;
 input ssn $9.;
datalines;
123456789
987654321
121212121
343434343
;
run;
data two;
 input ssn $9.;
datalines;
123456789
987645321
121212121
232323232
;
run;
data need;
array k[9999999] _temporary_;
do until(eof);
    set one end = eof;
    one = input(substr(ssn,1,7),7.);
    k[one] = 1;
end;
do until(eof1);
    set two end = eof1;
    two = input(substr(ssn,1,7),7.);
    if k[two] then output;
end;
drop one two;
run;

Kind regards,

Muthia Kachirayan

On Thu, Jan 22, 2009 at 11:22 AM, Don Henderson <


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
karma  
View profile  
 More options Jan 22 2009, 12:08 pm
Newsgroups: comp.soft-sys.sas
From: dorjeta...@GOOGLEMAIL.COM (karma)
Date: Thu, 22 Jan 2009 17:08:49 +0000
Local: Thurs, Jan 22 2009 12:08 pm
Subject: Re: Regex - Seven of Nine Matching
Hi Don,

I don't think that this type of problem can be solved using regexp -
or at least I haven't a clue how I would go about such a solution. As
you want to seek similarity between two strings finding the edit
distance seems like a good bet.

The only problem is what edit distance algorithm to use. SAS allows
you to specify your own scores for use in edit distance calculation
with the call compcost routine, the values are then used in the
compged function, this may be useful for other unique types of
similarity calculations.

However, given your problem it seems like the complev function would
work best as the Levenshtein Edit Distance allows insertion, deletion
or substitution of a single character where the cost of each operation
is 1. Since you know the length of your strings can filter for when we
have le 3 mismatches.

data one;
 input ssn $9.;
datalines;
123456789
987654321
121212121
343434343
;
data two;
 input ssn $9.;
datalines;
123456789
987645321
121212121
232323232
;
proc sql;
 create table matches as
 select a.ssn as ssn_one,
       b.ssn as ssn_two,
       case
        when (a.ssn ne b.ssn) then 'Not Exact'
        else 'Exact'
       end as MatchResults
 from one a, two b
 where sum(substr(a.ssn,1,1)=substr(b.ssn,1,1),
          substr(a.ssn,2,1)=substr(b.ssn,2,1),
          substr(a.ssn,3,1)=substr(b.ssn,3,1),
          substr(a.ssn,4,1)=substr(b.ssn,4,1),
          substr(a.ssn,5,1)=substr(b.ssn,5,1),
          substr(a.ssn,6,1)=substr(b.ssn,6,1),
          substr(a.ssn,7,1)=substr(b.ssn,7,1),
          substr(a.ssn,8,1)=substr(b.ssn,8,1),
          substr(a.ssn,9,1)=substr(b.ssn,9,1)
         ) ge 7;
quit;

 proc sql;
 create table matches2 as
 select a.ssn as ssn_one,
       b.ssn as ssn_two,
       case
        when (a.ssn ne b.ssn) then 'Not Exact'
        else 'Exact'
       end as MatchResults
 from one a, two b
 where complev(a.ssn, b.ssn) le 3;
quit;

proc compare base=matches compare=matches2 list;
   run;

2009/1/22 Don Henderson <donaldjhender...@hotmail.com>:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
karma  
View profile  
 More options Jan 22 2009, 12:09 pm
Newsgroups: comp.soft-sys.sas
From: dorjeta...@GOOGLEMAIL.COM (karma)
Date: Thu, 22 Jan 2009 17:09:58 +0000
Local: Thurs, Jan 22 2009 12:09 pm
Subject: Re: Regex - Seven of Nine Matching
" don't think that this type of problem can be solved using regexp "

Proven wrong in zero time :(

2009/1/22 Chang Chung <chang_y_ch...@hotmail.com>:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Chang Chung  
View profile  
 More options Jan 22 2009, 12:08 pm
Newsgroups: comp.soft-sys.sas
From: chang_y_ch...@HOTMAIL.COM (Chang Chung)
Date: Thu, 22 Jan 2009 12:08:11 -0500
Local: Thurs, Jan 22 2009 12:08 pm
Subject: Re: Regex - Seven of Nine Matching
On Thu, 22 Jan 2009 10:22:02 -0500, Don Henderson

hi, don,
here is one way. see matches2 below. sas 9.1.3. hth.
cheers,
chang

data one;
  input ssn $9.;
datalines;
123456789
987654321
121212121
343434343
;
data two;
  input ssn $9.;
datalines;
123456789
987645321
121212121
232323232
;
proc sql;
  create table matches as
  select a.ssn as ssn_one,
         b.ssn as ssn_two,
         case
         when (a.ssn ne b.ssn) then 'Not Exact'
         else 'Exact'
         end as MatchResults
  from one a, two b
  where sum(substr(a.ssn,1,1)=substr(b.ssn,1,1),
        substr(a.ssn,2,1)=substr(b.ssn,2,1),
        substr(a.ssn,3,1)=substr(b.ssn,3,1),
        substr(a.ssn,4,1)=substr(b.ssn,4,1),
        substr(a.ssn,5,1)=substr(b.ssn,5,1),
        substr(a.ssn,6,1)=substr(b.ssn,6,1),
        substr(a.ssn,7,1)=substr(b.ssn,7,1),
        substr(a.ssn,8,1)=substr(b.ssn,8,1),
        substr(a.ssn,9,1)=substr(b.ssn,9,1)
        ) ge 7;

  /* using prx */
  create table matches2 as
  select a.ssn as ssn_one
       , b.ssn as ssn_two
       , ifc(a.ssn=b.ssn,'Exact','Not Exact') as MatchResults
         length=9
  from   one a, two b
  where  countc(prxchange("s/(.)(?=(.{8}\1))/M/"
         , 9, catt(a.ssn,b.ssn)),"M")>=7;
quit;

proc compare base=matches compare=matches2;
run;
/* on lst in part:
NOTE: No unequal values were found. All values compared are exactly equal.
*/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Don Henderson  
View profile  
 More options Jan 22 2009, 12:35 pm
Newsgroups: comp.soft-sys.sas
From: donaldjhender...@HOTMAIL.COM (Don Henderson)
Date: Thu, 22 Jan 2009 12:35:58 -0500
Local: Thurs, Jan 22 2009 12:35 pm
Subject: Re: Regex - Seven of Nine Matching
Karma and Chang,

Thanks for these suggestions. The idea of a distance function (e.g.,
COMPLEV) had not occurred to me.

I don't have access to the real data my colleague will be using (he is
working in a locked down environment as it is sensitive financial data), but
I will create some volume test data and run some experiments to compare the
performance of the three approaches and will report back.

Thanks again!
Donh


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
"Howard Schreier  
View profile  
 More options Jan 22 2009, 12:48 pm
Newsgroups: comp.soft-sys.sas
From: hs AT dc-sug DOT org ("Howard Schreier)"
Date: Thu, 22 Jan 2009 12:48:38 -0500
Local: Thurs, Jan 22 2009 12:48 pm
Subject: Re: Regex - Seven of Nine Matching
Find out if the intent is to tolerate exactly one transposition of adjacent
digits. If so, I believe you can use CALL COMPCOST to adjust the
coefficients, then use the COMPGED function to get the precise subset.

On Thu, 22 Jan 2009 12:35:58 -0500, Don Henderson


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Sigurd Hermansen  
View profile  
 More options Jan 22 2009, 1:31 pm
Newsgroups: comp.soft-sys.sas
From: HERMA...@WESTAT.COM (Sigurd Hermansen)
Date: Thu, 22 Jan 2009 13:31:26 -0500
Local: Thurs, Jan 22 2009 1:31 pm
Subject: Re: Regex - Seven of Nine Matching
Don:
I've generally used the SPEDIS() function (or expressions involving SPEDIS()) for comparing two SSN during the past decade, and I find SPEDIS() a better measure of matching than any of the "number of matching digits" measures. It also works very fast. It does a "cost of rearrangement" calculation that assigns low costs to simple transpositions and exchanges of characters. The expression I use adjusts for lengths of strings (for cases where a SSN has fewer than nine digits, such as when a person records only the last four digits) and imposes symmetry of comparison (because SPEDIS() can yield different results for comparisons of the reverses of two SSN). As a rule a SPEDIS() score of <= 0.05 indicates either a recording error in one or both of two SSN or SSN for two siblings (since persons applying at the same time for SSN may receive consecutive numbers).
S


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Mike Rhoads  
View profile  
 More options Jan 22 2009, 3:55 pm
Newsgroups: comp.soft-sys.sas
From: RHOAD...@WESTAT.COM (Mike Rhoads)
Date: Thu, 22 Jan 2009 15:55:11 -0500
Local: Thurs, Jan 22 2009 3:55 pm
Subject: Re: Regex - Seven of Nine Matching
Don,

One other possibility to think about is the COMPARE function, which simply compares two strings and returns the position of the first non-matching character (or 0 for an exact match).

So, my approach would be:

1.  Call COMPARE once to try to find the first difference.
2.  If another difference is possible (there was a difference prior to the last character in the string), call it a second time for the remaining (rightmost) portion of the string.
3.  Similar to 2, call it a 3rd time if necessary.

If you find a 3rd difference, you don't have a match by your 7-of-9 rule.  Otherwise you do.

Note that you need to use ABS with COMPARE, since it may return either a positive or negative result (depending on which of the two strings has the "greater" value).

Good luck!

Mike Rhoads
Rhoad...@Westat.com


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Don Henderson  
View profile  
 More options Jan 22 2009, 2:59 pm
Newsgroups: comp.soft-sys.sas
From: donaldjhender...@HOTMAIL.COM (Don Henderson)
Date: Thu, 22 Jan 2009 14:59:21 -0500
Local: Thurs, Jan 22 2009 2:59 pm
Subject: Re: Regex - Seven of Nine Matching
Sig, Muthia and Howard,

Thanks for the alternative suggestions.

However, the requirement is 7 of the 9 positions match. And I believe that
this is mandated by some regulatory rule. So alternative matching
algorithms, even if they might be "better" in an abstract sense are not an
option. It was just speculation on my part as to the reason (transposition)
for 7 of the 9 match; it was not part of the requirement.

Given the specifics of the requirements, I am curious as to the performance
issues and so I will compare the RegEx, complev and brute force method. The
online doc says that COMPLEV is faster than COMPGED which is faster than
SPEDIS for a comparison that COMPLEV can do.

Thanks again for the suggestions however.

Regards,
donh


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Paul Dorfman  
View profile  
 More options Jan 22 2009, 5:19 pm
Newsgroups: comp.soft-sys.sas
From: sash...@BELLSOUTH.NET (Paul Dorfman)
Date: Thu, 22 Jan 2009 17:19:07 -0500
Local: Thurs, Jan 22 2009 5:19 pm
Subject: Re: Regex - Seven of Nine Matching
Don,

If it is performance you are primarily after (which would be logical to
seek trying to process a Cartesian product explosion) and you still want to
remain confined to SQL, I would suggest you change from substr(x,y,1) to
char(x,y). Since char() returns a single byte, that avoids gobs of the idle
blank-to-blank comparisons.

This simple change results in saving almost half the time in my tests (with
9.1.3 under AIX), which I have done on your own sample data with one record
added to each file, then exploded 1000 times, so that 5000 "ssn" numbers
are compared to 5000. For example, you SQL step with SUBSTR() ran 130
seconds, while the same step using CHAR() ran 77 seconds.

If you/your friend are not confined to SQL, I would rather use the DATA
step for the simple reason that its logic (absent from SQL) allows to
short-circuit the process of comparing a ssn1*ssn2 pair any time the number
of *non-matching* digits exceeds 2, and even before that never consider a
comparison if the strings are equal. For example,

data matches (drop = _:) ;
   set one (rename = (ssn = ssn1)) ;
   do p = 1 to n ;
      set two (rename = (ssn = ssn2)) point = p nobs = n ;
      _ne_cnt = 0 ;
      if ssn1 ne ssn2 then do ;
         ExactMatch = "N" ;
         do _cnt = 1 to 9 while (_ne_cnt <= 2) ;
            if char (ssn1, _cnt) ne char (ssn2, _cnt) then _ne_cnt ++ 1 ;
         end ;
         if _cnt <= 9 then continue ;
      end ;
      else ExactMatch = "Y" ;
      if _ne_cnt <= 2 then output ;
   end ;
run ;

cuts the SQL processing time down to 47 seconds. Eschewing the re-reading
of the second data set by placing it into an array beforehand,

data matches3 (drop = _:) ;
   array s [99999] $ 9 _temporary_ ;
   if _n_ = 1 then do _n_ = 1 to n ;
      set two nobs = n;
      s [_n_] = ssn ;
   end ;
   set one (rename = (ssn = ssn1)) ;
   do _n_ = 1 to n ;
      ssn2 = s[_n_] ;
      _ne_cnt = 0 ;
      if ssn1 ne ssn2 then do ;
         ExactMatch = "N" ;
         do _cnt = 1 to 9 while (_ne_cnt <= 2) ;
            if char (ssn1, _cnt) ne char (ssn2, _cnt) then _ne_cnt ++ 1 ;
         end ;
         if _cnt <= 9 then continue ;
      end ;
      else ExactMatch = "Y" ;
      if _ne_cnt <= 2 then output ;
   end ;
run ;

reduces the run time further down to 32 seconds, meaning that a rather
simple DATA step logic allows increasing performance four-fold from the
original point.

I rather like Karma's COMPLEV() approach for its elegance, and it greatly
simplifies the code, but unfortunately it is quite computationally
intensive in addition to the obvious fact that it has to consider all the
character in both strings being compared unconditionally. I have not tested
regexen, but am pretty much sure they would fair no better for the same
reason. Methinks any method failing to cut on the number of byte-to-byte
comparisons is not going to do better than your "brute force". The only
reason all the "improvements" suggested above do work is that they,
including char() vs substr(), simply let SAS avoid extra work. For the same
reason, Mike Rhoads' suggestions are very likely to improve run times as
well. Given the simplistic nature of any Lazy Susan logic, it is
effectively a brute force, only applied in a different direction.

The performance difference between 30 and 130 seconds is still not a big
deal for one willing to wait an extra minute, but that may very well change
if instead of 5000*5000 SSNs to compare, each file were mere 10 times
longer. That would increase the processing time 100 fold, and we would be
talking about the run time from 4 hours to 1.

Kind regards
------------
Paul Dorfman
Jax, FL
------------

On Thu, 22 Jan 2009 14:59:21 -0500, Don Henderson

...

read more »


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Sigurd Hermansen  
View profile  
 More options Jan 22 2009, 6:34 pm
Newsgroups: comp.soft-sys.sas
From: HERMA...@WESTAT.COM (Sigurd Hermansen)
Date: Thu, 22 Jan 2009 18:34:14 -0500
Local: Thurs, Jan 22 2009 6:34 pm
Subject: Re: Regex - Seven of Nine Matching
The crush of actual work has kept me from coming out to play on the 'L as much as I would like. (Don't you hate that!) Paul's solution has both inspired me to take a few minutes off to think about the problem from a different angle and come up with a solution.

As a first thought, I would conclude that a solution in the c or c++ language would work far faster than a SAS solution simply because c treats strings as character arrays and iterating through a series of character arrays should be an order faster than substringing strings. Paul's char() method helps reduce that advantage.

I next wondered where I would hide if I were a more efficient database programming solution. I immediately thought of an efficient database operation (a join) and a way to implement one (transposing character strings).

This SQL solution requires a couple of scans to set up implicit array indexes. It also minimizes the data values being carried from data sources to the solution. It won't perform O(n), but it will scale up to at least tens of thousands of SSN. For really large-scale problems, one of those hyperactive hash object solutions could approach O(n). I'll leave that to Paul, Greg, Chang, or one of the other uberhashers.

data one;
 input ssn $9.;
datalines;
123456789
987654321
121212121
343434343
;
data two;
 input ssn $9.;
datalines;
123456789
987645321
121212121
232323232
;
data one1;
   do until(EOF);
      i1+1;
      set one end=EOF;
             do d1=1 to 9;
                    v1=char(ssn,d1);
                        output;
                 end;
   end;
run;
data two2;
   do until(EOF);
      i2+1;
      set two  end=EOF;
             do d2=1 to 9;
                    v2=char(ssn,d2);
                        output;
                 end;
   end;
run;
proc sql;
   create table matches as
   select i1,i2,sum(m) as count
   from (select i1,i2,d1,d2,(v1=v2) as m
         from one1 as t1,two2 as t2
                 where t1.d1=t2.d2
                 )
   group by i1,i2 having sum(m)>=7
   ;
quit;

...

read more »


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Arthur Tabachneck  
View profile  
 More options Jan 22 2009, 7:25 pm
Newsgroups: comp.soft-sys.sas
From: art...@NETSCAPE.NET (Arthur Tabachneck)
Date: Thu, 22 Jan 2009 19:25:25 -0500
Local: Thurs, Jan 22 2009 7:25 pm
Subject: Re: Regex - Seven of Nine Matching
Don,

I haven't been able to tell if the requirement was simply to check each
record against its counterpart, or against all of the other records.

If its the former, wouldn't a simple data step sum be the quickest and
easiest?  For example:

data matches;
  set one;
  set two (rename=(ssn=ssn2));
  matches=0;
  do i=1 to 9;
    matches=matches+(substr(ssn,i,1) eq substr(ssn2,i,1));
  end;
  if matches gt 6 then output;
run;

Art
-------
On Thu, 22 Jan 2009 12:35:58 -0500, Don Henderson


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Don Henderson  
View profile  
 More options Jan 22 2009, 8:56 pm
Newsgroups: comp.soft-sys.sas
From: donaldjhender...@HOTMAIL.COM (Don Henderson)
Date: Thu, 22 Jan 2009 20:56:49 -0500
Local: Thurs, Jan 22 2009 8:56 pm
Subject: Re: Regex - Seven of Nine Matching
Sig, Paul, Art,

Thanks again for the insights.

And permit me to acknowledge that I probably should have asked a few more
questions before posting this question as a favor to my colleague.

First, let me answer Paul's question that yes performance is the number one
criteria. My brute force approach using a cartesian product was simply a
first cut at testing the viability of the idea. But there is nothing about
the problem or the client that requires SQL; so a data step solution is an
option. And I had thought about an approach using SET with the POINT as Paul
showed in his example but figured I would defer that until later.

Now to provide some additional details about the data and what is being
looked for. I can't say too much about them other than to say:

- the first file is expected to contain upwards of 3M distinct SSNs
- the second one could contain anywhere from 3M to 5M rows
- it is expected that most rows could/should have an exact match, but there
could be SSNs in the first file that are not in the second, and vice versa.
- the expected matching is that there is a (0 or 1)-M relationship between
the two files where M could be 0, 1, 2, etc.
- what is still unknown (my colleague needs to get clarification from the
client) is whether once an SSN in the second file matches exactly to an SSN
in the first file is it excluded from the 7/9 matching

Clarification of the latter point is a key item before further investigation
as it will dramatically impact the number of potential matches. While I
don't want to even think about the real combinations and permutations, if we
had all possible 9 digit numbers (clearly not the case, but makes this rough
approximation easier to think about), then by just looking at as single
position, we could have (I think) 81 possible matches (9 other potential
values for each of the nine positions) for each row in the first file.
Factor in a second position and I think we can all see where this is going.

As of now I think I am going to assume that if an exact match is found for
an SSN in the second file that it can be excluded from further analysis. The
business logic behind the 1-M matching suggests that this might be a
reasonable assumption.

So again, thx for all the great ideas. As I said earlier, once I get a
better handle on the scope of the data and refined requirements for how
exact matches are handled, I will evaluate the alternatives suggested and
report back.

Thanks again,
-donh

...

read more »


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Sigurd Hermansen  
View profile  
 More options Jan 22 2009, 9:39 pm
Newsgroups: comp.soft-sys.sas
From: HERMA...@WESTAT.COM (Sigurd Hermansen)
Date: Thu, 22 Jan 2009 21:39:03 -0500
Local: Thurs, Jan 22 2009 9:39 pm
Subject: Re: Regex - Seven of Nine Matching
Don:
Generally fuzzy matching of SSN or other identifiers turns out to be a Cartesian product problem; if one has an ID that can be used to join two datasets, one doesn't need a SSN to link records for the same person in two datasets.

A 7 of 9 digit SSN match rule won't identify matching records perfectly, but may do well enough. Expansion of the number of rows won't make much difference so long as the datasets used in comparisons remain very narrow. 1-M matching will only expand the number of potential matches. Watch out for M-M matching, though, because the yield of a join will explode. Many linkage problems have duplicates in data and inadvertent M-M matches.

I've worked with Paul D. on similar problems for the last ten years. I'd look to him for technical expertise. Any neophyte in the field of fuzzy or probabilistic record linkage has a steep learning curve to climb. The literature goes back more than thirty years.
S

...

read more »


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Don Henderson  
View profile  
 More options Jan 24 2009, 5:21 pm
Newsgroups: comp.soft-sys.sas
From: donaldjhender...@HOTMAIL.COM (Don Henderson)
Date: Sat, 24 Jan 2009 17:21:06 -0500
Local: Sat, Jan 24 2009 5:21 pm
Subject: Re: Regex - Seven of Nine Matching
Thanks Sig (et al),

As I said originally, the specs from the client were very specific - 7 out
of the 9 positions match. I don't know the source of this and we've been
unable (so far) to engage in conversations regarding other matching
criteria. So for now we need to go with this scheme (but I do agree that
other schemes/approaches are likely to be better in a number of ways).

Now to an interim progress report on what I've discovered so far.

First, it seems that the complev function won't do this sort of matching.
But it does pick up other potential matches that probably should be of
interest (but, sigh, are not part of the specs). Consider the following:

890  data complev;
891   insertions = complev('123456789',
892                        '012345678');
893   put insertions=;
894  run;

insertions=2

Clearly all nine digits are different. But since complev counts
insertions/deletions needed to make the strings match, it won't do the 7/9
matching.

And before I proceed, let me say that the data volumes are pretty big. The
first file is expected to be upwards of 3M rows and the second file on the
order of 4M rows. Given that volume and the cartesian product nature of the
comparisons (whether in SQL or the data step/array approach), that is simply
way too much data. So we have been able to convince the client to drop from
the second file any SSN that matches exactly to the first file. We expect
that will reduce the second file to around 100K rows.

That is still too much (I expect) for a SQL solution, so I decided to go
down the data step path. I have lots more work/analysis to do as this needs
to be as optimal as possible. Eventually this will be running on a MF, but
for now, I have to use my PC and randomly generated data (not ideal, but I
don't have access to the real data). And yes, I know that real data is
likely to have patterns that might alter the comparisons.

Using a variation of the approach suggested by Paul and Mike, here is what I
have found so far for a first file of 1M rows and a second file of 33K
(roughly one third the size of the real files).

- I first ran a simple data step that loaded the second file into an array
of size 33K. That data step then ready the 1M rows and for each row iterated
thru the array. It did nothing but iterate. This was just to give me a
benchmark of the overhead to handle the data volumes.

This step took roughly 1 hour on an (admittedly) underpowered PC.

- Next I ran a compare that used the compare logic suggested by Mike to
compare the strings.

This step took roughly 4 hours.

- Next, I looked at adding some logic to do more short-circuiting. Basically
what I did was build pointers for the second file so that once I knew there
were 3 differences in the first X characters, I would skip to next value
that had a different value in the first X characters.

That dramatically reduced the time - to around 40 minutes - less that the
baseline where no work was done other than looping thru all the
combinations.

Have lots more analysis to do and more short-circuiting criteria that I can
probably add. But I wanted to thank everyone now for their input by posting
this "progress" report.

Once I've done some more work on this, I will summarize it by writing up an
article on sasCommunity.org ;-).

Again, thx to everyone for their input/suggestions.

-donh

...

read more »


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Savian  
View profile  
 More options Jan 25 2009, 1:34 am
Newsgroups: comp.soft-sys.sas
From: Savian <savian....@gmail.com>
Date: Sat, 24 Jan 2009 22:34:25 -0800 (PST)
Local: Sun, Jan 25 2009 1:34 am
Subject: Re: Regex - Seven of Nine Matching
On Jan 24, 3:21 pm, donaldjhender...@HOTMAIL.COM (Don Henderson)
wrote:

...

read more »


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Paul Dorfman  
View profile  
 More options Jan 25 2009, 2:20 pm
Newsgroups: comp.soft-sys.sas
From: sash...@BELLSOUTH.NET (Paul Dorfman)
Date: Sun, 25 Jan 2009 19:20:21 +0000
Local: Sun, Jan 25 2009 2:20 pm
Subject: Re: Regex - Seven of Nine Matching
Alan,

But why... a Levenshtein implementation is already available via complev()
function, and if that is not enough, there is always compged() augmented with
call compcost() giving a much wider latitude in distance editing. Regardless,
it appears that any of these are way too expensive for Don's purposes,
forthey invariably have to examine all 9 characters, and then perform rather
heavy computations. Comparing the corresponding digits and applying as much
logical short-circuiting as much as possible (including that dictated by
intimate data knowledge) is better equipped to deal with (unavoidable?)
Cartesian product nature of the problem. I am using the question mark in
the parentheses deliberately... for I have not fully proven to myself yet that
no method to avoid comparing all to all exists.

Kind regards
------------
Paul Dorfman
Jax, FL
------------


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Savian  
View profile  
 More options Jan 25 2009, 5:16 pm
Newsgroups: comp.soft-sys.sas
From: Savian <savian....@gmail.com>
Date: Sun, 25 Jan 2009 14:16:17 -0800 (PST)
Local: Sun, Jan 25 2009 5:16 pm
Subject: Re: Regex - Seven of Nine Matching
On Jan 25, 12:20 pm, sash...@BELLSOUTH.NET (Paul Dorfman) wrote:

Paul,

I wasn't aware that it was already in SAS.

Don just learned enough C# to get him started so I will let him assess
whether this is a fit that he is comfortable with. I think he should
compare the SAS vs the C# approach but it depends on his client, O/S,
circumstances, etc.

I was actually trying to research a regex approach to it and came
across the algorithm. It's cool that SAS already has it built in.

Alan


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Paul Dorfman  
View profile  
 More options Jan 26 2009, 5:10 pm
Newsgroups: comp.soft-sys.sas
From: sash...@BELLSOUTH.NET (Paul Dorfman)
Date: Mon, 26 Jan 2009 17:10:11 -0500
Local: Mon, Jan 26 2009 5:10 pm
Subject: Re: Regex - Seven of Nine Matching
Don,

Replying to Alan yesterday, I wrote, in part: "...to deal with
(unavoidable?) Cartesian product nature of the problem. I am using the
question mark in the parentheses deliberately... for I have not fully
proven to myself yet that no method to avoid comparing all to all exists."
The reason for the doubt was a hint in Sig's first post in the thread.
What I now think is that I have proven the opposite.

The key here is the matching pattern constraint in the specifications. To
wit, there are 9!/7!/(9-7)!=36 7-digit positional patterns in a 9-digit
string, and we have to match only pattern #1 of SSN1 to pattern #1 of SSN2, pattern #2 of SSN1 to pattern #2 of SSN2,..., and so on till pattern #36.
Since thus no cross-pattern matching is required, separate patterns can be
matched independently. Hence the algorithm can be devised as follows:

DO PATTERN = 1 to 36
pass through each file
extract 7-digit pattern #N from SSN1 and SSN1
have it as key and SSN1 and SSN2 as satellites
perform many-to-many match of result sets
Output matches and append them to final data set
END

At the first glance, this scheme appears idiotic, because effectively it
means exploding each file 36 times, then matching by (pattern number *
pattern). However, it appears less so after the realization that the number
of passes through each file is limited to *only* 36 *regardless* of the
file size, and after the explosion has occurred, matching on each pattern
becomes time-linear, provided that a proper matching method is selected.

Aside from the speed, the concrete implementation is largely a matter of
taste and experience. SQL will have no problem with many-to-many matching
of 3mx4m rows by a pattern, and that can be simply iterated 36 times. Or,
for each pattern, one side can be indexed and searched during a pass
through the other side, collecting satellites for duplicate patterns by
looping over duplicate key index values.

Below, I choose the method I think is the fastest. To avoid re-reading the
files from disk, they are buffered on the first file passes into a [large
enough] array, which is subsequebtly re-read 35 times. The actual matching
by a pattern is done by storing the pattern ID from file DSN1 and its
discriminating enumerator N as a hash key and SSN1 - as a satellite, then
searching the table for each respective pattern from SSN2. The
discriminator is necessary because a pattern can correspond to more than
one SSN1. (In Version 9.2, multi-hash can be used to harvest different SSN1
for the same key pattern, but here I am running 9.1.3 and hence use my old
- and very simple - trick of storing and harvesting hash dupes, which ought
to be apparent from the code.) It is important that after each pass, the
table is explicitly deleted, else memory overload can (and does) occur.
(Again, in 9.2 it is not necessary - instead .clear() method should be
used.) Although .delete() is an expensive method, it is done only 36 times
and, practically speaking, has no adverse effect.

%let size = 1e5 ;

data dsn1 dsn2 ;
   length ssn $ 9 ;
   do _n_ = 1 to &size ;
      ssn = put (ceil (ranuni(1) * 1e9), z9.) ; output dsn1 ;
      ssn = put (ceil (ranuni(1) * 1e9), z9.) ; output dsn2 ;
   end ;
run ;

data matches (keep = ssn: exact) ;
   array ss [2, 5000000] $ 9 _temporary_ ;
   do i = 1 to 8 ;
      do j = i + 1 to 9 ;
         dcl hash   h (hashexp: 16) ;
         h.definekey  ('id', 'n') ;
         h.definedata ('ssn1') ;
         h.definedone () ;
         do p = 1 to n1 ;
            if i = 1 and j = 2 then do ;
               set dsn1 (rename = (ssn = ssn1)) nobs = n1 ;
               ss [1, p] = ssn1 ;
            end ;
            else ssn1 = ss [1, p] ;
            id = ssn1 ;
            substr (id, i, 1) = "*" ;
            substr (id, j, 1) = "*" ;
            do n = 1 by 1 until (h.check() ne 0) ;
            end ;
            h.add() ;
         end ;
         do p = 1 to n2 ;
            if i = 1 and j = 2 then do ;
               set dsn2 (rename = (ssn = ssn2)) nobs = n2 ;
               ss [2, p] = ssn2 ;
            end ;
            else ssn2 = ss [2, p] ;
            id = ssn2 ;
            substr (id, i, 1) = "*" ;
            substr (id, j, 1) = "*" ;
            do n = 1 by 1 ;
               if h.find() ne 0 then leave ;
               if ssn1 = ssn2 then Exact = "Y" ;
               else                Exact = "N" ;
               output ;
            end ;
         end ;
         h.delete() ;
         idn ++ 1 ;
         put "Progress: pattern # " idn 2.-R " processed." ;
      end ;
   end ;
   stop ;
run ;

This 1e5*1e5 match runs in under 10 seconds on a garden variety AIX box and
outputs 35,938 records, taking up 250M of RAM. What is critical, though, is
how it scales up. Here are some log figures:

Size, obs | Real time, sec | Memory usage, Mb | Output obs
----------+----------------+------------------+-----------
1e5*1e5   |    10          |     250          |     35,938
----------+----------------+------------------+-----------
5e5*5e5   |    65          |     280          |    897,950
----------+----------------+------------------+-----------
1e6*1e6   |   170          |     320          |  3,582,144
----------+----------------+------------------+-----------
2e6*2e6   |   440          |     400          | 14,331,078
----------+----------------+------------------+-----------
5e6*5e6   |  1620          |     640          | 89,596,633

Even the 5m by 5m match runs under 27 minutes, which includes producing the
enormous output (in the real environment likely to be infinitesimally
smaller). The real time scales almost linearly as the sizes grow, "almost"
probably being due to the O(log(N)), rather than O(1), nature of the AVL
trees underlying SAS hash. Still, the comparison math is nowhere near the
Cartesian explosion. Running 5m by 5m match using the scheme above means 36
passes over side 1, during which 36*5e6=1.8*5e7 SSNs are considered, plus,
roughly speaking, 5e6*log(5e6)=3.5e7 SSN pairs are considered searching the
table. Running the same via a Cartesian explosion means 1 pass through file
DSN1 and 5m passes through file DSN2, during which 2.5e13 SSN pairs are
considered - the difference of good 5 orders of magnitude.

Two more notes. First, on some mainframes, getting 640M of main storage may
be a (pretty ridiculous) problem, although I would not expect a system
programmer to fret about trading a short peak of main storage usage for a
job quickly leaving the queue.

Second, in the above code no effort was made to first eliminate full SSN
matches (although it still seems to run ok) and/or 8-digit matches before
7-digit patterns are considered. For example, the 2m*2m output from above
contains 80,280 exact matches responsible for 2,890,080 7-digit matches,
plus gobs of 8-digit matched (I did not count) responsible for 9*gobs
output 7-digit matches. I reckon eliminating both beforehand could very
well cut the real times rather considerably.

Kind regards
------------
Paul Dorfman
Jax, FL
------------

On Thu, 22 Jan 2009 10:22:02 -0500, Don Henderson


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Don Henderson  
View profile  
 More options Jan 26 2009, 6:00 pm
Newsgroups: comp.soft-sys.sas
From: donaldjhender...@HOTMAIL.COM (Don Henderson)
Date: Mon, 26 Jan 2009 18:00:41 -0500
Local: Mon, Jan 26 2009 6:00 pm
Subject: Re: Regex - Seven of Nine Matching
Paul,

All I can say is WOW. And thanks.

I create some realistically sized test data (3e6 in the main file, and 1e5
in the smaller file) and using Mike's suggestion of the COMPARE function
recursively:

ssn2 = ssnArray(i);
firstMismatch = abs(compare(ssn,ssn2));
else if firstMismatch gt 7 then
do;  /* last two chars don't matter */
   output;
   continue;
end; /* last two chars don't matter */
secondMismatch =
abs(compare(substr(ssn,firstMismatch+1),substr(ssn2,firstMismatch+1)));
if secondMismatch = 0 or firstMismatch+secondMismatch=9 then
do;  /* only one or two mismatches */
   output;
   continue;
end; /* only one or two mismatches */
Mismatch =
abs(compare(substr(ssn,firstMismatch+secondMismatch+1),substr(ssn2,firstMis m
atch+secondMisMatch+1)));
if Mismatch = 0 then
do;  /* only two mismatches */
   output;
   continue;
end; /* only two mismatches */

Basically this incorporates some of the pattern matching you generalized
(e.g., the first mismatch is in position 8/9 so we know it will match on 7
positions), etc.

This turned out to perform faster than looping and counting using the CHAR
function as you suggested in your first post (which indeed was faster than
comparing all the characters and faster than substr):

do _cnt = 1 to 9 while (_ne_cnt <= 2) ;
   if char (ssn1, _cnt) ne char (ssn2, _cnt) then _ne_cnt ++ 1 ;
end;

I then augmented the short circuiting by recognizing that once I knew the
third mismatch was in position X, that I could skip over all the other SSNs
in the array that had the same first X positions. That cut the time by
almost a factor of two. However, on an old PC it still took over 3 hours.

But I expect your code below will blow that out of the water. Am remote now
with no access to that test data. Once I am able to get to that PC, I will
try to adjust my code to reflect this approach and let you know what the
performance numbers are. Note that in my example, I only needed to load the
smaller file into an array and (as you pointed out) getting that much memory
may be an issue. But I will cross that bridge when I come to it ;-).

Again, many thanks for the insights (and the code).

...

read more »


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »