Google Gruplar, artık yeni Usenet gönderilerini veya aboneliklerini desteklememektedir. Geçmişteki içerikler görüntülenebilir kalmaya devam edecek.

Extracting zip code from a string variable

917 görüntüleme
İlk okunmamış mesaja atla

Kessie

okunmadı,
30 Eyl 2011 14:54:5030.09.2011
alıcı
Hi everyone,

I am looking to extract a zip code from a string variable that
contains address and zip code. SUBSTR command would not work, as zip
codes are not listed in a particular order (i.e. at the end of the
string) nor do they come in a particular format (XXXXX-XXXX or XXXXX).
Stata has a very neat command that allows to identify a certain number
of consecutive digits and extract it as a separate variable. Is there
anything like that in SPSS? Please help! :).

Kessie

David

okunmadı,
30 Eyl 2011 16:50:3630.09.2011
alıcı
Kessie,
Please post several examples of the field you need to parse
(particularly instances which are the more troublesome) in addition to
examples which are fairly representative of the normal case.
Are there *ANY* consistent attributes manifested which one can
leverage to extract this? Is it *ALWAYS* the last sequence of
*NUMERIC* values? From what you state there may be a hyphenated
version. One idea is to loop from the end of the string to the
beginning and collect and concatenate any numerics until you hit a non-
numeric. OTOH very difficult to say without seeing some actual
examples of the dragon you wish to slay! I cannot ever recall seeing
addresses where the ZIP isn't at the end or close.
Please clarify. How large is the file? Maybe less of a pain to hand
edit.
HTH, David

David

okunmadı,
30 Eyl 2011 19:06:1630.09.2011
alıcı
Here is some SPSS syntax which will extract all elements within a
string containing contiguous numbers or "-" and throw them into a
VECTOR. From there it should be easy enough to loop through the
vector elements and figure out what's what based on other logic.
Leaving that as an issue for you to resolve. There is probably a
neater way to do this without the repetitive code, but I got tired of
F'n around with the logic to lose three lines. Besides it's beer
time!!!
HTH, David

--
DATA LIST / ADDRESS 1-200 (A).
begin data
John Smith, 1234 white road. somewhere USA, 10334
Another addie , 123456 xx ave , 10556 Colorado Springs 303-567-2233
end data.

VECTOR ZIPS(4,A12).
STRING #ZTMP (A12).
STRING #SCHAR(A1).
COMPUTE #ZIndex=1.
LOOP #SCOUNT=1 TO LENGTH(RTRIM(ADDRESS)).
+ COMPUTE #SCHAR=SUBSTR(ADDRESS,#SCOUNT,1).
+ DO IF INDEX( #SCHAR,"0123456789-",1) GT 0.
+ COMPUTE #INNUM=1.
+ COMPUTE #ZTMP=CONCAT(RTRIM(#ZTMP),#SCHAR).
+ DO IF #SCOUNT=LENGTH(RTRIM(ADDRESS)).
+ COMPUTE ZIPS(#ZINDEX)=#ZTMP.
+ COMPUTE #ZTMP="".
+ COMPUTE #INNUM=0.
+ END IF.
+ ELSE.
+ DO IF #INNUM.
+ COMPUTE ZIPS(#ZINDEX)=#ZTMP.
+ COMPUTE #ZTMP="".
+ COMPUTE #INNUM=0.
+ COMPUTE #ZINDEX=#ZINDEX+1.
+ END IF.
+ END IF.
END LOOP.
LIST ZIPS1 TO ZIPS4.

ZIPS1 ZIPS2 ZIPS3 ZIPS4

1234 10334
123456 10556 303-567-2233


Number of cases read: 2 Number of cases listed: 2

Jon Peck

okunmadı,
1 Eki 2011 09:01:021.10.2011
alıcı
Here is a simple way to do this using pattern recognition. It requires the Python Extensions and the SPSSINC TRANS extension command from the SPSS Community website (www.ibm.com/developerworks/spssdevcentral)

Some test data:
data list free /address(a20).
begin data.
abc12345xyz
abc12345-6789zzz
23 x street, 12345
end data.

*pattern matching function.
begin program.
import re
def find(x):
return re.findall(r"\d{4,5}", x) #this line must be indented
end program.

* command to drive all this.
spssinc trans result = fivedigit fourdigit type=5
/formula find(address).

This command looks for numeric strings of length either 5 or 4 and returns two length 5 string variables containing 0, 1, or 2 values. The length restriction provides some incomplete protection from street numbers. The command is designed to fail if it finds more than two matching patterns.

HTH,
Jon Peck

0 yeni ileti