We have a report in HTML format that was generated by a public safety
agency. Regrettably, the format of the data prevents us from
aggregating records or conducting other descriptive analyses. The
following is an example of the way the data are generated by the
agency – they are unable to provide the data in any other format. I
should note that each record starts with the “—ACKNOWLEDGMENT –“ and
ends with “—END – “.
-- ACKNOWLEDGEMENT--
ENTER ACCEPTED AS FOLLOWING RECORD
LOST GUN
SERIAL NO: TBP12345 LOSS DATE:
11/27/2009
MAKE: SAT ENTRY DATE:
11/27/2009
MODEL: TP123 PCN:
A123456789
CALIBER: 9 NIC:
A123456789
TYPE: PI
CASE NO: 08-1234567
ENTERING MNE: A12345678
ENTERING AGY: AL12345678K7 – WINNIPEG COUNTY SHERIFF'S OFFICE
NOTIFY AGY: NO NOTIFY/PUBLICLY AVAILABLE
MISC: 2009-11WPG12345 TAURUS 3" BBL TITANIUM - FULLY LOADED
--END--
I was wondering if anyone could suggest syntax that would allow us to
import the data in tabular format that contains the following
variables. The desired database would include the following variables
that correspond to the above reported variables.
RTYPE DATEREP ENTRYD PCN NIC SERIALNO
LOST GUN 27-NOV-2009 27-NOV-2009 A123456789 A123456789 TBP12345
VARS continued….
MAKE MODEL CALIBRE TYPE CASENO ENTERINGMNE
SAT TP123 9 PI 08-1234567 A12345678
VARS continued….
ENTERINGAGY NOTIFYAGY
AL12345678K7-WINNIPEG COUNTY SHERIFF’S OFFICE NO NOTIFY/PUBLIC
AVAILABLE\
VARS continued….
MISC
2009-11WPG12345 TAURUS 3" BBL TITANIUM - FULLY LOADED
I don't have time to work on it right now. But here is a suggestion.
If you read in the data as one long string variable,
1. You can use SELECT IF to get rid of any unneeded rows.
2. You can use the CHAR.INDEX function to test for the presence of
various sub-strings on that row (e.g., to find if "MAKE" or "MODEL"
appears on the line).
3. You can use the SUBSTR function to pull out substrings that contain
the information you need.
4. After extracting all of the needed info and putting it into
appropriate variables, you can use AGGREGATE (and its FIRST function)
to reduce things from several rows to one row per entry.
HTH.
--
Bruce Weaver
bwe...@lakeheadu.ca
http://sites.google.com/a/lakeheadu.ca/bweaver/Home
"When all else fails, RTFM."
>Hi Everyone,
>
>We have a report in HTML format that was generated by a public safety
>agency. Regrettably, the format of the data prevents us from
>aggregating records or conducting other descriptive analyses. The
>following is an example of the way the data are generated by the
>agency – they are unable to provide the data in any other format. I
>should note that each record starts with the “—ACKNOWLEDGMENT –“ and
>ends with “—END – “.
Do you really mean "HTML format", which has all sorts of
invisible formatting characters, etc.? - The first thing that
I do with one of those is to copy it into a somewhat dumb
editor that does not support HTML, like Notepad, and the
editor strips out all the decoration (bold, italics, table-format)
and leaves a plain vanilla ASCII text file.
>
>-- ACKNOWLEDGEMENT--
>
>ENTER ACCEPTED AS FOLLOWING RECORD
>LOST GUN
> SERIAL NO: TBP12345 LOSS DATE:
>11/27/2009
> MAKE: SAT ENTRY DATE:
>11/27/2009
> MODEL: TP123 PCN:
>A123456789
> CALIBER: 9 NIC:
>A123456789
> TYPE: PI
> CASE NO: 08-1234567
>ENTERING MNE: A12345678
>ENTERING AGY: AL12345678K7 – WINNIPEG COUNTY SHERIFF'S OFFICE
> NOTIFY AGY: NO NOTIFY/PUBLICLY AVAILABLE
> MISC: 2009-11WPG12345 TAURUS 3" BBL TITANIUM - FULLY LOADED
>
>--END--
>
>I was wondering if anyone could suggest syntax that would allow us to
>import the data in tabular format that contains the following
Is it a fixed format, with the same number of lines each time, and
everything in the same place? That would call for the use of the
original, simple sort of fixed input formats that SPSS started with.
- Specify the line and column and type of data and var-name.
See your documentation.
If it is somewhat more varying, you can use "Input Record" to
tell it how to recognize each line, and the variable(s) in that line.
The solution that Bruce suggested is a brute-force variation that
you may have to retreat to if the simpler solutions can't be applied.
>variables. The desired database would include the following variables
>that correspond to the above reported variables.
>
>
>RTYPE DATEREP ENTRYD PCN NIC SERIALNO
>LOST GUN 27-NOV-2009 27-NOV-2009 A123456789 A123456789 TBP12345
>
>VARS continued….
>
>MAKE MODEL CALIBRE TYPE CASENO ENTERINGMNE
>SAT TP123 9 PI 08-1234567 A12345678
>
>VARS continued….
>
>ENTERINGAGY NOTIFYAGY
>AL12345678K7-WINNIPEG COUNTY SHERIFF’S OFFICE NO NOTIFY/PUBLIC
>AVAILABLE\
>
>VARS continued….
>MISC
>2009-11WPG12345 TAURUS 3" BBL TITANIUM - FULLY LOADED
--
Rich Ulrich
Here's what I had in mind. It works on a text file containing the
data you showed copied 3 times, and stored as a .txt file.
new file.
dataset close all.
GET DATA
/TYPE=TXT
/FILE='C:\Temp\htmldata.txt'
/DELCASE=LINE
/DELIMITERS=""
/ARRANGEMENT=DELIMITED
/FIRSTCASE=1
/IMPORTCASE=ALL
/VARIABLES=
V1 A78.
CACHE.
EXECUTE.
dataset name htmldata.
* Trim leading blanks .
compute v1 = ltrim(v1).
exe.
* Number the records .
do if ($casenum EQ 1).
- compute record = 1.
else.
- do if char.index(v1,"-- ACKNOWLEDGEMENT--") EQ 0.
- compute record = lag(record).
- else. /* "-- ACKNOWLEDGEMENT--" line .
- compute record = lag(record) + 1.
- end if.
end if.
exe.
* Get variable RTYPE .
string RTYPE(A25).
if rtrim(lag(v1)) EQ "ENTER ACCEPTED AS FOLLOWING RECORD" RTYPE =
RTRIM(v1).
exe.
* Get variables REPDATE and ENTRYD.
* I assume REPDATE is the LOSS DATE .
numeric REPDATE ENTRYD (Adate).
string pcn nic (a15).
compute #blank = char.rindex(rtrim(v1)," ").
do if char.index(v1,"LOSS DATE:") GT 0.
- compute repdate = number( char.substr(v1,#blank+1),adate10 ).
else if char.index(v1,"ENTRY DATE:") GT 0.
- compute entryd = number( char.substr(v1,#blank+1),adate10 ).
else if char.index(v1,"PCN:") GT 0.
- compute PCN = char.substr(v1,#blank+1) .
else if char.index(v1,"NIC:") GT 0.
- compute NIC = char.substr(v1,#blank+1) .
end if.
exe.
* Get SERIALNO, MAKE, MODEL, CALIBER, TYPE .
string serialno make model type caseno enteringmne (a15)/
enteringagy notifyagy misc (a80).
compute #colon = char.index(rtrim(v1),":").
do if char.index(v1,"SERIAL NO:") GT 0.
- compute serialno = char.substr(v1,#colon+2).
else if char.index(v1,"MAKE:") GT 0.
- compute make = char.substr(v1,#colon+2).
else if char.index(v1,"MODEL:") GT 0.
- compute model = char.substr(v1,#colon+2).
else if char.index(v1,"CALIBER:") GT 0.
- compute caliber = number( char.substr(v1,#colon+2),f5 ).
else if char.index(v1,"TYPE:") GT 0.
- compute type = char.substr(v1,#colon+2).
else if char.index(v1,"CASE NO:") GT 0.
- compute caseno = char.substr(v1,#colon+2).
else if char.index(v1,"ENTERING MNE:") GT 0.
- compute enteringmne = char.substr(v1,#colon+2).
else if char.index(v1,"ENTERING AGY:") GT 0.
- compute enteringagy = char.substr(v1,#colon+2).
else if char.index(v1,"NOTIFY AGY:") GT 0.
- compute notifyagy = char.substr(v1,#colon+2).
else if char.index(v1,"MISC:") GT 0.
- compute misc = char.substr(v1,#colon+2).
end if.
exe.
DATASET DECLARE newdata.
AGGREGATE
/OUTFILE='newdata'
/BREAK=record
/RTYPE=MAX(RTYPE)
/REPDATE=MAX(REPDATE)
/ENTRYD=MAX(ENTRYD)
/pcn=MAX(pcn)
/nic=MAX(nic)
/serialno=MAX(serialno)
/make=MAX(make)
/model=MAX(model)
/type=MAX(type)
/caseno=MAX(caseno)
/enteringmne=MAX(enteringmne)
/enteringagy=MAX(enteringagy)
/notifyagy=MAX(notifyagy)
/misc=MAX(misc).
dataset activate newdata window=front.
* End of syntax .
See FILE TYPE command!!!!
Hi David. Good thought. That would work great for most of it. But
I'm not quite sure how one would grab the RTYPE variable. It sits on
a line by itself, and will not be "LOST GUN" for every record.
--
Bruce Weaver
bwe...@lakeheadu.ca
another option would be using Perl, which is very convenient at
messing around with deleting spaces, splitting up sentences, getting
the desired info and formatting it how you want. I always use it in
cases like this and it is quite easy. Checkout http://www.comp.leeds.ac.uk/Perl/start.html
for some info, that tutorial would allow you to do what you want. It's
a good idea to copy it to a txt file like Rich said.
take care,
Filip Van Droogenbroeck