TCL For Parsing

ajocius

unread,

Nov 10, 2005, 5:44:21 PM11/10/05

to

Group,
I'm trying to find an application that will take a very large text file
and parse it. I get data files that have more than a million lines of text.
This causes the files to become extremely large and most if not all normal
applications (Excel or Word) just choke and cough. One of my European
colleagues suggested I use TCL. Is this possible in TCL and if so, is TCL
easy to learn.

Tony

Jeff Godfrey

unread,

Nov 10, 2005, 5:55:43 PM11/10/05

to

"ajocius" <ajo...@insightbb.com> wrote in message
news:93Qcf.558992$xm3.351865@attbi_s21...

> Group,
> I'm trying to find an application that will take a very large
> text file and parse it. I get data files that have more than a
> million lines of text. This causes the files to become extremely
> large and most if not all normal applications (Excel or Word) just
> choke and cough.

> One of my European colleagues suggested I use TCL.

Good suggestion - you probably owe him a beer... ;^)

>Is this possible in TCL

Absolutely.

> is TCL easy to learn.

Yes, very much so.

If you care to post more details regarding the contents of your file
and exactly what you want to parse from it, you'll likely get some
more concrete suggestions, but rest assured - you can likely do what
you want quite easily using Tcl.

Jeff

ajocius

unread,

Nov 10, 2005, 6:29:29 PM11/10/05

to

Jeff,
Well the file has a .txt extension. About 50% of the file is fluff that
I don't need and would just like to eliminate. Also, unlike a lot of other
text files, the data starts with a series of lines with astericks ***** and
there is always a header to a group of data. Once a list of data is
complete usually in the order of about 500 to 1000 or so lines for a single
tested part, then the astericks start again and data is added in 50
"Jeff Godfrey" <jeff_g...@pobox.com> wrote in message
news:PdQcf.182$104...@newsread1.news.pas.earthlink.net...

ajocius

unread,

Nov 10, 2005, 7:18:09 PM11/10/05

to

Jeff and Group,
Awesome, below you will find a portion of a sample text file that I
get from my tester. The file can be called almost anything, so ideally I
would have to select the file. The information between astericks ***** is
information that identifies my products serical number, Date, UUT result,
and time. This information is important everything else between the
astericks is not needed. Then what follows is somewhat disjoint. Any line
with a DONE is to be discard, any delay is to be discarded, anything skipped
is discarded and on and on. Then I'm looking to save the limits and the
results (Pass or Fail). Instead of saving in the present format, I want to
save it in columns. Eventually I will have thousands of each product and I
need to perform statistical analysis (to be done in Excel). So, below is a
sample of what I'm looking to get. I hope this example is viewable
correctly in the group window.

Expected parsed text:
Serial Number Time Execution Time UUT
Result DIM Result Measurement Low High Check Software ID CD Model:.......
etc
89FCHWM153011072 2:53:51 PM 362.8 s Failed
Passed 1 1 1 Failed..................
89FCHWM153011068 3:00:00 PM 240.2 s Passed
Passed 1 1 1 Passed..............

Text from original text file below.

****************************************
UUT Report
Test Socket Index: 0
Serial Number: 89FCHWM153011072
"Date: Friday, November 04, 2005"
Time: 2:53:51 PM
Operator: administrator
Execution Time: 362.8368623 seconds
Number of Results: 370
UUT Result: Failed
****************************************

Begin Sequence: MainSequence
(C:\ITS Functional\Models\_KiaCDUSA25OCT05.seq)

Lock Panel: Done
Delay: Done
Delay: Done
Delay: Done
Delay: Done
Delay: Done
Delay: Done
Delay: Done
Delay: Done
Delay: Done
Delay: Done
Delay: Done
Delay: Done
Delay: Done
Delay: Done
Delay: Done
Delay: Done
Delay: Done
Delay: Done
Delay: Done
Delay: Done
Delay: Done
Delay: Done
Delay: Done
Delay: Done
Delay: Done
Delay: Done
Reset Button_Status: Done
Delay: Done
Dimming - Brightest: Done
Module Time: 0.0005246
Delay: Done
Dimming - Bright: Done
Module Time: 0.0006663
Delay: Done
Dimming - Medium Bright: Done
Module Time: 0.0005179
Delay: Done
Get Start button: Passed
Module Time: 0.0117736
Delay: Done
DIM Results: Passed
Measurement: 1
Limits:
Low: 1
High: 1
Comparison Type: GELE (>= <=)
Unlock Panel: Done
Lock Panel: Done
Pass Ser Num to Main: Done
Ser Num - Prefix: Done
Ser Num - Broadcast Code: Done
Ser Num - PML: Done
Ser Num - Shift: Done
Ser Num - Year: Done
Ser Num - Julian Date: Done
Ser Num - Seq Num: Done
dB Ref Level 0dB: Done
Module Time: 0.0003989
Get Locals.CAN: Done
BAUD 100K: Done
Start Diagnostics: Done
Delay: Done
Get TestToolMsgID: Done
Power Run Mode ON: Done
Set Backlight ON3: Done
Unlock Panel: Done
Lock Panel: Done
Check Software ID CD Model: Failed
String: AA0424AA04
Limits:
String: AA0448AA02
Comparison Type: Ignore Case
Module Time: 0.1635819
Data = aa 04 24 aa 04 ; Number of Tries = 1; Failure Report =
Check Software ID ICDX Model: Skipped
Set Timed Diagnostics TRUE: Done
Unlock Panel: Done
Lock Panel: Done
Start Diagnostics: Done
RF MAX VOLUME Level: Passed
Measurement: 17
Units: dB
Limits:
Low: 5
High: 25
Comparison Type: GELE (>= <=)
Unit: dB
LR MAX VOLUME Level: Passed
Measurement: 17
Units: dB
Limits:
Low: 5
High: 25
Comparison Type: GELE (>= <=)
Unit: dB
RR MAX VOLUME Level: Passed
Measurement: 17
Units: dB
Limits:

"Jeff Godfrey" <jeff_g...@pobox.com> wrote in message
news:PdQcf.182$104...@newsread1.news.pas.earthlink.net...

Michael A. Cleverly

unread,

Nov 10, 2005, 9:15:45 PM11/10/05

to

Yes this is possible in Tcl. Yes Tcl is easy to learn.

Michael

Michael A. Cleverly

unread,

Nov 10, 2005, 9:59:30 PM11/10/05

to

On Fri, 11 Nov 2005, ajocius wrote:

> The information between astericks ***** is information that identifies
> my products serical number, Date, UUT result, and time. This
> information is important everything else between the astericks is not
> needed. Then what follows is somewhat disjoint. Any line with a DONE
> is to be discard, any delay is to be discarded, anything skipped is
> discarded and on and on. Then I'm looking to save the limits and the
> results (Pass or Fail). Instead of saving in the present format, I want
> to save it in columns. Eventually I will have thousands of each product
> and I need to perform statistical analysis (to be done in Excel). So,
> below is a sample of what I'm looking to get.
>

> Expected parsed text:
> Serial Number Time Execution Time UUT
> Result DIM Result Measurement Low High Check Software ID CD Model:.......
> etc
> 89FCHWM153011072 2:53:51 PM 362.8 s Failed
> Passed 1 1 1 Failed..................

In your previous message you asked if Tcl was easy to learn, so I take it
that you don't yet know Tcl. Don't worry--spend some time reading a book
(or the man pages) and the wiki and experimenting a bit. You'll want to
make sure you take enough time to understand the rules of Tcl. Not hard
as there are only eleven of them. See: http://wiki.tcl.tk/endekalogue

Now, there are several approaches you could take with Tcl to parse this
data.

You could read the entire file into memory (using then open, read, close
commands) and use a series of regular expressions (regexp command) to
parse the data and pick out the pieces you are interested in.

Another direction you could go is to read the file one line at a time (in
which case you'd use gets instead of read), look at what the line was, and
then either save the data or discard it.

A third way (which I'll show below) involves writing just enough Tcl
procedures (the ones matching the first "word" on each line that you are
interested in) and have Tcl do the parsing & heavy lifting by just
[source] 'ing your file. This works (at least with your sample data)
since everything is contained on a single line and there doesn't appear to
be any instances of characters that would have special significance (such
as $, [, ], etc.)

Save the script below to a file and it will write tab-delimited output to
stdout. If you redirect that to a file (or change it to write to a file
directly--an exercise for the reader :-) with an .xls extension then Excel
should be able to open & read it.

Welcome to Tcl!

Michael

proc unknown {cmd args} {
# do nothing for all the other lines in the file
}

set headers [list "Serial Number"]
proc Serial {Number: serial} {
set ::row [list $serial]
set ::last Serial
}

lappend headers Time
proc Time: {time ampm} {
lappend ::row "$time $ampm"
set ::last Time:
}

lappend headers "Execution Time"
proc Execution {Time: time units} {
lappend ::row "[format %0.1f $time] [string index $units 0]"
set ::last Execution
}

lappend headers "UUT Result"
proc UUT {what args} {
if {$what == "Result:"} then {
lappend ::row [lindex $args 0]
}
set ::last UUT
}

lappend headers "DIM Result"
proc DIM {Results: result} {
lappend ::row $result
set ::last DIM
}

lappend headers "Measurement"
proc Measurement: {measurement} {
if {$::last == "DIM"} then {
lappend ::row $measurement
}
}

lappend headers "Low"
proc Low: {qty} {
if {$::last == "DIM"} then {
lappend ::row $qty
}
}

lappend headers "High"
proc High: {qty} {
if {$::last == "DIM"} then {
lappend ::row $qty
}
}

lappend headers "Check Software ID CD Model"
proc Check {Software ID type Model: result} {
if {$type == "CD"} then {
lappend ::row $result
puts [join $::row \t]
}
set ::last "Check: Software ID $type Model:"
}

puts [join $headers \t]
source input.txt

Melissa Schrumpf

unread,

Nov 10, 2005, 10:53:31 PM11/10/05

to

ajocius wrote:

> Jeff and Group,
> Awesome, below you will find a portion of a sample text file that I
> get from my tester. The file can be called almost anything, so ideally I
> would have to select the file. The information between astericks ***** is
> information that identifies my products serical number, Date, UUT result,
> and time. This information is important everything else between the
> astericks is not needed. Then what follows is somewhat disjoint. Any line
> with a DONE is to be discard,

Tony,

While it's trivial to discard the lines with "Done" using Tcl, save
yourself some trouble and disk space: TestStand has a setting where you
can suppress the output of any step based on the result text... I have
my TestStand machines configured to suppress logging of any step whose
result text is "Done."

I do a bit of postprocessing on the log files myself, parsing out
results for transcription to other formats. (Actually, for injection
into other processes, but that's another story). One thing I do is make
it easy to identify the results I want to keep. For example, if the
Pass/Fail test is important, but the Delay is not, I might name the
Pass/Fail step "+Test" ... then, when parsing the files, look for lines
starting with "+."

Basically, I'd code this as a state machine... (untested)

set state begin

set filename [tk_getOpenFile ...]
if {![string compare $filename ""]} {return}

set savedata {}
set fid [open $filename r]
while {![eof $fid]} {
gets $fid str
set str [string trimleft $str]

switch $state {
begin {
if {[regexp {^[\*].*$} $str]} {set state header}
}
header {
if {[regexp {^[\*].*$} $str]} {set state test; continue}
# look for lines containing important stuff...
# e.g. regexp {UUT Result:[[:space:]]([^[:space:]])$} all result
# lappend savedata $result
}
test {
# again, parse out what you need, lappend it to savedata
}

}
}
close $fid

set fid [open ${filename}.out w]
foreach str $savedata {puts $fid $str}
close $fid

You could, of course, write the output file concurrently with parsing.
You can also accumulate statistics and write them out. Whatever you
need. It's all reasonably straight-forward.

Good luck!

MKS

--
MKS

Ralf Fassel

unread,

Nov 11, 2005, 5:23:28 AM11/11/05

to

* "Michael A. Cleverly" <mic...@cleverly.com>

| set headers [list "Serial Number"]
| proc Serial {Number: serial}

--<snip-snip>--

Not trusting my eyes...what a true hack.
Isn't TCL a lovely language.

R'

Uwe Klein

unread,

Nov 11, 2005, 5:54:34 AM11/11/05

to

painfully beautyfull.

This needs to be presented in a glass showcase and
i would prefer to put a solution like this in a
save interpreter.

uwe

ajocius

unread,

Nov 11, 2005, 7:23:39 AM11/11/05

to

All kinds of courses of action. Is there a website similar to
www.excelforum where TCL programmers meet to discuss problems and TCL
related goodies. I'm looking for a list of previously answered TCL
questions that I can use as a reference.

Tony

"Michael A. Cleverly" <mic...@cleverly.com> wrote in message
news:Pine.OSX.4.60.05...@powerbook.cleverly.com...

Uwe Klein

unread,

Nov 11, 2005, 7:33:31 AM11/11/05

to

ajocius wrote:
> All kinds of courses of action. Is there a website similar to
> www.excelforum where TCL programmers meet to discuss problems and TCL
> related goodies. I'm looking for a list of previously answered TCL
> questions that I can use as a reference.

The Tclers Wiki:
http://wiki.tcl.tk/0

Usage, Tricks, Treats

and Links to anything Tcl

uwe

Mark Tarver

unread,

Nov 12, 2005, 12:20:22 PM11/12/05

to

I'm not a TCL guru; but I'm not sure TCL is the
best choice for what you want. I'd use a high-level
declarative language such as Lisp or Qi. Prolog is
good for parsing, but I'm not sure how well it would
take to 10^6 lines of text.

IMHO I'd use Lisp or Qi and learn about streams.
This will allow you to read and parse more-or-less
synchronously without having to read in the whole
shebang. Lisp is more mainstream, Qi is written
in Lisp and has patterns which makes life easier.
See www.lambdassociates.org for Qi which also
has a link somewhere to CLisp.

Mark

Cameron Laird

unread,

Nov 13, 2005, 8:08:02 AM11/13/05

to

In article <1131816022....@g43g2000cwa.googlegroups.com>,

.
.
.
If you're going in *that* direction, why not Snobol?
Or, perhaps more soberly, Icon?

Mark Tarver

unread,

Nov 13, 2005, 12:10:32 PM11/13/05

to

Yes, they would be good too. SNOBOL is a bit old though.

Mark

MH

unread,

Nov 15, 2005, 2:26:21 PM11/15/05

to

In article <5rRcf.559082$xm3.538060@attbi_s21>,

ajocius <ajo...@insightbb.com> wrote:
>Jeff and Group,
> Awesome, below you will find a portion of a sample text file that I
>get from my tester. The file can be called almost anything, so ideally I
>would have to select the file. The information between astericks ***** is
>information that identifies my products serical number, Date, UUT result,
>and time. This information is important everything else between the
>astericks is not needed. Then what follows is somewhat disjoint. Any line
>with a DONE is to be discard, any delay is to be discarded, anything skipped
>is discarded and on and on. Then I'm looking to save the limits and the
>results (Pass or Fail). Instead of saving in the present format, I want to
>save it in columns. Eventually I will have thousands of each product and I
>need to perform statistical analysis (to be done in Excel). So, below is a
>sample of what I'm looking to get. I hope this example is viewable
>correctly in the group window.
>
>Expected parsed text:
>Serial Number Time Execution Time UUT
>Result DIM Result Measurement Low High Check Software ID CD Model:.......
>etc
>89FCHWM153011072 2:53:51 PM 362.8 s Failed
>Passed 1 1 1 Failed..................
>89FCHWM153011068 3:00:00 PM 240.2 s Passed
>Passed 1 1 1 Passed..............

Many good suggestions here..

Here's (hopefully) one more.

If you're looking to skip/ignore certain text, try:

grep -v file.txt Done > file2.txt
grep -v file2.txt Delay > file3.txt

If you get a LOT fo Done lines that you just want to discard, this should
cut down your file to something more manageable..

(if you're in windows, as I think you are, you might have to download a copy
of grep for windows)

Mattias

Darren New

unread,

Nov 15, 2005, 3:48:55 PM11/15/05

to

MH wrote:
> (if you're in windows, as I think you are, you might have to download a copy
> of grep for windows)

It's called "find" on Windows.

--
Darren New / San Diego, CA, USA (PST)
Sabotage? Communist conspiracy? Or just
Microsoft again? Only time will tell.

Dan Smart

unread,

Nov 15, 2005, 4:54:23 PM11/15/05

to

Darren New wrote:
> MH wrote:
>
>> (if you're in windows, as I think you are, you might have to download
>> a copy
>> of grep for windows)
>
>
> It's called "find" on Windows.
>

::fileutil::grep pattern ?files?
On just about any O/S you care to name shirley?

Dan "colon-colon-f-i-l..." Smart

Donal K. Fellows

unread,

Nov 15, 2005, 5:38:23 PM11/15/05

to

Darren New wrote:
> MH wrote:

> > grep for windows

> It's called "find" on Windows.

Ugh. On XP at least, it's much nicer to use "findstr". Or, better yet,
download a copy of "grep" from, say, http://unxutils.sourceforge.net/
(which I can recommend).

Donal.

Bob Techentin

unread,

Nov 17, 2005, 2:07:22 AM11/17/05

to

"Donal K. Fellows" <donal.k...@man.ac.uk>

> Or, better yet,
> download a copy of "grep" from, say, http://unxutils.sourceforge.net/

I'd even suggest getting MingW/MSYS from
http://sourceforge.net/projects/tcl/

Bob
--
Bob Techentin techenti...@NOSPAMmayo.edu
Mayo Foundation (507) 538-5495
200 First St. SW FAX (507) 284-9171
Rochester MN, 55901 USA http://www.mayo.edu/sppdg/