Group, I'm trying to find an application that will take a very large text file and parse it. I get data files that have more than a million lines of text. This causes the files to become extremely large and most if not all normal applications (Excel or Word) just choke and cough. One of my European colleagues suggested I use TCL. Is this possible in TCL and if so, is TCL easy to learn.
> Group, > I'm trying to find an application that will take a very large > text file and parse it. I get data files that have more than a > million lines of text. This causes the files to become extremely > large and most if not all normal applications (Excel or Word) just > choke and cough. > One of my European colleagues suggested I use TCL.
Good suggestion - you probably owe him a beer... ;^)
>Is this possible in TCL
Absolutely.
> is TCL easy to learn.
Yes, very much so.
If you care to post more details regarding the contents of your file and exactly what you want to parse from it, you'll likely get some more concrete suggestions, but rest assured - you can likely do what you want quite easily using Tcl.
Jeff, Well the file has a .txt extension. About 50% of the file is fluff that I don't need and would just like to eliminate. Also, unlike a lot of other text files, the data starts with a series of lines with astericks ***** and there is always a header to a group of data. Once a list of data is complete usually in the order of about 500 to 1000 or so lines for a single tested part, then the astericks start again and data is added in 50
"Jeff Godfrey" <jeff_godf...@pobox.com> wrote in message
> "ajocius" <ajoc...@insightbb.com> wrote in message > news:93Qcf.558992$xm3.351865@attbi_s21... >> Group, >> I'm trying to find an application that will take a very large text >> file and parse it. I get data files that have more than a million lines >> of text. This causes the files to become extremely large and most if not >> all normal applications (Excel or Word) just choke and cough.
>> One of my European colleagues suggested I use TCL.
> Good suggestion - you probably owe him a beer... ;^)
>>Is this possible in TCL
> Absolutely.
>> is TCL easy to learn.
> Yes, very much so.
> If you care to post more details regarding the contents of your file and > exactly what you want to parse from it, you'll likely get some more > concrete suggestions, but rest assured - you can likely do what you want > quite easily using Tcl.
Jeff and Group, Awesome, below you will find a portion of a sample text file that I get from my tester. The file can be called almost anything, so ideally I would have to select the file. The information between astericks ***** is information that identifies my products serical number, Date, UUT result, and time. This information is important everything else between the astericks is not needed. Then what follows is somewhat disjoint. Any line with a DONE is to be discard, any delay is to be discarded, anything skipped is discarded and on and on. Then I'm looking to save the limits and the results (Pass or Fail). Instead of saving in the present format, I want to save it in columns. Eventually I will have thousands of each product and I need to perform statistical analysis (to be done in Excel). So, below is a sample of what I'm looking to get. I hope this example is viewable correctly in the group window.
Expected parsed text: Serial Number Time Execution Time UUT Result DIM Result Measurement Low High Check Software ID CD Model:....... etc 89FCHWM153011072 2:53:51 PM 362.8 s Failed Passed 1 1 1 Failed.................. 89FCHWM153011068 3:00:00 PM 240.2 s Passed Passed 1 1 1 Passed..............
Text from original text file below.
**************************************** UUT Report Test Socket Index: 0 Serial Number: 89FCHWM153011072 "Date: Friday, November 04, 2005" Time: 2:53:51 PM Operator: administrator Execution Time: 362.8368623 seconds Number of Results: 370 UUT Result: Failed ****************************************
Begin Sequence: MainSequence (C:\ITS Functional\Models\_KiaCDUSA25OCT05.seq)
Lock Panel: Done Delay: Done Delay: Done Delay: Done Delay: Done Delay: Done Delay: Done Delay: Done Delay: Done Delay: Done Delay: Done Delay: Done Delay: Done Delay: Done Delay: Done Delay: Done Delay: Done Delay: Done Delay: Done Delay: Done Delay: Done Delay: Done Delay: Done Delay: Done Delay: Done Delay: Done Delay: Done Reset Button_Status: Done Delay: Done Dimming - Brightest: Done Module Time: 0.0005246 Delay: Done Dimming - Bright: Done Module Time: 0.0006663 Delay: Done Dimming - Medium Bright: Done Module Time: 0.0005179 Delay: Done Get Start button: Passed Module Time: 0.0117736 Delay: Done DIM Results: Passed Measurement: 1 Limits: Low: 1 High: 1 Comparison Type: GELE (>= <=) Unlock Panel: Done Lock Panel: Done Pass Ser Num to Main: Done Ser Num - Prefix: Done Ser Num - Broadcast Code: Done Ser Num - PML: Done Ser Num - Shift: Done Ser Num - Year: Done Ser Num - Julian Date: Done Ser Num - Seq Num: Done dB Ref Level 0dB: Done Module Time: 0.0003989 Get Locals.CAN: Done BAUD 100K: Done Start Diagnostics: Done Delay: Done Get TestToolMsgID: Done Power Run Mode ON: Done Set Backlight ON3: Done Unlock Panel: Done Lock Panel: Done Check Software ID CD Model: Failed String: AA0424AA04 Limits: String: AA0448AA02 Comparison Type: Ignore Case Module Time: 0.1635819 Data = aa 04 24 aa 04 ; Number of Tries = 1; Failure Report = Check Software ID ICDX Model: Skipped Set Timed Diagnostics TRUE: Done Unlock Panel: Done Lock Panel: Done Start Diagnostics: Done RF MAX VOLUME Level: Passed Measurement: 17 Units: dB Limits: Low: 5 High: 25 Comparison Type: GELE (>= <=) Unit: dB LR MAX VOLUME Level: Passed Measurement: 17 Units: dB Limits: Low: 5 High: 25 Comparison Type: GELE (>= <=) Unit: dB RR MAX VOLUME Level: Passed Measurement: 17 Units: dB Limits:
"Jeff Godfrey" <jeff_godf...@pobox.com> wrote in message
> "ajocius" <ajoc...@insightbb.com> wrote in message > news:93Qcf.558992$xm3.351865@attbi_s21... >> Group, >> I'm trying to find an application that will take a very large text >> file and parse it. I get data files that have more than a million lines >> of text. This causes the files to become extremely large and most if not >> all normal applications (Excel or Word) just choke and cough.
>> One of my European colleagues suggested I use TCL.
> Good suggestion - you probably owe him a beer... ;^)
>>Is this possible in TCL
> Absolutely.
>> is TCL easy to learn.
> Yes, very much so.
> If you care to post more details regarding the contents of your file and > exactly what you want to parse from it, you'll likely get some more > concrete suggestions, but rest assured - you can likely do what you want > quite easily using Tcl.
On Thu, 10 Nov 2005, ajocius wrote: > I'm trying to find an application that will take a very large text file > and parse it. I get data files that have more than a million lines of text. > This causes the files to become extremely large and most if not all normal > applications (Excel or Word) just choke and cough. One of my European > colleagues suggested I use TCL. Is this possible in TCL and if so, is TCL > easy to learn.
Yes this is possible in Tcl. Yes Tcl is easy to learn.
On Fri, 11 Nov 2005, ajocius wrote: > The information between astericks ***** is information that identifies > my products serical number, Date, UUT result, and time. This > information is important everything else between the astericks is not > needed. Then what follows is somewhat disjoint. Any line with a DONE > is to be discard, any delay is to be discarded, anything skipped is > discarded and on and on. Then I'm looking to save the limits and the > results (Pass or Fail). Instead of saving in the present format, I want > to save it in columns. Eventually I will have thousands of each product > and I need to perform statistical analysis (to be done in Excel). So, > below is a sample of what I'm looking to get.
> Expected parsed text: > Serial Number Time Execution Time UUT > Result DIM Result Measurement Low High Check Software ID CD Model:....... > etc > 89FCHWM153011072 2:53:51 PM 362.8 s Failed > Passed 1 1 1 Failed..................
In your previous message you asked if Tcl was easy to learn, so I take it that you don't yet know Tcl. Don't worry--spend some time reading a book (or the man pages) and the wiki and experimenting a bit. You'll want to make sure you take enough time to understand the rules of Tcl. Not hard as there are only eleven of them. See: http://wiki.tcl.tk/endekalogue
Now, there are several approaches you could take with Tcl to parse this data.
You could read the entire file into memory (using then open, read, close commands) and use a series of regular expressions (regexp command) to parse the data and pick out the pieces you are interested in.
Another direction you could go is to read the file one line at a time (in which case you'd use gets instead of read), look at what the line was, and then either save the data or discard it.
A third way (which I'll show below) involves writing just enough Tcl procedures (the ones matching the first "word" on each line that you are interested in) and have Tcl do the parsing & heavy lifting by just [source] 'ing your file. This works (at least with your sample data) since everything is contained on a single line and there doesn't appear to be any instances of characters that would have special significance (such as $, [, ], etc.)
Save the script below to a file and it will write tab-delimited output to stdout. If you redirect that to a file (or change it to write to a file directly--an exercise for the reader :-) with an .xls extension then Excel should be able to open & read it.
Welcome to Tcl!
Michael
proc unknown {cmd args} { # do nothing for all the other lines in the file
}
set headers [list "Serial Number"] proc Serial {Number: serial} { set ::row [list $serial] set ::last Serial
}
lappend headers Time proc Time: {time ampm} { lappend ::row "$time $ampm" set ::last Time:
}
lappend headers "Execution Time" proc Execution {Time: time units} { lappend ::row "[format %0.1f $time] [string index $units 0]" set ::last Execution
}
lappend headers "UUT Result" proc UUT {what args} { if {$what == "Result:"} then { lappend ::row [lindex $args 0] } set ::last UUT
}
lappend headers "DIM Result" proc DIM {Results: result} { lappend ::row $result set ::last DIM
}
lappend headers "Measurement" proc Measurement: {measurement} { if {$::last == "DIM"} then { lappend ::row $measurement }
}
lappend headers "Low" proc Low: {qty} { if {$::last == "DIM"} then { lappend ::row $qty }
}
lappend headers "High" proc High: {qty} { if {$::last == "DIM"} then { lappend ::row $qty }
}
lappend headers "Check Software ID CD Model" proc Check {Software ID type Model: result} { if {$type == "CD"} then { lappend ::row $result puts [join $::row \t] } set ::last "Check: Software ID $type Model:"
ajocius wrote: > Jeff and Group, > Awesome, below you will find a portion of a sample text file that I > get from my tester. The file can be called almost anything, so ideally I > would have to select the file. The information between astericks ***** is > information that identifies my products serical number, Date, UUT result, > and time. This information is important everything else between the > astericks is not needed. Then what follows is somewhat disjoint. Any line > with a DONE is to be discard,
Tony,
While it's trivial to discard the lines with "Done" using Tcl, save yourself some trouble and disk space: TestStand has a setting where you can suppress the output of any step based on the result text... I have my TestStand machines configured to suppress logging of any step whose result text is "Done."
I do a bit of postprocessing on the log files myself, parsing out results for transcription to other formats. (Actually, for injection into other processes, but that's another story). One thing I do is make it easy to identify the results I want to keep. For example, if the Pass/Fail test is important, but the Delay is not, I might name the Pass/Fail step "+Test" ... then, when parsing the files, look for lines starting with "+."
Basically, I'd code this as a state machine... (untested)
set state begin
set filename [tk_getOpenFile ...] if {![string compare $filename ""]} {return}
set savedata {} set fid [open $filename r] while {![eof $fid]} { gets $fid str set str [string trimleft $str]
switch $state { begin { if {[regexp {^[\*].*$} $str]} {set state header} } header { if {[regexp {^[\*].*$} $str]} {set state test; continue} # look for lines containing important stuff... # e.g. regexp {UUT Result:[[:space:]]([^[:space:]])$} all result # lappend savedata $result } test { # again, parse out what you need, lappend it to savedata }
}
}
close $fid
set fid [open ${filename}.out w] foreach str $savedata {puts $fid $str} close $fid
You could, of course, write the output file concurrently with parsing. You can also accumulate statistics and write them out. Whatever you need. It's all reasonably straight-forward.
All kinds of courses of action. Is there a website similar to www.excelforum where TCL programmers meet to discuss problems and TCL related goodies. I'm looking for a list of previously answered TCL questions that I can use as a reference.
>> I'm trying to find an application that will take a very large text >> file >> and parse it. I get data files that have more than a million lines of >> text. >> This causes the files to become extremely large and most if not all >> normal >> applications (Excel or Word) just choke and cough. One of my European >> colleagues suggested I use TCL. Is this possible in TCL and if so, is >> TCL >> easy to learn.
> Yes this is possible in Tcl. Yes Tcl is easy to learn.
ajocius wrote: > All kinds of courses of action. Is there a website similar to > www.excelforum where TCL programmers meet to discuss problems and TCL > related goodies. I'm looking for a list of previously answered TCL > questions that I can use as a reference.
I'm not a TCL guru; but I'm not sure TCL is the best choice for what you want. I'd use a high-level declarative language such as Lisp or Qi. Prolog is good for parsing, but I'm not sure how well it would take to 10^6 lines of text.
IMHO I'd use Lisp or Qi and learn about streams. This will allow you to read and parse more-or-less synchronously without having to read in the whole shebang. Lisp is more mainstream, Qi is written in Lisp and has patterns which makes life easier. See www.lambdassociates.org for Qi which also has a link somewhere to CLisp.
In article <1131816022.499288.15...@g43g2000cwa.googlegroups.com>,
Mark Tarver <dr.mtar...@ukonline.co.uk> wrote: >I'm not a TCL guru; but I'm not sure TCL is the >best choice for what you want. I'd use a high-level >declarative language such as Lisp or Qi. Prolog is >good for parsing, but I'm not sure how well it would >take to 10^6 lines of text.
>IMHO I'd use Lisp or Qi and learn about streams. >This will allow you to read and parse more-or-less >synchronously without having to read in the whole >shebang. Lisp is more mainstream, Qi is written >in Lisp and has patterns which makes life easier. >See www.lambdassociates.org for Qi which also >has a link somewhere to CLisp.
. . . If you're going in *that* direction, why not Snobol? Or, perhaps more soberly, Icon?
ajocius <ajoc...@insightbb.com> wrote: >Jeff and Group, > Awesome, below you will find a portion of a sample text file that I >get from my tester. The file can be called almost anything, so ideally I >would have to select the file. The information between astericks ***** is >information that identifies my products serical number, Date, UUT result, >and time. This information is important everything else between the >astericks is not needed. Then what follows is somewhat disjoint. Any line >with a DONE is to be discard, any delay is to be discarded, anything skipped >is discarded and on and on. Then I'm looking to save the limits and the >results (Pass or Fail). Instead of saving in the present format, I want to >save it in columns. Eventually I will have thousands of each product and I >need to perform statistical analysis (to be done in Excel). So, below is a >sample of what I'm looking to get. I hope this example is viewable >correctly in the group window.
>Expected parsed text: >Serial Number Time Execution Time UUT >Result DIM Result Measurement Low High Check Software ID CD Model:....... >etc >89FCHWM153011072 2:53:51 PM 362.8 s Failed >Passed 1 1 1 Failed.................. >89FCHWM153011068 3:00:00 PM 240.2 s Passed >Passed 1 1 1 Passed..............
Many good suggestions here..
Here's (hopefully) one more.
If you're looking to skip/ignore certain text, try:
Darren New wrote: > MH wrote: > > grep for windows > It's called "find" on Windows.
Ugh. On XP at least, it's much nicer to use "findstr". Or, better yet, download a copy of "grep" from, say, http://unxutils.sourceforge.net/ (which I can recommend).
Bob -- Bob Techentin techentin.rob...@NOSPAMmayo.edu Mayo Foundation (507) 538-5495 200 First St. SW FAX (507) 284-9171 Rochester MN, 55901 USA http://www.mayo.edu/sppdg/