Speedup of regexp required for Tcl-Scanner

Dr. Detlef Groth

unread,

Apr 29, 2008, 12:38:33 AM4/29/08

to

Hello all,

I am currently trying to speedup my regexp searches. I saw that the
tcl-regexp are much slower than perl ones. Does anybody have ideas to
improve performance on this code ? It requires about 50 seconds on my
single core machine with about 1.5Ghz for a 5Mb-file.

I put an example inputfile here: http://www.dgroth.de/sample.blast

Here is the code:

namespace eval SampleScanner2 {
variable state INITIAL
}
proc SampleScanner2::BEGIN {nstate} {
variable state
set state $nstate
}
proc SampleScanner2::yylex {yyin} {
variable state
set reg0 {^T?BLAST[XPN]}
set reg1 {^Query=[ ]+[^\s]+}
set reg2 {^Sequences[ ]producing[ ]significant[ ]alignments}
set reg3 {^>[^\s]+}
set reg4 {^.|\n}
set bsize 128
set yychar 0
set buffer [read $yyin $bsize]
set yyleng 0
while {1} {
set yy_rule -1
incr yychar $yyleng
set buffer [string range $buffer $yyleng end]
if {[string length $buffer] < $bsize} {
set nbuff [read $yyin $bsize]
if {[string length $nbuff] == 0 && [string length $buffer]
== 0} {
break
}

append buffer $nbuff
}
if {[regexp -- $reg0 $buffer yytext]} {
set yy_rule 0
} elseif {$state eq "XSTATE" && [regexp -- $reg1 $buffer
yytext]} {
set yy_rule 1
} elseif {$state eq "XSTATE" && [regexp -- $reg2 $buffer
yytext]} {
set yy_rule 2
} elseif {$state eq "SSTATE" && [regexp -- $reg3 $buffer
yytext]} {
set yy_rule 3
} elseif {[regexp -line -- $reg4 $buffer yytext]} {
set yy_rule 4
} else {
set yytext [string range $buffer 0 1]
}
set yyleng [string length $yytext]
switch -- $yy_rule {
0 {
puts $yytext
BEGIN XSTATE
}
1 {
puts -nonewline [string range $yytext 7 end]
}
2 {
puts " has alignments"
BEGIN SSTATE
}
3 {
puts -nonewline "id is: "
puts [string range $yytext 1 end]
BEGIN INITIAL
}
4 {
# do nothing
}
default
{
puts -nonewline $yytext
}
}
}
return 0
}
if {$argv0 eq [info script]} {
set start [clock seconds]
if {[llength $argv] == 0} {
puts stderr "Usage: [info script] \[-options values\]
<infile>"
exit 0
}

if {[catch {open [lindex $argv end] r} yyin]} {
puts stderr "Could not open [lindex $argv end]"
exit 0
}
if {[expr [llength [lrange $argv 1 end]] % 2] != 0} {
puts stderr "submitting uneven number of arguments"
}
SampleScanner2::yylex $yyin
close $yyin
set end [clock seconds]
set secs [expr {$end-$start}]
puts "required $secs seconds"
}

regards,
Detlef

EL

unread,

Apr 29, 2008, 2:02:06 AM4/29/08

to

Dr. Detlef Groth schrieb:

> I am currently trying to speedup my regexp searches. I saw that the
> tcl-regexp are much slower than perl ones.

Hmm... unfortunately I lost the link, but not long ago somebody
mentioned here exactly the opposite: Tcl's regexp engine is the fastest
one that is available.

> Does anybody have ideas to
> improve performance on this code ? It requires about 50 seconds on my
> single core machine with about 1.5Ghz for a 5Mb-file.
>
> I put an example inputfile here: http://www.dgroth.de/sample.blast

Why don't you use the XML output of BLAST and then parse the records
with SAX (or tDOM, if appropriate)? Or split them up, put them into a
database and use a combination of text search and SQL queries... if
speed matters over simplicify? (Well, to put BLAST records into a
database is not /that/ complicated. BioSQL [1] might help).
I could imagine that this gives you much better performance than regexp
searches on huge text files.

[1] http://www.biosql.org/

--
Eckhard

Dr. Detlef Groth

unread,

Apr 29, 2008, 2:33:04 AM4/29/08

to

Hi Eckard,

for the BLAST-case your alternative is perfectly but I am as well
interested in parsing, scanning output which is not xml. The blast was
just a dummy example. BioSQL btw seem currently not support SQLite
which is my prefered choice.

regards,
Detef

EL

unread,

Apr 29, 2008, 3:16:22 AM4/29/08

to

Dr. Detlef Groth schrieb:

> for the BLAST-case your alternative is perfectly but I am as well
> interested in parsing, scanning output which is not xml. The blast was
> just a dummy example.

Ok... But the Tcl regexp engine is probably good for this as well. If I
only had the link to this page which confirms that Tcl regexp's are fast...

> BioSQL btw seem currently not support SQLite
> which is my prefered choice.

Hmm, with the amount of data from sequences, BLAST's and annotations I
doubt that SQLite is a good fit. I would rather use Postgresql or
Oracle, which are both supported by BioSQL (and in addition have
excellent Tcl support).

--
Eckhard

SimonG

unread,

Apr 29, 2008, 4:23:10 AM4/29/08

to

On Apr 29, 8:16 am, EL <eckhardnos...@gmx.de> wrote:
> Dr. Detlef Groth schrieb:
>
> > for the BLAST-case your alternative is perfectly but I am as well
> > interested in parsing, scanning output which is not xml. The blast was
> > just a dummy example.
>
> Ok... But the Tcl regexp engine is probably good for this as well. If I
> only had the link to this page which confirms that Tcl regexp's are fast...
>

Perhaps you're thinking of

http://shootout.alioth.debian.org/gp4/benchmark.php?test=regexdna&lang=all

Simon Geard

Alexandre Ferrieux

unread,

Apr 29, 2008, 4:37:24 AM4/29/08

to

On Apr 29, 8:33 am, "Dr. Detlef Groth" <dgr...@gmx.de> wrote:
> Hi Eckard,
>
> for the BLAST-case your alternative is perfectly but I am as well
> interested in parsing, scanning output which is not xml. The blast was
> just a dummy example. BioSQL btw seem currently not support SQLite
> which is my prefered choice.
>

As far as regexp performance is concerned, please notice that the raw
metal will be exposed only if you crunch big strings through a regexp.
Otherwise, the exquisite efficiency of Tcl's DFA (vs. NFA for most
other environments) will be completely drowned in the overhead of hand-
cutting small chunks of data.

-Alex

rf

unread,

Apr 29, 2008, 5:04:49 AM4/29/08

to

On 29 Apr., 06:38, "Dr. Detlef Groth" <dgr...@gmx.de> wrote:
>... trying to speedup my regexp searches

my idea is to use a bigger value for bsize or not to use bsize at all,
if memory is not the bottleneck, and to proceed in buffer with "regexp
-start $yyleng -- ..."
in reg2 the brackets may be omitted

Roland Frank

Dr. Detlef Groth

unread,

Apr 29, 2008, 5:06:11 AM4/29/08

to

> http://shootout.alioth.debian.org/gp4/benchmark.php?test=regexdna&lan...
>
> Simon Geard

After writing the same application with the same logic in Perl it
indeed does not seem to make a difference between Perl and Tcl. I
certainly can not post the Perl-code here ?

with bsize=128
$ perl SampleScanner2.pl .sample.blast > /dev/null
required 56 seconds

$ /c/Tclkit/tclkitsh852.exe SampleScanner2.tcl sample.blast > /dev/
null
required 58 seconds

Just by increasing the bsize variable Perl becomes better:

bsize=120*40
$ perl SampleScanner2.pl .sample.blast > /dev/null
required 128 seconds

$ /c/Tclkit/tclkitsh852.exe SampleScanner2.tcl sample.blast > /dev/
null
required 172 seconds

So the only performance solution might be to implement some parts in
C. But will
Tcl_RegExp indeed perform better than regexp ?

regards,
Detlef

Dr. Detlef Groth

unread,

Apr 29, 2008, 5:08:50 AM4/29/08

to

On 29 Apr., 10:37, Alexandre Ferrieux <alexandre.ferri...@gmail.com>
wrote:

That's what I was thought first, but in fact the performance is better
with smaller strings and more read-operations. With bsize=128 we are
twice as fast as with bsize=40*128. Furthermore if I uncomment the
regexp-statements it seems that 80% till 90% of processing time
belongs to the regexp part.

regards,
Detlef

Dr. Detlef Groth

unread,

Apr 29, 2008, 5:15:52 AM4/29/08

to

Thanks for the suggestion, that's the way my ifickle http://wiki.tcl.tk/14292
is working. However the performance is slightly lower with the ifickle/
fickle approach. But I will check it again.

Detlef

Alexandre Ferrieux

unread,

Apr 29, 2008, 5:23:40 AM4/29/08

to

Can you isolate one single large string and one single regexp
exhibiting this pathological slowness ?

-Alex

EL

unread,

Apr 29, 2008, 6:28:43 AM4/29/08

to

Dr. Detlef Groth schrieb:

> So the only performance solution might be to implement some parts in
> C. But will
> Tcl_RegExp indeed perform better than regexp ?

Probably, since you do not only [regexp] but also read/write, set/unset
and lots of other operations... on the whole file and in Tcl.
That's why I was proposing the DB solution. Postgresql has Tcl's regexp
engine built in and is certainly better optimized for fast data access
than text files. As other DBMS too.

--
Eckhard

Dr. Detlef Groth

unread,

Apr 29, 2008, 7:00:58 AM4/29/08

to

Probably not. As I need this chunked behaviour and stepping through
different states.
The code on top should be fully functional.
It seems also that it is not pathological slow but it is what can be
currently achieved with plain tcl. A boost I achieved by changing reg4
line from
set reg4 {^.|\n}
to:
set reg4 {^[^\n]+|\n}
to grab as much as possible
This finally gives me one order of magnitude. As I can see now:

3 seconds with 128 byte and 10 seconds with 40*128 bytes. So I should
try to grab to the end of line if possible.

Thanks,
Detlef

Dr. Detlef Groth

unread,

Apr 29, 2008, 7:06:14 AM4/29/08

to

Totally agree that after initial scanning/parsing the best way to
retrieve the data is from a RDBMS-databases.
See one of my presentations to this issue:
http://bioscanners.sourceforge.net/docs/presentations/dg-affinity.html

regards,
Detlef

rf

unread,

Apr 29, 2008, 7:17:16 AM4/29/08

to

On 29 Apr., 11:04, rf <roland.fr...@free.fr> wrote:

set all [read $yyin]
foreach buffer [split $all \n] {
# is still worth an attempt
}

rf

unread,

Apr 29, 2008, 7:23:31 AM4/29/08

to

there is some improvement using full buffer versus line or line-like
processing
as with

set all [read $yyin]
foreach buffer [split $all \n] {

#main loop
}

Roland

Dr. Detlef Groth

unread,

Apr 29, 2008, 9:07:21 AM4/29/08

to

Unfortunately I can't read all input into one variable as typical user
input in bioinformatics can be several Gigabytes large.

Detlef

Dr. Detlef Groth

unread,

Apr 29, 2008, 9:29:33 AM4/29/08

to

Just a small demo that

regexp might be the slower part:

proc timeit {txt cmd} {
set num 100
set count 1000
set run {}
for {set n 0} {$n < 100} {incr n} {
lappend run $cmd
}
set val [uplevel 1 [list time \
[join $run {; }] $count]]
set tmp [lreplace $val 0 0 \
[expr {[lindex $val 0]/(1.0*$num)}]]
puts "$txt: [lrange $tmp 0 1]"
}

set str {
BLASTP 2.2.17 [Aug-19-2007]

Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A.
Schaffer,
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
"Gapped BLAST and PSI-BLAST: a new generation of protein database
search
programs", Nucleic Acids Res. 25:3389-3402.

Reference for composition-based statistics:
Schaffer, Alejandro A., L. Aravind, Thomas L. Madden,
Sergei Shavirin, John L. Spouge, Yuri I. Wolf,
Eugene V. Koonin, and Stephen F. Altschul (2001),
"Improving the accuracy of PSI-BLAST protein database searches with
composition-based statistics and other refinements", Nucleic Acids
Res. 29:2994-3005.

Query= tr|Q5DDY0|Q5DDY0_SCHJA SJCHGC06319 protein - Schistosoma
japonicum (Blood fluke).
}
proc s2 {} {
global str
set state 1
#if {$state == 1} {
# set x true
#}
regexp -nocase -- {^T?BLAST[XNP]} $str
}
proc dummy {} {}
set reg0 {^T?BLAST[XNP]}
timeit "regexp" { regexp -- $reg0 $str }
timeit "regexp" { regexp -- $reg0 $str }
timeit "regexp -nocase" { regexp -nocase -- $reg0 $str }
timeit "regexp -nocase" { regexp -nocase -- $reg0 $str }
timeit "regexp -nocase -start 10 " { regexp -nocase -start 10 --
$reg0 $str }
timeit "regexp -nocase -start 10 " { regexp -nocase -start 10 --
$reg0 $str }
timeit "regexp -nocase -start 10 " { regexp -nocase -start 10 --
$reg0 $str }
timeit "string length" { string length $str }
timeit "string range" { string range $str 5 end-1 }
timeit "compare numbers" { if {3 == 3} { } }
timeit "compare strings" { if {"string" eq "string" } {} }
timeit "proc first" { dummy }
timeit "proc second" { dummy }

gives us:
regexp: 3.4902699999999998 microseconds
regexp: 2.28327 microseconds
regexp -nocase: 2.36895 microseconds
regexp -nocase: 4.01335 microseconds
regexp -nocase -start 10 : 2.54778 microseconds
regexp -nocase -start 10 : 2.55221 microseconds
regexp -nocase -start 10 : 2.44581 microseconds
string length: 0.35381 microseconds
string range: 1.67269 microseconds
compare numbers: 0.0565 microseconds
compare strings: 0.05791 microseconds
proc first: 0.57409 microseconds
proc second: 0.5254099999999999 microseconds

regards,
Detlef

Dr. Detlef Groth

unread,

Apr 29, 2008, 9:52:05 AM4/29/08

to

On 29 Apr., 11:15, "Dr. Detlef Groth" <dgr...@gmx.de> wrote:
> On 29 Apr., 11:04, rf <roland.fr...@free.fr> wrote:
>
> > On 29 Apr., 06:38, "Dr. Detlef Groth" <dgr...@gmx.de> wrote:
>
> > >... trying to speedup my regexp searches
>
> > my idea is to use a bigger value for bsize or not to use bsize at all,
> > if memory is not the bottleneck, and to proceed in buffer with "regexp
> > -start $yyleng -- ..."
> > in reg2 the brackets may be omitted
>
> > Roland Frank
>

Finally I got it working with the start option and indeed the code is
three times as fast as my old code. Here is the revised version:

namespace eval SampleScanner3 {
variable state INITIAL
}
proc SampleScanner3::BEGIN {nstate} {

variable state
set state $nstate
}

proc SampleScanner3::yylex {yyin} {
variable state
set reg0 {\AT?BLAST[XPN]}
set reg1 {\AQuery=[ ]+[^\s]+}
set reg2 {\ASequences[ ]producing[ ]significant[ ]alignments}
set reg3 {\A>[^\s]+}
set reg4 {\A[^\n]+|\n}
set bsize [expr 128*40]

set yychar 0
set buffer [read $yyin $bsize]
set yyleng 0

set yyindex 0
set yytext ""

while {1} {
set yy_rule -1

if {[expr {[string length $buffer] - $yyindex}] < $bsize} {
set buffer [string range $buffer $yyindex end]

set nbuff [read $yyin $bsize]
if {[string length $nbuff] == 0 && [string length $buffer]
== 0} {
break
}
append buffer $nbuff

set yyindex 0
}
if {[regexp -start $yyindex -- $reg0 $buffer yytext]} {
set yy_rule 0
} elseif {$state eq "XSTATE" && [regexp -start $yyindex --

$reg1 $buffer yytext]} {
set yy_rule 1

} elseif {$state eq "XSTATE" && [regexp -start $yyindex --

$reg2 $buffer yytext]} {
set yy_rule 2

} elseif {$state eq "SSTATE" && [regexp -start $yyindex --

$reg3 $buffer yytext]} {
set yy_rule 3

} elseif {[regexp -start $yyindex -- $reg4 $buffer yytext]} {
set yy_rule 4
} else {
set yytext [string range $buffer $yyindex [expr {$yyindex
+1}]]

}
set yyleng [string length $yytext]

incr yyindex $yyleng
incr yychar $yyleng

switch -- $yy_rule {
0 {
puts $yytext
BEGIN XSTATE
}
1 {
puts -nonewline [string range $yytext 7 end]
}
2 {
puts " has alignments"
BEGIN SSTATE
}
3 {
puts -nonewline "id is: "
puts [string range $yytext 1 end]
BEGIN INITIAL
}
4 {

#puts -nonewline "..$yytext" ;# do nothing

}
default
{
puts -nonewline $yytext
}
}
}
return 0
}
if {$argv0 eq [info script]} {
set start [clock seconds]
if {[llength $argv] == 0} {
puts stderr "Usage: [info script] \[-options values\]
<infile>"
exit 0
}

if {[catch {open [lindex $argv end] r} yyin]} {
puts stderr "Could not open [lindex $argv end]"
exit 0
}
if {[expr [llength [lrange $argv 1 end]] % 2] != 0} {
puts stderr "submitting uneven number of arguments"
}

SampleScanner3::yylex $yyin

close $yyin
set end [clock seconds]
set secs [expr {$end-$start}]

puts stderr "required $secs seconds"
}

Together with the optimized reg4 regexp it parses the 5Mb-Blastfile in
about 1-2 seconds.
Finally I know how to rebuild the scanner generator. Thanks to all.
regards,
Detlef

Alexandre Ferrieux

unread,

Apr 29, 2008, 10:06:17 AM4/29/08

to

On Apr 29, 1:00 pm, "Dr. Detlef Groth" <dgr...@gmx.de> wrote:
> Probably not. As I need this chunked behaviour and stepping through
> different states.
> The code on top should be fully functional.

Sorry but I won't be diving into this. I was trying to help you
separate the part spent in the real regexp engine from the part spent
outside. Without this analysis you cannot require a "speedup of
regexp" as the title says.

> It seems also that it is not pathological slow but it is what can be
> currently achieved with plain tcl. A boost I achieved by changing reg4
> line from
> set reg4 {^.|\n}
> to:
> set reg4 {^[^\n]+|\n}
> to grab as much as possible
> This finally gives me one order of magnitude. As I can see now:
>
> 3 seconds with 128 byte and 10 seconds with 40*128 bytes. So I should
> try to grab to the end of line if possible.

Whatever. It's not clear whether you are requesting assistance for
optimization of your regexps or trying to draw attention to an
unexpected algorithmic worst case.

In any case, from a macroscopic look, I have the impression that
you're trying to reimplement line-oriented prefix parsing (considering
the frequency of ^ and \n in your regexp) within an awkward block-
based parsing loop. I'd suggest instead:

while {[gets stdin line]>=0} {
switch -regexp -- $line {
{^Someprefix} {...}
...
}
}

-Alex

Dr. Detlef Groth

unread,

Apr 29, 2008, 10:15:40 AM4/29/08

to

> Whatever. It's not clear whether you are requesting assistance for
> optimization of your regexps or trying to draw attention to an
> unexpected algorithmic worst case.
>
> In any case, from a macroscopic look, I have the impression that
> you're trying to reimplement line-oriented prefix parsing (considering
> the frequency of ^ and \n in your regexp) within an awkward block-
> based parsing loop. I'd suggest instead:
>
> while {[gets stdin line]>=0} {
> switch -regexp -- $line {
> {^Someprefix} {...}
> ...
> }
> }
>
> -Alex
>

I can't use line based parsing with switch, as I might catch two
different regexp on the same line. Anyway with some ideas gathered
here we get a performance boost from about 50 seconds to one or two
seconds for a 5Mb file. The principle idea is to find a template where
a Flex-like input file can't be used to fill this to generate a
scanner.

regards,
detlef

Donal K. Fellows

unread,

Apr 29, 2008, 11:29:33 AM4/29/08

to

Dr. Detlef Groth wrote:
> I am currently trying to speedup my regexp searches. I saw that the
> tcl-regexp are much slower than perl ones. Does anybody have ideas to
> improve performance on this code ? It requires about 50 seconds on my
> single core machine with about 1.5Ghz for a 5Mb-file.

Just for the record, which version of Tcl are you using? If you're on
8.4.*, you'll definitely want to think about upgrading to the 8.5 series.

Donal.

Donal K. Fellows

unread,

Apr 29, 2008, 11:35:22 AM4/29/08

to

rf wrote:
> set all [read $yyin]
> foreach buffer [split $all \n] {
> # is still worth an attempt
> }

He said he was working with BLAST databases. They can be *very* large,
many times available memory. (Not as bad as Earth Sciences datasets, but
still massive.)

Donal.

Alexandre Ferrieux

unread,

Apr 29, 2008, 12:00:50 PM4/29/08

to

On Apr 29, 4:15 pm, "Dr. Detlef Groth" <dgr...@gmx.de> wrote:
>
> I can't use line based parsing with switch, as I might catch two
> different regexp on the same line. Anyway with some ideas gathered
> here we get a performance boost from about 50 seconds to one or two
> seconds for a 5Mb file. The principle idea is to find a template where
> a Flex-like input file can't be used to fill this to generate a
> scanner.

Well, I'm glad for you, but with a vague overview like this one you're
not giving back much :-( I would have been glad to hunt down any
algorithmic or architectural issue, but I'm not doing this in the
dark.

-Alex

Cameron Laird

unread,

Apr 29, 2008, 7:41:38 PM4/29/08

to

In article <fv6dkv$qs6$00$1...@news.t-online.com>,
EL <eckhar...@gmx.de> wrote:
.
.

.
>Hmm... unfortunately I lost the link, but not long ago somebody
>mentioned here exactly the opposite: Tcl's regexp engine is the fastest
>one that is available.

.
.
.
In some, though not, all ways: <URL:
http://www.unixreview.com/documents/s=10121/ur0702e/ >.