I am currently trying to speedup my regexp searches. I saw that the
tcl-regexp are much slower than perl ones. Does anybody have ideas to
improve performance on this code ? It requires about 50 seconds on my
single core machine with about 1.5Ghz for a 5Mb-file.
I put an example inputfile here: http://www.dgroth.de/sample.blast
Here is the code:
namespace eval SampleScanner2 {
variable state INITIAL
}
proc SampleScanner2::BEGIN {nstate} {
variable state
set state $nstate
}
proc SampleScanner2::yylex {yyin} {
variable state
set reg0 {^T?BLAST[XPN]}
set reg1 {^Query=[ ]+[^\s]+}
set reg2 {^Sequences[ ]producing[ ]significant[ ]alignments}
set reg3 {^>[^\s]+}
set reg4 {^.|\n}
set bsize 128
set yychar 0
set buffer [read $yyin $bsize]
set yyleng 0
while {1} {
set yy_rule -1
incr yychar $yyleng
set buffer [string range $buffer $yyleng end]
if {[string length $buffer] < $bsize} {
set nbuff [read $yyin $bsize]
if {[string length $nbuff] == 0 && [string length $buffer]
== 0} {
break
}
append buffer $nbuff
}
if {[regexp -- $reg0 $buffer yytext]} {
set yy_rule 0
} elseif {$state eq "XSTATE" && [regexp -- $reg1 $buffer
yytext]} {
set yy_rule 1
} elseif {$state eq "XSTATE" && [regexp -- $reg2 $buffer
yytext]} {
set yy_rule 2
} elseif {$state eq "SSTATE" && [regexp -- $reg3 $buffer
yytext]} {
set yy_rule 3
} elseif {[regexp -line -- $reg4 $buffer yytext]} {
set yy_rule 4
} else {
set yytext [string range $buffer 0 1]
}
set yyleng [string length $yytext]
switch -- $yy_rule {
0 {
puts $yytext
BEGIN XSTATE
}
1 {
puts -nonewline [string range $yytext 7 end]
}
2 {
puts " has alignments"
BEGIN SSTATE
}
3 {
puts -nonewline "id is: "
puts [string range $yytext 1 end]
BEGIN INITIAL
}
4 {
# do nothing
}
default
{
puts -nonewline $yytext
}
}
}
return 0
}
if {$argv0 eq [info script]} {
set start [clock seconds]
if {[llength $argv] == 0} {
puts stderr "Usage: [info script] \[-options values\]
<infile>"
exit 0
}
if {[catch {open [lindex $argv end] r} yyin]} {
puts stderr "Could not open [lindex $argv end]"
exit 0
}
if {[expr [llength [lrange $argv 1 end]] % 2] != 0} {
puts stderr "submitting uneven number of arguments"
}
SampleScanner2::yylex $yyin
close $yyin
set end [clock seconds]
set secs [expr {$end-$start}]
puts "required $secs seconds"
}
regards,
Detlef
> I am currently trying to speedup my regexp searches. I saw that the
> tcl-regexp are much slower than perl ones.
Hmm... unfortunately I lost the link, but not long ago somebody
mentioned here exactly the opposite: Tcl's regexp engine is the fastest
one that is available.
> Does anybody have ideas to
> improve performance on this code ? It requires about 50 seconds on my
> single core machine with about 1.5Ghz for a 5Mb-file.
>
> I put an example inputfile here: http://www.dgroth.de/sample.blast
Why don't you use the XML output of BLAST and then parse the records
with SAX (or tDOM, if appropriate)? Or split them up, put them into a
database and use a combination of text search and SQL queries... if
speed matters over simplicify? (Well, to put BLAST records into a
database is not /that/ complicated. BioSQL [1] might help).
I could imagine that this gives you much better performance than regexp
searches on huge text files.
--
Eckhard
for the BLAST-case your alternative is perfectly but I am as well
interested in parsing, scanning output which is not xml. The blast was
just a dummy example. BioSQL btw seem currently not support SQLite
which is my prefered choice.
regards,
Detef
> for the BLAST-case your alternative is perfectly but I am as well
> interested in parsing, scanning output which is not xml. The blast was
> just a dummy example.
Ok... But the Tcl regexp engine is probably good for this as well. If I
only had the link to this page which confirms that Tcl regexp's are fast...
> BioSQL btw seem currently not support SQLite
> which is my prefered choice.
Hmm, with the amount of data from sequences, BLAST's and annotations I
doubt that SQLite is a good fit. I would rather use Postgresql or
Oracle, which are both supported by BioSQL (and in addition have
excellent Tcl support).
--
Eckhard
Perhaps you're thinking of
http://shootout.alioth.debian.org/gp4/benchmark.php?test=regexdna&lang=all
Simon Geard
As far as regexp performance is concerned, please notice that the raw
metal will be exposed only if you crunch big strings through a regexp.
Otherwise, the exquisite efficiency of Tcl's DFA (vs. NFA for most
other environments) will be completely drowned in the overhead of hand-
cutting small chunks of data.
-Alex
my idea is to use a bigger value for bsize or not to use bsize at all,
if memory is not the bottleneck, and to proceed in buffer with "regexp
-start $yyleng -- ..."
in reg2 the brackets may be omitted
Roland Frank
After writing the same application with the same logic in Perl it
indeed does not seem to make a difference between Perl and Tcl. I
certainly can not post the Perl-code here ?
with bsize=128
$ perl SampleScanner2.pl .sample.blast > /dev/null
required 56 seconds
$ /c/Tclkit/tclkitsh852.exe SampleScanner2.tcl sample.blast > /dev/
null
required 58 seconds
Just by increasing the bsize variable Perl becomes better:
bsize=120*40
$ perl SampleScanner2.pl .sample.blast > /dev/null
required 128 seconds
$ /c/Tclkit/tclkitsh852.exe SampleScanner2.tcl sample.blast > /dev/
null
required 172 seconds
So the only performance solution might be to implement some parts in
C. But will
Tcl_RegExp indeed perform better than regexp ?
regards,
Detlef
That's what I was thought first, but in fact the performance is better
with smaller strings and more read-operations. With bsize=128 we are
twice as fast as with bsize=40*128. Furthermore if I uncomment the
regexp-statements it seems that 80% till 90% of processing time
belongs to the regexp part.
regards,
Detlef
Thanks for the suggestion, that's the way my ifickle http://wiki.tcl.tk/14292
is working. However the performance is slightly lower with the ifickle/
fickle approach. But I will check it again.
Detlef
Can you isolate one single large string and one single regexp
exhibiting this pathological slowness ?
-Alex
> So the only performance solution might be to implement some parts in
> C. But will
> Tcl_RegExp indeed perform better than regexp ?
Probably, since you do not only [regexp] but also read/write, set/unset
and lots of other operations... on the whole file and in Tcl.
That's why I was proposing the DB solution. Postgresql has Tcl's regexp
engine built in and is certainly better optimized for fast data access
than text files. As other DBMS too.
--
Eckhard
3 seconds with 128 byte and 10 seconds with 40*128 bytes. So I should
try to grab to the end of line if possible.
Thanks,
Detlef
regards,
Detlef
set all [read $yyin]
foreach buffer [split $all \n] {
# is still worth an attempt
}
Roland
Unfortunately I can't read all input into one variable as typical user
input in bioinformatics can be several Gigabytes large.
Detlef
regexp might be the slower part:
proc timeit {txt cmd} {
set num 100
set count 1000
set run {}
for {set n 0} {$n < 100} {incr n} {
lappend run $cmd
}
set val [uplevel 1 [list time \
[join $run {; }] $count]]
set tmp [lreplace $val 0 0 \
[expr {[lindex $val 0]/(1.0*$num)}]]
puts "$txt: [lrange $tmp 0 1]"
}
set str {
BLASTP 2.2.17 [Aug-19-2007]
Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A.
Schaffer,
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
"Gapped BLAST and PSI-BLAST: a new generation of protein database
search
programs", Nucleic Acids Res. 25:3389-3402.
Reference for composition-based statistics:
Schaffer, Alejandro A., L. Aravind, Thomas L. Madden,
Sergei Shavirin, John L. Spouge, Yuri I. Wolf,
Eugene V. Koonin, and Stephen F. Altschul (2001),
"Improving the accuracy of PSI-BLAST protein database searches with
composition-based statistics and other refinements", Nucleic Acids
Res. 29:2994-3005.
Query= tr|Q5DDY0|Q5DDY0_SCHJA SJCHGC06319 protein - Schistosoma
japonicum (Blood fluke).
}
proc s2 {} {
global str
set state 1
#if {$state == 1} {
# set x true
#}
regexp -nocase -- {^T?BLAST[XNP]} $str
}
proc dummy {} {}
set reg0 {^T?BLAST[XNP]}
timeit "regexp" { regexp -- $reg0 $str }
timeit "regexp" { regexp -- $reg0 $str }
timeit "regexp -nocase" { regexp -nocase -- $reg0 $str }
timeit "regexp -nocase" { regexp -nocase -- $reg0 $str }
timeit "regexp -nocase -start 10 " { regexp -nocase -start 10 --
$reg0 $str }
timeit "regexp -nocase -start 10 " { regexp -nocase -start 10 --
$reg0 $str }
timeit "regexp -nocase -start 10 " { regexp -nocase -start 10 --
$reg0 $str }
timeit "string length" { string length $str }
timeit "string range" { string range $str 5 end-1 }
timeit "compare numbers" { if {3 == 3} { } }
timeit "compare strings" { if {"string" eq "string" } {} }
timeit "proc first" { dummy }
timeit "proc second" { dummy }
gives us:
regexp: 3.4902699999999998 microseconds
regexp: 2.28327 microseconds
regexp -nocase: 2.36895 microseconds
regexp -nocase: 4.01335 microseconds
regexp -nocase -start 10 : 2.54778 microseconds
regexp -nocase -start 10 : 2.55221 microseconds
regexp -nocase -start 10 : 2.44581 microseconds
string length: 0.35381 microseconds
string range: 1.67269 microseconds
compare numbers: 0.0565 microseconds
compare strings: 0.05791 microseconds
proc first: 0.57409 microseconds
proc second: 0.5254099999999999 microseconds
regards,
Detlef
namespace eval SampleScanner3 {
variable state INITIAL
}
proc SampleScanner3::BEGIN {nstate} {
variable state
set state $nstate
}
proc SampleScanner3::yylex {yyin} {
variable state
set reg0 {\AT?BLAST[XPN]}
set reg1 {\AQuery=[ ]+[^\s]+}
set reg2 {\ASequences[ ]producing[ ]significant[ ]alignments}
set reg3 {\A>[^\s]+}
set reg4 {\A[^\n]+|\n}
set bsize [expr 128*40]
set yychar 0
set buffer [read $yyin $bsize]
set yyleng 0
set yyindex 0
set yytext ""
while {1} {
set yy_rule -1
if {[expr {[string length $buffer] - $yyindex}] < $bsize} {
set buffer [string range $buffer $yyindex end]
set nbuff [read $yyin $bsize]
if {[string length $nbuff] == 0 && [string length $buffer]
== 0} {
break
}
append buffer $nbuff
set yyindex 0
}
if {[regexp -start $yyindex -- $reg0 $buffer yytext]} {
set yy_rule 0
} elseif {$state eq "XSTATE" && [regexp -start $yyindex --
$reg1 $buffer yytext]} {
set yy_rule 1
} elseif {$state eq "XSTATE" && [regexp -start $yyindex --
$reg2 $buffer yytext]} {
set yy_rule 2
} elseif {$state eq "SSTATE" && [regexp -start $yyindex --
$reg3 $buffer yytext]} {
set yy_rule 3
} elseif {[regexp -start $yyindex -- $reg4 $buffer yytext]} {
set yy_rule 4
} else {
set yytext [string range $buffer $yyindex [expr {$yyindex
+1}]]
}
set yyleng [string length $yytext]
incr yyindex $yyleng
incr yychar $yyleng
switch -- $yy_rule {
0 {
puts $yytext
BEGIN XSTATE
}
1 {
puts -nonewline [string range $yytext 7 end]
}
2 {
puts " has alignments"
BEGIN SSTATE
}
3 {
puts -nonewline "id is: "
puts [string range $yytext 1 end]
BEGIN INITIAL
}
4 {
#puts -nonewline "..$yytext" ;# do nothing
}
default
{
puts -nonewline $yytext
}
}
}
return 0
}
if {$argv0 eq [info script]} {
set start [clock seconds]
if {[llength $argv] == 0} {
puts stderr "Usage: [info script] \[-options values\]
<infile>"
exit 0
}
if {[catch {open [lindex $argv end] r} yyin]} {
puts stderr "Could not open [lindex $argv end]"
exit 0
}
if {[expr [llength [lrange $argv 1 end]] % 2] != 0} {
puts stderr "submitting uneven number of arguments"
}
SampleScanner3::yylex $yyin
close $yyin
set end [clock seconds]
set secs [expr {$end-$start}]
puts stderr "required $secs seconds"
}
Together with the optimized reg4 regexp it parses the 5Mb-Blastfile in
about 1-2 seconds.
Finally I know how to rebuild the scanner generator. Thanks to all.
regards,
Detlef
Sorry but I won't be diving into this. I was trying to help you
separate the part spent in the real regexp engine from the part spent
outside. Without this analysis you cannot require a "speedup of
regexp" as the title says.
> It seems also that it is not pathological slow but it is what can be
> currently achieved with plain tcl. A boost I achieved by changing reg4
> line from
> set reg4 {^.|\n}
> to:
> set reg4 {^[^\n]+|\n}
> to grab as much as possible
> This finally gives me one order of magnitude. As I can see now:
>
> 3 seconds with 128 byte and 10 seconds with 40*128 bytes. So I should
> try to grab to the end of line if possible.
Whatever. It's not clear whether you are requesting assistance for
optimization of your regexps or trying to draw attention to an
unexpected algorithmic worst case.
In any case, from a macroscopic look, I have the impression that
you're trying to reimplement line-oriented prefix parsing (considering
the frequency of ^ and \n in your regexp) within an awkward block-
based parsing loop. I'd suggest instead:
while {[gets stdin line]>=0} {
switch -regexp -- $line {
{^Someprefix} {...}
...
}
}
-Alex
I can't use line based parsing with switch, as I might catch two
different regexp on the same line. Anyway with some ideas gathered
here we get a performance boost from about 50 seconds to one or two
seconds for a 5Mb file. The principle idea is to find a template where
a Flex-like input file can't be used to fill this to generate a
scanner.
regards,
detlef
Just for the record, which version of Tcl are you using? If you're on
8.4.*, you'll definitely want to think about upgrading to the 8.5 series.
Donal.
He said he was working with BLAST databases. They can be *very* large,
many times available memory. (Not as bad as Earth Sciences datasets, but
still massive.)
Donal.
Well, I'm glad for you, but with a vague overview like this one you're
not giving back much :-( I would have been glad to hunt down any
algorithmic or architectural issue, but I'm not doing this in the
dark.
-Alex