demerphq writes:
> As you can see, except for the _construct benchmarks, B wins by a
> large margin. The _construct tests are designed to see what the
> overhead is of constructing the trie for nothing (ie, the match is at
> ^), and shows that the construct time is half as fast, (this is
> unsurprising as the entire cost of A must be carried by B as the
> optimisation doesnt occur until study_chunk()). OTOH the parse times
> are much better. perl_keywds searches for a list of words like perl
> keywords in the "bench" script that comes as part of perlbench, and
> shows that for this type of matching the trie is much much faster than
> the current mechanism.
FWIW, this looks like it'd be excellent for SpamAssassin ;)
I haven't had much time to look over the implementation, and I'm not
really any use for reviewing it from a p5p POV due to lack of familiarity
with perl internals, but the benchmark figures look fantastic and the
implementation details sound good.
I'd love to see this get into perl, even if just as an option enabled
through a "use" pragma. (in my opinion, if your regexp will benefit
from a trie, you will know that in advance.)
- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Exmh CVS
iD8DBQFCEl5SMJF5cimLx9ARAkZzAJwOzVx2bCXNu0S1tWCLsP9mCNrSbACfaOYb
V2eYWw4dhf756XEfZccu2F8=
=yTzv
-----END PGP SIGNATURE-----
I think that the problem is that the two people best qualified to comment on
this (Dave and Hugo) are both busy at the moment.
It looks very interesting to me, and has been in perl's todo for a long time,
but I don't know enough about the regexp engine to be competent to comment
further.
Nicholas Clark
Maybe somebody with SpamAssassin could do some benchmarks?
I'm a little concerned it might be misleading -- we've optimised our code
quite a lot around the *existing*, sans-trie, re engine, like so:
foreach $line (@body) {
foreach $rule (@ruleset) {
next unless ($scores{$rule} != 0); # skip zero-scoring rules
/$rule/ and $score+=$scores{$rule};
}
}
If this patch were included in perl, we'd rewrite our code to optimise
it's behaviour for *this* engine, like so:
my $allrules = eval 'qr/(?:' . join("|", @ruleset) . ')/';
# probably a bit more sophisticated than that, but you
# get the idea
foreach $rule (@ruleset) {
foreach $line (@body) {
next unless $line =~ $allrules;
foreach $rule (@ruleset) { .. } # as above
}
So a simple run of SpamAssassin may not illustrate the speedups
without more changes on our side.
Plus (as far as I know) none of us have ever built bleadperl ;)
I can try it out if there's a tarball though, and give some
rough ideas of the speedup without our code changes?
- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)
Comment: Exmh CVS
iD8DBQFCEmV5MJF5cimLx9ARAtHqAKCM9bcEd781a3NNDuhqIldOQuneCACfb9kj
3AHm4LPLpVO0dIIkvDtUZV0=
=Qplv
-----END PGP SIGNATURE-----
> (who's just finished lighting Othello, and who will be lighting Carpe
> Jugulum in three weeks' time, and who is still wondering how he was
> volunteered for both)
By not following Rincewind's advice? (oh. you said Carpe Jugulum, not Mort)
Nicholas Clark
Oy, I seem to have been promoted in my absence!
I know practically nothing about the internals of the regex engine - I'm
just tinkering with the outside edges of it as regards the (?{...}) stuff.
Dave
(who's just finished lighting Othello, and who will be lighting Carpe
Jugulum in three weeks' time, and who is still wondering how he was
volunteered for both)
--
"Emacs isn't a bad OS once you get used to it.
It just lacks a decent editor."
> Plus (as far as I know) none of us have ever built bleadperl ;)
> I can try it out if there's a tarball though, and give some
> rough ideas of the speedup without our code changes?
Giving an idea of speedup [or not :-(] would be useful. I don't think that
there are any recent tarballs - everyone tends to get the current source from
the rsync server. Something like
rsync --delete -avz rsync://ftp.linux.activestate.com/perl-current/ ~/Perl/bleadperl-src/
Nicholas Clark
I was waiting for Hugo's advice on this patch.
Also, it would be nice to have something that works with /i, if that can
be done reliably.
> I'd love to see this get into perl, even if just as an option enabled
> through a "use" pragma. (in my opinion, if your regexp will benefit
> from a trie, you will know that in advance.)
or a new end-of-regexp switch. But if it works well I don't see why
this shouldn't be optimized by default (although avoiding slowdowns
in some cases would even be better -- do they affect regexp compilation
or execution ?)
[...]
> I did however do some benchmarks on the perfomance of one particular
> regex in the SA suite. Its from SA::Bayes and looks like this:
>
> ($token =~ /^(?:a(?:nd|ny|ble|ll|re)|
> m(?:uch|ost|ade|ore|ail|ake|ailing|any|ailto)|
> t(?:his|he|ime|hrough|hat)|
> w(?:hy|here|ork|orld|ith|ithout|eb)|
> f(?:rom|or|ew)| e(?:ach|ven|mail)|
> o(?:ne|ff|nly|wn|ut)| n(?:ow|ot|eed)|
> s(?:uch|ame)| l(?:ook|ike|ong)|
> y(?:ou|our|ou're)|
> The|has|have|into|using|http|see|It's|it's|
> number|just|both|come|years|right|know|already|
> people|place|first|because|
> And|give|year|information|can)$/x);
>
> As you can see this regex has been PreSuf'ed for a minor speedup. Run
> times of this regex being used to search regcomp.c (a large file) in a
> while /\b(RE)\b/ loop were prepatch: 21/s postpatch: 51/s. When the
> pre-suf in the pattern is unrolled the times were prepatched:10/s
> postpatched:100/s. (Note this is the number of times to extract all
> of the matches from the file.)
If you're benching, can you compare how your trie version (out)?performs
regexps produced by Regexp::Optimizer and Regexp::Assemble on the
above list of words?
Regexp::Assemble produces the following RE on the list of above words:
(?:m(?:a(?:il(?:ing|to)?|[dk]e)|o(?:re|st)|uch)|
w(?:ith(?:out)?|h(?:ere|y)|or(?:ld|k)|eb)|
a(?:l(?:ready|l)|(?:bl|r)e|n[dy])|t(?:h(?:rough|at|is|e)|ime)|
i(?:n(?:formation|to)|t's)|o(?:n(?:ly|e)|ff|ut|wn)|
y(?:ou(?:'re|r)?|ears?)|(?:p(?:eopl|lac)|giv)e|
n(?:o[tw]|umber|eed)|f(?:irst|rom|ew|or)|l(?:o(?:ng|ok)|ike)|
h(?:a(?:ve|s)|ttp)|s(?:(?:am|e)e|uch)|e(?:mail|ach|ven)|
b(?:ecause|oth)|(?:righ|jus)t|c(?:ome|an)|using|It's|know|And)
Regexp::Optimizer produces:
(?:a(?:n[dy]|l(?:l|ready)|(?:bl|r)e)|m(?:o(?:st|re)|
a(?:il(?:ing|to)?|[dk]e)|uch)|t(?:h(?:is|e|rough|at)|ime)|
w(?:h(?:y|ere)|or(?:k|ld)|ith(?:out)?|eb)|f(?:rom|or|ew|irst)|
e(?:ach|ven|mail)|o(?:n(?:e|ly)|ff|wn|ut)|n(?:o[wt]|eed|umber)|
s(?:uch|(?:am|e)e)|l(?:o(?:ok|ng)|ike)|y(?:ou(?:r|'re)?|ears?)|
h(?:a(?:s|ve)|ttp)|i(?:n(?:to|formation)|t's)|b(?:oth|ecause)|
c(?:ome|an)|p(?:eopl|lac)e|using|It's|(?:jus|righ)t|know|And|give)
(word-wrapped to avoid mangling, should be on one line).
David
That's what I was alluding to.
> To be honest i would really love it if someone with stronger perl guts
> and stonger C skills than me could give the code a good once over.
> Even if whoever does it knows nothing about the regex engines
> particulars a sanity check would be really appreciated.
/me has to look at it more thoroughly then :)
> > > I'd love to see this get into perl, even if just as an option enabled
> > > through a "use" pragma. (in my opinion, if your regexp will benefit
> > > from a trie, you will know that in advance.)
> >
> > or a new end-of-regexp switch. But if it works well I don't see why
> > this shouldn't be optimized by default (although avoiding slowdowns
> > in some cases would even be better -- do they affect regexp compilation
> > or execution ?)
>
> Ok, in comp sci terms the cost of construction the trie is
> proportional to the number of characters in the trie. So IMO the cost
> is negligable. In what is the common case (a short list of short
> words of relatively few unique characters) you probably wouldnt even
> notice the slowdown as the overall match time would improve more than
> the construct time degrades.
Cool.
> In human terms: there is a slight slowdown of compilation, that is
> more than made up with the execution time differences. IMO most times
> the optimisation will kick in the size of the alternation being
> optimized will be so small as to have negligable construction costs.
> The only time I could see the construction costs being an issue would
> be with things like:
>
> "a"=~/a|very|extremely|long|list|of|words|that|could|match|but|the|first|one|will|always|match|so|the|regex|is|a|bit|of|a|waste|anyway/;
>
> The trouble is the only way we can "opt-out" of the optimisation
> sensibly is if we know what string we are matching against at the
> compile time of the regex, since the regex engine has no access to
> such knowledge we cant really do this.
>
> Anyway, the results I show are that even simple subpatterns benefit
> from conversion to a Trie and it happens automatically so i dont see
> the need of a switch except possibly to force the disabling of the
> optimisation for some reason. Im not really sure how that would work
> yet tho.
Agreeing.
> I really hope that some of you guys try the patch out. Some hands on
> feedback would be appreciated, especially as im on Win32 and have no
> real access to *nix right now.
At least on Linux, all tests pass.
Dangit, sorry about that. Still getting used to writing mails in gmail.
> demerphq wrote:
> > On Wed, 16 Feb 2005 16:30:47 +0100, Rafael Garcia-Suarez
> > > I was waiting for Hugo's advice on this patch.
> > > Also, it would be nice to have something that works with /i, if that can
> > > be done reliably.
> >
> > Its done. I hope reliably too, but there is this annoying segfault
> > with SA that i need to figure out. While it does appear to be related
> > to my code im a bit pushed to come up with an explanation for why
> > slight changes to code around the regex could change the segfault
> > behaviour. (Note i cant get SA to segfault when i add debug text, and
> > SA is the only thing ive tested against that has problems, and the
> > code that has problems is evaled closures (possibly inside of a fork
> > on win32) so it could be almost anything.)
>
> That's what I was alluding to.
Yeah fair enough. Im working on it though. :-)
>
> > To be honest i would really love it if someone with stronger perl guts
> > and stonger C skills than me could give the code a good once over.
> > Even if whoever does it knows nothing about the regex engines
> > particulars a sanity check would be really appreciated.
>
> /me has to look at it more thoroughly then :)
Thanks that would be really appreciated.
> > I really hope that some of you guys try the patch out. Some hands on
> > feedback would be appreciated, especially as im on Win32 and have no
> > real access to *nix right now.
>
> At least on Linux, all tests pass.
Is that the second patch i posted or the first Raphael?
(trie_5.patch.gz is the latest)
Cheers,
yves
--
First they ignore you, then they laugh at you, then they fight you,
then you win.
+Gandhi
Damn automatic gmail reply-to, as well.
> > At least on Linux, all tests pass.
>
> Is that the second patch i posted or the first Raphael?
> (trie_5.patch.gz is the latest)
The second.
Yeah ive put together a benchmark program to test my current build
against blead. It adds whitespace and the /x modifier to the regexes
to make them easier to read. It tests the same list of words (sorted
however) as was in the SA module with and without \b wrappers as
constructed by a bunch of the Regexp modules. If anybody is interested
in seeing the code they should ask off list, but the underlying
benchmark is:
my $found=0; $found++ while $text=~/$re{$name}/g;
Where $text is my local regcomp.c (just because its a large file)
Here is the output:
----------------------------------------------------
used the following Regex modules: Regexp::List, Regexp::Assemble,
Regexp::Optimizer, Regex::PreSuf
Read 181168 chars from '\blead_trie\regcomp.c'
Got 74 words to match
Testing regexes:
Regexp 'orig' :
(?x-ism:(
(?: a (?: nd | ny | ble | ll | re )| m (?: uch | ost | ade | ore |
ail | ake | ailing | any | ailto )| t (?: his | he | ime | hrough |
hat )| w (?: hy | here | ork | orld | ith | ithout | eb )| f (?:
rom | or | ew )| e (?: ach | ven | mail )| o (?: ne | ff | nly | wn
| ut )| n (?: ow | ot | eed )| s (?: uch | ame )| l (?: ook | ike |
ong )| y (?: ou | our | ou're )| The | has | have | into | using |
http | see | It's | it's | number | just | both | come | years |
right | know | already | people | place | first | because | And |
give | year | information | can )
))
Regexp 'RA' [Regexp::Assemble] :
(?x-ism:(
(?: m (?: a (?: il (?: ing | to )?| [dk]e | ny )| o (?: re | st )|
uch )| w (?: ith (?: out )?| h (?: ere | y )| or (?: ld | k )| eb
)| a (?: l (?: ready | l )|(?: bl | r ) e | n[dy] )| t (?: h (?:
rough | at | is | e )| ime )| i (?: n (?: formation | to )| t's
)|(?: p (?: eopl | lac )| giv | Th ) e | o (?: n (?: ly | e )| ff |
ut | wn )| y (?: ou (?: 're | r )?| ears ?)| n (?: o[tw] | umber |
eed )| f (?: irst | rom | ew | or )| l (?: o (?: ng | ok )| ike )|
h (?: a (?: ve | s )| ttp )| s (?:(?: am | e ) e | uch )| e (?:
mail | ach | ven )| b (?: ecause | oth )|(?: righ | jus ) t | c (?:
ome | an )| using | It's | know | And )
))
Regexp 'RL' [Regexp::List] :
(?x-ism:(
(?= [AITabcefghijklmnoprstuwy] )(?: a (?: l (?: l | ready )| n[dy]
|(?: bl | r ) e )| b (?: ecause | oth )| c (?: an | ome )| e (?:
ach | mail | ven )| f (?: ew | irst | or | rom )| h (?: a (?: s |
ve )| ttp )| i (?: n (?: formation | to )| t\'s )| l (?: o (?: ng |
ok )| ike )| m (?: a (?: il (?: ing | to )?| [dk]e | ny )| o (?: re
| st )| uch )| n (?: o[tw] | eed | umber )| o (?: n (?: e | ly )|
ff | ut | wn )| p (?: eopl | lac ) e | s (?:(?: am | e ) e | uch )|
t (?: h (?: at | e | is | rough )| ime )| w (?: h (?: ere | y )|
ith (?: out )?| or (?: k | ld )| eb )| y (?: ears ?| ou (?: \'re |
r )?)| And | It\'s |(?: Th | giv ) e |(?: jus | righ ) t | know |
using )
))
Regexp 'RO' [Regexp::Optimizer] :
(?x-ism:(
(?: a (?: l (?: l | ready )| n[dy] |(?: bl | r ) e )| b (?: ecause
| oth )| c (?: an | ome )| e (?: ach | mail | ven )| f (?: ew |
irst | or | rom )| h (?: a (?: s | ve )| ttp )| i (?: n (?:
formation | to )| t's )| l (?: o (?: ng | ok )| ike )| m (?: a (?:
il (?: ing | to )?| [dk]e | ny )| o (?: re | st )| uch )| n (?:
o[tw] | eed | umber )| o (?: n (?: e | ly )| ff | ut | wn )| p (?:
eopl | lac ) e | s (?:(?: am | e ) e | uch )| t (?: h (?: at | e |
is | rough )| ime )| w (?: h (?: ere | y )| ith (?: out )?| or (?:
k | ld )| eb )| y (?: ears ?| ou (?: 're | r )?)| And | It's |(?:
Th | giv ) e |(?: jus | righ ) t | know | using )
))
Regexp 'RPS' [Regex::PreSuf] :
(?x-ism:(
(?: And | It\'s | The | a (?: ble | l (?: ready | l )| n[dy] | re
)| b (?: ecause | oth )| c (?: an | ome )| e (?: ach | mail | ven
)| f (?: ew | irst | or | rom )| give | h (?: a (?: ve | s )| ttp
)| i (?: n (?: formation | to )| t\'s )| just | know | l (?: ike |
o (?: ng | ok ))| m (?: a (?: de | il (?: ing | to )?| ke | ny )| o
(?: re | st )| uch )| n (?: eed | o[tw] | umber )| o (?: ff | n (?:
ly | e )| ut | wn )| p (?: eopl | lac ) e | right | s (?: ame | ee
| uch )| t (?: h (?: at | is | rough | e )| ime )| using | w (?: eb
| h (?: ere | y )| ith (?: out )?| or (?: ld | k ))| y (?: ears ?|
ou (?: \'re | r )?))
))
Regexp 'smpl' :
(?x-ism:(
And | It's | The | able | all | already | and | any | are | because
| both | can | come | each | email | even | few | first | for |
from | give | has | have | http | information | into | it's | just
| know | like | long | look | made | mail | mailing | mailto | make
| many | more | most | much | need | not | now | number | off | one
| only | out | own | people | place | right | same | see | such |
that | the | this | through | time | using | web | where | why |
with | without | work | world | year | years | you | you're | your
))
Running tests for 2 seconds on 2 perls:
Pfx |Perl
----+--------------------------
A |g:\perl_trie\bin\perl.exe
B |g:\perl_blead\bin\perl.exe
----+--------------------------
Rate B_smpl B_orig B_RO B_RA B_RPS A_RA B_RL A_RO A_RPS
A_orig A_RL A_smpl
B_smpl 2.87/s -- -38% -43% -44% -58% -58% -61% -64% -65%
-75% -77% -89%
B_orig 4.64/s 62% -- -7% -10% -31% -33% -37% -42% -43%
-59% -62% -82%
B_RO 4.99/s 74% 7% -- -3% -26% -28% -32% -37% -39%
-56% -60% -81%
B_RA 5.15/s 80% 11% 3% -- -24% -25% -30% -35% -37%
-55% -58% -80%
B_RPS 6.75/s 136% 45% 35% 31% -- -2% -8% -15% -17%
-41% -45% -74%
A_RA 6.89/s 140% 48% 38% 34% 2% -- -6% -13% -15%
-39% -44% -73%
B_RL 7.34/s 156% 58% 47% 42% 9% 7% -- -8% -10%
-35% -41% -71%
A_RO 7.95/s 177% 71% 59% 54% 18% 15% 8% -- -2%
-30% -36% -69%
A_RPS 8.13/s 184% 75% 63% 58% 20% 18% 11% 2% --
-28% -34% -68%
A_orig 11.4/s 296% 145% 128% 120% 68% 65% 55% 43% 40%
-- -8% -56%
A_RL 12.4/s 331% 166% 148% 140% 83% 80% 68% 56% 52%
9% -- -52%
A_smpl 25.7/s 795% 453% 414% 398% 280% 273% 250% 223% 216%
126% 108% --
Testing with /\b(PAT)\b/
Pfx |Perl
----+--------------------------
A |g:\perl_trie\bin\perl.exe
B |g:\perl_blead\bin\perl.exe
----+--------------------------
Rate B_smpl B_orig B_RO B_RA A_RPS B_RPS A_RA A_RO B_RL
A_orig A_RL A_smpl
B_smpl 11.9/s -- -38% -43% -43% -48% -56% -58% -63% -69%
-74% -76% -88%
B_orig 19.3/s 62% -- -7% -7% -15% -28% -32% -40% -50%
-57% -62% -80%
B_RO 20.7/s 74% 7% -- -0% -9% -23% -27% -36% -46%
-54% -59% -79%
B_RA 20.8/s 75% 8% 0% -- -9% -23% -27% -36% -46%
-54% -59% -79%
A_RPS 22.7/s 91% 18% 10% 10% -- -15% -20% -30% -41%
-50% -55% -77%
B_RPS 26.8/s 126% 39% 29% 29% 18% -- -6% -17% -30%
-41% -47% -73%
A_RA 28.5/s 139% 47% 37% 37% 25% 6% -- -12% -26%
-37% -43% -71%
A_RO 32.4/s 172% 68% 56% 56% 42% 21% 14% -- -15%
-28% -36% -67%
B_RL 38.2/s 222% 98% 84% 84% 68% 42% 34% 18% --
-16% -24% -61%
A_orig 45.3/s 281% 135% 118% 118% 99% 69% 59% 40% 18%
-- -10% -54%
A_RL 50.4/s 324% 161% 143% 143% 122% 88% 77% 56% 32%
11% -- -49%
A_smpl 98.9/s 732% 413% 377% 376% 335% 269% 248% 206% 159%
119% 96% --