Message from discussion
Semantics for regexes
Newsgroups: perl.perl6.internals
Path: g2news1.google.com!news1.google.com!newsfeed.stanford.edu!nntp.perl.org
Return-Path: <d...@sidhe.org>
Mailing-List: contact perl6-internals-h...@perl.org; run by ezmlm
Delivered-To: mailing list perl6-intern...@perl.org
Received: (qmail 28477 invoked from network); 2 Sep 2004 14:16:17 -0000
Received: from x1.develooper.com (63.251.223.170)
by lists.develooper.com with SMTP; 2 Sep 2004 14:16:17 -0000
Received: (qmail 15019 invoked by uid 225); 2 Sep 2004 14:16:17 -0000
Delivered-To: perl6-intern...@perl.org
Received: (qmail 15015 invoked by alias); 2 Sep 2004 14:16:17 -0000
X-Spam-Status: No, hits=-4.9 required=8.0
tests=BAYES_00
X-Spam-Check-By: la.mx.develooper.com
Received: from 178.94.252.64.snet.net (HELO sprite.sidhe.org) (64.252.94.178)
by la.mx.develooper.com (qpsmtpd/0.27.1) with SMTP; Thu, 02 Sep 2004 07:16:14 -0700
Received: (qmail 29283 invoked from network); 2 Sep 2004 14:15:15 -0000
X-Scanned-By: AMaViS-ng at sidhe.org
Received: from unknown (HELO ?172.24.18.155?) (d...@157.130.220.242)
by 178.94.252.64.snet.net with SMTP; 2 Sep 2004 14:15:13 -0000
Mime-Version: 1.0
X-Sender: dan@localhost
Message-ID: <a06110418bd5cd77826f0@[172.24.18.155]>
In-Reply-To: <20040902135658.GA96544@amse.pair.com>
References: <a0611040ebd5bbe03d7e1@[172.24.18.155]>
<20040902032432.GI12318@kevin.fink.com>
<a06110416bd5cc39f8005@[172.24.18.155]>
<20040902135658.GA96544@amse.pair.com>
Date: Thu, 2 Sep 2004 10:15:55 -0400
To: Felix Gallo <f...@sunscreamer.com>
Subject: Re: Semantics for regexes
Cc: Steve Fink <st...@fink.com>, perl6-intern...@perl.org
Content-Type: text/plain; charset="us-ascii" ; format="flowed"
Approved: n...@nntp.perl.org
From: d...@sidhe.org (Dan Sugalski)
At 9:56 AM -0400 9/2/04, Felix Gallo wrote:
>Dan writes:
>> [...]
>> Yes, and some of the initial list already has ops to do those bits,
>> though I fully plan on evil cheating versions for some extra speed.
>
>If I recall correctly, someone with the best intentions attempted
>to write a clear, object-oriented (but still C/C++ based) regex
>engine to replace the existing highly hairy code. As a result it
>missed fitting in the cache, and ended up being like two orders
>of magnitude slower.
>
>Not arguing for premature optimization, but anyone thinking about
>regex might want to look ahead (get it?) to the end game where
>regex pretty much has to be within .2x of its current speed.
I don't think we're going to be able to manage doing our matches in
20% of the time of the current regex engine. That's a bit ambitious,
even for me. :)
FWIW, we did some work on this a few years back. (And yes, the fact
that I can say that with a straight face scares me) We managed to run
only about 25% slower than the perl 5 regex engine when using the
basic string ops. This was pre-ICU, though, so the numbers may well
be significantly worse now.
Anyway, that's the whole point of this exercise. I want to know what
primitive operations the regex engine will need. (Or, rather, I've a
pretty good idea and I want to make sure nothing is missed) When I
know I plan on taking all the best accepted standards and practices
of software engineering, throwing them out, and cheating like hell
with those string primitives. If this means having several alternate
class matching schemes (since bitmaps are likely fastest for small
sets like 7/8 bit ASCII and some sort of trie fastest for large ones
for the asian sets and Unicode) then fine. If this means being able
to pin a single string in a fixed location to avoid a single pointer
indirection on each access, fine. If it means knowing that since the
input is US-ASCII or Latin-x characters are 8 bits and there are no
combining characters so we drop to direct byte access, fine.
I'm not so much resigned to cheating as looking forward to it with
relish. (And a bit of catsup, with a side of cole slaw)
--
Dan
--------------------------------------it's like this-------------------
Dan Sugalski even samurai
d...@sidhe.org have teddy bears and even
teddy bears get drunk