Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Message from discussion Semantics for regexes

Newsgroups: perl.perl6.internals
Path: g2news1.google.com!news1.google.com!newsfeed.stanford.edu!nntp.perl.org
Return-Path: <d...@sidhe.org>
Mailing-List: contact perl6-internals-h...@perl.org; run by ezmlm
Delivered-To: mailing list perl6-intern...@perl.org
Received: (qmail 28477 invoked from network); 2 Sep 2004 14:16:17 -0000
Received: from x1.develooper.com (63.251.223.170)
  by lists.develooper.com with SMTP; 2 Sep 2004 14:16:17 -0000
Received: (qmail 15019 invoked by uid 225); 2 Sep 2004 14:16:17 -0000
Delivered-To: perl6-intern...@perl.org
Received: (qmail 15015 invoked by alias); 2 Sep 2004 14:16:17 -0000
X-Spam-Status: No, hits=-4.9 required=8.0
	tests=BAYES_00
X-Spam-Check-By: la.mx.develooper.com
Received: from 178.94.252.64.snet.net (HELO sprite.sidhe.org) (64.252.94.178)
  by la.mx.develooper.com (qpsmtpd/0.27.1) with SMTP; Thu, 02 Sep 2004 07:16:14 -0700
Received: (qmail 29283 invoked from network); 2 Sep 2004 14:15:15 -0000
X-Scanned-By: AMaViS-ng at sidhe.org
Received: from unknown (HELO ?172.24.18.155?) (d...@157.130.220.242)
  by 178.94.252.64.snet.net with SMTP; 2 Sep 2004 14:15:13 -0000
Mime-Version: 1.0
X-Sender: dan@localhost
Message-ID: <a06110418bd5cd77826f0@[172.24.18.155]>
In-Reply-To: <20040902135658.GA96544@amse.pair.com>
References: <a0611040ebd5bbe03d7e1@[172.24.18.155]>
 <20040902032432.GI12318@kevin.fink.com>
 <a06110416bd5cc39f8005@[172.24.18.155]>
 <20040902135658.GA96544@amse.pair.com>
Date: Thu, 2 Sep 2004 10:15:55 -0400
To: Felix Gallo <f...@sunscreamer.com>
Subject: Re: Semantics for regexes
Cc: Steve Fink <st...@fink.com>, perl6-intern...@perl.org
Content-Type: text/plain; charset="us-ascii" ; format="flowed"
Approved: n...@nntp.perl.org
From: d...@sidhe.org (Dan Sugalski)

At 9:56 AM -0400 9/2/04, Felix Gallo wrote:
>Dan writes:
>>  [...]
>>  Yes, and some of the initial list already has ops to do those bits,
>>  though I fully plan on evil cheating versions for some extra speed.
>
>If I recall correctly, someone with the best intentions attempted
>to write a clear, object-oriented (but still C/C++ based) regex
>engine to replace the existing highly hairy code.  As a result it
>missed fitting in the cache, and ended up being like two orders
>of magnitude slower.
>
>Not arguing for premature optimization, but anyone thinking about
>regex might want to look ahead (get it?) to the end game where
>regex pretty much has to be within .2x of its current speed.

I don't think we're going to be able to manage doing our matches in 
20% of the time of the current regex engine. That's a bit ambitious, 
even for me. :)

FWIW, we did some work on this a few years back. (And yes, the fact 
that I can say that with a straight face scares me) We managed to run 
only about 25% slower than the perl 5 regex engine when using the 
basic string ops. This was pre-ICU, though, so the numbers may well 
be significantly worse now.

Anyway, that's the whole point of this exercise. I want to know what 
primitive operations the regex engine will need. (Or, rather, I've a 
pretty good idea and I want to make sure nothing is missed) When I 
know I plan on taking all the best accepted standards and practices 
of software engineering, throwing them out, and cheating like hell 
with those string primitives. If this means having several alternate 
class matching schemes (since bitmaps are likely fastest for small 
sets like 7/8 bit ASCII and some sort of trie fastest for large ones 
for the asian sets and Unicode) then fine. If this means being able 
to pin a single string in a fixed location to avoid a single pointer 
indirection on each access, fine. If it means knowing that since the 
input is US-ASCII or Latin-x characters are 8 bits and there are no 
combining characters so we drop to direct byte access, fine.

I'm not so much resigned to cheating as looking forward to it with 
relish. (And a bit of catsup, with a side of cole slaw)
-- 
				Dan

--------------------------------------it's like this-------------------
Dan Sugalski                          even samurai
d...@sidhe.org                         have teddy bears and even
                                       teddy bears get drunk