Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Most efficient method to search text?

4 views
Skip to first unread message

Robin Siebler

unread,
Oct 15, 2002, 8:35:29 PM10/15/02
to
I wrote a script to search a slew of files for certain words and
names. However, I am sure that there has to be a faster/better way to
do it. Here is what I am doing:

1. Load words to exclude into a list.
2. Load names to exclude for into a list.
3. Load words to include into a list.
4. Remove any duplicates from the name list.
5. Generate a list of files to search.
6. Open the 1st file.
7. Search each line:
a. For a word (line.find(word)). If I get a hit, I then use a RE
to perform a more exact search (the pattern I am using is
'\w+word|word\w+|word').
i. Compare any matches against the include/exclude list. If
it is a match, keep searching. Otherwise, log the line.
b. For a name (line.find(name)). If I get a hit, I then use a RE
to perform a more exact search (the pattern that I am using is
'\bname\b'. If I get a hit, log the line.

The reason that I first search using line.find() is that in the past I
have done some searches for simple strings and found that line.find()
was much faster than an RE, so I am only using an RE when I need to.
I am including my code below (unforunately, Google screws the
formating up). Any suggestions to improve it would be appreciated.

LinesFound = []; LinesFound = FunkyList(LinesFound); msg = ""
LineNum = 0; Header = 0
LogFile = var['LogFile']
print '\nGenerating file list...' #Let user see that script is
running
FilesToSearch = listFiles(var['SearchPath'], var['SearchExt'],
var['Recurse'])
if len(FilesToSearch) == 0:
print 'No Files Found!'
clean_up(var)
else:
print 'Number of files to search: ' + str(len(FilesToSearch))
print 'Processing files...', #Let user see that script is running
while FilesToSearch:
FileBeingSearched = FilesToSearch.pop() #Get/remove last file
name
open_file = open(FileBeingSearched)
print "\nProcessing " + FileBeingSearched,
for line in open_file.xreadlines():
LineNum += 1
#Let user see that script is running
if LineNum >24 and LineNum % 100==0: print ".",
for word in var['ExcludeWords']: #Search line for
proscribed words
#Perform a case insensitive search for word *anywhere* in the line
if line.lower().find(word.lower()) != -1:
pattern = '\w+word|word\w+|word'
pattern = pattern.replace('word', word.lower())
s_word = re.compile(pattern, re.IGNORECASE)
#If the phrase was found, get a list containing the matches
match_found = unique(s_word.findall(line))
for match in match_found:
#If the word contains an underscore
if match.find('_') != -1:
words = '\w+'
w_find = re.compile(words, re.IGNORECASE)
words = ''
for item in w_find.findall(line):
if item.find('_') != -1:
words = words + ' ' +
str(item.split('_'))
else:
words = words + ' ' + item
m_found = unique(s_word.findall(words))
for item in m_found:
if item in var['ExcludeWords'] and item
not in var['IncludeWords']:
msg = '\tLine ' + str(LineNum) + ':
The word "' + \
word + '" was found in: "' +
line.strip() + '"'
LinesFound.append(msg)
break;
elif match not in var['IncludeWords']:
#Is the word in IncludeWords?
msg = '\tLine ' + str(LineNum) + ': The
word "' + \
word + '" was found in: "' +
line.strip() + '"'
LinesFound.append(msg)
break;
for name in var['Names']:
#Search line for names
if line.lower().find(name.lower()) != -1:
#Perform a case insensitive search
pattern = '\bname\b'
pattern = pattern.replace('name', name)
s_word = re.compile(pattern, re.IGNORECASE)
match_found = unique(s_word.findall(line))
#If the phrase was found, get a list containing the matches
for match in match_found:
if match in var['Names']:
msg = '\tLine ' + str(LineNum) + ':
The name "' + name + \
'" was found in: "' + line.strip()
+'"'
LinesFound.append(msg)
break;
if len(LinesFound) > 0:
if not Header:
LogFile.write('Proscribed words were found in ' +
FileBeingSearched + '\n')
LogFile.write('\n')
Header = 1
for line in LinesFound:
LogFile.write(line + '\n')
LogFile.write('\n')
LinesFound = []
open_file.close()
LineNum = 0; Header = 0; hit = 0
print '\nProcessing Complete.'

Michael Hudson

unread,
Oct 16, 2002, 9:35:09 AM10/16/02
to
robin....@corp.palm.com (Robin Siebler) writes:

> I wrote a script to search a slew of files for certain words and
> names. However, I am sure that there has to be a faster/better way to
> do it.

I should point out that I'm not really replying in order to help you
-- if I do, that's a bonus. I've actually wanted to sit down and
write this code out for a while, and this seemed like a good excuse.

Here's a way to quickly (I hope! Haven't done any benchmarks) tell if
one of a bunch of words is contained in a chunk of text, assuming the
words are known beforehand:

import string

wordchars = string.letters
allchars = map(chr, range(256))
hittoken = 'hit'

def _compile(wordlist, d, skipalong):
t = {}
for word in wordlist:
t.setdefault(word[0], []).append(word[1:])
for k, v in t.iteritems():
d[k] = {}
if '' in v:
for c in allchars:
d[k][c] = hittoken
for c in wordchars:
d[k][c] = skipalong
else:
for c in allchars:
d[k][c] = skipalong
_compile([y for y in v if y <> ''], d[k], skipalong)

def compile(wordlist):
root = {}
skipalong = {}
for c in allchars:
skipalong[c] = root
for c in wordchars:
skipalong[c] = skipalong
for c in allchars:
root[c] = root
_compile(wordlist, root, skipalong)
return root

def match(set, text):
_hittoken = hittoken
for c in text + '\000':
set = set[c]
if set is _hittoken:
return 1
else:
return 0

You probably don't want to try and print the kind of things compile()
returns...

I don't really feel like explaining it, either, but it shouldn't be
too hard to grasp the principles if you've heard of a dfa.

You use it like this:

>>> set = robin.compile(['Tim', 'Guido', 'Barry'])
>>> files = glob.glob(os.path.expanduser("~/src/python/dist/src/Lib/*.py"))
>>> for file in files:
... if robin.match(s, open(file).read()):
... print os.path.basename(file)
...
cgi.py
doctest.py
getpass.py
gettext.py
imputil.py
profile.py
pstats.py
pty.py
pydoc.py
random.py
rfc822.py
site.py
smtpd.py
tabnanny.py
telnetlib.py
threading.py
tokenize.py
whrandom.py
xmlrpclib.py
sets.py
heapq.py
>>>

There are surely cleverer Boyer-Moore tricks you can pull, but I
personally think this code is really really neat. You should be able
to write lexers very like this (though not full-on regexps --
backtracking ain't gonna fit, here).

Cheers,
M.

--
ROOSTA: Ever since you arrived on this planet last night you've
been going round telling people that you're Zaphod
Beeblebrox, but that they're not to tell anyone else.
-- The Hitch-Hikers Guide to the Galaxy, Episode 7

Jeff Epler

unread,
Oct 16, 2002, 10:52:06 AM10/16/02
to
On Wed, Oct 16, 2002 at 01:35:09PM +0000, Michael Hudson wrote:
> Here's a way to quickly (I hope! Haven't done any benchmarks) tell if
> one of a bunch of words is contained in a chunk of text, assuming the
> words are known beforehand [...]

Is there any reason to suppose that this is more efficient than using
re.compile("|".join(re.escape(words))).match? I haven't looked at the
implementation of sre, but it should be able to generate a simple DFA
for this RE and execute something very much like your 'match' function,
but at C speeds.

Having one dict lookup per character scanned seems like a pretty huge
chunk of overhead.

Jeff

Michael Hudson

unread,
Oct 16, 2002, 11:19:58 AM10/16/02
to
Jeff Epler <jep...@unpythonic.net> writes:

> On Wed, Oct 16, 2002 at 01:35:09PM +0000, Michael Hudson wrote:
> > Here's a way to quickly (I hope! Haven't done any benchmarks) tell if
> > one of a bunch of words is contained in a chunk of text, assuming the
> > words are known beforehand [...]
>
> Is there any reason to suppose that this is more efficient than using
> re.compile("|".join(re.escape(words))).match?

No, and it's not (once you fiddle what you said to

re.compile("|".join(
['\\b'+re.escape(word)+'\\b' for word in words])).search

so it does roughly the same thing).

> I haven't looked at the implementation of sre, but it should be able
> to generate a simple DFA for this RE and execute something very much
> like your 'match' function, but at C speeds.

Yup. Dunno if it does either, but it's quick.

> Having one dict lookup per character scanned seems like a pretty huge
> chunk of overhead.

The original reason I started writing code like this was that I needed
to feed in the string a character at a time and there was enough else
going on to easily swamp one measly dict lookup. Another difference
is that it's easy to fiddle my scheme into telling you which word
matched (this isn't exactly hard with re either).

I can't think of many ways of speeding it up in Python -- perhaps
running the text through map(ord, text) and using lists? Doesn't seem
to make much difference. psyco seems to make more difference, and I
guess playing for psyco a bit could get more (these are all eyeball
tests, no numbers I'm afraid).

Anyway, what I thought was a neat bit of code was the point...

Cheers,
M.

--
Whaaat? That is the most retarded thing I have seen since, oh,
yesterday -- Kaz Kylheku, comp.lang.lisp

Tim Peters

unread,
Oct 16, 2002, 12:42:41 PM10/16/02
to
[Michael Hudson]

> Here's a way to quickly (I hope! Haven't done any benchmarks) tell if
> one of a bunch of words is contained in a chunk of text, assuming the
> words are known beforehand [...]

[Jeff Epler]


> Is there any reason to suppose that this is more efficient than using
> re.compile("|".join(re.escape(words))).match?

Time it.

> I haven't looked at the implementation of sre, but it should be able to
> generate a simple DFA for this RE and execute something very much like
> your 'match' function, but at C speeds.

re doesn't build a DFA. Alternatives are searched for one at a time, left
to right. See Friedl's "Mastering Regular Expression" book for long
discussions of the tradeoffs.


Aahz

unread,
Oct 16, 2002, 1:59:20 PM10/16/02
to
In article <mailman.1034786661...@python.org>,

Tim Peters <tim...@comcast.net> wrote:
>
>re doesn't build a DFA. Alternatives are searched for one at a time, left
>to right. See Friedl's "Mastering Regular Expression" book for long
>discussions of the tradeoffs.

In case people weren't paying attention a couple of months ago, Friedl
now has the second edition out.
--
Aahz (aa...@pythoncraft.com) <*> http://www.pythoncraft.com/

Project Vote Smart: http://www.vote-smart.org/

jep...@unpythonic.net

unread,
Oct 16, 2002, 7:39:24 PM10/16/02
to
On Wed, Oct 16, 2002 at 12:42:41PM -0400, Tim Peters wrote:
> re doesn't build a DFA. Alternatives are searched for one at a time, left
> to right. See Friedl's "Mastering Regular Expression" book for long
> discussions of the tradeoffs.

I repent of the sin of suggesting this approach. Oh well.

Jeff

Bengt Richter

unread,
Oct 16, 2002, 8:22:23 PM10/16/02
to

That's pretty cool, but for just finding words I just thought of an alternative
has_word, I think ;-) Given a wordlist and string to search, the wordsearch part could
be a long one liner, but I think this is a little clearer ;-) Don't know about the timing:
No guarantees...

#!/usr/bin/python
# __PyPAN_Py has_word.py -- searches string and returns words found of given list
#
# May also be run at command line to search for words in globbed files:
# Usage: python has_word.py [word]+ -- [fileglob]+'
#
def has_word(
wordlist, # 'word' or ['word', 'word2', 'etc']
aString, # string to find words in
trans = ''.join([c.isalnum() and c or ' ' for c in map(chr,range(256))])
):
"""
Search for any of a list of words in a string and return list of words found.
"""
if isinstance(wordlist, str): wordlist = [wordlist] # allow single word for convenience
string_words = dict( map(None, aString.translate(trans).split(),[]))
return [w for w in wordlist if w in string_words] # and 1 or 0 for bool return

if __name__ == '__main__':
import sys, os, glob
try:
eow = sys.argv.index('--')
words = sys.argv[1:eow]
if eow<0 or not words: raise ValueError
globs = sys.argv[eow+1:]
for g in globs:
files = glob.glob(g) # "~/src/python/dist/src/Lib/*.py"))
for filename in files:
found = has_word(words, file(filename).read())
if found:
print '%16s: %s' % (os.path.basename(filename), found)
except:
print 'Usage: python has_word.py [word]+ -- [fileglob]+'
# __PyPAN__

Which may be used like:

[17:07] C:\pywk\ut>python has_word.py Tim Guido Barry Michael -- d:\python22\lib\*.py
cgi.py: ['Guido', 'Michael']
doctest.py: ['Tim']
getpass.py: ['Guido']
gettext.py: ['Barry']
imputil.py: ['Tim', 'Guido']
profile.py: ['Guido']
pstats.py: ['Guido']
pty.py: ['Guido']
pydoc.py: ['Guido']
random.py: ['Tim', 'Guido']
rfc822.py: ['Guido']
site.py: ['Guido']
smtpd.py: ['Barry']
tabnanny.py: ['Tim']
telnetlib.py: ['Guido']
threading.py: ['Tim']
tokenize.py: ['Tim']
whrandom.py: ['Guido']

Regards,
Bengt Richter

Tim Peters

unread,
Oct 17, 2002, 12:28:45 AM10/17/02
to
[Tim]

> re doesn't build a DFA. Alternatives are searched for one at a
> time, left to right. See Friedl's "Mastering Regular Expression" book
> for long discussions of the tradeoffs.

[Jeff Epler]


> I repent of the sin of suggesting this approach. Oh well.

It's not a sin, but it's not nearly as straightforward as what you're
imagining. Read the book <wink> -- Python/Perl regular expressions are
irregular in ways that go beyond the capabilities of DFAs.

Especially for purposes of building lexers, it might be useful if the re
package could recognize when a DFA approach was sufficient and practical,
and switch to a different scheme entirely then. Or it might not. Build
code to try it both ways, and let us know how it turns out ...


Michael Hudson

unread,
Oct 17, 2002, 7:14:09 AM10/17/02
to
Tim Peters <tim...@comcast.net> writes:

> Especially for purposes of building lexers, it might be useful if the re
> package could recognize when a DFA approach was sufficient and practical,
> and switch to a different scheme entirely then. Or it might not. Build
> code to try it both ways, and let us know how it turns out ...

Indeed, my code canes re when the wordlist gets long. Here's some
numbers (do_comp(n) builds a list of n words composed of ten random
ascii characters and searches the Python standard library for them
using first my code, then a re-based approach):

>>> import robin
>>> robin.do_comp(10)
2.275832057
0.317215919495
>>> robin.do_comp(100)
2.2694439888
0.980538964272
>>> robin.do_comp(1000)
2.36677598953
8.53031408787
>>> robin.do_comp(2000)
2.34253299236
16.9222760201

So roughly linear behaviour from re, n-independent behaviour from my
code. Isn't it nice when things do what you expect?

On the other hand, the "compiler" code I posted yesterday consumes
vast gobs of memory -- trying to compile a 10000 long list of random
words got the OOM killer into action. A version using lists seems
slightly better in this respect, though slower in execution.

Cheers,
M.

--
Have you considered downgrading your arrogance to a reasonable level?
-- Erik Naggum, comp.lang.lisp, to yet another C++-using troll

Bengt Richter

unread,
Oct 17, 2002, 6:25:08 PM10/17/02
to
On Thu, 17 Oct 2002 11:14:09 GMT, Michael Hudson <m...@python.net> wrote:

>Tim Peters <tim...@comcast.net> writes:
>
>> Especially for purposes of building lexers, it might be useful if the re
>> package could recognize when a DFA approach was sufficient and practical,
>> and switch to a different scheme entirely then. Or it might not. Build
>> code to try it both ways, and let us know how it turns out ...
>
>Indeed, my code canes re when the wordlist gets long. Here's some

If this is easy to add to your test harness, I'd be interested to see what
this search does in comparison, with the longer word lists (it's probably
faster than has_word.py that I posted elsewhere, depending on relative lengths
of word lists and strings, and internal vs external looping and allocation.

====================

# __PyPAN_Py word_is_in.py -- Check if at least one word in list is in string.
def word_is_in(
wordlist, # ['word', 'word2', 'etc']


aString, # string to find words in
trans = ''.join([c.isalnum() and c or ' ' for c in map(chr,range(256))])
):
"""

Search for any of a list of words in a string and return 1 if any found, else 0.
"""
word_dict = dict(zip(wordlist,wordlist))
for w in aString.translate(trans).split():
if w in word_dict: return 1
return 0
# __PyPAN__

====================
The trans table treats anything non-alphanumeric as word delimiter. You could mess with that.

You could write a fast C function to take advantage of not having actually to do
the translate and split, but instead just finding the successive tokens and checking
for them in the word_dict. For searching large files, this would help with memory use
as well as speed.

BTW, a split method for strings that would do a pre-translate could be useful.
You could specify by way of a character filter function like isalnum, which would
be used like for the trans argument above, but a builtin str.splitcs(character_set_filter)
would recognize references to its own standard methods (e.g., isalnum) and use cached
info instead of building it each time. Alternatively, a string could be assumed to imply
'['+the_string+']+' as a findall regex.


>numbers (do_comp(n) builds a list of n words composed of ten random
>ascii characters and searches the Python standard library for them
>using first my code, then a re-based approach):
>
>>>> import robin
>>>> robin.do_comp(10)
>2.275832057
>0.317215919495
>>>> robin.do_comp(100)
>2.2694439888
>0.980538964272
>>>> robin.do_comp(1000)
>2.36677598953
>8.53031408787
>>>> robin.do_comp(2000)
>2.34253299236
>16.9222760201
>
>So roughly linear behaviour from re, n-independent behaviour from my
>code. Isn't it nice when things do what you expect?
>
>On the other hand, the "compiler" code I posted yesterday consumes
>vast gobs of memory -- trying to compile a 10000 long list of random
>words got the OOM killer into action. A version using lists seems
>slightly better in this respect, though slower in execution.
>

Regards,
Bengt Richter

Michael Hudson

unread,
Oct 18, 2002, 6:03:10 AM10/18/02
to
bo...@oz.net (Bengt Richter) writes:

> On Thu, 17 Oct 2002 11:14:09 GMT, Michael Hudson <m...@python.net> wrote:
>
> >Tim Peters <tim...@comcast.net> writes:
> >
> >> Especially for purposes of building lexers, it might be useful if the re
> >> package could recognize when a DFA approach was sufficient and practical,
> >> and switch to a different scheme entirely then. Or it might not. Build
> >> code to try it both ways, and let us know how it turns out ...
> >
> >Indeed, my code canes re when the wordlist gets long. Here's some
>
> If this is easy to add to your test harness, I'd be interested to see what
> this search does in comparison, with the longer word lists (it's probably
> faster than has_word.py that I posted elsewhere, depending on relative lengths
> of word lists and strings, and internal vs external looping and allocation.

It's quick:

>>> robin.do_comp(1000)
compile... 3.42289197445
compile2... 2.49848008156
compile3... 0.696313977242
compile4... 4.04265594482
compile_re... 0.627331018448
compile_bengt... 0.00175499916077

test... 1.39854204655
test2... 2.93543899059
test3... 3.2231388092
test4... 2.15867292881
test_re... 8.38554108143
test_bengt... 0.437232971191

I'm sure Tim once said something along the lines of "Python doesn't
give much advice for getting good performance, beyond a not-so-subtle
hint to exploit dicts for all they're worth" but I can't find it now.

Cheers,
M.
hmm, look at that sig...

--
Premature optimization is the root of all evil.
-- Donald E. Knuth, Structured Programming with goto Statements

Bengt Richter

unread,
Oct 18, 2002, 2:32:20 PM10/18/02
to

Thanks. I wonder what would happen if you used word lists and files
guaranteed *not* to have any hits, so they'd all be searched to the end.
Or what was your mix (no of search words, length of files, hit distribution)?

>I'm sure Tim once said something along the lines of "Python doesn't
>give much advice for getting good performance, beyond a not-so-subtle
>hint to exploit dicts for all they're worth" but I can't find it now.
>

There's probably a lot to learn from the dictionary implementation code
... one of these days ;-)

>Cheers,
>M.
>hmm, look at that sig...
>
>--
> Premature optimization is the root of all evil.
> -- Donald E. Knuth, Structured Programming with goto Statements

Premature celebration is a related problem ;-)

Regards,
Bengt Richter

Tim Peters

unread,
Oct 18, 2002, 3:59:23 PM10/18/02
to
[Michael Hudson, searching for words]

> ...
> Indeed, my code canes re when the wordlist gets long. Here's some
> numbers ...
> ...

> So roughly linear behaviour from re, n-independent behaviour from my
> code. Isn't it nice when things do what you expect?

When you're young, it can seem that way. At my age, it mostly makes one
nervous, waiting for the other and always larger and much smellier foot to
fall <wink>.

> On the other hand, the "compiler" code I posted yesterday consumes
> vast gobs of memory -- trying to compile a 10000 long list of random
> words got the OOM killer into action. A version using lists seems
> slightly better in this respect, though slower in execution.

Which is, of course, the traditional tradeoff between NFA and DFA
approaches -- in bad cases you can pay in memory and preprocessing time due
to exponential state explosion, or you can pay at runtime.

I see Bengt discovered how to use dicts for this, and that's probably what
you'll use in the future too <wink>. For the more general case of lexing,
you're searching for patterns (like r'[A-Za-z_]\w*'), and twisting a dict
gets at best obscure. There's a large literature now on compromise
approaches, building only the parts of the DFA that the target string
actually needs to explore, in a dynamic way. This sounds horribly expensive
at first glance (don't think about that mixing eyes with ears doesn't make
sense), but in some practical cases there are exceedingly clever ways to
"fake it" with a handful of machine-level instructions per character, using
integers as bit vectors representing the *set* of NFA nodes currently
active.

Gonzo speeding of all arguably common cases of pattern-matching is a
lifetime job. Although there's nothing unique about pattern-matching in
that ...

learning-to-live-with-fast-enough-leaves-more-time-for-life-ly y'rs - tim


Michael Hudson

unread,
Oct 21, 2002, 8:53:40 AM10/21/02
to
bo...@oz.net (Bengt Richter) writes:

> Thanks. I wonder what would happen if you used word lists and files
> guaranteed *not* to have any hits, so they'd all be searched to the end.
> Or what was your mix (no of search words, length of files, hit distribution)?

Oh, no hits at all. I was making up random words of length 10.

> >I'm sure Tim once said something along the lines of "Python doesn't
> >give much advice for getting good performance, beyond a not-so-subtle
> >hint to exploit dicts for all they're worth" but I can't find it now.
> >
> There's probably a lot to learn from the dictionary implementation code
> ... one of these days ;-)

Not as much as there is to be learnt from using it, I suspect.

Cheers,
M.

--
I've even been known to get Marmite *near* my mouth -- but never
actually in it yet. Vegamite is right out.
UnicodeError: ASCII unpalatable error: vegamite found, ham expected
-- Tim Peters, comp.lang.python

Michael Hudson

unread,
Oct 22, 2002, 7:24:23 AM10/22/02
to
Tim Peters <tim...@comcast.net> writes:

> [Michael Hudson, searching for words]
> > ...
> > Indeed, my code canes re when the wordlist gets long. Here's some
> > numbers ...
> > ...
> > So roughly linear behaviour from re, n-independent behaviour from my
> > code. Isn't it nice when things do what you expect?
>
> When you're young, it can seem that way. At my age, it mostly makes one
> nervous, waiting for the other and always larger and much smellier foot to
> fall <wink>.

Well, as I'm only doing this for fun, when it gets squashed by a heavy
foot I'll just give up and do something else :)

> > On the other hand, the "compiler" code I posted yesterday consumes
> > vast gobs of memory -- trying to compile a 10000 long list of random
> > words got the OOM killer into action. A version using lists seems
> > slightly better in this respect, though slower in execution.
>
> Which is, of course, the traditional tradeoff between NFA and DFA
> approaches -- in bad cases you can pay in memory and preprocessing time due
> to exponential state explosion, or you can pay at runtime.

I don't think I'm suffering from state explosion, in as much as I have
a version that uses a similar algorithm but different data structures
(flat lists) which goes *very* much quicker -- faster than the sre
compiler (tho' given that that does more and is written in Python, I
guess that's not a major acheivement).

I don't know where the other compiler is spending it's time, and all
trying to use hotshot to tell did was get me worried about what I've
done to its overhead.

> I see Bengt discovered how to use dicts for this, and that's probably what
> you'll use in the future too <wink>.

Nah, far too boring.

> For the more general case of lexing, you're searching for patterns
> (like r'[A-Za-z_]\w*'), and twisting a dict gets at best obscure.

I'd bet.

> There's a large literature now on compromise approaches, building
> only the parts of the DFA that the target string actually needs to
> explore, in a dynamic way. This sounds horribly expensive at first
> glance (don't think about that mixing eyes with ears doesn't make
> sense), but in some practical cases there are exceedingly clever
> ways to "fake it" with a handful of machine-level instructions per
> character, using integers as bit vectors representing the *set* of
> NFA nodes currently active.

Hmm, I'm basically implementing a (limited) regular expression engine
-- FOR FUN! I *really* need to get out more...

> Gonzo speeding of all arguably common cases of pattern-matching is a
> lifetime job. Although there's nothing unique about pattern-matching in
> that ...
>
> learning-to-live-with-fast-enough-leaves-more-time-for-life-ly y'rs - tim

Shh!

Cheers,
M.

--
31. Simplicity does not precede complexity, but follows it.
-- Alan Perlis, http://www.cs.yale.edu/homes/perlis-alan/quotes.html

0 new messages