doing hundreds of re.subs efficiently on large strings

nihilo

unread,

Mar 25, 2003, 4:46:04 PM3/25/03

to

I have a simple application that converts a file in one format to a file
in another format using string.replace and re.sub. There are about 150
replacements occuring on files that are on the order of a few KB. Right
now, I represent the file contents as a string, and naively use
string.replace and re.sub to replace all the substrings that I need to
replace. I knew that I would have to come up with a faster, more memory
friendly solution eventually, but I just wanted to implement the basic
idea as simply as possible at first.

Now I am at a loss as to how to proceed. I need a stringbuffer
equivalent that supports at least re.sub. Two alternatives that I have
seen for StringBuffer are using lists and then joining into a string at
the end, or using an array. Neither of these seems to be of much use to
me, since I am doing more than just appending to the end of the string.
Is there another pre-existing alternative that I'm overlooking? Or has
anybody come up with a good solution for this issue?

tia,

nihilo

Bengt Richter

unread,

Mar 25, 2003, 8:02:01 PM3/25/03

to

Is your substitution recursive? I.e., does each substitution operate on
the finished result of a previous substitution? As in

>>> 'abc'.replace('b','x').replace('xc','z')
'az'

If not, you could split on the befores and then walk through the list
and substitute corresponding afters and join the result, e.g.,

An example source:

>>> s = """\
... before1, before2 plain stuff
... and before3 and before4, and
... some more plain stuff.
... """
>>> print s
before1, before2 plain stuff
and before3 and before4, and
some more plain stuff.

Regex to split out befores:
>>> import re
>>> rxo = re.compile(r'(before1|before2|before3|before4)')

The parens retain the matches as the odd indexed items:
>>> rxo.split(s)
['', 'before1', ', ', 'before2', ' plain stuff\nand ', 'before3', ' and ', 'before4', ', and\nsome more plain stuff.\n']

A dict to look up substitutions:
>>> subdict = dict([('before'+x, 'after'+x) for x in '1234'])
>>> subdict
{'before4': 'after4', 'before1': 'after1', 'before2': 'after2', 'before3': 'after3'}

As above, but bind to s, so we can do substitutions on the odd elements:
>>> s = rxo.split(s)
>>> s
['', 'before1', ', ', 'before2', ' plain stuff\nand ', 'before3', ' and ', 'before4', ', and\nsome more plain stuff.\n']

Do the substitution:
>>> for i in xrange(1,len(s),2): s[i] = subdict[s[i]]
...
>>> s
['', 'after1', ', ', 'after2', ' plain stuff\nand ', 'after3', ' and ', 'after4', ', and\nsome more plain stuff.\n']

Join into single string:
>>> ''.join(s)
'after1, after2 plain stuff\nand after3 and after4, and\nsome more plain stuff.\n'

Print it to see in original format:
>>> print ''.join(s)
after1, after2 plain stuff
and after3 and after4, and
some more plain stuff.

You could easily wrap this in a function, of course.

Regards,
Bengt Richter

Anders J. Munch

unread,

Mar 26, 2003, 6:13:52 AM3/26/03

to

"nihilo" <exni...@NOmyrealCAPSbox.com> wrote:
> I have a simple application that converts a file in one format to a file
> in another format using string.replace and re.sub. There are about 150
> replacements occuring on files that are on the order of a few KB. Right
> now, I represent the file contents as a string, and naively use
> string.replace and re.sub to replace all the substrings that I need to
> replace. I knew that I would have to come up with a faster, more memory
> friendly solution eventually, but I just wanted to implement the basic
> idea as simply as possible at first.

Hmm, 150 replacements on a few-KB text, that shouldn't take long. Why
do you need it to be faster and more memory-friendly? Are there
realtime constraints? Do you have a lot of these files that you need
to process in batch?

<plug>
If so, it might be worth taking a look at my Msub
http://home.inbound.dk/~andersjm/provide/msub/
in which all 150 replacements can be done efficiently in parallel.
</plug>

Staying in Python, and assuming that there are no overlaps between
matches, here's a way to do single-pass replace of all your
regexps.

Combine your 150 regexps into one big regexp. (Say regexps A,B and C
combine into (A)|(B)|(C), you get the picture.) Then use re.finditer
(assuming Python version >=2.2) to locate the matching substrings one
by one, as well as the intermediate unmatched parts. Which component
was matched can be identified by looking at the position of the first
non-None entry in groups(). Collect the output (unmatched
parts+replacement strings) in a list of string fragments which you
''.join into one big string at the end.

- Anders

nihilo

unread,

Mar 26, 2003, 11:45:22 PM3/26/03

to

Bengt Richter wrote:
> On Tue, 25 Mar 2003 21:46:04 GMT, nihilo <exni...@NOmyrealCAPSbox.com> wrote:
>
>
>>I have a simple application that converts a file in one format to a file
>>in another format using string.replace and re.sub. There are about 150
>>replacements occuring on files that are on the order of a few KB. Right
>>now, I represent the file contents as a string, and naively use
>>string.replace and re.sub to replace all the substrings that I need to
>>replace. I knew that I would have to come up with a faster, more memory
>>friendly solution eventually, but I just wanted to implement the basic
>>idea as simply as possible at first.
>>
>>Now I am at a loss as to how to proceed. I need a stringbuffer
>>equivalent that supports at least re.sub. Two alternatives that I have
>>seen for StringBuffer are using lists and then joining into a string at
>>the end, or using an array. Neither of these seems to be of much use to
>>me, since I am doing more than just appending to the end of the string.
>>Is there another pre-existing alternative that I'm overlooking? Or has
>>anybody come up with a good solution for this issue?
>>
>
> Is your substitution recursive? I.e., does each substitution operate on
> the finished result of a previous substitution? As in
>
> >>> 'abc'.replace('b','x').replace('xc','z')
> 'az'

Yes, this is the case. The result of one replace is the input for the
next. Is it better to cascade them like this? Perhaps I am
misunderstanding what happens with the cascading calls, but I thought
that it would be the same as doing it the obvious way except that the
intermediate strings are unable to be referenced when they are cascaded
togehter.

Thanks for the really nice solution below. I will adapt my code and
handle the regular expression substitutions as you have suggested.

cheers,

nihilo

unread,

Mar 27, 2003, 12:50:16 AM3/27/03

to

Anders J. Munch wrote:
> "nihilo" <exni...@NOmyrealCAPSbox.com> wrote:
>
>>I have a simple application that converts a file in one format to a file
>>in another format using string.replace and re.sub. There are about 150
>>replacements occuring on files that are on the order of a few KB. Right
>>now, I represent the file contents as a string, and naively use
>>string.replace and re.sub to replace all the substrings that I need to
>>replace. I knew that I would have to come up with a faster, more memory
>>friendly solution eventually, but I just wanted to implement the basic
>>idea as simply as possible at first.
>
>
> Hmm, 150 replacements on a few-KB text, that shouldn't take long. Why
> do you need it to be faster and more memory-friendly? Are there
> realtime constraints? Do you have a lot of these files that you need
> to process in batch?

Actually, I'm apparently really bad at estimating. I just checked a
couple of files, and they are from 20 KB to 60 KB, and there are
something like 200 replaces and perhaps 100 regular expression
substitutions, some of which are pretty complex. There are realtime
constraints, since the application basically downloads a file and
processes it while a user is waiting. It already takes a second or so to
download the file, and the processing takes from a couple to six or
seven seconds, which is already too much for impatient users like me ;-)

> <plug>
> If so, it might be worth taking a look at my Msub
> http://home.inbound.dk/~andersjm/provide/msub/
> in which all 150 replacements can be done efficiently in parallel.
> </plug>

Thanks, I checked out your application. It looks great, but I only doing
one file at a time, and I would like to stick with a purely python solution.

> Staying in Python, and assuming that there are no overlaps between
> matches, here's a way to do single-pass replace of all your
> regexps

> Combine your 150 regexps into one big regexp. (Say regexps A,B and C
> combine into (A)|(B)|(C), you get the picture.) Then use re.finditer
> (assuming Python version >=2.2) to locate the matching substrings one
> by one, as well as the intermediate unmatched parts.

How do I use finditer to find the unmatched parts also (the stuff
between the matches)? It looks to me like it just gives you access to
the stuff that is matched, but maybe I'm missing something.

> Which component
> was matched can be identified by looking at the position of the first
> non-None entry in groups(). Collect the output (unmatched
> parts+replacement strings) in a list of string fragments which you
> ''.join into one big string at the end.
>
> - Anders

I am sorry but I am still not sure how you are suggesting the
replacement occur. After I have the iterator, what would I do with each
of the match objects in the iterator?

Sorry for my confusion, but I'm pretty new to python and regular
expressions.

thanks,

nihilo

Anders J. Munch

unread,

Mar 27, 2003, 3:36:34 AM3/27/03

to

"nihilo" <exni...@NOmyrealCAPSbox.com> wrote:
> How do I use finditer to find the unmatched parts also (the stuff
> between the matches)? It looks to me like it just gives you access to
> the stuff that is matched, but maybe I'm missing something.

Look at start() and end().

>>> s = 'hi nihilo'
>>> for m in re.finditer('(n)|(o)', s):
... print m.start(), m.end()
...
3 4
8 9
>>> s[:3], s[3:4], s[4:8], s[8:9], s[9:]
('hi ', 'n', 'ihil', 'o', '')
>>>

- Anders

John Machin

unread,

Mar 27, 2003, 6:09:56 AM3/27/03

to

nihilo <exni...@NOmyrealCAPSbox.com> wrote in message news:<3E82911D...@NOmyrealCAPSbox.com>...

>
> I am sorry but I am still not sure how you are suggesting the
> replacement occur. After I have the iterator, what would I do with each
> of the match objects in the iterator?

Below is an example of what Anders was talking about.
HTH,
John

=== file raboof.by ===
import re
from_str = ["foo", "bar", "zot"]
to_str = ["oofay", "arbay", "otzay"]
pat = "|".join(["(" + x + ")" for x in from_str])
print pat
text_in = "fee fie foo and bar or zot then foo again"
asm = []
asmap = asm.append
lpos = 0
for m in re.finditer(pat, text_in):
spos, epos = m.span()
grpnum = m.lastindex
print spos, epos, grpnum
asmap(text_in[lpos:spos])
asmap(to_str[grpnum-1])
lpos = epos
asmap(text_in[lpos:])
text_out = "".join(asm)
print text_in
print text_out

=== results ===
C:\junk>python raboof.py
(foo)|(bar)|(zot)
8 11 1
16 19 2
23 26 3
32 35 1
fee fie foo and bar or zot then foo again
fee fie oofay and arbay or otzay then oofay again

C:\junk>
==================

Alex Martelli

unread,

Mar 27, 2003, 6:33:11 AM3/27/03

to

Anders J. Munch wrote:

Yeah, but having to keep track of "one after the latest .end" in
the loop to accumulate the fragments is fussy and may be a bit
error-prone. I'm not sure performance advantages justify that.
Let's check -- performance should ALWAYS be measured, NEVER
guessed at...!!! Here's a module ms.py with both approaches:

import re

# simulate a copy of dozen needed substitutions
from string import lowercase
uppercase = lowercase.upper()
# say allres is a list of RE's (without groups!), and
# allsub a parallel list of strings to substitute for
# each corresponding RE when it matches
allres = list(lowercase)
allsub = ['(%s!)'%c for c in allres]

# a biggish string, about 10K char, on which to do substitutions
bigstr = (lowercase+uppercase) * 200

# join the many REs (which have no groups) into one big RE
bigre = re.compile( '(' + ')|('.join(allres) + ')' )

def sub_1(bigre=bigre, bigstr=bigstr, allsub=allsub):
"substitutions with delicate, tricky approach"
pieces = []
prev = 0
for m in bigre.finditer(bigstr):
pieces.append(bigstr[prev:m.start()])
pieces.append(allsub[m.lastindex-1])
prev = m.end()
pieces.append(bigstr[prev:])
return ''.join(pieces)

def sub_2(bigre=bigre, bigstr=bigstr, allsub=allsub):
"substitutions with simple, straightforward approach"
def onesub(m):
return allsub[m.lastindex-1]
return bigre.sub(onesub,bigstr)

assert sub_1() == sub_2()

And the winner is...:

[alex@lancelot alex]$ python timeit.py -s 'import ms' 'ms.sub_1()'
10 loops, best of 3: 1.28e+05 usec per loop
[alex@lancelot alex]$ python timeit.py -s 'import ms' 'ms.sub_2()'
10 loops, best of 3: 9.92e+04 usec per loop

...the simple, straightforward approach, with 99 milliseconds per
loop, versus 128 milliseconds per loop for the fussy, tricky one!

Don't guess -- MEASURE...!!!

Alex

Anders J. Munch

unread,

Mar 27, 2003, 8:42:02 AM3/27/03

to

"Alex Martelli" <al...@aleax.it> wrote:
> Yeah, but having to keep track of "one after the latest .end" in
> the loop to accumulate the fragments is fussy and may be a bit
> error-prone.

When you know a better way. That replacements could be functions had
escaped my attention; your solution is clearly better. No
measurements needed to realise that.

And both solutions may or may not be better that the OPs original
one-sub-at-a-time approach. I'll leave it to the OP to find that out.
After all, the OP can do the testing with the right regexps on the
right data. Synthetic benchmarks give synthetic results.

- Anders

Alex Martelli

unread,

Mar 27, 2003, 10:29:12 AM3/27/03

to

Anders J. Munch wrote:

> "Alex Martelli" <al...@aleax.it> wrote:
>> Yeah, but having to keep track of "one after the latest .end" in
>> the loop to accumulate the fragments is fussy and may be a bit
>> error-prone.
>
> When you know a better way. That replacements could be functions had
> escaped my attention; your solution is clearly better. No
> measurements needed to realise that.

It's simpler/cleaner, and yes, that's more an issue of aesthetics than
of measurements. Not sure what you mean by the .sub method "escaping
your attention" when it's right in the subject (I'm just using it with
the composite re, just like you're proposing to use .finditer), but if
you're speaking more generally, sure -- one can't measure an approach
one hasn't even thought of (that's why posting here is so useful, as
often people WILL propose lots of different approaches).

> And both solutions may or may not be better that the OPs original
> one-sub-at-a-time approach. I'll leave it to the OP to find that out.
> After all, the OP can do the testing with the right regexps on the
> right data. Synthetic benchmarks give synthetic results.

Definitely. Moreover, one should start comparative measurements of
different approaches only when one KNOWS the simplest, cleanest one
doesn't yield satisfactory performance AND that the area in question
is on the application's performance bottleneck as shown by profiling.

Alex

Anders J. Munch

unread,

Mar 27, 2003, 11:39:48 AM3/27/03

to

"Alex Martelli" <al...@aleax.it> wrote:
> It's simpler/cleaner, and yes, that's more an issue of aesthetics than
> of measurements. Not sure what you mean by the .sub method "escaping
> your attention" when it's right in the subject

Reading help(re.sub), I took the 'repl' parameter for a simple string
parameter and missed the dual typing.

- Anders

Alex Martelli

unread,

Mar 27, 2003, 1:25:08 PM3/27/03

to

Anders J. Munch wrote:

Ah, yes -- that help text is definitely remiss in not mentioning the
sub method's special powers and pointing to the docs about it!

Alex

nihilo

unread,

Mar 27, 2003, 4:45:16 PM3/27/03

to

Bengt Richter wrote:
> On Tue, 25 Mar 2003 21:46:04 GMT, nihilo <exni...@NOmyrealCAPSbox.com> wrote:

<snip/>

I finally got around to actually testing the time for each of these
approaches. Method 1 was with a bunch of string.replaces, and method 2
was compiling everything into one big regex, spliting the string, and
using a dictionary for lookup of the term to substitute for all the
odd-numbered items in the list (the matched terms). The results were
very surprising. string.replace averaged .25 seconds, while the method
outlined above averaged .43 seconds!

The bottleneck in my code turned out to be the regular expressions,
which take almost a second and a half in total. Since they all contain
groups, I think I'm stuck with what I have at present, though I'm
wondering whether it is worthwhile to precompile all the expressions and
use cPickle to load them from a file.

Anyway, thanks again for the help. Your solution is more elegant than a
hundred and fifty replaces, and better memory-wise too, even if it is a
bit slower.

-nihilo

Bengt Richter

unread,

Mar 29, 2003, 7:12:51 PM3/29/03

to

Thanks for testing, that is interesting. It would be interesting to see
the test code you ran, so we could see the reasons for .25 vs .43 (and
I could revise my reality-model as necessary ;-)

Regards,
Bengt Richter

nihilo

unread,

Apr 6, 2003, 8:28:13 PM4/6/03

to

Bengt Richter wrote:
>
> Thanks for testing, that is interesting. It would be interesting to see
> the test code you ran, so we could see the reasons for .25 vs .43 (and
> I could revise my reality-model as necessary ;-)
>
> Regards,
> Bengt Richter

Hi Bengt,

Here is some code that I created to test the two versions.

Original:

import time, re
start = time.clock()
data = '' # get http://news.google.com for sample data to run this on. I
was just putting this in the file for testing to avoid time
complications over the network. I've omitted it here because it is more
than 50KB.

data = data.replace('US and Britain', 'Oceania')
data = data.replace('US and British', 'Oceanian')
data = data.replace('America and Britain', 'Oceania')
data = data.replace('the U.S.A.', 'Oceania')
data = data.replace('the United States of America', 'Oceania')
data = data.replace('the U.S.', 'Oceania')
data = data.replace('the USA', 'Oceania')
data = data.replace('the US', 'Oceania')
data = data.replace('The US', 'Oceania')
data = data.replace('the United States', 'Oceania')
data = data.replace('U.S.A.', 'Oceania')
data = data.replace('United States of America', 'Oceania')
data = data.replace('U.S.', 'Oceania')
data = data.replace(' US', ' Oceania')
data = data.replace('US ', 'Oceania ')
data = data.replace('US,', 'Oceania,')
data = data.replace('US-', 'Oceania-')
data = data.replace('United States', 'Oceania')
data = data.replace('American', 'Oceanian')
data = data.replace('British', 'Oceanian')
data = data.replace('Great Britain', 'Oceania')
data = data.replace('Britain', 'Oceania')
data = data.replace('United Kingdom', 'Oceania')
data = data.replace('European Union countries', 'Oceania provinces')
data = data.replace('European nations', 'Oceania provinces')
data = data.replace('European', 'Oceanian')
data = data.replace('Europe', 'Oceania')
data = data.replace('the EU', 'Oceania')
data = data.replace('Australian', 'Oceanian')
data = data.replace('Australia', 'Oceania')
data = data.replace('Washington Post', 'Ministry of Truth')
data = data.replace('Washington DC', 'Airstrip One')
data = data.replace('Washington D.C.', 'Airstrip One')
data = data.replace('Washington, D.C.', 'Airstrip One')
data = data.replace('Washington, DC', 'Airstrip One')
data = data.replace('Washington', 'Airstrip One')
data = data.replace('WASHINGTON', 'Airstrip One')
data = data.replace('Boy Scouts', 'Junior Anti-Sex League')
data = data.replace('Girl Scouts', 'Junior Anti-Sex League')
data = data.replace('President Hussein', 'Emmanuel Goldstein, Enemy of
the People')
data = data.replace('Saddam Hussein', 'Emmanuel Goldstein')
data = data.replace('Saddam', 'Goldstein')
data = data.replace('Hussein', 'Goldstein')
data = data.replace('President Bush', 'Big Brother')
data = data.replace('president Bush', 'Big Brother')
data = data.replace('George Bush', 'Big Brother')
data = data.replace('George W. Bush', 'Big Brother')
data = data.replace('President George W Bush', 'Big Brother')
data = data.replace('George W Bush', 'Big Brother')
data = data.replace('Mr. Bush', 'Big Brother')
data = data.replace('Bush', 'Big Brother')
data = data.replace('Donald Rumsfeld', 'O\' Brien')
data = data.replace('Rumsfeld', "O' Brien")
data = data.replace('United Nations', 'bloc nations')
data = data.replace('UN', 'bloc nations')
data = data.replace('U.N.', 'bloc nations')
data = data.replace('coalition allies', 'bloc allies')
data = data.replace(' allies', ' bloc allies')
data = data.replace('Japanese', 'Eastasian')
data = data.replace('Chinese', 'Eastasian')
data = data.replace('Japan', 'Eastasia')
data = data.replace('China', 'Eastasia')
data = data.replace('North Korea', 'Eastasia')
data = data.replace('South Korea', 'Eastasia')
data = data.replace('India', 'Eastasia')
data = data.replace('Hong Kong', 'Eastasia')
data = data.replace('Vietnam', 'Eastasia')
data = data.replace('Singapore', 'Eastasia')
data = data.replace('the Phillipines', 'Eastasia')
data = data.replace('Thailand', 'Eastasia')
data = data.replace('Sri Lanka', 'Eastasia')
data = data.replace('Mexico', 'Oceania')
data = data.replace('Canadian', 'Oceanian')
data = data.replace('Canada', 'Oceania')
data = data.replace('Federal Bureau of Investigation', 'Ministry of Love')
data = data.replace('Department of State', 'Ministry of Truth')
data = data.replace('Pentagon', 'Ministry of Peace')
data = data.replace(' Post', ' Ministry of Truth')
data = data.replace('Russian', 'Eastasian')
data = data.replace('Russia', 'Eastasia')
data = data.replace('Australian', 'Oceanian')
data = data.replace('Australia', 'Oceania')
data = data.replace('Christian Science Monitor', 'INGSOC Science Monitor')
data = data.replace('Christianity', 'INGSOC')
data = data.replace('Washington Post', 'Ministry of Truth')
data = data.replace('Washington Post', 'Ministry of Truth')
data = data.replace('Daily Telegraph', 'Ministry of Truth')
data = data.replace('New York Times',
'Ministry of Truth')
data = data.replace('International Herald Tribune', 'Ministry of Truth')
data = data.replace('International Herald Tribune',
'Ministry of Truth')
data = data.replace('BBC', 'Ministry of Truth')
data = data.replace('wartime', 'peacetime')
data = data.replace('congressional', 'Inner Party')
data = data.replace('Senate', 'Inner Party Senate')
data = data.replace('House of Representatives', 'Inner Party House')
data = data.replace('House', 'Inner Party House')
data = data.replace('Congress', 'Inner Party')
data = data.replace('Marines', 'Peace Keepers')
data = data.replace('soldiers', 'peace keepers')
data = data.replace('Republican Guards', 'Resistance Elite Troops')
data = data.replace('Republican Guard', 'Resistance Elite Troops')
data = data.replace('Department of Defense', 'Ministry of Peace')
data = data.replace('Department of Homeland Security', 'Ministry of Love')
data = data.replace('Associated Press', 'Ministry of Truth')
data = data.replace('FBI', 'Ministry of Love')
data = data.replace('CNN', 'Ministry of Truth')
data = data.replace('New York Times', 'Ministry of Truth')
data = data.replace('UPI', 'Ministry of Truth')
data = data.replace('United Press International', 'Ministry of Truth')
data = data.replace('Federal Reserve', 'Ministry of Plenty')
data = data.replace('Homeland Security', 'Motherland Security')
data = data.replace('homeland security', 'motherland security')
data = data.replace('protesters', 'restistance thought criminals')
data = data.replace('prisoner', 'Ministry of Love prison guest')
# data = data.replace(' prison', 'Ministry of Love')
data = data.replace('solitary confinement', 'Room 101')
data = data.replace('anti-war', 'Resistance')
data = data.replace('Anti-war', 'Resistance')
data = data.replace('paramilitary', 'Ministry of Peace')
data = data.replace('para-military', 'Ministry of Peace')
data = data.replace('military', 'Ministry of Peace')
data = data.replace('Military', 'Ministry of Peace')
data = data.replace('Gulf War', 'Gulf Peace Conflict')
data = data.replace('news network', 'Ministry of Truth')
data = data.replace('Al-Qaeda', 'Resistance')
data = data.replace('Al Qaeda', 'Resistance')
data = data.replace('Al-Quaida', 'Resistance')
data = data.replace('al Qaeda', 'Resistance')
data = data.replace('Qaeda', 'Resistance')
data = data.replace('Taliban', 'Resistance')
data = data.replace('Taleban', 'Resistance')
data = data.replace("Royal Air Force", 'Ministry of Peace Air Force')
data = data.replace('air force', 'Ministry of Peace Air Force')
data = data.replace('Invasion', 'Liberation')
data = data.replace('invasion', 'liberation')
data = data.replace('\'shock and awe\'', 'awe-shock')
data = data.replace('\"shock and awe\"', 'awe-shock')
data = data.replace('weapons of mass destruction', 'gifts of mass coercion')
data = data.replace('demonstrators', 'thoughtcriminals')
data = data.replace('protestors', 'thoughtcriminals')

print time.clock() - start

And the modified code is:

import time, re
start = time.clock()
data = '' # same data
dict = { 'US and Britain' : 'Oceania', 'US and British' : 'Oceanian',
'America and Britain' : 'Oceania','the U.S.A.' : 'Oceania', 'the United
States of America' : 'Oceania', 'the U.S.' : 'Oceania', 'the USA' :
'Oceania', 'the US' : 'Oceania', 'The US' : 'Oceania', 'the United
States' : 'Oceania', 'U.S.A.' : 'Oceania', 'United States of America' :
'Oceania', 'U.S.' : 'Oceania', ' US' : ' Oceania', 'US ' : 'Oceania ',
'US,' : 'Oceania,', 'US-' : 'Oceania-', 'United States' : 'Oceania',
'American' : 'Oceanian', 'British': 'Oceanian', 'Great Britain' :
'Oceania', 'Britain' : 'Oceania', 'United Kingdom' : 'Oceania',
'European Union countries' : 'Oceania provinces', 'European nations' :
'Oceania provinces', 'European' : 'Oceanian', 'Europe' : 'Oceania', 'the
EU' : 'Oceania', 'Australian' : 'Oceanian', 'Australia' : 'Oceania',
'Washington Post' : 'Ministry of Truth', 'Washington DC' : 'Airstrip
One', 'Washington D.C.' : 'Airstrip One','Washingto' : '.C.',
'Washingto' : 'C', 'Washington' : 'Airstrip One', 'WASHINGTON' :
'Airstrip One', 'Boy Scouts' : 'Junior Anti-Sex League', 'Girl Scouts' :
'Junior Anti-Sex League', 'President Hussein' : 'Emmanuel Goldstei',
'Saddam Hussein' : 'Emmanuel Goldstein', 'Saddam' : 'Goldstein',
'Hussein' : 'Goldstein', 'President Bush' : 'Big Brother', 'president
Bush' : 'Big Brother', 'George Bush' : 'Big Brother', 'George W. Bush' :
'Big Brother', 'President George W Bush' : 'Big Brother', 'George W
Bush' : 'Big Brother', 'Mr. Bush' : 'Big Brother', 'Bush' : 'Big
Brother', 'Donald Rumsfeld' : 'O\' Brien', 'Rumsfeld' : 'O\'Brien',
'United Nations' : 'bloc nations', 'UN' : 'bloc nations', 'U.N.' : 'bloc
nations', 'coalition allies' : 'bloc allies', ' allies' : ' bloc
allies', 'Japanese' : 'Eastasian', 'Chinese' : 'Eastasian', 'Japan' :
'Eastasia', 'China' : 'Eastasia', 'North Korea' : 'Eastasia', 'South
Korea' : 'Eastasia', 'India' : 'Eastasia', 'Hong Kong' : 'Eastasia',
'Vietnam' : 'Eastasia', 'Singapore' : 'Eastasia', 'the Phillipines' :
'Eastasia', 'Thailand' : 'Eastasia', 'Sri Lanka' : 'Eastasia', 'Mexico'
: 'Oceania', 'Canadian' : 'Oceanian', 'Canada' : 'Oceania', 'Federal
Bureau of Investigation' : 'Ministry of Love', 'Department of State' :
'Ministry of Truth', 'Pentagon' : 'Ministry of Peace', ' Post' : '
Ministry of Truth', 'Russian' :'Eastasian', 'Russia' : 'Eastasia',
'Australian' : 'Oceanian', 'Australia' : 'Oceania', 'Christian Science
Monitor' : 'INGSOC Science Monitor', 'Christianity' : 'INGSOC',
'Washington Post' : 'Ministry of Truth', 'Washington Post' :
'Ministry of Truth', 'Daily Telegraph' :
'Ministry of Truth', 'New York Times' :
'Ministry of Truth', 'International Herald Tribune' :
'Ministry of Truth', 'International Herald Tribune' :
'Ministry of Truth', 'BBC' : 'Ministry of Truth', 'wartime' :
'peacetime', 'congressional' : 'Inner Party', 'Senate' : 'Inner Party
Senate', 'House of Representatives' : 'Inner Party House', 'House' :
'Inner Party House', 'Congress' : 'Inner Party', 'Marines' : 'Peace
Keepers', 'soldiers' : 'peace keepers', 'Republican Guards' :
'Resistance EliteTroops', 'Republican Guard' : 'Resistance Elite
Troops', 'Department of Defense' : 'Ministry of Peace', 'Department of
Homeland Security' : 'Ministry of Love', 'Associated Press' : 'Ministry
of Truth', 'FBI' :'Ministry of Love', 'CNN' : 'Ministry of Truth', 'New
York Times' : 'Ministry of Truth', 'UPI' : 'Ministry of Truth', 'United
Press International' : 'Ministry of Truth', 'Federal Reserve' :
'Ministry of Plenty', 'Homeland Security' : 'Motherland Security',
'homeland security' : 'motherland security', 'protesters': 'restistance
thought criminals', 'prisoner' : 'Ministry of Love prison guest',
'solitary confinement' : 'Room 101', 'anti-war' : 'Resistance',
'Anti-war' : 'Resistance', 'paramilitary' : 'Ministry of
Peace','para-military' : 'Ministry of Peace', 'military' : 'Ministry of
Peace', 'Military' : 'Ministry of Peace', 'Gulf War' : 'Gulf Peace
Conflict', 'news network' : 'Ministry of Truth', 'Al-Qaeda' :
'Resistance', 'Al Qaeda' : 'Resistance', 'Al-Quaida' : 'Resistance', 'al
Qaeda' : 'Resistance', 'Qaeda' : 'Resistance', 'Taliban' : 'Resistance',
'Taleban' : 'Resistance', 'Royal Air Force' : 'Ministry of Peace Air
Force', 'air force' : 'Ministry of Peace Air Force', 'Invasion' :
'Liberation', 'invasion' : 'liberation', '\'shockand awe\'' :
'awe-shock', '\"shock and awe\"' : 'awe-shock', 'weapons of mass
destruction' : 'gifts of mass coercion', 'demonstrators' :
'thoughtcriminals', 'protestors' : 'thoughtcriminals' }
pat = re.compile('(Canada|the U\.S\.A\.|air force|International Herald
Tribune|Associated Press|prisoner|Great Britain|George W Bush|US and
British|soldiers|Federal Bureau of
Investigation|Europe|demonstrators|Australia|Congress|Saddam|the
U\.S\.|Boy Scouts|solitary
confinement|Taliban|Russian|Thailand|military|Hong
Kong|congressional|protestors|Department of Defense|House of
Representatives|Christian Science Monitor|Canadian|President Bush|the
US|European
nations|BBC|protesters|Daily Telegraph|Washington|American|invasion|Invasion|Military|Marines|
US|Rumsfeld|U\.N\.|Gulf War|paramilitary|North Korea|coalition
allies|House|para-military|US |India|\\\'shock and awe\\\'|Senate|George
W. Bush|\\\"shock and awe\\\"|Al Qaeda|South Korea|Pentagon|President
George W Bush|US,| allies|Washington|New York
Times|wartime|Christianity|Taleban|Bush|New York Times|Al-Quaida|homeland
security|Washington Post|Washington DC|Australian| Post|Mr\. Bush|United
Nations|President Hussein|the United States|Homeland Security|Federal
Reserve|America and Britain|Singapore|president Bush|the
Phillipines|United States of America|Britain|Donald Rumsfeld|Republican
Guards|Department of Homeland Security|Qaeda|China|al
Qaeda|European|Department of State|Washington Post|the EU|George
Bush|International Herald Tribune|UPI|WASHINGTON|U\.S\.|news
network|European Union countries|FBI|British|Royal Air Force|United
States|United Press International|Vietnam|Republican
Guard|anti-war|CNN|Anti-war|Hussein|Russia|Al-Qaeda|Washington
D\.C\.|Saddam Hussein|the United States of America|Chinese|Mexico|Girl
Scouts|U\.S\.A\.|US-|the USA|Japanese|United Kingdom|US and
Britain|UN|The US|Sri Lanka|Japan|weapons of mass destruction)')
strings = pat.split(data)
for i in xrange(1,len(strings),2):
strings[i] = dict[strings[i]]
data = ''.join(strings)

I just checked these again, and the results were .052 on average for the
first version, and .080 on average for the second. The second would
obviously be a lot faster if I pre-compiled the pattern and cPickled it,
but I wanted to include compilation time in the test. The fastest
version that I have tested is using Alex's replacement suggestion in
combination with pre-compilation and cPickle, but the naive data =
data.replace is surprisingly fast.

cheers,

nihilo

Bengt Richter

unread,

Apr 7, 2003, 12:34:13 PM4/7/03

to

On Mon, 07 Apr 2003 00:28:13 GMT, nihilo <exni...@NOmyrealCAPSbox.com> wrote:

>Bengt Richter wrote:
>>
>> Thanks for testing, that is interesting. It would be interesting to see
>> the test code you ran, so we could see the reasons for .25 vs .43 (and
>> I could revise my reality-model as necessary ;-)
>>
>> Regards,
>> Bengt Richter
>
>Hi Bengt,
>
>Here is some code that I created to test the two versions.
>
>Original:
>

[...]

>
>print time.clock() - start
>
>And the modified code is:
>
>import time, re
>start = time.clock()
>data = '' # same data
>dict = { 'US and Britain' : 'Oceania', 'US and British' : 'Oceanian',

[...~ 135 items ...]

>pat = re.compile('(Canada|the U\.S\.A\.|air force|International Herald

[... ditto ...]

>strings = pat.split(data)
>for i in xrange(1,len(strings),2):
> strings[i] = dict[strings[i]]
>data = ''.join(strings)
>
>I just checked these again, and the results were .052 on average for the
>first version, and .080 on average for the second. The second would
>obviously be a lot faster if I pre-compiled the pattern and cPickled it,

I had assumed pre-compiled regex, but I learned something anyway:
The split is a _lot_ slower than I would have thought. I had guessed that
it would have compiled to a fast state machine like flex might generate
for or-ed static string matches, but something else must be happening.
Maybe it was not worth special casing static ors.

The loop to substitute into the list plus joining the result is quite fast,
but the split is 2+ orders of magnitude slower than the latter, and the
regex compilation is somewhat faster than the latter, but not much, apparently.
Something like 175:150:1 for split:regexcomp:subst-in-list-using-dict-and-join
in a test I ran on a 63.5k html file with 100 matching substrings in the data.

>but I wanted to include compilation time in the test. The fastest

Why? Just curious -- would the patterns be changing as often as the data?

>version that I have tested is using Alex's replacement suggestion in

What version of python are you running? Mine wouldn't compile the regex
in the form Alex suggested (with every '|' term separately parenthesized,
like '(x)|(y)' vs '(x|y)' ). Maybe I typoed when I typed the change, but
I got

File "D:\python22\lib\sre_compile.py", line 442, in compile
assert p.pattern.groups <= 100,\
AssertionError: sorry, but this version only supports 100 named groups

I was hoping to compare regex speed using the two forms. Can't pursue it now though...

>combination with pre-compilation and cPickle, but the naive data =
>data.replace is surprisingly fast.
>

Yes.
BTW, neither of the regex-based methods are equivalent in semantics
to the naive sequence of data.replace statements, because each data.replace
has the possibility (in general, even if maybe not in the actual example)
of matching something created by a previous data.replace.

Thanks for posting the code.

Regards,
Bengt Richter

nihilo

unread,

Apr 7, 2003, 9:49:18 PM4/7/03

to

Hmmm, I was thinking in terms of equivalent operations, or starting from
a blank slate in each case, but that is silly. What is important is the
time, and if one solution allows you to do some of the work beforehand,
so much the better.

>
>>version that I have tested is using Alex's replacement suggestion in
>
> What version of python are you running? Mine wouldn't compile the regex
> in the form Alex suggested (with every '|' term separately parenthesized,
> like '(x)|(y)' vs '(x|y)' ). Maybe I typoed when I typed the change, but
> I got
>
> File "D:\python22\lib\sre_compile.py", line 442, in compile
> assert p.pattern.groups <= 100,\
> AssertionError: sorry, but this version only supports 100 named groups
>
> I was hoping to compare regex speed using the two forms. Can't pursue it now though...

I'm running 2.3a2, and got the same error. I compared without the
parentheses, figuring that it would give me a rough estimate of how
quick it was. I am going to break the pattern up into sets of 100 named
groups when I have a little time. I'll post final times when I get
around to it if you're curious, including your version with the patterns
precompiled.

>>combination with pre-compilation and cPickle, but the naive data =
>>data.replace is surprisingly fast.
>>
>
> Yes.
> BTW, neither of the regex-based methods are equivalent in semantics
> to the naive sequence of data.replace statements, because each data.replace
> has the possibility (in general, even if maybe not in the actual example)
> of matching something created by a previous data.replace.

Yes, I realize that. What I actually want is the semantics of the
regexes, since I don't want to substitute for any of the words that have
already been substituted. The data.replace method was just an easy way
of getting the same result, since none of the data.replace output words
appear as inputs to later data.replaces.

> Thanks for posting the code.
>
> Regards,
> Bengt Richter

No problem. Thanks for all your help.

regards,

-nihilo