Problem with re module

John Harrington

unread,

Mar 22, 2011, 1:56:21 PM3/22/11

to

I'm trying to use the following substitution,

lineList[i]=re.sub(r'(\\begin{document})([^$])',r'\1\n\n
\2',lineList[i])

I intend this to match any string "\begin{document}" that doesn't end
in a line ending. If there's no line ending, then, I want to place
two carriage returns between the string and the non-line end
character.

However, this places carriage returns even when the string is followed
directly after with a line ending. Can someone explain to me why this
match is not behaving as I intend it to, especially the ([^$])?

Also, how can I write a regex that matches what I wish to match, as
described above?

Many thanks,
John

John Bokma

unread,

Mar 22, 2011, 2:16:08 PM3/22/11

to

John Harrington <bearti...@gmail.com> writes:

> I'm trying to use the following substitution,
>
> lineList[i]=re.sub(r'(\\begin{document})([^$])',r'\1\n\n
> \2',lineList[i])
>
> I intend this to match any string "\begin{document}" that doesn't end
> in a line ending. If there's no line ending, then, I want to place
> two carriage returns between the string and the non-line end
> character.
>
> However, this places carriage returns even when the string is followed
> directly after with a line ending. Can someone explain to me why this
> match is not behaving as I intend it to, especially the ([^$])?

[^$] matches: not a $ character

You might want [^\n]

--
John Bokma j3b

Blog: http://johnbokma.com/ Facebook: http://www.facebook.com/j.j.j.bokma
Freelance Perl & Python Development: http://castleamber.com/

Peter Otten

unread,

Mar 22, 2011, 2:35:57 PM3/22/11

to pytho...@python.org

John Harrington wrote:

> I'm trying to use the following substitution,
>
> lineList[i]=re.sub(r'(\\begin{document})([^$])',r'\1\n\n
> \2',lineList[i])
>
> I intend this to match any string "\begin{document}" that doesn't end
> in a line ending. If there's no line ending, then, I want to place
> two carriage returns between the string and the non-line end
> character.
>
> However, this places carriage returns even when the string is followed
> directly after with a line ending. Can someone explain to me why this
> match is not behaving as I intend it to, especially the ([^$])?

Quoting http://docs.python.org/library/re.html:
"""
Special characters are not active inside sets. For example, [akm$] will
match any of the characters 'a', 'k', 'm', or '$';

"""
>
> Also, how can I write a regex that matches what I wish to match, as
> described above?

I think you want a "negative lookahead assertion", (?!...):

>>> print re.compile("(xxx)(?!$)", re.MULTILINE).sub(r"\1**", "aaa bbb
xxx\naaa xxx bbb\nxxx")
aaa bbb xxx
aaa xxx** bbb
xxx

John Harrington

unread,

Mar 22, 2011, 2:40:11 PM3/22/11

to

On Mar 22, 11:16 am, John Bokma <j...@castleamber.com> wrote:

> John Harrington <beartiger....@gmail.com> writes:
> > I'm trying to use the following substitution,
>
> > lineList[i]=re.sub(r'(\\begin{document})([^$])',r'\1\n\n
> > \2',lineList[i])
>
> > I intend this to match any string "\begin{document}" that doesn't end
> > in a line ending. If there's no line ending, then, I want to place
> > two carriage returns between the string and the non-line end
> > character.
>
> > However, this places carriage returns even when the string is followed
> > directly after with a line ending. Can someone explain to me why this
> > match is not behaving as I intend it to, especially the ([^$])?
>
> [^$] matches: not a $ character
>
> You might want [^\n]

Thank you, John.

I thought that when you use "r" before the regex, $ matches an end of
line. But, in any case, if I use "[^\n]" as you suggest I get the
same result.

Here's a script that illustrates the problem. Any help would be
appreciated!:

#BEGIN SCRIPT
import re

outlist = []
myfile = "raw.tex"

fin = open(myfile, "r")
lineList = fin.readlines()
fin.close()

for i in range(0,len(lineList)):

lineList[i]=re.sub(r'(\\begin{document})([^\n])',r'\1\n\n
\2',lineList[i])

outlist.append(lineList[i])

fou = open(myfile, "w")
for i in range(len(outlist)):
fou.write(outlist[i])
fou.close
#END SCRIPT

And the file raw.tex:

%BEGIN TeX FILE
\begin{document}
This line should remain right after the above line in the output, but
doesn't

\begin{document}Extra stuff here should appear below the begin line
and does in the output.
%END TeX FILE

Benjamin Kaplan

unread,

Mar 22, 2011, 3:07:06 PM3/22/11

to pytho...@python.org

On Tue, Mar 22, 2011 at 2:40 PM, John Harrington
<bearti...@gmail.com> wrote:
> On Mar 22, 11:16 am, John Bokma <j...@castleamber.com> wrote:
>> John Harrington <beartiger....@gmail.com> writes:
>> > I'm trying to use the following substitution,
>>
>> > lineList[i]=re.sub(r'(\\begin{document})([^$])',r'\1\n\n
>> > \2',lineList[i])
>>
>> > I intend this to match any string "\begin{document}" that doesn't end
>> > in a line ending. If there's no line ending, then, I want to place
>> > two carriage returns between the string and the non-line end
>> > character.
>>
>> > However, this places carriage returns even when the string is followed
>> > directly after with a line ending. Can someone explain to me why this
>> > match is not behaving as I intend it to, especially the ([^$])?
>>
>> [^$] matches: not a $ character
>>
>> You might want [^\n]
>
> Thank you, John.
>
> I thought that when you use "r" before the regex, $ matches an end of
> line. But, in any case, if I use "[^\n]" as you suggest I get the
> same result.
>

r before a string has nothing to do with regexes. It signals a raw
string- escape sequences wont' be escaped.
>>> print 'a\tb'
a b
>>> print r'a\tb'
a\tb

We use raw strings for regexes because otherwise, you'd have to
remember double up all your backslashes. And double up your doubled up
backslashes when you really want a backslash.

Works for me. Do you have a space after the \begin{document} or
something? Because that get moved. You might want to check for
non-whitespace characters in the reges instead of just non-newlines.

> --
> http://mail.python.org/mailman/listinfo/python-list
>

John Harrington

unread,

Mar 22, 2011, 3:30:58 PM3/22/11

to

Matching the non-whitespace works, but I'm troubled I can't match a
non-end-of-line. No, there was no space after the string.

Thank you for your help, Ben

Ethan Furman

unread,

Mar 22, 2011, 7:26:21 PM3/22/11

to John Harrington, pytho...@python.org

Here's the important tidbit:

re.sub(r'(\\begin{document})(.+)', r'\1\n\n\2', line)

From the docs:
'.'
(Dot.) In the default mode, this matches any character except a newline.
If the DOTALL flag has been specified, this matches any character
including a newline.

'+'
Causes the resulting RE to match 1 or more repetitions of the preceding
RE. ab+ will match ‘a’ followed by any non-zero number of ‘b’s; it will
not match just ‘a’.

And here's the entire program, a bit more pythonically:

8<---------------------------------------------------------------
import re

outlist = []
myfile = "raw.tex"

fin = open(myfile, "r")
lineList = fin.readlines()
fin.close()

for line in lineList:
line = re.sub(r'(\\begin{document})(.+)', r'\1\n\n\2', line)
outlist.append(line)

fou = open(myfile, "w")

for line in outlist:
fou.write(line)
fou.close
8<---------------------------------------------------------------

Hope this helps!

~Ethan~