# Problem with re module

16 views

### John Harrington

Mar 22, 2011, 1:56:21 PM3/22/11
to
I'm trying to use the following substitution,

lineList[i]=re.sub(r'(\\begin{document})([^$])',r'\1\n\n \2',lineList[i]) I intend this to match any string "\begin{document}" that doesn't end in a line ending. If there's no line ending, then, I want to place two carriage returns between the string and the non-line end character. However, this places carriage returns even when the string is followed directly after with a line ending. Can someone explain to me why this match is not behaving as I intend it to, especially the ([^$])?

Also, how can I write a regex that matches what I wish to match, as
described above?

Many thanks,
John

### John Bokma

Mar 22, 2011, 2:16:08 PM3/22/11
to
John Harrington <bearti...@gmail.com> writes:

> I'm trying to use the following substitution,
>
> lineList[i]=re.sub(r'(\\begin{document})([^$])',r'\1\n\n > \2',lineList[i]) > > I intend this to match any string "\begin{document}" that doesn't end > in a line ending. If there's no line ending, then, I want to place > two carriage returns between the string and the non-line end > character. > > However, this places carriage returns even when the string is followed > directly after with a line ending. Can someone explain to me why this > match is not behaving as I intend it to, especially the ([^$])?

[^$] matches: not a$ character

You might want [^\n]

--
John Bokma j3b

Freelance Perl & Python Development: http://castleamber.com/

### Peter Otten

Mar 22, 2011, 2:35:57 PM3/22/11
to pytho...@python.org
John Harrington wrote:

> I'm trying to use the following substitution,
>
> lineList[i]=re.sub(r'(\\begin{document})([^$])',r'\1\n\n > \2',lineList[i]) > > I intend this to match any string "\begin{document}" that doesn't end > in a line ending. If there's no line ending, then, I want to place > two carriage returns between the string and the non-line end > character. > > However, this places carriage returns even when the string is followed > directly after with a line ending. Can someone explain to me why this > match is not behaving as I intend it to, especially the ([^$])?

Quoting http://docs.python.org/library/re.html:
"""
Special characters are not active inside sets. For example, [akm$] will match any of the characters 'a', 'k', 'm', or '$';

"""
>
> Also, how can I write a regex that matches what I wish to match, as
> described above?

I think you want a "negative lookahead assertion", (?!...):

>>> print re.compile("(xxx)(?!$)", re.MULTILINE).sub(r"\1**", "aaa bbb xxx\naaa xxx bbb\nxxx") aaa bbb xxx aaa xxx** bbb xxx ### John Harrington unread, Mar 22, 2011, 2:40:11 PM3/22/11 to On Mar 22, 11:16 am, John Bokma <j...@castleamber.com> wrote: > John Harrington <beartiger....@gmail.com> writes: > > I'm trying to use the following substitution, > > > lineList[i]=re.sub(r'(\\begin{document})([^$])',r'\1\n\n
> > \2',lineList[i])
>
> > I intend this to match any string "\begin{document}" that doesn't end
> > in a line ending.  If there's no line ending, then, I want to place
> > two carriage returns between the string and the non-line end
> > character.
>
> > However, this places carriage returns even when the string is followed
> > directly after with a line ending.  Can someone explain to me why this
> > match is not behaving as I intend it to, especially the ([^$])? > > [^$] matches: not a $character > > You might want [^\n] Thank you, John. I thought that when you use "r" before the regex,$ matches an end of
line. But, in any case, if I use "[^\n]" as you suggest I get the
same result.

Here's a script that illustrates the problem. Any help would be
appreciated!:

#BEGIN SCRIPT
import re

outlist = []
myfile = "raw.tex"

fin = open(myfile, "r")
fin.close()

for i in range(0,len(lineList)):

lineList[i]=re.sub(r'(\\begin{document})([^\n])',r'\1\n\n
\2',lineList[i])

outlist.append(lineList[i])

fou = open(myfile, "w")
for i in range(len(outlist)):
fou.write(outlist[i])
fou.close
#END SCRIPT

And the file raw.tex:

%BEGIN TeX FILE
\begin{document}
This line should remain right after the above line in the output, but
doesn't

\begin{document}Extra stuff here should appear below the begin line
and does in the output.
%END TeX FILE

### Benjamin Kaplan

Mar 22, 2011, 3:07:06 PM3/22/11
to pytho...@python.org
On Tue, Mar 22, 2011 at 2:40 PM, John Harrington
<bearti...@gmail.com> wrote:
> On Mar 22, 11:16 am, John Bokma <j...@castleamber.com> wrote:
>> John Harrington <beartiger....@gmail.com> writes:
>> > I'm trying to use the following substitution,
>>
>> >      lineList[i]=re.sub(r'(\\begin{document})([^$])',r'\1\n\n >> > \2',lineList[i]) >> >> > I intend this to match any string "\begin{document}" that doesn't end >> > in a line ending. If there's no line ending, then, I want to place >> > two carriage returns between the string and the non-line end >> > character. >> >> > However, this places carriage returns even when the string is followed >> > directly after with a line ending. Can someone explain to me why this >> > match is not behaving as I intend it to, especially the ([^$])?
>>
>> [^$] matches: not a$ character
>>
>> You might want [^\n]
>
> Thank you, John.
>
> I thought that when you use "r" before the regex, \$ matches an end of
> line.  But, in any case, if I use "[^\n]" as you suggest I get the
> same result.
>

r before a string has nothing to do with regexes. It signals a raw
string- escape sequences wont' be escaped.
>>> print 'a\tb'
a b
>>> print r'a\tb'
a\tb

We use raw strings for regexes because otherwise, you'd have to
remember double up all your backslashes. And double up your doubled up
backslashes when you really want a backslash.

Works for me. Do you have a space after the \begin{document} or
something? Because that get moved. You might want to check for
non-whitespace characters in the reges instead of just non-newlines.

### John Harrington

Mar 22, 2011, 3:30:58 PM3/22/11
to

Matching the non-whitespace works, but I'm troubled I can't match a
non-end-of-line. No, there was no space after the string.

Thank you for your help, Ben

### Ethan Furman

Mar 22, 2011, 7:26:21 PM3/22/11
to John Harrington, pytho...@python.org

Here's the important tidbit:

re.sub(r'(\\begin{document})(.+)', r'\1\n\n\2', line)

From the docs:
'.'
(Dot.) In the default mode, this matches any character except a newline.
If the DOTALL flag has been specified, this matches any character
including a newline.

'+'
Causes the resulting RE to match 1 or more repetitions of the preceding
RE. ab+ will match ‘a’ followed by any non-zero number of ‘b’s; it will
not match just ‘a’.

And here's the entire program, a bit more pythonically:

8<---------------------------------------------------------------
import re

outlist = []
myfile = "raw.tex"

fin = open(myfile, "r")
fin.close()

for line in lineList:
line = re.sub(r'(\\begin{document})(.+)', r'\1\n\n\2', line)
outlist.append(line)

fou = open(myfile, "w")

for line in outlist:
fou.write(line)
fou.close
8<---------------------------------------------------------------

Hope this helps!

~Ethan~