re.sub('abc', r'a\nb\nc', '123abcdefg')
I get
"""
123a
b
cdefg
"""
what I want is
r'123a\nb\ncdefg'
How do I get what I want?
Thanks,
-EdK
Ed Keith
e_...@yahoo.com
Blog: edkeith.blogspot.com
Looks like raw strings lets you avoid having to escape slashes when
specifying the literal, but doesn't preserve it during operations.
changing your replacement string to r'a\\nb\\nc' seems to give the
desired output
cheers
> I am having a problem when substituting a raw string. When I do the
> following:
>
> re.sub('abc', r'a\nb\nc', '123abcdefg')
>
> I get
>
> """
> 123a
> b
> cdefg
> """
>
> what I want is
>
> r'123a\nb\ncdefg'
From http://docs.python.org/library/re.html#re.sub
re.sub(pattern, repl, string[, count])
...repl can be a string or a function; if
it is a string, any backslash escapes in
it are processed. That is, \n is converted
to a single newline character, \r is
converted to a linefeed, and so forth.
So you'll have to double your backslashes:
py> re.sub('abc', r'a\\nb\\nc', '123abcdefg')
'123a\\nb\\ncdefg'
--
Gabriel Genellina
> --http://mail.python.org/mailman/listinfo/python-list
>
That is going to be a nontrivial exercise. I have control over the pattern, but the texts to be substituted and substituted into will be read from user supplied files. I need to reproduce the exact text the is read from the file.
Maybe what I should do is use re to break the string into two pieces, the part before the pattern to be replaces and the part after it, then splice the replacement text in between them. Seems like doing it the hard way, but it should work.
Thanks,
-EdK
There is a helper function re.escape() that you can use to sanitize the
substitution:
>>> print re.sub('abc', re.escape(r'a\nb\nc'), '123abcdefg')
123a\nb\ncdefg
Peter
> Ed Keith wrote:
>
>> --- On Wed, 12/16/09, Gabriel Genellina <gags...@yahoo.com.ar> wrote:
>>
>>> Ed Keith <e_...@yahoo.com>
>>> escribi�:
>>>
>>> > I am having a problem when substituting a raw string.
>>> When I do the following:
>>> >
>>> > re.sub('abc', r'a\nb\nc', '123abcdefg')
>>> >
>>> > I get
>>> >
>>> > """
>>> > 123a
>>> > b
>>> > cdefg
>>> > """
>>> >
>>> > what I want is
>>> >
>>> > r'123a\nb\ncdefg'
>>>
>>> So you'll have to double your backslashes:
>>>
>>> py> re.sub('abc', r'a\\nb\\nc', '123abcdefg')
>>> '123a\\nb\\ncdefg'
>>>
>> That is going to be a nontrivial exercise. I have control over the
>> pattern, but the texts to be substituted and substituted into will be
>> read
>> from user supplied files. I need to reproduce the exact text the is read
>> from the file.
>
> There is a helper function re.escape() that you can use to sanitize the
> substitution:
>
>>>> print re.sub('abc', re.escape(r'a\nb\nc'), '123abcdefg')
> 123a\nb\ncdefg
Unfortunately re.escape does much more than that:
py> print re.sub('abc', re.escape(r'a.b.c'), '123abcdefg')
123a\.b\.cdefg
I think the string_escape encoding is what the OP needs:
py> print re.sub('abc', r'a\n(b.c)\nd'.encode("string_escape"),
'123abcdefg')
123a\n(b.c)\nddefg
--
Gabriel Genellina
> En Wed, 16 Dec 2009 14:51:08 -0300, Peter Otten <__pet...@web.de>
> escribió:
>
>> Ed Keith wrote:
>>
>>> --- On Wed, 12/16/09, Gabriel Genellina <gags...@yahoo.com.ar> wrote:
>>>
>>>> Ed Keith <e_...@yahoo.com>
>>>> escribió:
>>>>
>>>> > I am having a problem when substituting a raw string.
>>>> When I do the following:
>>>> >
>>>> > re.sub('abc', r'a\nb\nc', '123abcdefg')
>>>> >
>>>> > I get
>>>> >
>>>> > """
>>>> > 123a
>>>> > b
>>>> > cdefg
>>>> > """
>>>> >
>>>> > what I want is
>>>> >
>>>> > r'123a\nb\ncdefg'
>>>>
>>>> So you'll have to double your backslashes:
>>>>
>>>> py> re.sub('abc', r'a\\nb\\nc', '123abcdefg')
>>>> '123a\\nb\\ncdefg'
>>>>
>>> That is going to be a nontrivial exercise. I have control over the
>>> pattern, but the texts to be substituted and substituted into will be
>>> read
>>> from user supplied files. I need to reproduce the exact text the is read
>>> from the file.
>>
>> There is a helper function re.escape() that you can use to sanitize the
>> substitution:
>>
>>>>> print re.sub('abc', re.escape(r'a\nb\nc'), '123abcdefg')
>> 123a\nb\ncdefg
>
> Unfortunately re.escape does much more than that:
>
> py> print re.sub('abc', re.escape(r'a.b.c'), '123abcdefg')
> 123a\.b\.cdefg
Sorry, I didn't think of that.
> I think the string_escape encoding is what the OP needs:
>
> py> print re.sub('abc', r'a\n(b.c)\nd'.encode("string_escape"),
> '123abcdefg')
> 123a\n(b.c)\nddefg
Another possibility:
>>> print re.sub('abc', lambda m: r'a\nb\n.c\a', '123abcdefg')
123a\nb\n.c\adefg
Peter
> Another possibility:
>
> >>> print re.sub('abc', lambda m: r'a\nb\n.c\a',
> '123abcdefg')
> 123a\nb\n.c\adefg
I'm not sure whether that is clever, ugly, or just plain strange!
I think I'll stick with:
>>> m = re.match('^(.*)abc(.*)$', '123abcdefg')
>>> print m.group(1) + r'a\nb\n.c\a' + m.group(2)
123a\nb\n.c\adefg
It's much less likely to fry the poor maintenance programmer's mind.
On 12/16/2009 9:35 AM, Gabriel Genellina wrote:
> From http://docs.python.org/library/re.html#re.sub
>
> re.sub(pattern, repl, string[, count])
>
> ...repl can be a string or a function; if
> it is a string, any backslash escapes in
> it are processed. That is, \n is converted
> to a single newline character, \r is
> converted to a linefeed, and so forth.
>
> So you'll have to double your backslashes:
I'm not persuaded that the docs are clear. Consider:
>>> 'ab\\ncd' == r'ab\ncd'
True
Naturally enough. So I think the right answer is:
1. this is a documentation bug (i.e., the documentation
fails to specify unexpected behavior for raw strings), or
2. this is a bug (i.e., raw strings are not handled correctly
when used as replacements)
I vote for 2.
Peter's use of a function highlights just how odd this is:
getting the raw string via a function produces a different
result than providing it directly. If this is really the
way things ought to be, I'd appreciate a clear explanation
of why.
Alan Isaac
> Naturally enough. So I think the right answer is:
>
> 1. this is a documentation bug (i.e., the documentation
> fails to specify unexpected behavior for raw strings), or
> 2. this is a bug (i.e., raw strings are not handled correctly
> when used as replacements)
<neo> There is no raw string. </neo>
A raw string is not a distinct type from an ordinary string
in the same way byte strings and Unicode strings are. It
is a merely a notation for constants, like writing integers
in hexadecimal.
>>> (r'\n', u'a', 0x16)
('\\n', u'a', 22)
Yes, that was a mistake. But the problem remains::
>>> re.sub('abc', r'a\nb\n.c\a','123abcdefg') == re.sub('abc', 'a\\nb\\n.c\\a',' 123abcdefg') == re.sub('abc', 'a\nb\n.c\a','123abcdefg')
True
>>> r'a\nb\n.c\a' == 'a\\nb\\n.c\\a' == 'a\nb\n.c\a'
False
Why are the first two strings being treated as if they are the last one?
That is, why isn't '\\' being processed in the obvious way?
This still seems wrong. Why isn't it?
More simply, consider::
>>> re.sub('abc', '\\', '123abcdefg')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python26\lib\re.py", line 151, in sub
return _compile(pattern, 0).sub(repl, string, count)
File "C:\Python26\lib\re.py", line 273, in _subx
template = _compile_repl(template, pattern)
File "C:\Python26\lib\re.py", line 260, in _compile_repl
raise error, v # invalid expression
sre_constants.error: bogus escape (end of line)
Why is this the proper handling of what one might think would be an
obvious substitution?
Thanks,
Alan Isaac
Was this a straight cut and paste or did you make a manual change? Is
that leading space in the middle one a copying error? I get False for
what you actually have there for obvious reasons.
> >>> r'a\nb\n.c\a' == 'a\\nb\\n.c\\a' == 'a\nb\n.c\a'
> False
>
> Why are the first two strings being treated as if they are the last one?
They aren't. The last string is different.
>>> for x in (r'a\nb\n.c\a', 'a\\nb\\n.c\\a', 'a\nb\n.c\a'): print repr(x)
...
'a\\nb\\n.c\\a'
'a\\nb\\n.c\\a'
'a\nb\n.c\x07'
> That is, why isn't '\\' being processed in the obvious way?
> This still seems wrong. Why isn't it?
What do you think is wrong? What would the "obvious" way of handling
'//' be?
>
> More simply, consider::
>
> >>> re.sub('abc', '\\', '123abcdefg')
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File "C:\Python26\lib\re.py", line 151, in sub
> return _compile(pattern, 0).sub(repl, string, count)
> File "C:\Python26\lib\re.py", line 273, in _subx
> template = _compile_repl(template, pattern)
> File "C:\Python26\lib\re.py", line 260, in _compile_repl
> raise error, v # invalid expression
> sre_constants.error: bogus escape (end of line)
>
> Why is this the proper handling of what one might think would be an
> obvious substitution?
Is this what you want? What you have is a re expression consisting of
a single backslash that doesn't escape anything (EOL) so it barfs.
>>> re.sub('abc', r'\\', '123abcdefg')
'123\\defg'
--
D'Arcy J.M. Cain <da...@druid.net> | Democracy is three wolves
http://www.druid.net/darcy/ | and a sheep voting on
+1 416 425 1212 (DoD#0082) (eNTP) | what's for dinner.
Some of these regex escape sequences are the same as those of string
literals, eg \n represents a newline; others are different, eg \b in a
regex represents a word boundary and not a backspace as in a string
literal.
You can match a newline in a regex by either using an actual newline
character ('\n' in a string literal) or an escape sequence ('\\n' or
r'\n' in a string literal). If you want a regex to match an actual
backslash followed by a letter 'n' then you need to escape the backslash
in the regex and then either use a raw string literal or escape it again
in a non-raw string literal.
Match characters: <newline>
Regex: \n
Raw string literal: r'\n'
Non-raw string literal: '\\n'
Match characters: \n
Regex: \\n
Raw string literal: r'\\n'
Non-raw string literal: '\\\\n'
Replace with characters: <newline>
Replacement: \n
Raw string literal: r'\n'
Non-raw string literal: '\\n'
Replace with characters: \n
Replacement: \\n
Raw string literal: r'\\n'
Non-raw string literal: '\\\\n'
On 12/17/2009 12:19 PM, D'Arcy J.M. Cain wrote:
> They aren't. The last string is different.
Of course it is different.
That is the basis of my question.
Why is it being treated as if it is the same?
(See the end of this post.)
> Alan G Isaac<alan....@gmail.com> wrote:
>> More simply, consider::
>>
>> >>> re.sub('abc', '\\', '123abcdefg')
>> Traceback (most recent call last):
>> File "<stdin>", line 1, in<module>
>> File "C:\Python26\lib\re.py", line 151, in sub
>> return _compile(pattern, 0).sub(repl, string, count)
>> File "C:\Python26\lib\re.py", line 273, in _subx
>> template = _compile_repl(template, pattern)
>> File "C:\Python26\lib\re.py", line 260, in _compile_repl
>> raise error, v # invalid expression
>> sre_constants.error: bogus escape (end of line)
>>
>> Why is this the proper handling of what one might think would be an
>> obvious substitution?
On 12/17/2009 12:19 PM, D'Arcy J.M. Cain wrote:
> Is this what you want? What you have is a re expression consisting of
> a single backslash that doesn't escape anything (EOL) so it barfs.
>>>> re.sub('abc', r'\\', '123abcdefg')
> '123\\defg'
Turning again to the documentation:
"if it is a string, any backslash escapes in it are processed.
That is, \n is converted to a single newline character, \r is
converted to a linefeed, and so forth."
So why is '\n' converted to a newline but '\\' does not become a literal
backslash? OK, I don't do much string processing, so perhaps this is where
I am missing the point: how is the replacement being "converted"?
(As Peter's example shows, if you supply the replacement via
a function, this does not happen.) You suggest it is just a matter of
it being an re, but::
>>> re.sub('abc', 'a\\nc','1abcd') == re.sub('abc', 'a\nc','1abcd')
True
>>> re.compile('a\\nc') == re.compile('a\nc')
False
So I have two string that are not the same, nor do they compile
equivalently, yet apparently they are "converted" to something
equivalent for the substitution. Why? Is my question clearer?
If the answer looks too obvious to state, assume I'm missing it anyway
and please state it. As I said, I seldom use the re module.
Alan Isaac
However, regex objects never compare equal to each other, so, strictly
speaking, re.compile('a\nc') != re.compile('a\nc').
However, having said that, the re module contains a cache (keyed on the
string and options supplied), so the first re.compile('a\nc') will put
the regex object in the cache and the second re.compile('a\nc') will
return that same regex object from the cache. If you clear the cache in
between the two calls (do re._cache.clear()) you'll get two different
regex objects which won't compare equal even though they are to all
intents identical.
OK, this is helpful.
(I did check equality but did not understand
I got True only because re used caching.)
So is the bottom line the following?
A string replacement is not just "converted"
as described in the documentation, essentially
it is compiled?
But that cannot quite be right. E.g., \b will be a back
space not a word boundary. So then the question arises
again, why isn't '\\' a backslash? Just because?
Why does it not get the "obvious" conversion?
Thanks,
Alan Isaac
Because the re module uses backslashes for escaping, you'll need to
escape a literal backslash with a backslash in the string you give it.
But string literals also use backslashes for escaping, so you'll need to
escape each of those backslashes with a backslash.
> So is the bottom line the following?
> A string replacement is not just "converted"
> as described in the documentation, essentially
> it is compiled?
That depends entirely on what you mean.
> But that cannot quite be right. E.g., \b will be a back
> space not a word boundary. So then the question arises
> again, why isn't '\\' a backslash? Just because?
> Why does it not get the "obvious" conversion?
'\\' *is* a backslash. That string containing a single backslash is then
processed by the re module which sees a backslash, tries to interpret it
as an escape, fails and barfs.
"re.compile('a\\nc')" passes a sequence of four characters to re.compile:
'a', '\', 'n' and 'c'. re.compile() then does it's own interpretation:
'a' passes through as is, '\' flags an escape which combined with 'n'
produces the newline character (0x0a), and 'c' passes through as is.
"re.compile('a\nc')" by contrast passes a sequence of three character to
re.compile: 'a', 0x0a and 'c'. re.compile() does it's own interpretation,
which happens not to change any of the characters, resulting in the same
regular expression as before.
Your problem is that you are conflating the compile-time processing of
string literals with the run-time processing of strings specific to re.
--
Rhodri James *-* Wildebeeste Herder to the Masses
> Regular expressions and replacement strings have their own escaping
> mechanism, which also uses backslashes.
This seems like a misfeature to me. It makes sense for
a regular expression to give special meanings to backslash
sequences, because it's a sublanguage with its own syntax.
But I can't see any earthly reason to do that with the
*replacement* string, which is just data.
It looks like a feature that's been blindly copied over
from Perl without thinking about whether it makes sense
in Python.
--
Greg
>>> re.sub('a(.)c', r'\1', "123abcdefg")
'123bdefg'
Still think the replacement string is "just data"?
--
\S
under construction
For example, swapping pairs of words:
>>> re.sub(r'(\w+) (\w+)', r'\2 \1', r'first second third fourth')
'second first fourth third'
Python also allows you to provide a function that returns the
replacement string, but that seems a bit long-winded for those cases
when a simple replacement template would suffice.
I got that from MRAB's posts. (Thanks.)
What I'm not getting is why the replacement string
gets this particular interpretation. What is the payoff?
(Contrast e.g. Vim's substitution syntax.)
Thanks,
Alan
Of course that "conversion" is needed in the replacement.
But e.g. Vim substitutions handle this fine without the
odd (to non perlers) handling of backslashes in replacement.
Alan Isaac
Short answer: Python is not Perl, Python's re.sub is not Vim's :s.
Slightly longer answer: Different environments have different need;
vim-ers more often needs to escape with just a plain text. All in all,
the decision for default behaviors are often made so that less backslash
will be needed for the more common case in the particular environment.
> In simple cases you might be replacing with the same string every time,
> but other cases you might want the replacement to contain substrings
> captured by the regex.
But you can give it a function that has access to the
match object and can produce whatever replacement string
it wants.
You already have a complete programming language at
your disposal. There's no need to invent yet another
mini-language for the replacement string.
--
Greg
The same can't be said for regex replacement strings, which are far more
specialised.
And list comps don't make anything *harder*, they just make things
easier. In contrast, the current behaviour of regex replacements makes it
difficult to use special characters as part of the replacement string.
That's not good.
--
Steven
> On 12/17/2009 7:59 PM, Rhodri James wrote:
>> "re.compile('a\\nc')" passes a sequence of four characters to
>> re.compile: 'a', '\', 'n' and 'c'. re.compile() then does it's own
>> interpretation: 'a' passes through as is, '\' flags an escape which
>> combined with 'n' produces the newline character (0x0a), and 'c' passes
>> through as is.
>
>
> I got that from MRAB's posts. (Thanks.)
> What I'm not getting is why the replacement string
> gets this particular interpretation. What is the payoff?
So that the substitution escapes \1, \2 and so on work.
Assuming I remember correctly, the function capability came after the
replacement capability. I think that breaking replacement would be a
Bad Idea.
--
Aahz (aa...@pythoncraft.com) <*> http://www.pythoncraft.com/
Weinberg's Second Law: If builders built buildings the way programmers wrote
programs, then the first woodpecker that came along would destroy civilization.