How to escape # hash character in regex match strings

504c...@gmail.com

unread,

Jun 10, 2009, 10:47:13 AM6/10/09

to

I've encountered a problem with my RegEx learning curve -- how to
escape hash characters # in strings being matched, e.g.:

>>> string = re.escape('123#abc456')
>>> match = re.match('\d+', string)
>>> print match

<_sre.SRE_Match object at 0x00A6A800>
>>> print match.group()

123

The correct result should be:

123456

I've tried to escape the hash symbol in the match string without
result.

Any ideas? Is the answer something I overlooked in my lurching Python
schooling?

Peter Otten

unread,

Jun 10, 2009, 11:31:40 AM6/10/09

to

504c...@gmail.com wrote:

> I've encountered a problem with my RegEx learning curve -- how to
> escape hash characters # in strings being matched, e.g.:
>
>>>> string = re.escape('123#abc456')
>>>> match = re.match('\d+', string)
>>>> print match
>
> <_sre.SRE_Match object at 0x00A6A800>
>>>> print match.group()
>
> 123
>
> The correct result should be:
>
> 123456

>>> "".join(re.findall("\d+", "123#abc456"))
'123456'

> I've tried to escape the hash symbol in the match string without
> result.
>
> Any ideas? Is the answer something I overlooked in my lurching Python
> schooling?

re.escape() is used to build the regex from a string that may contain
characters that have a special meaning in regular expressions but that you
want to treat as literals. You can for example search for r"C:\dir" with

>>> re.compile(re.escape(r"C:\dir")).findall(r"C:\dir C:7ir")
['C:\\dir']

Without escaping you'd get

>>> re.compile(r"C:\dir").findall(r"C:\dir C:7ir")
['C:7ir']

Peter

David Shapiro

unread,

Jun 10, 2009, 12:42:44 PM6/10/09

to Peter Otten, pytho...@python.org

Maybe a using a Unicode equiv of # would do the trick.

504c...@gmail.com wrote:

Peter

--
http://mail.python.org/mailman/listinfo/python-list

Lie Ryan

unread,

Jun 11, 2009, 3:01:04 AM6/11/09

to

As you're not being clear on what you wanted, I'm just guessing this is
what you wanted:

>>> s = '123#abc456'
>>> re.match('\d+', re.sub('#\D+', '', s)).group()
'123456'
>>> s = '123#this is a comment and is ignored456'
>>> re.match('\d+', re.sub('#\D+', '', s)).group()
'123456'

Message has been deleted

504c...@gmail.com

unread,

Jun 11, 2009, 10:29:45 AM6/11/09

to

On Jun 11, 2:01 am, Lie Ryan <lie.1...@gmail.com> wrote:

> '123456'- Hide quoted text -
>
> - Show quoted text -

Sorry I wasn't more clear. I positively appreciate your reply. It
provides half of what I'm hoping to learn. The hash character is
actually a desirable hook to identify a data entity in a scraping
routine I'm developing, but not a character I want in the scrubbed
data.

In my application, the hash makes a string of alphanumeric characters
unique from other alphanumeric strings. The strings I'm looking for
are actually manually-entered identifiers, but a real machine-created
identifier shouldn't contain that hash character. The correct pattern
should be 'A1234509', but is instead often merely entered as '#12345'
when the first character, representing an alphabet sequence for the
month, and the last two characters, representing a two-digit year, can
be assumed. Identifying the hash character in a RegEx match is a way
of trapping the string and transforming it into its correct machine-
generated form.

Other patterns the strings can take in their manually-created
form:

A#12345
#1234509

Garbage in, garbage out -- I know. I wish I could tell the people
entering the data how challenging it is to work with what they
provide, but it is, after all, a screen-scraping routine.

I'm surprised it's been so difficult to find an example of the hash
character in a RegEx string -- for exactly this type of situation,
since it's so common in the real world that people want to put a pound
symbol in front of a number.

Thanks!

Rhodri James

unread,

Jun 11, 2009, 6:12:38 PM6/11/09

to pytho...@python.org

On Thu, 11 Jun 2009 15:22:44 +0100, Brian D <brian...@gmail.com> wrote:

> I'm surprised it's been so difficult to find an example of the hash
> character in a RegEx string -- for exactly this type of situation,
> since it's so common in the real world that people want to put a pound
> symbol in front of a number.

It's a character with no special meaning to the regex engine, so I'm not
in the least surprised that there aren't many examples containing it.
You could just as validly claim that there aren't many examples involving
the letter 'q'.

By the way, I don't know what you're doing but I'm seeing all of your
posts twice, from two different addresses. This is a little confusing,
to put it mildly, and doesn't half break the threading.

--
Rhodri James *-* Wildebeest Herder to the Masses

Tim Chase

unread,

Jun 11, 2009, 6:19:33 PM6/11/09

to Rhodri James, pytho...@python.org

>> I'm surprised it's been so difficult to find an example of the hash
>> character in a RegEx string -- for exactly this type of situation,
>> since it's so common in the real world that people want to put a pound
>> symbol in front of a number.
>
> It's a character with no special meaning to the regex engine, so I'm not
> in the least surprised that there aren't many examples containing it.
> You could just as validly claim that there aren't many examples involving
> the letter 'q'.

It depends on whether the re.VERBOSE option is passed. If you're using
a verbose regexp, you can use "#" to comment portions of it:

r = re.compile(r"""
\d+ # some digits
[aeiou] # some vowels
""", re.VERBOSE)

-tkc

Lie Ryan

unread,

Jun 14, 2009, 3:48:38 AM6/14/09

to

Brian D wrote:

> On Jun 11, 9:22 am, Brian D <brianden...@gmail.com> wrote:
>> On Jun 11, 2:01 am, Lie Ryan <lie.1...@gmail.com> wrote:
>>
>>
>>

>> Sorry I wasn't more clear. I positively appreciate your reply. It
>> provides half of what I'm hoping to learn. The hash character is
>> actually a desirable hook to identify a data entity in a scraping
>> routine I'm developing, but not a character I want in the scrubbed
>> data.
>>
>> In my application, the hash makes a string of alphanumeric characters
>> unique from other alphanumeric strings. The strings I'm looking for
>> are actually manually-entered identifiers, but a real machine-created
>> identifier shouldn't contain that hash character. The correct pattern
>> should be 'A1234509', but is instead often merely entered as '#12345'
>> when the first character, representing an alphabet sequence for the
>> month, and the last two characters, representing a two-digit year, can
>> be assumed. Identifying the hash character in a RegEx match is a way
>> of trapping the string and transforming it into its correct machine-
>> generated form.
>>

>> I'm surprised it's been so difficult to find an example of the hash
>> character in a RegEx string -- for exactly this type of situation,
>> since it's so common in the real world that people want to put a pound
>> symbol in front of a number.
>>

>> Thanks!
>
> By the way, other forms the strings can take in their manually created
> forms:

>
> A#12345
> #1234509
>
> Garbage in, garbage out -- I know. I wish I could tell the people
> entering the data how challenging it is to work with what they
> provide, but it is, after all, a screen-scraping routine.

perhaps it's like this?

>>> # you can use re.search if that suits better
>>> a = re.match('([A-Z]?)#(\d{5})(\d\d)?', 'A#12345')
>>> b = re.match('([A-Z]?)#(\d{5})(\d\d)?', '#1234509')
>>> a.group(0)
'A#12345'
>>> a.group(1)
'A'
>>> a.group(2)
'12345'
>>> a.group(3)
>>> b.group(0)
'#1234509'
>>> b.group(1)
''
>>> b.group(2)
'12345'
>>> b.group(3)
'09'