Re: String comparison question

Michael Spencer

unread,

Mar 19, 2006, 8:38:27 PM3/19/06

to pytho...@python.org

Olivier Langlois wrote:

> I would like to make a string comparison that would return true without
> regards of the number of spaces and new lines chars between the words

>
> like 'A B\nC' = 'A\nB C'
>

import string
NULL = string.maketrans("","")
WHITE = string.whitespace

def compare(a,b):
"""Compare two strings, disregarding whitespace -> bool"""
return a.translate(NULL, WHITE) == b.translate(NULL, WHITE)

Here, str.translate deletes the characters in its optional second argument.
Note that this does not work with unicode strings.

Michael

Olivier Langlois

unread,

Mar 19, 2006, 8:42:47 PM3/19/06

to pytho...@python.org

Hi Michael!

Your suggestion is fantastic and is doing exactly what I was looking
for! Thank you very much.
There is something that I'm wondering though. Why is the solution you
proposed wouldn't work with Unicode strings?

Olivier Langlois
http://www.quazal.com

>
> import string
> NULL = string.maketrans("","")
> WHITE = string.whitespace
>
> def compare(a,b):
> """Compare two strings, disregarding whitespace -> bool"""
> return a.translate(NULL, WHITE) == b.translate(NULL, WHITE)
>
> Here, str.translate deletes the characters in its optional second
> argument.
> Note that this does not work with unicode strings.
>
> Michael
>

> --
> http://mail.python.org/mailman/listinfo/python-list

Michael Spencer

unread,

Mar 19, 2006, 10:15:50 PM3/19/06

to pytho...@python.org

Olivier Langlois wrote:
> Hi Michael!
>
> Your suggestion is fantastic and is doing exactly what I was looking
> for! Thank you very much.
> There is something that I'm wondering though. Why is the solution you
> proposed wouldn't work with Unicode strings?
>

Simply, that str.translate with two arguments isn't implemented for unicode
strings. I don't know the underlying reason, or how hard it would be to change.

If you do need the comparison functionality for unicode strings, you'll have
to go with a different approach. For example, using regular expressions:

import re
def compare2(a, b):
"""Compare two basestrings, disregarding whitespace -> bool"""
return re.sub("\s*", "", a) == re.sub("\s*", "", b)

This is slower than the str.translate approach, though it has the advantage that
you could easily modify it to normalize, rather than eliminate whitespace. This
would be a more useful comparison in many cases.

def compare3(a, b):
"""Compare two basestrings, normalizing whitespace -> bool"""
return re.sub("\s*", " ", a) == re.sub("\s*", " ", b)

Continuing the disclaimers: none these approaches makes any attempt to deal
specially with quoted whitespace or any other sort of escapes.

Michael

Alex Martelli

unread,

Mar 19, 2006, 11:20:50 PM3/19/06

to

Michael Spencer <ma...@telcopartners.com> wrote:

> Olivier Langlois wrote:
> > Hi Michael!
> >
> > Your suggestion is fantastic and is doing exactly what I was looking
> > for! Thank you very much.
> > There is something that I'm wondering though. Why is the solution you
> > proposed wouldn't work with Unicode strings?
> >
> Simply, that str.translate with two arguments isn't implemented for
> unicode strings. I don't know the underlying reason, or how hard it would
> be to change.

A Unicode's string translate takes a dict argument -- you delete
characters by mapping their ord(...) to None. For example:

>>> u'banana'.translate({ord('a'):None})
u'bnn'

That is in fact much handier, when all you want to do is deleting some
characters, than using string.maketrans to create a "null" translation
table and passing as the 2nd argument the string of chars to delete.

With unicode .translate, you can also translate a character into a
STRING...:

>>> u'banana'.translate({ord('a'):u'ay'})
u'baynaynay'

...which is simply impossible with plainstring's .translate.

Alex

Alex Martelli

unread,

Mar 19, 2006, 11:23:57 PM3/19/06

to

Michael Spencer <ma...@telcopartners.com> wrote:

With unicode, you could do something strictly equivalent, as follows:

nowhite = dict.fromkeys(ord(c) for c in string.whitespace)

and then

return a.translate(nowhite) == b.translate(nowhite)

Alex

Michael Spencer

unread,

Mar 20, 2006, 2:20:22 AM3/20/06

to pytho...@python.org

Alex Martelli wrote:
> Michael Spencer <ma...@telcopartners.com> wrote:

>>
>> Here, str.translate deletes the characters in its optional second argument.
>> Note that this does not work with unicode strings.
>
> With unicode, you could do something strictly equivalent, as follows:
>
> nowhite = dict.fromkeys(ord(c) for c in string.whitespace)
>
> and then
>
> return a.translate(nowhite) == b.translate(nowhite)
>
>
> Alex

Interesting! But annoying to have to use unicode.translate differently from
str.translate:

import string
NULL = string.maketrans("","")
WHITE = string.whitespace

NO_WHITE_MAP = dict.fromkeys(ord(c) for c in WHITE)
def compare(a,b):

"""Compare two basestrings, disregarding whitespace -> bool"""

if isinstance(a, unicode):
astrip = a.translate(NO_WHITE_MAP)
else:
astrip = a.translate(NULL, WHITE)
if isinstance(b, unicode):
bstrip = b.translate(NO_WHITE_MAP)
else:
bstrip = b.translate(NULL, WHITE)
return astrip == bstrip

In fact, now that you've pointed it out, I like the unicode.translate interface
much better than str.translate(translation_table, deletechars = None). But it
would also be nice if these interfaces were compatible.

Perhaps str.translate could be extended to take a single mapping (optionally) as
its argument:

i.e., behavior like:

def translate(self, table, deletechars=None):
"""S.translate(table [,deletechars]) -> string

Return a copy of the string S, where all characters occurring
in the optional argument deletechars are removed, and the
remaining characters have been mapped through the given
translation table, which must be either a string of length 256
or a map of str ordinals to str ordinals, strings or None.
Unmapped characters are left untouched. Characters mapped to None
are deleted."""
if hasattr(table, "keys"):
if deletechars:
raise ValueError, "Can't specify deletechars with a mapping table"
table_map = table
table = ""
deletechars = ""
for key in range(256):
if key in table_map:
val = table_map[key]
if val is None:
deletechars += chr(key)
val = chr(key)
if not isinstance(val, str):
val = chr(val)
else:
val = chr(key)
table += val

return str.translate(self, table, deletechars)

Michael

luc.s...@gmail.com

unread,

Mar 20, 2006, 7:46:30 AM3/20/06

to

Michael Spencer wrote:
> Olivier Langlois wrote:
>
> > I would like to make a string comparison that would return true without
> > regards of the number of spaces and new lines chars between the words
> >
> > like 'A B\nC' = 'A\nB C'

Here is how I do such comparisons:

if a.strip().split() == b.strip().split()

Luc

Peter Otten

unread,

Mar 20, 2006, 7:59:21 AM3/20/06

to

luc.s...@gmail.com wrote:

The strip() is not necessary:

>>> " a b c\n ".split() == "a b c".split()
True

Peter

Fredrik Lundh

unread,

Mar 20, 2006, 8:25:15 AM3/20/06

to pytho...@python.org

luc.s...@gmail.com wrote:

>> > I would like to make a string comparison that would return true without
>> > regards of the number of spaces and new lines chars between the words
>> >
>> > like 'A B\nC' = 'A\nB C'
>
> Here is how I do such comparisons:
>
> if a.strip().split() == b.strip().split()

clever solution (I was about to post a split/join solution, but the join is of course
meaningless), but the strip() isn't necessary: the default version of split already
removes leading and trailing whitespace:

>>> " hello world ".split()
['hello', 'world']

>>> " hello world ".split(None)
['hello', 'world']
>>> " hello world ".split(" ")
['', 'hello', 'world', '']

</F>

Michael Spencer

unread,

Mar 20, 2006, 2:41:25 PM3/20/06

to pytho...@python.org

Fredrik Lundh wrote:

>
>>>> " hello world ".split()
> ['hello', 'world']

a.split() == b.split() is a convenient test, provided you want to normalize
whitespace rather than ignore it. I took the OP's requirements to mean that
'A B' == 'AB', but this is just a guess.

Michael

Olivier Langlois

unread,

Mar 20, 2006, 2:46:35 PM3/20/06

to pytho...@python.org

Hi Michael,

Normalizing the whitespace is what I was looking to do. I guess that
that aspect of my original query was not enough clear. But with either
solutions, I get the result I wanted.

Greetings,
Olivier Langlois
http://www.quazal.com

Fredrik Lundh

unread,

Mar 22, 2006, 3:32:23 AM3/22/06

to pytho...@python.org

Michael Spencer wrote:

> a.split() == b.split() is a convenient test, provided you want to normalize
> whitespace rather than ignore it. I took the OP's requirements to mean that
> 'A B' == 'AB', but this is just a guess.

I'm sure someone has studied this in more detail, but intuitively, partially
based on common mistakes when writing regular expressions, I'd say that
when people talk about "any number of something", they mean "one or
more" more often than not.

</F>