'séd' (repr is 's\\xc3\\xa9d')
u'séd' (repr is u's\\xe9d')
[note: your reprs are wrong; change the \\ to \]
You need to decode the non-unicode string and compare the result with
the unicode string. You need to know the encoding used for the non-
unicode string. In the example that you gave, it's about 99.99% likely
that it's UTF-8.
>>> 's\xc3\xa9d'.decode('utf8')
u's\xe9d'
>>> u's\xe9d'.encode('utf8')
's\xc3\xa9d'
>>>
HTH,
John
determine what encoding the former string is using (looks like UTF-8),
and convert it to Unicode before doing the comparision.
>>> b = 's\xc3\xa9d'
>>> u = u's\xe9d'
>>> b
's\xc3\xa9d'
>>> u
u's\xe9d'
>>> unicode(b, "utf-8")
u's\xe9d'
>>> unicode(b, "utf-8") == u
True
</F>
@-salutations
--
Michel Claveau
You may also want to look at unicodedata.normalize(). For example, é can
be represented multiple ways:
>>> import unicodedata
>>> unicodedata.normalize('NFC', u'é')
u'\xe9'
>>> unicodedata.normalize('NFD', u'é')
u'e\u0301'
>>> u'\xe9' == u'e\u0301'
False
The first form is "composed", just being U+00E9 (LATIN SMALL LETTER E
WITH ACUTE). The second form is "decomposed", being made up of U+0065
(LATIN SMALL LETTER E) and U+0301 (COMBINING ACUTE ACCENT).
Even though they represent the same thing to a human, they don't compare
as equal. But if you normalize them to the same form, they will.
For more information, look at the unicodedata module's documentation:
<http://docs.python.org/lib/module-unicodedata.html>
--