Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

getting rid of —

9 views
Skip to first unread message

someone

unread,
Jul 1, 2009, 12:05:37 PM7/1/09
to
Hello,

how can I replace '—' sign from string? Or do split at that character?
Getting unicode error if I try to do it:

UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position
1: ordinal not in range(128)


Thanks, Pet

script is # -*- coding: UTF-8 -*-

Benjamin Peterson

unread,
Jul 1, 2009, 7:55:23 PM7/1/09
to pytho...@python.org
someone <petshmidt <at> googlemail.com> writes:

>
> Hello,
>
> how can I replace '—' sign from string? Or do split at that character?
> Getting unicode error if I try to do it:
>
> UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position
> 1: ordinal not in range(128)


Please paste your code. I suspect that you are mixing unicode and normal strings.


MRAB

unread,
Jul 1, 2009, 7:56:22 PM7/1/09
to pytho...@python.org
someone wrote:
> Hello,
>
> how can I replace '�' sign from string? Or do split at that character?

> Getting unicode error if I try to do it:
>
> UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position
> 1: ordinal not in range(128)
>
>
> Thanks, Pet
>
> script is # -*- coding: UTF-8 -*-

It sounds like you're mixing bytestrings with Unicode strings. I can't
be any more helpful because you haven't shown the code.

Tep

unread,
Jul 2, 2009, 4:25:29 AM7/2/09
to
On 2 Jul., 01:56, MRAB <pyt...@mrabarnett.plus.com> wrote:
> someone wrote:
> > Hello,
>
> > how can I replace '—' sign from string? Or do split at that character?

> > Getting unicode error if I try to do it:
>
> > UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position
> > 1: ordinal not in range(128)
>
> > Thanks, Pet
>
> > script is # -*- coding: UTF-8 -*-
>
> It sounds like you're mixing bytestrings with Unicode strings. I can't
> be any more helpful because you haven't shown the code.

Oh, I'm sorry. Here it is

def cleanInput(input)
return input.replace('—', '')

Tep

unread,
Jul 2, 2009, 4:31:46 AM7/2/09
to

I also need:

#input is html source code, I have problem with only this character
#input = 'foo — bar'
#return should be foo
def splitInput(input)
parts = input.split(' — ')
return parts[0]


Thanks!

Simon Forman

unread,
Jul 3, 2009, 12:40:42 AM7/3/09
to

Okay people want to help you but you must make it easy for us.

Post again with a small piece of code that is runnable as-is and that
causes the traceback you're talking about, AND post the complete
traceback too, as-is.

I just tried a bit of your code above in my interpreter here and it
worked fine:

|>>> data = 'foo — bar'
|>>> data.split('—')
|['foo ', ' bar']
|>>> data = u'foo — bar'
|>>> data.split(u'—')
|[u'foo ', u' bar']

Figure out the smallest piece of "html source code" that causes the
problem and include that with your next post.

HTH,
~Simon

You might also read this: http://catb.org/esr/faqs/smart-questions.html

Tep

unread,
Jul 3, 2009, 6:28:25 AM7/3/09
to

The problem was, I've converted "html source code" to unicode object
and didn't encoded to utf-8 back, before using split...
Thanks for help and sorry for not so smart question
Pet

Mark Tolonen

unread,
Jul 3, 2009, 10:58:35 AM7/3/09
to pytho...@python.org

"Tep" <pets...@googlemail.com> wrote in message
news:46d36544-1ea2-4391...@o6g2000yqj.googlegroups.com...

> On 3 Jul., 06:40, Simon Forman <sajmik...@gmail.com> wrote:
> > On Jul 2, 4:31 am, Tep <petshm...@googlemail.com> wrote:
[snip]

> > > > > > how can I replace '—' sign from string? Or do split at that
> > > > > > character?
> > > > > > Getting unicode error if I try to do it:
> >
> > > > > > UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in
> > > > > > position
> > > > > > 1: ordinal not in range(128)
> >
> > > > > > Thanks, Pet
> >
> > > > > > script is # -*- coding: UTF-8 -*-
[snip]

> > I just tried a bit of your code above in my interpreter here and it
> > worked fine:
> >
> > |>>> data = 'foo — bar'
> > |>>> data.split('—')
> > |['foo ', ' bar']
> > |>>> data = u'foo — bar'
> |>>> data.split(u'—')
> > |[u'foo ', u' bar']
> >
> > Figure out the smallest piece of "html source code" that causes the
> > problem and include that with your next post.
>
> The problem was, I've converted "html source code" to unicode object
> and didn't encoded to utf-8 back, before using split...
> Thanks for help and sorry for not so smart question
> Pet

You'd still benefit from posting some code. You shouldn't be converting
back to utf-8 to do a split, you should be using a Unicode string with split
on the Unicode version of the "html source code". Also make sure your file
is actually saved in the encoding you declare. I print the encoding of your
symbol in two encodings to illustrate why I suspect this.

Below, assume "data" is your "html source code" as a Unicode string:

# -*- coding: UTF-8 -*-

data = u'foo — bar'

print repr(u'—'.encode('utf-8'))
print repr(u'—'.encode('windows-1252'))
print data.split(u'—')
print data.split('—')


OUTPUT:

'\xe2\x80\x94'
'\x97'


[u'foo ', u' bar']

Traceback (most recent call last):
File
"C:\dev\python\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py",
line 427, in ImportFile
exec codeObj in __main__.__dict__
File "<auto import>", line 1, in <module>
File "x.py", line 6, in <module>
print data.split('—')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0:
ordinal not in range(128)

Note that using the Unicode string in split() works. Also note the decode
byte in the error message when using a non-Unicode string to split the
Unicode data. In your original error message the decode byte that caused an
error was 0x97, which is 'EM DASH' in Windows-1252 encoding. Make sure to
save your source code in the encoding you declare. If I save the above
script in windows-1252 encoding and change the coding line to windows-1252 I
get the same results, but the decode byte is 0x97.

# coding: windows-1252


data = u'foo — bar'

print repr(u'—'.encode('utf-8'))
print repr(u'—'.encode('windows-1252'))
print data.split(u'—')
print data.split('—')

'\xe2\x80\x94'
'\x97'


[u'foo ', u' bar']

Traceback (most recent call last):
File
"C:\dev\python\Lib\site-packages\pythonwin\pywin\framework\scriptutils.py",
line 427, in ImportFile
exec codeObj in __main__.__dict__
File "<auto import>", line 1, in <module>
File "x.py", line 6, in <module>
print data.split('ק)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x97 in position 0:
ordinal not in range(128)

-Mark


Tep

unread,
Jul 3, 2009, 12:24:43 PM7/3/09
to
On 3 Jul., 16:58, "Mark Tolonen" <metolone+gm...@gmail.com> wrote:
> "Tep" <petshm...@googlemail.com> wrote in message

I've posted code below

> back to utf-8 to do a split, you should be using a Unicode string with split
> on the Unicode version of the "html source code".  Also make sure your file
> is actually saved in the encoding you declare.  I print the encoding of your
> symbol in two encodings to illustrate why I suspect this.

File was indeed in windows-1252, I've changed this. For errors see
below

#! /usr/bin/python


# -*- coding: UTF-8 -*-

import urllib2
import re
def getTitle(input):
title = re.search('<title>(.*?)</title>', input)
title = title.group(1)
print "FULL TITLE", title.encode('UTF-8')
parts = title.split(' — ')
return parts[0]


def getWebPage(url):
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers = { 'User-Agent' : user_agent }
req = urllib2.Request(url, '', headers)
response = urllib2.urlopen(req)
the_page = unicode(response.read(), 'UTF-8')
return the_page


def main():
url = "http://bg.wikipedia.org/wiki/
%D0%91%D0%B0%D1%85%D1%80%D0%B5%D0%B9%D0%BD"
title = getTitle(getWebPage(url))
print title[0]


if __name__ == "__main__":
main()


Traceback (most recent call last):

File "C:\user\Projects\test\src\new_main.py", line 29, in <module>
main()
File "C:\user\Projects\test\src\new_main.py", line 24, in main
title = getTitle(getWebPage(url))
FULL TITLE Бахрейн — Уикипеди�
File "C:\user\Projects\test\src\new_main.py", line 9, in getTitle
parts = title.split(' — ')


UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position

MRAB

unread,
Jul 3, 2009, 12:54:10 PM7/3/09
to pytho...@python.org

The input is Unicode, so it's probably better for the regular expression
to also be Unicode:

title = re.search(u'<title>(.*?)</title>', input)

(In the current implementation it actually doesn't matter.)

> title = title.group(1)
> print "FULL TITLE", title.encode('UTF-8')
> parts = title.split(' — ')

The title is Unicode, so the string with which you're splitting should
also be Unicode:

parts = title.split(u' — ')

Tep

unread,
Jul 3, 2009, 2:34:21 PM7/3/09
to


Oh, so simple. I'm new to python and still feel uncomfortable with
unicode stuff.

Thanks to all for help!

0 new messages