Account Options

  1. Sign in
The old Google Groups will be going away soon.
Switch to the new Google Groups.
Google Groups Home
« Groups Home
character encoding conversion
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  15 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Dylan  
View profile  
 More options Dec 11 2004, 8:28 pm
Newsgroups: comp.lang.python
From: "Dylan" <dyl...@yahoo.com>
Date: Sun, 12 Dec 2004 01:28:29 -0000
Local: Sat, Dec 11 2004 8:28 pm
Subject: character encoding conversion

Here's what I'm trying to do:

- scrape some html content from various sources

The issue I'm running to:

- some of the sources have incorrectly encoded characters... for
example, cp1252 curly quotes that were likely the result of the author
copying and pasting content from Word

I've searched and read for many hours, but have not found a solution
for handling the case where the page author does not use the character
encoding that they have specified.

Things I have tried include encode()/decode(), and replacement lookup
tables (i.e. something like
http://groups-beta.google.com/group/comp.lang.python/browse_thread/th...
) .  However, I am still unable to convert the characters to something
meaningful.  In the case of the lookup table, this failed as all of
the imporoperly encoded characters were returning as ? rather than
their original encoding.

I'm using urllib and htmllib to open, read, and parse the html
fragments, Python 2.3 on OS X 10.3

Any ideas or pointers would be greatly appreciated.

-Dylan Schiemann
http://www.dylanschiemann.com/


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Martin v. Löwis  
View profile  
 More options Dec 12 2004, 11:51 am
Newsgroups: comp.lang.python
From: "Martin v. Löwis" <mar...@v.loewis.de>
Date: Sun, 12 Dec 2004 17:51:37 +0100
Local: Sun, Dec 12 2004 11:51 am
Subject: Re: character encoding conversion

Dylan wrote:
> Things I have tried include encode()/decode()

This should work. If you somehow manage to guess the encoding,
e.g. guess it as cp1252, then

   htmlstring.decode("cp1252").encode("us-ascii", "xmlcharrefreplace")

will give you a file that contains only ASCII characters, and
character references for everything else.

Now, how should you guess the encoding? Here is a strategy:
1. use the encoding that was sent through the HTTP header. Be
    absolutely certain to not ignore this encoding.
2. use the encoding in the XML declaration (if any).
3. use the encoding in the http-equiv meta element (if any)
4. use UTF-8
5. use Latin-1, and check that there are no characters in the
    range(128,160)
6. use cp1252
7. use Latin-1

In the order from 1 to 6, check whether you manage to decode
the input. Notice that in step 5, you will definitely get successful
decoding; consider this a failure if you have get any control
characters (from range(128, 160)); then try in step 7 latin-1
again.

When you find the first encoding that decodes correctly, encode
it with ascii and xmlcharrefreplace, and you won't need to worry
about the encoding, anymore.

Regards,
Martin


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Christian Ergh  
View profile  
 More options Dec 12 2004, 2:29 pm
Newsgroups: comp.lang.python
From: Christian Ergh <christian.e...@gmail.com>
Date: Sun, 12 Dec 2004 20:29:59 +0100
Local: Sun, Dec 12 2004 2:29 pm
Subject: Re: character encoding conversion

I have a similar problem, with characters like äöüAÖÜß and so on. I am
extracting some content out of webpages, and they deliver whatever,
sometimes not even giving any encoding information in the header. But
your solution sounds quite good, i just do not know if
- it works with the characters i mentioned
- what encoding do you have in the end
- and how exactly are you doing all this? All with somestring.decode()
or... Can you please give an example for these 7 steps?
Thanx in advance for the help
Chris

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Martin v. Löwis  
View profile  
 More options Dec 12 2004, 5:13 pm
Newsgroups: comp.lang.python
From: "Martin v. Löwis" <mar...@v.loewis.de>
Date: Sun, 12 Dec 2004 23:13:25 +0100
Local: Sun, Dec 12 2004 5:13 pm
Subject: Re: character encoding conversion

Christian Ergh wrote:
> - it works with the characters i mentioned

It does.

> - what encoding do you have in the end

US-ASCII

> - and how exactly are you doing all this? All with somestring.decode()
> or... Can you please give an example for these 7 steps?

I could, but I don't have the time - just try to come up with some
code, and I try to comment on it.

Regards,
Martin


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Christian Ergh  
View profile  
 More options Dec 13 2004, 3:41 am
Newsgroups: comp.lang.python
From: Christian Ergh <christian.e...@gmail.com>
Date: Mon, 13 Dec 2004 09:41:56 +0100
Local: Mon, Dec 13 2004 3:41 am
Subject: Re: character encoding conversion

Something like this?
Chris

import urllib2

url = 'www.someurl.com'
f = urllib2.urlopen(url)
data = f.read()
# if it is not in the pagecode, how do i get the encoding of the page?
pageencoding = ???
xmlencoding  = 'whatever i parsed out of the file'
htmlmetaencoding = 'whatever i parsed out of the metatag'
f.close()
         try:
             data = data.decode(pageencoding)
         except:
             try:
                 data = data.decode(xmlencoding)
             except:
                 try:
                     data = data.decode(htmlmetaencoding)
                 except:
                     try:
                        data = data.encode('UTF-8')
                    except:
                        flag = true
                        for char in data:
                            if 127 < ord(char) < 128:
                                flag = false
                            if flag:
                                try:
                                    data = data.encode('latin-1')
                                except:
                                    pass
                    try:
                        data = data.encode('cp1252')
                    except:
                        pass
        try:
            data = data.encode('latin-1')
        except:
            pass:
data = data.encode("ascii", "xmlcharrefreplace")


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Steven Bethard  
View profile  
 More options Dec 13 2004, 3:58 am
Newsgroups: comp.lang.python
From: Steven Bethard <steven.beth...@gmail.com>
Date: Mon, 13 Dec 2004 08:58:15 GMT
Local: Mon, Dec 13 2004 3:58 am
Subject: Re: character encoding conversion

Christian Ergh wrote:
> flag = true
> for char in data:
>     if 127 < ord(char) < 128:
>         flag = false
> if flag:
>     try:
>         data = data.encode('latin-1')
>     except:
>         pass

A little OT, but (assuming I got your indentation right[1]) this kind of
loop is exactly what the else clause of a for-loop is for:

for char in data:
     if 127 < ord(char) < 128:
         break
else:
     try:
         data = data.encode('latin-1')
     except:
         pass

Only saves you one line of code, but you don't have to keep track of a
'flag' variable.  Generally, I find that when I want to set a 'flag'
variable, I can usually do it with a for/else instead.

Steve

[1] Messed up indentation happens in a lot of clients if you have tabs
in your code.  If you can replace tabs with spaces before posting, this
usually solves the problem.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Peter Otten  
View profile  
 More options Dec 13 2004, 4:09 am
Newsgroups: comp.lang.python
From: Peter Otten <__pete...@web.de>
Date: Mon, 13 Dec 2004 10:09:32 +0100
Local: Mon, Dec 13 2004 4:09 am
Subject: Re: character encoding conversion

Even more off-topic:

>>> for char in data:

...     if 127 < ord(char) < 128:
...             break
...
>>> print char

127.5

:-)

Peter


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Christian Ergh  
View profile  
 More options Dec 13 2004, 4:37 am
Newsgroups: comp.lang.python
From: Christian Ergh <christian.e...@gmail.com>
Date: Mon, 13 Dec 2004 10:37:10 +0100
Local: Mon, Dec 13 2004 4:37 am
Subject: Re: character encoding conversion
Once more, indention should be correct now, and the 128 is gone too. So,
something like this?
Chris

import urllib2

url = 'www.someurl.com'
f = urllib2.urlopen(url)
data = f.read()
# if it is not in the pagecode, how do i get the encoding of the page?
pageencoding = '???'
xmlencoding  = 'whatever i parsed out of the file'
htmlmetaencoding = 'whatever i parsed out of the metatag'
f.close()
try:
      data = data.decode(pageencoding)
except:
     try:
         data = data.decode(xmlencoding)
     except:
         try:
             data = data.decode(htmlmetaencoding)
         except:
             try:
                 data = data.encode('UTF-8')
             except:
                 flag = true
                 for char in data:
                     if 127 < ord(char) < 160:
                         flag = false
                 if flag:
                     try:
                         data = data.encode('latin-1')
                     except:
                         pass
         try:
             data = data.encode('cp1252')
         except:
             pass
try:
     data = data.encode('latin-1')
except:
     pass
data = data.encode("ascii", "xmlcharrefreplace")


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Christian Ergh  
View profile  
 More options Dec 13 2004, 4:32 am
Newsgroups: comp.lang.python
From: Christian Ergh <christian.e...@gmail.com>
Date: Mon, 13 Dec 2004 10:32:12 +0100
Local: Mon, Dec 13 2004 4:32 am
Subject: Re: character encoding conversion

Well yes, that happens when doing a quick hack and not reviewing it, 128
has to be 160 of course...

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Max M  
View profile  
 More options Dec 13 2004, 4:48 am
Newsgroups: comp.lang.python
From: Max M <m...@mxm.dk>
Date: Mon, 13 Dec 2004 10:48:14 +0100
Local: Mon, Dec 13 2004 4:48 am
Subject: Re: character encoding conversion

Christian Ergh wrote:

A smiple way to try out different encodings in a given order:

# -*- coding: latin-1 -*-

def get_encoded(st, encodings):
     "Returns an encoding that doesn't fail"
     for encoding in encodings:
         try:
             st_encoded = st.decode(encoding)
             return st_encoded, encoding
         except UnicodeError:
             pass

st = 'Test characters æøå ÆØÅ'
encodings = ['utf-8', 'latin-1', 'ascii', ]
print get_encoded(st, encodings)

     (u'Test characters \xe6\xf8\xe5 \xc6\xd8\xc5', 'latin-1')

--

hilsen/regards Max M, Denmark

http://www.mxm.dk/
IT's Mad Science


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Christian Ergh  
View profile  
 More options Dec 13 2004, 5:35 am
Newsgroups: comp.lang.python
From: Christian Ergh <christian.e...@gmail.com>
Date: Mon, 13 Dec 2004 11:35:25 +0100
Local: Mon, Dec 13 2004 5:35 am
Subject: Re: character encoding conversion
- snip -
> def get_encoded(st, encodings):
>     "Returns an encoding that doesn't fail"
>     for encoding in encodings:
>         try:
>             st_encoded = st.decode(encoding)
>             return st_encoded, encoding
>         except UnicodeError:
>             pass

-snip-
This works fine, but after this you have three possible encodings (or
even more, looking at the data in the net you'll see a lot of
encodings...)- what we need is just one for all.
Chris

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Christian Ergh  
View profile  
 More options Dec 13 2004, 9:50 am
Newsgroups: comp.lang.python
From: Christian Ergh <christian.e...@gmail.com>
Date: Mon, 13 Dec 2004 15:50:06 +0100
Local: Mon, Dec 13 2004 9:50 am
Subject: Re: character encoding conversion
Dylan wrote:
> Here's what I'm trying to do:

> - scrape some html content from various sources

> The issue I'm running to:

> - some of the sources have incorrectly encoded characters... for
> example, cp1252 curly quotes that were likely the result of the author
> copying and pasting content from Word

Finally: For me this works, all inside my own class, and the module has
a logger, for reuse you would need to fix this stuff... Im am updating a
postgreSQL Database, in case someone wonders about the __setattr__, and
my class inherits from SQLObject.

     def doDecode(self, st):
         "Returns an encoding that doesn't fail"
         for encoding in encodings:
             try:
                 stEncoded = st.decode(encoding)
                 return stEncoded
             except UnicodeError:
                 pass

     def setAttribute(self, name, data):
         import HTMLFilter
         data = self.doDecode(data)
         try:
             data = data.encode('ascii', "xmlcharrefreplace")
         except:
             log.warn('new method did not fit')

         try:
             if '&#' in data:
                 data = HTMLFilter.HTMLDecode(data)
         except UnicodeDecodeError:
             log.debug('HTML decoding failed!!!')

         try:
             data = data.encode('utf-8')
         except:
             log.warn('new utf 8 method did not fit')

         try:
             self.__setattr__(name, data)
         except:
             log.debug('1. try failed: ')
             log.warning(type(data))
             log.debug(data)
             log.warning('Some unicode error while updating')


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Christian Ergh  
View profile  
 More options Dec 13 2004, 9:55 am
Newsgroups: comp.lang.python
From: Christian Ergh <christian.e...@gmail.com>
Date: Mon, 13 Dec 2004 15:55:57 +0100
Local: Mon, Dec 13 2004 9:55 am
Subject: Re: character encoding conversion
Forgot a part... You need the encoding list:

encodings = [
     'utf-8',
     'latin-1',
     'ascii',
     'cp1252',
     ]


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Martin v. Löwis  
View profile  
 More options Dec 13 2004, 5:59 pm
Newsgroups: comp.lang.python
From: "Martin v. Löwis" <mar...@v.loewis.de>
Date: Mon, 13 Dec 2004 23:59:51 +0100
Local: Mon, Dec 13 2004 5:59 pm
Subject: Re: character encoding conversion

Christian Ergh wrote:
> Once more, indention should be correct now, and the 128 is gone too. So,
> something like this?

Yes, something like this. The tricky part is of, course, then the
fragments which you didn't implement.

Also, it might be possible to do this in a for loop, e.g.

for encoding in (pageencoding, xmlencoding, htmlmetaencoding,
                  "UTF-8", "Latin-1-no-controls", "cp1252", "Latin-1"):
     try:
        data = data.encode(encoding)
        break;
     except UnicodeError:
        pass

You then just need to add the Latin-1-no-controls codec, or you need
to special-case this in the loop.

> # if it is not in the pagecode, how do i get the encoding of the page?
> pageencoding = '???'

You need to remember the HTTP connection that you got the HTML file
from. The webserver may have sent a Content-Type header.

> xmlencoding  = 'whatever i parsed out of the file'
> htmlmetaencoding = 'whatever i parsed out of the metatag'

Depending on the library you use, these aren't that trivial, either.

Regards,
Martin


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Martin v. Löwis  
View profile  
 More options Dec 13 2004, 6:02 pm
Newsgroups: comp.lang.python
From: "Martin v. Löwis" <mar...@v.loewis.de>
Date: Tue, 14 Dec 2004 00:02:48 +0100
Local: Mon, Dec 13 2004 6:02 pm
Subject: Re: character encoding conversion

Max M wrote:
> A smiple way to try out different encodings in a given order:

The loop is fine - although ('UTF-8', 'Latin-1', 'ASCII') is
somewhat redundant. The 'ASCII' case is never considered, since
Latin-1 effectively works as a catch-all encoding (as all byte
sequences can be considered Latin-1 - whether they are meaningful
data is a different question).

Regards,
Martin


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »