- some of the sources have incorrectly encoded characters... for example, cp1252 curly quotes that were likely the result of the author copying and pasting content from Word
I've searched and read for many hours, but have not found a solution for handling the case where the page author does not use the character encoding that they have specified.
Things I have tried include encode()/decode(), and replacement lookup tables (i.e. something like http://groups-beta.google.com/group/comp.lang.python/browse_thread/th... ) . However, I am still unable to convert the characters to something meaningful. In the case of the lookup table, this failed as all of the imporoperly encoded characters were returning as ? rather than their original encoding.
I'm using urllib and htmllib to open, read, and parse the html fragments, Python 2.3 on OS X 10.3
Any ideas or pointers would be greatly appreciated.
will give you a file that contains only ASCII characters, and character references for everything else.
Now, how should you guess the encoding? Here is a strategy: 1. use the encoding that was sent through the HTTP header. Be absolutely certain to not ignore this encoding. 2. use the encoding in the XML declaration (if any). 3. use the encoding in the http-equiv meta element (if any) 4. use UTF-8 5. use Latin-1, and check that there are no characters in the range(128,160) 6. use cp1252 7. use Latin-1
In the order from 1 to 6, check whether you manage to decode the input. Notice that in step 5, you will definitely get successful decoding; consider this a failure if you have get any control characters (from range(128, 160)); then try in step 7 latin-1 again.
When you find the first encoding that decodes correctly, encode it with ascii and xmlcharrefreplace, and you won't need to worry about the encoding, anymore.
> will give you a file that contains only ASCII characters, and > character references for everything else.
> Now, how should you guess the encoding? Here is a strategy: > 1. use the encoding that was sent through the HTTP header. Be > absolutely certain to not ignore this encoding. > 2. use the encoding in the XML declaration (if any). > 3. use the encoding in the http-equiv meta element (if any) > 4. use UTF-8 > 5. use Latin-1, and check that there are no characters in the > range(128,160) > 6. use cp1252 > 7. use Latin-1
> In the order from 1 to 6, check whether you manage to decode > the input. Notice that in step 5, you will definitely get successful > decoding; consider this a failure if you have get any control > characters (from range(128, 160)); then try in step 7 latin-1 > again.
> When you find the first encoding that decodes correctly, encode > it with ascii and xmlcharrefreplace, and you won't need to worry > about the encoding, anymore.
> Regards, > Martin
I have a similar problem, with characters like äöüAÖÜß and so on. I am extracting some content out of webpages, and they deliver whatever, sometimes not even giving any encoding information in the header. But your solution sounds quite good, i just do not know if - it works with the characters i mentioned - what encoding do you have in the end - and how exactly are you doing all this? All with somestring.decode() or... Can you please give an example for these 7 steps? Thanx in advance for the help Chris
> will give you a file that contains only ASCII characters, and > character references for everything else.
> Now, how should you guess the encoding? Here is a strategy: > 1. use the encoding that was sent through the HTTP header. Be > absolutely certain to not ignore this encoding. > 2. use the encoding in the XML declaration (if any). > 3. use the encoding in the http-equiv meta element (if any) > 4. use UTF-8 > 5. use Latin-1, and check that there are no characters in the > range(128,160) > 6. use cp1252 > 7. use Latin-1
> In the order from 1 to 6, check whether you manage to decode > the input. Notice that in step 5, you will definitely get successful > decoding; consider this a failure if you have get any control > characters (from range(128, 160)); then try in step 7 latin-1 > again.
> When you find the first encoding that decodes correctly, encode > it with ascii and xmlcharrefreplace, and you won't need to worry > about the encoding, anymore.
> Regards, > Martin
Something like this? Chris
import urllib2
url = 'www.someurl.com' f = urllib2.urlopen(url) data = f.read() # if it is not in the pagecode, how do i get the encoding of the page? pageencoding = ??? xmlencoding = 'whatever i parsed out of the file' htmlmetaencoding = 'whatever i parsed out of the metatag' f.close() try: data = data.decode(pageencoding) except: try: data = data.decode(xmlencoding) except: try: data = data.decode(htmlmetaencoding) except: try: data = data.encode('UTF-8') except: flag = true for char in data: if 127 < ord(char) < 128: flag = false if flag: try: data = data.encode('latin-1') except: pass try: data = data.encode('cp1252') except: pass try: data = data.encode('latin-1') except: pass: data = data.encode("ascii", "xmlcharrefreplace")
Christian Ergh wrote: > flag = true > for char in data: > if 127 < ord(char) < 128: > flag = false > if flag: > try: > data = data.encode('latin-1') > except: > pass
A little OT, but (assuming I got your indentation right[1]) this kind of loop is exactly what the else clause of a for-loop is for:
for char in data: if 127 < ord(char) < 128: break else: try: data = data.encode('latin-1') except: pass
Only saves you one line of code, but you don't have to keep track of a 'flag' variable. Generally, I find that when I want to set a 'flag' variable, I can usually do it with a for/else instead.
Steve
[1] Messed up indentation happens in a lot of clients if you have tabs in your code. If you can replace tabs with spaces before posting, this usually solves the problem.
Steven Bethard wrote: > Christian Ergh wrote: >> flag = true >> for char in data: >> if 127 < ord(char) < 128: >> flag = false >> if flag: >> try: >> data = data.encode('latin-1') >> except: >> pass
> A little OT, but (assuming I got your indentation right[1]) this kind of > loop is exactly what the else clause of a for-loop is for:
> for char in data: > if 127 < ord(char) < 128: > break > else: > try: > data = data.encode('latin-1') > except: > pass
> Only saves you one line of code, but you don't have to keep track of a > 'flag' variable. Generally, I find that when I want to set a 'flag' > variable, I can usually do it with a for/else instead.
> Steve
> [1] Messed up indentation happens in a lot of clients if you have tabs > in your code. If you can replace tabs with spaces before posting, this > usually solves the problem.
Once more, indention should be correct now, and the 128 is gone too. So, something like this? Chris
import urllib2
url = 'www.someurl.com' f = urllib2.urlopen(url) data = f.read() # if it is not in the pagecode, how do i get the encoding of the page? pageencoding = '???' xmlencoding = 'whatever i parsed out of the file' htmlmetaencoding = 'whatever i parsed out of the metatag' f.close() try: data = data.decode(pageencoding) except: try: data = data.decode(xmlencoding) except: try: data = data.decode(htmlmetaencoding) except: try: data = data.encode('UTF-8') except: flag = true for char in data: if 127 < ord(char) < 160: flag = false if flag: try: data = data.encode('latin-1') except: pass try: data = data.encode('cp1252') except: pass try: data = data.encode('latin-1') except: pass data = data.encode("ascii", "xmlcharrefreplace")
>>>flag = true >>>for char in data: >>> if 127 < ord(char) < 128: >>> flag = false >>>if flag: >>> try: >>> data = data.encode('latin-1') >>> except: >>> pass
>>A little OT, but (assuming I got your indentation right[1]) this kind of >>loop is exactly what the else clause of a for-loop is for:
>>for char in data: >> if 127 < ord(char) < 128: >> break >>else: >> try: >> data = data.encode('latin-1') >> except: >> pass
>>Only saves you one line of code, but you don't have to keep track of a >>'flag' variable. Generally, I find that when I want to set a 'flag' >>variable, I can usually do it with a for/else instead.
>>Steve
>>[1] Messed up indentation happens in a lot of clients if you have tabs >>in your code. If you can replace tabs with spaces before posting, this >>usually solves the problem.
> Even more off-topic:
>>>>for char in data:
> ... if 127 < ord(char) < 128: > ... break > ...
>>>>print char
> 127.5
> :-)
> Peter
Well yes, that happens when doing a quick hack and not reviewing it, 128 has to be 160 of course...
> def get_encoded(st, encodings): > "Returns an encoding that doesn't fail" > for encoding in encodings: > try: > st_encoded = st.decode(encoding) > return st_encoded, encoding > except UnicodeError: > pass
-snip- This works fine, but after this you have three possible encodings (or even more, looking at the data in the net you'll see a lot of encodings...)- what we need is just one for all. Chris
> - some of the sources have incorrectly encoded characters... for > example, cp1252 curly quotes that were likely the result of the author > copying and pasting content from Word
Finally: For me this works, all inside my own class, and the module has a logger, for reuse you would need to fix this stuff... Im am updating a postgreSQL Database, in case someone wonders about the __setattr__, and my class inherits from SQLObject.
def doDecode(self, st): "Returns an encoding that doesn't fail" for encoding in encodings: try: stEncoded = st.decode(encoding) return stEncoded except UnicodeError: pass
def setAttribute(self, name, data): import HTMLFilter data = self.doDecode(data) try: data = data.encode('ascii', "xmlcharrefreplace") except: log.warn('new method did not fit')
try: if '&#' in data: data = HTMLFilter.HTMLDecode(data) except UnicodeDecodeError: log.debug('HTML decoding failed!!!')
try: data = data.encode('utf-8') except: log.warn('new utf 8 method did not fit')
>> - scrape some html content from various sources
>> The issue I'm running to:
>> - some of the sources have incorrectly encoded characters... for >> example, cp1252 curly quotes that were likely the result of the author >> copying and pasting content from Word
> Finally: For me this works, all inside my own class, and the module has > a logger, for reuse you would need to fix this stuff... Im am updating a > postgreSQL Database, in case someone wonders about the __setattr__, and > my class inherits from SQLObject.
> def doDecode(self, st): > "Returns an encoding that doesn't fail" > for encoding in encodings: > try: > stEncoded = st.decode(encoding) > return stEncoded > except UnicodeError: > pass
> def setAttribute(self, name, data): > import HTMLFilter > data = self.doDecode(data) > try: > data = data.encode('ascii', "xmlcharrefreplace") > except: > log.warn('new method did not fit')
> try: > if '&#' in data: > data = HTMLFilter.HTMLDecode(data) > except UnicodeDecodeError: > log.debug('HTML decoding failed!!!')
> try: > data = data.encode('utf-8') > except: > log.warn('new utf 8 method did not fit')
Max M wrote: > A smiple way to try out different encodings in a given order:
The loop is fine - although ('UTF-8', 'Latin-1', 'ASCII') is somewhat redundant. The 'ASCII' case is never considered, since Latin-1 effectively works as a catch-all encoding (as all byte sequences can be considered Latin-1 - whether they are meaningful data is a different question).