lihui
unread,Mar 29, 2009, 1:04:04 AM3/29/09Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to pyth...@googlegroups.com, python-chinese
各位兄弟:
有没有自动判断一段中文代码编码的算法,要高可靠性的
试过两节代码,都不是很成功.
1)代码1:来自modou兄
def utf8Detect(text):
"""Detect if a string is utf-8 encoding"""
lastch=0
begin=0
BOM=True
BOMchs=(0xEF, 0xBB, 0xBF)
good=0
bad=0
for char in text:
ch=ord(char)
if begin<3:
BOM=(BOMchs[begin]==ch) and BOM
begin += 1
continue
if (begin==4) and (BOM==True):
break;
if (ch & 0xC0) == 0x80:
if (lastch & 0xC0) == 0xC0:
good += 1
elif (lastch &0x80) == 0:
bad += 1
elif (lastch & 0xC0) == 0xC0:
bad += 1
lastch = ch
if (((begin == 4) and (BOM == True)) or
(good >= bad)):
return True
else:
return False
2)代码2
def zh2uni(s):
"""Auto converter encodings to unicode
Chinese (PRC): gb2312 gbk gb18030 big5hkscs hz
Chinese (ROC): big5 cp950
Japanese: cp932 shift-jis shift-jisx0213 shift-jis-2004 euc-jp
euc-jisx0213 euc-jis-2004 iso-2022-jp iso-2022-jp-1
iso-2022-jp-2 iso-2022-jp-3 iso-2022-jp-ext iso-2022-jp-2004
Korean: cp949 euc-kr johab iso-2022-kr
Unicode: utf-7 utf-8 utf-16 utf-16-be utf-16-le
-------------------------------------------------------
It will test utf8,gbk,big5,jp,kr to converter"""
if type(s)==unicode:
return 'utf-8',s
for c in ('gb18030','utf-8','big5','jp','utf-16','hz','euc_kr'):
try:
return c,s.decode(c)
except:
pass
return 'unk',s