如何判断一个中文字符串的编码？

lihui

unread,

Mar 29, 2009, 1:04:04 AM3/29/09

to pyth...@googlegroups.com, python-chinese

各位兄弟：

有没有自动判断一段中文代码编码的算法，要高可靠性的
试过两节代码，都不是很成功.

1)代码１：来自modou兄
def utf8Detect(text):
"""Detect if a string is utf-8 encoding"""
lastch=0
begin=0
BOM=True
BOMchs=(0xEF, 0xBB, 0xBF)
good=0
bad=0
for char in text:
ch=ord(char)
if begin<3:
BOM=(BOMchs[begin]==ch) and BOM
begin += 1
continue
if (begin==4) and (BOM==True):
break;
if (ch & 0xC0) == 0x80:
if (lastch & 0xC0) == 0xC0:
good += 1
elif (lastch &0x80) == 0:
bad += 1
elif (lastch & 0xC0) == 0xC0:
bad += 1
lastch = ch
if (((begin == 4) and (BOM == True)) or
(good >= bad)):
return True
else:
return False

2)代码２
def zh2uni(s):
"""Auto converter encodings to unicode

Chinese (PRC): gb2312 gbk gb18030 big5hkscs hz
Chinese (ROC): big5 cp950
Japanese: cp932 shift-jis shift-jisx0213 shift-jis-2004 euc-jp
euc-jisx0213 euc-jis-2004 iso-2022-jp iso-2022-jp-1
iso-2022-jp-2 iso-2022-jp-3 iso-2022-jp-ext iso-2022-jp-2004
Korean: cp949 euc-kr johab iso-2022-kr
Unicode: utf-7 utf-8 utf-16 utf-16-be utf-16-le
-------------------------------------------------------
It will test utf8,gbk,big5,jp,kr to converter"""
if type(s)==unicode:
return 'utf-8',s
for c in ('gb18030','utf-8','big5','jp','utf-16','hz','euc_kr'):
try:
return c,s.decode(c)
except:
pass
return 'unk',s

shell909090

unread,

Mar 29, 2009, 11:48:34 AM3/29/09

to pyth...@googlegroups.com

lihui 写道:

晕死，我很想说RTFM。
去gg一下chardet吧。
猜测原始编码有几种路子。一种是使用BOM，这只对UTF-8有用，就是您的头一种。
第二种是反复实验直到成功，就是您的第二种。但是才生产中，我们都是采用
iconv(python用chardet)来猜测编码的，基本没有直接写的——

signature.asc

刘其帅

unread,

Mar 29, 2009, 9:27:42 PM3/29/09

to pyth...@googlegroups.com

chardet是用文章样本统计出每个字符的出现次数，在匹配时取得最可能的编码。和iconv没关系。

2009/3/29 shell909090 <shell...@gmail.com>

shell909090

unread,

Mar 29, 2009, 11:14:29 PM3/29/09

to pyth...@googlegroups.com

刘其帅写道:

晕死，我很想说RTFM。
去gg一下chardet吧。
猜测原始编码有几种路子。一种是使用BOM，这只对UTF-8有用，就是您的头一种。
第二种是反复实验直到成功，就是您的第二种。但是才生产中，我们都是采用
iconv(python用chardet)来猜测编码的，基本没有直接写的----

是阿，所以才说C用iconv，python用chardet。如果推荐使用iconv，那我就建议使用iconv的python封装了。
月经贴——

signature.asc

lihui

unread,

Apr 10, 2009, 3:33:18 AM4/10/09

to pyth...@googlegroups.com

chardet不错，谢谢指点

先前，由于全是简体中文

用

try:

m.decode('gb2312')

except:

m.decode(útf-8')

解决

查了一下文档，才知道gb18030与gb2312原来是不同的啊:)

2009/3/30 shell909090 <shell...@gmail.com>

Jiahua Huang

unread,

Apr 10, 2009, 5:03:52 AM4/10/09

to pyth...@googlegroups.com

2009/4/10 lihui <lihu...@gmail.com>:
> 查了一下文档，才知道gb18030与gb2312原来是不同的啊:)
>

orz，难怪国内那么多人明明 GB18030 字符也敢写成缺字的 GB2312，
原来是以为他们一样的啊……

shell909090

unread,

Apr 10, 2009, 5:48:16 AM4/10/09

to pyth...@googlegroups.com

Jiahua Huang 写道:

早发现了，害我连jiong都乱码。

signature.asc

Reply all

Reply to author

Forward