Groups

MPNxxx〓python程序的编码-大家帮忙看看对不对

44 views

Skip to first unread message

Oyster

unread,

Nov 6, 2007, 8:48:57 PM11/6/07

to python-cn:CPyUG

总结一下，看看对不对
1.源程序里面的汉字，直接用普通字符串的方式写出来 '汉字'，
不要用unicode字符串的方式 u'汉字'

2.源程序保存为utf-8编码的文件，且文件头包含 #coding=utf-8 字眼。
其实，只要查找到了 "# coding 编码" 就行了，所以就算
乱写成encoding也无所谓。我个人喜欢
# coding="utf-8"
因为它象一个合法的python赋值语句

3.要输出一个字符串或者已赋值的字符串变量，需要
print unicode('汉字','utf-8')
糟糕的是，这里我们不能省略编码方式让其使用文件头指定的utf-8
因为我在py24上省略没有问题，但是py25就会报错

4.使用字符串作为函数调用的参数时，需要先转换为unicode对象

5.unicode+str之后返回的就是unicode，所以print这样的值，不需要再次转换

6.print类的实例（或者对实例使用str函数）的时候，调用的是__str__函数，
返回的也是str类型的对象。这样的话，会报错的
unicode(对象)，首先查找的是__unicode__函数，并将其值作为返回值；如果找不到，
则使用__str__函数

7.如果变量是raw_input进来的，则该变量是一个str类型的对象，需要使用
a)unicode(变量,sys.stdin.encoding)
或者
b)name.decode(sys.stdin.encoding)
先转换为unicode对象

8.对于文件的io，应该使用codecs.open

测试用的很乱的代码
# coding='utf-8'
import locale
import sys

def hi(n2, name):
return '%s, %s' % (n2,name)

name=raw_input('name>> ')
n2='你好'
print type(name), type(n2)
print hi(n2,name)
print hi(unicode(n2,'utf-8'),unicode(name,sys.stdin.encoding))
print hi(unicode(n2,'utf-8'),name.decode(sys.stdin.encoding))
print map(len, (unicode(n2,'utf-8'),name.decode(sys.stdin.encoding)))
print map(type,
(n2,name,unicode(n2,'utf-8'),unicode(name,sys.stdin.encoding)))
print map(type,(hi(n2,name),hi(unicode(n2,'utf-8'),unicode(name,
sys.stdin.encoding))))
print unicode(n2,'utf-8') + unicode(name,sys.stdin.encoding)

print sys.stdin.encoding, sys.stdout.encoding
print

def info(name, age=0):
return "I'm %s, and %0i years old" % (name, age)

print '===1st test'

print unicode(info(unicode('张','utf-8'), 20)) #这里的输出是对的
print 'length=', len(unicode('张','utf-8')) #长度计算也是对的
print

print '===2nd test'
name=raw_input('name>> ')#输入一个汉字，比如李
print type(name)
print name, unicode(name,sys.stdin.encoding),
name.decode(sys.stdin.encoding)
print map(len,(name, unicode(name,sys.stdin.encoding),
name.decode(sys.stdin.encoding)))
print
unicode(name,sys.stdin.encoding)==name.decode(sys.stdin.encoding)
print info(name, 21) #输出没有问题
print unicode(info(name, 21),sys.stdin.encoding) #输出没有问题
print info(unicode(name,sys.stdin.encoding), 21) #输出没有问题
print map(type, (info(name, 21),
info(unicode(name,sys.stdin.encoding), 21)))
print

print '===3rd test'
print type(name) #是string
name2='李'
print type(name) #还是string
print name==name2 #False?
print info(name, 21) #输出没有问题
print 'length=', len(name) #但是长度计算有问题
print

class human(object):
def __init__(self, name, age=0):
self.name=name
self.age=age
def __str__(self):
print 'in __str__'
return "I'm %s, and %0i years old" %(self.name,
self.age)
def __unicode__(self):
return "unistr: I'm %s, and %0i years old" %
(self.name,
self.age)

print '===4th test'
zhang=human(unicode('张','utf-8'),20)
print 'zhang.name=',zhang.name #输出没有问题
print 'length=', len(zhang.name) #长度没有问题

print unicode(zhang) #输出没有问题
#print zhang #出错！
#print str(zhang) #出错！
#print unicode(str(zhang)) #出错！

t=unicode(zhang)
import codecs
outfile = codecs.open("test.txt", "w", "utf-8")
outfile.write(t)
#outfile.write(t.decode('utf-8'))
#outfile.write(unicode('汉字','utf-8'))
outfile.close()

Zoom.Quiet

unread,

Nov 6, 2007, 10:13:07 PM11/6/07

to pyth...@googlegroups.com, pyth...@googlegroups.com, cpug-ea...@googlegroups.com, zp...@googlegroups.com

On Nov 7, 2007 9:48 AM, Oyster <lepto....@gmail.com> wrote:
> 总结一下，看看对不对
收录!! 请继续
http://wiki.woodpecker.org.cn/moin/PyInChinese
PS:
http://wiki.woodpecker.org.cn/moin/BPUG/2007-03-03
有过相关分析 ...

--
'''Time is unimportant, only life important!
过程改进乃是开始催生可促生靠谱的人的组织!
'''http://zoomquiet.org
blog @ http://blog.zoomquiet.org/pyblosxom/
wiki @ http://wiki.woodpecker.org.cn/moin/ZoomQuiet
scrap @ http://floss.zoomquiet.org ; http://skm.zoomquiet.org
douban@ http://www.douban.com/people/zoomq/
好看簿 @ http://zoomq.haokanbu.com/
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Pls. usage OOo to replace M$ Office. http://zh.openoffice.org
Pls. usage 7-zip to replace WinRAR/WinZip. http://7-zip.org
You can get the truely Freedom 4 software.

Jiahua Huang

unread,

Nov 7, 2007, 6:12:24 AM11/7/07

to pyth...@googlegroups.com

1. 怎么都可以

2. wiki 有分析

3. 不需要
3.1 str 也一样直接 print
3.2 使用 unicode() 前加上
import sys
reload(sys)
sys.setdefaultencoding('utf8')

4. 不需要

后边的不说了

在 07-11-7，Oyster<lepto....@gmail.com> 写道：

Oyster

unread,

Nov 8, 2007, 12:49:27 AM11/8/07

to python-cn:CPyUG

还是py24、py25 on 中文win2k。命令行方式运行脚本（当然保存为utf8先）
#################这是p1.py的开始#####################
#coding='utf-8' #line1

import sys #line2
reload(sys) #line3
sys.setdefaultencoding('utf-8') #line4

def hi(a, b): #line5
return a+b #line6

ua=u'你好' #line7
print len(ua) #输出 2，没问题 #line8
print ua #输出你好，没问题 #line9
print #line10

a='你好' #line11
print len(a),type(a) #输出 6 <type 'str'>，6不是我要的
#line12
print len(unicode(a)),type(unicode(a)) #输出 2 <type 'unicode'>，没问
题 #line13
print a #输出浣犲ソ #line14
print unicode(a) #输出你好，没问题 #line15
print

b=raw_input('>> ') #输入啊 #line16
print b #输出啊 #line17
print hi(a,b) #输出浣犲ソ啊 #line18
print hi(ua,b) #直接出错如下 #line19
'''
Traceback (most recent call last):
File "p1.py", line 25, in ?
print hi(ua,b) #杈撳嚭娴ｇ姴銈藉晩 #line18
File "p1.py", line 8, in hi
return a+b #line6
UnicodeDecodeError: 'utf8' codec can't decode byte 0xb0 in position 0:
unexpected code byte
'''
#################这是p1.py的结束#####################

结论如下
On 11月7日, 下午7时12分, "Jiahua Huang" <jhuangjia...@gmail.com> wrote:
> 1. 怎么都可以
line12, line14 说明并不是怎么样都可以

>
> 2. wiki 有分析
>
> 3. 不需要
> 3.1 str 也一样直接 print

line14说明，str不是一样直接print，转换成unicode是需要的

> 3.2 使用 unicode() 前加上
> import sys
> reload(sys)
> sys.setdefaultencoding('utf8')

多谢

>
> 4. 不需要
line18、line19 还是说明"需要"

如果说"在函数体里面转换成unicode不就行了"，那么
1.函数体里转换，还是有转换，而不是直接使用
2.看下面的两个例子（其实在此他们是等效的），证明在函数体内转换成unicode是不行的
###################这是p2.py的开始#################
#coding='utf-8' #line1

import sys #line2
reload(sys) #line3
sys.setdefaultencoding('utf-8') #line4

def hi(a, b): #line5
return unicode(a)+unicode(b) #line6

ua=u'你好' #line7

a='你好' #line8

b=raw_input('>> ') #输入啊 #line9

#print hi(a,b) #直接出错如下 #line10
'''
Traceback (most recent call last):
File "p2.py", line 16, in ?
print hi(a,b) #鐩存帴鍑洪敊濡備笅 #line10
File "p2.py", line 8, in hi
return unicode(a)+unicode(b) #line6
UnicodeDecodeError: 'utf8' codec can't decode byte 0xb0 in position 0:
unexpected code byte
'''

print hi(ua,b) #直接出错如下 #line11
'''
Traceback (most recent call last):
File "p2.py", line 26, in ?
print hi(ua,b) #鐩存帴鍑洪敊濡備笅 #line11
File "p2.py", line 8, in hi
return unicode(a)+unicode(b) #line6
UnicodeDecodeError: 'utf8' codec can't decode byte 0xb0 in position 0:
unexpected code byte
'''
##############这是p2.py的结束################

##############这是p3.py的开始################
#coding='utf-8' #line1

import sys #line2
reload(sys) #line3
sys.setdefaultencoding('utf-8') #line4

def hi(a, b): #line5
return u'%s%s' %(a,b) #line6

ua=u'你好' #line7

a='你好' #line8

b=raw_input('>> ') #输入啊 #line9

print hi(a,b) #直接出错如下 #line10
'''
Traceback (most recent call last):
File "p3.py", line 16, in ?
print hi(a,b) #鐩存帴鍑洪敊濡備笅 #line10
File "p3.py", line 8, in hi
return u'%s%s' %(a,b) #line6
UnicodeDecodeError: 'utf8' codec can't decode byte 0xb0 in position 0:
unexpected code byte
'''

print hi(ua,b) #直接出错如下 #line11
'''
Traceback (most recent call last):
File "p3.py", line 26, in ?
print hi(ua,b) #鐩存帴鍑洪敊濡備笅 #line11
File "p3.py", line 8, in hi
return u'%s%s' %(a,b) #line6
UnicodeDecodeError: 'utf8' codec can't decode byte 0xb0 in position 0:
unexpected code byte
'''
##############这是p3.py的结束################

> 后边的不说了
同上

Jiahua Huang

unread,

Nov 8, 2007, 1:03:57 AM11/8/07

to pyth...@googlegroups.com

别坚持那套繁琐的玩意了

下边是照你原样的代码，
除了开头换成
#!/usr/bin/python
# -*- coding: UTF-8 -*-

huahua@huahua:dirty$ cat p1.py
#!/usr/bin/python
# -*- coding: UTF-8 -*-

import sys #line2
reload(sys) #line3
sys.setdefaultencoding('utf-8') #line4

def hi(a, b): #line5
return a+b #line6

ua=u'你好' #line7
print len(ua) #输出 2，没问题 #line8
print ua #输出你好，没问题 #line9
print #line10

a='你好' #line11
print len(a),type(a) #输出 6 <type 'str'>，6不是我要的
#line12
print len(unicode(a)),type(unicode(a)) #输出 2 <type 'unicode'>，没问
题 #line13
print a #输出浣犲ソ #line14
print unicode(a) #输出你好，没问题 #line15
print

b=raw_input('>> ') #输入啊 #line16
print b #输出啊 #line17
print hi(a,b) #输出浣犲ソ啊 #line18
print hi(ua,b) #直接出错如下 #line19

huahua@huahua:dirty$ ./p1.py
2
你好

6 <type 'str'>
2 <type 'unicode'>
你好
你好

>> 啊
啊
你好啊
你好啊

Jiahua Huang

unread,

Nov 8, 2007, 1:05:08 AM11/8/07

to pyth...@googlegroups.com

huahua@huahua:dirty$ cat p2.py

#!/usr/bin/python
# -*- coding: UTF-8 -*-
import sys #line2
reload(sys) #line3
sys.setdefaultencoding('utf-8') #line4

def hi(a, b): #line5
return unicode(a)+unicode(b) #line6

ua=u'你好' #line7

a='你好' #line8

b=raw_input('>> ') #输入啊 #line9

print hi(a,b) #直接出错如下 #line10

huahua@huahua:dirty$ ./p2.py
>> 啊
你好啊

Jiahua Huang

unread,

Nov 8, 2007, 1:10:54 AM11/8/07

to pyth...@googlegroups.com

huahua@huahua:dirty$ cat p3.py

#!/usr/bin/python
# -*- coding: UTF-8 -*-
import sys #line2
reload(sys) #line3
sys.setdefaultencoding('utf-8') #line4

def hi(a, b): #line5
return u'%s%s' %(a,b) #line6

ua=u'你好' #line7

a='你好' #line8

b=raw_input('>> ') #输入啊 #line9

print hi(a,b) #直接出错如下 #line10
huahua@huahua:dirty$ ./p3.py
>> 啊
你好啊

2007-11-08-141006_433x398_scrot.png

Jiahua Huang

unread,

Nov 8, 2007, 1:13:41 AM11/8/07

to pyth...@googlegroups.com

当使用 unicode 的 Py3000 的时候，
也许你又会遇上更多迷惑。

Jiahua Huang

unread,

Nov 8, 2007, 1:19:20 AM11/8/07

to pyth...@googlegroups.com

补充下，这边 Python 的环境就是 UTF8

huahua@huahua:dirty$ locale
LANG=zh_CN.UTF-8
LANGUAGE=zh_CN:zh
LC_CTYPE=zh_CN.UTF-8
LC_NUMERIC="zh_CN.UTF-8"
LC_TIME="zh_CN.UTF-8"
LC_COLLATE="zh_CN.UTF-8"
LC_MONETARY="zh_CN.UTF-8"
LC_MESSAGES="zh_CN.UTF-8"
LC_PAPER="zh_CN.UTF-8"
LC_NAME="zh_CN.UTF-8"
LC_ADDRESS="zh_CN.UTF-8"
LC_TELEPHONE="zh_CN.UTF-8"
LC_MEASUREMENT="zh_CN.UTF-8"
LC_IDENTIFICATION="zh_CN.UTF-8"
LC_ALL=

Python 系统编码的见下边

huahua@huahua:dirty$ python
Python 2.5.1 (r251:54863, May 2 2007, 16:56:35)
[GCC 4.1.2 (Ubuntu 4.1.2-0ubuntu4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> str1='这个是中文'
>>> print str1
这个是中文
>>> str1
'\xe8\xbf\x99\xe4\xb8\xaa\xe6\x98\xaf\xe4\xb8\xad\xe6\x96\x87'
>>> # Python2 默认系统编码是 iso8859-1
...
>>> unicode(str1)

Traceback (most recent call last):

File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position
0: ordinal not in range(128)
>>> # 系统编码改为 utf8 后可以直接用许多
...
>>> import sys
>>> reload(sys)
<module 'sys' (built-in)>
>>> sys.setdefaultencoding('utf8')
>>> unicode(str1)
u'\u8fd9\u4e2a\u662f\u4e2d\u6587'
>>> print unicode(str1)
这个是中文
>>>

Jiahua Huang

unread,

Nov 8, 2007, 1:35:05 AM11/8/07

to pyth...@googlegroups.com

代码见上，

仅仅是开头换成

Screenshot-p2.py + (-tmp-dirty) (2 of 3) - VIM.png

Oyster

unread,

Nov 8, 2007, 1:35:23 AM11/8/07

to python-cn:CPyUG

不是我坚持你所谓的烦琐的东西，而是你忽视这里的实例、以及这样的观点"要证明一个结论成立，需要很多实例，而且就算你使用了一万个实例证明成立也不保
证第一万零一个实例成立；反过来，证明某个结论不成立，只要一个范例就可以"

回到本题，也许这是windows、linux处理字符编码方式的不同引起的，不想激起任何操作系统的争论--我只想在windows下让我的程序正
常，如果有人觉得我的脚本有用，打算在他的linux下使用，我也不希望到时候还要我去修改源程序。我不会linux，但是我相信，这种麻烦的方式，在
这两个平台之上的表现应该是一致的。

各持己见吧，我认为世界上除了linux还是有windows的。

python3k正式出现了再去折腾

Oyster

unread,

Nov 8, 2007, 1:37:09 AM11/8/07

to python-cn:CPyUG

忘了说了，这样处理字符串，确实麻烦，哈哈

Jiahua Huang

unread,

Nov 8, 2007, 1:44:45 AM11/8/07

to pyth...@googlegroups.com

不，你唯一忽略的，
就是妳 CMD 的编码，
Windows 内部编码是 UTF-16,
但妳 CMD 默认的编码不是

并不是忽略你的系统，
只是既然你知道 sys.stdin.encoding，
怎么就不看 sys.getdefaultencoding()，
或看看 sys.stdin.encoding 的值

以前贴过怎么改 cmd 的编码，或是代替 cmd 的靠谱终端。

在 07-11-8，Oyster<lepto....@gmail.com> 写道：
> 不是我坚持你所谓的烦琐的东西，

Oyster

unread,

Nov 8, 2007, 5:10:21 AM11/8/07

to python-cn:CPyUG

python的编码已经让我很头晕了，不如找一个虽然麻烦、但是一劳永逸的方法，免得在不同的系统下需要修改开头#coding 一行
^_^

On 11月8日, 下午2时44分, "Jiahua Huang" <jhuangjia...@gmail.com> wrote:
> 不，你唯一忽略的，
> 就是妳 CMD 的编码，
> Windows 内部编码是 UTF-16,
> 但妳 CMD 默认的编码不是
>
> 并不是忽略你的系统，
> 只是既然你知道 sys.stdin.encoding，
> 怎么就不看 sys.getdefaultencoding()，
> 或看看 sys.stdin.encoding 的值
>
> 以前贴过怎么改 cmd 的编码，或是代替 cmd 的靠谱终端。
>

> 在 07-11-8，Oyster<lepto.pyt...@gmail.com> 写道：
>
>
>
> > 不是我坚持你所谓的烦琐的东西，- 隐藏被引用文字 -
>
> - 显示引用的文字 -

Reply all

Reply to author

Forward

0 new messages

Search

Clear search

Close search

Google apps

Main menu