python3.1读取utf8编码的中文文件

adream307

unread,

Apr 11, 2011, 2:25:49 AM4/11/11

to python-cn

代码很简单，如下：

# -*- coding: utf-8 -*-

fp=open('中文.txt',encoding='utf8')

for lines in fp.readlines():

print(lines)

中文.txt的内容如下，文件时utf8编码的：

中文

运行时出现这样的错误：

UnicodeEncodingError：‘gbk’ codec can't encode character '\ufeff' in position 0: illegal multibyte sequence

但单独运行 fp.readlines()，得到如下结果：

['\ufeff中文','中文','中文']

我想知道，在将文件读进来之后，是否必须手动的将前两个字节去掉

按照错误提示，print函数将'\ufeff中文' 当做 'gbk' 编码，才会发生那样的错误，那能否设置 print 使用 utf8 编码呢？

我的环境是 winXP + python3.1

谢谢

依云

unread,

Apr 11, 2011, 2:41:44 AM4/11/11

to pyth...@googlegroups.com

你想 python 打印出一堆乱码？Python 是根据系统决定输出如何解码的。对于
Python3，我只知道在 Linux 下，在 Python 启动前设置 LANG 变量可行。不过你
的问题不是这个吧？

Windows 的 BOM 是挺烦人的，不过貌似 Python 没办法自动处理。你可以试试手
动检查第一个字符。
--
Best regards,
lilydjwg

Linux Vim Python 我的博客
http://bit.ly/lilydjwg or http://goo.gl/y4Gsy

limodou

unread,

Apr 11, 2011, 3:08:57 AM4/11/11

to pyth...@googlegroups.com

2011/4/11 依云 <lily...@gmail.com>

On Mon, Apr 11, 2011 at 02:25:49PM +0800, adream307 wrote:
> 代码很简单，如下：
>
>
> # -*- coding: utf-8 -*-
> fp=open('中文.txt',encoding='utf8')
> for lines in fp.readlines():
> print(lines)
>
>
> 中文.txt的内容如下，文件时utf8编码的：
>
>
> 中文
> 中文
> 中文
>
>
> 运行时出现这样的错误：
> UnicodeEncodingError：‘gbk’ codec can't encode character '\ufeff' in position 0: illegal multibyte sequence
>
> 但单独运行 fp.readlines()，得到如下结果：
> ['\ufeff中文','中文','中文']
>
>
> 我想知道，在将文件读进来之后，是否必须手动的将前两个字节去掉
> 按照错误提示，print函数将'\ufeff中文' 当做 'gbk' 编码，才会发生那样的错误，那能否设置 print 使用 utf8 编码呢？
>
>
> 我的环境是 winXP + python3.1
>
>
> 谢谢
>

你想 python 打印出一堆乱码？Python 是根据系统决定输出如何解码的。对于
Python3，我只知道在 Linux 下，在 Python 启动前设置 LANG 变量可行。不过你
的问题不是这个吧？

Windows 的 BOM 是挺烦人的，不过貌似 Python 没办法自动处理。你可以试试手
动检查第一个字符。

正如依云所说，估计你是使用象Notepad之类的软件生成的中文文件，它会自动带BOM，但是open函数不会自动处理。你要在程序中去判断。因为在很多情况下，这个BOM不一定有。

--
I like python!
UliPad <<The Python Editor>>: http://code.google.com/p/ulipad/
UliWeb <<simple web framework>>: http://uliwebproject.appspot.com
My Blog: http://hi.baidu.com/limodou

pansz

unread,

Apr 11, 2011, 3:31:55 AM4/11/11

to pyth...@googlegroups.com

2011/4/11 adream307 <adre...@gmail.com>:

> 按照错误提示，print函数将'\ufeff中文' 当做 'gbk' 编码，才会发生那样的错误，那能否设置 print 使用 utf8 编码呢？
> 我的环境是 winXP + python3.1

print 函数必须负责将文本转换为自身 tty 所在的目标编码，如果你在中文 windows 的 cmd 里面执行，那么目标编码是
gbk，要修改 print 的行为得修改你 windows 的编码。

临时修改 windows 编码可以使用 chcp 命令：
C:\Windows\system32>chcp
Active code page: 1251

C:\Windows\system32>chcp 65001
Active code page: 65001 <------- UTF-8

永久的修改 windows 编码可以进入控制面板。

当然，简单的无需修改 windows 编码的解决办法是：寻找一个支持 utf-8 的 tty，并在这个 tty 内部运行 python。例如
cygwin 的命令行就是可以通过 LANG 直接设定为 utf-8。

adream

unread,

Apr 11, 2011, 6:33:18 AM4/11/11

to pyth...@googlegroups.com

Windows 的 BOM 是挺烦人的，不过貌似 Python 没办法自动处理。你可以试试手
动检查第一个字符。

谢谢提醒，当我使用无 BOM 的uft-8 文件时，程序程序就运行正常了

再次感谢

adream

unread,

Apr 11, 2011, 6:38:04 AM4/11/11

to pyth...@googlegroups.com

临时修改 windows 编码可以使用 chcp 命令：
C:\Windows\system32>chcp
Active code page: 1251

C:\Windows\system32>chcp 65001
Active code page: 65001 <------- UTF-8

我按照你的方法试了一下，切换到 chcp 65001后，输入python 出现这样的错误：

Falta Python error: Py_Initialize: can't initialize sys standard streams

LookupError: unknow encoding: cp65001

This application has requested the Runtime to terminate it in a unusuall way.

Please contact the application's support team for more information

谢谢

ablo...@gmail.com

unread,

Apr 12, 2011, 1:18:45 AM4/12/11

to pyth...@googlegroups.com

我记得python好像有自动处理bom的方式，不过忘了。可查查手册

--
来自: python-cn`CPyUG`华蟒用户组(中文Python技术邮件列表)
发言: pyth...@googlegroups.com
退订: python-cn+...@googlegroups.com (向此发空信即退!)
详情: http://code.google.com/p/cpyug/wiki/PythonCn
严正: 理解列表! 智慧提问! http://wiki.woodpecker.org.cn/moin/AskForHelp
强烈: 建议使用技巧: 如何有效地报告Bug
http://www.chiark.greenend.org.uk/%7Esgtatham/bugs-cn.html

--
http://abloz.com 我的技术博客

pansz

unread,

Apr 12, 2011, 1:32:55 AM4/12/11

to pyth...@googlegroups.com

2011/4/11 adream <adre...@gmail.com>:

> 我按照你的方法试了一下，切换到 chcp 65001后，输入python 出现这样的错误：
> Falta Python error: Py_Initialize: can't initialize sys standard streams
> LookupError: unknow encoding: cp65001
> This application has requested the Runtime to terminate it in a unusuall
> way.
> Please contact the application's support team for more information
> 谢谢

这个问题是由于一个矛盾引起的：
1。Windows只能使用代码页来表示编码，因此utf-8被编码为cp65001。没法直接设定windows的编码为utf-8，只能设定为某个代码页（cp）。
2。而python只能识别 "utf-8" 这个名称，不能把 cp65001 自动识别为 utf-8 。

要解决这个问题需要让 python 知道 “CP65001就是UTF-8”
这个事实，可以通过加几行python代码实现，关于这个问题网上有些解决方案，有兴趣可以找找看。

CHEN Xing

unread,

Apr 12, 2011, 3:20:09 AM4/12/11

to pyth...@googlegroups.com, adream307

建议
import codecs
它可以解决大多数读文件时的编码问题。

CHEN, Xing / 陈醒

Reply all

Reply to author

Forward