中英文混合，词数如何统计？

batfree

unread,

May 6, 2007, 11:34:35 AM5/6/07

to pyth...@googlegroups.com

一段Unicode字符串，里面有英文，有中日韩语等等，如何其统计其字数。
就是每一个英文单词算一个字符，中日韩语言每个character算一个词，每个标点
符号算一个词。
请问大家用什么统计方法呢？

batfree

unread,

May 6, 2007, 3:30:08 PM5/6/07

to pyth...@googlegroups.com

batfree 写道:

我折腾了一个晚上，最后用了4行算出来了，不知道有没有问题。与大家探讨一下。
import re
cjkReg = re.compile(u'[\u1100-\uFFFDh]+?')
trimedCJK = cjkReg.sub( ' a ', inputString, 0)# replace the CJK with the
word a
return len(trimedCJK.split())
做过一部分测试，与Word里面的字数统计数目一样，不过不知道是不是还有问题。

simohayha

unread,

May 7, 2007, 9:07:13 PM5/7/07

to python.cn

呵呵，我觉得没什么问题。

Zoom.Quiet

unread,

May 7, 2007, 10:16:35 PM5/7/07

to pyth...@googlegroups.com, python-chinese列表, pyth...@googlegroups.com, cpug-ea...@googlegroups.com

收藏到微项目了！感谢分享!!!
http://wiki.woodpecker.org.cn/moin/MicroProj/2007-05-08

--
'''Time is unimportant, only life important!
http://zoomquiet.org
blog@http://blog.zoomquiet.org/pyblosxom/
wiki@http://wiki.woodpecker.org.cn/moin/ZoomQuiet
scrap@http://floss.zoomquiet.org
douban@http://www.douban.com/people/zoomq/
____________________________________
Pls. use OpenOffice.org to replace M$ Office.
http://zh.openoffice.org
Pls. use 7-zip to replace WinRAR/WinZip.
http://7-zip.org/zh-cn/
You can get the truely Freedom 4 software.
'''

batfree

unread,

May 8, 2007, 10:59:22 AM5/8/07

to python-...@lists.python.cn

fdu.x...@gmail.com 写道:

> Zoom.Quiet wrote:
>
>> On 5/7/07, batfree <batfr...@gmail.com> wrote:
>>
>>
>>> batfree 写道:
>>>
>>>
>>>> 一段Unicode字符串，里面有英文，有中日韩语等等，如何其统计其字数。
>>>> 就是每一个英文单词算一个字符，中日韩语言每个character算一个词，每个标点
>>>> 符号算一个词。
>>>> 请问大家用什么统计方法呢？
>>>>
>>>>
>>>>
>>>>
>>>>
>>> 我折腾了一个晚上，最后用了4行算出来了，不知道有没有问题。与大家探讨一下。
>>> import re
>>> cjkReg = re.compile(u'[\u1100-\uFFFDh]+?')
>>> trimedCJK = cjkReg.sub( ' a ', inputString, 0)# replace the CJK with the
>>> word a
>>> return len(trimedCJK.split())
>>> 做过一部分测试，与Word里面的字数统计数目一样，不过不知道是不是还有问题。
>>>
>>>
>>>
>> 收藏到微项目了！感谢分享!!!
>> http://wiki.woodpecker.org.cn/moin/MicroProj/2007-05-08
>>
>>
>>

> 好像gbk里面中文是两个字节,英文是一个字节,
> 可以直接拿gbk编码的长度和unicode编码的长度比较的
>
> >>> s = 'abc你好你好def你好你好'
> >>> len(s)
> 22
> >>> u = s.decode('gbk')
> >>> len(u)
> 14
> >>> length_a = len(u) #总字符数
> >>> length_c = len(s) - len(u) #中文字符数
> >>> length_e = 2*len(u) - len(s) #英文字符数
> >>> length_a, length_c, length_e
> (14, 8, 6)
> >>> a = u.encode('gbk')
> >>> a
> 'abc\xc4\xe3\xba\xc3\xc4\xe3\xba\xc3def\xc4\xe3\xba\xc3\xc4\xe3\xba\xc3'
> >>> len(a)
> 22
>
>
需要计算英文的单词数，不是字符数，而且可能还有法语，德语等，我是查了
Unicode的表然后用了替换的方式来解决的，不过也可能会有更好的解决方法，在
这儿与大家讨论一下。看有没有更高效准确的方法。

_______________________________________________
python-chinese
Post: send python-...@lists.python.cn
Subscribe: send subscribe to python-chin...@lists.python.cn
Unsubscribe: send unsubscribe to python-chin...@lists.python.cn
Detail Info: http://python.cn/mailman/listinfo/python-chinese

Reply all

Reply to author

Forward