[node.io]抓去GBK乱码,求解决方案

621 views
Skip to first unread message

阿彪

unread,
May 15, 2011, 12:25:31 AM5/15/11
to cnodejs
如题。

阿彪

unread,
May 15, 2011, 12:26:22 AM5/15/11
to cnodejs
var nodeio = require('node.io');
var options = {timeout: 10, max: 5};

exports.job = new nodeio.Job(options, {
input: './input.txt',
run: function (keyword) {
var self = this, results = [];
this.getHtml('http://www.baidu.com/s?wd=' +
encodeURIComponent(keyword), function (err, $) {
//try{
$('#rs a').each(function(a){
results.push(a.fulltext);
});
self.emit(keyword + '\t' + results.join('\t'));
//}catch(e){}
});
},
output: './output.txt'
});

例如抓取百度页面,但输入是乱码

On May 15, 12:25 pm, 阿彪 <bit.ke...@gmail.com> wrote:
> 如题。

mashihua

unread,
May 15, 2011, 3:49:09 AM5/15/11
to cno...@googlegroups.com
这个是在神州大多数的Response Header中都是
charset=gb2312,而nodeio是按utf8 encoding去解析编码,见https://github.com/chriso/node.io/blob/master/lib/node.io/request.js的317行。
在回调函数的参数应该包含 err,$, data, headers, response。可用new Buffer(data).toString('gb2312')得到正确的编码字符窜;


在 May 15, 2011,12:26 PM, 阿彪 写道:

var nodeio = require('node.io');
var options = {timeout: 10, max: 5};

exports.job = new nodeio.Job(options, {
   input: './input.txt',
   run: function (keyword) {
       var self = this, results = [];
       this.getHtml('http://www.baidu.com/s?wd=' +
encodeURIComponent(keyword), function (err, $) {
        //try{
        $('#rs a').each(function(a){
        results.push(a.fulltext);
        });
           self.emit(keyword + '\t' + results.join('\t'));
        //}catch(e){}
       });h

   },
   output: './output.txt'
});

例如抓取百度页面,但输入是乱码

On May 15, 12:25 pm, 阿彪 <bit.ke...@gmail.com> wrote:
如题。


 mashihua shihua.ma
Contact Me 

MK2

unread,
May 15, 2011, 4:22:20 AM5/15/11
to cno...@googlegroups.com
杯具,node.io这样处理response不正确,要么可以设置字符编码,要么直接返回Buffer
要不然对于非utf8编码的文本页面和非文本链接资源(如图片,文件等)都会有问题。

<gtalk.png> mashihua <skype.png>shihua.ma


阿彪

unread,
May 15, 2011, 5:48:02 AM5/15/11
to cnodejs
-------- code --------
data = new Buffer(data).toString('gb2312');

-------- error --------
buffer.js:366
throw new Error('Unknown encoding');
^
Error: Unknown encoding
at Buffer.toString (buffer.js:366:13)


toString并不支持gb2312,下面是官方显示只支持部分格式。但悲剧的是只有有限的几种格式。

Converting between Buffers and JavaScript string objects requires an
explicit encoding method. Here are the different string encodings;

'ascii' - for 7 bit ASCII data only. This encoding method is very
fast, and will strip the high bit if set.
'utf8' - Multi byte encoded Unicode characters. Many web pages and
other document formats use UTF-8.
'ucs2' - 2-bytes, little endian encoded Unicode characters. It can
encode only BMP(Basic Multilingual Plane, U+0000 - U+FFFF).
'base64' - Base64 string encoding.
'binary' - A way of encoding raw binary data into strings by using
only the first 8 bits of each character. This encoding method is
deprecated and should be avoided in favor of Buffer objects where
possible. This encoding will be removed in future versions of Node.

另外,我使用:
var iconv = require('iconv');
new iconv.Iconv('GBK', 'UTF-8//IGNORE').convert(data).toString()进行转码,但悲
剧的是无论从utf-8 -> gbk OR gbk->utf-8都不正确。


On May 15, 3:49 pm, mashihua <mashi...@gmail.com> wrote:
> 这个是在神州大多数的Response Header中都是charset=gb2312,而nodeio是按utf8 encoding去解析编码,见https://github.com/chriso/node.io/blob/master/lib/node.io/request.js的317行。
> mashi...@gmail.com
> Chat
>
> mashihua shihua.ma
> Contact Me

阿彪

unread,
May 15, 2011, 6:18:30 AM5/15/11
to cnodejs
----- https://github.com/chriso/node.io/blob/master/lib/node.io/request.js
-----
line: 317
response.setEncoding('utf8');


----- https://github.com/joyent/node/blob/master/lib/http.js -----
line: 243
IncomingMessage.prototype.setEncoding = function(encoding) {
var StringDecoder = require('string_decoder').StringDecoder; // lazy
load
this._decoder = new StringDecoder(encoding);
};
line: 114
parser.onBody = function(b, start, len) {
// TODO body encoding?
var slice = b.slice(start, start + len);
if (parser.incoming._decoder) {
var string = parser.incoming._decoder.write(slice);
if (string.length) parser.incoming.emit('data', string);
} else {
parser.incoming.emit('data', slice);
}
};


其实这个地方"response.setEncoding('utf8');"不应该硬编码的,在http包中是可以设置参数的。


这里使用了string_decoder来解码,但官方未放出string_decoder模块的文档,可能是不够稳定。
其中夹了一句注释,"// TODO body encoding?",哈哈,可能以后会支持自动根据http响应声明的编码进行解码。



On May 15, 3:49 pm, mashihua <mashi...@gmail.com> wrote:
> 这个是在神州大多数的Response Header中都是charset=gb2312,而nodeio是按utf8 encoding去解析编码,见https://github.com/chriso/node.io/blob/master/lib/node.io/request.js的317行。

mashihua

unread,
May 15, 2011, 6:39:50 AM5/15/11
to cno...@googlegroups.com
源码中lib目录是有string_decoder.js文件的。

阿彪

unread,
May 15, 2011, 6:40:20 AM5/15/11
to cnodejs
node.io 抓取非utf8中文网页,目前确实是个悲催的事,神一样的利器使不上啊,只能自己hack啊

On May 15, 4:22 pm, MK2 <feng...@gmail.com> wrote:
> 杯具,node.io这样处理response不正确,要么可以设置字符编码,要么直接返回Buffer
> 要不然对于非utf8编码的文本页面和非文本链接资源(如图片,文件等)都会有问题。
>
> 在 2011-5-15,下午3:49, mashihua 写道:
>
>
>
>
>
>
>

> > 这个是在神州大多数的Response Header中都是charset=gb2312,而nodeio是按utf8 encoding去解析编码,见https://github.com/chriso/node.io/blob/master/lib/node.io/request.js的317行。

> > mashi...@gmail.com

阿彪

unread,
May 15, 2011, 7:06:33 AM5/15/11
to cnodejs
https://github.com/joyent/node/blob/master/lib/string_decoder.js
line: 33
// If not utf8...
if (this.encoding !== 'utf8') {
return buffer.toString(this.encoding);
}

string_decode.js功能实在太简单了,目前也只处理了utf8。能否支持非utf8还依赖于buffer模块。

另,刚邮件了node.io作者Chris O'Hara,他修改了一下,给出了接口:
https://github.com/chriso/node.io/commit/86f3e1b414bc897b08cf283b7c0289ee354a917c

他的回信:
> I've just pushed a commit which will allow you to set the request encoding. Keep in mind that NodeJS currently only
> supports 3 encodings, "utf8", "ascii" and "binary". If you're using something else, you'll have to use something
> like iconv bindings to handle the encoding.

> I haven't used any other encodings, so I can't really help you out here.


但是,buffer是不支持gbk之类的编码的:

switch (encoding) {
case 'hex':
return this.hexSlice(start, end);

case 'utf8':
case 'utf-8':
return this.utf8Slice(start, end);

case 'ascii':
return this.asciiSlice(start, end);

case 'binary':
return this.binarySlice(start, end);

case 'base64':
return this.base64Slice(start, end);

case 'ucs2':
case 'ucs-2':
return this.ucs2Slice(start, end);

default:
throw new Error('Unknown encoding');
}
};

So,结论是即使作者开放了node.io的接口,也无法支持gbk编码。只有寻求别的途径来解决。


On 5月15日, 下午6时39分, mashihua <mashi...@gmail.com> wrote:
> 源码中lib目录是有string_decoder.js文件的。
>
> 在 May 15, 2011,6:18 PM, 阿彪 写道:
>
>
>
>
>
>
>
>
>
> > -----https://github.com/chriso/node.io/blob/master/lib/node.io/request.js

MK2

unread,
May 15, 2011, 8:22:13 AM5/15/11
to cnodejs
node无法将非utf8编码的页面?

MK2

unread,
May 15, 2011, 9:26:49 AM5/15/11
to cnodejs
@阿彪

我测试了一下iconv是可以正常使用的:


var http = require('http');

var options = {
host: 'www.baidu.com',
port: 80,
path: '/'
};

var Iconv = require('iconv').Iconv;
var gb2312_to_utf8_iconv = new Iconv('GB2312', 'UTF-8');

http.get(options, function(res) {
console.log("Got response: " + res.statusCode);
var buffers = [], size = 0;
res.on('data', function(buffer) {
buffers.push(buffer);
size += buffer.length;
});
res.on('end', function() {
var buffer = new Buffer(size), pos = 0;
for(var i = 0, len = buffers.length; i < len; i++) {
buffers[i].copy(buffer, pos);
pos += buffers[i].length;
}
var utf8_buffer = gb2312_to_utf8_iconv.convert(buffer);
console.log(utf8_buffer.toString());
});
}).on('error', function(e) {
console.log("Got error: " + e.message);
});

MK2

unread,
May 15, 2011, 9:48:38 AM5/15/11
to cno...@googlegroups.com
具体页面编码可以根据res.headers['content-type'] 来判断。

如果没有res.headers['content-type'],则需要分析html中的 “<meta http-equiv="Content-Type" content="text/html; charset=xxxx"/>”来判断了

阿彪

unread,
May 15, 2011, 11:22:42 AM5/15/11
to cnodejs
您用的是http,不是node.io. http可以解决, 但node.io把http封装了,具体如下

@see https://github.com/chriso/node.io/blob/master/lib/node.io/request.js
Line: 309 - 393

fengmk2

unread,
May 15, 2011, 2:28:54 PM5/15/11
to cno...@googlegroups.com
是啊,node.io,你可以fork它,帮作者修复一下这个问题。


------------------------
Taobao EDP 苏千
Simple is better, MK2


在 2011-5-15,23:22,阿彪 <bit....@gmail.com> 写到:

阿彪

unread,
May 15, 2011, 10:30:47 PM5/15/11
to cnodejs
我在本地如何修改node.io的代码并生效?

有相关文章没的?给个新手入门吧

On May 16, 2:28 am, fengmk2 <feng...@gmail.com> wrote:
> 是啊,node.io,你可以fork它,帮作者修复一下这个问题。
>
> ------------------------
> Taobao EDP 苏千
> Simple is better, MK2
>
> 在 2011-5-15,23:22,阿彪 <bit.ke...@gmail.com> 写到:
>
>
>
>
>
>
>
> > 您用的是http,不是node.io. http可以解决, 但node.io把http封装了,具体如下
>
> > @seehttps://github.com/chriso/node.io/blob/master/lib/node.io/request.js

Shawn Meng

unread,
May 15, 2011, 11:01:52 PM5/15/11
to cno...@googlegroups.com
你是使用npm抓来的node.io吗?

--
Frontend Engineer @ imeigu
Sent with Sparrow

阿彪

unread,
May 17, 2011, 1:46:20 AM5/17/11
to cnodejs
用的是npm,但我没找到node.io的js源码,编译了?

Shawn Meng

unread,
May 17, 2011, 1:56:03 AM5/17/11
to cno...@googlegroups.com
找找npm安装目录下面的.npm

--
Frontend Engineer @ imeigu
Sent with Sparrow

Reply all
Reply to author
Forward
0 new messages