setEncoding other than utf8, ascii or base64

9,367 views
Skip to first unread message

Néstor

unread,
Sep 14, 2010, 6:41:35 PM9/14/10
to nodejs
Hi

Is it possible to set the encoding of a http response to other than
utf8, ascii or base64? I'm having the following problem, I send a POST
request to a web that answer me with a html page encoded in
windows-1252, I suppose it won't a problem to encode the response in
ISO-88591-1 but I don't see the way of reading the response without
unknown characters like "Colegio de Educaci�n".

I was trying to use somehow node-iconv to see if that was helpful but
there is an issue with that module on Mac os.

http://github.com/bnoordhuis/node-iconv/issues/#issue/2

Thanks!

Ben Noordhuis

unread,
Sep 14, 2010, 10:11:15 PM9/14/10
to nod...@googlegroups.com
I just pulled a patch from Juan Alvarez's fork that should fix the
issue if you compile with --libiconv=/usr/local (or wherever your
libiconv is installed). Could you give it another spin?

Néstor

unread,
Sep 15, 2010, 5:27:32 AM9/15/10
to nodejs
It seems that building the library and using the new option the
missing symbols are fixed but I'm having some issues using it.

As I said my problem is that in this code the html in the response is
in windows-1252 and I can't set that encoding, I decided to set UTF8
and try to convert it back to ISO-88591-1 (or windows-1512). This is
the code

request.on('response', function (response) {
var responseBody = '';
sys.debug('STATUS= ' + response.statusCode);
sys.debug('HEADERS= ' + JSON.stringify(response.headers));
response.setEncoding('utf-8');

response.on('data', function(chunk){
responseBody += chunk;
});

response.on('end', function(){
if ( 200 == response.statusCode ) {
var iconv = new Iconv('UTF-8', 'ISO-8859-1');
var latinBuf = iconv.convert(responseBody);
htmlemit.emit('done', latinBuf);
}
});
});

But I'm getting this error:

/Users/nlafon/Dropbox/code/njs/sacacoles.js:42
var latinBuf = iconv.convert(responseBody);
^
Error: EILSEQ, Illegal character sequence.
at IncomingMessage.<anonymous> (/Users/nlafon/Dropbox/code/njs/
sacacoles.js:42:34)
at IncomingMessage.emit (events:43:20)
at HTTPParser.onMessageComplete (http:107:23)
at Client.ondata (http:882:22)
at IOWatcher.callback (net:494:29)
at node.js:764:9

I tried with creating a buffer, var latinBuf = iconv.convert(new
Buffer(responseBody)); , but I got the same error.

Anything I'm missing?

Thanks

Ben Noordhuis

unread,
Sep 15, 2010, 5:56:50 AM9/15/10
to nod...@googlegroups.com
Assuming the request is valid Windows-1252, then you probably need to
set response.setEncoding('binary) and aggregate the chunks into a
buffer before calling Iconv.convert(). Note that encoding=binary means
your data callback will receive Buffer objects, not strings.

Néstor

unread,
Sep 15, 2010, 7:17:45 AM9/15/10
to nodejs
The lack of encodings for the response it is quite annoying, I'm just
learning javascript and nodejs and I'm stuck with the always-tedious
encoding stuff. Anyway, less complaining and more "working". I tried
what you said and I bet I'm missing something, I paste the new code
(ignore the commeted lines):

request.on('response', function (response) {
//var responseBody = '';
var resBuf = new Buffer(10485760);
var resIdx = 0;
sys.debug('STATUS= ' + response.statusCode);
sys.debug('HEADERS= ' + JSON.stringify(response.headers));
//response.setEncoding('utf8');
response.setEncoding('binary');

response.on('data', function(chunk){
resIdx += resBuf.write(chunk, resIdx);
//responseBody += chunk;
});

response.on('end', function(){
if ( 200 == response.statusCode ) {
sys.debug(resIdx);
sys.debug(sys.inspect(resBuf.toString('binary',resIdx -
10240, resIdx ) ));
var iconv = new Iconv('ISO-8859-1', 'UTF8');
var latinBuf = iconv.convert(resBuf);
htmlemit.emit('done', latinBuf);
//htmlemit.emit('done', responseBody);
}
});
});

Using binary and having to allocate a buffer of fixed size ahead is
not a very good thing... that is the reason I used a 10MB buffer. In
the on.('end') function I dump the last 10KB of that string, seems
that it contents some data (good data) and prints UTF characters like
"\u00c3\u00b3".

But after that I get an error from iconv:

/Users/nlafon/Dropbox/code/njs/sacacoles.js:48
var iconv = new Iconv('ISO-8859-1', 'UTF8');
^
Error: EINVAL, Conversion not supported.
at IncomingMessage.<anonymous> (/Users/nlafon/Dropbox/code/njs/
sacacoles.js:48:25)
at IncomingMessage.emit (events:43:20)
at HTTPParser.onMessageComplete (http:107:23)
at Client.ondata (http:882:22)
at IOWatcher.callback (net:494:29)
at node.js:764:9


Thanks again.... I really hate (and don't understand) this encoding
"crap".

Ben Noordhuis

unread,
Sep 15, 2010, 10:21:08 AM9/15/10
to nod...@googlegroups.com
> Using binary and having to allocate a buffer of fixed size ahead is
> not a very good thing... that is the reason I used a 10MB buffer. In

Agreed. I'll probably add a streaming API in the next release.

> /Users/nlafon/Dropbox/code/njs/sacacoles.js:48
> var iconv = new Iconv('ISO-8859-1', 'UTF8');
> ^
> Error: EINVAL, Conversion not supported.

Your libiconv does not know how to convert from latin1 to utf8. Make
sure the proper character sets are compiled in.

Néstor Lafón-Gracia

unread,
Sep 15, 2010, 11:00:27 AM9/15/10
to nod...@googlegroups.com
Ummm it seems that is the problem, I just tried the same code on linux (ubuntu 10.04) and I didn't hit that problem, I will see if I can rebuild the library later at home but...

on linux the script starts using 95% CPU for the conversion and never finishes. 


--
You received this message because you are subscribed to the Google Groups "nodejs" group.
To post to this group, send email to nod...@googlegroups.com.
To unsubscribe from this group, send email to nodejs+un...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/nodejs?hl=en.




--
Hasta pronto!

Néstor Lafón-Gracia

E-mail: nestor...@gmail.com

"I can see a darker side, I can see it in your eyes, Maybe things have happened in your life, And maybe that's the reason why, You're playing for a one-man side, It's alright when you're winning though, But, when you lose, you lose alone. " //Milburn//

Ben Noordhuis

unread,
Sep 15, 2010, 11:21:49 AM9/15/10
to nod...@googlegroups.com
> Ummm it seems that is the problem, I just tried the same code on linux
> (ubuntu 10.04) and I didn't hit that problem, I will see if I can rebuild
> the library later at home but...

I might just add GNU libiconv as a static dependency to avoid all this
craziness.

> on linux the script starts using 95% CPU for the conversion and never
> finishes.

I wager that 10 MB buffer isn't full. You are passing it in whole so
in that case iconv is trying to recode undefined data.

Néstor Lafón-Gracia

unread,
Sep 21, 2010, 8:41:39 AM9/21/10
to nod...@googlegroups.com
Hi Ben

I slice the buffer before the convert... not sure why it hangs (95% cpu).

I saw that you changed the module to have the library statically linked, I will give a try later on Linux and Mac OS but I think that is not the issue.

Thanks


--
You received this message because you are subscribed to the Google Groups "nodejs" group.
To post to this group, send email to nod...@googlegroups.com.
To unsubscribe from this group, send email to nodejs+un...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/nodejs?hl=en.

Néstor Lafón-Gracia

unread,
Sep 21, 2010, 9:24:37 AM9/21/10
to nod...@googlegroups.com
Tried on Linux and with the new module I get the error I was getting when I built my own library...

var iconv = new Iconv('ISO-8859-1', 'UTF8');
                        ^
Error: EINVAL, Conversion not supported.
    at IncomingMessage.<anonymous> (/home/lafonne/kk:37:25)
    at IncomingMessage.emit (events:43:20)
    at HTTPParser.onMessageComplete (http:107:23)
    at Client.ondata (http:901:22)
    at IOWatcher.callback (net:494:29)
    at node.js:765:9

I put the code I'm testing here (other gist than before) http://gist.github.com/589675 What I want to do would be really simple if the web was encoded in utf-8, but that is not the case.

Please let me know if you are able to run that code... I'm considering to dump node.js over other server-side language just because of its lack of encoding support (I'm from Spain and ISO-8859-1 is widely used).

Thanks again!

2010/9/21 Néstor Lafón-Gracia <nestor...@gmail.com>

Ben Noordhuis

unread,
Sep 21, 2010, 9:43:21 AM9/21/10
to nod...@googlegroups.com
Hi Néstor,

I'll see what I can do. Could you perhaps do a `make distclean` and
post the output of `make install` and `nm iconv.node`?

Thanks!

Ben

Néstor Lafón-Gracia

unread,
Sep 21, 2010, 10:25:29 AM9/21/10
to nod...@googlegroups.com
I grep'd the output, I suppose you are looking for these:

$nm iconv.node | grep 8859
00023a00 r iso8859_10_2uni
00006851 t iso8859_10_mbtowc
00023ac0 r iso8859_10_page00
0000689c t iso8859_10_wctomb
0000691c t iso8859_11_mbtowc
00006971 t iso8859_11_wctomb
00023ba0 r iso8859_13_2uni
000069cb t iso8859_13_mbtowc
00023c60 r iso8859_13_page00
00023d40 r iso8859_13_page20
00006a16 t iso8859_13_wctomb
00023d60 r iso8859_14_2uni
00006aae t iso8859_14_mbtowc
00023e20 r iso8859_14_page00
00023e80 r iso8859_14_page01_0
00023ea0 r iso8859_14_page01_1
00023ec0 r iso8859_14_page1e_0
00023f48 r iso8859_14_page1e_1
00006af9 t iso8859_14_wctomb
00023f60 r iso8859_15_2uni
00006c0c t iso8859_15_mbtowc
00023fa0 r iso8859_15_page00
00023fc0 r iso8859_15_page01
00006c5d t iso8859_15_wctomb
00024000 r iso8859_16_2uni
00006d21 t iso8859_16_mbtowc
000240c0 r iso8859_16_page00
000241a0 r iso8859_16_page02
000241a8 r iso8859_16_page20
00006d6c t iso8859_16_wctomb
00005feb t iso8859_1_mbtowc
0000600a t iso8859_1_wctomb
00022ea0 r iso8859_2_2uni
0000602e t iso8859_2_mbtowc
00022f60 r iso8859_2_page00
00023040 r iso8859_2_page02
00006079 t iso8859_2_wctomb
00023060 r iso8859_3_2uni
00006111 t iso8859_3_mbtowc
00023120 r iso8859_3_page00
00023180 r iso8859_3_page01
000231f8 r iso8859_3_page02
00006174 t iso8859_3_wctomb
00023200 r iso8859_4_2uni
00006236 t iso8859_4_mbtowc
000232c0 r iso8859_4_page00
000233a0 r iso8859_4_page02
00006281 t iso8859_4_wctomb
000233c0 r iso8859_5_2uni
00006319 t iso8859_5_mbtowc
00023480 r iso8859_5_page00
000234a0 r iso8859_5_page04
00006364 t iso8859_5_wctomb
00023500 r iso8859_6_2uni
0000640b t iso8859_6_mbtowc
000235c0 r iso8859_6_page00
000235e0 r iso8859_6_page06
0000646e t iso8859_6_wctomb
00023640 r iso8859_7_2uni
00006506 t iso8859_7_mbtowc
00023700 r iso8859_7_page00
00023720 r iso8859_7_page03
00023778 r iso8859_7_page20
00006569 t iso8859_7_wctomb
000237a0 r iso8859_8_2uni
00006649 t iso8859_8_mbtowc
00023860 r iso8859_8_page00
000238c0 r iso8859_8_page05
000238e0 r iso8859_8_page20
000066ac t iso8859_8_wctomb
00023900 r iso8859_9_2uni
0000676e t iso8859_9_mbtowc
00023960 r iso8859_9_page00
000239a0 r iso8859_9_page01
000067b9 t iso8859_9_wctomb

I put the full listing in http://pastebin.com/ubU0LjQ0



--
You received this message because you are subscribed to the Google Groups "nodejs" group.
To post to this group, send email to nod...@googlegroups.com.
To unsubscribe from this group, send email to nodejs+un...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/nodejs?hl=en.

Ben Noordhuis

unread,
Sep 21, 2010, 5:48:41 PM9/21/10
to nod...@googlegroups.com
Thanks, Néstor. I found the cause: the target encoding should be
'UTF-8', not 'UTF8'. Pretty lame, I'll see if I can fix it.

Ben Noordhuis

unread,
Sep 21, 2010, 6:58:58 PM9/21/10
to nod...@googlegroups.com
On Tue, Sep 21, 2010 at 23:48, Ben Noordhuis <in...@bnoordhuis.nl> wrote:
> Thanks, Néstor. I found the cause: the target encoding should be
> 'UTF-8', not 'UTF8'. Pretty lame, I'll see if I can fix it.

I've pushed a commit[1] that applies a little heuristics to the
character encoding name, massaging it into something libiconv likes.

[1] http://github.com/bnoordhuis/node-iconv/commit/9012b03b441ced4349b8a7590369ec02ca9bac87

Néstor Lafón-Gracia

unread,
Sep 22, 2010, 7:46:20 AM9/22/10
to nod...@googlegroups.com
Thanks Ben. I also found a problem in my code, I was slicing a buffer but not assigning it, therefore I was using the whole buffer.

Anyway, I'm still not being able to do what I want to do... I'm starting to think it is not related to iconv itself but more to the way I do it. I updated the code in http://gist.github.com/589675 to be more simple and more easy to test (ouput small). 

I tried to do something similar with files and it looks like it works... I copy an execution of my test here http://gist.github.com/591543

I don't understand what I'm doing wrong (in this case the page is in ISO-8859-15 but the problem is the same), I get Educación when I want to see Educación.

Could you give it a run?

Thanks for all your time!

PS: Let me know if you want to keep the discussion off the list, I think this it is still useful for other but it is being repetitive 

--
You received this message because you are subscribed to the Google Groups "nodejs" group.
To post to this group, send email to nod...@googlegroups.com.
To unsubscribe from this group, send email to nodejs+un...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/nodejs?hl=en.

Ben Noordhuis

unread,
Sep 22, 2010, 3:05:05 PM9/22/10
to nod...@googlegroups.com
Quick heads-up as I'm strapped for time right now but yes, the issue
appears to be in your code. If you save the 404 page to disk and
execute the below snippet, the result is valid (and the desired)
UTF-8.

fs = require('fs'), Iconv = require('iconv').Iconv;
fs.readFile('404.html', function(ex, data) {
iv = new Iconv('iso-8859-15', 'utf-8');
console.log(iv.convert(data).toString());
});

Sidney San Martín

unread,
Oct 2, 2010, 1:12:10 PM10/2/10
to nod...@googlegroups.com
Not to bring this thread back from the deathbed, but I was just playing around with node-iconv, and while streaming support would be awesome, a better way to deal with the buffers coming in from ondata might be to store them separately in an array, then copy them all to a new buffer at once (warning: untested):

request.on('response', function (response) {

var responseBuffers = [], responseLength = 0;

response.on('data', function (chunk) {
responseBuffers.push(chunk);


});
response.on('end', function() {

var totalLength = 0, index = 0, concatBuffer, iconv = new Iconv('ISO-8859-1', 'UTF-8'), outputBuffer;
for (var i=0, currentBuffer; currentBuffer = responseBuffers[i]; i++){
totalLength += currentBuffer.length;
}
concatBuffer = new Buffer(totalLength);
while ((currentBuffer = responseBuffers.shift())){
currentBuffer.copy(concatBuffer, index, 0);
index += currentBuffer.length;
}
try{
outputBuffer = iconv.convert(concatBuffer);
} catch(e){
errback('Trouble converting text: ' + e);
}

callback(outputBuffer.toString('utf8'));
});
});

Ben Noordhuis

unread,
Oct 2, 2010, 8:10:21 PM10/2/10
to nod...@googlegroups.com
On Sat, Oct 2, 2010 at 19:12, Sidney San Martín <s...@sidneysm.com> wrote:
> Not to bring this thread back from the deathbed, but I was just playing around with node-iconv, and while streaming support would be awesome, a better way to deal with the buffers coming in from ondata might be to store them separately in an array, then copy them all to a new buffer at once

Yes, this is the preferred method. Encoding the chunks one-by-one may
not work as expected with stateful or multi-byte character sets. This
is something of a trap door, I'll update the documentation.

Side note: I've added a buffertools.concat() method to
node-buffertools that lets you concatenate buffers (and strings) as a
one-liner.

buffertools.concat(a, b, 'foo', new Buffer('bar'));

I'll probably add something similar to iconv.convert():

// decode a+b+c into a single result buffer
r = iconv.convert(a, b, c);

Jorge

unread,
Oct 2, 2010, 9:06:21 PM10/2/10
to nod...@googlegroups.com

I've had an idea:

An analogy: when you are adding 15 + 17 you first do 5+7 that gives 2 and a carry of one so that the next sum becomes carry+1+1.

In the same manner, chunks that arrive ondata encoded in utf8 could generate a carry to be prepended to the next chunk on the next ondata.

That does not sound too difficult to do, and would help to avoid the need to buffer the chunks.

Maybe the `new Buffer()`s should be built with a few free bytes of padding prepended, whose only use would be to accommodate the carry -if any- from a previous chunk (the "0" of the buffer would have to be reset accordingly).

I might be perfectly wrong, though, very likely.
--
Jorge.

Jorge Chamorro

unread,
Oct 2, 2010, 9:16:31 PM10/2/10
to nod...@googlegroups.com
On 03/10/2010, at 03:06, Jorge wrote:
>
> I've had an idea:
>
> An analogy: when you are adding 15 + 17 you first do 5+7 that gives 2 and a carry of one so that the next sum becomes carry+1+1.
>
> In the same manner, chunks that arrive ondata encoded in utf8 could generate a carry to be prepended to the next chunk on the next ondata.
>
> That does not sound too difficult to do, and would help to avoid the need to buffer the chunks.
>
> Maybe the `new Buffer()`s should be built with a few free bytes of padding prepended, whose only use would be to accommodate the carry -if any- from a previous chunk (the "0" of the buffer would have to be reset accordingly).
>
> I might be perfectly wrong, though, very likely.

BTW, in order to tell if there's a carry, one would just need to sniff the last few - 4 at most - bytes of the buffer, istm : not the whole buffer.
--
Jorge.

Ben Noordhuis

unread,
Oct 3, 2010, 3:14:22 PM10/3/10
to nod...@googlegroups.com
This is more or less the idea for the streaming API. It's pretty
straight-forward for single-byte and multi-byte character sets, but
less so for stateful encodings (the ISO-2022 dialects).

Jorge

unread,
Oct 4, 2010, 3:46:20 AM10/4/10
to nod...@googlegroups.com

Just make the carry an object that includes the state as well... (?)
--
Jorge.

Ben Noordhuis

unread,
Oct 4, 2010, 6:37:04 AM10/4/10
to nod...@googlegroups.com
On Mon, Oct 4, 2010 at 09:46, Jorge <jo...@jorgechamorro.com> wrote:
> Just make the carry an object that includes the state as well... (?)

Easy in theory, somewhat harder in practice. :-)

Reply all
Reply to author
Forward
0 new messages