setEncoding other than utf8, ascii or base64

Néstor

unread,

Sep 14, 2010, 6:41:35 PM9/14/10

to nodejs

Hi

Is it possible to set the encoding of a http response to other than
utf8, ascii or base64? I'm having the following problem, I send a POST
request to a web that answer me with a html page encoded in
windows-1252, I suppose it won't a problem to encode the response in
ISO-88591-1 but I don't see the way of reading the response without
unknown characters like "Colegio de Educaci�n".

I was trying to use somehow node-iconv to see if that was helpful but
there is an issue with that module on Mac os.

http://github.com/bnoordhuis/node-iconv/issues/#issue/2

Thanks!

Ben Noordhuis

unread,

Sep 14, 2010, 10:11:15 PM9/14/10

to nod...@googlegroups.com

I just pulled a patch from Juan Alvarez's fork that should fix the
issue if you compile with --libiconv=/usr/local (or wherever your
libiconv is installed). Could you give it another spin?

Néstor

unread,

Sep 15, 2010, 5:27:32 AM9/15/10

to nodejs

It seems that building the library and using the new option the
missing symbols are fixed but I'm having some issues using it.

As I said my problem is that in this code the html in the response is
in windows-1252 and I can't set that encoding, I decided to set UTF8
and try to convert it back to ISO-88591-1 (or windows-1512). This is
the code

request.on('response', function (response) {
var responseBody = '';
sys.debug('STATUS= ' + response.statusCode);
sys.debug('HEADERS= ' + JSON.stringify(response.headers));
response.setEncoding('utf-8');

response.on('data', function(chunk){
responseBody += chunk;
});

response.on('end', function(){
if ( 200 == response.statusCode ) {
var iconv = new Iconv('UTF-8', 'ISO-8859-1');
var latinBuf = iconv.convert(responseBody);
htmlemit.emit('done', latinBuf);
}
});
});

But I'm getting this error:

/Users/nlafon/Dropbox/code/njs/sacacoles.js:42
var latinBuf = iconv.convert(responseBody);
^
Error: EILSEQ, Illegal character sequence.
at IncomingMessage.<anonymous> (/Users/nlafon/Dropbox/code/njs/
sacacoles.js:42:34)
at IncomingMessage.emit (events:43:20)
at HTTPParser.onMessageComplete (http:107:23)
at Client.ondata (http:882:22)
at IOWatcher.callback (net:494:29)
at node.js:764:9

I tried with creating a buffer, var latinBuf = iconv.convert(new
Buffer(responseBody)); , but I got the same error.

Anything I'm missing?

Thanks

Ben Noordhuis

unread,

Sep 15, 2010, 5:56:50 AM9/15/10

to nod...@googlegroups.com

Assuming the request is valid Windows-1252, then you probably need to
set response.setEncoding('binary) and aggregate the chunks into a
buffer before calling Iconv.convert(). Note that encoding=binary means
your data callback will receive Buffer objects, not strings.

Néstor

unread,

Sep 15, 2010, 7:17:45 AM9/15/10

to nodejs

The lack of encodings for the response it is quite annoying, I'm just
learning javascript and nodejs and I'm stuck with the always-tedious
encoding stuff. Anyway, less complaining and more "working". I tried
what you said and I bet I'm missing something, I paste the new code
(ignore the commeted lines):

request.on('response', function (response) {

//var responseBody = '';
var resBuf = new Buffer(10485760);
var resIdx = 0;

sys.debug('STATUS= ' + response.statusCode);
sys.debug('HEADERS= ' + JSON.stringify(response.headers));

//response.setEncoding('utf8');
response.setEncoding('binary');

response.on('data', function(chunk){
resIdx += resBuf.write(chunk, resIdx);
//responseBody += chunk;

});

response.on('end', function(){
if ( 200 == response.statusCode ) {

sys.debug(resIdx);
sys.debug(sys.inspect(resBuf.toString('binary',resIdx -
10240, resIdx ) ));
var iconv = new Iconv('ISO-8859-1', 'UTF8');
var latinBuf = iconv.convert(resBuf);
htmlemit.emit('done', latinBuf);
//htmlemit.emit('done', responseBody);
}
});
});

Using binary and having to allocate a buffer of fixed size ahead is
not a very good thing... that is the reason I used a 10MB buffer. In
the on.('end') function I dump the last 10KB of that string, seems
that it contents some data (good data) and prints UTF characters like
"\u00c3\u00b3".

But after that I get an error from iconv:

/Users/nlafon/Dropbox/code/njs/sacacoles.js:48
var iconv = new Iconv('ISO-8859-1', 'UTF8');
^
Error: EINVAL, Conversion not supported.
at IncomingMessage.<anonymous> (/Users/nlafon/Dropbox/code/njs/
sacacoles.js:48:25)

at IncomingMessage.emit (events:43:20)
at HTTPParser.onMessageComplete (http:107:23)
at Client.ondata (http:882:22)
at IOWatcher.callback (net:494:29)
at node.js:764:9

Thanks again.... I really hate (and don't understand) this encoding
"crap".

Ben Noordhuis

unread,

Sep 15, 2010, 10:21:08 AM9/15/10

to nod...@googlegroups.com

> Using binary and having to allocate a buffer of fixed size ahead is
> not a very good thing... that is the reason I used a 10MB buffer. In

Agreed. I'll probably add a streaming API in the next release.

> /Users/nlafon/Dropbox/code/njs/sacacoles.js:48
> var iconv = new Iconv('ISO-8859-1', 'UTF8');
> ^
> Error: EINVAL, Conversion not supported.

Your libiconv does not know how to convert from latin1 to utf8. Make
sure the proper character sets are compiled in.

Néstor Lafón-Gracia

unread,

Sep 15, 2010, 11:00:27 AM9/15/10

to nod...@googlegroups.com

Ummm it seems that is the problem, I just tried the same code on linux (ubuntu 10.04) and I didn't hit that problem, I will see if I can rebuild the library later at home but...

on linux the script starts using 95% CPU for the conversion and never finishes.

--
You received this message because you are subscribed to the Google Groups "nodejs" group.
To post to this group, send email to nod...@googlegroups.com.
To unsubscribe from this group, send email to nodejs+un...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/nodejs?hl=en.

--
Hasta pronto!

Néstor Lafón-Gracia

E-mail: nestor...@gmail.com

"I can see a darker side, I can see it in your eyes, Maybe things have happened in your life, And maybe that's the reason why, You're playing for a one-man side, It's alright when you're winning though, But, when you lose, you lose alone. " //Milburn//

Ben Noordhuis

unread,

Sep 15, 2010, 11:21:49 AM9/15/10

to nod...@googlegroups.com

> Ummm it seems that is the problem, I just tried the same code on linux
> (ubuntu 10.04) and I didn't hit that problem, I will see if I can rebuild
> the library later at home but...

I might just add GNU libiconv as a static dependency to avoid all this
craziness.

> on linux the script starts using 95% CPU for the conversion and never
> finishes.

I wager that 10 MB buffer isn't full. You are passing it in whole so
in that case iconv is trying to recode undefined data.

Néstor Lafón-Gracia

unread,

Sep 21, 2010, 8:41:39 AM9/21/10

to nod...@googlegroups.com

Hi Ben

I slice the buffer before the convert... not sure why it hangs (95% cpu).

See http://gist.github.com/589620

I saw that you changed the module to have the library statically linked, I will give a try later on Linux and Mac OS but I think that is not the issue.

Thanks

--

You received this message because you are subscribed to the Google Groups "nodejs" group.
To post to this group, send email to nod...@googlegroups.com.
To unsubscribe from this group, send email to nodejs+un...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/nodejs?hl=en.

Néstor Lafón-Gracia

unread,

Sep 21, 2010, 9:24:37 AM9/21/10

to nod...@googlegroups.com

Tried on Linux and with the new module I get the error I was getting when I built my own library...

var iconv = new Iconv('ISO-8859-1', 'UTF8');

^

Error: EINVAL, Conversion not supported.

at IncomingMessage.<anonymous> (/home/lafonne/kk:37:25)

at IncomingMessage.emit (events:43:20)

at HTTPParser.onMessageComplete (http:107:23)

at Client.ondata (http:901:22)

at IOWatcher.callback (net:494:29)

at node.js:765:9

I put the code I'm testing here (other gist than before) http://gist.github.com/589675 What I want to do would be really simple if the web was encoded in utf-8, but that is not the case.

Please let me know if you are able to run that code... I'm considering to dump node.js over other server-side language just because of its lack of encoding support (I'm from Spain and ISO-8859-1 is widely used).

Thanks again!

2010/9/21 Néstor Lafón-Gracia <nestor...@gmail.com>

Ben Noordhuis

unread,

Sep 21, 2010, 9:43:21 AM9/21/10

to nod...@googlegroups.com

Hi Néstor,

I'll see what I can do. Could you perhaps do a `make distclean` and
post the output of `make install` and `nm iconv.node`?

Thanks!

Ben

Néstor Lafón-Gracia

unread,

Sep 21, 2010, 10:25:29 AM9/21/10

to nod...@googlegroups.com

I grep'd the output, I suppose you are looking for these:

$nm iconv.node | grep 8859

00023a00 r iso8859_10_2uni

00006851 t iso8859_10_mbtowc

00023ac0 r iso8859_10_page00

0000689c t iso8859_10_wctomb

0000691c t iso8859_11_mbtowc

00006971 t iso8859_11_wctomb

00023ba0 r iso8859_13_2uni

000069cb t iso8859_13_mbtowc

00023c60 r iso8859_13_page00

00023d40 r iso8859_13_page20

00006a16 t iso8859_13_wctomb

00023d60 r iso8859_14_2uni

00006aae t iso8859_14_mbtowc

00023e20 r iso8859_14_page00

00023e80 r iso8859_14_page01_0

00023ea0 r iso8859_14_page01_1

00023ec0 r iso8859_14_page1e_0

00023f48 r iso8859_14_page1e_1

00006af9 t iso8859_14_wctomb

00023f60 r iso8859_15_2uni

00006c0c t iso8859_15_mbtowc

00023fa0 r iso8859_15_page00

00023fc0 r iso8859_15_page01

00006c5d t iso8859_15_wctomb

00024000 r iso8859_16_2uni

00006d21 t iso8859_16_mbtowc

000240c0 r iso8859_16_page00

000241a0 r iso8859_16_page02

000241a8 r iso8859_16_page20

00006d6c t iso8859_16_wctomb

00005feb t iso8859_1_mbtowc

0000600a t iso8859_1_wctomb

00022ea0 r iso8859_2_2uni

0000602e t iso8859_2_mbtowc

00022f60 r iso8859_2_page00

00023040 r iso8859_2_page02

00006079 t iso8859_2_wctomb

00023060 r iso8859_3_2uni

00006111 t iso8859_3_mbtowc

00023120 r iso8859_3_page00

00023180 r iso8859_3_page01

000231f8 r iso8859_3_page02

00006174 t iso8859_3_wctomb

00023200 r iso8859_4_2uni

00006236 t iso8859_4_mbtowc

000232c0 r iso8859_4_page00

000233a0 r iso8859_4_page02

00006281 t iso8859_4_wctomb

000233c0 r iso8859_5_2uni

00006319 t iso8859_5_mbtowc

00023480 r iso8859_5_page00

000234a0 r iso8859_5_page04

00006364 t iso8859_5_wctomb

00023500 r iso8859_6_2uni

0000640b t iso8859_6_mbtowc

000235c0 r iso8859_6_page00

000235e0 r iso8859_6_page06

0000646e t iso8859_6_wctomb

00023640 r iso8859_7_2uni

00006506 t iso8859_7_mbtowc

00023700 r iso8859_7_page00

00023720 r iso8859_7_page03

00023778 r iso8859_7_page20

00006569 t iso8859_7_wctomb

000237a0 r iso8859_8_2uni

00006649 t iso8859_8_mbtowc

00023860 r iso8859_8_page00

000238c0 r iso8859_8_page05

000238e0 r iso8859_8_page20

000066ac t iso8859_8_wctomb

00023900 r iso8859_9_2uni

0000676e t iso8859_9_mbtowc

00023960 r iso8859_9_page00

000239a0 r iso8859_9_page01

000067b9 t iso8859_9_wctomb

I put the full listing in http://pastebin.com/ubU0LjQ0

--

You received this message because you are subscribed to the Google Groups "nodejs" group.
To post to this group, send email to nod...@googlegroups.com.
To unsubscribe from this group, send email to nodejs+un...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/nodejs?hl=en.

Ben Noordhuis

unread,

Sep 21, 2010, 5:48:41 PM9/21/10

to nod...@googlegroups.com

Thanks, Néstor. I found the cause: the target encoding should be
'UTF-8', not 'UTF8'. Pretty lame, I'll see if I can fix it.

Ben Noordhuis

unread,

Sep 21, 2010, 6:58:58 PM9/21/10

to nod...@googlegroups.com

On Tue, Sep 21, 2010 at 23:48, Ben Noordhuis <in...@bnoordhuis.nl> wrote:
> Thanks, Néstor. I found the cause: the target encoding should be
> 'UTF-8', not 'UTF8'. Pretty lame, I'll see if I can fix it.

I've pushed a commit[1] that applies a little heuristics to the
character encoding name, massaging it into something libiconv likes.

[1] http://github.com/bnoordhuis/node-iconv/commit/9012b03b441ced4349b8a7590369ec02ca9bac87

Néstor Lafón-Gracia

unread,

Sep 22, 2010, 7:46:20 AM9/22/10

to nod...@googlegroups.com

Thanks Ben. I also found a problem in my code, I was slicing a buffer but not assigning it, therefore I was using the whole buffer.

Anyway, I'm still not being able to do what I want to do... I'm starting to think it is not related to iconv itself but more to the way I do it. I updated the code in http://gist.github.com/589675 to be more simple and more easy to test (ouput small).

I tried to do something similar with files and it looks like it works... I copy an execution of my test here http://gist.github.com/591543

I don't understand what I'm doing wrong (in this case the page is in ISO-8859-15 but the problem is the same), I get EducaciÃ³n when I want to see Educación.

Could you give it a run?

Thanks for all your time!

PS: Let me know if you want to keep the discussion off the list, I think this it is still useful for other but it is being repetitive

--
You received this message because you are subscribed to the Google Groups "nodejs" group.
To post to this group, send email to nod...@googlegroups.com.
To unsubscribe from this group, send email to nodejs+un...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/nodejs?hl=en.

Ben Noordhuis

unread,

Sep 22, 2010, 3:05:05 PM9/22/10

to nod...@googlegroups.com

Quick heads-up as I'm strapped for time right now but yes, the issue
appears to be in your code. If you save the 404 page to disk and
execute the below snippet, the result is valid (and the desired)
UTF-8.

fs = require('fs'), Iconv = require('iconv').Iconv;
fs.readFile('404.html', function(ex, data) {
iv = new Iconv('iso-8859-15', 'utf-8');
console.log(iv.convert(data).toString());
});

Sidney San Martín

unread,

Oct 2, 2010, 1:12:10 PM10/2/10

to nod...@googlegroups.com

Not to bring this thread back from the deathbed, but I was just playing around with node-iconv, and while streaming support would be awesome, a better way to deal with the buffers coming in from ondata might be to store them separately in an array, then copy them all to a new buffer at once (warning: untested):

request.on('response', function (response) {

var responseBuffers = [], responseLength = 0;

response.on('data', function (chunk) {
responseBuffers.push(chunk);

});
response.on('end', function() {

Ben Noordhuis

unread,

Oct 2, 2010, 8:10:21 PM10/2/10

to nod...@googlegroups.com

On Sat, Oct 2, 2010 at 19:12, Sidney San Martín <s...@sidneysm.com> wrote:
> Not to bring this thread back from the deathbed, but I was just playing around with node-iconv, and while streaming support would be awesome, a better way to deal with the buffers coming in from ondata might be to store them separately in an array, then copy them all to a new buffer at once

Yes, this is the preferred method. Encoding the chunks one-by-one may
not work as expected with stateful or multi-byte character sets. This
is something of a trap door, I'll update the documentation.

Side note: I've added a buffertools.concat() method to
node-buffertools that lets you concatenate buffers (and strings) as a
one-liner.

buffertools.concat(a, b, 'foo', new Buffer('bar'));

I'll probably add something similar to iconv.convert():

// decode a+b+c into a single result buffer
r = iconv.convert(a, b, c);

Jorge

unread,

Oct 2, 2010, 9:06:21 PM10/2/10

to nod...@googlegroups.com

I've had an idea:

An analogy: when you are adding 15 + 17 you first do 5+7 that gives 2 and a carry of one so that the next sum becomes carry+1+1.

In the same manner, chunks that arrive ondata encoded in utf8 could generate a carry to be prepended to the next chunk on the next ondata.

That does not sound too difficult to do, and would help to avoid the need to buffer the chunks.

Maybe the `new Buffer()`s should be built with a few free bytes of padding prepended, whose only use would be to accommodate the carry -if any- from a previous chunk (the "0" of the buffer would have to be reset accordingly).

I might be perfectly wrong, though, very likely.
--
Jorge.

Jorge Chamorro

unread,

Oct 2, 2010, 9:16:31 PM10/2/10

to nod...@googlegroups.com

On 03/10/2010, at 03:06, Jorge wrote:
>
> I've had an idea:
>
> An analogy: when you are adding 15 + 17 you first do 5+7 that gives 2 and a carry of one so that the next sum becomes carry+1+1.
>
> In the same manner, chunks that arrive ondata encoded in utf8 could generate a carry to be prepended to the next chunk on the next ondata.
>
> That does not sound too difficult to do, and would help to avoid the need to buffer the chunks.
>
> Maybe the `new Buffer()`s should be built with a few free bytes of padding prepended, whose only use would be to accommodate the carry -if any- from a previous chunk (the "0" of the buffer would have to be reset accordingly).
>
> I might be perfectly wrong, though, very likely.

BTW, in order to tell if there's a carry, one would just need to sniff the last few - 4 at most - bytes of the buffer, istm : not the whole buffer.
--
Jorge.

Ben Noordhuis

unread,

Oct 3, 2010, 3:14:22 PM10/3/10

to nod...@googlegroups.com

This is more or less the idea for the streaming API. It's pretty
straight-forward for single-byte and multi-byte character sets, but
less so for stateful encodings (the ISO-2022 dialects).

Jorge

unread,

Oct 4, 2010, 3:46:20 AM10/4/10

to nod...@googlegroups.com

Just make the carry an object that includes the state as well... (?)
--
Jorge.

Ben Noordhuis

unread,

Oct 4, 2010, 6:37:04 AM10/4/10

to nod...@googlegroups.com

On Mon, Oct 4, 2010 at 09:46, Jorge <jo...@jorgechamorro.com> wrote:
> Just make the carry an object that includes the state as well... (?)

Easy in theory, somewhat harder in practice. :-)

Reply all

Reply to author

Forward