unicode support, nodejs and chrome browser

1,193 views
Skip to first unread message

~flow

unread,
Aug 9, 2011, 2:31:11 PM8/9/11
to v8-users
hi, i would like to cross-post a question:

http://stackoverflow.com/questions/6985851/how-to-render-32bit-unicode-characters-in-google-v8-and-nodejs

the whole issue boils down to this: when use javascript to write
strings with 32bit unicode characters into an HTML page displayed
inside google chrome, i take it that it's that browser's javascript
engine that the string must pass through, so it's passed through V8---
correct? i observe that such characters are rendered correctly.

however, doing the equivalent using nodejs 0.4.10 and the
console.log() method, all i get is those annoying ����. my sources are
in utf-8, my terminal is correctly configured (have been doing this
stuff for years using python, so i can be sure about that).

my understanding is that unicode support in javascript is deeply
flawed and the V8 team is committed to stick to the standard as
closely as possible, for understandable reasons. but somehow the
people who built chrome the browser must have found a way to minimize
the impact of this difficult ECMA-legacy---what does document.write()
do that console.log()can't?

i just can't believe that a platform so finely crafted will go on
mangling every single character outside the unicode BMP...

Marcel Laverdet

unread,
Aug 10, 2011, 12:18:27 AM8/10/11
to v8-u...@googlegroups.com
Yeah this is just the way Javascript is for now. There is an issue open for it:

You don't run into the problem with HTML because that goes straight through WebKit; it never hits v8.

~flow

unread,
Aug 10, 2011, 10:04:23 AM8/10/11
to v8-u...@googlegroups.com
> You don't run into the problem with HTML because that goes straight through WebKit; it never hits v8

well that's why i did

<script>document.write( "𡥂" );</script>

to make sure that string does go through v8... you mean to say that while we're using v8 to execute the javascript, document.write() is in fact an exposed webkit method?

Marcel Laverdet

unread,
Aug 10, 2011, 6:24:43 PM8/10/11
to v8-u...@googlegroups.com
Ah I see. These are handled internally with surrogate pairs. You can observe this by doing this in Chrome or Safari's web console:
> var mua = '𩄎';
> mua.length
2
> mua.charCodeAt(0)
55396
> mua.charCodeAt(1)
56590

v8 thinks it's writing out 2 characters, but `document.write` knows its encoding to utf8 or whatever and can understand the pairs correctly. I don't see a way to convert a NodeJS utf-8 buffer with SMP characters into a string with surrogate pairs. Seems like the Node team will need to put some work into getting around JS's unicode limitations.

--

~flow

unread,
Aug 11, 2011, 1:05:42 PM8/11/11
to v8-u...@googlegroups.com
so i went and put the same javascript into an HTML page to be displayed by chrome and into a standalone js snippet to be run using nodejs:

var f = function( text ) {
  document.write( '<h1>',  text,                                '</h1>'  );
  document.write( '<div>', text.length,                         '</div>' );
  document.write( '<div>0x', text.charCodeAt(0).toString( 16 ), '</div>' );
  document.write( '<div>0x', text.charCodeAt(1).toString( 16 ), '</div>' );
  console.log( '<h1>',  text,                                 '</h1>'  );
  console.log( '<div>', text.length,                          '</div>' );
  console.log( '<div>0x', text.charCodeAt(0).toString( 16 ),  '</div>' );
  console.log( '<div>0x', text.charCodeAt(1).toString( 16 ),  '</div>' ); };

f( '𩄎' );
f( String.fromCharCode( 0x2910e ) );
f( String.fromCharCode( 0xd864, 0xdd0e ) );

in function f(), those document.write() calls are only present in the HTML document, not the standalone.

i want to show here that something more fundamental must be different between javascript running inside google chrome and javascript running inside nodejs. because, you see, the output i get inside chrome looks like this:

𩄎

2
0xd864
0xdd0e

1
0x910e
0xNaN

𩄎

2
0xd864
0xdd0e

the second character is silently truncated (notice how the chr code is reported as 0x910e where it should be 0x2910e) which is sad, but both using a string literal and a numerical surrogate pair works---both in the HTML page and in chrome's console output! conversely, in nodejs, this is what i get:

<h1> � </h1>
<div> 1 </div>
<div>0x fffd </div>
<div>0x NaN </div>
<h1> 鄎 </h1>
<div> 1 </div>
<div>0x 910e </div>
<div>0x NaN </div>
<h1> �����</h1>
<div> 2 </div>
<div>0x d864 </div>
<div>0x dd0e </div>

the silver lining here is that v8 inside nodejs does preserve the surrogate pair, even though it fails to output it correctly. however, the console.log() method gets it completely wrong. may i add that the analog in python 3.1 does work---since i use a 'narrow' python build, it also reports a string '𩄎' as being two characters long, and manages to print it out correctly, which seems to tell me that my ubuntu gnome terminal knows how to handle surrogate pairs.

i could perfectly live with those surrogate pairs---they're a nuisance but i know how to deal with them from years of experience with python. the really sad thing here is that nodejs's v8 seems to fall short on something that v8 can be demonstrated to do correctly when running inside chrome.

that said, let me add that i sometimes worry about the unneeded complexity that goes into implementations. why can't people just use a 32bit wide character datatypes? instead they make users jump to all kinds of gratuitous hoops.

Marcel Laverdet

unread,
Aug 11, 2011, 1:57:05 PM8/11/11
to v8-u...@googlegroups.com
console.log() and document.write() are not parts of v8. These are host functions and have different implementations in Chrome and in NodeJS. Chrome seems to have a very robust implementation of both, which is aware of surrogate pairs and the target encoding. NodeJS on the other hand fails to respect surrogate pairs.

Your examples don't show too much other than the fact that String.fromCodeCode() will not generate surrogate pairs and therefore can only generate characters with a 16 bit codepoint. You'll see the same results in NodeJS.

'𩄎' === String.fromCharCode(0xd864, 0xdd0e)
true

Koichi Kobayashi

unread,
Aug 11, 2011, 10:38:20 PM8/11/11
to v8-u...@googlegroups.com
Hi,

console.log() of Node uses v8::String::WriteUtf8() internally.
Unfortunately it supports only BMP.

http://code.google.com/p/v8/issues/detail?id=761

On Thu, 11 Aug 2011 12:57:05 -0500, Marcel Laverdet <mar...@laverdet.com> wrote:

> console.log() and document.write() are not parts of v8. These are host
> functions and have different implementations in Chrome and in NodeJS. Chrome
> seems to have a very robust implementation of both, which is aware of
> surrogate pairs and the target encoding. NodeJS on the other hand fails to
> respect surrogate pairs.
>
> Your examples don't show too much other than the fact that
> String.fromCodeCode() will not generate surrogate pairs and therefore can
> only generate characters with a 16 bit codepoint. You'll see the same
> results in NodeJS.
>

> '??' === String.fromCharCode(0xd864, 0xdd0e)


> true
>
> On Thu, Aug 11, 2011 at 12:05 PM, ~flow <wolfga...@gmail.com> wrote:
>
> > so i went and put the same javascript into an HTML page to be displayed by
> > chrome and into a standalone js snippet to be run using nodejs:
> >
> > var f = function( text ) {
> > document.write( '<h1>', text, '</h1>' );
> > document.write( '<div>', text.length, '</div>' );
> > document.write( '<div>0x', text.charCodeAt(0).toString( 16 ), '</div>' );
> > document.write( '<div>0x', text.charCodeAt(1).toString( 16 ), '</div>' );
> > console.log( '<h1>', text, '</h1>' );
> > console.log( '<div>', text.length, '</div>' );
> > console.log( '<div>0x', text.charCodeAt(0).toString( 16 ), '</div>' );
> > console.log( '<div>0x', text.charCodeAt(1).toString( 16 ), '</div>' );
> > };
> >

> > f( '??' );


> > f( String.fromCharCode( 0x2910e ) );
> > f( String.fromCharCode( 0xd864, 0xdd0e ) );
> >
> > in function f(), those document.write() calls are only present in the HTML
> > document, not the standalone.
> >
> > i want to show here that something more fundamental must be different
> > between javascript running inside google chrome and javascript running
> > inside nodejs. because, you see, the output i get inside chrome looks like
> > this:
> >

> > ??
> > 2
> > 0xd864
> > 0xdd0e
> > ?
> > 1
> > 0x910e
> > 0xNaN
> > ??


> > 2
> > 0xd864
> > 0xdd0e
> >
> > the second character is silently truncated (notice how the chr code is
> > reported as 0x910e where it should be 0x2910e) which is sad, but both
> > using a string literal and a numerical surrogate pair works---both in the
> > HTML page and in chrome's console output! conversely, in nodejs, this is
> > what i get:
> >

> > <h1> ? </h1>


> > <div> 1 </div>
> > <div>0x fffd </div>
> > <div>0x NaN </div>

> > <h1> ? </h1>


> > <div> 1 </div>
> > <div>0x 910e </div>
> > <div>0x NaN </div>

> > <h1> ?????</h1>


> > <div> 2 </div>
> > <div>0x d864 </div>
> > <div>0x dd0e </div>
> >
> > the silver lining here is that v8 inside nodejs does preserve the surrogate
> > pair, even though it fails to output it correctly. however, the
> > console.log() method gets it completely wrong. may i add that the analog in
> > python 3.1 does work---since i use a 'narrow' python build, it also reports

> > a string '??' as being two characters long, and manages to print it out


> > correctly, which seems to tell me that my ubuntu gnome terminal knows how to
> > handle surrogate pairs.
> >
> > i could perfectly live with those surrogate pairs---they're a nuisance but
> > i know how to deal with them from years of experience with python. the
> > really sad thing here is that nodejs's v8 seems to fall short on something
> > that v8 can be demonstrated to do correctly when running inside chrome.
> >
> > that said, let me add that i sometimes worry about the unneeded complexity
> > that goes into implementations. why can't people just use a 32bit wide
> > character datatypes? instead they make users jump to all kinds of gratuitous
> > hoops.
> >
> > --
> > v8-users mailing list
> > v8-u...@googlegroups.com
> > http://groups.google.com/group/v8-users
> >
>
> --
> v8-users mailing list
> v8-u...@googlegroups.com
> http://groups.google.com/group/v8-users


--
{
name: "Koichi Kobayashi",
mail: "koi...@improvement.jp",
blog: "http://d.hatena.ne.jp/koichik/",
twitter: "@koichik"
}

~flow

unread,
Aug 12, 2011, 8:35:49 AM8/12/11
to v8-u...@googlegroups.com
i had a look at the nodejs source (0.4.10; includes v8 sources) but was unable to locate the definition of console.log(). does anyone have a pointer for me? searching for (case-insensitive) log\(.*\{$ didn't turn out anything of interest.

Bryan White

unread,
Aug 12, 2011, 9:27:34 AM8/12/11
to v8-u...@googlegroups.com

I am not familiar with node.js but the actual function may not be
named log. I would search for "log" (the word log in double quotes)
to find where it is mapping a function name to a function pointer in
the console object.
--
Bryan White

Koichi Kobayashi

unread,
Aug 12, 2011, 10:00:00 AM8/12/11
to v8-u...@googlegroups.com
Hi,

console.log is defined in lib/console.js,
but it just passes string to the stream.

You should look Buffer::Utf8Write in src/node_buffer.cc
It converts JS's string (UCS-2) to byte array (UTF-8)
using v8::String::WriteUtf8().

https://github.com/joyent/node/blob/v0.4.10/src/node_buffer.cc#L471

Reply all
Reply to author
Forward
0 new messages