decodeURIComponent vs. binary Buffer.toString() + UTF-8 replacement chars

1,289 views
Skip to first unread message

Taylor Hughes

unread,
Jul 20, 2012, 2:48:31 PM7/20/12
to nod...@googlegroups.com
Hi nodejs group!

I was just wrestling with a bug in our app — concerning an iPhone emoji => multipart POST to a node.js backend (decoding with formidible library) — and came across the following Interesting Case™.

The bug was: emoji chars POSTed from an iPhone, as part of a multipart request, were being converted into \ufffd (UTF-8 replacement) chars, whereas with form-encoded POSTs they were not.

From this behavior I isolated the following interesting snippet:

// This is an emoji character POSTed by an iPhone:
var binary = '\u00f0\u009f\u008d\u0094';
// The same binary string, urlencoded byte for byte (what you get with a form-encoded POST of the same thing):
var urlencoded = '%F0%9F%8D%94';

// Convert from the binary string
var utf8 = new Buffer(binary, 'binary').toString('utf-8');

// Convert from the urlencoded version of the same thing
var utf8uri = decodeURIComponent(urlencoded);

// Results are not the same:
utf8 == utf8uri // false

// utf8    => "\ufffd" (UTF-8 replacement character)
// utf8uri => "\ud83c\udf54" (characters the iPhone can understand as the original emoji)


(Note that normal multibyte UTF-8 characters go through both the same way, and seem to come out fine in both cases.)

I'm mostly curious about why this happens — namely why decodeURIComponent() is seemingly more permissive with UTF-8 decoding than other mechanisms like StringDecoder() and Buffer.toString() — and if there's a way to preserve strange UTF-8 characters using those mechanisms too.

Thanks!
Taylor

Marcel Laverdet

unread,
Jul 20, 2012, 6:48:26 PM7/20/12
to nod...@googlegroups.com
What version of node? This is what I get:

> var moji1 = (new Buffer('\xf0\x9f\x8d\x94', 'binary')).toString('utf-8');
> var moji2 = (new Buffer('\u00f0\u009f\u008d\u0094', 'binary')).toString('utf-8');
> var moji3 = decodeURIComponent('%F0%9F%8D%94');
> moji1 == moji2
true
> moji2 == moji3
true


Taylor

--
Job Board: http://jobs.nodejs.org/
Posting guidelines: https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
You received this message because you are subscribed to the Google
Groups "nodejs" group.
To post to this group, send email to nod...@googlegroups.com
To unsubscribe from this group, send email to
nodejs+un...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/nodejs?hl=en?hl=en

Taylor Hughes

unread,
Jul 20, 2012, 6:54:24 PM7/20/12
to nod...@googlegroups.com
v0.6.18 — we haven't bumped our project up to 0.8 yet.

Just tried with 0.8.3 and you're right — looks good in 0.8. Didn't think an upgrade would change this particular behavior so I didn't try it out. :)

Amazing. Thanks!

-t

Daniel Rinehart

unread,
Jul 21, 2012, 8:52:46 AM7/21/12
to nod...@googlegroups.com
I faced a similar issue recently with 0.6.x that was fixed in 0.8.x
since 0.8.x includes a newer version of V8 which addressed some issues
with Unicode handling such as:

https://github.com/joyent/node/issues/2686

The linked V8 issue has more details.

-- Daniel R. <dan...@neophi.com> [http://danielr.neophi.com/]
Reply all
Reply to author
Forward
0 new messages