Hi nodejs group!
I was just wrestling with a bug in our app — concerning an iPhone emoji => multipart POST to a node.js backend (decoding with formidible library) — and came across the following Interesting Case™.
The bug was: emoji chars POSTed from an iPhone, as part of a multipart request, were being converted into \ufffd (UTF-8 replacement) chars, whereas with form-encoded POSTs they were not.
From this behavior I isolated the following interesting snippet:
// This is an emoji character POSTed by an iPhone:
var binary = '\u00f0\u009f\u008d\u0094';
// The same binary string, urlencoded byte for byte (what you get with a form-encoded POST of the same thing):
var urlencoded = '%F0%9F%8D%94';
// Convert from the binary string
var utf8 = new Buffer(binary, 'binary').toString('utf-8');
// Convert from the urlencoded version of the same thing
var utf8uri = decodeURIComponent(urlencoded);
// Results are not the same:
utf8 == utf8uri // false
// utf8 => "\ufffd" (UTF-8 replacement character)
// utf8uri => "\ud83c\udf54" (characters the iPhone can understand as the original emoji)
(Note that normal multibyte UTF-8 characters go through both the same way, and seem to come out fine in both cases.)
I'm mostly curious about why this happens — namely why decodeURIComponent() is seemingly more permissive with UTF-8 decoding than other mechanisms like StringDecoder() and Buffer.toString() — and if there's a way to preserve strange UTF-8 characters using those mechanisms too.
Thanks!
Taylor