UTF8 encoding issue

138 views
Skip to first unread message

arpit shah

unread,
Sep 30, 2015, 10:53:09 AM9/30/15
to nodejs
Hi All,

I have been facing an issue on node.js express framework encoding/decoding style.
Brief background, I store pdf file in mysql database with longblob data-type with latin1 charset. From server side, i need to send the binary data with UTF8 Encoding format as my client knows utf8 decoding format only.
I tried all the possible solutions available on google.

For ex: new Buffer(mySqlData).toString('utf8');
Already tried module "UTF8" with given functionality utf8.encode(mySqlData);

Also i already tried "base64" encoding and retrieve data at client with base64 decoding. I is working just fine but i need to have utf8 encoding set. Also you know base64 certainly increase the size.

Please help guys.

Matt

unread,
Oct 1, 2015, 1:12:25 AM10/1/15
to nod...@googlegroups.com
I think you've really misunderstood the PDF format. PDF is a binary format (it can contain images with zero codepoints). There's nothing that can read PDFs that expects them in UTF8 format.

You may have destroyed the PDFs completely by storing them as Latin-1, but given how flexible Latin-1 is you may be OK - just deliver them as binary and you might be OK. Set the correct Content-Type on delivery (res.type('pdf')) and everything will work.

--
Job board: http://jobs.nodejs.org/
New group rules: https://gist.github.com/othiym23/9886289#file-moderation-policy-md
Old group rules: https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
---
You received this message because you are subscribed to the Google Groups "nodejs" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nodejs+un...@googlegroups.com.
To post to this group, send email to nod...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nodejs/a420947c-ce4b-4cee-abb3-6daffc141616%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jimb Esser

unread,
Oct 1, 2015, 1:12:25 AM10/1/15
to nodejs
A PDF is a binary blob and contains characters that would not be valid UTF8.  new Buffer(mySqlData).toString('utf8') is going to replace all unknown bytes/character codes with character 0xFFFD.

What exactly do you want on the client?  If you said you used base64 encoding successfully, I'm guessing maybe you want a string in which the Nth character's charCode is equal to the Nth byte of your binary data?  The UTF8 encoding of this won't look like your binary data, but that should be fine.  To get that, you first want to construct such a string on your server, then send that to the client (using UTF8 encoding should be fine for transfer, as, for example, UTF8 character 0xFF is represented by 0xC3BF, it's going to be a bit larger than sending binary directly to the client, but still smaller than base64 encoding for most characters).

You should be able to construct such a string on your server with either of these (depending on what exactly "mySqlData" is, I'm assuming an array of bytes or a Buffer):
  for (var s1='', ii=0; ii<mySqlData.length; ++ii) { s1 += String.fromCharCode(mySqlData[ii]); }
  var s2 = new Buffer(mySqlData).toString('binary');
  assert.equal(s1, s2);
Now you have a string, which can be sent losslessly through utf8 encoding, which represents your bytes.

On the client, if you actually want an array of bytes again, then loop through the string and call charCodeAt.  Note that making a new Buffer or Uint8Array from this kind of string (without specifying this "binary" encoding which is not particularly a standard) will not work, because that will get you the UTF-8 encoded bytes, not your source bytes.

None of this sounds particularly efficient, the best would be to simply send the file as binary, without any extra encoding.

Hope this helps,
  Jimb
Reply all
Reply to author
Forward
0 new messages