handling binary strings

19 views
Skip to first unread message

Brian Craft

unread,
Oct 18, 2015, 11:31:18 PM10/18/15
to nodejs
I'm trying to persist a string to disk, but every mechanism I've tried loses bits. The string is holding binary data generated by a third-party lib.

Searching online has turned up lots of recommendations to use the Buffer class, however I'm also unable to round-trip the data through Buffer without losing bits. That is, in general

str !== (new Buffer(str, 'binary')).toString('binary'))

How can I persist a string without losing bits?

Are there docs anywhere on how the String class encodings work, and how to move data in and out of String without losing bits?

Aria Stewart

unread,
Oct 18, 2015, 11:38:14 PM10/18/15
to nod...@googlegroups.com
Well, strings are series of unicode codepoints, not unrestricted binary data -- so you can't put arbitrary binary into it without already having encoded it somehow.

Buffers are the 8-bit-clean array-of-bytes interface you're looking for -- but knowing what this data is and how to get it out of a string is the key part -- if they're encoding binary data in strings, it's a bit of a guess.

Why did you choose the 'binary' encoding above? It's a bit quirky -- only one way to get binary data into and out of strings, and deprecated since it's a hack.

Aria

Brian Craft

unread,
Oct 19, 2015, 12:55:47 AM10/19/15
to nodejs
How would I pick one? Are the encodings documented? Or listed somewhere?

I eventually worked out with much pain that passing through Buffer and back sometimes swaps byte order in the words, for reasons I don't understand. The ucs2 encoding, which I only found by looking at the source code, avoids this.

Aria Stewart

unread,
Oct 19, 2015, 1:03:45 AM10/19/15
to nod...@googlegroups.com

> On Oct 19, 2015, at 12:03 AM, Brian Craft <craft...@gmail.com> wrote:
>
> How would I pick one? Are the encodings documented? Or listed somewhere?
>
> I eventually worked out with much pain that passing through Buffer and back sometimes swaps byte order in the words, for reasons I don't understand. The ucs2 encoding, which I only found by looking at the source code, avoids this.

Well, generally one avoids ever storing binary data in strings that can't carry the full range of things -- that's why things like Base64 were invented, to reduce arbitrary bytes down to a character set that's easy to manipulate as text.

Unicode's a tricky beast -- what's allowed and not is pretty lumpy and uneven in places -- surrogate characters, handling of nulls, replacement with the replacement character for things that are out of bounds. You don't want the mess of all human languages mixed with the exacting representation of binary data.

So it sounds like you found ucs2, which is a naive mapping of the basic plane of unicode into 16-bit codepoints -- two ways to map that to bytes, of course -- low first or high first -- so byte swapping can be an issue. It's not documented because it pretty much should never be used -- though it sounds like you've been forced into it since you're interfacing with a library that squishes binary data into strings that way.

Good luck!

Aria
Reply all
Reply to author
Forward
0 new messages