Single bytes convert to multi-byte strings?

751 views
Skip to first unread message

Skyler Hawthorne

unread,
Nov 16, 2012, 4:17:06 AM11/16/12
to golan...@googlegroups.com
I'm working on a cryptography assignment in which we are supposed to analyze ciphertexts for letter frequency. However, I noticed that if you iterate over a byte array and convert each byte to a string, many of the bytes convert to two-byte strings.

I put together an example here: http://play.golang.org/p/xZbFTsBW00

Why is this happening? How could a single byte be converted to a two-byte string?

chris dollin

unread,
Nov 16, 2012, 4:55:03 AM11/16/12
to Skyler Hawthorne, golan...@googlegroups.com
string(aninteger) doesn't convert how you think it would. It delivers
a string whose bytes form the UTF-8 representation of the argument
value, ie, it makes a one-rune string from a rune [== unicode codepoint].
Such a string may be more than one byte long.

http://golang.org/ref/spec#Conversions

Chris

--
Chris "allusive" Dollin

Dan Kortschak

unread,
Nov 16, 2012, 5:30:10 AM11/16/12
to chris dollin, Skyler Hawthorne, golan...@googlegroups.com
http://play.golang.org/p/4yvIy5MX0W demonstrates explicitly.

Kyle Lemons

unread,
Nov 16, 2012, 5:30:16 PM11/16/12
to Skyler Hawthorne, golang-nuts
When you decode the ciphertext, you're getting a string of raw bytes.  Many of these (if it's a good crypto algorithm) will be >127, which is not valid ASCII, and so its UTF-8 encoding will be used when its in a string (which is two bytes long).  If you're looking for a histogram of the bytes in the ciphertext, just use a map[byte]int:



--
 
 

Kyle Lemons

unread,
Nov 16, 2012, 5:31:08 PM11/16/12
to Skyler Hawthorne, golang-nuts
On Fri, Nov 16, 2012 at 2:30 PM, Kyle Lemons <kev...@google.com> wrote:
When you decode the ciphertext, you're getting a string of raw bytes.  Many of these (if it's a good crypto algorithm) will be >127, which is not valid ASCII, and so its UTF-8 encoding will be used when its in a string (which is two bytes long).  If you're looking for a histogram of the bytes in the ciphertext, just use a map[byte]int:

apparently I keyed send instead of paste: http://play.golang.org/p/XpqFpETnz0

Kyle Lemons

unread,
Nov 16, 2012, 7:11:55 PM11/16/12
to Skyler, golang-nuts
You can use a [2]byte as a map key.


On Fri, Nov 16, 2012 at 3:14 PM, Skyler <skylerh...@gmail.com> wrote:

Well, this is what I did at first, but after counting single characters, I want to count the digrams and trigrams, and I can't think of a sane way to do that besides just encapsulating them in a string.

Kyle Lemons

unread,
Nov 16, 2012, 8:11:33 PM11/16/12
to Skyler, golang-nuts
Luckily, [2]byte is not a slice :)



On Fri, Nov 16, 2012 at 5:03 PM, Skyler <skylerh...@gmail.com> wrote:

Slices aren't allowed as map keys

Bryan Mills

unread,
Nov 18, 2012, 4:29:30 PM11/18/12
to golan...@googlegroups.com, Skyler
Alternatively, if you want to remove the fixed length you can put the byte in a singleton []byte and then convert that to a string:

That allows you to distinguish between single-bytes and digraphs with zero as the second byte, at the cost of marginally higher runtime overhead for storing the lengths.
Reply all
Reply to author
Forward
0 new messages