encoding/json treats []byte as b64 encoded. Could it not?

5,815 views
Skip to first unread message

Brian Picciano

unread,
Dec 14, 2013, 3:25:18 PM12/14/13
to golan...@googlegroups.com
I'm sorry if this has been brought up already, I haven't been able to find anything on it in my searching. I also know this would be a fairly significant change and would break backwards compatibility, but it is a fairly annoying "feature" that I think is more of a hindrance than a help.

Basically the current behavior is that if you have a struct with a []byte field that you pass into the json marshaler, it will represent that in the output json string as the base64 encoded version of what you put in, and if you're unmarshaling into a []byte it will try to base64 decode the json string first. I can understand why this might be thought to be "the right way", since it forces you to use string as a string and raw binary data as []byte. But it's a bit presumptuous to assume that there is no legitimate reason anyone should pass a string through to a []byte and work with them that way.

Currently, if I want my destination struct to be something like:

type MyStruct struct {
    A, B []byte
}

And have that be filled by the json string: `{"A":"foo","B":"bar"}`, then I would first have to make a temporary struct like:

type MyStructStr struct {
    A, B string
}

And copy/convert each field over individually. Same goes if I want to convert from MyStruct back into json. This adds a lot of extra code and data copies. In encoding/json the data is initially passed in and (un)quoted as []byte, where it is then converted to string. So now I'm converting back to []byte. This is unnecessary and unoptimizes for the common case.

I've hacked a version of encoding/json where I took out the b64 stuff. It works just fine and is actually less code than it used to be. So there's no technical reason it has to stay (to my knowledge). Again, I know this probably won't make it in for anything in versions 1.*, but for 2 I think it should be considered. Also, if this is the wrong place to post this please let me know, I'll happily move it.
 

Jsor

unread,
Dec 15, 2013, 5:11:08 AM12/15/13
to golan...@googlegroups.com, bgpic...@gmail.com
I can understand working with a string as bytes. That's why (as far as I know), the string->[]byte conversion is supported. There are plenty of legitimate reasons. What I'm less clear on is exactly what your use case is for encoding the it as a string if you're dealing with it exclusively as bytes. I mean, if you're using it as both, surely you're making the copy anyway at some point. Wouldn't it make more sense to encode/decode as whatever your program uses it as?

The only time I can think of where this would be useful is if you were manually editing the JSON to change the strings, or if you have two different programs sharing serialization files where one treats something as a string and one as bytes. I could certainly see an argument for the former, but I think it's sufficiently esoteric that it's not really worth changing the API for it.

Damian Gryski

unread,
Dec 15, 2013, 6:51:02 AM12/15/13
to golan...@googlegroups.com, bgpic...@gmail.com


On Saturday, December 14, 2013 9:25:18 PM UTC+1, Brian Picciano wrote:

Basically the current behavior is that if you have a struct with a []byte field that you pass into the json marshaler, it will represent that in the output json string as the base64 encoded version of what you put in, and if you're unmarshaling into a []byte it will try to base64 decode the json string first. 
 
  JSON defines strings to be UTF-8 encoded, and as such is not suitable for storing binary data.  Encoding an unknown []byte with base64 eliminates problems with the data being accidentally modified in transit and simplifies interoperating with other JSON implementations.  Bytes should be used for binary data.  If it's textual data, then string is the correct type.

   Damian

egon

unread,
Dec 15, 2013, 7:32:48 AM12/15/13
to golan...@googlegroups.com, bgpic...@gmail.com
You can use:

type MyStruct struct {
    A, B json.RawMessage
}

json.RawMessage is defined as type RawMessage byte[]. Alternatively you can make your own type that implements MarshalJSON and UnmarshalJSON.

The reason it does b64, is that it is "the correct way" to represent byte array in a json. In other words by default it is safe, but you can override the behavior by using RawMessage or a custom Marshaler.

+egon
Message has been deleted

Brian Picciano

unread,
Dec 15, 2013, 5:05:48 PM12/15/13
to golan...@googlegroups.com
I'm going to compress my three responses to one.

> What I'm less clear on is exactly what your use case is for encoding the it as a string if you're dealing with it exclusively as bytes. I mean, if you're using it as both, surely you're making the copy anyway at some point.

That's my point, I don't want to use string EVER in my application (I have no reason to for this particular one). But with encoding/json I have to, because I can't directly get []byte out of it for a JSON string value. So I have to convert.

JSON defines strings to be UTF-8 encoded, and as such is not suitable for storing binary data.  Encoding an unknown []byte with base64 eliminates problems

I think that's a decision the coder should make. If I am worried that my binary data can't be encoded into a UTF-8 string then I can encode it into hex or base64 or whatever I like. But if I KNOW that my data is coming in as a proper JSON string and going out the other end without being changed in between there's no reason I should be forced to pay the penalty of four extra copies ([]byte -> string (inside encoding/json) -> []byte -> app -> string -> []byte (inside encoding/json)).

> If it's textual data, then string is the correct type.

That's true if I am actually interacting with the data. If I'm just carrying the data along and spitting it back out somewhere else than I don't really care what it is, and what I really need to optimize for is speed and memory. Four copies aren't helping.

Alternatively you can make your own type that implements MarshalJSON and UnmarshalJSON.

The problem with doing this (and the RawMessage) is that you skip the unicode (un)escaping step which encoding/json does for strings (internally, it actually does it while they're still []byte, so it's pretty trivial to have it do it for []byte fields too). I could just pass along the []byte untouched, with the backslashes an all still in there, and send it out the other end as a JSON string and no-one would be any wiser. But what if that other end isn't JSON? What if it''s some custom binary interface? They're going to be receiving different data than was passed in.

Jsor

unread,
Dec 15, 2013, 6:02:53 PM12/15/13
to golan...@googlegroups.com, bgpic...@gmail.com
I'm confused, are you using manually-created or edited JSON files as config files or something? I'm simply trying to understand exactly why you have/need/want it encoded as a string, but to decode it as a []byte if you're explicitly "only passing it along" (and thus never deal with it as a string in the application). It seems that storing it as a []byte should be sufficient.

The only use case I can really think of is:

Manually created JSON config file -> load into application as []byte -> serve data over network (or some other way) to another service that treats it as a string.

(And the other way around; client sends string -> read as []byte -> write to JSON as editable string)

If so, I think that's sufficiently esoteric that it's acceptable to make the user define their own marshalling function on the type.

Brian Picciano

unread,
Dec 15, 2013, 8:22:29 PM12/15/13
to Jsor, golan...@googlegroups.com
Defining your own marshaling function doesn't work though, because like I said you lose the escaping/unescaping of the data that encoding/json does for the string type. As far as I can see encoding/json doesn't provide a function to do that manually, but that could be another solution too.

My application that I have in mind (although this has been a problem in the past) is essentially re-implementing redis-cluster, but with some added capability. So there is a struct with a command field, a key field, an auth field, and extra arguments. The command, key and arguments need to be []byte because that's what the redis library I'm using wants (and because that's what makes sense), and the auth field needs to be []byte because that's what the crypto package wants. The data in args (and the key, for that matter) needs to be unescaped, because it won't necessarily be retrieved through a JSON interface later.

Like I said in the beginning, I don't really think this would even be considered for changing till go2 rolls around since it's very much non-compatible. But I thought it needed throwing out there. I've gone ahead and made the change to get rid of the base64 related code in the following repo: https://github.com/mediocregopher/gojson and will be using that for my applications. Everyone can do what they like.
Message has been deleted

Brian Picciano

unread,
Dec 15, 2013, 9:51:48 PM12/15/13
to Islan Dberry, golan...@googlegroups.com
Where does JSON fit in with redis-cluster?

It's not a drop-in replacement. It's not even a replacement, that's just an easy way to describe it. And it's going to have a JSON interface, among potentially others.

base64 or some other encoding is required to represent arguments as a JSON strings

Except that that's only true if my data isn't normal (UTF-8) text. If it is then going back and forth from JSON is no problem. And if it's not then the client (to my service) can decide on what encoding it would like to use, or it could just decide to use a different format all-together that supports binary data better than json does. Yes, there's potential for a screwup if someone tries to encode/decode something with invalid data, but the package can just return an error if that happens. I assume it's already doing that for encoding/decoding strings, since those can just as well have binary data in them.


On Sun, Dec 15, 2013 at 9:40 PM, Islan Dberry <island...@gmail.com> wrote:
On Sunday, December 15, 2013 5:22:29 PM UTC-8, Brian Picciano wrote:
My application that I have in mind (although this has been a problem in the past) is essentially re-implementing redis-cluster, but with some added capability.

Where does JSON fit in with redis-cluster?  I would have expected requests and responses to be encoded using the "unified request protocol".
 
 The data in args (and the key, for that matter) needs to be unescaped, because it won't necessarily be retrieved through a JSON interface later.

Because Redis command arguments are binary data, base64 or some other encoding is required to represent arguments as a JSON strings.

Message has been deleted

Brian Picciano

unread,
Dec 15, 2013, 10:15:02 PM12/15/13
to Islan Dberry, golan...@googlegroups.com
I agree that they can't. But there's nothing to say the destination for the decoded data can't be a []byte instead of a string, after all the un-escaping and replacing and checking whatever else is done. All of that checking and unescaping is done on the original []byte that is passed in (internally, inside encoding/json). The casting to string is done afterwards.


On Sun, Dec 15, 2013 at 10:10 PM, Islan Dberry <island...@gmail.com> wrote:
On Sunday, December 15, 2013 6:51:48 PM UTC-8, Brian Picciano wrote:
I assume it's already doing that for encoding/decoding strings, since those can just as well have binary data in them.

JSON strings are UTF-8; they cannot contain arbitrary binary data.

The json package coerces strings to valid UTF-8 by replacing invalid bytes with the Unicode replacement rune. 


egon

unread,
Dec 16, 2013, 2:12:47 AM12/16/13
to golan...@googlegroups.com, Jsor, bgpic...@gmail.com
I still can't understand, why would you modify the original lib, instead of just implementing a custom type... e.g. http://play.golang.org/p/r6-Z62qAkw ?

+ egon

Kyle Lemons

unread,
Dec 16, 2013, 1:37:30 PM12/16/13
to Brian Picciano, golang-nuts
On Sun, Dec 15, 2013 at 2:05 PM, Brian Picciano <bgpic...@gmail.com> wrote:
I'm going to compress my three responses to one.

> What I'm less clear on is exactly what your use case is for encoding the it as a string if you're dealing with it exclusively as bytes. I mean, if you're using it as both, surely you're making the copy anyway at some point.

That's my point, I don't want to use string EVER in my application (I have no reason to for this particular one). But with encoding/json I have to, because I can't directly get []byte out of it for a JSON string value. So I have to convert.

JSON defines strings to be UTF-8 encoded, and as such is not suitable for storing binary data.  Encoding an unknown []byte with base64 eliminates problems

I think that's a decision the coder should make. If I am worried that my binary data can't be encoded into a UTF-8 string then I can encode it into hex or base64 or whatever I like. But if I KNOW that my data is coming in as a proper JSON string and going out the other end without being changed in between there's no reason I should be forced to pay the penalty of four extra copies ([]byte -> string (inside encoding/json) -> []byte -> app -> string -> []byte (inside encoding/json)).

I assume you've benchmarked this and found that the extra copies are a bottleneck?  If not, don't assume that they are without some hard data, especially if you're doing a lot of I/O (as I would expect of a networked service).  The JSON library does string/[]byte and []byte/string conversions itself in some places.
 
> If it's textual data, then string is the correct type.

That's true if I am actually interacting with the data. If I'm just carrying the data along and spitting it back out somewhere else than I don't really care what it is, and what I really need to optimize for is speed and memory. Four copies aren't helping.

Alternatively you can make your own type that implements MarshalJSON and UnmarshalJSON.

The problem with doing this (and the RawMessage) is that you skip the unicode (un)escaping step which encoding/json does for strings (internally, it actually does it while they're still []byte, so it's pretty trivial to have it do it for []byte fields too). I could just pass along the []byte untouched, with the backslashes an all still in there, and send it out the other end as a JSON string and no-one would be any wiser. But what if that other end isn't JSON? What if it''s some custom binary interface? They're going to be receiving different data than was passed in.

The implementations are probably pretty easy to write: `js, err := json.Marshal(string(b))` etc.
 
On Sun, Dec 15, 2013 at 7:32 AM, egon <egon...@gmail.com> wrote:
You can use:

type MyStruct struct {
    A, B json.RawMessage
}

json.RawMessage is defined as type RawMessage byte[]. Alternatively you can make your own type that implements MarshalJSON and UnmarshalJSON.

The reason it does b64, is that it is "the correct way" to represent byte array in a json. In other words by default it is safe, but you can override the behavior by using RawMessage or a custom Marshaler.

+egon

On Saturday, December 14, 2013 10:25:18 PM UTC+2, Brian Picciano wrote:
I'm sorry if this has been brought up already, I haven't been able to find anything on it in my searching. I also know this would be a fairly significant change and would break backwards compatibility, but it is a fairly annoying "feature" that I think is more of a hindrance than a help.

Basically the current behavior is that if you have a struct with a []byte field that you pass into the json marshaler, it will represent that in the output json string as the base64 encoded version of what you put in, and if you're unmarshaling into a []byte it will try to base64 decode the json string first. I can understand why this might be thought to be "the right way", since it forces you to use string as a string and raw binary data as []byte. But it's a bit presumptuous to assume that there is no legitimate reason anyone should pass a string through to a []byte and work with them that way.

Currently, if I want my destination struct to be something like:

type MyStruct struct {
    A, B []byte
}

And have that be filled by the json string: `{"A":"foo","B":"bar"}`, then I would first have to make a temporary struct like:

type MyStructStr struct {
    A, B string
}

And copy/convert each field over individually. Same goes if I want to convert from MyStruct back into json. This adds a lot of extra code and data copies. In encoding/json the data is initially passed in and (un)quoted as []byte, where it is then converted to string. So now I'm converting back to []byte. This is unnecessary and unoptimizes for the common case.

I've hacked a version of encoding/json where I took out the b64 stuff. It works just fine and is actually less code than it used to be. So there's no technical reason it has to stay (to my knowledge). Again, I know this probably won't make it in for anything in versions 1.*, but for 2 I think it should be considered. Also, if this is the wrong place to post this please let me know, I'll happily move it.
 

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Mauro Lacy

unread,
Jun 25, 2024, 9:01:57 AM (5 days ago) Jun 25
to golang-nuts
Reply all
Reply to author
Forward
0 new messages