[jansson-users] Escaping bytes 128 - 255

88 views
Skip to first unread message

Twisol

unread,
Apr 21, 2010, 3:35:08 AM4/21/10
to Jansson users
Hello,

According to the Jansson documentation, code points U+0001 through U
+10FFFF are allowed in strings. However, I can't figure out how to get
Jansson to encode byets 128 - 255. Jansson handles decoding ["\u00FF"]
to byte 255 just fine, but I don't see any easy way to do the reverse:
json_string() always errors out.

Any tips?

~Jonathan Castello

--
Jansson users mailing list
jansso...@googlegroups.com
http://groups.google.com/group/jansson-users

Petri Lehtinen

unread,
Apr 21, 2010, 3:44:15 AM4/21/10
to jansso...@googlegroups.com
Twisol wrote:
> Hello,
>
> According to the Jansson documentation, code points U+0001 through U
> +10FFFF are allowed in strings. However, I can't figure out how to get
> Jansson to encode byets 128 - 255. Jansson handles decoding ["\u00FF"]
> to byte 255 just fine, but I don't see any easy way to do the reverse:
> json_string() always errors out.

json_string() requires an UTF-8 encoded string as the argument. Byte
255 is not valid UTF-8. The UTF-8 encoded representation of 255
consists of two bytes: 195 and 191.

To exploit Jansson so that you can use arbitrary binary data in
strings, you must encode your data in UTF-8 before JSON encoding, and
decode the UTF-8 strings back to raw binary after JSON decoding. You
must also escape zero bytes somehow, as Jansson doesn't allow them
(even though JSON does).

Petri

Jonathan Castello

unread,
Apr 21, 2010, 3:57:13 AM4/21/10
to jansso...@googlegroups.com
On Wed, Apr 21, 2010 at 12:44 AM, Petri Lehtinen <pe...@digip.org> wrote:
> json_string() requires an UTF-8 encoded string as the argument. Byte
> 255 is not valid UTF-8. The UTF-8 encoded representation of 255
> consists of two bytes: 195 and 191.

Hmm, I see. Apparently that goes for characters 128 through 254 as well.

> To exploit Jansson so that you can use arbitrary binary data in
> strings, you must encode your data in UTF-8 before JSON encoding, and
> decode the UTF-8 strings back to raw binary after JSON decoding. You
> must also escape zero bytes somehow, as Jansson doesn't allow them
> (even though JSON does).

I meant to mention that too, actually. Jansson claims full UTF-8
support, yet it doesn't directly support U+0000. Is this something to
look forward to in Jansson 2.0?

Also, I notice that there's a utf8_encode function defined in utf.c,
but either it's not used anywhere or Intellisense has bailed on me
again. If I called that before using json_string(), would that encode
128 - 255 properly?

~Jonathan Castello

Petri Lehtinen

unread,
Apr 21, 2010, 4:11:07 AM4/21/10
to jansso...@googlegroups.com
Jonathan Castello wrote:
> On Wed, Apr 21, 2010 at 12:44 AM, Petri Lehtinen <pe...@digip.org> wrote:
> > json_string() requires an UTF-8 encoded string as the argument. Byte
> > 255 is not valid UTF-8. The UTF-8 encoded representation of 255
> > consists of two bytes: 195 and 191.
>
> Hmm, I see. Apparently that goes for characters 128 through 254 as well.
>
> > To exploit Jansson so that you can use arbitrary binary data in
> > strings, you must encode your data in UTF-8 before JSON encoding, and
> > decode the UTF-8 strings back to raw binary after JSON decoding. You
> > must also escape zero bytes somehow, as Jansson doesn't allow them
> > (even though JSON does).
>
> I meant to mention that too, actually. Jansson claims full UTF-8
> support, yet it doesn't directly support U+0000. Is this something to
> look forward to in Jansson 2.0?

I'll see if it's easy to implement support for strings with embedded
zero bytes. Other people have requested this, too.

Actually, I know it's quite easy, but I'm only willing to do it if
I'll manage to invent a smart API for it.

> Also, I notice that there's a utf8_encode function defined in utf.c,
> but either it's not used anywhere or Intellisense has bailed on me
> again. If I called that before using json_string(), would that encode
> 128 - 255 properly?

utf8_encode() is used in src/load.c to convert \u escapes to UTF-8. It
would do the trick for you, yes, although the Windows APIs might have
functions to do UTF-8 encoding for whole strings at once.
utf8_encode() encodes a single Unicode code point (or byte) at a time.


Petri
Reply all
Reply to author
Forward
0 new messages