unable to decode byte ...why, when it was originally encoded with jansson?

803 views
Skip to first unread message

Doug Hellinger

unread,
Jun 4, 2012, 4:16:19 AM6/4/12
to jansso...@googlegroups.com

Hello,


I am having difficulty understanding why I am unable to decode JSON data which was encoded with jansson in the first place and then dumped into hex bytes (using xxd Hex Dumper). I am getting a jansson error trying to load JSON formatted data which has strings in German language (i.e. with characters outside of the basic ASCII character set). The same error occurs whether loading from string or buffer source.


The string element reads something like:


unable to decode byte 0xf6”


I have seen the error occur on a 0xdf too, and I assume it will on any character above 0x7f.


When encoding using json_dumps(), if I pass the JSON_ENSURE_ASCII flag, then all of the bytes are in standard ASCII range and the json_loads() function works okay. Since I want to load the JSON in UTF-8 anyway, why should I have to escape all of the extended characters?


Why can't I seem to load JSON that is UTF-8 encoded?


Happy to hear your suggestions or advice...


Doug




Petri Lehtinen

unread,
Jun 4, 2012, 4:47:09 AM6/4/12
to jansso...@googlegroups.com
Doug Hellinger wrote:
> Hello,
>
>
> I am having difficulty understanding why I am unable to decode JSON data which
> was encoded with jansson in the first place and then dumped into hex bytes
> (using xxd Hex Dumper). I am getting a jansson error trying to load JSON
> formatted data which has strings in German language (i.e. with characters
> outside of the basic ASCII character set). The same error occurs whether
> loading from string or buffer source.

I don't understand. Why do you dump it with xxd? xxd's output is like this:

00000000: c3a4 0a0a

This is not valid JSON, so you cannot decode it :)

> The string element reads something like:
>
>
> “unable to decode byte 0xf6”
>
>
> I have seen the error occur on a 0xdf too, and I assume it will on any
> character above 0x7f.
>
>
> When encoding using json_dumps(), if I pass the JSON_ENSURE_ASCII flag, then
> all of the bytes are in standard ASCII range and the json_loads() function
> works okay. Since I want to load the JSON in UTF-8 anyway, why should I have to
> escape all of the extended characters?
>
>
> Why can't I seem to load JSON that is UTF-8 encoded?

The only reason is that your input is not UTF-8 after all. Are you
sure it isn't ISO-8859-1? It seems to me that both 0xf6 and 0xdf are
valid German language ISO-8859-1 characters (ö and ß).

0xf6 and 0xdf are valid first bytes of UTF-8 code units, too, but 0xf6
shouldn't appear in German text (all Unicode code points whose UTF-8
representation starts with 0xf6 are outside the BMP).

Petri

Doug Hellinger

unread,
Jun 4, 2012, 5:53:36 AM6/4/12
to jansso...@googlegroups.com

Sorry. I should clarify that I call xxd with the -i option to output in C include file style. This outputs a C header file which looks like this...


unsigned char info_json[] = {
0x5b, 0x0d, 0x0a, 0x20, 0x20, 0x7b, 0x0d, 0x0a, 0x20, 0x20, 0x20, 0x20,
0x22, 0x65, 0x6e, 0x74, 0x72, 0x79, 0x5f, 0x69, 0x64, 0x22, 0x3a, 0x20,
// cut a lot of bytes...
};
unsigned int info_json_len = 179314;

 


I'm fairly sure this does nothing to change the text in any way, it is just a hex representation of the JSON data.


0xf6 and 0xdf are valid first bytes of UTF-8 code units, too, but 0xf6 
shouldn't appear in German text (all Unicode code points whose UTF-8 
representation starts with 0xf6 are outside the BMP).  


What do you mean here by outside the BMP? Is 0xf6 not a valid UTF-8 code? It is certainly in the German text I have.

Petri Lehtinen

unread,
Jun 4, 2012, 6:24:33 AM6/4/12
to jansso...@googlegroups.com
Doug Hellinger wrote:
> 0xf6 and 0xdf are valid first bytes of UTF-8 code units, too, but 0xf6
> shouldn't appear in German text (all Unicode code points whose UTF-8
> representation starts with 0xf6 are outside the BMP).
>
> What do you mean here by outside the BMP? Is 0xf6 not a valid UTF-8
> code? It is certainly in the German text I have.

Neither 0xdf nor 0xf6 are not valid UTF-8 on their own. To be a part
of a valid UTF-8 stream, 0xdf must be followed by exactly one
continuation byte, and 0xf6 with exactly 3 continuation bytes. Any
byte in the 0x80-0xbf range is a continuation byte.

See http://en.wikipedia.org/wiki/UTF-8 for how UTF-8 works.

With the BMP reference I was talking about Unicode planes. All western
characters (including the German characters) are in the Basic
Multilingual Plane (BMP), i.e. the first 65535 code points of the
Unicode. If a UTF-8 code unit starts with 0xf6, it encodes a character
whose code point is over 1572864.

Did you check that your data really is valid UTF-8 with something else
than Jansson? My best guess still is that it's really ISO-8859-1. For
example, tou can call iconv on it:

$ iconv -f utf-8 -t utf-8 < /path/to/latin1.txt
iconv: illegal input sequence at position 0

If you see an error, it's not valid UTF-8.

Petri

Petri Lehtinen

unread,
Jun 4, 2012, 6:26:57 AM6/4/12
to jansso...@googlegroups.com
Petri Lehtinen wrote:
> Did you check that your data really is valid UTF-8 with something else
> than Jansson? My best guess still is that it's really ISO-8859-1. For
> example, tou can call iconv on it:
>
> $ iconv -f utf-8 -t utf-8 < /path/to/latin1.txt
> iconv: illegal input sequence at position 0
>
> If you see an error, it's not valid UTF-8.

Forgot to say:

With icon, you can also convert your data from ISO-8859-1 to UTF-8:

$ iconv -f iso-8859-1 -t utf-8 < data.txt > data_as_utf8.txt
Reply all
Reply to author
Forward
0 new messages