decoding JSON payloads with pathological text encodings

53 views
Skip to first unread message

mhcom...@gmail.com

unread,
Apr 25, 2018, 9:51:15 PM4/25/18
to jackson-user
Hello,

I wanted to inquire about a bizarre situation I've run into, with decoding a certain uniquely weird kind of JSON lately. I have some JSONs which come from a web crawling service, which is fetching webpages from all over the world. These pages can be formatted in various insane text encodings, which I wish had never existed in the first place, such as Latin1, LatinX, Windows-12XX, and in my current case, EUC-JP and Shift-JIS.

The web crawler is generating some JSON out of this, coming into my system, which contains lots of hostile and sort-of illegal inputs, such as corrupted, unpaired, or otherwise invalid surrogates and other such byte sequences in the UTF-8.

Technically, Jackson can deserialize this "just fine", except not really, because now you have a whole ton of Java String instances in this tree, which have bogus / unknown / invalid / illegal bytes inside, and some tools farther downstream from me, trying to use my APIs, are exploding when they are trying to deal with these insane bytes which I need to clean up first. I could try to make something that goes through and un-corrupts all the Strings in the tree, but it's very hard to try to access the original raw bytes from inside these damaged Strings and fix them the way they should be.

The good news is that Mozilla and some open-source hackers have made a library for dealing with these mangled Strings: https://github.com/albfernandez/juniversalchardet . However, there is the possibility that every String in a single JSON input from the crawler can have some different encoding, So, instead of trying to guess the encoding on the entire raw JSON, I need to try and guess the encoding on each String before deserializing.

So, I wanted to ask if the system will let me create a custom StdDeserializer, which steals the deserialization of String, even though it's a kind-of magic builtin Java type and not a regular POJO, so I can pass each String through the encoding detector and un-corrupt it, so that when Jackson assembles the whole structure, all of the corrupt Strings have been eliminated as much as possible, and re-encoded into proper UTF-8, the way they always should have been.

Thanks,
Matthew.


Tatu Saloranta

unread,
Apr 25, 2018, 10:43:44 PM4/25/18
to jackson-user
On Wed, Apr 25, 2018 at 6:51 PM, <mhcom...@gmail.com> wrote:
> Hello,
>
> I wanted to inquire about a bizarre situation I've run into, with decoding a
> certain uniquely weird kind of JSON lately. I have some JSONs which come
> from a web crawling service, which is fetching webpages from all over the
> world. These pages can be formatted in various insane text encodings, which
> I wish had never existed in the first place, such as Latin1, LatinX,
> Windows-12XX, and in my current case, EUC-JP and Shift-JIS.
>
> The web crawler is generating some JSON out of this, coming into my system,
> which contains lots of hostile and sort-of illegal inputs, such as
> corrupted, unpaired, or otherwise invalid surrogates and other such byte
> sequences in the UTF-8.

Right. This does happen, alas. :-/

> Technically, Jackson can deserialize this "just fine", except not really,
> because now you have a whole ton of Java String instances in this tree,
> which have bogus / unknown / invalid / illegal bytes inside, and some tools
> farther downstream from me, trying to use my APIs, are exploding when they
> are trying to deal with these insane bytes which I need to clean up first. I
> could try to make something that goes through and un-corrupts all the
> Strings in the tree, but it's very hard to try to access the original raw
> bytes from inside these damaged Strings and fix them the way they should be.
>
> The good news is that Mozilla and some open-source hackers have made a
> library for dealing with these mangled Strings:
> https://github.com/albfernandez/juniversalchardet . However, there is the
> possibility that every String in a single JSON input from the crawler can
> have some different encoding, So, instead of trying to guess the encoding on
> the entire raw JSON, I need to try and guess the encoding on each String
> before deserializing.
>
> So, I wanted to ask if the system will let me create a custom
> StdDeserializer, which steals the deserialization of String, even though
> it's a kind-of magic builtin Java type and not a regular POJO, so I can pass
> each String through the encoding detector and un-corrupt it, so that when
> Jackson assembles the whole structure, all of the corrupt Strings have been
> eliminated as much as possible, and re-encoded into proper UTF-8, the way
> they always should have been.

At the point where deserializers handle things, decoding has already
been done, and
information potentially lost and/or corrupt. But if we go down to
lower level, decoder (`JsonParser`)
is responsible for tokenization, and is in better position.

I would probably approach this form perspective of using another
library to detect encoding
and construct `InputStreamReader` for that encoding (library may offer
that integration out of the box too),
and then use resulting reader for creating parser:

JsonParser p = jsonFactory.createStreamReader(reader);

which may then be given as input source to `ObjectMapper` (or `ObjectReader`).

Jackson does not really have to know about potential complexity of
detecting encoding, and
attempting to fix possible Unicode errors.

-+ Tatu +-

>
> Thanks,
> Matthew.
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "jackson-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to jackson-user...@googlegroups.com.
> To post to this group, send email to jackso...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

mhcom...@gmail.com

unread,
Apr 25, 2018, 11:50:54 PM4/25/18
to jackson-user
On Wednesday, April 25, 2018 at 7:43:44 PM UTC-7, Tatu Saloranta wrote:
At the point where deserializers handle things, decoding has already
been done, and
information potentially lost and/or corrupt. But if we go down to
lower level, decoder (`JsonParser`)
is responsible for tokenization, and is in better position.

I would probably approach this form perspective of using another
library to detect encoding
and construct `InputStreamReader` for that encoding (library may offer
that integration out of the box too),
and then use resulting reader for creating parser:

   JsonParser p = jsonFactory.createStreamReader(reader);

which may then be given as input source to `ObjectMapper` (or `ObjectReader`).

Jackson does not really have to know about potential complexity of
detecting encoding, and
attempting to fix possible Unicode errors.

-+ Tatu +-

Yes, this would certainly be a preferable solution, if I actually always knew what encoding to use for the entire JSON document, but sadly it can vary per-String-valued-field. This means that, within a single document, there is a possibility that every String could have some different encoding.

So, instead of trying to guess the encoding on the entire raw JSON, I need to hook in try and guess the encoding on each String-valued field when constructing the String value for the field itself.

So I am trying to understand, what is the right place to intercept the creation of the String for every String-valued field? Then I can call the encoding guesser, and construct the String or CharSequence for the String-valued field myself, where I can do some tricks to un-mangle the bytes.

Matthew.

mhcom...@gmail.com

unread,
Apr 26, 2018, 5:50:36 PM4/26/18
to jackson-user
I could really use a hand with this one, if somebody has advice how to intercept Jackson's String instance creation.

I found a few places where `new String(...)` is invoked but I am not sure which one is the right one for my purposes.

Matthew. 

Tatu Saloranta

unread,
Apr 26, 2018, 6:16:13 PM4/26/18
to jackson-user
Deserializers ask `JsonParser`, either via `getText()` or one of
variants (`nextTextValue()`).
Decoding is handled by parser, possibly eagerly (when `nextToken()` is
called), possibly lazily (implementation dependant)
Converting from byte stream to tokens is what parser does, and not
something deserializers have
direct effect on.

So you would need to reimplement parser to make it flexible enough,
and figure out how to pass
information on alternate encoding(s) somehow.

-+ Tatu +-

Matthew Hall

unread,
Apr 30, 2018, 7:23:00 PM4/30/18
to jackso...@googlegroups.com
On Thu, Apr 26, 2018 at 3:16 PM, Tatu Saloranta <ta...@fasterxml.com> wrote:
Deserializers ask `JsonParser`, either via `getText()` or one of
variants (`nextTextValue()`).
Decoding is handled by parser, possibly eagerly (when `nextToken()` is
called), possibly lazily (implementation dependant)
Converting from byte stream to tokens is what parser does, and not
something deserializers have
direct effect on.

So you would need to reimplement parser to make it flexible enough,
and figure out how to pass
information on alternate encoding(s) somehow.

-+ Tatu +-

Tatu,

This advice is extremely helpful.

Based on this, I first created a copy of ReaderBasedJsonParser, with all of the "final" method labeling stripped off so that I can override them with hacked versions. Then I made a subclass of that thing, with extra logic hooking everything that's creating or returning some strings from buffers or wherever, so that I can inserrt the dynamic encoding detection trickery.

I have just one question left, about how to test or integrate this. The constructors for these classes are somewhat internal-ish with some various mysterious Jackson objects in them:    

public CustomJsonParser(IOContext ctxt, int features, Reader r, ObjectCodec codec, CharsToNameCanonicalizer st,
    char[] inputBuffer, int start, int end, boolean bufferRecyclable) {
    super(ctxt, features, r, codec, st, inputBuffer, start, end, bufferRecyclable);
}

public CustomJsonParser(IOContext ctxt, int features, Reader r, ObjectCodec codec, CharsToNameCanonicalizer st) {
    super(ctxt, features, r, codec, st);
}

I wanted to check where I should look for some examples how to construct this custom parser on some test input, or how to hook it up to some other existing ObjectMapper / JsonGenerator / JsonFactory sort of exterior classes, so that I can run some input through my new code and learn how it will crash. ;)

Sincerely,
Matthew.

Tatu Saloranta

unread,
Apr 30, 2018, 7:34:39 PM4/30/18
to jackson-user
Right, these are not really defined as extension points.

What might work better is defining `InputDecorator`, which you can
register with `JsonFactory`.
It will allow dynamic wrapping of passed-in `Reader` or `InputStream`.
And then you don't have to worry about internal details of actually
constructing parser instance.
Approach may also work with other formats, whereas extending of
factory directly would not.

I hope this helps,

-+ Tatu +-

Matthew Hall

unread,
Apr 30, 2018, 7:40:38 PM4/30/18
to jackso...@googlegroups.com
On Mon, Apr 30, 2018 at 4:34 PM, Tatu Saloranta <ta...@fasterxml.com> wrote:

Right, these are not really defined as extension points.

What might work better is defining `InputDecorator`, which you can
register with `JsonFactory`.
It will allow dynamic wrapping of passed-in `Reader` or `InputStream`.
And then you don't have to worry about internal details of actually
constructing parser instance.
Approach may also work with other formats, whereas extending of
factory directly would not.

I hope this helps,

-+ Tatu +-

I'd love to do it that way for sure, but I believe that this input JSON is too broken to employ that sort of approach.

Each String variable value inside the JSON can have a different encoding! Imagine something truly impressively awful, like this:

{ "utf8": "SOME_UTF8_STRING", "ascii": "SOME_ASCII_STRING", "shift-jis": "SOME_SHIFT_JIS_STRING", "latin1": "SOME_LATIN1_STRING", ... }

Therefore trying to fix the encoding of this whole input at the Reader or InputStream level isn't enough. I have to check it each time a new String-valued field is created in the parser itself, I think.

I have a class which should be close to doing this, but I need to understand how to construct the inputs to this parser so I can test it and see what is going to happen, I mean how my code is going to crash first. ;)

Matthew.

Tatu Saloranta

unread,
Apr 30, 2018, 7:42:04 PM4/30/18
to jackson-user
Ok then you are unfortunately on your own.

You can try sub-classing and it may work (or may break at some point).

Good luck!

-+ Tatu +-

Matthew Hall

unread,
Apr 30, 2018, 7:44:55 PM4/30/18
to jackso...@googlegroups.com
On Mon, Apr 30, 2018 at 4:42 PM, Tatu Saloranta <ta...@fasterxml.com> wrote:
Ok then you are unfortunately on your own.

You can try sub-classing and it may work (or may break at some point).

Good luck!

-+ Tatu +-

Totally understood, it can fail, and if it does I blame the JSON by all means, not Jackson itself, because this is crazy. :)

I am just wondering where I should look to see some examples of how those special constructors are invoked, to make the Parser object.

Then I can report back what kind of hilarious results this JSON might produce.

Matthew.

Tatu Saloranta

unread,
Apr 30, 2018, 8:04:40 PM4/30/18
to jackson-user
Factory methods are called by ObjectMapper and ObjectReader; those are
probably the best
examples. It is possible to only overload some of internal methods
that these factory methods
delegate to (2 or 3, instead of a dozen).

But it sounds like you would then also need to override many other
accessors for textual data,
and/or `nextToken()` and other methods.

-+ Tatu +-

mhcom...@gmail.com

unread,
May 14, 2018, 10:32:39 PM5/14/18
to jackson-user
On Monday, April 30, 2018 at 5:04:40 PM UTC-7, Tatu Saloranta wrote:
Factory methods are called by ObjectMapper and ObjectReader; those are
probably the best
examples. It is possible to only overload some of internal methods
that these factory methods
delegate to (2 or 3, instead of a dozen).

But it sounds like you would then also need to override many other
accessors for textual data,
and/or `nextToken()` and other methods.

-+ Tatu +-

Dear Tatu,

Your advice about this bizarre challenge was absolutely invaluable and in the end it worked very nicely. I've managed to create a blasphemous JSON parser, which is able to properly decode illegal inputs like this:

{ "key1_in_utf8": "value1_in_weird_encoding_1", "key2_in_utf8": "value2_in_weird_encoding_2", ... }


I have attached this special parser here, in case you wanted to see how it works, and especially if you might have any feedback on it. This code could be used as a basis for a future Jackson class or extra / bonus module, which could be used to handle JSON with different text encoding bugs:

1) the more common case of JSON in a wrong encoding (by running the detection at a low level like ByteSourceJsonBootstrapper does, except supporting a much bigger number of encodings, but at a cost of being more ugly and slow)

2) the rare, and never before seen (by me at least), JSON with a mixed wrong encoding (which is what this does now, but at a cost of being exponentially more ugly and slow of course)

The parser depends on a popular open-source encoding detection library in Java: https://github.com/albfernandez/juniversalchardet , which is licensed with MPL 1.1, GPL 2 or later, LGPL 2.1 or later, which should be pretty compatible with the licensing on Jackson for downstream users who need this.

Hopefully this effort will assist other users who run into similar issues with insane JSONs.

Matthew.
RawReader.java
EncodingJsonFactory.java
CustomReaderBasedJsonParser.java
CustomJsonParser.java

Tatu Saloranta

unread,
May 14, 2018, 11:24:00 PM5/14/18
to jackson-user
Very cool. Thank you for sharing it -- who knows? There are many
lenient libs for HTML
(TagSoup et al), so perhaps there is need. And the first step often is
to have source material
to peek into and maybe get ideas of how a new approach might work.
Reply all
Reply to author
Forward
0 new messages