On Wed, Apr 25, 2018 at 6:51 PM, <
mhcom...@gmail.com> wrote:
> Hello,
>
> I wanted to inquire about a bizarre situation I've run into, with decoding a
> certain uniquely weird kind of JSON lately. I have some JSONs which come
> from a web crawling service, which is fetching webpages from all over the
> world. These pages can be formatted in various insane text encodings, which
> I wish had never existed in the first place, such as Latin1, LatinX,
> Windows-12XX, and in my current case, EUC-JP and Shift-JIS.
>
> The web crawler is generating some JSON out of this, coming into my system,
> which contains lots of hostile and sort-of illegal inputs, such as
> corrupted, unpaired, or otherwise invalid surrogates and other such byte
> sequences in the UTF-8.
Right. This does happen, alas. :-/
> Technically, Jackson can deserialize this "just fine", except not really,
> because now you have a whole ton of Java String instances in this tree,
> which have bogus / unknown / invalid / illegal bytes inside, and some tools
> farther downstream from me, trying to use my APIs, are exploding when they
> are trying to deal with these insane bytes which I need to clean up first. I
> could try to make something that goes through and un-corrupts all the
> Strings in the tree, but it's very hard to try to access the original raw
> bytes from inside these damaged Strings and fix them the way they should be.
>
> The good news is that Mozilla and some open-source hackers have made a
> library for dealing with these mangled Strings:
>
https://github.com/albfernandez/juniversalchardet . However, there is the
> possibility that every String in a single JSON input from the crawler can
> have some different encoding, So, instead of trying to guess the encoding on
> the entire raw JSON, I need to try and guess the encoding on each String
> before deserializing.
>
> So, I wanted to ask if the system will let me create a custom
> StdDeserializer, which steals the deserialization of String, even though
> it's a kind-of magic builtin Java type and not a regular POJO, so I can pass
> each String through the encoding detector and un-corrupt it, so that when
> Jackson assembles the whole structure, all of the corrupt Strings have been
> eliminated as much as possible, and re-encoded into proper UTF-8, the way
> they always should have been.
At the point where deserializers handle things, decoding has already
been done, and
information potentially lost and/or corrupt. But if we go down to
lower level, decoder (`JsonParser`)
is responsible for tokenization, and is in better position.
I would probably approach this form perspective of using another
library to detect encoding
and construct `InputStreamReader` for that encoding (library may offer
that integration out of the box too),
and then use resulting reader for creating parser:
JsonParser p = jsonFactory.createStreamReader(reader);
which may then be given as input source to `ObjectMapper` (or `ObjectReader`).
Jackson does not really have to know about potential complexity of
detecting encoding, and
attempting to fix possible Unicode errors.
-+ Tatu +-
>
> Thanks,
> Matthew.
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "jackson-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to
jackson-user...@googlegroups.com.
> To post to this group, send email to
jackso...@googlegroups.com.
> For more options, visit
https://groups.google.com/d/optout.