JsonParseException: Invalid UTF-8 middle byte 0x28

11,454 views
Skip to first unread message

ericd...@gmail.com

unread,
Dec 27, 2017, 6:36:44 PM12/27/17
to jackson-user
I'm trying to parse byte array (retrieved from HttpServletRequest POST body) to JsonNode object. Some clients uses dirty encoding that results a JsonParseException: Invalid UTF-8 middle byte 0x28. If I convert the byte array to string and then parse it using the same objectMapper I wouldn't have a problem. But this would add extra execution time to the service (small but not insignificant for the application I have). The dirty encoding occurs with only two fields and it is acceptable for the application to use null for these fields when that happens. 

The question: is there a flag to configure Jackson to use null for fields that throw this exception (following the code suggests there is none). Or is it possible to implement custom parser or encoder for these fields that support the required behavior. 

Sample code to reproduce: 

ObjectMapper objectMapper = new ObjectMapper();
objectMapper
.configure(JsonParser.Feature.ALLOW_UNQUOTED_CONTROL_CHARS, true);
objectMapper
.configure(JsonParser.Feature.ALLOW_BACKSLASH_ESCAPING_ANY_CHARACTER, true);
String s = "{\"id\":\"33226\",\"name\":\"Some Name\"}";
byte[] bytes = s.getBytes();

// simulating the dirty encoding
bytes
[9] = (byte) 0xc3;
bytes
[10] = (byte) 0x28;

//converting bytes to String and then parsing it works fine
System.out.println("Parsing String:");
JsonNode tree1 = objectMapper.readTree(new String(bytes, "UTF8"));
System.out.println(tree1);


try {
 
// but parsing the dirty byte array directly throws an Exception
 
System.out.println("Parsing byte array:");
 
JsonNode tree2 = objectMapper.readTree(bytes);
 
System.out.println(tree2);
} catch (Exception e) {
  e
.printStackTrace();
}



Output:
Parsing String:
{"id":"33�(6","name":"Some Name"}
Parsing byte array:
com
.fasterxml.jackson.core.JsonParseException: Invalid UTF-8 middle byte 0x28
 at
[Source: (byte[])"{"id":"33�(6","name":"Some Name"}"; line: 1, column: 12]
 at com
.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1798)
 at com
.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:663)
 at com
.fasterxml.jackson.core.json.UTF8StreamJsonParser._reportInvalidOther(UTF8StreamJsonParser.java:3544)
 at com
.fasterxml.jackson.core.json.UTF8StreamJsonParser._reportInvalidOther(UTF8StreamJsonParser.java:3551)
 at com
.fasterxml.jackson.core.json.UTF8StreamJsonParser._decodeUtf8_2(UTF8StreamJsonParser.java:3324)
 at com
.fasterxml.jackson.core.json.UTF8StreamJsonParser._finishString2(UTF8StreamJsonParser.java:2456)

Tatu Saloranta

unread,
Dec 27, 2017, 8:25:45 PM12/27/17
to jackson-user
No. Invalid input is invalid input, and there's no way to fix that
since it is literally not known what character there might be.
Consider that UTF-8 encoding allows different byte lengths so it is
not just one character that could be wrong but one or two
of following ones.

What you can do is, in order of preference:

1) Fix (or tell someone else to fix) whatever is producing invalid
content. That code is broken.
2) Use a single-byte encoding like ISO-8859-1 (latin-1) -- or one of
other 8859-x encodings -- to decode.
You have to construct your own `InputStreamReader` with encoding
of your choice. You can then pass that content
3) Pre-process content using some other mechanism (custom
InputStreamReader; use JDK's mechanisms that may opt to
speculatively give 0xFFFE marker) to work around the problem.

I hope this helps,

-+ Tatu +-

-+ Tatu +-
> --
> You received this message because you are subscribed to the Google Groups
> "jackson-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to jackson-user...@googlegroups.com.
> To post to this group, send email to jackso...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Dawid Weiss

unread,
Dec 28, 2017, 2:45:49 AM12/28/17
to jackso...@googlegroups.com
> 1) Fix (or tell someone else to fix) whatever is producing invalid
> content. That code is broken.

Strong +1 to this one and thank you for stating this. We sometimes get
the same request
from customers who have invalid UTF-8 (or XML files with valid UTF-8,
but illegal characters). I always try to redirect them to fix the
problem at the core, there is no better way.

> 2) Use a single-byte encoding like ISO-8859-1 (latin-1) -- or one of
> other 8859-x encodings -- to decode.

While technically a solution this sets you back 30 years to the world
with byte-based codepages. If you dodge the problem now, it's going to
bite you in the future (somebody will complain sooner or later and
it'll be even harder to diagnose where illegal characters come from).

Fix the problem up the processing chain. If this is not possible,
report and omit invalid input files.

Dawid

Tatu Saloranta

unread,
Dec 28, 2017, 4:10:09 PM12/28/17
to jackson-user
Nothing much to add. I agree. Adding work-arounds often ends up
causing more work for everyone
involved, without making anyone's life easier.

-+ Tatu +-

>
> Dawid
Reply all
Reply to author
Forward
0 new messages