value string canonicalization?

42 views
Skip to first unread message

Viktor Szathmáry

unread,
Mar 27, 2023, 5:15:10 PM3/27/23
to jackson-user
Hi,

Is there a good way of applying the equivalent of INTERN_FIELD_NAMES but for field values? I'm guessing a custom deserializer could be applied to these fields, but different implementations would need to happen depending on the type of the underlying deserializer buffer (eg bytes or chars). What would be a reasonable design for this?

Thanks,
  Viktor

Tatu Saloranta

unread,
Mar 27, 2023, 8:02:01 PM3/27/23
to jackso...@googlegroups.com
On Mon, Mar 27, 2023 at 2:15 PM Viktor Szathmáry <phra...@gmail.com> wrote:
>
> Hi,
>
> Is there a good way of applying the equivalent of INTERN_FIELD_NAMES but for field values? I'm guessing a custom deserializer

There is no such functionality, although I think it has been requested.
But design is more complicated due to typically different fields would
need different handling (that is, only some should be canonicalized,
and possibly as distinct value sets).

> could be applied to these fields, but different implementations would need to happen depending on the type of the underlying deserializer buffer (eg bytes or chars). What would be a reasonable design for this?

Right, it depends on what are important optimizations: if it is
important to avoid intermediate String generation then things get more
complicated. If not, then it could be handled by deserializer itself
(accessing String, canonicalizing as second step).

But there are different approaches even without String allocation:
decoding from input can be done (and generally is) into intermediate
`char[]` buffer, so callback could be used to get access to that
buffer (and offset, length), to avoid allocation but without having to
deal with UTF-8 (etc) decoding.
For that there'd probably need to be addition of something like

String getText(DecodingCallback cb);

in JsonParser; that'd need to be implemented by various backends
although would default to something simple (actually calling
`getText()` or whatever).
But that obviously would be a change to `jackson-core` and not
something that can be done from outside.

-+ Tatu +-

>
> Thanks,
> Viktor
>
> --
> You received this message because you are subscribed to the Google Groups "jackson-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to jackson-user...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/jackson-user/b482520d-cc8e-4802-808f-7c117f16348en%40googlegroups.com.

Viktor Szathmáry

unread,
Mar 28, 2023, 2:11:14 AM3/28/23
to jackso...@googlegroups.com
Hi,


> On Mar 28, 2023, at 02:01, Tatu Saloranta <ta...@fasterxml.com> wrote:
>
> On Mon, Mar 27, 2023 at 2:15 PM Viktor Szathmáry <phra...@gmail.com> wrote:
>
> Right, it depends on what are important optimizations: if it is
> important to avoid intermediate String generation then things get more
> complicated. If not, then it could be handled by deserializer itself
> (accessing String, canonicalizing as second step).

Indeed, the objective here is two-fold:
1) pull strings from a pool (improving locality/hashcode/equals down the line)
2) avoid string (or other intermediate buffer) allocation altogether (reducing GC pressure)

Achieving #1 is easy with a custom deserializer, I’m more interested in #2.

Does jackson avoid creating intermediate strings when deserializing enum values?

> But there are different approaches even without String allocation:
> decoding from input can be done (and generally is) into intermediate
> `char[]` buffer, so callback could be used to get access to that
> buffer (and offset, length), to avoid allocation but without having to
> deal with UTF-8 (etc) decoding.
> For that there'd probably need to be addition of something like
>
> String getText(DecodingCallback cb);

Assuming my underlying buffers are byte arrays, is there a way to access that from a custom deserializer (without copying) and apply a ByteQuadsCanonicalizer?

Thanks,
Viktor

Tatu Saloranta

unread,
Mar 28, 2023, 2:51:37 PM3/28/23
to jackso...@googlegroups.com
On Mon, Mar 27, 2023 at 11:11 PM Viktor Szathmáry <phra...@gmail.com> wrote:
>
> Hi,
>
>
> > On Mar 28, 2023, at 02:01, Tatu Saloranta <ta...@fasterxml.com> wrote:
> >
> > On Mon, Mar 27, 2023 at 2:15 PM Viktor Szathmáry <phra...@gmail.com> wrote:
> >
> > Right, it depends on what are important optimizations: if it is
> > important to avoid intermediate String generation then things get more
> > complicated. If not, then it could be handled by deserializer itself
> > (accessing String, canonicalizing as second step).
>
> Indeed, the objective here is two-fold:
> 1) pull strings from a pool (improving locality/hashcode/equals down the line)
> 2) avoid string (or other intermediate buffer) allocation altogether (reducing GC pressure)
>
> Achieving #1 is easy with a custom deserializer, I’m more interested in #2.

Ok, yes, that makes sense.

> Does jackson avoid creating intermediate strings when deserializing enum values?

Not currently, no (with JSON).

But as I said, decoding from bytes to chars itself does not require
allocations, yet.
So `parser.getTextCharacters()` for example typically avoids
allocations (except for buffer boundaries,
long strings etc).

> > But there are different approaches even without String allocation:
> > decoding from input can be done (and generally is) into intermediate
> > `char[]` buffer, so callback could be used to get access to that
> > buffer (and offset, length), to avoid allocation but without having to
> > deal with UTF-8 (etc) decoding.
> > For that there'd probably need to be addition of something like
> >
> > String getText(DecodingCallback cb);
>
> Assuming my underlying buffers are byte arrays, is there a way to access that from a custom deserializer (without copying) and apply a ByteQuadsCanonicalizer?

No. It is not easy to link, either, as it is rather specialized decoding.

For text value handling I'd probably start with something else than
ByteQuadsCanonicalizer / CharToNamesCanonicalizer.
Although latter could probably work better.

Jackson 3.0 has yet another mechanism geared at name canonicalization,
in which BeanDeserializer sort of owns lookup table.
That approach would probably work better for Enums etc. But, then
again, Jackson 3.0 is still not about to be released.
(and this particular technique difficult to backport; you can check
out `JsonParser` methods in `master` branch,
Class/interface is "PropertyNameMatcher". It is effectively a callback
and allows caller to pass a Dictionary.
That approach might generalize to String values, not just property names.

-+ Tatu +-


>
> Thanks,
> Viktor
>
> --
> You received this message because you are subscribed to the Google Groups "jackson-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to jackson-user...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/jackson-user/3A8E6205-6C7F-43EC-BF36-9A361C2C5656%40gmail.com.

Viktor Szathmáry

unread,
Mar 28, 2023, 6:24:52 PM3/28/23
to jackso...@googlegroups.com


On Mar 28, 2023, at 20:51, Tatu Saloranta <ta...@fasterxml.com> wrote:

But as I said, decoding from bytes to chars itself does not require
allocations, yet.
So `parser.getTextCharacters()` for example typically avoids
allocations (except for buffer boundaries,
long strings etc).

So would the following work as intended?

public class JsonInterningDeserializer extends JsonDeserializer<String>  {

    private final CharsToNameCanonicalizer can = CharsToNameCanonicalizer.createRoot();

    @Override
    public String deserialize(JsonParser p, DeserializationContext ctxt) throws IOException, JacksonException {
        char[] buf = p.getTextCharacters();
        int start = p.getTextOffset();
        int len = p.getTextLength();
        int hash = can.calcHash(buf, start, len);
        return can.findSymbol(buf, start, len, hash);
    }

}

Thanks,
  Viktor


Tatu Saloranta

unread,
Mar 28, 2023, 6:58:25 PM3/28/23
to jackso...@googlegroups.com
Impressive!

Yes, I think that shows a simple idea that could work as intended (in
this case, for `String` -valued properties.

-+ Tatu +-

Viktor Szathmáry

unread,
Mar 28, 2023, 7:30:15 PM3/28/23
to jackso...@googlegroups.com



On Mar 29, 2023, at 00:58, Tatu Saloranta <ta...@fasterxml.com> wrote:

Yes, I think that shows a simple idea that could work as intended (in
this case, for `String` -valued properties.


Seems to work after creating a child with the canonicalize flag set (the root throws and without the flag no actual interning seems to happen).

    private final static CharsToNameCanonicalizer can = CharsToNameCanonicalizer
        .createRoot()
        .makeChild(JsonFactory.Feature.CANONICALIZE_FIELD_NAMES.getMask());

Since there are multiple fields with the custom deserializer applied (thus multiple deserializer instances) but I want them to share a string pool, I have also made it static. Is this thread-safe? I’m not sure I fully understand the parent/child relationship in CharsToNameCanonicalizer...

Also, what’s the expected behavior in case the string pool gets large?

Thanks,
 Viktor

Tatu Saloranta

unread,
Mar 28, 2023, 7:46:52 PM3/28/23
to jackso...@googlegroups.com
Good questions. I may have spoken too soon -- I forgot that the
thread-safety is due JsonParser instances having "child" instances,
and only syncing when "returning" internal state to parent instance.

This is very different from stateless deserializers where there does
not exist a similar life-cycle (for parsers, the life-cycle is for
reading a single document).

So I am not sure this approach actually works (or even can work).

Behavior is to clear the whole cache if it reaches critical size; it's
not designed as a LRU cache but more for a bounded set of names
(so clearing up is a defensive mechanism for an unexpected situation).

-+ Tatu +-
Reply all
Reply to author
Forward
0 new messages