Engine with YAML 1.2 support is out !

493 views
Skip to first unread message

Andrey Somov

unread,
May 12, 2018, 10:02:23 AM5/12/18
to snakeya...@googlegroups.com
Hi all,
there is a new project which has a goal to support YAML 1.2.


This is a successor of SnakeYAML. The idea is to split SnakeYAML into smaller projects.
The first is the Engine. It will be responsible for the spec and it will provide low level API to create basic Java structures. Others projects will deliver high level API to create JavaBeans or any other custom instances (for JDK, for Android, etc).

The project is at its very early stage. There is no release yet, it is not yet published in Maven central.
Feel free to checkout the source, review the code and make proposal to meet your expectations. What does your project need ?
After the first release it will be more difficult to introduce a backwards incompatible change. Be quick !

Cheers,
The SnakeYAML team

Joachim Durchholz

unread,
May 12, 2018, 1:37:50 PM5/12/18
to snakeya...@googlegroups.com
Hi team,

I like the general approach a lot, just three nits to pick.

What's "basic Java structures"? Classes with public fields?
Rejecting Javabeans is slightly odd, unless you really mean public
fields. Even then, the Engine's parser needs to be able to construct
arbitrary classes directly, otherwise you force a translation layer
between basic data structures and whatever the application is using.
I.e. Snakeyaml needs to offer a way to explicitly map YAML constructs to
Java constructor calls / setter calls / attribute writes.

If the Engine is supposed to be internal, talking about the character
set representation does not make much sense: For better or for worse,
Java Strings are UTF-16, conversion to and from other character sets
(including UTF-8 and -16) happen when talking to streams.
My main fear is that direct support for the latter two will slow down
the code paths for UTF-16.
So if you plan to add character support at the engine level, I'd want to
know the reasoning.

What's missing is security considerations. In particular, *in default
mode*, the YAML input stream must not be able to instruct the Engine to
call arbitrary constructors, because that's a security hole wide enough
to drive arbitrary payload through.
To the very least, the parser configuration code needs to be able to
explicitly whitelist constructors; it may be necessary to whitelist
setter calls and attribute writes as well.
(The most direct approach that I can come up with is telling the YAML
engine to whitelist methods and fields that carry a given annotation.
Which annotation: that should be configurable; I can imagine situations
where you have multiple parsers, and want to allow different sets of
types for deserialization.)
I'm pretty sure you guys have security on the radar, and it just didn't
make it into the Overview page. It *should* be mentioned there though,
otherwise people will be unsure whether it's suitable.

Hope this helps.
Regards,
Jo
> --
> You received this message because you are subscribed to the Google
> Groups "SnakeYAML" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to snakeyaml-cor...@googlegroups.com
> <mailto:snakeyaml-cor...@googlegroups.com>.
> To post to this group, send email to snakeya...@googlegroups.com
> <mailto:snakeya...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/snakeyaml-core.
> For more options, visit https://groups.google.com/d/optout.

Andrey Somov

unread,
May 12, 2018, 5:05:25 PM5/12/18
to snakeya...@googlegroups.com
Hi Jo,
thank you for your comments.

1)
>What's "basic Java structures"? Classes with public fields?
I fixed it: more explanation on the wiki: String, List<Integer>, Map<String, Boolean>

2)
>Rejecting Javabeans is slightly odd, unless you really mean public fields. 
It is not rejecting. It is explicitly not supported out of the box. The user has to write code or take another library (which will come later and it will depend on the Engine)
It looks like a number of players do not use Javabeans support of SnakeYAML (Jackson, Spring Boot, Liquibase, RAML etc)
The Engine does not prohibit to create custom instances, it just does not support it without explicit coding.

3)
>Snakeyaml needs to offer a way to explicitly map YAML constructs to Java constructor calls / setter calls / attribute writes.
The question is : how it should be implemented ? What should be the API ?
Feel free to join and make proposal

4)
If the Engine is supposed to be internal, talking about the character set representation does not make much sense
I am sorry, I do not get it

5)
>For better or for worse, Java Strings are UTF-16, conversion to and from other character sets (including UTF-8 and -16) happen when talking to streams.
Well, it more complex now. Before Unicode 2.0 indeed Java String were UTF-16. With introduction of surrogates in Unicode 2.0 things changed.
In version 1.18 it was fixed (issue 323) with the price of a significant performance penalty. Which we try to fix in 1.21

6)
>I'm pretty sure you guys have security on the radar, and it just didn't make it into the Overview page. It *should* be mentioned there though, otherwise people will be unsure whether it's suitable.
Thank you. The security is mentioned on the overview page

Cheers,
Andrey



To unsubscribe from this group and stop receiving emails from it, send an email to snakeyaml-core+unsubscribe@googlegroups.com <mailto:snakeyaml-core+unsubscri...@googlegroups.com>.
To post to this group, send email to snakeya...@googlegroups.com <mailto:snakeyaml-core@googlegroups.com>.
--
You received this message because you are subscribed to the Google Groups "SnakeYAML" group.
To unsubscribe from this group and stop receiving emails from it, send an email to snakeyaml-core+unsubscribe@googlegroups.com.
To post to this group, send email to snakeya...@googlegroups.com.



--
Andrey Somov

Joachim Durchholz

unread,
May 13, 2018, 2:27:22 AM5/13/18
to snakeya...@googlegroups.com
Hi Andrey,

thanks for the updates, things are clearer now.

> The Engine does not prohibit to create custom instances, it just does
> not support it without explicit coding.

I guess what's missing is how Snakeyaml is going to support user-defined
types.

> 3)
>>Snakeyaml needs to offer a way to explicitly map YAML constructs to Java
> constructor calls / setter calls / attribute writes.
> The question is : how it should be implemented ?

Multiple approaches are possible:
- Reflection. Comparatively slow, but Java does a decent job at
optimizing the standard use cases, and Snakeyaml's use case is pretty
standard.
- Adapter code that translates parse events to object updates, à la SAX.
The code could be manually written, generated at compile time, or
generated at runtime.

> What should be the API ?

Again, multiple options:
- Annotations. Easy to use, particularly if you have Javabeans. Very
efficient to pick up. Difficult to cover all forms of use cases.
Probably too high-level because it can be built on top of other approaches.
- Publish a stream of parse events. Well-suited for a low-level Engine.
Exposes a low-level API that cannot be easily changed anymore.
- Publish the parse tree when it's done, à la DOM. Again, higher-level
than the event-stream approach.

Personally, I think it's the stream API.

IDEs and similar tools want the ability to keep comments and formatting
everywhere except where they update things. I.e. any API needs to
preserve the raw input bytes.
To avoid memory use, don't store the bytes themselves, just store the
offset and length of each parsed Character sequence (length is necessary
to identify skipped content). It is the responsibility of the caller to
keep the input data in memory and map Character offsets to line/column
numbers if it needs to do this.
Also, Snakeyaml would have to be able to ignore encoding-invalid byte
sequences in the input, particularly if they are inside a comment.

> 4)
> > If the Engine is supposed to be internal, talking about the character
> set representation does not make much sense
> I am sorry, I do not get it
>
> 5)
>>For better or for worse, Java Strings are UTF-16, conversion to and from
> other character sets (including UTF-8 and -16) happen when talking to
> streams.
> Well, it more complex now. Before Unicode 2.0 indeed Java String were
> UTF-16. With introduction of surrogates in Unicode 2.0 things changed.

Erm... actually, UTF-16 is the encoding that defined surrogate pairs.
UTF-16 didn't exist before Unicode 2.0.

Java added surrogate pair support at the character level but nothing
beyond that, leaving full Unicode support to libraries like ICU4J.
Not that I find any of this surprising. Surrogate pairs suck, as would
have going to 32-bit characters in the JVM.

> In version 1.18 it was fixed (issue 323) with the price of a significant
> performance penalty. Which we try to fix in 1.21

Yeah, I guess things should be optimized for handling BMP characters
well, and do the surrogate-pair dance only where necessary.

Unfortunately, it does not seem to be easy. E.g. you still have to check
every Java char whether it's in a surrogate pair or not (AND with 0xd800
and check whether it's 0xd800), and while the check is simple, it needs
to be repeated for every friggin' byte pair from the input.


Hm. Reviewing the things I wrote above, I'm starting to wonder if String
is even the right data type for the Engine. Maybe it must be byte[].
I dimly recall there's an "unsafe" package in Java that has things like
buffers that can be cast between arrays of bytes and arrays of 16-bit
words; maybe that's helpful for implementing Snakeyaml using byte[].

Okay, enough for now. Hope this helps!

Regards,
Jo
Reply all
Reply to author
Forward
0 new messages