Hi Andrey,
thanks for the updates, things are clearer now.
> The Engine does not prohibit to create custom instances, it just does
> not support it without explicit coding.
I guess what's missing is how Snakeyaml is going to support user-defined
types.
> 3)
>>Snakeyaml needs to offer a way to explicitly map YAML constructs to Java
> constructor calls / setter calls / attribute writes.
> The question is : how it should be implemented ?
Multiple approaches are possible:
- Reflection. Comparatively slow, but Java does a decent job at
optimizing the standard use cases, and Snakeyaml's use case is pretty
standard.
- Adapter code that translates parse events to object updates, à la SAX.
The code could be manually written, generated at compile time, or
generated at runtime.
> What should be the API ?
Again, multiple options:
- Annotations. Easy to use, particularly if you have Javabeans. Very
efficient to pick up. Difficult to cover all forms of use cases.
Probably too high-level because it can be built on top of other approaches.
- Publish a stream of parse events. Well-suited for a low-level Engine.
Exposes a low-level API that cannot be easily changed anymore.
- Publish the parse tree when it's done, à la DOM. Again, higher-level
than the event-stream approach.
Personally, I think it's the stream API.
IDEs and similar tools want the ability to keep comments and formatting
everywhere except where they update things. I.e. any API needs to
preserve the raw input bytes.
To avoid memory use, don't store the bytes themselves, just store the
offset and length of each parsed Character sequence (length is necessary
to identify skipped content). It is the responsibility of the caller to
keep the input data in memory and map Character offsets to line/column
numbers if it needs to do this.
Also, Snakeyaml would have to be able to ignore encoding-invalid byte
sequences in the input, particularly if they are inside a comment.
> 4)
> > If the Engine is supposed to be internal, talking about the character
> set representation does not make much sense
> I am sorry, I do not get it
>
> 5)
>>For better or for worse, Java Strings are UTF-16, conversion to and from
> other character sets (including UTF-8 and -16) happen when talking to
> streams.
> Well, it more complex now. Before Unicode 2.0 indeed Java String were
> UTF-16. With introduction of surrogates in Unicode 2.0 things changed.
Erm... actually, UTF-16 is the encoding that defined surrogate pairs.
UTF-16 didn't exist before Unicode 2.0.
Java added surrogate pair support at the character level but nothing
beyond that, leaving full Unicode support to libraries like ICU4J.
Not that I find any of this surprising. Surrogate pairs suck, as would
have going to 32-bit characters in the JVM.
> In version 1.18 it was fixed (issue 323) with the price of a significant
> performance penalty. Which we try to fix in 1.21
Yeah, I guess things should be optimized for handling BMP characters
well, and do the surrogate-pair dance only where necessary.
Unfortunately, it does not seem to be easy. E.g. you still have to check
every Java char whether it's in a surrogate pair or not (AND with 0xd800
and check whether it's 0xd800), and while the check is simple, it needs
to be repeated for every friggin' byte pair from the input.
Hm. Reviewing the things I wrote above, I'm starting to wonder if String
is even the right data type for the Engine. Maybe it must be byte[].
I dimly recall there's an "unsafe" package in Java that has things like
buffers that can be cast between arrays of bytes and arrays of 16-bit
words; maybe that's helpful for implementing Snakeyaml using byte[].
Okay, enough for now. Hope this helps!
Regards,
Jo