JRuby now using SnakeYAML for Ruby 1.9 YAML support!

212 views
Skip to first unread message

Charles Oliver Nutter

unread,
Dec 6, 2010, 5:03:05 AM12/6/10
to SnakeYAML
I posted some time ago about eventually attempting to implement the
Ruby 1.9 YAML support ("Psych") with SnakeYAML, and tonight I've
finally done it.

The process was almost painless; the SnakeYAML API matches libyaml
closely, and Psych was implemented mostly in Ruby with only a small
core wrapping libyaml's exported functions. There have been only a few
places where things did not match libyaml which I'll summarize below.

I would like to congratulate Andrey on an excellent job combining the
libyaml API and his own implementation.

I hope to post more once I start running tests on Psych, but so far
it's definitely functional.

So, here's the issues I ran into:

* YAML is supposed to always be unicode, so supporting arbitrary
encodings is not a big deal. However, SnakeYAML uses all Java strings,
which means it always pays the cost of decoding byte[] to char[], and
when going back out to Ruby we pay the cost to encode char[] to
byte[]. I don't think there's a way around this, but I have a concern
that we might have to make a fork of SnakeYAML in the future that can
deal with byte[] directly.

* I could see no way to update an Emitter's settings after it has been
created, as in the libyaml functions yaml_emitter_set_canonical,
yaml_emitter_set_indent, etc. For now, I'm updating the DumperOptions
I created the Emitter with and hoping they propagate.

* There was no way to specify encodings, presumably because all
incoming YAML data comes from a Reader and all outgoing YAML goes to a
Writer. This was a gap from the libxml API, but it's more a challenge
of being on the JVM...we always work with UTF-16.

* Event should have a getID, so that the ID enum can be used in a
switch. This would be faster than chaining if (event.is(...)) calls,
which is what I had to do for now.

That's about it. Thanks again! Please let me know if you have any
questions.

Andrey

unread,
Dec 6, 2010, 8:03:38 AM12/6/10
to SnakeYAML
Hi Charles,
nice to know you use SnakeYAML !
The fact that SnakeYAML matches libyaml (and PyYAML) is no
coincidence. This is done to share the experience (and it works for
you!) and also to be sure that the same document fails or succeeds in
all the parsers. Even the problem areas are respected (http://
pyyaml.org/wiki/YAMLColonInFlowContext).

I will try to answer your questions.

1) This is a complex issue. It is only trivial for ASCII. But for
Russian, Chinese and others there are many ways to convert byte[]-
>char[] and back. In Python 2 strings were also represented as byte[]
but it was changed in Python 3 and it helps to solve lots of problems.
I think I would like to see the code which does the conversion to
understand what you need and how SnakeYAML can help you (to avoid
maintaining a fork)

2) Yes, you are right. DumperOptions was introduced to change
Emitter's settings.

3) Indeed SnakeYAML does not explicitly define incoming or outgoing
encodings. But it does not mean that there is a gap from libyaml API.
When you create Reader (or InputStream) you have to specify the
encoding, which means that any encoding can be used. Please be aware
that only UTF-8 and UTF-16 are allowed by the specification. If you
create a YAML document in a different encoding and send it to another
party there is no guarantee it works properly there.

4) Please create an issue to introduce Event.getID(). Then we can try
to implement it in the coming release - 1.8
It is not present because neither in PyYAML nor in SnakeYAML it can be
used in the 'switch' context.

Feel free to ask more questions.

Good luck,
Andey
Reply all
Reply to author
Forward
0 new messages