---------- Forwarded message ----------
From: "Gary Yngve" <gary....@gmail.com>
Date: Jan 17, 2009 10:54 AM
Subject: ZAML
To: <hall...@gmail.com>
Hi,
First I want to commend you and your team on ZAML. The readability over YAML+Syck is AMAZING. I was getting frustrated at waiting literally an hour (and burning gigs of RAM) to write merely a few hundred thousand records. Didn't want JSON because of issues with large strings and poor readablity. YAML appeared to have options for fast_generate, etc., but couldn't figure out how to unlock it, and the project seems to have gone stale.
Second, I have a bug to report. Some of the small string cases work as strings but not written to files because they do not append a newline to the end.
>> YAML.dump('>')
=> "--- \">\"\n"
>> ZAML.dump('>')
=> "--- >"
>> File.open('foo','w'){|f| f.puts YAML.dump('>')}
=> nil
>> YAML.load_file('foo')
=> ">"
>> File.open('foo','w'){|f| f.puts ZAML.dump('>')}
=> nil
>> YAML.load_file('foo')
=> ""
The problem is that the FileIO automatically adds the newline, so you need to be more aggressive on escaping (btw, you're more robust than YAML on escaping!).
Another thing I noticed was that the indenting was a little off on the first level. I think that's because ideally you'd want to initialize the indenting at "-1*tail".. my solution is
something like:
def initialize
...
@indents = ["\n","\n"]
@indent_idx = 0
...
def nested(tail=' ')
@indent_idx += 1
@indent = @indents[@indent_idx] || (@indents[@indent_idx] = @indents[@indent_idx-1]+tail)
yield
@indent_idx -= 1
end
though I haven't tested it thoroughly.
Anyway, where I'm going with it is trying to make it as FAST as JSON, which means not handling circular references and object identity stuff (just as JSON does).. probably call it something like SloppyYAML..
On a large testfile I have (lots of large arrays of complex hashes), my current benchmarks so far are something like:
YAML: over a minute
ZAML: about 40 sec
"SloppyYAML": about 10 sec
JSON: about 5 sec
Thanks,
Gary
I'm glad you are finding ZAML so useful! And thank you for the bug
reports. We will definitely have to include some file and other IO
stream tests in our test suite. And I will look at that indentation
issue too.
If you feel inclined to send us patches from SloppyYAML, we might
consider a faster, less featureful dumping option in ZAML. We could
put it in a 'fast_dump' method. The speedup you reported is
impressive. Do you happen to have more ideas about how to close the
remaining 2x gap with JSON?
I encourage you to join our mailing list if you want to talk about
more ZAML or SloppyYAML stuff: http://groups.google.com/group/zaml
Thanks,
Jesse
> Second, I have a bug to report. Some of the small string cases work as
> strings but not written to files because they do not append a newline to the
> end....so you need to be more aggressive on escaping (btw, you're more robust
> than YAML on escaping!).
I'm actually running an exhaustive (multi-week) test for cases such as
this at this very moment. We've had a fair amount of discussion about
the issue, and plan an update with much more robust string-escaping
logic.
Initially, we quoted/escaped everything, since that always works. We
then went (based on user feedback) to more relaxed approach, and got bit
by a few special words (such as "yes") that must be quoted.
After that, we played with matching YAML's output on everything, only to
find that there were a host of cases that YAML didn't handle correctly.
So now we're doing a test-driven pass, doing things like testing all
possible strings up to a certain length, all dictionary words, larger
(in some cases much larger) random strings from selected sub-sets of the
possible characters, and so on.
> >> YAML.dump('>')
> => "--- \">\"\n"
> >> ZAML.dump('>')
> => "--- >"
> >> File.open('foo','w'){|f| f.puts YAML.dump('>')}
> => nil
> >> YAML.load_file('foo')
> => ">"
> >> File.open('foo','w'){|f| f.puts ZAML.dump('>')}
> => nil
> >> YAML.load_file('foo')
> => ""
> The problem is that the FileIO automatically adds the newline,
Actually, the problem is with "puts", which modifies the data as it's
writing it--and thus shouldn't be used to output any sort of
protocol-specific data; if you use "print" or "write" it will work just
fine. But I agree it's needlessly brittle.
> Another thing I noticed was that the indenting was a little off on the first
> level. I think that's because ideally you'd want to initialize the
> indenting at "-1*tail"..
We had done that at one point, matching YAML, but then found cases where
it wasn't loaded back correctly (the spec also has optional indentation
for certain nesting patterns--with the right combination you get to
where you can't tell which indentation was omitted). So now we are
always indenting, even in optional cases, and haven't found a
circumstance where it fails to load correctly.
Are you seeing a failure to load properly, or just noting that the
output is different than what YAML produces?
> Anyway, where I'm going with it is trying to make it as FAST as JSON, which
> means not handling circular references and object identity stuff (just as
> JSON does).. probably call it something like SloppyYAML..
That's probably a bad idea. For any interesting data structures you run
the risk of producing much larger output (because you are reserializing
shared structures time and again), taking much longer (ditto), or even
hanging/crashing with a stack overflow.
Also, you have the problem that what you get back will be a subtly
different structure than what you saved. For many use cases, this would
be a very bad form of data corruption.
> On a large testfile I have (lots of large arrays of complex hashes), my
> current benchmarks so far are something like:
> ZAML: about 40 sec
> "SloppyYAML": about 10 sec
> JSON: about 5 sec
Whoa, 75% savings? That's way more than I'd have expected. Our
back-reference detection shouldn't be taking that long. Either you're
doing something really clever or we're doing something needlessly dumb.
Care to share your code & test cases?
> Thanks,
> Gary
Thank you!
-- Markus
I'm actually running an exhaustive (multi-week) test for cases such as
this at this very moment. We've had a fair amount of discussion about
the issue, and plan an update with much more robust string-escaping
logic.
Actually, the problem is with "puts", which modifies the data as it's
writing it--and thus shouldn't be used to output any sort of
protocol-specific data; if you use "print" or "write" it will work just
fine. But I agree it's needlessly brittle.
Are you seeing a failure to load properly, or just noting that the
output is different than what YAML produces?
That's probably a bad idea. For any interesting data structures you run
the risk of producing much larger output (because you are reserializing
shared structures time and again), taking much longer (ditto), or even
hanging/crashing with a stack overflow.
Whoa, 75% savings? That's way more than I'd have expected. Our
back-reference detection shouldn't be taking that long. Either you're
doing something really clever or we're doing something needlessly dumb.
Care to share your code & test cases?
And without this patch it just saves the Date since DateTime is a
sub-class of Date, which we handle by producing a readable date (and
nothing else).
Hmmm. We talked about this sort of thing at one point, but I don't
recall the resolution if there was one.
The issue is, do we want to require libraries just to extend their
classes with a to_zaml method? Some applications might object to our
dragging in libraries they didn't otherwise need. But those that
subclass a class (such as Date) that we're already giving special
treatment to might not work right.
I can think of a few monkey patching ideas, including special casing
DateTime in Date#to_zaml, but I'd be interested in hearing what the rest
of you think.
-- Markus
P.S. As I was about to hit send I realized we could (and should) fix
this case trivially by replacing:
class Date
def to_zaml(z)
z.emit(strftime('%Y-%m-%d'))
end
end
with
class Date
def to_zaml(z)
z.emit(to_s)
end
end
which would then work for Dates and DateTimes. But the general question
is still interesting, as well as the ancillary questions it suggests
(are there other cases like this, how do we deal with user extensions of
a similar nature, etc.).
which would then work for Dates and DateTimes. But the general question
is still interesting, as well as the ancillary questions it suggests
(are there other cases like this, how do we deal with user extensions of
a similar nature, etc.).
We played around with that interpretation of POLS, but it breaks down if
you push it too far. For example, yaml.rb segfaults on some data; would
anyone argue that zaml should do so too, as this would be cause "the
least surprise"?
I think it's important to keep in mind zaml's priority hierarchy:
1. Produce correct results
2. Do so quickly
3. Prefer readable results
4. Match yaml.rb's results
...in that order. While POLS is important in many regards, no one has
yet offered a convincing use-case where it matters that zaml produce the
same output as yaml.rb does.
Conversely, we have a number of compelling arguments for always
producing correct results and doing so quickly--our users care about
that. Likewise, we have a small contingent that wants the output to be
human readable / editable where ever possible. Note too that these are
also example of POLS, in that people expect the correct results, and
don't expect it to take unreasonably long to get them.
Though other considerations, such as size of output or exact matching of
yaml.rb's output, have been offered theoretically no one has stepped
forward with a compelling explanation of why they matter.
> YAML may possibly also have something special for Struct.
Quite possibly. Anybody know?
-- Markus
> btw, this hints at how YAML and JSON implement serialization for
> DateTime (and likely Date) :)
YAML doesn't work with DateTime
irb> YAML.load(DateTime.now.to_yaml).class =>Time
Date and Time are treated internally as timestamps (tag !timestamps).
DateTime is derived from Date and but the content format doesn't look
like a year/month/date and is hence treated as Time.
By adding a custome to_s you've interfered with DateTime content parsing
(by syck?) and hence it just writes it as !timestamp. (since it still
knows that this class is a derivative of Date).
> YAML may possibly also have something special for Struct.
Treats struct like a special builtin type
irb> Post = Struct.new(:a, :b)
irb> p = Post.new("hello", "world")
=>
--- !ruby/struct:Post
a: hello
b: world
--Cheers
--Ragav
I'm working on rolling a few changes and bug fixes into zaml, and hit a
series of problems with the proposed DateTime fix.
In our last episode:
Gary Yngve noted that DateTime objects, being a subclass of
Date, serialized without the time portion. He also pointed out
that when yaml serialized them it just emitted their "to_s."
I proposed changing Date#to_zaml to use to_s instead of
strftime, thus automatically handling DateTime without changing
without messing up Date.
On trying this, however, I discovered two problems:
1. The format yaml produces for DateTime objects isn't properly
recognized by yaml as a DateTime object; it comes back as a Time
object.
2. If we monkey patch to catch this and convert the resulting Time
object to a DateTime, we still don't have the right answer
because the conversion is lossy--DateTime#to_s doesn't represent
the fractional seconds.
So instead of the above simple and elegant solution I'm going to go with
the following semi-klunky implementation for DateTime:
class DateTime
def to_zaml(z)
z.emit("!ruby/datetime #{strftime('%FT%T.%N%:z')}")
end
#
# Monkey patch for buggy DateTime restore in YAML
#
unless YAML.load(ZAML.dump(now)).is_a? DateTime
YAML.add_ruby_type( 'datetime' ) { |type, val|
DateTime.parse(val.to_s)
}
end
end
...which produces something that looks like so:
--- !ruby/datetime 2009-02-07T20:11:19.738720000+00:00
This isn't as pretty, but it produces the correct result quickly, so by
our criteria it should be what we use. If anybody's got a major
objection or a better idea, shout it out now.
-- Markus
P.S. I also realized in the process that rationals aren't coming back
correctly, but that's a separate problem.