Fwd: ZAML

16 views
Skip to first unread message

Jesse Hallett

unread,
Jan 17, 2009, 2:20:00 PM1/17/09
to za...@googlegroups.com

---------- Forwarded message ----------
From: "Gary Yngve" <gary....@gmail.com>
Date: Jan 17, 2009 10:54 AM
Subject: ZAML
To: <hall...@gmail.com>

Hi,

First I want to commend you and your team on ZAML.  The readability over YAML+Syck is AMAZING.  I was getting frustrated at waiting literally an hour (and burning gigs of RAM) to write merely a few hundred thousand records.  Didn't want JSON because of issues with large strings and poor readablity.  YAML appeared to have options for fast_generate, etc., but couldn't figure out how to unlock it, and the project seems to have gone stale.

Second, I have a bug to report.  Some of the small string cases work as strings but not written to files because they do not append a newline to the end.

>> YAML.dump('>')
=> "--- \">\"\n"
>> ZAML.dump('>')
=> "--- >"
>> File.open('foo','w'){|f| f.puts YAML.dump('>')}
=> nil
>> YAML.load_file('foo')
=> ">"
>> File.open('foo','w'){|f| f.puts ZAML.dump('>')}
=> nil
>> YAML.load_file('foo')
=> ""

The problem is that the FileIO automatically adds the newline, so you need to be more aggressive on escaping (btw, you're more robust than YAML on escaping!).

Another thing I noticed was that the indenting was a little off on the first level.  I think that's because ideally you'd want to initialize the indenting at "-1*tail".. my solution is
something like:

  def initialize
...
    @indents = ["\n","\n"]
    @indent_idx = 0
...
 
  def nested(tail='  ')
    @indent_idx += 1
    @indent = @indents[@indent_idx] || (@indents[@indent_idx] = @indents[@indent_idx-1]+tail)
    yield
    @indent_idx -= 1
  end

though I haven't tested it thoroughly.

Anyway, where I'm going with it is trying to make it as FAST as JSON, which means not handling circular references and object identity stuff (just as JSON does).. probably call it something like SloppyYAML..

On a large testfile I have (lots of large arrays of complex hashes), my current benchmarks so far are something like:
YAML: over a minute
ZAML: about 40 sec
"SloppyYAML": about 10 sec
JSON: about 5 sec

Thanks,
Gary

Jesse Hallett

unread,
Jan 17, 2009, 7:56:01 PM1/17/09
to Gary Yngve, za...@googlegroups.com
Gary,

I'm glad you are finding ZAML so useful! And thank you for the bug
reports. We will definitely have to include some file and other IO
stream tests in our test suite. And I will look at that indentation
issue too.

If you feel inclined to send us patches from SloppyYAML, we might
consider a faster, less featureful dumping option in ZAML. We could
put it in a 'fast_dump' method. The speedup you reported is
impressive. Do you happen to have more ideas about how to close the
remaining 2x gap with JSON?

I encourage you to join our mailing list if you want to talk about
more ZAML or SloppyYAML stuff: http://groups.google.com/group/zaml

Thanks,
Jesse

Markus

unread,
Jan 17, 2009, 9:57:23 PM1/17/09
to za...@googlegroups.com
Gary,

> Second, I have a bug to report. Some of the small string cases work as
> strings but not written to files because they do not append a newline to the

> end....so you need to be more aggressive on escaping (btw, you're more robust
> than YAML on escaping!).

I'm actually running an exhaustive (multi-week) test for cases such as
this at this very moment. We've had a fair amount of discussion about
the issue, and plan an update with much more robust string-escaping
logic.

Initially, we quoted/escaped everything, since that always works. We
then went (based on user feedback) to more relaxed approach, and got bit
by a few special words (such as "yes") that must be quoted.

After that, we played with matching YAML's output on everything, only to
find that there were a host of cases that YAML didn't handle correctly.
So now we're doing a test-driven pass, doing things like testing all
possible strings up to a certain length, all dictionary words, larger
(in some cases much larger) random strings from selected sub-sets of the
possible characters, and so on.

> >> YAML.dump('>')
> => "--- \">\"\n"
> >> ZAML.dump('>')
> => "--- >"
> >> File.open('foo','w'){|f| f.puts YAML.dump('>')}
> => nil
> >> YAML.load_file('foo')
> => ">"
> >> File.open('foo','w'){|f| f.puts ZAML.dump('>')}
> => nil
> >> YAML.load_file('foo')
> => ""

> The problem is that the FileIO automatically adds the newline,

Actually, the problem is with "puts", which modifies the data as it's
writing it--and thus shouldn't be used to output any sort of
protocol-specific data; if you use "print" or "write" it will work just
fine. But I agree it's needlessly brittle.

> Another thing I noticed was that the indenting was a little off on the first
> level. I think that's because ideally you'd want to initialize the
> indenting at "-1*tail"..

We had done that at one point, matching YAML, but then found cases where
it wasn't loaded back correctly (the spec also has optional indentation
for certain nesting patterns--with the right combination you get to
where you can't tell which indentation was omitted). So now we are
always indenting, even in optional cases, and haven't found a
circumstance where it fails to load correctly.

Are you seeing a failure to load properly, or just noting that the
output is different than what YAML produces?

> Anyway, where I'm going with it is trying to make it as FAST as JSON, which
> means not handling circular references and object identity stuff (just as
> JSON does).. probably call it something like SloppyYAML..

That's probably a bad idea. For any interesting data structures you run
the risk of producing much larger output (because you are reserializing
shared structures time and again), taking much longer (ditto), or even
hanging/crashing with a stack overflow.

Also, you have the problem that what you get back will be a subtly
different structure than what you saved. For many use cases, this would
be a very bad form of data corruption.

> On a large testfile I have (lots of large arrays of complex hashes), my
> current benchmarks so far are something like:

> ZAML: about 40 sec
> "SloppyYAML": about 10 sec
> JSON: about 5 sec

Whoa, 75% savings? That's way more than I'd have expected. Our
back-reference detection shouldn't be taking that long. Either you're
doing something really clever or we're doing something needlessly dumb.
Care to share your code & test cases?

> Thanks,
> Gary

Thank you!

-- Markus


Gary Yngve

unread,
Jan 18, 2009, 3:30:48 PM1/18/09
to za...@googlegroups.com
On Sat, Jan 17, 2009 at 6:57 PM, Markus <mar...@reality.com> wrote:
I'm actually running an exhaustive (multi-week) test for cases such as
this at this very moment.  We've had a fair amount of discussion about
the issue, and plan an update with much more robust string-escaping
logic.

Here are a few other things I found:

For the HASH in the testcases, test on HASH.invert too.
(nil can be a key, and it is represented by ~)
The case of requiring a \n at the end of a one-line yaml introduces some more escaping
issues regarding strings as keys.
 
Actually, the problem is with "puts", which modifies the data as it's
writing it--and thus shouldn't be used to output any sort of
protocol-specific data; if you use "print" or "write" it will work just
fine.  But I agree it's needlessly brittle.

You are correct. :)
 
Are you seeing a failure to load properly, or just noting that the
output is different than what YAML produces?

No failure to load properly (I haven't found a case where my version loads improperly, but I'm targeting my use cases),
just doesn't look the way YAML looks.
 
That's probably a bad idea.  For any interesting data structures you run
the risk of producing much larger output (because you are reserializing
shared structures time and again), taking much longer (ditto), or even
hanging/crashing with a stack overflow.

Yes, for some apps.  But mostly I'm doing 'boring' stuff, so data tends to be almost always
nested arrays/hashes of different strings and numbers, and anything "recursive" in nature, e.g. a graph-like structure,
is often already represented as references/foreign keys.

JSON likewise crashes on:
>> ((a=[])<<a).to_json
SystemStackError: stack level too deep

I'd use JSON for its speed, but some detractors are:
- not as readable by default, and JSON.pretty_generate has too many blank lines (YAML'ed arrays are easy to manipulate with grep, cut, tr, etc.)
- doesn't differentiate btwn strings and symbols
- crashes on weird characters that need to be encoded a priori

Whoa, 75% savings?  That's way more than I'd have expected.  Our
back-reference detection shouldn't be taking that long.  Either you're
doing something really clever or we're doing something needlessly dumb.
Care to share your code & test cases?

I did some more analysis.  75% was an extreme case due to the GC kicking in extra hard.

For one large example, prior to serialization, ruby was taking up about 60-70 MB.
YAML jumped it up to over 300MB!
ZAML was around 140.
Mine was around 80. 
Even though keeping a hash of seen objects is incredibly cheap time-complexity-wise, the downside is that it uses up a lot of space.
If it is enough space that it causes ruby to ask for more memory from the OS, it will trigger the GC, which burns a lot of CPU.  So big slowdowns aren't necessarily from the code per se, but rather the GC.  In fact, if I run YAML/ZAML a second time in the same process, they are significantly faster because they already have sufficient memory and the GC is quiet.

I'm often running jobs on commodity machines where I'm pushing the memory limits of the machine (2-4G).  I've had cases where it's taken in excess of an hour to serialize (with yaml) an array of a million hashes because there wasn't spare memory and the GC was burning nonstop.  I don't want to use Marshal because I specifically want readability for my sanity.  Having a serialization tool with a minimal memory footprint is very important to me.

My version is running maybe 20% faster than ZAML currently if the GC is not involved. still playing with refactoring.
some of the things i did/am doing:
- by not needing to check for object identification, you can stream out the data as you read it, rather than waiting till the end because it might be referenced
- for String.to_zaml, the when case with all the ors is now a single regex (it also had the other string comparisons in that when)
- there's a lot of specific logic for structured keys that every codepath has to go through.  99% of the time, keys are not structured, so I'm trying to refactor that logic to just when structured keys are for sure involved.

-Gary

Gary Yngve

unread,
Jan 18, 2009, 4:20:00 PM1/18/09
to za...@googlegroups.com
One more note..

DateTime needs a to_zaml (DateTime is used a lot of DBs)

something that seems to work is:

class DateTime
  def to_zaml(z)
    z.emit(to_s)
  end
end

-Gary

Markus

unread,
Jan 18, 2009, 5:32:26 PM1/18/09
to za...@googlegroups.com

> DateTime needs a to_zaml (DateTime is used a lot of DBs)
>
> something that seems to work is:
>
> class DateTime
> def to_zaml(z)
> z.emit(to_s)
> end
> end

And without this patch it just saves the Date since DateTime is a
sub-class of Date, which we handle by producing a readable date (and
nothing else).

Hmmm. We talked about this sort of thing at one point, but I don't
recall the resolution if there was one.

The issue is, do we want to require libraries just to extend their
classes with a to_zaml method? Some applications might object to our
dragging in libraries they didn't otherwise need. But those that
subclass a class (such as Date) that we're already giving special
treatment to might not work right.

I can think of a few monkey patching ideas, including special casing
DateTime in Date#to_zaml, but I'd be interested in hearing what the rest
of you think.

-- Markus

P.S. As I was about to hit send I realized we could (and should) fix
this case trivially by replacing:

class Date
def to_zaml(z)
z.emit(strftime('%Y-%m-%d'))
end
end

with

class Date


def to_zaml(z)
z.emit(to_s)
end
end

which would then work for Dates and DateTimes. But the general question
is still interesting, as well as the ancillary questions it suggests
(are there other cases like this, how do we deal with user extensions of
a similar nature, etc.).

Gary Yngve

unread,
Jan 18, 2009, 7:58:34 PM1/18/09
to za...@googlegroups.com
On Sun, Jan 18, 2009 at 2:32 PM, Markus <mar...@reality.com> wrote:
which would then work for Dates and DateTimes.  But the general question
is still interesting, as well as the ancillary questions it suggests
(are there other cases like this, how do we deal with user extensions of
a similar nature, etc.).

By the principle of least surprise, if YAML and maybe JSON work in a certain way, ZAML should work that way too.

btw, this hints at how YAML and JSON implement serialization for DateTime (and likely Date) :)

>> DateTime.now.to_yaml
=> "--- 2009-01-18T16:56:51-08:00\n"
>> class DateTime
>> def to_s; 'foo'; end
>> end
=> nil
>> DateTime.now.to_json
=> "\"foo\""
>> DateTime.now.to_yaml
=> "--- !timestamp foo\n"
(where did that !timestamp come from, and why wasn't it there originally?)

YAML may possibly also have something special for Struct.

-Gary

Markus

unread,
Jan 18, 2009, 9:50:57 PM1/18/09
to za...@googlegroups.com

> By the principle of least surprise, if YAML and maybe JSON work in a
> certain way, ZAML should work that way too.

We played around with that interpretation of POLS, but it breaks down if
you push it too far. For example, yaml.rb segfaults on some data; would
anyone argue that zaml should do so too, as this would be cause "the
least surprise"?

I think it's important to keep in mind zaml's priority hierarchy:

1. Produce correct results
2. Do so quickly
3. Prefer readable results
4. Match yaml.rb's results

...in that order. While POLS is important in many regards, no one has
yet offered a convincing use-case where it matters that zaml produce the
same output as yaml.rb does.

Conversely, we have a number of compelling arguments for always
producing correct results and doing so quickly--our users care about
that. Likewise, we have a small contingent that wants the output to be
human readable / editable where ever possible. Note too that these are
also example of POLS, in that people expect the correct results, and
don't expect it to take unreasonably long to get them.

Though other considerations, such as size of output or exact matching of
yaml.rb's output, have been offered theoretically no one has stepped
forward with a compelling explanation of why they matter.

> YAML may possibly also have something special for Struct.

Quite possibly. Anybody know?

-- Markus


Ragav Satish

unread,
Jan 19, 2009, 9:18:53 AM1/19/09
to za...@googlegroups.com
On Sun, 2009-01-18 at 16:58 -0800, Gary Yngve wrote:

> btw, this hints at how YAML and JSON implement serialization for
> DateTime (and likely Date) :)

YAML doesn't work with DateTime

irb> YAML.load(DateTime.now.to_yaml).class =>Time

Date and Time are treated internally as timestamps (tag !timestamps).
DateTime is derived from Date and but the content format doesn't look
like a year/month/date and is hence treated as Time.

By adding a custome to_s you've interfered with DateTime content parsing
(by syck?) and hence it just writes it as !timestamp. (since it still
knows that this class is a derivative of Date).

> YAML may possibly also have something special for Struct.

Treats struct like a special builtin type

irb> Post = Struct.new(:a, :b)
irb> p = Post.new("hello", "world")

=>
--- !ruby/struct:Post
a: hello
b: world

--Cheers
--Ragav


Markus

unread,
Feb 7, 2009, 3:25:46 PM2/7/09
to za...@googlegroups.com
All --

I'm working on rolling a few changes and bug fixes into zaml, and hit a
series of problems with the proposed DateTime fix.

In our last episode:

Gary Yngve noted that DateTime objects, being a subclass of
Date, serialized without the time portion. He also pointed out
that when yaml serialized them it just emitted their "to_s."

I proposed changing Date#to_zaml to use to_s instead of
strftime, thus automatically handling DateTime without changing
without messing up Date.

On trying this, however, I discovered two problems:

1. The format yaml produces for DateTime objects isn't properly
recognized by yaml as a DateTime object; it comes back as a Time
object.
2. If we monkey patch to catch this and convert the resulting Time
object to a DateTime, we still don't have the right answer
because the conversion is lossy--DateTime#to_s doesn't represent
the fractional seconds.

So instead of the above simple and elegant solution I'm going to go with
the following semi-klunky implementation for DateTime:

class DateTime
def to_zaml(z)
z.emit("!ruby/datetime #{strftime('%FT%T.%N%:z')}")
end
#
# Monkey patch for buggy DateTime restore in YAML
#
unless YAML.load(ZAML.dump(now)).is_a? DateTime
YAML.add_ruby_type( 'datetime' ) { |type, val|
DateTime.parse(val.to_s)
}
end
end

...which produces something that looks like so:

--- !ruby/datetime 2009-02-07T20:11:19.738720000+00:00

This isn't as pretty, but it produces the correct result quickly, so by
our criteria it should be what we use. If anybody's got a major
objection or a better idea, shout it out now.

-- Markus

P.S. I also realized in the process that rationals aren't coming back
correctly, but that's a separate problem.

Reply all
Reply to author
Forward
0 new messages