>> I'm a bit more worried about what will happen after a document is
>> finished. Is it sure that it won't try to parse the next document to
>> establish that the previous document is done?
>> I think that might actually happen if you parse a yaml file that doesn't
>> have start-of-document or end-of-document markers. I'm seeing Snakeyaml
>> returning the first object in the stream, but having read everything
>> that came after it.
>>
> Yes, it is sure. The document is only parsed when it is explicitly asked
> to.
Okay.
> A YAML document contains ScalarTokens (strings) which do not have size
> limitations.
[...]
> As I already said, SnakeYAML has a buffer which is read in a single step.
> It may read much more then it is required to finish a document. The size of
> the buffer is not configurable. (never needed)
Won't be needed for this case either. If Snakeyaml is still reading the
current document and hits the limit, that's okay - the document is too
long, regardless of what the parser was trying to do.
The question is whether anything unbounded can happen between documents.
That is:
When the iterator from loadAll hits and end-of-document token, does it
read the next token?
If it does read the next token:
Can that token be of unbounded size?
I control the definition of well-formed files, so I can impose a
requirement that start-of-document tokens be used. I understand that
start-of-document is always just three dots, so that's bounded - is that
correct?
If Snakeyaml reads a token, does it also eat any whitespace or comments
that may follow it?
If yes, I'll have to rethink how to deal with a malicious file that has
a gazillion of empty lines between first and second document.
Observations:
Stylistic nitpick: Line 184 is missing a 'this.' before 'data'.
Printability is determined in two different ways, in far-away code
sections. (A regex for nonprintable, and a range check for printable).
Anybody adapting the definition of printability runs a risk of
introducing a subtle bug by making the two definitions inconsistent.
(That's a worry for the future case of the Yaml specs adapting to
changed definitions in the Unicode specs. Not going to happen very
often, but possible.)
getEncoding will crash if the Reader passed to the Yaml constructor is
not a UnicodeReader.
Not that that function is likely to be ever called. Callers could simply
hold on to the UnicodeReader and ask it for encoding and other
properties directly, this is outside of Snakeyaml's domain.
Eclipse tells me it's not being used by Snakeyaml, so it could be
deprecated.
I'm not sure why the code is reinventing the BufferedReader.
It's complicating the code a lot with all the offset calculations.
Consider this:
InputStream inputStream = new FileInputStream(filename);
inputStream = new BufferedInputStream(inputStream);
inputStream = ByteStreams.limit(inputStream, limit);
return new Yaml().loadAll(inputStream).iterator();
I can decide whether I want the limit applied before or after buffering
by swapping lines 2 and 3.
It's already doing buffering. If I need a Reader-level buffer, I can
construct a Reader (and make an informed decision whether that's worth
making the limit less precise).
I can mix and match the various buffering and other options and
benchmark each against my data mix, my JVM and/or other Java compiler
technology.
Am I overlooking something here?