Reading multiple documents, one at a time

2,206 views
Skip to first unread message

toolforger

unread,
Dec 26, 2012, 7:48:15 AM12/26/12
to snakeya...@googlegroups.com
Hi all,

I need to read multiple documents but I can't use loadAll because I need to do checks before I decide whether I want to read the remaining documents.
When I use loadAs, I'm getting "expected a single document in the stream; but found another document".

I.e. I want to read a document, do some processing, read the next document, do some more processing, etc.
What's the best way to do that with Snakeyaml?

Regards,
Jo

Andrey

unread,
Dec 27, 2012, 4:46:35 AM12/27/12
to snakeya...@googlegroups.com
Hi Jo,
as far as I remember the method Yaml.loadAll() is exactly what you are looking for.
According to the JavaDoc:
Parse all YAML documents in a String and produce corresponding Java objects. The documents are parsed only when the iterator is invoked.

Cheers,
Andrey

Joachim Durchholz

unread,
Dec 27, 2012, 7:00:00 AM12/27/12
to snakeya...@googlegroups.com
Am 27.12.2012 10:46, schrieb Andrey:
> Hi Jo,
> as far as I remember the method Yaml.loadAll() is exactly what you are
> looking for.
> According to the JavaDoc:
> *Parse all YAML documents in a String and produce corresponding Java
> objects. The documents are parsed only when the iterator is invoked.

Ah, I see. I had overlooked the last sentence because I was skimming the
docs for the function to use, and had it mentally dismissed as "aww,
that's parsing everything", plus the "all" in "parse all documents"
strengthened that impression further.

How about placing that information more prominently?
Suggestion:

* Returns an iterator that, on every next() call, parses another YAML
document from {@code source} and produces a corresponding Java object.

Thanks, I'm a happy user now.
At least until I overlook the next obvious thing, of course :-)

Regards,
Jo

Joachim Durchholz

unread,
Dec 27, 2012, 11:06:32 AM12/27/12
to snakeya...@googlegroups.com
Am 27.12.2012 13:00, schrieb Joachim Durchholz:
> Thanks, I'm a happy user now.
> At least until I overlook the next obvious thing, of course :-)

Well, darn, I'm still in trouble.

Here's the scenario:

I have a YAML file of three chunks, which consist of two headers and a body.
The headers are designed to be short and supply enough information that
processors can reject inappropriate ones. To make sure, I need to limit
input size - I'm using Guava's LimitInputStream for that.
To give you an idea, assume the limits are 10K, 10M, and 10G for header
1, header 2, and body.

So my first attempt looked like this:
- Set up a pipeline like this:
FileInputStream->BufferedInputStream->LimitedInputStream->Yaml
- Yaml.loadAll (returning iterator)
- Set limit to 10K
- iterator.next()
- Set limit to 10M
- iterator.next()
- Set limit to 10G
- iterator.next()
Except LimitedInputStream does not offer a way to modify the limit.
I can't "set limit to".

I was told by the Google folks to use multiple instances of
LimitedInputStream. The idea is to piggyback a LimitedInputStream on top
of an InputStream that has already been read from. (I didn't expect that
to work but on second thought, it seems entirely valid.)
So I'd have this:
- Pipeline: FileInputStream -> BufferedInputStream
- Add ->LimitedInputStream(10K)->Yaml
- iterator.next()
- Drop ->LimitedInputStream(10K)->Yaml
- Add ->LimitedInputStream(10M)->Yaml
- iterator.next()
- Drop ->LimitedInputStream(10M)->Yaml
- Add ->LimitedInputStream(10G)->Yaml
- iterator.next()
Unfortunately, now I can't use loadAll because the Yaml object has no
way to switch to a different InputStream between documents.

I'll try to talk with the Guava guys, but these are very hard to
convince (with a reason).

What else should I try?

Regards,
Jo

Andrey Somov

unread,
Dec 27, 2012, 4:46:48 PM12/27/12
to snakeya...@googlegroups.com
What you are asking for has nothing to do with SnakeYAML. 
You shall first clearly understand how it works. When you ask a chunk of data from BufferedInputStream or another InputStream, you may get much more then you need to finish a document. SnakeYAML will keep it in its own buffer to feed the next document. I am afraid, you have to look at the task from the helicopter view (to find another solution). I do not have any proposal.

Cheers,
Andrey




Regards,
Jo

--
You received this message because you are subscribed to the Google Groups "SnakeYAML" group.
To post to this group, send email to snakeyaml-core@googlegroups.com.
To unsubscribe from this group, send email to snakeyaml-core+unsubscribe@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/snakeyaml-core?hl=en.




--
Andrey Somov

Joachim Durchholz

unread,
Dec 27, 2012, 5:22:08 PM12/27/12
to snakeya...@googlegroups.com
Am 27.12.2012 22:46, schrieb Andrey Somov:
> What you are asking for has nothing to do with SnakeYAML.

Well, depends on your definition of "has nothing to do".
Could be "it's impossible" (I suspect that's not true, see below).
Could be "you don't want to bother", which would be fine if you say so.
Could be "would be conceptually wrong to handle in Snakeyaml", which is
just not correct - stopping reading with the exact last byte would be
conceptually much cleaner than potentially reading way past the end of
the document.

> You shall first clearly understand how it works. When you ask a chunk of
> data from BufferedInputStream or another InputStream, you may get much more
> then you need to finish a document. SnakeYAML will keep it in its own
> buffer to feed the next document.

That issue did in fact cross my mind.

However, I don't think that's strictly true. If the caller holds on to
the InputStream passed into Snakeyaml, it can continue reading at the
exact point where Snakeyaml stopped. BufferedInputStream should not be
an issue with that: even if more bytes were physically read, the
BufferedInputStream's current position would still be at the last byte
that Snakeyaml processed.

It may be that Snakeyaml needs to read ahead to determine end-of-document.
If that's left as is, it's impossible to read streams that have Yaml
documents, then something else, so it's a bit less interoperable.
If the readahead can be limited in size, a PushbackInputStream would
cover that.

> I do not have any proposal.

Well, I'm glad I could come up with one :-)

> I am afraid, you have to look at the task
> from the helicopter view (to find another solution).

I have put in a stopgap solution.
Getting a later chunk means reading and parsing the earlier chunks again.
It's too ugly and shouldn't be let live, but at least it works.

Is there a way to ask Snakeyaml about the last byte of the final
document read? I.e. the "current byte position in the input stream".
(Obviously, won't work for something set up with a Reader.)

Regards,
Jo

Andrey Somov

unread,
Dec 27, 2012, 5:39:03 PM12/27/12
to snakeya...@googlegroups.com
Yes, of course SnakeYAML  needs to read ahead to determine end-of-document.

When the conversation is more then a couple of e-mails, it is not easy to follow the idea. If you feel that SnakeYAML can be improved, feel free to make a proposal. You can open an issue.

Cheers,
Andrey


Joachim Durchholz

unread,
Jan 2, 2013, 4:22:52 AM1/2/13
to snakeya...@googlegroups.com
Am 27.12.2012 23:39, schrieb Andrey Somov:
> When the conversation is more then a couple of e-mails, it is not
> easy to follow the idea. If you feel that SnakeYAML can be improved,
> feel free to make a proposal. You can open an issue.

I've had some more time to think about this, and I probably will open an
issue, but there's another detail I'd like to check first.

Here's the scenario again, streamlined a bit:
Users use my software to generate files, which are sent to my server for
processing. As always, there's that small fraction of malicious users
who manipulate the files before the server sees them.
The file is partitioned into three documents, with maximum sizes of,
say, 10K, 10M, and 10G, respectively.
Processing goes like this:
- read a document
- check whether the file is eligible for processing
- abort if no, continue reading the next document if yes.

> Yes, of course SnakeYAML needs to read ahead to determine
> end-of-document.

Is the readahead bounded or unbounded?

If it is bounded or can be made bounded via assumptions like "this code
should not accept files that don't have end-of-document markers anyway",
then it can be done by progressively increasing the limit.

If it is unbounded, a malicious user could force Snakeyaml into reading
the full 10G even if it should read only the first 10K, wasting memory
and CPU.

Andrey

unread,
Jan 2, 2013, 8:33:49 AM1/2/13
to snakeya...@googlegroups.com

Feel free to study ParserImpl.java

Joachim Durchholz

unread,
Jan 2, 2013, 10:12:02 AM1/2/13
to snakeya...@googlegroups.com
Am 02.01.2013 14:33, schrieb Andrey:
>
>> Am 27.12.2012 23:39, schrieb Andrey Somov:
>>
>> Is the readahead bounded or unbounded?
>
> Feel free to study ParserImpl.java

Wow.

You seriously suggest that I'm better equipped to find the answer to a
nontrivial question about a grammar property than the man who has
written the parser?
Silly little me with more than a decade since I even touched a parser?

Seriously.
Wow.
Not impressed.

Jo

Andrey Somov

unread,
Jan 2, 2013, 10:59:57 AM1/2/13
to snakeya...@googlegroups.com
I have not touched the parser for more then 4 years.

I am always impressed when a non-profit Open Source project is supported many years.

I think your question is only relevant for the LALR parser (http://en.wikipedia.org/wiki/LALR_parser).
SnakeYAML is using LL(1) parser.

TO AVOID ANY MISUNDERSTANDING WITH TERMINOLOGY, it is much better to have a look at the code. That is why the source is open.

Andrey

--
You received this message because you are subscribed to the Google Groups "SnakeYAML" group.

Joachim Durchholz

unread,
Jan 2, 2013, 12:44:33 PM1/2/13
to snakeya...@googlegroups.com
Am 02.01.2013 16:59, schrieb Andrey Somov:
> I have not touched the parser for more then 4 years.

Good to know.
That's more current than my parser knowledge ;-)

> I think your question is only relevant for the LALR parser (
> http://en.wikipedia.org/wiki/LALR_parser).
> SnakeYAML is using LL(1) parser.

Readahead is relevant for both. The k in LL(k) is the number of symbols
that it looks ahead to determine what the next production is.

If it's indeed LL(1), then the lookahead is 1 token.

I'm a bit more worried about what will happen after a document is
finished. Is it sure that it won't try to parse the next document to
establish that the previous document is done?
I think that might actually happen if you parse a yaml file that doesn't
have start-of-document or end-of-document markers. I'm seeing Snakeyaml
returning the first object in the stream, but having read everything
that came after it.

> TO AVOID ANY MISUNDERSTANDING WITH TERMINOLOGY, it is much better to have a
> look at the code. That is why the source is open.

Normally, I'd agree, but recursive-descent parsers can exhibit implicit
and emergent behavour that's not just in the code, so it's not a panacea.

Andrey

unread,
Jan 2, 2013, 1:43:19 PM1/2/13
to snakeya...@googlegroups.com


On Wednesday, January 2, 2013 6:44:33 PM UTC+1, toolforger wrote:
Am 02.01.2013 16:59, schrieb Andrey Somov:
> I have not touched the parser for more then 4 years.

Good to know.
That's more current than my parser knowledge ;-)

> I think your question is only relevant for the LALR parser (
> http://en.wikipedia.org/wiki/LALR_parser).
> SnakeYAML is using LL(1) parser.

Readahead is relevant for both. The k in LL(k) is the number of symbols
that it looks ahead to determine what the next production is.

If it's indeed LL(1), then the lookahead is 1 token.

A YAML document contains ScalarTokens (strings) which do not have size limitations.


I'm a bit more worried about what will happen after a document is
finished. Is it sure that it won't try to parse the next document to
establish that the previous document is done?
I think that might actually happen if you parse a yaml file that doesn't
have start-of-document or end-of-document markers. I'm seeing Snakeyaml
returning the first object in the stream, but having read everything
that came after it.

Yes, it is sure. The document is only parsed when it is explicitly asked to.
As I already said, SnakeYAML has a buffer which is read in a single step. It may read much more then it is required to finish a document. The size of the buffer is not configurable. (never needed)
See the source:
StreamReader.update()

http://code.google.com/p/snakeyaml/source/browse/src/main/java/org/yaml/snakeyaml/reader/StreamReader.java#180

Joachim Durchholz

unread,
Jan 2, 2013, 4:05:37 PM1/2/13
to snakeya...@googlegroups.com
>> I'm a bit more worried about what will happen after a document is
>> finished. Is it sure that it won't try to parse the next document to
>> establish that the previous document is done?
>> I think that might actually happen if you parse a yaml file that doesn't
>> have start-of-document or end-of-document markers. I'm seeing Snakeyaml
>> returning the first object in the stream, but having read everything
>> that came after it.
>>
> Yes, it is sure. The document is only parsed when it is explicitly asked
> to.

Okay.

> A YAML document contains ScalarTokens (strings) which do not have size
> limitations.
[...]
> As I already said, SnakeYAML has a buffer which is read in a single step.
> It may read much more then it is required to finish a document. The size of
> the buffer is not configurable. (never needed)

Won't be needed for this case either. If Snakeyaml is still reading the
current document and hits the limit, that's okay - the document is too
long, regardless of what the parser was trying to do.

The question is whether anything unbounded can happen between documents.
That is:
When the iterator from loadAll hits and end-of-document token, does it
read the next token?

If it does read the next token:
Can that token be of unbounded size?
I control the definition of well-formed files, so I can impose a
requirement that start-of-document tokens be used. I understand that
start-of-document is always just three dots, so that's bounded - is that
correct?

If Snakeyaml reads a token, does it also eat any whitespace or comments
that may follow it?
If yes, I'll have to rethink how to deal with a malicious file that has
a gazillion of empty lines between first and second document.
Observations:

Stylistic nitpick: Line 184 is missing a 'this.' before 'data'.

Printability is determined in two different ways, in far-away code
sections. (A regex for nonprintable, and a range check for printable).
Anybody adapting the definition of printability runs a risk of
introducing a subtle bug by making the two definitions inconsistent.
(That's a worry for the future case of the Yaml specs adapting to
changed definitions in the Unicode specs. Not going to happen very
often, but possible.)

getEncoding will crash if the Reader passed to the Yaml constructor is
not a UnicodeReader.
Not that that function is likely to be ever called. Callers could simply
hold on to the UnicodeReader and ask it for encoding and other
properties directly, this is outside of Snakeyaml's domain.
Eclipse tells me it's not being used by Snakeyaml, so it could be
deprecated.

I'm not sure why the code is reinventing the BufferedReader.
It's complicating the code a lot with all the offset calculations.
Consider this:
InputStream inputStream = new FileInputStream(filename);
inputStream = new BufferedInputStream(inputStream);
inputStream = ByteStreams.limit(inputStream, limit);
return new Yaml().loadAll(inputStream).iterator();
I can decide whether I want the limit applied before or after buffering
by swapping lines 2 and 3.
It's already doing buffering. If I need a Reader-level buffer, I can
construct a Reader (and make an informed decision whether that's worth
making the limit less precise).
I can mix and match the various buffering and other options and
benchmark each against my data mix, my JVM and/or other Java compiler
technology.
Am I overlooking something here?

Andrey

unread,
Jan 3, 2013, 4:25:28 AM1/3/13
to snakeya...@googlegroups.com


On Wednesday, January 2, 2013 10:05:37 PM UTC+1, toolforger wrote:
>> I'm a bit more worried about what will happen after a document is
>> finished. Is it sure that it won't try to parse the next document to
>> establish that the previous document is done?
>> I think that might actually happen if you parse a yaml file that doesn't
>> have start-of-document or end-of-document markers. I'm seeing Snakeyaml
>> returning the first object in the stream, but having read everything
>> that came after it.
>>
> Yes, it is sure. The document is only parsed when it is explicitly asked
> to.

Okay.

> A YAML document contains ScalarTokens (strings) which do not have size
> limitations.
[...]
> As I already said, SnakeYAML has a buffer which is read in a single step.
> It may read much more then it is required to finish a document. The size of
> the buffer is not configurable. (never needed)

Won't be needed for this case either. If Snakeyaml is still reading the
current document and hits the limit, that's okay - the document is too
long, regardless of what the parser was trying to do.

The question is whether anything unbounded can happen between documents.
That is:
When the iterator from loadAll hits and end-of-document token, does it
read the next token?

No it does not. This is the specification: (http://yaml.org/spec/1.1/#id897596)

When YAML is used as the format of a communication channel, it is useful to be able to indicate the end of a document without closing the stream, independent of starting the next document. Lacking such a marker, the YAML processor reading the stream would be forced to wait for the header of the next document (that may be long time in coming) in order to detect the end of the previous one. To support this scenario, a YAML document may be terminated by an explicit end line denoted by “...”, followed by optional comments. To ease the task of concatenating YAML streams, the end marker may be repeated.


UnicodeReader is taken from another project. Some redundant code is left unchanged. We had issues in the past when developers contacted us to keep their original code and copyright.
 

I'm not sure why the code is reinventing the BufferedReader.
It's complicating the code a lot with all the offset calculations.
Consider this:
     InputStream inputStream = new FileInputStream(filename);
     inputStream = new BufferedInputStream(inputStream);
     inputStream = ByteStreams.limit(inputStream, limit);
     return new Yaml().loadAll(inputStream).iterator();
I can decide whether I want the limit applied before or after buffering
by swapping lines 2 and 3.
It's already doing buffering. If I need a Reader-level buffer, I can
construct a Reader (and make an informed decision whether that's worth
making the limit less precise).
I can mix and match the various buffering and other options and
benchmark each against my data mix, my JVM and/or other Java compiler
technology.
Am I overlooking something here?

Contributions are welcome !!!

Any investigation is also welcome. Feel free to experiment !
 

Joachim Durchholz

unread,
Jan 3, 2013, 4:47:30 AM1/3/13
to snakeya...@googlegroups.com
Am 03.01.2013 10:25, schrieb Andrey:
>
>> When the iterator from loadAll hits and end-of-document token, does it
>> read the next token?
>
> No it does not.

Aaah. That's excellent!
Then I have boundedness for well-formed files and can reject overflowing
files as invalid without having to worry whether that's an artifact from
the parser.

> This is the specification:

Hm... yes, okay, I just didn't know how precise Snakeyaml is about
implementing the spec.

>> getEncoding will crash if the Reader passed to the Yaml constructor is
>> not a UnicodeReader.
>> Not that that function is likely to be ever called. Callers could simply
>> hold on to the UnicodeReader and ask it for encoding and other
>> properties directly, this is outside of Snakeyaml's domain.
>> Eclipse tells me it's not being used by Snakeyaml, so it could be
>> deprecated.
>
> UnicodeReader is taken from another project.

You mean StreamReader? 'cause UnicodeReader is part of the JDK.

> Some redundant code is left
> unchanged. We had issues in the past when developers contacted us to keep
> their original code and copyright.

I understand.

> Contributions are welcome !!!
>
> Any investigation is also welcome. Feel free to experiment !

Heh. I'd really like to clean that up, sounds like a small fun contribution.
I'm short on time, unfortunately. My free time is being eaten up by
another project (the one I'm using Snakeyaml for right now), and there's
a lingering promise I made to another project.
I'll see what I can do.

Regards,
Jo

Andrey Somov

unread,
Jan 3, 2013, 5:56:32 AM1/3/13
to snakeya...@googlegroups.com
On Thu, Jan 3, 2013 at 10:47 AM, Joachim Durchholz <j...@durchholz.org> wrote:

> This is the specification:

Hm... yes, okay, I just didn't know how precise Snakeyaml is about implementing the spec.

Any deviation from the specification is considered to be a bug. Except some things (they are the same as for PyYAML to fail or succeed together):
http://code.google.com/p/snakeyaml/wiki/Documentation#Deviations_from_the_specification

You mean StreamReader? 'cause UnicodeReader is part of the JDK.

 UnicodeReader is part of SnakeYAML. It is included only to avoid a strange limitation of the JDK - it is unable to read UTF encoded files with BOM. UTF-8 BOM is considered to be a part of the content. There is a long standing bug there.

http://code.google.com/p/snakeyaml/source/browse/src/main/java/org/yaml/snakeyaml/reader/UnicodeReader.java

-
Cheers,
Andrey

Joachim Durchholz

unread,
Jan 3, 2013, 11:31:19 AM1/3/13
to snakeya...@googlegroups.com
Am 03.01.2013 11:56, schrieb Andrey Somov:
> On Thu, Jan 3, 2013 at 10:47 AM, Joachim Durchholz <j...@durchholz.org> wrote:
>
>>
>>> This is the specification:
>>
>> Hm... yes, okay, I just didn't know how precise Snakeyaml is about
>> implementing the spec.
>
>
> Any deviation from the specification is considered to be a bug. Except some
> things (they are the same as for PyYAML to fail or succeed together):
> http://code.google.com/p/snakeyaml/wiki/Documentation#Deviations_from_the_specification
>
>>
>> You mean StreamReader? 'cause UnicodeReader is part of the JDK.
>
> UnicodeReader is part of SnakeYAML.

Ah ok. I didn't see the missing import from java.io.UnicodeReader.

> It is included only to avoid a strange
> limitation of the JDK - it is unable to read UTF encoded files with BOM.
> UTF-8 BOM is considered to be a part of the content. There is a long
> standing bug there.

Hm... code using a Reader might want to know whether the stream starts
with a BOM or not. On the other hand, code that wants to skip a BOM will
see it as \uFEFF and can handle it as desired - discard it, record that
there was one, whatever.
So, if the JDK passes the BOM through, it's actually doing the Right
Thing because it allows for more use cases. Admittedly, the use case
that wants to see the BOM is rarer, but it's valid (a UTF-8 BOM does not
define byte order, but it can serve as an encoding indicator and is in
active use for that purpose).

Anyway, Snakeyaml won't benefit from a BOM-skipping Reader.

Either the Reader skips just the first BOM, then it doesn't deal with
BOMs between documents as http://yaml.org/spec/1.2/spec.html#id2771184
demands:

> To make it easier to concatenate streams, byte order marks may appear
> at the start of any document.

Or the Reader skips all BOMs, then Snakeyaml would fail to produce this
error message:

> ERROR: A BOM must not appear inside a document.

Snakeyaml needs to define the BOM as a token and handle it as a document
boundary.

Andrey Somov

unread,
Jan 3, 2013, 12:48:20 PM1/3/13
to snakeya...@googlegroups.com
On Thu, Jan 3, 2013 at 5:31 PM, Joachim Durchholz <j...@durchholz.org> wrote:
Hm... code using a Reader might want to know whether the stream starts with a BOM or not. On the other hand, code that wants to skip a BOM will see it as \uFEFF and can handle it as desired - discard it, record that there was one, whatever.
So, if the JDK passes the BOM through, it's actually doing the Right Thing because it allows for more use cases. Admittedly, the use case that wants to see the BOM is rarer, but it's valid (a UTF-8 BOM does not define byte order, but it can serve as an encoding indicator and is in active use for that purpose).

Anyway, Snakeyaml won't benefit from a BOM-skipping Reader.


a) To understand the problem:
1) create a UTF-8 document with BOM
2) open it properly (Windows Notepad :) and observe no BOM
3) open it with JDK and observe the BOM
4) create a UTF-16 document with BOM and open it with JDK -> no BOM

Why the behaviour is asymmetrical ? Why the users shall get different content when they give the very same UTF-8 YAML document to different parsers (other languages work properly)?

b)
1) try to change SnakeYAML, run the tests and see the failures (it has 97% coverage)
2) if you think it can be improved -> contributions are welcome

The BOM inside the document is not supported on purpose - to have compatibility with other YAML parsers. (Python, Ruby)

Cheers,
Andrey

Joachim Durchholz

unread,
Jan 3, 2013, 2:18:41 PM1/3/13
to snakeya...@googlegroups.com
Am 03.01.2013 18:48, schrieb Andrey Somov:
> On Thu, Jan 3, 2013 at 5:31 PM, Joachim Durchholz <j...@durchholz.org> wrote:
>
>> Hm... code using a Reader might want to know whether the stream starts
>> with a BOM or not. On the other hand, code that wants to skip a BOM will
>> see it as \uFEFF and can handle it as desired - discard it, record that
>> there was one, whatever.
>> So, if the JDK passes the BOM through, it's actually doing the Right Thing
>> because it allows for more use cases. Admittedly, the use case that wants
>> to see the BOM is rarer, but it's valid (a UTF-8 BOM does not define byte
>> order, but it can serve as an encoding indicator and is in active use for
>> that purpose).
>>
>> Anyway, Snakeyaml won't benefit from a BOM-skipping Reader.
>>
> a) To understand the problem:
> 1) create a UTF-8 document with BOM
> 2) open it properly (Windows Notepad :) and observe no BOM

Yes, it's removing the BOM when reading, and (I think) it's writing the
BOM on output.
Nothing surprising here, that's well-known behaviour in the Windows world.

> 3) open it with JDK and observe the BOM
> 4) create a UTF-16 document with BOM and open it with JDK -> no BOM
>
> Why the behaviour is asymmetrical ?

No idea, maybe UTF-16 support in the JDK is wrong, maybe Windows
software doesn't provide an UTF-16 BOM.
I dimly recall that the JDK is inconsistent here, but if anything, the
bug isn't in those variants that keep the BOM, it's in those that drop it.
IMHO, anyway.

Snakeyaml shouldn't care much anyway, it should simply strip the BOM in
the scanner. No need to do any special character set handling, just
strip it at the Unicode level where it's uniformly seen as \uFEFF.

> Why the users shall get different
> content when they give the very same UTF-8 YAML document to different
> parsers (other languages work properly)?

Eh? The users will get the same content, since the Yaml parsing process
strips the BOMs anyway.

> b)
> 1) try to change SnakeYAML, run the tests and see the failures (it has 97%
> coverage)
> 2) if you think it can be improved -> contributions are welcome
>
> The BOM inside the document is not supported on purpose - to have
> compatibility with other YAML parsers. (Python, Ruby)

a) Then these parsers are in violation of the 1.2 spec.
b) I don't see any practically relevant incompatibilities arise if
Snakeyaml starts accepting those optional BOMs between documents.

In my book, practically relevant incompatibilities would be one of these:

* Snakeyaml writes something that other parsers can't read. *
Inapplicable, this is about reading. The emitter shouldn't write a BOM
(except at the start of a stream, but even that isn't so clear as
Snakeyaml can't know whether the BOM wasn't already written).

* Snakeyaml cannot read something that the other parsers accept. *
Since Snakeyaml is getting slightly more liberal in what it accepts,
this isn't going to happen.

* Snakeyaml cannot read something that the other libraries emit. *
Again, Snakeyaml is getting more liberal, so if anything, this
compatibility aspect should improve.

Andrey Somov

unread,
Jan 3, 2013, 2:54:47 PM1/3/13
to snakeya...@googlegroups.com
On Thu, Jan 3, 2013 at 8:18 PM, Joachim Durchholz <j...@durchholz.org> wrote:
a) Then these parsers are in violation of the 1.2 spec.

Since we change the topics so often, it is not easy for me to follow you. I think it is better to stop now. 
We can start a new topic any time.

1.2 specification is not supported by SnakeYAML yet (as well as by many other YAML parsers).

Joachim Durchholz

unread,
Jan 3, 2013, 3:34:49 PM1/3/13
to snakeya...@googlegroups.com
Am 03.01.2013 20:54, schrieb Andrey Somov:
> On Thu, Jan 3, 2013 at 8:18 PM, Joachim Durchholz <j...@durchholz.org> wrote:
>
>> a) Then these parsers are in violation of the 1.2 spec.
>
> Since we change the topics so often, it is not easy for me to follow you. I
> think it is better to stop now.
> We can start a new topic any time.

Feel free to do so whenever you think it's appropriate.
I don't always notice a topic change.
I have changed the Subject to reflect the current discourse.

> 1.2 specification is not supported by SnakeYAML yet (as well as by many
> other YAML parsers).

Ah, okay. Somehow I came under the impression that Snakeyaml is on 1.2.

However, 1.1 has just the same requirement for this, see
http://yaml.org/spec/1.1/current.html#id898785 :

> To ease the task of concatenating character streams, following
> documents may begin with a byte order mark and comments
[...]
> [106] l-next-document ::=
> c-byte-order-mark? l-comment* l-directive* l-explicit-document

Andrey Somov

unread,
Jan 4, 2013, 6:20:18 AM1/4/13
to snakeya...@googlegroups.com
I think, I get your point now. Instead of stripping BOM at IO, to strip it in the scanner.
Then we can deprecate (and later remove) UnicodeReader.
I wonder what are the advantages for the end user ?
We can create an issue (to keep track of this proposal)

Cheers,
Andrey

Joachim Durchholz

unread,
Jan 5, 2013, 7:02:53 AM1/5/13
to snakeya...@googlegroups.com
Am 04.01.2013 12:20, schrieb Andrey Somov:
> I think, I get your point now. Instead of stripping BOM at IO, to strip it
> in the scanner.
> Then we can deprecate (and later remove) UnicodeReader.
> I wonder what are the advantages for the end user ?

The spec already mentions it: If multiple streams (files, sockets,
whatever) are interleaved, the caller doesn't need to strip BOMs.

My own observation is that as systems grow larger, it's getting more and
more bothersome to propagate logical dependencies such as "this data is
ultimately going to a Yaml parser, so I need to strip the BOM".

> We can create an issue (to keep track of this proposal)

Will do if you think it should be done.

Andrey Somov

unread,
Jan 7, 2013, 2:55:46 AM1/7/13
to snakeya...@googlegroups.com
On Sat, Jan 5, 2013 at 1:02 PM, Joachim Durchholz <j...@durchholz.org> wrote:
Am 04.01.2013 12:20, schrieb Andrey Somov:

I think, I get your point now. Instead of stripping BOM at IO, to strip it
in the scanner.
Then we can deprecate (and later remove) UnicodeReader.
I wonder what are the advantages for the end user ?

The spec already mentions it: If multiple streams (files, sockets, whatever) are interleaved, the caller doesn't need to strip BOMs.

My own observation is that as systems grow larger, it's getting more and more bothersome to propagate logical dependencies such as "this data is ultimately going to a Yaml parser, so I need to strip the BOM".

I think, this exactly what we have now - no need to worry about BOM.



We can create an issue (to keep track of this proposal)

Will do if you think it should be done.

 
If you do not wish to drive the change and contribute the code, then no need to create an issue.

Cheers,
Andrey

Joachim Durchholz

unread,
Jan 7, 2013, 2:22:07 PM1/7/13
to snakeya...@googlegroups.com
Am 07.01.2013 08:55, schrieb Andrey Somov:
> On Sat, Jan 5, 2013 at 1:02 PM, Joachim Durchholz <j...@durchholz.org> wrote:
>
>> Am 04.01.2013 12:20, schrieb Andrey Somov:
>>
>> I think, I get your point now. Instead of stripping BOM at IO, to strip it
>>> in the scanner.
>>> Then we can deprecate (and later remove) UnicodeReader.
>>> I wonder what are the advantages for the end user ?
>>>
>>
>> The spec already mentions it: If multiple streams (files, sockets,
>> whatever) are interleaved, the caller doesn't need to strip BOMs.
>>
>> My own observation is that as systems grow larger, it's getting more and
>> more bothersome to propagate logical dependencies such as "this data is
>> ultimately going to a Yaml parser, so I need to strip the BOM".
>
>
> I think, this exactly what we have now - no need to worry about BOM.

Is Snakeyaml skipping inter-document BOMs?

Andrey Somov

unread,
Jan 8, 2013, 1:54:26 AM1/8/13
to snakeya...@googlegroups.com


I think, this exactly what we have now - no need to worry about BOM.

Is Snakeyaml skipping inter-document BOMs?


Skipping inter-document BOMs is not implemented on any existing YAML parser. Such a feature should be implemented by everyone, otherwise it only creates misunderstanding.
It does not mean that SnakeYAML should not implement it.

Andrey

Joachim Durchholz

unread,
Jan 8, 2013, 2:58:38 AM1/8/13
to snakeya...@googlegroups.com
Am 08.01.2013 07:54, schrieb Andrey Somov:
>>
>>> I think, this exactly what we have now - no need to worry about BOM.
>>>
>>
>> Is Snakeyaml skipping inter-document BOMs?
>>
>>
> Skipping inter-document BOMs is not implemented on any existing YAML
> parser.

Too bad.

> Such a feature should be implemented by everyone, otherwise it only
> creates misunderstanding.

As I already explained, it should not cause any problems for any parser
that implements it.

> It does not mean that SnakeYAML should not implement it.

In my book, it would make the parser less dependent on outside parties
doing something (stripping BOMs properly), so it would expand
Snakeyaml's applicability to those projects where, for some reason, the
outside parties can't be made to do it.
In other words, it would extend Snakeyaml's applicability a bit,
particularly into really large projects where such issues can become
serious problems.

I don't have a large project, so it's not my use case and I can live
with Snakeyaml's current BOM handling. Since my time is really scarce
currently, I won't be able to do it.
Since you're not going to do it either, I guess opening an issue isn't
gonig to help, so I'm leaving the issue as it is.

Regards,
Jo

Andrey Somov

unread,
Jan 8, 2013, 3:30:00 AM1/8/13
to snakeya...@googlegroups.com


I don't have a large project, so it's not my use case and I can live with Snakeyaml's current BOM handling. Since my time is really scarce currently, I won't be able to do it.
Since you're not going to do it either, I guess opening an issue isn't gonig to help, so I'm leaving the issue as it is.


I think it can be done once we need it. The information is there:

http://code.google.com/p/snakeyaml/wiki/Documentation#Deviations_from_the_specification

Thank you for your time.

Cheers,
Andrey
Reply all
Reply to author
Forward
0 new messages