ANN: Another binary parser combinator - this time for java's streams

Steffen

unread,

Jan 29, 2014, 12:32:03 PM1/29/14

to clo...@googlegroups.com

Hello Clojure community,

there are already two excellent libraries for reading/writing/manipulating binary data: Zach's Lamina and Clojurewerkz' Buffy for java's ByteBuffers. I would like to offer another library for java's Input/OutputStreams. It is inspired by Lamina but not compatible in syntax, I'm sorry.

The focus is on

read/write performance,
no external dependencies
works with java.util.*Stream

If you use Leiningen please add the following to your dependencies:

[org.clojars.smee/binary "0.2.4"]

The link to the source code and README is https://github.com/smee/binary.

Democode to parse Bitcoin blocks (including scripts): https://github.com/smee/binary/blob/master/src/org/clojars/smee/binary/demo/bitcoin.clj

Democode for MP3 ID3v2 tags (work in progress): https://github.com/smee/binary/blob/master/src/org/clojars/smee/binary/demo/mp3.clj

Apart from the README, doc strings there is no further documentation, yet. Please refer to the demos and the unit tests for now.

Thanks,

Steffen Dienst

Michael Gardner

unread,

Jan 29, 2014, 4:49:56 PM1/29/14

to clo...@googlegroups.com

Looks good! A few questions:

1) Is it possible to specify a byte length for a 'repeated codec, rather than a number of objects?

2) Would you consider an enum type, for convenience? Something like:

(defn enum [type m]
(compile-codec type m
(clojure.set/map-invert m)))

3) In the mp3.clj demo, the flags seem to be listed in the wrong order. Or does the 'bits function actually take its arguments LSB-first?

> --
> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clo...@googlegroups.com
> Note that posts from new members are moderated - please be patient with your first post.
> To unsubscribe from this group, send email to
> clojure+u...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
> ---
> You received this message because you are subscribed to the Google Groups "Clojure" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to clojure+u...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.

Steffen

unread,

Jan 30, 2014, 7:36:36 AM1/30/14

to clo...@googlegroups.com

Please see below.

Am Mittwoch, 29. Januar 2014 17:49:56 UTC+1 schrieb Michael Gardner:

Looks good! A few questions:

Thanks.

1) Is it possible to specify a byte length for a 'repeated codec, rather than a number of objects?

If your 'object' is a byte,sure:

(repeated :byte :length 1234)

If you would like to use a specific codec other than :byte or :ubyte but also restrict the number of bytes read this would only work if you expected to have some kind of optional padding after your objects, like:

(padding inner-codec 4096).

2) Would you consider an enum type, for convenience? Something like:

(defn enum [type m]
(compile-codec type m
(clojure.set/map-invert m)))

So m would be a map of for example keywords to a native datatype like int that would allow you to represent a fixed number of things with distinct binary representations? Looks good to me. What do you think should be the behaviour in case of an unspecified value (not in m)?

3) In the mp3.clj demo, the flags seem to be listed in the wrong order. Or does the 'bits function actually take its arguments LSB-first?

Currently the index in the vector is the index of the bit. Yes, that means LSB-first.

Michael Gardner

unread,

Jan 30, 2014, 1:05:07 PM1/30/14

to clo...@googlegroups.com

On Jan 30, 2014, at 01:36 , Steffen <steffen...@gmail.com> wrote:

> If you would like to use a specific codec other than :byte or :ubyte but also restrict the number of bytes read this would only work if you expected to have some kind of optional padding after your objects, like:
>
> (padding inner-codec 4096).

Yes, that's exactly what I need. I didn't try 'padding because the docs seemed to say that it works only when encoding.

My only problem is that when decoding, I don't know how many objects to expect before the padding (this is for parsing ID3v2 tags). Ideally I'd like to say something like (padding (repeated frame-codec) byte-count), with the padding taking over once the inner codec fails to parse the next available bytes (but see the next point).

> (defn enum [type m]
> (compile-codec type m
> (clojure.set/map-invert m)))
> So m would be a map of for example keywords to a native datatype like int that would allow you to represent a fixed number of things with distinct binary representations? Looks good to me. What do you think should be the behaviour in case of an unspecified value (not in m)?

I'd expect an exception to be thrown in case of an unspecified value. But when decoding, it would be nice if the exception were (optionally?) swallowed when occurring inside a 'padding construct, to allow something like the above example. Though I don't know how many other binary formats would require something like that; I imagine most aren't as dumb as ID3v2.

> Currently the index in the vector is the index of the bit. Yes, that means LSB-first.

Then the docs seem to be wrong (or at least confusing), since the example code for 'bits says the first item corresponds to the "highest" bit.

Steffen Dienst

unread,

Jan 30, 2014, 2:10:40 PM1/30/14

to clo...@googlegroups.com

Am Donnerstag, 30. Januar 2014 14:05:07 UTC+1 schrieb Michael Gardner:

On Jan 30, 2014, at 01:36 , Steffen <steffen...@gmail.com> wrote:

> If you would like to use a specific codec other than :byte or :ubyte but also restrict the number of bytes read this would only work if you expected to have some kind of optional padding after your objects, like:
>
> (padding inner-codec 4096).

Yes, that's exactly what I need. I didn't try 'padding because the docs seemed to say that it works only when encoding.

My bad. I changed the readme. Padding will always read the given number of bytes before using the inner codec on those bytes. When writing it adds the needed amount of bytes to ensure that the expected number of bytes were written.

My only problem is that when decoding, I don't know how many objects to expect before the padding (this is for parsing ID3v2 tags). Ideally I'd like to say something like (padding (repeated frame-codec) byte-count), with the padding taking over once the inner codec fails to parse the next available bytes (but see the next point).

That's exactly what padding is designed to do: Let's say you know there is a run of bytes with a known length (from a header field maybe) and you want to parse an unbounded number of objects within this area. You could use

(padding (repeated inner-codec) 1024)

Another example: Let's assume an inputstream with these bytes: [11 5 0 0 0 9 0 0 0 0x99 0x99 0x99]

;the padding length is determined by the byte header, the inner codec `repeated` can only read two integers (8 bytes)

(header :byte #(padding (repeated :int-le) % 0x99) (constantly 11))

=> [5 9] ; now the inputstream will be empty

> (defn enum [type m]
> (compile-codec type m
> (clojure.set/map-invert m)))
> So m would be a map of for example keywords to a native datatype like int that would allow you to represent a fixed number of things with distinct binary representations? Looks good to me. What do you think should be the behaviour in case of an unspecified value (not in m)?

I'd expect an exception to be thrown in case of an unspecified value. But when decoding, it would be nice if the exception were (optionally?) swallowed when occurring inside a 'padding construct, to allow something like the above example. Though I don't know how many other binary formats would require something like that; I imagine most aren't as dumb as ID3v2.

Currently codecs don't know about their context, that means, I can't behave differently depending on whether a codec is used within a padding or not, sorry.

> Currently the index in the vector is the index of the bit. Yes, that means LSB-first.

Then the docs seem to be wrong (or at least confusing), since the example code for 'bits says the first item corresponds to the "highest" bit.

Thanks, I fixed the documentation.

Michael Gardner

unread,

Jan 30, 2014, 2:54:55 PM1/30/14

to clo...@googlegroups.com

On Jan 30, 2014, at 08:10 , Steffen Dienst <steffen...@gmail.com> wrote:

> That's exactly what padding is designed to do: Let's say you know there is a run of bytes with a known length (from a header field maybe) and you want to parse an unbounded number of objects within this area. You could use
>
> (padding (repeated inner-codec) 1024)

Excellent.

> Currently codecs don't know about their context, that means, I can't behave differently depending on whether a codec is used within a padding or not, sorry.

It could work the other way around, with 'padding catching certain types of exceptions thrown by its inner codecs.

For example, when parsing something like (padding (repeated (constant 0x99)) len pad-byte), padding could catch the exception thrown by the constant codec and then use pad-byte to parse the remaining bytes.

But I can live without this, if it's too niche or too hard to implement.

> Then the docs seem to be wrong (or at least confusing), since the example code for 'bits says the first item corresponds to the "highest" bit.
> Thanks, I fixed the documentation.

A couple other things about the README:

The docs for 'header say that body->header should produce a codec that will be used to encode the header, but in testing I've had to make it return the header directly (which does make more sense).

Also, the expression #{:a :b:last} in the 'bits section is missing a space.

Thanks for all the help, by the way!

Steffen Dienst

unread,

Jan 31, 2014, 8:12:23 AM1/31/14

to clo...@googlegroups.com

Thanks, I fixed the documentation issues. Feel free to share your id3 tags parser, if you like :) You can see that mine is still stuck at the very beginning..

Stathis Sideris

unread,

Feb 3, 2014, 4:13:13 PM2/3/14

to clo...@googlegroups.com

Hello,

Is it possible to use 'repeated with a dynamic size if the length-defining prefix does not directly precede the content? For example, see PNG chunks:

http://en.wikipedia.org/wiki/Portable_Network_Graphics#.22Chunks.22_within_the_file

The codec would be:

(def chunk

(b/ordered-map

:length :int-be

:type (b/repeated :byte :length 4)

:data (b/repeated :byte :length ???)

:crc (b/repeated :byte :length 4)))

What do I put in the place of "???"

Thanks,

Stathis

Steffen Dienst

unread,

Feb 3, 2014, 4:50:12 PM2/3/14

to clo...@googlegroups.com

I would use header for this:

(def chunk

(header :int-be

#(ordered-map

:type (b/repeated :byte :length 4)

:data (b/repeated :byte :length %)

:crc (b/repeated :byte :length 4))

#(count (:data %))))

The resulting data structure would not contain the field length in this case. Length only gets used to configure the inner codec for the body (the map with :type, :data and :crc). You can read this codec as: "Read a big-endian integer, then use this value to construct a new codec to read the body. When writing, count the :data field, write the length using :type and then write the body".

Steffen

2014-02-03 Stathis Sideris <sid...@gmail.com>:

--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---

You received this message because you are subscribed to a topic in the Google Groups "Clojure" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/clojure/2c9-oXfKlp0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to clojure+u...@googlegroups.com.

Stathis Sideris

unread,

Feb 4, 2014, 10:35:30 AM2/4/14

to clo...@googlegroups.com

Thanks, header seems very useful and relevant to what I was doing, but I ended up doing something slightly different because I needed to include the information retrieved using the chunk header codec in the final result (specifically, the type of the chunk). Here is some code:

https://gist.github.com/stathissideris/8801295

select-codec is almost identical to header (didn't bother with writing in this case), but it also merges the result of the "decision-codec" with the result of the selected codec. Of course it's less generic than header because it makes the assumption that we're dealing with maps. Also, note the use of core.match to decide on what codec to use.

Stathis

Reply all

Reply to author

Forward