[ANN] byte-streams: a rosetta stone for all the byte representations the jvm has to offer

326 views
Skip to first unread message

Zach Tellman

unread,
Jun 29, 2013, 1:57:58 PM6/29/13
to clo...@googlegroups.com
I've recently been trying to pull out useful pieces from some of my more monolithic libraries.  The most recent result is 'byte-streams' [1], a library that figures how how to convert between different byte representations (including character streams), and how to efficiently transfer bytes between various byte sources and sinks.  The net result is that you can do something like:

  (byte-streams/convert (File. "/tmp/foo") String {:encoding "utf-8"})

and get a string representation of the file's contents.  Of course, this is already possible using 'slurp', but you could also convert it to a CharSequence, or lazy sequence of ByteBuffers, or pretty much anything else you can imagine.  This is accomplished by traversing a graph of available conversions (don't worry, it's memoized), so simply defining a new conversion from some custom type to (say) a ByteBuffer will transitively allow you to convert it to any other type.

As an aside, this sort of conversion mechanism isn't limited to just byte representations, but I'm not sure if there's another large collection of mostly-isomorphic types out there that would benefit from this.  If anyone has ideas on where else this could be applied, I'd be interested to hear them.

Zach

Dan Burkert

unread,
Jun 30, 2013, 11:46:47 PM6/30/13
to clo...@googlegroups.com
Very cool, I've got a couple of question: the readme references optimized transfers, what qualifies as an optimized transfer?  Also, would it be possible for byte-streams to give an estimation of the number of memory copies that happen in a given conversion (maybe this is as simple as the number of steps...)?  Thanks for releasing this!

-- Dan

Zach Tellman

unread,
Jul 1, 2013, 2:19:54 PM7/1/13
to clo...@googlegroups.com
The library defines two protocols, ByteSource and ByteSink.  They expose a 'take-bytes!' and 'send-bytes!' method, respectively.  An unoptimized transfer involves shuttling bytes between these two methods, which in most cases means one copy and one allocation (ByteBuffer can act as a ByteSource and requires neither).

An optimized transform is anything which short-circuits this default mechanism.  Unfortunately, exactly how optimal this is depends heavily on what you're transferring.  Copying a File to a Channel representing a file descriptor will often use OS-provided zero copy mechanisms, which makes it a lot more "optimal" than the optimized transfer between an InputStream and an OutputStream, which just involves reusing an array for shuttling data between them.  Basically, 'optimized-transfer?' returning true means that a special function exists that is as optimal as I (or someone else who's written a def-transfer) know how to make it.

The number of steps is definitely not the number of copies.  Wrapping an array in a ByteArrayInputStream, for instance, is just allocating a wrapper.  Some are more complicated though; creating an array from a ByteBuffer is only zero-copy if the ByteBuffer is non-direct *and* the position and limit haven't been moved.  Right now my traversal is treating each transform as equally complex, but an obvious improvement is to allow each transform to be annotated with some sort of overhead metric, and minimize based on that.  A lot of the low-level optimizations are implementation-specific, though, so it's always going to be a little fuzzy.

If you have ideas on any of the above, please speak up.

Zach

Mikera

unread,
Jul 2, 2013, 6:19:39 AM7/2/13
to clo...@googlegroups.com
This is cool, thanks Zach!

Another set of mostly-isomporphic types that this could be applied to is different matrix/array types in core.matrix. core.matrix already has generic conversion mechanisms but they probably aren't as efficient as they could be. I'll take a look and see if the same techniques might be applicable.

Quick question for you and the crowd: does there exist or should we build a standard immutable byte data representation for Clojure? 

I think this is often needed: ByteBuffers and byte[] arrays work well enough but are mutable. Byte sequences are nice and idiomatic but have a lot of overhead, so people are often forced to resort to a variety of other techniques. And it would be nice to support some higher level operations on such types, e.g. production of efficient (non-copying) immutable subsequences.

From a data structure perspective, I'm imagining something like a persistent data structure with byte[] data arrays at the lowest level.

Given the amount of data-processing stuff people are doing, it seems like a reasonable thing to have in contrib at least? 

Thomas

unread,
Jul 2, 2013, 7:51:47 AM7/2/13
to clo...@googlegroups.com
I have already used this library and it is really really useful. Thanks Zach.

Thomas

Ben Smith-Mannschott

unread,
Jul 2, 2013, 9:42:07 AM7/2/13
to clo...@googlegroups.com
Ropes?

Ben
--
This message was sent via electromagnetism.
--
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google Groups "Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clojure+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Zach Tellman

unread,
Jul 2, 2013, 7:51:36 PM7/2/13
to clo...@googlegroups.com
Hey Mike,

Please feel free to appropriate or adapt any code you think might be useful.  I've signed a CA, it should all be kosher.

As far as an immutable byte-data type, I'm a little skeptical it would be useful in a wide variety of situations, since a dense array/matrix is going to be much faster and more predictable than pretty much anything else.  If you want to guarantee read-only semantics, you have ByteBuffer.asReadOnly().  If you want a ByteBuffer-like interface atop an unbounded stream with memory-efficient subsequences, you can check out the chunked-byte-seqs in Vertigo [1], which is due for release soon.  Something which behaves like a Clojure vector but is built atop byte[] seems like it would only be optimal for a very narrow set of use-cases.  Maybe I'm misunderstanding what you're proposing, though.  Please correct me if that's the case.

Zach



--
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to a topic in the Google Groups "Clojure" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/clojure/SElE2W7bzis/unsubscribe.
To unsubscribe from this group and all its topics, send an email to clojure+u...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages