Better way to read binary file...

3,088 views
Skip to first unread message

Scott S.

unread,
Mar 19, 2012, 10:46:18 AM3/19/12
to scala-user
Hi all,

New Scala user here, coming from years of Java. Great language so
far. However, I can't seem to find a pattern to read a file into a
byte array in a "Scala" way that doesn't take for freaking ever to
complete, computationally speaking. Running some benchmarks,

def bufferedJavaTest() {
val in =
this.getClass().getClassLoader().getResourceAsStream(res_name)
var out = new ByteArrayOutputStream()
val buf = new Array[Byte](4096)
var len = in.read(buf)

while (len > 0) {
out.write(buf, 0, len)
len = in.read(buf)
}
out.flush
out.close
in.close
}

This method was fastest for reading in a reasonably large file (i.e.
took around 220 ms for a 9.9 MB file)

A more scala-ish solution as posted on stack overflow looked like
this:

def scalaStream() {
val in4 =
this.getClass().getClassLoader().getResourceAsStream(res_name)
var stream = Iterator continually in4.read takeWhile (-1 !=) map
(_.toByte) toArray
}

and took well over 1000 ms to complete due to iterating over the
stream multiple times. Is there any way that I can buffer the "Scala"
method so that it rivals the java streams way in terms of
performance? If there isn't a way, I'm just as happy using the java
wrapper way.

Thanks in advance

√iktor Ҡlang

unread,
Mar 19, 2012, 11:05:01 AM3/19/12
to Scott S., scala-user
On Mon, Mar 19, 2012 at 3:46 PM, Scott S. <scott.s...@gmail.com> wrote:
Hi all,

New Scala user here, coming from years of Java.  Great language so
far.  However, I can't seem to find a pattern to read a file into a
byte array in a "Scala" way that doesn't take for freaking ever to
complete, computationally speaking.

Why do you need to cache the file in memory?

Cheers,



--
Viktor Klang

Akka Tech Lead
Typesafe - The software stack for applications that scale

Twitter: @viktorklang

Scott Spillmann

unread,
Mar 19, 2012, 11:08:37 AM3/19/12
to scala-user
Certain pdf libraries (iText) require reading the entire file in from a byte array to initialize the data structures.  So, we need to be able to read the whole file and pass it to the iText functions.  We also have proprietary files that are not over 10MB currently, but could potentially grow in the future that we need to work with atomically.

√iktor Ҡlang

unread,
Mar 19, 2012, 11:13:35 AM3/19/12
to Scott Spillmann, scala-user
On Mon, Mar 19, 2012 at 4:08 PM, Scott Spillmann <scott.s...@gmail.com> wrote:
Certain pdf libraries (iText) require reading the entire file in from a byte array to initialize the data structures.  So, we need to be able to read the whole file and pass it to the iText functions.  We also have proprietary files that are not over 10MB currently,

 
but could potentially grow in the future that we need to work with atomically.

I don't understand that part. What does a byte array have to do with atomicity?

Cheers,

Scott Spillmann

unread,
Mar 19, 2012, 11:18:41 AM3/19/12
to scala-user
So the java nio is probably the way to go?  From what I understand of Java streams, they use the relevant parts of nio under the covers especially in later versions (6 and 7)

What I meant by "atomically" was as a whole unit, not pieces thereof, so reading only parts of the file would not work.  It would need to be the full file contents in memory at the very least.

√iktor Ҡlang

unread,
Mar 19, 2012, 11:22:45 AM3/19/12
to Scott Spillmann, scala-user
On Mon, Mar 19, 2012 at 4:18 PM, Scott Spillmann <scott.s...@gmail.com> wrote:
So the java nio is probably the way to go?  From what I understand of Java streams, they use the relevant parts of nio under the covers especially in later versions (6 and 7)

Blocking IO should be avoided, since it doesn't play nice with multithreaded code.
 

What I meant by "atomically" was as a whole unit, not pieces thereof, so reading only parts of the file would not work.  It would need to be the full file contents in memory at the very least.

That sounds very API specific, because you can most definitely operate "atomically" on a file without having it all in memory at the same point in time.

√iktor Ҡlang

unread,
Mar 19, 2012, 11:25:06 AM3/19/12
to Scott Spillmann, scala-user
You could definitely also look into: http://jesseeichar.github.com/scala-io-doc/index.html

Cheers,


2012/3/19 √iktor Ҡlang <viktor...@gmail.com>

Dennis Haupt

unread,
Mar 19, 2012, 11:27:28 AM3/19/12
to "√iktor Ҡlang", scott.s...@gmail.com, scala...@googlegroups.com
is there a reason you don't throw "Source.fromFile" at it?

-------- Original-Nachricht --------
> Datum: Mon, 19 Mar 2012 16:25:06 +0100
> Von: "√iktor Ҡlang" <viktor...@gmail.com>
> An: Scott Spillmann <scott.s...@gmail.com>
> CC: scala-user <scala...@googlegroups.com>
> Betreff: Re: [scala-user] Better way to read binary file...

> >>> Typesafe <http://www.typesafe.com/> - The software stack for


> >>> applications that scale
> >>>
> >>> Twitter: @viktorklang
> >>>
> >>>
> >>
> >>
> >> --
> >> Viktor Klang
> >>
> >> Akka Tech Lead

> >> Typesafe <http://www.typesafe.com/> - The software stack for


> >> applications that scale
> >>
> >> Twitter: @viktorklang
> >>
> >>
> >
> >
> > --
> > Viktor Klang
> >
> > Akka Tech Lead

> > Typesafe <http://www.typesafe.com/> - The software stack for


> applications
> > that scale
> >
> > Twitter: @viktorklang
> >
> >
>
>
> --
> Viktor Klang
>
> Akka Tech Lead

> Typesafe <http://www.typesafe.com/> - The software stack for applications
> that scale
>
> Twitter: @viktorklang

Daniel Sobral

unread,
Mar 19, 2012, 11:36:55 AM3/19/12
to Dennis Haupt, "√iktor Ò lang", scott.s...@gmail.com, scala...@googlegroups.com
On Mon, Mar 19, 2012 at 12:27, Dennis Haupt <h-s...@gmx.de> wrote:
> is there a reason you don't throw "Source.fromFile" at it?

They are binary files. Source.fromFile read text files.

--
Daniel C. Sobral

I travel to the future all the time.

Daniel Sobral

unread,
Mar 19, 2012, 11:47:43 AM3/19/12
to Scott S., scala-user

Have you tried this?

val in4 = new java.io.BufferedInputStream(this.getClass().getClassLoader().getResourceAsStream(res_name))

>    var stream = Iterator continually in4.read takeWhile (-1 !=) map
> (_.toByte) toArray

Yeah, reading one byte at a time just to then convert into a single
array is not going to be efficient. In particular because you don't
know beforehand what that array size will be, since you are working
with an Iterator. That means the array will be resized, which means it
will be copied multiple times as it grows.

There are many ways in which you could optimize this, but, in the end,
it comes down to whether it matters or not. If speed is not crucial,
then you might keep this functional version. If it is of utmost
importance, then you should go with the fastest version.

>  }
>
> and took well over 1000 ms to complete due to iterating over the
> stream multiple times.  Is there any way that I can buffer the "Scala"
> method so that it rivals the java streams way in terms of
> performance?  If there isn't a way, I'm just as happy using the java
> wrapper way.
>
> Thanks in advance

--

Tony Morris

unread,
Mar 19, 2012, 11:49:10 AM3/19/12
to scala...@googlegroups.com
You could use an iteratee, which will scale to any size file and perform
much better, if you were willing to forego the iText library and
implement something useful yourself.

>> Typesafe <http://www.typesafe.com/>- The software stack for


>> applications that scale
>>
>> Twitter: @viktorklang
>>
>


--
Tony Morris
http://tmorris.net/


√iktor Ҡlang

unread,
Mar 19, 2012, 11:59:07 AM3/19/12
to tmo...@tmorris.net, scala...@googlegroups.com
On Mon, Mar 19, 2012 at 4:49 PM, Tony Morris <tonym...@gmail.com> wrote:
You could use an iteratee, which will scale to any size file and perform
much better, if you were willing to forego the iText library and
implement something useful yourself.

Trust me Tony, you wouldn't want to implement your own PDF library. I know, I tried.
The ROI is infinite negative ;-)

Cheers,



--
Typesafe - The software stack for applications that scale

Twitter: @viktorklang

√iktor Ҡlang

unread,
Mar 19, 2012, 12:03:58 PM3/19/12
to tmo...@tmorris.net, scala...@googlegroups.com


2012/3/19 Tony Morris <tonym...@gmail.com>
I've done it recently. It's relatively easy with appropriate general
library support -- much easier than using iText, which I last had the
displeasure of doing about 3 or 4 years ago.

Yeah? Would love to see the sauce. I ended up switching from iText to a proprietary 3rd party lib because it had support for AcroForms.

Cheers,

Scott Spillmann

unread,
Mar 19, 2012, 12:05:59 PM3/19/12
to √iktor Ҡlang, tmo...@tmorris.net, scala...@googlegroups.com
Thanks all for the responses.  Good stuff here.  I would most certainly not try to rewrite iText given the sheer expansiveness of that library.

@Daniel Sobral, I do my own buffering there, so the BufferedReader would be redundant I'm assuming.  I'm not sure if that affects the outcome or not though. 

I will look at FileChannel to see if newer versions of iText and other libraries we have both written and 3rd party provided will take a memory mapped file in the constructors. It sounds like it would be best to stick with the Java way until something better comes along.  Thanks guys.

√iktor Ҡlang

unread,
Mar 19, 2012, 12:14:20 PM3/19/12
to tmo...@tmorris.net, scala...@googlegroups.com


2012/3/19 Tony Morris <tonym...@gmail.com>
On 20/03/12 03:03, √iktor Ҡlang wrote:
> 2012/3/19 Tony Morris <tonym...@gmail.com>
>
>> I've done it recently. It's relatively easy with appropriate general
>> library support -- much easier than using iText, which I last had the
>> displeasure of doing about 3 or 4 years ago.
>>
> Yeah? Would love to see the sauce. I ended up switching from iText to a
> proprietary 3rd party lib because it had support for AcroForms.

The company I work for gets cold feet over open-sourcing our work. I've
spent the last few months trying to get a very basic library to OSS, not
that it is too useful, but to get over this artificial barrier. Once I
do that, maybe I can look at more...

It's cool man, I understand.

Cheers,
 

--

Tony Morris
http://tmorris.net/


Tony Morris

unread,
Mar 19, 2012, 12:01:17 PM3/19/12
to √iktor Ҡlang, scala...@googlegroups.com
I've done it recently. It's relatively easy with appropriate general
library support -- much easier than using iText, which I last had the
displeasure of doing about 3 or 4 years ago.


--
Tony Morris
http://tmorris.net/


Tony Morris

unread,
Mar 19, 2012, 12:12:32 PM3/19/12
to √iktor Ҡlang, scala...@googlegroups.com
On 20/03/12 03:03, √iktor Ҡlang wrote:
> 2012/3/19 Tony Morris <tonym...@gmail.com>
>
>> I've done it recently. It's relatively easy with appropriate general
>> library support -- much easier than using iText, which I last had the
>> displeasure of doing about 3 or 4 years ago.
>>
> Yeah? Would love to see the sauce. I ended up switching from iText to a
> proprietary 3rd party lib because it had support for AcroForms.

The company I work for gets cold feet over open-sourcing our work. I've


spent the last few months trying to get a very basic library to OSS, not
that it is too useful, but to get over this artificial barrier. Once I
do that, maybe I can look at more...

--

Tony Morris
http://tmorris.net/


Razvan Cojocaru

unread,
Mar 19, 2012, 12:26:41 PM3/19/12
to √iktor Ҡlang, scala-user

What??? I thought  blocking IO is the reason we have threads in the first place… J

 

How’s this for a thought: if EVERYTHING was asynchronous – would we need threads?

√iktor Ҡlang

unread,
Mar 19, 2012, 12:29:01 PM3/19/12
to Razvan Cojocaru, scala-user


2012/3/19 Razvan Cojocaru <p...@razie.com>

What??? I thought  blocking IO is the reason we have threads in the first place… J


Why would make you think that?
 

 

How’s this for a thought: if EVERYTHING was asynchronous – would we need threads?


In what context?

Tony Morris

unread,
Mar 19, 2012, 12:30:23 PM3/19/12
to Scott Spillmann, √iktor Ҡlang, scala...@googlegroups.com
To be clear, a "rewrite of iText" in such a way as to be useful, looks
nothing like iText and so it is not a good comparison in terms of
measuring required effort. Implementing only the essentials can be
unrecognisable if you don't also repeat the ceremony.

>> >> <mailto:scott.s...@gmail.com

Johannes Rudolph

unread,
Mar 19, 2012, 12:47:03 PM3/19/12
to tmo...@tmorris.net, Scott Spillmann, √iktor Ҡlang, scala...@googlegroups.com
On Mon, Mar 19, 2012 at 5:30 PM, Tony Morris <tonym...@gmail.com> wrote:
> To be clear, a "rewrite of iText" in such a way as to be useful, looks
> nothing like iText and so it is not a good comparison in terms of
> measuring required effort. Implementing only the essentials can be
> unrecognisable if you don't also repeat the ceremony.

Of course, a complete rewrite is mostly out of scope. But even a
library which is just able to read the basic PDF file structure (which
is kind of a filesystem in itself) would already be immediately
useful.

--
Johannes

-----------------------------------------------
Johannes Rudolph
http://virtual-void.net

Razvan Cojocaru

unread,
Mar 19, 2012, 12:51:34 PM3/19/12
to √iktor Ҡlang, scala-user

Well – because you want to occupy the CPU with something else while the user is waiting for the… mouse to move or the keys to be accumulated. At the same time you want to simplify programming, so it all looks synchronous and stupid, hence you invent threads, which are nothing but mini-processes or suspend-able state of a path carved through code, all the while the I/O in fact works via hardware interrupts – not really blocking J

 

I remember Windows 3 was process-based cooperative multitasking and it worked relatively fine if you remembered to call yield() from all events J Unix-style preemptiveness was black magic woodoo stuff J … until the advent of the i386 I think, which had dedicated TAS and task switching instructions and made it simple?

 

 

IF you didn’t have to deal with blocking I/O, i.e. waiting for stuff – if ALL programming was reactive, workflow or actor-based, would we need threads? If everything, down to the AX+DX (or whatever the registries are called) was asynchronous units of work – would we need threads? If all programming was in the form of “WHEN this DO that”, what is a thread?

 

All a thread is, is a suspendable state of execution as seen by a dumb lower level OS that doesn’t know what the heck is going on upstairs. But if the upstairs was all reactive… what then?

 

Yeah – time for the second coffee of the day.

√iktor Ҡlang

unread,
Mar 19, 2012, 1:03:03 PM3/19/12
to Razvan Cojocaru, scala-user


2012/3/19 Razvan Cojocaru <p...@razie.com>

Well – because you want to occupy the CPU with something else while the user is waiting for the… mouse to move or the keys to be accumulated. At the same time you want to simplify programming, so it all looks synchronous and stupid, hence you invent threads, which are nothing but mini-processes or suspend-able state of a path carved through code, all the while the I/O in fact works via hardware interrupts – not really blocking J


I thought the idea behind blocking IO was to block progression of the current thread of execution until enough data had arrived/departed from the system.
 

 

I remember Windows 3 was process-based cooperative multitasking and it worked relatively fine if you remembered to call yield() from all events J Unix-style preemptiveness was black magic woodoo stuff J … until the advent of the i386 I think, which had dedicated TAS and task switching instructions and made it simple?

 

 

IF you didn’t have to deal with blocking I/O, i.e. waiting for stuff – if ALL programming was reactive, workflow or actor-based, would we need threads?


Of course not. Threads are an implementation detail. A bad one at that. Concurrent we can have with just one hardware thread of execution, as single-core-cpus have shown. However, to get true parallelism we need multiple hardware threads of execution.
 

If everything, down to the AX+DX (or whatever the registries are called) was asynchronous units of work – would we need threads? If all programming was in the form of “WHEN this DO that”, what is a thread?

 

All a thread is, is a suspendable state of execution as seen by a dumb lower level OS that doesn’t know what the heck is going on upstairs. But if the upstairs was all reactive… what then?


My point is that as soon as you introduce blocking, you have to ask yourself "How many Threads of execution do I need to avoid deadlocking/starvation?" This problem is "solved" in Node.js by not allowing blocking. Now, if you have to roll your own concurrency by splitting things into smaller event-handlers, you'll have to trust all code in the system, if one event-handler goes into some never ending calculation, you're pretty SOL. If you go the Erlang route, no haywire process will end up eating all CPU, since they employ their own timeslicing.

So, coming back to the question: If everything was reactive, we'd need no more than one thread of execution to have concurrency, but to be able to scale up, we'd need to be able to take advantage of multiple threads of execution, which is fairly easy if you use immutable data.

I think we strayed from the topic though.

Geir Hedemark

unread,
Mar 19, 2012, 1:27:17 PM3/19/12
to Razvan Cojocaru, √iktor Ҡlang, scala-user
On 2012, Mar 19, at 5:51 PM, Razvan Cojocaru wrote:
Well – because you want to occupy the CPU with something else while the user is waiting for the… mouse to move or the keys to be accumulated. At the same time you want to simplify programming, so it all looks synchronous and stupid, hence you invent threads, which are nothing but mini-processes or suspend-able state of a path carved through code, all the while the I/O in fact works via hardware interrupts – not really blocking J

Yeah, but the hardware interrupt you are describing as an unblocking, asynchronous gadget is really a clocked, synchronous design.

yours
Geir

sreque

unread,
Mar 19, 2012, 3:04:31 PM3/19/12
to scala-user
If you're interested in performance, it might help to cut out the
intermediate array copy by writing your own custom class like the
following. I only did minimal testing on it by observing that it could
copy one binary file (a zip file) to another such that the files are
identical.

class DynamicByteArray(initialSize: Int) {
var data = new Array[Byte](initialSize)
var pos: Int = 0

def this() = this(1024)

def append(bytes: Array[Byte]) {
growIfNeeded(pos + bytes.length )
java.lang.System.arraycopy(bytes, 0, data, pos, data.length)
pos += bytes.length
}

def read(stream: java.io.InputStream, amount: Int): Int = {
growIfNeeded(pos + amount)
val amountRead = stream.read(data, pos, amount)
if(amountRead > 0) {
pos += amountRead
}
return amountRead
}

def resize(newSize: Int) {
data = java.util.Arrays.copyOf(data, newSize)
}

def growIfNeeded(lengthNeeded: Int) {
if(lengthNeeded > data.length)
{
resize(scala.math.max(lengthNeeded, data.length * 2))
}
}
}

object Main extends App {
import java.io._

val buffer = new DynamicByteArray
var input = new FileInputStream(args(0))
while(buffer.read(input, 8192) >= 0) { }
input.close

var output = new FileOutputStream(args(1))
output.write(buffer.data, 0, buffer.pos)
output.close

Josh Suereth

unread,
Mar 19, 2012, 3:16:24 PM3/19/12
to sreque, scala-user
You should probably also be using the blocking nio interface and direct-byte-buffers.  You'll see a speed increase.  I've been toying with them myself, but I don't have anything I'd recommend for production yet.

Basically, I/O in Java (and therefore in Scala) is less than ideal.  Some of the new paradigms of doing streaming can increase composibility and such, but their supporting infrastructure to do real I/O just isn't there yet.  E.g. Iteratees exist in several implementations, but we don't have a good low-level API to hit files and network streams yet.  You have to roll your own.

If you really want to use iteratees, feel free to read this and this.  Again, I don't recommend these yet for anything critical.

If I were you, I'd look at what Jesse Eichar has done in Scala I/O.  It's stable and he's been improving speed.  It's most likely the strongest Scala I/O library and will continue to be so.

- Josh

sreque

unread,
Mar 19, 2012, 3:38:51 PM3/19/12
to scala-user
The problem with byte buffers is that they have a fixed size and they
don't work directly with java.io.InputStream. So, if your requirements
are that your input is a java.io.InputStream and you want to form a
byte array of variable size with all the contents from that stream, it
doesn't look like ByteBuffer can be of any help.

Of course, if you know you are accessing a file and you can get access
to a FileInputStream, then you can use things like MappedByteBuffer
(http://docs.oracle.com/javase/1.4.2/docs/api/java/nio/
MappedByteBuffer.html)

Also, here is an interesting post I found on performance differences
between byte[], streams, and buffers: http://www.evanjones.ca/software/java-bytebuffers.html.

On Mar 19, 2:16 pm, Josh Suereth <joshua.suer...@gmail.com> wrote:
> You should probably also be using the blocking nio interface and
> direct-byte-buffers.  You'll see a speed increase.  I've been toying with
> them myself, but I don't have anything I'd recommend for production yet.
>
> Basically, I/O in Java (and therefore in Scala) is less than ideal.  Some
> of the new paradigms of doing streaming can increase composibility and
> such, but their supporting infrastructure to do real I/O just isn't there
> yet.  E.g. Iteratees exist in several implementations, but we don't have a
> good low-level API to hit files and network streams yet.  You have to roll
> your own.
>
> If you really want to use iteratees, feel free to read
> this<http://jsuereth.com/scala/2012/02/29/iteratees.html>and
> this<https://github.com/jsuereth/scalaz/blob/scalaz-nio2/example/src/main/...>.
Reply all
Reply to author
Forward
0 new messages