slow raw io

cageface

unread,

Jun 23, 2010, 7:17:32 PM6/23/10

to Clojure

Not sure if this is a clojure issue or a something else but I'm seeing
surprisingly slow I/O on large text files.

For example, on a unix machine try this:

1. create a large file
rm -f words; for x in $(seq 300); do cat /usr/share/dict/words >>
words; done

2. create a clj file that just slurps it
#!/usr/bin/env clj

(slurp *in*)

3. time the slurp
cat words | cat.clj

On my machine this takes 17 seconds.

If I make a simple java program:

import java.io.BufferedReader;
import java.io.InputStreamReader;

public class Cat {

public static void main(String[] args) throws Exception {
BufferedReader reader;
reader = new BufferedReader(new InputStreamReader(System.in));
StringBuffer fileData = new StringBuffer();

char[] buffer = new char[4096];
int numRead = 0;

while ((numRead = reader.read(buffer)) != -1) {
fileData.append(buffer, 0, numRead);
}
reader.close();
}
}

and try this:
cat words | java -Xmx3G Cat

It only takes 2.3 seconds.

I thought this might have to do with the -Xmx max heap flag but I
added this to the java invocation in my clj wrapper script and it
didn't seem to make any difference. Line-oriented I/O seems similarly
slow.

cageface

unread,

Jun 23, 2010, 7:42:52 PM6/23/10

to Clojure

For the record, this program runs in 3.3 seconds so I guess that
points to the implementation of slurp:

(import '[java.io BufferedReader InputStreamReader])

(let [reader (BufferedReader. (InputStreamReader. System/in))
file-data (StringBuffer.)
buffer (char-array 4096)]
(loop [total 0]
(let [num-read (.read reader buffer)]
(if (not= -1 num-read)
(do
(.append file-data buffer 0 num-read)
(recur (+ total num-read)))
(println total)))))

Stuart Halloway

unread,

Jun 23, 2010, 9:20:26 PM6/23/10

to clo...@googlegroups.com

I am seeing more like 1.8 seconds for the raw version, vs. 2.8 seconds for slurp (master branch). Can you post a complete example (including the clj script you use, and what version of Clojure), so we can be apples-to-apples?

Stu

> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clo...@googlegroups.com
> Note that posts from new members are moderated - please be patient with your first post.
> To unsubscribe from this group, send email to
> clojure+u...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en

cageface

unread,

Jun 23, 2010, 9:39:20 PM6/23/10

to Clojure

Sure. Here's my clj script:
#!/bin/sh

if [ -z "$1" ]; then
exec java -server jline.ConsoleRunner clojure.main
else
SCRIPT=$(dirname $1)
export CLASSPATH=$SCRIPT/*:$SCRIPT:$CLASSPATH
exec java -Xmx3G -server clojure.main "$1" "$@"
fi

(Usually I don't have the -Xmx flag there. I added it for this test)

I'm using a fresh build of clojure and clojure-contrib from the github
1.2.0-MASTER branch. If I try to run it with the 1.1.0 jars I get this
error:
Exception in thread "main" java.lang.ClassCastException:
clojure.lang.LineNumberingPushbackReader cannot be cast to
java.lang.String (cat.clj:0)
at clojure.lang.Compiler.eval(Compiler.java:4658)
at clojure.lang.Compiler.load(Compiler.java:4972)
at clojure.lang.Compiler.loadFile(Compiler.java:4939)
at clojure.main$load_script__7405.invoke(main.clj:213)
at clojure.main$script_opt__7442.invoke(main.clj:265)
at clojure.main$main__7466.doInvoke(main.clj:346)
at clojure.lang.RestFn.invoke(RestFn.java:426)
at clojure.lang.Var.invoke(Var.java:363)
at clojure.lang.AFn.applyToHelper(AFn.java:175)
at clojure.lang.Var.applyTo(Var.java:476)
at clojure.main.main(main.java:37)

How big is your words file? 100 copies of /usr/share/dict/words on
this machine is 267 megabyes.

cageface

unread,

Jun 23, 2010, 10:16:47 PM6/23/10

to Clojure

Another example. I'm running this on a Ubuntu 10.04 laptop with this
java:

java version "1.6.0_18"
OpenJDK Runtime Environment (IcedTea6 1.8) (6b18-1.8-0ubuntu1)
OpenJDK 64-Bit Server VM (build 14.0-b16, mixed mode)

and this command line:
java -Xmx3G -server clojure.main cat2.clj

(require '[clojure.java.io :as jio])

(defn- normalize-slurp-opts
[opts]
(if (string? (first opts))
(do
(println "WARNING: (slurp f enc) is deprecated, use (slurp
f :encoding enc).")
[:encoding (first opts)])
opts))

(defn slurp2
"Reads the file named by f using the encoding enc into a
string
and returns it."
{:added "1.0"}
([f & opts]
(let [opts (normalize-slurp-opts opts)
data (StringBuffer.)
buffer (char-array 4096)]
(with-open [#^java.io.Reader r (apply jio/reader f opts)]
(loop [c (.read r buffer)]
(if (neg? c)
(str data)
(do
(.append data buffer 0 c)
(recur (.read r buffer)))))))))

(time
(with-open [f (java.io.FileReader. "words")]
(println (count (slurp f)))))

(time
(with-open [f (java.io.FileReader. "words")]
(println (count (slurp2 f)))))

I get this output:

$ java -Xmx3G -server clojure.main cat2.clj
279440100
"Elapsed time: 17094.007487 msecs"
279440100
"Elapsed time: 5233.097287 msecs"

So at least in my environment there seems to be a big difference
between slurp2 with an explicit buffer and the core/slurp one, which
appears to be reading a character at a time from a BufferedReader
stream.

Stuart Halloway

unread,

Jun 23, 2010, 11:22:25 PM6/23/10

to clo...@googlegroups.com

On my laptop (Mac) the biggest difference here has nothing to do with buffering in slurp. It is whether you use System/in (fast) or *in* (slow). The latter is a LineNumberingPushbackReader.

Can you check and confirm? When I slurp System/in it is more than twice as fast as slurping *in*.

I believe the next-biggest perf issue is how StringBuilders grow. I suspect that the 4096 buffer is making them grow more efficiently.

Stu

cageface

unread,

Jun 24, 2010, 12:00:11 AM6/24/10

to Clojure

Interesting. Here are the times I get:

LINUX:
slurp, *in* 18.8 seconds
slurp, System/in 18.2 seconds
slurp2, *in* 6.7 seconds
slurp2, System/in 5.7 seconds

I have an intel iMac here too, running 10.6.4:
slurp, *in* 20.4 seconds
slurp, System.in 19.0 seconds
slurp2, *in* 7.2 seconds
slurp2, System/in 6.0 seconds

For another data point, I wrote a simple java version:

import java.io.BufferedReader;
import java.io.InputStreamReader;

public class Cat {
public static void main(String[] args) throws Exception {
BufferedReader reader;
reader = new BufferedReader(new InputStreamReader(System.in));

StringBuilder fileData = new StringBuilder();

char[] buffer = new char[4096];
int numRead = 0;

long totalRead = 0;
int c;

if (args[0].equals("single")) {
// single-char
read
while ((c = reader.read()) != -1) {
totalRead++;
fileData.append((char)c);
}
}
else {
// buffered
read

while ((numRead = reader.read(buffer)) != -1) {

totalRead += numRead;
fileData.append(buffer, 0, numRead);
}
System.out.println(totalRead);
}
}
}

Runnning this like so:

$ javac Cat.java && cat words | time java -Xmx3G -cp . Cat single
gives 14.5 seconds on linux and 17.1 seconds on the mac

running it like this:
$ javac Cat.java && cat words | time java -Xmx3G -cp . Cat buffered
gives 2.5 seconds on linux and 5.4 seconds on the mac

I'm not sure it's totally analogous but it does seem to indicate that
the single-char reads are slower on both platforms.

cageface

unread,

Jun 24, 2010, 3:34:51 PM6/24/10

to Clojure

Has anyone else had a chance to try this? I'm surprised to see manual
buffering behaving so much better than the BufferedReader
implementation but it does seem to make quite a difference.

cageface

unread,

Jun 24, 2010, 6:45:29 PM6/24/10

to Clojure

I put a self-contained test up here:
http://gist.github.com/452095

To run it copy this to slurptest.clj and run these commands
java clojure.main slurptest.clj makewords 100 (100 seems good for
macs, 300 for linux)

java -Xmx3G -Xms3G clojure.main slurptest.clj slurp|
slurp2

Trying either slurp or slurp2. I see big timing differences on both
macs and linux machines but it would be interesting to see if other
people do too.

Peter Schuller

unread,

Jun 25, 2010, 4:42:30 AM6/25/10

to clo...@googlegroups.com

Looking at core's slurp, the problem is that it reads one character at
a time from the reader. The underlying reader being buffered or not,
reading one character at a time is not good for performance. The
attached patch brings it back down do the speed of slurp2 (How do I
actually create a ticket on assembla? I couldn't find a way to do
that; just browse individual tickets. I can't change tickets either;
perhaps editing is not publicly allowed?).

Anyways, some performance for FreeBSD 8/x86-64 with:

openjdk6: 15 seconds slurp, 3.0 seconds slurp2
openjdk7 (fastdebug): 14.5 seconds slurp, 2.0 seconds slurp2

And slurp2 as a function of buffer size (single run each):

1: 17.8 seconds
128: 2.92 seconds
1024: 2.88 seconds
4096: 3.12 seconds

--
/ Peter Schuller

slurp-buffer-size.patch

Peter Schuller

unread,

Jun 25, 2010, 5:07:38 AM6/25/10

to clo...@googlegroups.com

And reading the thread history I realize the problem was already
identified (sorry), however:

> Has anyone else had a chance to try this? I'm surprised to see manual
> buffering behaving so much better than the BufferedReader
> implementation but it does seem to make quite a difference.

Not really that surprising looking at the implementation. Each call to
read(), in addition to the obvious buffer arithmetic etc, results in:

* A lock being acquired.
* ensureOpen() being called, which checks whether the stream was already closed.

One might argue about whether streams should be thread-safe or not,
but in any case I think it is good default behavior to always strive
to do I/O chunk-wise, regardless of whether the expectation is that
the underlying stream is fast (for some definition of fast).

--
/ Peter Schuller

Stuart Halloway

unread,

Jun 25, 2010, 8:14:13 AM6/25/10

to clo...@googlegroups.com

Hi Peter,

You are on the contributors list, so I just need to know your account name on Assembla to activate your ability to add tickets, patches, etc. Let me know your account name (which needs to be some permutation of your real name, not a nick).

Thanks,
Stu

> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clo...@googlegroups.com
> Note that posts from new members are moderated - please be patient with your first post.
> To unsubscribe from this group, send email to
> clojure+u...@googlegroups.com
> For more options, visit this group at

> http://groups.google.com/group/clojure?hl=en<slurp-buffer-size.patch>

Peter Schuller

unread,

Jun 25, 2010, 8:18:02 AM6/25/10

to clo...@googlegroups.com, stuart....@gmail.com

> You are on the contributors list, so I just need to know your account name on Assembla to activate your ability to add tickets, patches, etc. Let me know your account name (which needs to be some permutation of your real name, not a nick).

When I read up before submitting the contributor agreement I did see
the preference for real name derived account names, but I went with my
usual 'scode' anyway since I normally use the same name *everywhere*
and I didn't get the impression it was a strict requirement.

I can register another account (no problem), but what implications are
there on the fact that I wrote 'scode' on the contributor agreement I
mail:ed Rich?

--
/ Peter Schuller

Peter Schuller

unread,

Jun 25, 2010, 8:19:46 AM6/25/10

to clo...@googlegroups.com, stuart....@gmail.com

> I can register another account (no problem), but what implications are
> there on the fact that I wrote 'scode' on the contributor agreement I
> mail:ed Rich?

I just registered "peterschuller".

--
/ Peter Schuller

cageface

unread,

Jun 25, 2010, 10:43:33 AM6/25/10

to Clojure

Thanks Stuart & Peter for following up on this. Now I can get back to
plowing through this mountain of ldiff data with Clojure!

cageface

unread,

Aug 6, 2010, 8:31:03 PM8/6/10

to Clojure

Any chance of getting this in before 1.2?

Stuart Halloway

unread,

Aug 7, 2010, 8:02:07 AM8/7/10

to clo...@googlegroups.com

No. We want to collect more information and do more comparisons before moving away from the recommended Java buffering.

Stu

Peter Schuller

unread,

Aug 7, 2010, 8:43:39 AM8/7/10

to clo...@googlegroups.com

> No. We want to collect more information and do more comparisons before moving away from the recommended Java buffering.

Interesting. Why do you consider it recommended to read one character
at a time in a case like this? Maybe there is such a recommendation
that I don't know about, but in general I would consider it contrary
to expected practice when doing I/O if performance is a concern.

Even discounting the fact of ensureOpen() and lock acquisition that
seems to be used as per my previous post, even the most efficient
implementation I can think of (do some index checking and bump a
positional pointer) would still be generally expected to be slower
when invoked one character/byte at a time than larger chunks being
coped with e.g. System.arraycopy() (though as with all things, within
reason; cranking up size too much will of course have other effects
such as GC overhead, poorer cache locality, etc).

(Note that I'm not arguing the point of whether or not it should be
committed before 1.2, but I'm genuinely interested in why not reading
one character (or byte) at a time would be a controversial change.)

--
/ Peter Schuller

j-g-faustus

unread,

Aug 7, 2010, 9:59:23 AM8/7/10

to Clojure

On Aug 7, 2:02 pm, Stuart Halloway <stuart.hallo...@gmail.com> wrote:
> > No. We want to collect more information and do more comparisons before moving away from the recommended Java buffering.

Maybe this comparison can be of interest?
http://nadeausoftware.com/articles/2008/02/java_tip_how_read_files_quickly

Somebody tests 13 different ways to read a 100MB file with Java 1.5,
and get timings ranging from half a second to 10 min.

jf

David Powell

unread,

Aug 9, 2010, 6:20:40 AM8/9/10

to clo...@googlegroups.com

On Sat 07/08/10 14:02 , "Stuart Halloway" stuart....@gmail.com sent:

> No. We want to collect more information and do more comparisons before
> moving away from the recommended Java buffering.

> Stu

This isn't an issue with the buffering, it is an issue with the massive overhead of doing character at a time i/o - it is something that you really
should never ever do. I'd say something somewhere doing character at a time i/o is probably the number one cause of crippling performance problems
that I've seen in Java.

--
Dave

David Powell

unread,

Aug 9, 2010, 6:27:13 AM8/9/10

to clo...@googlegroups.com

> This isn't an issue with the buffering, it is an issue with the massive
> overhead of doing character at a time i/o - it is something that you really
> should never ever do. I'd say something somewhere doing character at a
> time i/o is probably the number one cause of crippling performance problems
> that I've seen in Java.

Just to add, the implementation really ought to use the stream copy stuff from clojure.java.io, which copies streams using a 1k buffer by default.

--
Dave

cageface

unread,

Aug 9, 2010, 2:49:00 AM8/9/10

to Clojure

On Aug 7, 5:43 am, Peter Schuller <peter.schul...@infidyne.com> wrote:
> Interesting. Why do you consider it recommended to read one character
> at a time in a case like this? Maybe there is such a recommendation
> that I don't know about, but in general I would consider it contrary
> to expected practice when doing I/O if performance is a concern.

That's not a practice I'm familiar with either. Everything I've ever
read on I/O tuning suggests that reading single char is worst-case
performance, which seems borne out here.

Maybe this seems like a low-priority issue but I think slurp is likely
to be very commonly used. For instance, the Riak tutorial just posted
to Hacker News uses it:
http://mmcgrana.github.com/2010/08/riak-clojure.html

--
miles

David Powell

unread,

Aug 9, 2010, 9:25:29 AM8/9/10

to clo...@googlegroups.com

> Maybe this seems like a low-priority issue but I think slurp is likely
> to be very commonly used. For instance, the Riak tutorial just posted
> to Hacker News uses it:
> http://mmcgrana.github.com/2010/08/riak-clojure.html

In the past I've steered clear of using slurp because it didn't hand character encodings, so I had
to write my own version (which did do block at a time copying), but now that the API has been
fixed, I'd hope to be able to use the version in core, but only if the implementation is usable.

--
Dave

Reply all

Reply to author

Forward