Dear All,
I'm writing some code to read a bunch of file names from an input
file, then run a linux program on each of them in parallel. I am
using scala.sys.process.Process to create a ProcessBuilder, and I'm
using a custom ProcessLogger to take the output of the linux program
and pluck out the console output that I need. The relevant bits of my
code look like this:
val accInfo = new Array[Float](numConformations)
final class DsspOutputParser(conformationID: Int) {
var numLinesRead = 0
var residueIdx = 0
@inline
def processLine(line: String): Unit = {
numLinesRead += 1
if (numLinesRead > 25) {
val acc = line.substring(34, 38).trim.toInt
accInfo(conformationID) = acc
residueIdx += 1
}
}
}
val tasks = withBufferedReader(new File(config.datasetDir,
"conformation_filenames.txt"))(br => {
Iterator.continually(br.readLine()).takeWhile(_ !=
null).toList.zipWithIndex.map {
case (conformationFilename, i) => {
future {
println((i+1) + " of " + numConformations)
val pb = Process(dsspCommand + " " +
conformationFilename)
val outputParser = new DsspOutputParser(i)
val procLog = ProcessLogger(outputParser.processLine(_))
pb.!(ProcessLogger(outputParser.processLine(_)))
}
}
}
})
tasks.grouped(10).foreach{group =>
scala.actors.Futures.awaitAll(Long.MaxValue / 2L, group: _*)}
The problem is that after running for a while, the program barfs with
the following error:
<function0>: caught java.io.IOException: Cannot run program "dssp-2-
linux-amd64": java.io.IOException: error=24, Too many open files
java.io.IOException: Cannot run program "dssp-2-linux-amd64":
java.io.IOException: error=24, Too many open files
at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
at scala.sys.process.ProcessBuilderImpl
$Simple.run(ProcessBuilderImpl.scala:68)
at scala.sys.process.ProcessBuilderImpl
$AbstractBuilder.run(ProcessBuilderImpl.scala:99)
at scala.sys.process.ProcessBuilderImpl$AbstractBuilder$$anonfun
$runBuffered$1.apply(ProcessBuilderImpl.scala:147)
at scala.sys.process.ProcessBuilderImpl$AbstractBuilder$$anonfun
$runBuffered$1.apply(ProcessBuilderImpl.scala:147)
at scala.sys.process.ProcessLogger$$anon$1.buffer(ProcessLogger.scala:
64)
at scala.sys.process.ProcessBuilderImpl
$AbstractBuilder.runBuffered(ProcessBuilderImpl.scala:147)
at scala.sys.process.ProcessBuilderImpl$AbstractBuilder.
$bang(ProcessBuilderImpl.scala:113)
[et cetera]
So it looks like something might not be cleaning up after itself, or
the garbage collector might not be aggressive enough (i.e. it looks
like the process streams aren't getting properly closed). The output
of lsof lists a bunch of lines that look like this:
java 23555 harveywi 7693w FIFO 0,8 0t0 866166108 pipe
I have seen a similar problem before a few years back (using
scala.io.Source.fromFile(_) in small parallel batches), and the
solution was to invoke System.gc() periodically to make sure that the
offending streams were closed and cleaned up. However, that
workaround doesn't seem to be effective anymore.
I am using sun java 1.6.0_29, and the VM arguments that I am using are
"-Xmx1G -server". I am using the default garbage collector, which
might be the problem.
Am I doing something really silly? If it is up to me to close the
streams manually, how do I do that? I poked through the scala
standard library source and didn't see anything obvious. If it's not
my responsibility to close the streams, and the scala standard library
is properly closing them, do you have any good ideas on how I might be
able to get around this issue?
Thank you!
-William Harvey
http://www.cse.ohio-state.edu/~harveywi