Using sqldf on an HPC

Skip to first unread message

Chris Davis

Jan 16, 2014, 8:53:17 AM1/16/14
I'm writing about an issue I encountered when running sqldf on an HPC and the solution I found for it. It took me forever to debug this, so I hope this is useful for any one who runs into the same problem.

Basically I'm running a few hundred R programs in parallel on a computer cluster. Each program needs a single cpu, no shared memory, nothing really complicated. Input and output files are all on the local node. The only library being used is sqldf which is used to run queries over data frames generated within the program.

The issue I ran into is that as more instances of R are running on the cluster, the programs all collectively grind to a halt and there is an immense amount of traffic generated to the head node. I tried installing R locally on the nodes, but this didn't make the problem go away. I used the linux command "lsof" to check for open files on the head node, but nothing came up that (to me) indicated i/o from the nodes or libraries being repeatedly loaded.

I was able to finally fix it by updating the code to close any existing sqldf connections and create a new one:

# close an old connection if it exists
if (!is.null(getOption("sqldf.connection"))){
# create a new connection

In the original code, no connection was initiated beyond just loading the library and then directly running queries on the data frames.

I'm not sure if what I encountered was expected behavior or not. The code runs fine on a single computer, and as I understand, sqldf should just load the database into memory, so no network traffic should be necessary.

Best regards,


Gabor Grothendieck

Jan 21, 2014, 9:48:36 AM1/21/14
Thanks. Have added this to the Troubleshooting section oin the sqldf home page.
Reply all
Reply to author
0 new messages