Problems with gfServer setup.

Alan Hoyle

unread,

Oct 2, 2013, 11:48:21 AM10/2/13

to Gen...@soe.ucsc.edu

Hey all,

We're using gfClient/gfServer as part of one of our analyses and we're running the analyses via qsub on an SGE cluster. For now, we're only using gfServer for a single 2bit file in a single analysis, so we haven't set up a dedicated, stand-alone server for it.

What we'd tried to do was something like the following on the cluster node for each job:

gfServer start localhost 8000 blah.2bit & (i.e. start in the background)
sleep 60 && run_job.pl --blahblahblah && gfserver stop localhost 8000 (i.e. wait for a bit for the server to spin up, run the job, kill the server.)

This seems to work for the first job that ends up on each node, however, the second job that ends up on the node loads the 2bit file, and then spews the following into the STDERR:

Couldn't bind socket to 8000: Address already in use

Error accepting the connection

[....]

The "Error accepting the connection" repeats literally millions of times. The largest log file we deleted from this was approximately 120 GB of almost exclusively that message.

We've tried writing a wrapper script to dynamically check to see if the port it available at runtime (either through netstat or gfServer status) or find an open port if not available. Either of those fails since the gfServer doesn't actually grab the port until after it has finished loading the 2bit file. We have a thought that it might be possible to work around this by using lock files to say "gfServer is trying to use port XXXX" on a per-node basis, but that seems suboptimal.

I have a couple of suggestions that would improve gfServer for this use:

Have gfServer try to grab the port almost immediately on startup, and return an error message saying that it's loading the 2bit file until it's ready.
if it can't grab the port, perhas try again every second and but certainly stop the spew of that error out ad infinitum.

(Note that we're a bit lucky on this. If we had started the gfServer with the -canStop parameter, the jobs might be partially finished when the server was killed.)

We have a couple of other ideas for how to handle this.

Currently, we have a dedicated gfServer running on one node all the time. We don't like this long term as it gives a single point of failure and it wastes resources all the time, not just when the gfServer is needed. Also, it's slightly slower as the traffic has to hop across the network instead of running all on the localhost.

Another idea would be to have gfServers running on all of the nodes all the time. We don't like this as it wastes resources all the time on all the nodes.

Does anyone else have suggestions for dealing with this kind of use case?

-alan

--
- Alan Hoyle - al...@alanhoyle.com - http://www.alanhoyle.com/ -

Luvina Guruvadoo

unread,

Oct 2, 2013, 7:02:21 PM10/2/13

to Alan Hoyle, Gen...@soe.ucsc.edu

Hi Alan,

One of our engineers had this to say:

Regarding the message:

 Couldn't bind socket to 8000: Address already in use
 Error accepting the connection
 Error accepting the connection
 [...]

This seems to indicate that the gfServer is still running on the port. In order to stop gfServer from the commandline, it must be STARTED with -canStop option. I believe the problem of failing to connect over and over has been fixed. Are you using an old version of BLAT??

But really that will not matter at all because you should not be using blat this way. With a cluster you should be using stand-alone blat and not gfClient/gfServer. Genome data is often "embarrassingly parallelizable", and that certainly applies to alignment.

What we do is split the target and query into multiple parts and then run jobs which blat each combination of parts Qi against Tj. The simplest way to split them up would be to have a job for each chromosome. A more sophisticated method splits chromosomes into chunks (possibly overlapping), running standalone blat on the pieces' combinations on cluster and then lifting them back into place and chaining the results back together. If the queries are lots of already small pieces like genes or RNAs, then you can still split your many sequences into
into multiple input files and then have a cluster job to call blat for each input file against some target chromosome (or chunk).

You can find many detailed examples of using BLAT on clusters in our make docs in our source tree. Look under kent/src/hg/makeDb/doc/. Grep the .txt files for "blat". Many of the examples will show the use of parasol to run cluster jobs. Parasol was created by Jim Kent. Note that Jim Kent is the author and owner of BLAT.

I hope this helps. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

- - -
Luvina Guruvadoo
UCSC Genome Bioinformatics Group

--

Alan Hoyle

unread,

Oct 3, 2013, 10:42:07 AM10/3/13

to Luvina Guruvadoo, Gen...@soe.ucsc.edu

Luvina,

It looks like the blat/gfServer we used that had the huge logs is v34, though we are moving slowly to a new cluster has a freshly compiled install of v35.

The reason why we're using gfServer as opposed to a stand-alone blat is that we're using Bellerophon which seems to do many many small calls using gfClient and they recommend using gfServer

quoting http://cbc.case.edu/Bellerophon/README.txt:


***NOTE: Be sure to download the gfClient and gfServer programs. Bellerophon uses the Blat server and not the Blat executable. This results in a considerable increase in speed.

--
- Alan Hoyle - al...@alanhoyle.com - http://www.alanhoyle.com/ -

Jonathan Casper

unread,

Oct 4, 2013, 2:54:58 PM10/4/13

to Alan Hoyle, Gen...@soe.ucsc.edu

Hello Alan,

You will have to check with the authors of Bellerophon for the specific performance issues they encountered with command-line BLAT versus a gfServer/gfClient setup; it is possible their tool makes queries in a manner better handled by gfServer. I can say that users on our cluster have never reported a BLAT performance issue that was resolved by switching to the client/server version. If you are interested in trying command-line BLAT with Bellerophon for comparison, you may be able to make that change relatively easily - command-line BLAT accepts many of the same parameters as gfServer/Client.

Please let us know if you continue to have issues with the updated version of BLAT.

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. Questions sent to that address will be archived in a publicly-accessible forum for the benefit of other users. If your question contains sensitive data, you may send it instead to genom...@soe.ucsc.edu.