Homology searching against nr or nt database using blat

maryam moazam

unread,

Jul 20, 2015, 5:30:39 PM7/20/15

to gen...@soe.ucsc.edu

Hi there,

I'm working on a plant RNA-seq analysis, I plan to check my transcriptome assembly against whole nr or nt databases to detect any common contamination like Homo sapiens and Escherichia coli DNA, mitochondrial and chloroplast sequences as well as rRNA. As you all know doing blast against nr or nt takes too much time, so I prefer to use blat. I downloaded the tool for linux (64 bit) machine. Could you please give me a clear example how to make database and searching against it and finally extracting unmapped sequences. I of course read available documentation on UCSC, but unfortunately it wasn't much helpful for me as a biology student and new in this field. It's really kind of you with helping me to get a quick and clear guide to do the job.

Thanks a lot in advance.

Mary

Jonathan Casper

unread,

Jul 24, 2015, 4:02:53 PM7/24/15

to maryam moazam, gen...@soe.ucsc.edu

Hello Mary,

Thank you for your question about setting up a database to run BLAT against nr with your sequence data. The BLAT set of tools require a database in the .2bit or .nib format (see http://genome.ucsc.edu/FAQ/FAQformat.html), so the first step of this process would be to convert nr into that format. You can do this by downloading nr in FASTA format from NCBI, and then running our faToTwoBit conversion tool. A precompiled version of the faToTwoBit tool for 64-bit Linux can be found on our download server at http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/. The next step would be to decide whether you want to run BLAT as a service - a standalone server that you send queries to - or in a separate instance for each query that you run. It sounds like you want to run many sequences against nr, so a persistent standalone server might make more sense. This means downloading and running the gfServer program described at http://genome.ucsc.edu/goldenPath/help/blatSpec.html, and then using the gfClient or webBlat program to send queries to the server.

Example twoBitToFa command:

   twoBitToFa nr.fa nr.2bit

Example of starting gfServer command:

   gfServer start localhost 12345 nr.2bit

Example of sending a query to the server using gfClient, where mydata.fa contains one or more FASTA transcripts:

   gfClient localhost 12345 nr.2bit mydata.fa output.psl

For more information about the options available to the gfServer and gfClient programs, please see the program descriptions provided at http://genome.ucsc.edu/goldenPath/help/blatSpec.html. You can also find those usage messages by running the gfClient and gfServer programs without any arguments. Please note that you should contact your system administrators about what is permitted when setting up a server. They may wish to know details about the software and port(s) chosen.

One difficulty that you may encounter in this process is the large size of the nr database. Even when compressed into .2bit format, nr is very large. BLAT was not originally designed to run on such a big database, and may crash or run poorly. If that is the case, then you will need to split the nr FASTA file into several sections and convert those into separate .2bit files. You can then set up a separate gfServer instance for each .2bit file and run your queries against each server in turn.

You may also be interested in the following wiki page and question from our mailing list archives: http://genomewiki.ucsc.edu/index.php/Blat-FAQ, https://groups.google.com/a/soe.ucsc.edu/d/topic/genome/rBtE-sKgoVk/discussion.

As an alternative to this project, there are some online resources that already provide a similar service and might save you the trouble of setting up your own server. RIKEN, for example, hosts a "MetaBin" server at http://metabin.riken.jp/.

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu or genome...@soe.ucsc.edu. Questions sent to those addresses will be archived in publicly-accessible forums for the benefit of other users. If your question contains sensitive data, you may send it instead to genom...@soe.ucsc.edu.

--
Jonathan Casper
UCSC Genome Bioinformatics Group

--

Jonathan Casper

unread,

Jul 27, 2015, 6:56:14 PM7/27/15

to maryam moazam, gen...@soe.ucsc.edu

Hello Mary,

One of our engineers has pointed out that your download of nr in FASTA format will very likely contain amino acid sequence instead of nucleotide sequence. That is a problem, as both the faToTwoBit tool and BLAT's gfServer program expect DNA input. To get around this issue, you might try to set up a BLAT server for your transcriptome assembly and then submit query sequences from nr. This would require turning your transcriptome assembly into a .2bit file (again using faToTwoBit). You could then run gfServer on that assembly along with the -trans option, which prepares the server for protein sequence queries. Finally, you would then run gfClient on pieces of the nr FASTA file using the -q=prot and -t=dnax options. Alternatively, you could try running the standalone "blat" tool without using gfServer or gfClient. "blat" itself is able to run on a plain FASTA protein database.

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu or genome...@soe.ucsc.edu. Questions sent to those addresses will be archived in publicly-accessible forums for the benefit of other users. If your question contains sensitive data, you may send it instead to genom...@soe.ucsc.edu.

--
Jonathan Casper
UCSC Genome Bioinformatics Group

Reply all

Reply to author

Forward