Hello Mary,
Thank you for your question about setting up a database to run BLAT against nr with your sequence data. The BLAT set of tools require a database in the .2bit or .nib format (see http://genome.ucsc.edu/FAQ/FAQformat.html), so the first step of this process would be to convert nr into that format. You can do this by downloading nr in FASTA format from NCBI, and then running our faToTwoBit conversion tool. A precompiled version of the faToTwoBit tool for 64-bit Linux can be found on our download server at http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/. The next step would be to decide whether you want to run BLAT as a service - a standalone server that you send queries to - or in a separate instance for each query that you run. It sounds like you want to run many sequences against nr, so a persistent standalone server might make more sense. This means downloading and running the gfServer program described at http://genome.ucsc.edu/goldenPath/help/blatSpec.html, and then using the gfClient or webBlat program to send queries to the server.
Example twoBitToFa command:
twoBitToFa nr.fa nr.2bit
gfServer start localhost 12345 nr.2bit
gfClient localhost 12345 nr.2bit mydata.fa output.psl
One difficulty that you may encounter in this process is the large size of the nr database. Even when compressed into .2bit format, nr is very large. BLAT was not originally designed to run on such a big database, and may crash or run poorly. If that is the case, then you will need to split the nr FASTA file into several sections and convert those into separate .2bit files. You can then set up a separate gfServer instance for each .2bit file and run your queries against each server in turn.
You may also be interested in the following wiki page and question from our mailing list archives: http://genomewiki.ucsc.edu/index.php/Blat-FAQ, https://groups.google.com/a/soe.ucsc.edu/d/topic/genome/rBtE-sKgoVk/discussion.
As an alternative to this project, there are some online resources that already provide a similar service and might save you the trouble of setting up your own server. RIKEN, for example, hosts a "MetaBin" server at http://metabin.riken.jp/.
I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu or genome...@soe.ucsc.edu. Questions sent to those addresses will be archived in publicly-accessible forums for the benefit of other users. If your question contains sensitive data, you may send it instead to genom...@soe.ucsc.edu.
--
Jonathan Casper
UCSC Genome Bioinformatics Group
--
Hello Mary,
One of our engineers has pointed out that your download of nr in FASTA format will very likely contain amino acid sequence instead of nucleotide sequence. That is a problem, as both the faToTwoBit tool and BLAT's gfServer program expect DNA input. To get around this issue, you might try to set up a BLAT server for your transcriptome assembly and then submit query sequences from nr. This would require turning your transcriptome assembly into a .2bit file (again using faToTwoBit). You could then run gfServer on that assembly along with the -trans option, which prepares the server for protein sequence queries. Finally, you would then run gfClient on pieces of the nr FASTA file using the -q=prot and -t=dnax options. Alternatively, you could try running the standalone "blat" tool without using gfServer or gfClient. "blat" itself is able to run on a plain FASTA protein database.
I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu or genome...@soe.ucsc.edu. Questions sent to those addresses will be archived in publicly-accessible forums for the benefit of other users. If your question contains sensitive data, you may send it instead to genom...@soe.ucsc.edu.
--
Jonathan Casper
UCSC Genome Bioinformatics Group