How To Download Blast Nt Database

124 views

Skip to first unread message

Divina Hujer

unread,

Jan 5, 2024, 1:26:13 AM1/5/24

to rucucordia

The best way to obtain BLAST databases is to download them from NCBI or cloud providers (currently from Google Cloud Platform and Amazon Web Services). These are the same databases available via the public BLAST Web Service ( ), are updated regularly, and contain taxonomic information built into them. These can also be a source of biological sequence data (see below).

how to download blast nt database

Download Zip https://slowanassi1985.blogspot.com/?iuqn=2x31fr

This command will download the compressed nr BLAST database from NCBI to the current working directory and decompress it. Any subsequent identical invocations of this script with the same parameters in that directory will only download any data if it has a different time stamp when compared to the data at NCBI.

NCBI has a large number of databases available on the web service. You'll use only a few of them today. You select databases using the pull-down lists and radio buttons in the 'Choose Search Set' section of the BLAST submission form. Both the nucleotide page and the protein page have standard and experimental databases.

The nucleotide and protein pages have default databases called nr/nt and nr respectively. Other databases include useful subsets of these as well as separate databases with different content. Today you'll search the default databases as well as the NCBI RefSeq subsets. You'll also search a whale genome assembly that's available through a separate Genome BLAST page

I'm working with a fungus that has no sequence info in Genbank. I've used sanger sequence to generate sequence information from a number of cultures and used those sequences to generate a custom BLAST database.

I am not sure but should you not add the actual database to your path? Now it is pointing to a folder and not an alias or index.
So instead of /apps/galaxy/galaxy_staging1/tool-data/app_db/mfas/ I think it should be /apps/galaxy/galaxy_staging1/tool-data/app_db/mfas/mfas

It is odd that you were not able to decompress the downloaded files. Yes they need to be all decompressed and need to stay in one directory. nr database is close to 300-400G worth of data so the uncompress job will run for a while and will need enough space to be available. You should not need to capture logs unless your download has somehow been corrupt and that is generating error messages. In that case you will need to redownload the data.

First, nr is a protein database, so it will never work with blastn. You need to either use blastp or blastx, or if you want to have a nucleotide db, you need to download nt. Second, once you download, the dbs are called nr or nt. So check what you need first.

I downloaded the nt database then submitted a job where -db nt (not the path to nt) and this got running. thank you. not specifying the path of the database and just writing 'nt' was non-obvious (to me at least).

Repeat that for all the files (a short script will automate) and you will have the database unpacked. You may need to delete some of these files after unpacking unless you know for sure that disk space is not an issue.

Downloading and maintaining local copies of BLAST databases has a substantial learning curve, as you are discovering. You may wish to try BIRCH, which has a complete set of automated tools for BLAST databases, run through an easy to use graphical interface. BIRCH's blastdbkit generates disk usage reports for BLAST databases, downloads, verifies and decompresses the downloads, and has an update mechanism that only downloads database files newer than those currently installed. Because database files are huge, downloads can fail. Simply restart blastdbkit and the download will pick up where it left off.

Hey so for the past two days I have been trying to install and execute a stand alone blast named ncbi-blast-2.2.30+ on a centos os system. I managed to download a nr ref sequence from ncbi ftp using the command wget and extracted the file using the command tar -xvpf nr.gz. I got an nr file of 33 gb size but when I try to format the file using the command

You need a z in your tar command for gzip files. Use -xvzf. The file is probably not unzipped correctly. but also, the nr database is protein ("Non-Redundant peptides") so you will be creating a protein database. Think about whether that is what you want and why. I believe that if you ran the makeblastdb correctly it would tell you that you are mixing up protein and nucleotides. If you want the whole nucleotide database it is called 'nt ("Non-Translated nucleotides). I can't remember if they are the official acronyms or whether it's just what my brain uses to keep them the correct way around.

Ok I think I found the problem. When making a blast db or using any masker algorithm like DUST Windowmasker etc, it raises an error if the program finds an empty record. This means a record in the form:

Now use the perl script to download the database of your choice. The decompress option automatically decompresses the tar.gz files. Depending on what database you choose to download and your internet speeds, this could be a lengthy process.

This kind of database can be generated from an existing corpus of sequence data (as I will show you later on this post) but NCBI conveniently offers a comprehensive list of them for download. These preformatted databases are available through their FTP site at the following path:

After a couple seconds you should find an all_sp.fa file with all the sequences from the specified database stored in FASTA format. You can check that out by running head all_sp.fa and getting back the first few lines, which should look like:

You can make use of the -entry parameter in a different way: to filter out the entries you would like to extract from your database. Instead of writing all you could set up a pattern that will be matched against all the entries and only return the ones that are compliant with it. For instance, this is how you would query the database in order to get the sequence data for the protein named LEC_LUFAC:

By default, the results that come out of blastdbcmd are formatted following the guidelines defined by the FASTA format. But you can change this as well, using the -outfmt parameter.

Thanks to the many different files that the BLAST suite uses to store its data, it is easy for your $BLASTDB directory to quickly become a mess. blastdbcmd lets you get a list of your currently active databases, by using the -list parameter:

First of all, we are going to download some raw sequence data that will serve as a starting point for us to create the database. You could use your own, of course, but for this tutorial we are going to be using the full set of nucleotide sequences from the Drosophila Melanogaster genome.

Picture the following scenario. You and your team are continuously getting new data from an ongoing sequencing project. You receive the first batch and set up an instance to work with it using BLAST. You tell your team (or setup your scripts) to use the database called Cancer_NT_Jan_2016, which you just set up with all the data you have at this moment.

The BLAST suite helps you solve this kind of problem by allowing you to create aliases, that is, placeholder names that look like and work like BLAST databases but which point out to one (or many) different data sources. Using aliases, you could just tell your team to use Cancer_NT_Latest as their database name and they will always have the most relevant information available, regardless of whether you are adding or removing database sources from it.

I want to use a local blastn command to BLAST a multifasta file 700 sequences using following command: blastn -db nt -query fasta_all.fasta -num_alignments 2 -out fasta_blasted.txt but I receive error :BLAST Database error: Error: Not a valid version 4 database.

I use a local nt database which I downloaded today (11th of May 2020) by running update_blastdb.pl --decompress nt [*]. I downloaded this but I did not do any post-processing and just the "raw" download files are in the folder (see image below which shows only a part of the files)

So, both database and blastn version should be the latest. It probably has something to do with either the version of the database or the software, but as it both are the latest version, I find it strange that these would not be compatible. Any ideas what causes this error?

Blast databases are pre-formatted to work with blast commands from BLAST and BLAST+. These databases are similar to what you could build using the makeblastdb command with FASTA files. Each database is located at a directory with the name pattern /db/ncbiblast/year-month-date. For example, databases downloaded on February 01, 2022 will be available in /db/ncbiblast/20220201. The following databases are currently available and stored together with databases downloaded at the same time; nr, nt, taxdb, refseq_rna, refseq_protein, mouse_genome, human_genome and swissprot. The database refseq_genomic is no longer available to download as a pre-formatted database.

NCBI BLAST Datasets can be loaded just like software modules. This allows users to replicate results by always being able to use a time stamped version of a database. The time stamp corresponds to the download date of the dataset. To search for available NCBI BLAST dataset modules run the command:

Loading this module sets the environment variable BLASTDB to /db/ncbiblast/20210616. This allows BLAST+ to search that directory for databases. You can then use the name of the database you would like to use.(nt, nr, etc.) Here is an example of a shell script to run BLAST+ on the batch queue using the database nt:

An old version of the v5 database (download in June 2019) is located at a directory with name pattern as /db/ncbiblast.v5/dbname, where dbname is the name of the database, such as nr, nt, taxdb, refseq_rna, swissprot, etc.

Here the main goal is to obtain a class specific database from a pre-formatted database from NCBI.
In this example we want to create a EST databse for class Hydrozoa using an Acc list retrieved from NCBI.
Once we have these files we will create the taxonomic mapping file that can then be used with Make Blast Database feature within OmicsBox.