SPAdes - St. Petersburg genome assembler - is an assembly toolkit that contains various assembly pipelines for different types of sequencing data. It was originally developed for de novo assembly of bacterial and viral genomes from single-cell or isolate samples, but it has been extended to support metagenomic, plasmid, transcriptomic, and biosynthetic gene cluster assembly as well. SPAdes can also perform hybrid assembly using short reads (Illumina or IonTorrent) and long reads (PacBio, Oxford Nanopore, or Sanger). SPAdes is one of the most widely used assemblers in the field, and it has several advantages over other assemblers, such as:
In this article, I will show you how to download and use SPAdes assembler for your own genome assembly projects. I will cover the following topics:
By the end of this article, you should be able to perform de novo genome assembly using SPAdes with confidence and ease. Let's begin!
The first step is to download SPAdes from its official website: http://cab.spbu.ru/software/spades/. You can choose to download either the pre-compiled binaries or the source code, depending on your operating system and preference. The latest version of SPAdes is 3.15.5, which was released on July 14th, 2022 under GPLv2 license.
If you are using a Linux system (64-bit only), you can download the pre-compiled binaries from the website. The file name is SPAdes-3.15.5-Linux.tar.gz. You can use the following command to download it:
Alternatively, you can use a web browser to download it manually. After downloading, you need to extract the file using the following command:
tar -xzf SPAdes-3.15.5-Linux.tar.gz This will create a folder named SPAdes-3.15.5-Linux, which contains the executable files and other resources for SPAdes.
If you are using a Mac system (64-bit only), you can download the pre-compiled binaries from the website as well. The file name is SPAdes-3.15.5-Darwin.tar.gz. You can use the following command to download it:
wget http://cab.spbu.ru/files/release3.15.5/ SPAdes-3.15.5-Darwin.tar.gz Alternatively, you can use a web browser to download it manually. After downloading, you need to extract the file using the following command:
tar -xzf SPAdes-3.15.5-Darwin.tar.gz This will create a folder named SPAdes-3.15.5-Darwin, which contains the executable files and other resources for SPAdes.
If you prefer to compile SPAdes from source code, or if you are using a different operating system, you can download the source code from the website as well. The file name is SPAdes-3.15.5.tar.gz. You can use the following command to download it:
Alternatively, you can use a web browser to download it manually. After downloading, you need to extract the file using the following command:
tar -xzf SPAdes-3.15.5.tar.gz This will create a folder named SPAdes-3.15.5, which contains the source code and other resources for SPAdes.
To compile SPAdes from source code, you need to have some prerequisites installed on your system, such as CMake, GCC, Python 2 or 3, zlib, bzip2, and Boost libraries. You can check the detailed instructions on how to install these prerequisites on the SPAdes website: http://cab.spbu.ru/software/spades/#prereq. Once you have installed the prerequisites, you can use the following commands to compile SPAdes:
cd SPAdes-3.15.5 ./spades_compile.sh This will create an executable file named spades.py in the bin folder.
After downloading and extracting (or compiling) SPAdes, you need to install it on your system. The installation process is very simple and straightforward. You just need to add the bin folder of SPAdes to your system's PATH variable, so that you can run SPAdes from any directory.
If you are using a Linux system, you can add the bin folder of SPAdes to your PATH variable by editing your .bashrc file (or equivalent) in your home directory. You can use the following command to open the file with a text editor (such as nano):
nano /.bashrc Then, add the following line at the end of the file (replace /path/to/SPAdes-3.15.5-Linux/bin with the actual path of your SPAdes bin folder):
export PATH=$PATH:/path/to/SPAdes-3.15.5-Linux/bin Save and close the file, and then run the following command to apply the changes:
source /.bashrc You can now run SPAdes from any directory by typing spades.py.
If you are using a Mac system, you can add the bin folder of SPAdes to your PATH variable by editing your .bash_profile file (or equivalent) in your home directory. You can use the following command to open the file with a text editor (such as nano):
nano /.bash_profile Then, add the following line at the end of the file (replace /path/to/SPAdes-3.15.5-Darwin/bin with the actual path of your SPAdes bin folder):
export PATH=$PATH:/path/to/SPAdes-3.15.5-Darwin/bin Save and close the file, and then run the following command to apply the changes:
source /.bash_profile You can now run SPAdes from any directory by typing spades.py.
After installing SPAdes, you should verify that it works properly on your system. You can do this by running a self-test that comes with SPAdes. The self-test will run SPAdes on a small dataset and check if the output matches the expected results.
To run the self-test, you need to go to the test folder of SPAdes, which is located inside the main SPAdes folder. You can use the following command to go there:
cd /path/to/SPAdes-3.15.5/test Then, you can run the self-test by typing:
./spades.py --test This will launch SPAdes in test mode and run it on a small dataset of E. coli reads. The test will take a few minutes to complete, and it will generate some output files in a folder named spades_test. You should see something like this at the end of the test:
===== Test passed OK ===== This means that SPAdes ran successfully and produced the correct output. If you see any errors or warnings, you should check the log file (spades.log) for more details and troubleshoot the problem.
Now that you have installed and verified SPAdes, you are ready to use it for your own genome assembly projects. To run SPAdes, you need to provide some input data and some command line options for different assembly pipelines.
The input data for SPAdes are sequencing reads from one or more samples. SPAdes can handle various types of reads, such as:
You need to specify the type and format of your input reads using different command line options. The most common options are:
| Option | Description |
|---|
-1 <filename> | The file name with forward PE reads (in FASTQ or FASTA format). |
-2 <filename> | The file name with reverse PE reads (in FASTQ or FASTA format). |
--s1 <filename> | The file name with unpaired reads (in FASTQ or FASTA format). |
--pacbio <filename> | The file name with PacBio SMRT reads (in FASTQ or FASTA format). |
--nanopore <filename> | The file name with Oxford Nanopore reads (in FASTQ or FASTA format). |
--sanger <filename> | The file name with Sanger reads (in FASTQ or FASTA format). |
--pe1-12 <filename> | The file name with interlaced forward and reverse PE reads (in FASTQ or FASTA format). |
--mp1-12 <filename> | The file name with interlaced forward and reverse MP reads (in FASTQ or FAST A format). |
You can use multiple options to provide reads from different sources or libraries. For example, if you have PE reads from Illumina and SMRT reads from PacBio, you can use the following options:
-1 illumina_pe_1.fastq -2 illumina_pe_2.fastq --pacbio pacbio_smrt.fastq You can also use the --dataset <filename> option to provide a YAML file that describes your input data in more detail. For example, you can specify the library type, orientation, insert size, quality offset, and coverage for each file. You can find more information on how to create a YAML file on the SPAdes website: http://cab.spbu.ru/software/spades/#dataset.
The next step is to choose the appropriate command line options for the assembly pipeline that suits your data and goal. SPAdes has several assembly pipelines for different types of data, such as:
--sc: Single-cell assembly pipeline for bacterial or viral genomes from single-cell or isolate samples.--meta: Metagenomic assembly pipeline for mixed microbial communities.--plasmid: Plasmid assembly pipeline for plasmid detection and extraction.--rna: Transcriptomic assembly pipeline for RNA-Seq data.--isolate: Isolate assembly pipeline for bacterial or viral genomes from isolate samples.--moleculo: Moleculo assembly pipeline for long synthetic reads from Moleculo technology.--bga: Biosynthetic gene cluster assembly pipeline for secondary metabolite gene clusters.You can use one of these options to run the corresponding pipeline, or you can omit them to run the default pipeline, which is suitable for most cases. For example, if you want to assemble a bacterial genome from single-cell data, you can use the following option:
--sc If you want to assemble a metagenomic sample from mixed reads, you can use the following option:
--meta If you want to assemble a transcriptome from RNA-Seq data, you can use the following option:
--rna In addition to these pipeline options, you can also use some other options to customize your assembly process, such as:
-k <value>: The k-mer size to use for assembly. You can specify a single value (e.g. -k 21) or a comma-separated list of values (e.g. -k 21,33,55). The default value is auto, which means that SPAdes will choose the optimal k-mer size based on your data.-t <value>: The number of threads to use for assembly. The default value is 16.-m <value>: The amount of RAM to use for assembly in GB. The default value is 250.--careful: The option to run SPAdes in careful mode, which will reduce the number of mismatches and short indels in the resulting assembly.--only-assembler: The option to run only the assembly module of SPAdes, without performing error correction or read mapping.--continue: The option to resume a previously interrupted run of SPAdes from the last available checkpoint.You can find more information on the available command line options on the SPAdes website: http://cab.spbu.ru/software/spades/#manual.
After running SPAdes, you will get some output files and statistics in a folder named after your project. For example, if you run SPAdes with the following command:
./spades.py -1 illumina_pe_1.fastq -2 illumina_pe_2.fastq --pacbio pacbio_smrt.fastq -o my_project You will get a folder named my_project, which contains the following files and subfolders:
| File or subfolder | Description |
|---|
spades.log | The log file that records the progress and status of SPAdes. |
params.txt | The file that contains the parameters and options used for SPAdes. |
dataset.info | The file that contains the information about the input data. |
corrected/ | The subfolder that contains the error-corrected reads. |
mismatch_corrector/ | The subfolder that contains the mismatch-corrected contigs and scaffolds. |
K21/ K33/ K55/ .../ | The subfolders that contain the intermediate assemblies for each k-mer size. |
scaffolds.fasta | The final assembly file that contains the scaffolds (sequences with gaps). |
contigs.fasta | The final assembly file that contains the contigs (sequences without gaps). |
assembly_graph.fastg | The final assembly graph file in FASTG format. |
scaffolds.paths | The file that contains the paths of contigs in scaffolds. |
contigs.paths(#message) Continue writing the article. [assistant](#message) | The file that contains the paths of edges in contigs. |
spades.yaml | The file that contains the summary statistics and quality metrics of the final assembly. |
To evaluate the quality and accuracy of your assembly, you can look at some of these output files and statistics. For example, you can check the following metrics:
A: You can get help or report a bug for SPAdes by contacting the developers via email or GitHub. The email address is spades....@cab.spbu.ru. The GitHub repository is https://github.com/ablab/spades. You can also check the FAQ section on the SPAdes website for some common questions and answers: http://cab.spbu.ru/software/spades/#faq.
A: You can update SPAdes to the latest version by downloading the new binaries or source code from the SPAdes website: http://cab.spbu.ru/software/spades/. You can also use the --check-for-updates option when running SPAdes to check if there is a new version available.
A: You can uninstall SPAdes from your system by deleting the SPAdes folder and removing it from your PATH variable. You can also delete any output files or folders that you have created with SPAdes.