Hi Hazel,
stacks should be the first tool you try out. It's a suite of programmes that in principle can do everything from data cleaning and splitting by barcode to calling SNP's, outputting mappable markers in
joinmap format and assembling paired end reads into contigs for blasting (if you did standard RAD a la Baird et al.).
Getting all pipeline components of
stacks to run can be a painful and frustrating process. I recommend that you install stacks on a Linux distro with a package management system and where you have administrator rights. I can post an amended README file for stacks that gives additional info for the installation of dependencies on Ubuntu.
Then you could go through the stacks tutorials with the example data and try to understand how stacks works by reading:
Catchen, J. M.; Amores, A.; Hohenlohe, P.; Cresko, W. &
Postlethwait, J. H.
Stacks: Building and Genotyping Loci De Novo From
Short-Read Sequences
G3: Genes, Genomes, Genetics, 2011,
1, 171-182
I also have a perl script which does some additional filtering of the output of '
export_sql.pl' which is one of
stacks programmes that exports genotypes from the mysql database but does not include enough filters.
The fact that I said that
stacks has a somewhat inferior SNP calling algorithm to
samtools and
GATK shouldn't put you off using it. As long as it is good enough to answer your question, there is no problem with it. It's important though that you understand how SNP's are called by any programme you use and play around with filters to get a reasonable compromise between sensitivity and specificity. For instance, in a small family of a genetic cross, genotype quality (i. e. specificity) is much more important than detecting as many SNP's as possible including many false positives. More SNP markers would not make your genetic map any finer. If you are doing a genome scan for divergence or a GWAS instead, you want as many SNP's as you can get and incorporate SNP allele frequency and genotype uncertainty in downstream analysis, for example with
piMASS.
One issue with
stacks' de novo assembly algorithm is that it doesn't
allow for gaps, which nowadays become more prevalent with the increasing
read lengths achieved by illumina, etc. Possible alternatives to
stacks are for example:
rtd from Brant Peterson:
Peterson, B. K.; Weber, J. N.; Kay, E. H.; Fisher, H.
S. & Hoekstra, H. E.
Double Digest RADseq: An Inexpensive Method
for De Novo SNP Discovery and Genotyping in Model and Non-Model
Species
PLoS ONE, Public Library of Science, 2012, 7,
e37135
Unfortunately, I couldn't get
rtd to run on my Ubuntu compute server. Maybe you have more luck and some python skills :-), since it's a clever way of de-novo assembly.
or custom de novo assembly with programmes like
velvet (with
velvetoptimizer) or
IDBA-UD:
Peng, Y.; Leung, H. C. M.; Yiu, S. M. & Chin, F. Y. L.
IDBA-UD:
a de novo assembler for single-cell and metagenomic sequencing data with
highly uneven depth.
Bioinformatics, Department of Computer
Science, The University of Hong Kong, Pokfulam Road, Hong Kong., 2012,
28, 1420-1428
... followed by mapping of reads against the reference sequence thus created with
stampy, followed by SNP calling with
samtools/bcftools/vcftools or
GATK.
Another SNP caller worth looking into (which I haven't done yet though) is
ANGSD:
Nielsen, R.; Korneliussen, T.; Albrechtsen, A.; Li, Y.
& Wang, J.
SNP calling, genotype calling, and sample allele
frequency estimation from New-Generation Sequencing data.
PLoS One, BGI-Shenzhen,
Shenzhen, China., 2012, 7,
e37558
These SNP callers can include base call and mapping uncertainty into SNP calls which get a Bayesian probability score. One feature of SNP caller which should now be incorporated into
samtools/bcftools,
GATK and
ANGSD (don't take my word on that, I am also still trying to understand these programmes) is the incorporation of allele frequency estimated from all individuals in a population to inform the genotype call of a focal individual:
Li, H.
A statistical framework for SNP calling,
mutation discovery, association mapping and population genetical parameter
estimation from sequencing data.
Bioinformatics, Medical Population
Genetics Program, Broad Institute, 7 Cambridge Center, Cambridge, MA
02142, USA., 2011, 27,
2987-2993
Stacks can't do that as far as I am aware.
These are the options for RAD data analysis that I am aware of so far.
claudius