Bioinformatics / Pipelines

Clemens

unread,

Oct 10, 2012, 12:06:17 PM10/10/12

to NGS...@googlegroups.com

Post anything about the Bioinformatics aspect of NGS

a.westram

unread,

Oct 30, 2012, 1:04:06 PM10/30/12

to NGS...@googlegroups.com

Hi,

if you don't know so much about computing, bioinformatics, etc. yet, maybe you will like this book:
http://practicalcomputing.org/
It's specifically for biologists and focuses on the computing stuff we really need to do when analyzing ngs data. You will learn some Python and shell scripting, for example.

cheers, Anja

claudiuskerth

unread,

Nov 2, 2012, 10:23:47 AM11/2/12

to NGS...@googlegroups.com

Also, definitely check out the tutorial "Unix & Perl Primer for Biologists":

http://korflab.ucdavis.edu/Unix_and_Perl/index.html

I did go through it two years ago starting with absolutely zero prior knowledge. It's succinct, relevant for biologists and totally free!

This tutorial teaches you quickly a lot of useful Unix and Perl skills but have no illusions it's still only a primer. That's, however, all you need to get going in case you need to do more advanced stuff. For instance, so-called "complex data structures" in Perl (which are crucial for more complex tasks) are introduced only briefly at the end. For that you have your very good onboard Perl documentation. The quality of the documentation (mostly in form of tutorials) is one of the many strengths of Perl.

If you don't want to learn Perl, then you can just go through the first part of the tutorial about Unix.

Happy programming :-)

claudius

Paul

unread,

Nov 2, 2012, 11:09:31 AM11/2/12

to NGS...@googlegroups.com

It's worth knowing that the primer served as the basis for Bradnam and Korfs new book "Unix & Perl to the rescue" which is worth a look too.

hazelperry88

unread,

Apr 10, 2013, 10:44:46 AM4/10/13

to NGS...@googlegroups.com

Hi,
In about 5 or so weeks I will get the data from some RAD sequencing that is being done for me externally. Before the workshop it had been my plan to do all the analysis of this using Stacks however during the workshop it was mentioned that stacks uses an inferior method of calling SNP's. As my initial plan is to use the RAD data to call SNP's and ultimately to map these to the reference genome (I don’t have access to this yet), I am now unsure as to whether or not Stacks is the best option. I have very little experience with any type of code and at present am still trying to find out what programs I have access to (although I think most of what I would need can be got off the internet for free).
So my question is, with my limited code experience, would I be better off trying to use Stacks to do all of my analysis (cleaning data/separating individual samples/ calling SNP's etc.) or would I be better off trying to use individual programs to do the individual steps? If the latter which programs would you recommend?

Thanks

Hazel

claudiuskerth

unread,

Apr 12, 2013, 9:19:23 AM4/12/13

to NGS...@googlegroups.com

Hi Hazel,

stacks should be the first tool you try out. It's a suite of programmes that in principle can do everything from data cleaning and splitting by barcode to calling SNP's, outputting mappable markers in joinmap format and assembling paired end reads into contigs for blasting (if you did standard RAD a la Baird et al.).

Getting all pipeline components of stacks to run can be a painful and frustrating process. I recommend that you install stacks on a Linux distro with a package management system and where you have administrator rights. I can post an amended README file for stacks that gives additional info for the installation of dependencies on Ubuntu.

Then you could go through the stacks tutorials with the example data and try to understand how stacks works by reading:
Catchen, J. M.; Amores, A.; Hohenlohe, P.; Cresko, W. & Postlethwait, J. H.
Stacks: Building and Genotyping Loci De Novo From Short-Read Sequences
G3: Genes, Genomes, Genetics, 2011, 1, 171-182

I also have a perl script which does some additional filtering of the output of 'export_sql.pl' which is one of stacks programmes that exports genotypes from the mysql database but does not include enough filters.

The fact that I said that stacks has a somewhat inferior SNP calling algorithm to samtools and GATK shouldn't put you off using it. As long as it is good enough to answer your question, there is no problem with it. It's important though that you understand how SNP's are called by any programme you use and play around with filters to get a reasonable compromise between sensitivity and specificity. For instance, in a small family of a genetic cross, genotype quality (i. e. specificity) is much more important than detecting as many SNP's as possible including many false positives. More SNP markers would not make your genetic map any finer. If you are doing a genome scan for divergence or a GWAS instead, you want as many SNP's as you can get and incorporate SNP allele frequency and genotype uncertainty in downstream analysis, for example with piMASS.

One issue with stacks' de novo assembly algorithm is that it doesn't allow for gaps, which nowadays become more prevalent with the increasing read lengths achieved by illumina, etc. Possible alternatives to stacks are for example:

rtd from Brant Peterson:
Peterson, B. K.; Weber, J. N.; Kay, E. H.; Fisher, H. S. & Hoekstra, H. E.
Double Digest RADseq: An Inexpensive Method for De Novo SNP Discovery and Genotyping in Model and Non-Model Species
PLoS ONE, Public Library of Science, 2012, 7, e37135

Unfortunately, I couldn't get rtd to run on my Ubuntu compute server. Maybe you have more luck and some python skills :-), since it's a clever way of de-novo assembly.

or custom de novo assembly with programmes like velvet (with velvetoptimizer) or IDBA-UD:
Peng, Y.; Leung, H. C. M.; Yiu, S. M. & Chin, F. Y. L.
IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth.
Bioinformatics, Department of Computer Science, The University of Hong Kong, Pokfulam Road, Hong Kong., 2012, 28, 1420-1428

... followed by mapping of reads against the reference sequence thus created with stampy, followed by SNP calling with samtools/bcftools/vcftools or GATK.

Another SNP caller worth looking into (which I haven't done yet though) is ANGSD:
Nielsen, R.; Korneliussen, T.; Albrechtsen, A.; Li, Y. & Wang, J.
SNP calling, genotype calling, and sample allele frequency estimation from New-Generation Sequencing data.
PLoS One, BGI-Shenzhen, Shenzhen, China., 2012, 7, e37558

These SNP callers can include base call and mapping uncertainty into SNP calls which get a Bayesian probability score. One feature of SNP caller which should now be incorporated into samtools/bcftools, GATK and ANGSD (don't take my word on that, I am also still trying to understand these programmes) is the incorporation of allele frequency estimated from all individuals in a population to inform the genotype call of a focal individual:
Li, H.
A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data.
Bioinformatics, Medical Population Genetics Program, Broad Institute, 7 Cambridge Center, Cambridge, MA 02142, USA., 2011, 27, 2987-2993

Stacks can't do that as far as I am aware.

These are the options for RAD data analysis that I am aware of so far.

claudius

Reply all

Reply to author

Forward