Mapping vs De novo

861 views
Skip to first unread message

Amanda Charbonneau

unread,
Apr 16, 2017, 9:36:54 PM4/16/17
to Stacks
Sorry if this has already been discussed, I tried searching the questions but didn't find anything...

I am getting far fewer markers than I expected/hoped for, and I'm trying to narrow down whether this is inherent to the reads, the genome, or something I'm doing wrong. Unfortunately, everything for my organism is a little sketchy, so I'm having difficulty assigning blame.

I have sequence for several hundred individuals that are related to each other by varying degrees. For this discussion, I'm only using the 80 individuals from the 5 populations that are most closely related. Reads are all high quality.

I mapped them using Bowtie2, using the defaults for an end-to-end, sensitive alignment. Then throwing out any alignment with a phred score <30. I'm mapping to the best available genome, which was made from one of my 5 populations, but the genome is a hot mess: 64732 contigs, N50=10186, and an assembled length of 253MB (actual genome size ~500MB). I only get ~50% mapping, which seems reasonable as I only have ~50% of the genome.

Anyway, I ran STACKS without the wrapper, but using all default settings for p, c, & s, then ran populations with -p = 5 (locus must be in all 5 pops) and -r .5 (locus must be in >50% of individuals per pop) and --write_random_snp. This pipeline gives me 1542 markers, which is too few for my intended analysis.

I guessed that the 50% mapping was the problem, and so ran the pipeline again, but with USTACKS rather than mapping. To keep it as consistent as possible, I set the USTACKS minimum depth of coverage required to create a stack to 3 (so it's the same as the default for PSTACKS) and used all the same parameters as the mapped pipeline. This gives 1521 markers.

It seems weird to me that I get slightly fewer markers doing a denovo analysis than I do when mapping to a terrible genome. Is there a logical reason for this? Or am I just doing something really wrong?

Any suggestions are appreciated.

-amanda





Julian Catchen

unread,
Apr 17, 2017, 1:01:28 PM4/17/17
to stacks...@googlegroups.com, fan...@gmail.com
Hi Amanda,

The first thing to focus on would be your depth of coverage for your
assembled stacks in both the ustacks and pstacks data sets. These
programs will both output a final depth of coverage after executing. Are
your depths <10x or are they >20x? This will make a big difference in
your results. How many loci are the two programs assembling? If your
hypothesis is correct about the reference being in bad shape, you should
have many fewer loci being built by pstacks than by ustacks. You won't
be able to tell the difference if your depth of coverage is far too low.

Once you have established that the amount of sequence you are getting is
good enough, you should work on selecting the best assembly parameters.
For example, I would not filter at phred score 30 if you stick with
assembled reads, I believe that is far too high.

I would recommend you look at our recent paper that discusses many of
these issues and how to select the right assembly parameters:

http://onlinelibrary.wiley.com/doi/10.1111/2041-210X.12775/abstract

Best,

julian

Amanda Charbonneau

unread,
Apr 17, 2017, 2:08:20 PM4/17/17
to Stacks, fan...@gmail.com, jcat...@illinois.edu
I had originally thought that the problem might be depth related since we only have ~3 million reads per individual for a 500MB genome, however the people that paid for the sequencing didn't like that answer, so I've been trying to triple check that it's not something fixable.

I do have those numbers though. For a randomly chosen individual:

USTACKs pipeline: 19684 loci. Minimum depth: 1, Maximum depth: 773, Median: 9.0, Mean: 51.3
PSTACKs pipeline: 6420 loci. Minimum depth: 1, Maximum depth: 3074, Median: 19.0, Mean: 72.6

Spot checking suggests that this individual is pretty typical. The difference in the number of loci makes sense, and the depth is generally low, so it seems reasonable that we just don't have enough coverage. But I have been trouble coming up with a reason the depth is so variable. I had thought it might be due to multimapping. I have 2-3 whole genome duplication events in this plant, so mapping gets weird. That's why I was using such a high phred score cutoff.

I can certianly lower my mapping cutoff, and I'll go through that paper.

Thanks!

-amanda

Julian Catchen

unread,
Apr 17, 2017, 2:22:06 PM4/17/17
to stacks...@googlegroups.com, fan...@gmail.com
Hi Amanda,

Where did you get the depth numbers you list in your message? ustacks
will report the depth after it excludes very high coverage stacks, which
should give us a more useful mean. A mean depth of >50x is more than
sufficient, but the median suggests you have many loci with very small
depths. Since you used -m3, these loci should not show up in your depth
calcluations, but your stats list minimum depth of 1, so it suggests you
are not posting the numbers from ustacks/pstacks?

You could pick a 'typical' sample and post the full output from ustacks
and pstacks.

julian
> --
> Stacks website: http://catchenlab.life.illinois.edu/stacks/
> ---
> You received this message because you are subscribed to the Google
> Groups "Stacks" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to stacks-users...@googlegroups.com
> <mailto:stacks-users...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/stacks-users.
> For more options, visit https://groups.google.com/d/optout.

--
Julian M Catchen, Ph.D.
Assistant Professor
Department of Animal Biology
University of Illinois, Urbana-Champaign
--
jcat...@illinois.edu; @jcatchen

Amanda Charbonneau

unread,
Apr 17, 2017, 2:35:48 PM4/17/17
to Stacks, jcat...@illinois.edu
I pulled those from a matches.tsv file (from sstacks)...I've never seen pstacks/ustacks report those numbers. I'm running through a queue system on an HPC though, so if they normally get dumped to the terminal they might be getting lost. I'll see if I can find them.

It is weird that my minimum is 1 with an -m 3 though. That hadn't occurred to me. My submission script definitely has the -m 3.

-amanda

Amanda Charbonneau

unread,
Apr 17, 2017, 2:52:38 PM4/17/17
to Stacks, jcat...@illinois.edu

Found it! Here's the output from one sample, both pipelines.

USTACKS:

Min depth of coverage to create a stack: 3
Max distance allowed between stacks: 2
Max distance allowed to align secondary reads: 4
Max number of stacks allowed per de novo locus: 3
Deleveraging algorithm: disabled
Removal algorithm: disabled
Model type: SNP
Alpha significance level for model: 0.05
Parsing 14796QTL_F2_13_E01.fq
Loaded 2006410 RAD-Tags; inserted 314495 elements into the RAD-Tags hash map.
0 reads contained uncalled nucleotides that were modified.
Mean coverage depth is 53; Std Dev: 133.814 Max: 5078
Coverage mean: 53; stdev: 133.814
Deleveraging trigger: 187; Removal trigger: 321
Calculating distance between stacks…
Distance allowed between stacks: 2
Using a k-mer length of 29
Number of kmers per sequence: 63
Minimum number of k-mers to define a match: 5
Merging stacks, maximum allowed distance: 2 nucleotide(s)
32146 stacks merged into 19153 stacks; deleveraged 0 stacks; removed 1434 stacks.
Mean merged coverage depth is 88.4577; Std Dev: 215.6; Max: 9975
Merging remainder radtags
312179 remainder sequences left to merge.
Distance allowed between stacks: 4
Using a k-mer length of 17
Number of kmers per sequence: 75
Minimum number of k-mers to define a match: 7
Matched 134481 remainder reads; unable to match 177698 remainder reads.
Number of utilized reads: 1828712
Writing loci, SNPs, and alleles to ‘Rrr_20170416_5
.5/‘…

PSTACKS:

Min depth of coverage to report a stack: 3
Model type: SNP
Alpha significance level for model: 0.05
Parsing 14796QTL_F2_13_E01.fq_q30.sam
Analyzed 471427 sequence reads; Identified 57838 unique stacks from those reads.
Merged 57838 unique Stacks into 9421 loci.
Identifying polymorphic sites and calling consensus sequences…done.
Number of utilized reads 471427
Mean coverage depth is 84.6716; Std Dev: 141.634; Max: 3074
Writing loci, SNPs, alleles to ‘Rrr_20170402_5
.5/…’

Julian Catchen

unread,
Apr 17, 2017, 3:34:13 PM4/17/17
to stacks...@googlegroups.com, fan...@gmail.com
Hi Amanda,

You can get more accurate depth of coverage numbers if you upgrade
stacks to the newest version (what version is this?). Also, there are
new and useful options in pstacks for inputing data from aligners such
as Bowtie2.

From the ustacks output, I can see that you start with 2M reads, which
break down to 314,495 stacks of exactly matching reads, with an initial
coverage of 53x. Then, stacks that have very high depth are removed, and
we can't see how many were removed (the newer version will tell you this
explicitly). What we can see is that during merging your data goes from
32K stacks (putative alleles) down to 19K merged loci with >80x
coverage. So, a lot of loci were removed.

I can't tell with this version's output if that was because 1) you had
lots of 1 or 2-read loci, which could have been spurious reads, or not
enough sequencing, 2) because your genome is very repetitive and many
high-depth loci were dropped for this reason, or 3) some of both. You
can figure #1 out by playing with the -m flag to ustacks, and #2 we can
find out more from the newer version's output.

I would look into optimizing your assembly parameters and stick with the
denovo data set. Once you have the assembly working well, you can align
the consensus reads from the denovo stacks to your reference to get
positional data for them (this is described in the paper I mentioned).

julian

Amanda Charbonneau wrote:
> Found it! Here's the output from one sample, both pipelines.
>
> USTACKS:
>
> Min depth of coverage to create a stack: 3
> Max distance allowed between stacks: 2
> Max distance allowed to align secondary reads: 4
> Max number of stacks allowed per de novo locus: 3
> Deleveraging algorithm: disabled
> Removal algorithm: disabled
> Model type: SNP
> Alpha significance level for model: 0.05
> Parsing 14796/QTL_F2_13_E01.fq
> Loaded 2006410 RAD-Tags; inserted 314495 elements into the RAD-Tags
> hash map.
> 0 reads contained uncalled nucleotides that were modified.
> Mean coverage depth is 53; Std Dev: 133.814 Max: 5078
> Coverage mean: 53; stdev: 133.814
> Deleveraging trigger: 187; Removal trigger: 321
> Calculating distance between stacks…
> Distance allowed between stacks: 2
> Using a k-mer length of 29
> Number of kmers per sequence: 63
> Minimum number of k-mers to define a match: 5
> Merging stacks, maximum allowed distance: 2 nucleotide(s)
> 32146 stacks merged into 19153 stacks; deleveraged 0 stacks; removed
> 1434 stacks.
> Mean merged coverage depth is 88.4577; Std Dev: 215.6; Max: 9975
> Merging remainder radtags
> 312179 remainder sequences left to merge.
> Distance allowed between stacks: 4
> Using a k-mer length of 17
> Number of kmers per sequence: 75
> Minimum number of k-mers to define a match: 7
> Matched 134481 remainder reads; unable to match 177698 remainder reads.
> Number of utilized reads: 1828712
> Writing loci, SNPs, and alleles to ‘Rrr_20170416_5/.5/‘…
>
> PSTACKS:
>
> Min depth of coverage to report a stack: 3
> Model type: SNP
> Alpha significance level for model: 0.05
> Parsing 14796/QTL_F2_13_E01.fq_q30.sam
> Analyzed 471427 sequence reads; Identified 57838 unique stacks from
> those reads.
> Merged 57838 unique Stacks into 9421 loci.
> Identifying polymorphic sites and calling consensus sequences…done.
> Number of utilized reads 471427
> Mean coverage depth is 84.6716; Std Dev: 141.634; Max: 3074
> Writing loci, SNPs, alleles to ‘Rrr_20170402_5/.5/…’
>
> On Monday, April 17, 2017 at 2:35:48 PM UTC-4, Amanda Charbonneau wrote:
>
> I pulled those from a matches.tsv file (from sstacks)...I've never
> seen pstacks/ustacks report those numbers. I'm running through a
> queue system on an HPC though, so if they normally get dumped to the
> terminal they might be getting lost. I'll see if I can find them.
>
> It is weird that my minimum is 1 with an -m 3 though. That hadn't
> occurred to me. My submission script definitely has the -m 3.
>
> -amanda
>
>
> ​
>

Amanda Charbonneau

unread,
Apr 17, 2017, 3:44:25 PM4/17/17
to Stacks, jcat...@illinois.edu
I'm using stacks/1.35 ...it's the newest version on my institution HPC. I can try cajoling them into upgrading, but I'm not super optimistic. I should probably just woman-up and learn to write linuxbrew recipes.

-amanda

Ivania Ceron Souza

unread,
Apr 19, 2017, 10:01:38 AM4/19/17
to Stacks, fan...@gmail.com, jcat...@illinois.edu
Hi Julian!
I have been following this discussion, and I just read the paper you suggested (Paris et al. 2017). So, I have two questions: 
1) What is the name of the shell script you developed to extract the information from the Ustack log files to construct the graphics for each parameter iteration? And, is there an example how you run this script and also, how to execute the phyton script for catalog "count_fixed_catalog_snps.py"? 
2) Since which version can we find in the Ustack log files the information about how many stacks were removed, is the ver. 1.45?

Thanks!

Ivania

tarid purisotayo

unread,
Mar 8, 2018, 4:34:06 PM3/8/18
to Stacks
Hi Ivania,

This article should give you an answer https://www.nature.com/articles/nprot.2017.123 .

An example for using shell operator scripts was provided in step 8. 

Suffix 'o' and 'e' were for output and error.

"&> process_radtags.lane1.oe"  


Tarid


เมื่อ วันพุธที่ 19 เมษายน ค.ศ. 2017 15 นาฬิกา 01 นาที 38 วินาที UTC+1, Ivania Ceron Souza เขียนว่า:
Reply all
Reply to author
Forward
0 new messages