de novo addembly of RADseq data with overrepresented sequences

253 views
Skip to first unread message

DAVID VENDRAMI

unread,
Jun 11, 2021, 8:53:11 AM6/11/21
to Stacks
Good afternoon Stacks users,

I am currently trying to run the denovo_map.pl script (using the latest version of Stacks) on a set of RADseq data generated for a marine gastropod. My problem is that the pipeline get stuck at the ustack step (without throwing any error message) for days. Specifically, the ustacks run starts by assembling stacks in the first sample and that's it. It keeps running (i.e.: the run does not throw any error message and does not abort), but it does not produce any output and it does not proceed to the next sample. Here's what the log file says so far:

denovo_map.pl version 2.58 started at 2021-06-08 15:21:53
denovo_map.pl --samples /prj/mar-in-gen/fastq_f --popmap /prj/mar-in-gen/sam_map.txt -o /prj/mar-in-gen/St_out -X populations: --vcf -T 4 --paired --rm-pcr-duplicates

ustacks
==========

Sample 1 of 30 'Pr_ML_12'
----------
ustacks -t fastq -f /prj/mar-in-gen/fastq_f/Pr_ML_12.1.fq -o /prj/mar-in-gen/St_out -i 1 --name Pr_ML_12 -p 4

I believe the problems is caused by the fact that there are a lot of overrepresented sequences in the relevant fastq files (in all samples. See attached fastQC report for a representative sample). Initially I thought this was indicative of a failed library preparation, but I have reasons to believe that this might not actually be the case: 1. The same overrepresented sequences are present in all samples in similar proportions; 2. the same pattern is present in fastq files derived from two independent sequencing of two independently prepared (by different laboratories) RADseq libraries (coming from different samples of the same species); 3. There are almost no overrepresented sequences in the correspondin paired sequences (fastQC report attached. If the overrepresented sequences were caused by a failed library preparation where only a small number of RAD loci were included in the library, I would expect to see the same amounts of overrepresented sequences in both .1 and .2 fq files).

Said this, I believe that the data are genuine and that I am dealing with a species whose genome is reach in repetitive elements.

My questions then are: what do you think about this? And, how would you suggest to increase ustacks performance in assemblies stacks when highly repetitive reads are present in the input files? I've alreadt tried many different combinations of -M and -n, but this does not help.

Many thanks in advance for your help,
All the best,
David


M_SCS_19_12_2_fastqc.html
M_SCS_19_12_1_fastqc.html
M_SCS_19_12_1_fastqc.zip
M_SCS_19_12_2_fastqc.zip

Julian Catchen

unread,
Jun 14, 2021, 5:42:07 PM6/14/21
to stacks...@googlegroups.com, DAVID VENDRAMI
Hi David,

The first step is to run ustacks on its own on the first sample and see
how far the software is getting. The log tells you the command to run:

ustacks -t fastq -f /prj/mar-in-gen/fastq_f/Pr_ML_12.1.fq -o
/prj/mar-in-gen/St_out -i 1 --name Pr_ML_12 -p 4

You could add more threads if you have them. What does the ustacks
output tell you, what stage is it getting to?

julian

DAVID VENDRAMI wrote on 6/11/21 7:53 AM:

DAVID VENDRAMI

unread,
Jun 16, 2021, 4:26:21 AM6/16/21
to Julian Catchen, stacks...@googlegroups.com
Hi Julian,

Thank you very much for your answer. Following your suggestion I have now run ustacks alone on a single sample using 8 threads (-p 8). After approximately 24 hours, the log file looks like this:

ustacks parameters selected:
  Input file: '/prj/mar-in-gen/fastq_f/Pr_ML_12.1.fq'
  Sample ID: 1
  Min depth of coverage to create a stack (m): 3
  Repeat removal algorithm: enabled
  Max distance allowed between stacks (M): 2
  Max distance allowed to align secondary reads: 4
  Max number of stacks allowed per de novo locus: 3
  Deleveraging algorithm: disabled
  Gapped assembly: enabled
  Minimum alignment length: 0.8
  Model type: SNP
  Alpha significance level for model: 0.05

Loading RAD-Tags...1M...2M...3M...4M...

Loaded 4235683 reads; formed:
  457909 stacks representing 3750157 primary reads (88.5%)
  417043 secondary stacks representing 485526 secondary reads (11.5%)

Stack coverage: mean=8.19; stdev=18.77; max=4504; n_reads=3750157(88.5%)
Removing repetitive stacks: cov > 65 (mean+3*stdev)...
  Blacklisted 15061 stacks.
Coverage after repeat removal: mean=7.73; stdev=4.60; max=65; n_reads=3423732(80.8%)

Assembling stacks (max. dist. M=2)...
  Assembled 442848 stacks into 384880; blacklisted 1362 stacks.
Coverage after assembling stacks: mean=8.36; stdev=5.03; max=100; n_reads=3203751(75.6%)

Merging secondary stacks (max. dist. N=4 from consensus)...
  Merged 241291 out of 485526 secondary reads (49.7%), 15664 merged with gapped alignments.
Coverage after merging secondary stacks: mean=8.99; stdev=5.31; max=114; n_reads=3445042(81.3%)

Assembling stacks, allowing for gaps (min. match length 80.0%)...

Then, some more news.
1) ustacks have now finished processing the first sample in my denovo_map.pl run (see previous email). It therefore took it about a week to process one sample (using 4 threads).

2) I tried to normalize read coverage using the kmer_filter program while also removing reads with abundant kmers (--abundant, based on default parameters). Different kmer sizes obviously have a quite strong impact on the number of reads that are retained. Specifying a kmer size of 30 (--k_len 30) while aiming for a normalized read depth of 15 (--normalize 15) allows me to keep about 50% of the reads (which still contain a significant amount of overrepresented reads according to fastqc reports). Using the resulting files as input speeds things up a lot, with the entire pipeline being run on 2 test samples in less than 24 hours. Do you think this would be a reasonable alternative way of proceeding?

Many thanks for your support,
Best wishes,
David 

Julian Catchen

unread,
Jun 21, 2021, 12:49:21 PM6/21/21
to DAVID VENDRAMI, stacks...@googlegroups.com
Hi David,

Using the kmer filtering is akin to randomly sampling a subset of reads.
So, you are halving the amount of data that you are pushing into the
pipeline -- this will result in less robust SNP calls down the line
(since your ustacks output show a mean read depth of ~9x, you are going
to push this down to 4-5x, too low to call most SNPs).

I would recommend increasing the number of threads, or better, running
the individual ustacks runs independently on a cluster, assuming you are
submitting jobs to a cluster job scheduler. You could also disable
gapped alignments, which will speed things up, but again, it will cause
you to miss assembling some alleles into their respective loci.

Best,

julian

DAVID VENDRAMI wrote on 6/16/21 3:26 AM:
Reply all
Reply to author
Forward
0 new messages