Error: different sequence lengths (Stacks v2.60)

274 views

Skip to first unread message

agathe leveque

unread,

Mar 3, 2022, 8:19:43 AM3/3/22

to Stacks

Dear Julian, Nicolas, and Stacks users,

I’m having trouble running denovo_map.pl on my ddRADseq samples (NovaSeq 6000 Sequencing), and I hope you could help me.

I have sequenced several individuals (split into 4 pools) marked with combinations of inline-inline barcodes (barcodes of different lengths 5 to 8bp on one side and 5 to 7bp on the other side).

I demultiplex my data using process_radtags without noticeable problems using:
mod="inline_inline"
for i in 1 2 3 4
do
nohup process_radtags -1 agrion${i}_R1.fastq.gz -2 agrion${i}_R2.fastq.gz -b barcode/2-barcode_stacks_agrion${i}.txt -o test_demult_NO_BACKUP/${mod} --renz_1 pstI --renz_2 mspI --${mod} -c -q -r > 2-barcode_agrion${i}_${mod}.trace 2>&1 && mv test_demult_NO_BACKUP/${mod}/process_radtags.log test_demult_NO_BACKUP/${mod}/2-agrion${i}-${mod}-process_radtags.log&
done

I then proceed to run denovo_map.p (keeping separate R1 and R2 files) and I get an error in the log file:

ustacks

==========

Sample 1 of 20 'Cam1-13-m20'

----------

/eep/softwares/miniconda/envs/stacks-2.60/bin/ustacks -t gzfastq -f /eep/data1/genomic/reads/agrion-ddRADseq/demultiplexage/inline_inline/Cam1-13-m20.1.fq.gz -o /home/agathe.leveque/working_dir/Agrion-ddRADseq/Results_data/denovo/denovo.test -i 1 --name Cam1-13-m20 -p 10 -M 1

ustacks parameters selected:

Input file: '/eep/data1/genomic/reads/agrion-ddRADseq/demultiplexage/inline_inline/Cam1-13-m20.1.fq.gz'

Sample ID: 1

Min depth of coverage to create a stack (m): 3

Repeat removal algorithm: enabled

Max distance allowed between stacks (M): 1

Max distance allowed to align secondary reads: 3

Max number of stacks allowed per de novo locus: 3

Deleveraging algorithm: disabled

Gapped assembly: enabled

Minimum alignment length: 0.8

Model type: SNP

Alpha significance level for model: 0.05

Loading RAD-Tags...1M...2M...3M...4M...5M...6M...7M...8M...9M...10M...11M...12M...13M...14M...15M...16M...17M...18M...

Error: different sequence lengths detected, this will interfere with Stacks algorithms, trim reads to uniform length (override this check with --force-diff-len).

denovo_map.pl: Aborted because the last command failed (1).

So, I first looked at the Fast QC files for the raw dataset (1per pool) and it showed that both the R1 & R2 read lengths had different lengths (ex : R1 :35-251 and 35-251)

It seems that after process_radtags, the R1 files get shortened to 120, and the R2 files this variability remain and maybe exacerbate with the difference in barcode size

I also check the length of my sequences before and after demultiplexing using: zcat file.fq.gz | sed -n '2~4p' | awk '{print length($0)}' | sort -n | uniq -c and it seems that these variations are found both before and after the demultiplexing.

I assume if I force the different lengths it will cause issues for my downstream analysis and SNP calling (which I have seen, you mention in other Stacks threads).

Do stacks allow a certain threshold of variability between read lengths? Should I trim my raw data (and lost information) to be able to run stacks? Or use the trimming parameter -t in process_radtags?

Many thanks and I apologize if this question has been asked before.

Best,

Agathe

paige...@ucsb.edu

unread,

Apr 7, 2022, 4:32:53 PM4/7/22

to Stacks

Hello Agathe,

I can't answer all of your questions as I'm no expert on stacks2, but I had the exact problem you are discussing with process_radtags earlier this year. I spent some time trying to fix the problem. I tried running using the -t option, but it actually did not lead to all the reads being the same length (I'm not sure what happened), some were still 1 or 2 bases different and when I tried to run the de novo pipeline with the process_radtags output using --force-diff-len, it would cause an error and not complete. I could not find any work around and did not get a timely answer on this group (stopped waiting after a week). So, I used BBTools and trimmed them all to the same length (after completing the process_radtags step), which meant just losing at most 5 bases per read on the 3' end.. That worked great for me. Just letting you know, in case it can save you some time!