Dear Julian, Nicolas, and Stacks users,
I’m having trouble running denovo_map.pl on my ddRADseq
samples (NovaSeq 6000 Sequencing), and I hope you could help me.
I have sequenced several individuals (split into 4 pools) marked with combinations of inline-inline barcodes (barcodes of different lengths 5 to 8bp on one side and 5 to 7bp on the other side).
I demultiplex my data using process_radtags without noticeable problems using:
mod="inline_inline"
for i in 1 2 3 4
do
nohup process_radtags -1 agrion${i}_R1.fastq.gz -2 agrion${i}_R2.fastq.gz -b barcode/2-barcode_stacks_agrion${i}.txt -o test_demult_NO_BACKUP/${mod} --renz_1 pstI --renz_2 mspI --${mod} -c -q -r > 2-barcode_agrion${i}_${mod}.trace 2>&1 && mv test_demult_NO_BACKUP/${mod}/process_radtags.log test_demult_NO_BACKUP/${mod}/2-agrion${i}-${mod}-process_radtags.log&
done
I then proceed to run denovo_map.p (keeping separate R1 and R2 files) and I get an error in the log file:
ustacks
==========
Sample 1 of 20 'Cam1-13-m20'
----------
/eep/softwares/miniconda/envs/stacks-2.60/bin/ustacks -t gzfastq -f /eep/data1/genomic/reads/agrion-ddRADseq/demultiplexage/inline_inline/Cam1-13-m20.1.fq.gz -o /home/agathe.leveque/working_dir/Agrion-ddRADseq/Results_data/denovo/denovo.test -i 1 --name Cam1-13-m20 -p 10 -M 1
ustacks parameters selected:
Input file: '/eep/data1/genomic/reads/agrion-ddRADseq/demultiplexage/inline_inline/Cam1-13-m20.1.fq.gz'
Sample ID: 1
Min depth of coverage to create a stack (m): 3
Repeat removal algorithm: enabled
Max distance allowed between stacks (M): 1
Max distance allowed to align secondary reads: 3
Max number of stacks allowed per de novo locus: 3
Deleveraging algorithm: disabled
Gapped assembly: enabled
Minimum alignment length: 0.8
Model type: SNP
Alpha significance level for model: 0.05
Loading RAD-Tags...1M...2M...3M...4M...5M...6M...7M...8M...9M...10M...11M...12M...13M...14M...15M...16M...17M...18M...
Error: different sequence lengths detected, this will interfere with Stacks algorithms, trim reads to uniform length (override this check with --force-diff-len).
denovo_map.pl: Aborted because the last command failed (1).
So, I first looked at the Fast QC files for the raw dataset (1per pool) and it showed that both the R1 & R2 read lengths had different lengths (ex : R1 :35-251 and 35-251)
It seems that after process_radtags, the R1 files get shortened to 120, and the R2 files this variability remain and maybe exacerbate with the difference in barcode size
I also check the length of my sequences before and after demultiplexing using: zcat file.fq.gz | sed -n '2~4p' | awk '{print length($0)}' | sort -n | uniq -c and it seems that these variations are found both before and after the demultiplexing.
I assume if I force the different lengths it will cause issues for my downstream analysis and SNP calling (which I have seen, you mention in other Stacks threads).
Do stacks allow a certain threshold of variability between read lengths? Should I trim my raw data (and lost information) to be able to run stacks? Or use the trimming parameter -t in process_radtags?
Many thanks and I apologize if this question has been asked before.
Best,
Agathe