process_radtags output files where the sequence and quality length do not match

1,572 views
Skip to first unread message

Carol

unread,
Jun 14, 2017, 4:40:30 PM6/14/17
to Stacks
Dear stacks users,

I really need urgent help with my process_radtag output, this was the command I ran

Stacks v 1.45
process_radtags -P -1 ./lib1_R1_adaprem.1.fq.gz -2 ./lib1_R2_adaprem.2.fq.gz -e pstI -b /home/carol/RADseq/RAD_Big-dataset/process_RADtags_txt-files/barcodes_library1 -r --barcode_dist_1 1 -E phred33 -o . --retain_header -D &>> RAD_Big-dataset-adaprem-process-RADtag.log

Everything seemed very good until I tried to run trimmomatic on my process_radtag output to remove low quality bases and crop my reads to the same length, then I got the following warning: 

Exception in thread "Thread-0" java.lang.RuntimeException: Sequence and quality length don't match: 'TGCAG' vs 'FFFF'


I verify myself some of the sequences and indeed few had different number of characters for its sequence and its quality score.

This, of course, affects greatly my downstream analysis because it leads to a loss of almost 70% of the reads 


has someone experience a similar problem. If so how have you deal with it?


Please help me!


Carol

Julian Catchen

unread,
Jun 14, 2017, 5:08:34 PM6/14/17
to stacks...@googlegroups.com, carol.b...@gmail.com
Hi Carol,

This usually happens when you have listed the same barcode for more than
one sample in the barcodes file. This causes two or more sets of data to
be written to the same file, giving weird overlaps between different
records.

Also, why are you running trimmomatic after process_radtags? If you plan
to trim the reads, process_radtags can do that for you directly.

Best,

julian

Carol Buitrago

unread,
Jun 15, 2017, 3:55:33 AM6/15/17
to Julian Catchen, stacks...@googlegroups.com
Hi Julian,

Thanks a lot for your quick reply. 

The thing is that I first demultiplexed my reads with the process_radtags module, then I did clone_filter to remove pcr duplicates and at last, I used the trimmomatic to crop the reads and remove low quality bases. So I don't do the cropping nor quality trimming with process_radtags because I don't want to modify my reads before removing the PCR duplicates.

I noticed that after trimmomatic I was loosing almost 70% of my sequences, so I checked the trimmomatic log file and noticed the problem about "sequence and quality length do not match". Then, to check if the problem was from the clone_filter, I assessed if trimmomatic would work normally on the output of the process_radtags, but it didn't. On the other hand if I ran trimmomatic on my raw data (before demultiplexing) it doesn't pop up that error. So the problem comes after process_radtags

Yesterday I also notice that I had forgotten the flag -i gzfastq so I included it, hopeful that might solve the problem. This was the code I used:

process_radtags -P i gzfastq -1 ./lib1_R1_adaprem.1.fq.gz -2 ./lib1_R2_adaprem.2.fq.gz -e pstI -b /home/carol/RADseq/RAD_Big-dataset/process_RADtags_txt-files/barcodes_library1 -r --barcode_dist_1 1 -E phred33 -o . --retain_header -D &>> Lib1-process-RADtag.lo


Unfortunately, I kept on having the same problem when afterward I used trimmomatic: "sequence and quality length do not match : 'TGCAG' vs 'FFFF'"

I re-checked my barcode file and I don't have any case where the same barcode is used for two or more samples. Indeed, my barcodes have hamming distances of at least 3 nucleotides. 

I'm really overwhelmed cause I don't know how to solve this problem.

Any suggestion is welcome,

Carol  

Carol Buitrago

unread,
Jun 19, 2017, 2:33:37 PM6/19/17
to Julian Catchen, stacks...@googlegroups.com
Hi Julian and stacks users,

After a million of trials of using process_radtags with the resulting output of my previous illumina-adapter-sequences trimming (with trimmomatic) I found out that:

1. if the resulting sequence length distribution of my fastq file (after the illumina-adapter-sequences trimming) is too wide (spread), process_radtags looses control and for some reason (still unknown to me) the sequence and quality score do not correspond in length.

2. However, if while trimming the adapter sequences you also limit the length of the remaining sequences e.g.
trimmomatic-0.36.jar PE -threads 30 -phred33 -trimlog trimlog-lib1.txt ./lib2_R1.fastq.gz ./lib2_R2.fastq.gz -baseout lib2_qtrim.fq.gz ILLUMINACLIP:/home/carol/RADseq/tools/Truseq2_adapters_trimmomatic/TruSeq2-PE.fa:2:30:10 MINLEN:110
then, the resulting fastq file has not a widely spread sequences length distribution and, therefore, process_radtag can proceed without generating weird mismatches between sequence and quality score length.

Hope this can help anyone else facing the same problem!

Cheers,

Carol 

2017-06-15 20:14 GMT+03:00 Carol Buitrago <carol.b...@gmail.com>:
The distribution of my raw file was

218581358 126


2017-06-15 19:35 GMT+03:00 Carol Buitrago <carol.b...@gmail.com>:
Hi Julian,

Ok. that is the distribution of the reads length after removing adapter sequences with bbduk. 
I'm running right now the command for the distribution of the real raw sequences (it takes few minutes :S )

Carol  

2017-06-15 19:30 GMT+03:00 Carol Buitrago <carol.b...@gmail.com>:
inspecting manually one of my file I found for instance this 

@D00658:7:C9N98ANXX:1:2116:5231:75289 1:N:0:1

TGCAGTGTCGCTCAGTGACATTACGCCCACGGCTCCTCACAGGGGGTTGGAGAGTAAACGACAGGGGAATGTGCTGTTAAAGAACTTAACCAGCATTCGTTCACCCGTGACGGTTCGCCA

+

FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFB^Z


Where for instance the quality score has two extra characters.

Attached you can see the correspondent files for this sample.​​​

2017-06-15 19:03 GMT+03:00 Carol Buitrago <carol.b...@gmail.com>:
Hi Julian,

sorry for my delay, it took quite long time for my computer to process the command

attached the length distribution of R1 and R2 raw reads

2017-06-15 18:36 GMT+03:00 Julian Catchen <jcat...@illinois.edu>:
Hi Carol,

What are the distribution of read lengths in your three raw, unprocessed files?

This command will tell you: (all one line)

zcat ./path/to/raw/file.fq.gz | sed -n '2~4p' | awk '{print length}' | sort -n | uniq -c

julian


Carol Buitrago wrote:
Hi Julian

Thanks again for your quick reply. Attached you can find the list of all
my barcodes

To find out why is this happening I performed two tests with different
process_radtag parameters for only three of my samples

test1
process_radtags -i gzfastq -P -1 ./lib1_R1_adaprem.1.fq.gz -2
./lib1_R2_adaprem.2.fq.gz -o . -b
/home/carol/RADseq/RAD_Big-dataset/process_RADtags_txt-files/barcode_test --barcode_dist_1
1 -E phred33 --retain_header -D -e pstI -r -c -q &>>
test1-notruncate-process-RADtag.log

carol@symbiomics[Lane1] cat test1-notruncate-process-RADtag.log

Processing paired-end data.

Using Phred+33 encoding for quality scores.

Found 1 paired input file(s).

Searching for single-end, inlined barcodes.

Loaded 3 barcodes (6bp).

Will attempt to recover barcodes with at most 1 mismatches.

Processing file 1 of 1 [lib1_R1_adaprem.1.fq.gz]

Reading data from:

./lib1_R1_adaprem.1.fq.gz and

./lib1_R2_adaprem.2.fq.gz

Processing
RAD-Tags...1M...2M...3M...4M...5M...6M...7M...8M...9M...10M...11M...12M...13M...14M...15M...16M...17M...18M...19M...20M...21M...22M...23M...24M...25M...26M...27M...28M...29M...30M...31M...32M...33M...34M...35M...36M...37M...38M...39M...40M...41M...42M...43M...44M...45M...46M...47M...48M...49M...50M...51M...52M...53M...54M...55M...56M...57M...58M...59M...60M...61M...62M...63M...64M...65M...66M...67M...68M...69M...70M...71M...72M...73M...74M...75M...76M...77M...78M...79M...80M...81M...82M...83M...84M...85M...86M...87M...88M...89M...90M...91M...92M...93M...94M...95M...96M...97M...98M...99M...100M...101M...102M...103M...104M...105M...106M...107M...108M...109M...110M...111M...112M...113M...114M...115M...116M...117M...118M...119M...120M...121M...122M...123M...124M...125M...126M...127M...128M...129M...130M...131M...132M...133M...134M...135M...136M...137M...138M...139M...140M...141M...142M...143M...144M...145M...146M...147M...148M...149M...150M...151M...152M...153M...154M...155M...
156M...157M...158M...159M...160M...161M...162M...163M...164M...165M...166M...167M...168M...169M...170M...171M...172M...173M...174M...175M...176M...177M...178M...179M...180M...181M...182M...183M...184M...185M...186M...187M...188M...189M...190M...191M...192M...193M...194M...195M...196M...197M...198M...199M...200M...201M...202M...203M...204M...205M...206M...207M...208M...209M...210M...211M...212M...213M...214M...215M...216M...217M...

434211680 total reads; -360317814 ambiguous barcodes; -885218 ambiguous
RAD-Tags; +1699831 recovered; -122111 low quality reads; 72886537
retained reads.

Closing files, flushing buffers...

Outputing details to log: './process_radtags.Lane1.log'

434211680 total sequences

360317814 ambiguous barcode drops (83.0%)

122111 low quality read drops (0.0%)

885218 ambiguous RAD-Tag drops (0.2%)

72886537 retained reads (16.8%)



test2
process_radtags -i gzfastq -P -1 ./lib1_R1_adaprem.1.fq.gz -2
./lib1_R2_adaprem.2.fq.gz -o ./test2 -b
/home/carol/RADseq/RAD_Big-dataset/process_RADtags_txt-files/barcode_test --barcode_dist_1
1 -E phred33 --retain_header -D -e pstI -r -c -q -t 100 &>>
test2-truncate-process-RADtag.log

carol@symbiomics[Lane1] cat test2-truncate-process-RADtag.log *[ 6:20PM]*


Processing paired-end data.

Using Phred+33 encoding for quality scores.

Reads will be truncated to 100bp

Found 1 paired input file(s).

Searching for single-end, inlined barcodes.

Loaded 3 barcodes (6bp).

Will attempt to recover barcodes with at most 1 mismatches.

Processing file 1 of 1 [lib1_R1_adaprem.1.fq.gz]

Reading data from:

./lib1_R1_adaprem.1.fq.gz and

./lib1_R2_adaprem.2.fq.gz

Processing
RAD-Tags...1M...2M...3M...4M...5M...6M...7M...8M...9M...10M...11M...12M...13M...14M...15M...16M...17M...18M...19M...20M...21M...22M...23M...24M...25M...26M...27M...28M...29M...30M...31M...32M...33M...34M...35M...36M...37M...38M...39M...40M...41M...42M...43M...44M...45M...46M...47M...48M...49M...50M...51M...52M...53M...54M...55M...56M...57M...58M...59M...60M...61M...62M...63M...64M...65M...66M...67M...68M...69M...70M...71M...72M...73M...74M...75M...76M...77M...78M...79M...80M...81M...82M...83M...84M...85M...86M...87M...88M...89M...90M...91M...92M...93M...94M...95M...96M...97M...98M...99M...100M...101M...102M...103M...104M...105M...106M...107M...108M...109M...110M...111M...112M...113M...114M...115M...116M...117M...118M...119M...120M...121M...122M...123M...124M...125M...126M...127M...128M...129M...130M...131M...132M...133M...134M...135M...136M...137M...138M...139M...140M...141M...142M...143M...144M...145M...146M...147M...148M...149M...150M...151M...152M...153M...154M...155M...
156M...157M...158M...159M...160M...161M...162M...163M...164M...165M...166M...167M...168M...169M...170M...171M...172M...173M...174M...175M...176M...177M...178M...179M...180M...181M...182M...183M...184M...185M...186M...187M...188M...189M...190M...191M...192M...193M...194M...195M...196M...197M...198M...199M...200M...201M...202M...203M...204M...205M...206M...207M...208M...209M...210M...211M...212M...213M...214M...215M...216M...217M...

434211680 total reads; -360317814 ambiguous barcodes; -880835 ambiguous
RAD-Tags; +1695662 recovered; -306420 low quality reads; 72706611
retained reads.

Closing files, flushing buffers...

Outputing details to log: './test2/process_radtags.Lane1.log'


434211680 total sequences

360317814 ambiguous barcode drops (83.0%)

306420 low quality read drops (0.1%)

880835 ambiguous RAD-Tag drops (0.2%)

72706611 retained reads (16.7%)


Interestingly enough when I try to run trimmomatic in any of the three
samples from test1 I get the error :"Exception in thread "Thread-0"

java.lang.RuntimeException: Sequence and quality length don't match:
'TGCAG' vs 'FFFF'"
but when I run it for the samples resulting from my test2 then
trimmomatic works perfectly well (I know that if I truncate the reads
with process_radtags, I wont have to later use trimmomatic).

However, if I later use the Clone_filter on the resulting samples of my
demultiplexing I loose more reads in test2 than in test1

Clone_filter results
test1
        pairs of reads input    pairs of reads output   total paired-end reads
discarded pairs of reads        percentage %
SMAQ-R1-1       8004937 5847902 11695804        2157035 26.95
SMAQ-R1-2       11970412        8931507 17863014        3038905 25.39
SMAQ-R1-3       15982942        11675276        23350552        4307666 26.95


Clone_filter results
test2
        pairs of reads input    pairs of reads output   total paired-end reads
discarded pairs of reads        percentage %
SMAQ-R1-1       7962673 5676830 11353660        2285843 28.71
SMAQ-R1-2       11949218        8697902 17395804        3251316 27.21
SMAQ-R1-3       15951253        11358116        22716232        4593137 28.79


What can I do ??

Thanks for your help



2017-06-15 17:36 GMT+03:00 Julian Catchen <jcat...@illinois.edu
<mailto:jcat...@illinois.edu>>:


    Hi Carol,

    Can you send me your barcodes file off-list?

    julian
        <mailto:jcat...@illinois.edu>
        <mailto:jcat...@illinois.edu <mailto:jcat...@illinois.edu>>>:
    --
    Julian Catchen, Ph.D.
    Assistant Professor
    Department of Animal Biology
    University of Illinois, Urbana-Champaign
    --
    jcat...@illinois.edu <mailto:jcat...@illinois.edu>; @jcatchen



--
Julian M Catchen, Ph.D.

Assistant Professor
Department of Animal Biology
University of Illinois, Urbana-Champaign
--
jcat...@illinois.edu; @jcatchen






Reply all
Reply to author
Forward
0 new messages