Exception in thread "Thread-0" java.lang.RuntimeException: Sequence and quality length don't match: 'TGCAG' vs 'FFFF'
I verify myself some of the sequences and indeed few had different number of characters for its sequence and its quality score.
This, of course, affects greatly my downstream analysis because it leads to a loss of almost 70% of the reads
has someone experience a similar problem. If so how have you deal with it?
Please help me!
Carol
process_radtags -P i gzfastq -1 ./lib1_R1_adaprem.1.fq.gz -2 ./lib1_R2_adaprem.2.fq.gz -e pstI -b /home/carol/RADseq/RAD_Big-dataset/process_RADtags_txt-files/barcodes_library1 -r --barcode_dist_1 1 -E phred33 -o . --retain_header -D &>> Lib1-process-RADtag.lo
The distribution of my raw file was218581358 126
2017-06-15 19:35 GMT+03:00 Carol Buitrago <carol.b...@gmail.com>:Hi Julian,Ok. that is the distribution of the reads length after removing adapter sequences with bbduk.I'm running right now the command for the distribution of the real raw sequences (it takes few minutes :S )Carol2017-06-15 19:30 GMT+03:00 Carol Buitrago <carol.b...@gmail.com>:inspecting manually one of my file I found for instance this@D00658:7:C9N98ANXX:1:2116:5231:75289 1:N:0:1
TGCAGTGTCGCTCAGTGACATTACGCCCACGGCTCCTCACAGGGGGTTGGAGAGTAAACGACAGGGGAATGTGCTGTTAAAGAACTTAACCAGCATTCGTTCACCCGTGACGGTTCGCCA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFB^Z
Where for instance the quality score has two extra characters.Attached you can see the correspondent files for this sample.2017-06-15 19:03 GMT+03:00 Carol Buitrago <carol.b...@gmail.com>:Hi Julian,sorry for my delay, it took quite long time for my computer to process the commandattached the length distribution of R1 and R2 raw reads
2017-06-15 18:36 GMT+03:00 Julian Catchen <jcat...@illinois.edu>:
Hi Carol,
What are the distribution of read lengths in your three raw, unprocessed files?
This command will tell you: (all one line)
zcat ./path/to/raw/file.fq.gz | sed -n '2~4p' | awk '{print length}' | sort -n | uniq -c
julian
Carol Buitrago wrote:
Hi Julian156M...157M...158M...159M...160M...161M...162M...163M...164M...165M...166M...167M...168M...169M...170M...171M...172M...173M...174M...175M...176M...177M...178M...179M...180M...181M...182M...183M...184M...185M...186M...187M...188M...189M...190M...191M...192M...193M...194M...195M...196M...197M...198M...199M...200M...201M...202M...203M...204M...205M...206M...207M...208M...209M...210M...211M...212M...213M...214M...215M...216M...217M...
Thanks again for your quick reply. Attached you can find the list of all
my barcodes
To find out why is this happening I performed two tests with different
process_radtag parameters for only three of my samples
test1
process_radtags -i gzfastq -P -1 ./lib1_R1_adaprem.1.fq.gz -2
./lib1_R2_adaprem.2.fq.gz -o . -b
/home/carol/RADseq/RAD_Big-dataset/process_RADtags_txt-files/barcode_test --barcode_dist_1
1 -E phred33 --retain_header -D -e pstI -r -c -q &>>
test1-notruncate-process-RADtag.log
carol@symbiomics[Lane1] cat test1-notruncate-process-RADtag.log
Processing paired-end data.
Using Phred+33 encoding for quality scores.
Found 1 paired input file(s).
Searching for single-end, inlined barcodes.
Loaded 3 barcodes (6bp).
Will attempt to recover barcodes with at most 1 mismatches.
Processing file 1 of 1 [lib1_R1_adaprem.1.fq.gz]
Reading data from:
./lib1_R1_adaprem.1.fq.gz and
./lib1_R2_adaprem.2.fq.gz
Processing
RAD-Tags...1M...2M...3M...4M...5M...6M...7M...8M...9M...10M...11M...12M...13M...14M...15M...16M...17M...18M...19M...20M...21M...22M...23M...24M...25M...26M...27M...28M...29M...30M...31M...32M...33M...34M...35M...36M...37M...38M...39M...40M...41M...42M...43M...44M...45M...46M...47M...48M...49M...50M...51M...52M...53M...54M...55M...56M...57M...58M...59M...60M...61M...62M...63M...64M...65M...66M...67M...68M...69M...70M...71M...72M...73M...74M...75M...76M...77M...78M...79M...80M...81M...82M...83M...84M...85M...86M...87M...88M...89M...90M...91M...92M...93M...94M...95M...96M...97M...98M...99M...100M...101M...102M...103M...104M...105M...106M...107M...108M...109M...110M...111M...112M...113M...114M...115M...116M...117M...118M...119M...120M...121M...122M...123M...124M...125M...126M...127M...128M...129M...130M...131M...132M...133M...134M...135M...136M...137M...138M...139M...140M...141M...142M...143M...144M...145M...146M...147M...148M...149M...150M...151M...152M...153M...154M...155M...
156M...157M...158M...159M...160M...161M...162M...163M...164M...165M...166M...167M...168M...169M...170M...171M...172M...173M...174M...175M...176M...177M...178M...179M...180M...181M...182M...183M...184M...185M...186M...187M...188M...189M...190M...191M...192M...193M...194M...195M...196M...197M...198M...199M...200M...201M...202M...203M...204M...205M...206M...207M...208M...209M...210M...211M...212M...213M...214M...215M...216M...217M...carol@symbiomics[Lane1] cat test2-truncate-process-RADtag.log *[ 6:20PM]*
434211680 total reads; -360317814 ambiguous barcodes; -885218 ambiguous
RAD-Tags; +1699831 recovered; -122111 low quality reads; 72886537
retained reads.
Closing files, flushing buffers...
Outputing details to log: './process_radtags.Lane1.log'
434211680 total sequences
360317814 ambiguous barcode drops (83.0%)
122111 low quality read drops (0.0%)
885218 ambiguous RAD-Tag drops (0.2%)
72886537 retained reads (16.8%)
test2
process_radtags -i gzfastq -P -1 ./lib1_R1_adaprem.1.fq.gz -2
./lib1_R2_adaprem.2.fq.gz -o ./test2 -b
/home/carol/RADseq/RAD_Big-dataset/process_RADtags_txt-files/barcode_test --barcode_dist_1
1 -E phred33 --retain_header -D -e pstI -r -c -q -t 100 &>>
test2-truncate-process-RADtag.log
Processing paired-end data.
Using Phred+33 encoding for quality scores.
Reads will be truncated to 100bp
Found 1 paired input file(s).
Searching for single-end, inlined barcodes.
Loaded 3 barcodes (6bp).
Will attempt to recover barcodes with at most 1 mismatches.
Processing file 1 of 1 [lib1_R1_adaprem.1.fq.gz]
Reading data from:
./lib1_R1_adaprem.1.fq.gz and
./lib1_R2_adaprem.2.fq.gz
Processing
RAD-Tags...1M...2M...3M...4M...5M...6M...7M...8M...9M...10M...11M...12M...13M...14M...15M...16M...17M...18M...19M...20M...21M...22M...23M...24M...25M...26M...27M...28M...29M...30M...31M...32M...33M...34M...35M...36M...37M...38M...39M...40M...41M...42M...43M...44M...45M...46M...47M...48M...49M...50M...51M...52M...53M...54M...55M...56M...57M...58M...59M...60M...61M...62M...63M...64M...65M...66M...67M...68M...69M...70M...71M...72M...73M...74M...75M...76M...77M...78M...79M...80M...81M...82M...83M...84M...85M...86M...87M...88M...89M...90M...91M...92M...93M...94M...95M...96M...97M...98M...99M...100M...101M...102M...103M...104M...105M...106M...107M...108M...109M...110M...111M...112M...113M...114M...115M...116M...117M...118M...119M...120M...121M...122M...123M...124M...125M...126M...127M...128M...129M...130M...131M...132M...133M...134M...135M...136M...137M...138M...139M...140M...141M...142M...143M...144M...145M...146M...147M...148M...149M...150M...151M...152M...153M...154M...155M...
434211680 total reads; -360317814 ambiguous barcodes; -880835 ambiguous
RAD-Tags; +1695662 recovered; -306420 low quality reads; 72706611
retained reads.
Closing files, flushing buffers...
Outputing details to log: './test2/process_radtags.Lane1.log'
434211680 total sequences
360317814 ambiguous barcode drops (83.0%)
306420 low quality read drops (0.1%)
880835 ambiguous RAD-Tag drops (0.2%)
72706611 retained reads (16.7%)
Interestingly enough when I try to run trimmomatic in any of the three
samples from test1 I get the error :"Exception in thread "Thread-0"
java.lang.RuntimeException: Sequence and quality length don't match:
'TGCAG' vs 'FFFF'"
<mailto:jcat...@illinois.edu>>:but when I run it for the samples resulting from my test2 then
trimmomatic works perfectly well (I know that if I truncate the reads
with process_radtags, I wont have to later use trimmomatic).
However, if I later use the Clone_filter on the resulting samples of my
demultiplexing I loose more reads in test2 than in test1
Clone_filter results
test1
pairs of reads input pairs of reads output total paired-end reads
discarded pairs of reads percentage %
SMAQ-R1-1 8004937 5847902 11695804 2157035 26.95
SMAQ-R1-2 11970412 8931507 17863014 3038905 25.39
SMAQ-R1-3 15982942 11675276 23350552 4307666 26.95
Clone_filter results
test2
pairs of reads input pairs of reads output total paired-end reads
discarded pairs of reads percentage %
SMAQ-R1-1 7962673 5676830 11353660 2285843 28.71
SMAQ-R1-2 11949218 8697902 17395804 3251316 27.21
SMAQ-R1-3 15951253 11358116 22716232 4593137 28.79
What can I do ??
Thanks for your help
2017-06-15 17:36 GMT+03:00 Julian Catchen <jcat...@illinois.edu
Hi Carol,
Can you send me your barcodes file off-list?
julian
<mailto:jcat...@illinois.edu <mailto:jcat...@illinois.edu>>>:<mailto:jcat...@illinois.edu>
jcat...@illinois.edu <mailto:jcat...@illinois.edu>; @jcatchen--
Julian Catchen, Ph.D.
Assistant Professor
Department of Animal Biology
University of Illinois, Urbana-Champaign
--
--
Julian M Catchen, Ph.D.
Assistant Professor
Department of Animal Biology
University of Illinois, Urbana-Champaign
--
jcat...@illinois.edu; @jcatchen