Q scores in split libraries

241 views
Skip to first unread message

Ajinkya Kulkarni

unread,
Aug 22, 2016, 12:03:52 PM8/22/16
to Qiime 1 Forum

Hello everyone,
I am seeking help with respect to a situation arising during running my sequence analysis.
I ran 100 samples (2X300) on a MiSeq run using bacterial and archaeal 16S primers (Bac357F/785R, Arc344F/806R). The samples were prepared based on the strategy used by Herbold et al 2015 with a new header sequence attached to the barcode and primer.

Nonetheless the MiSeq run gave 10million reads. The R1 and R2 files were devoid of the adapter sequences. I performed the following procedures on the output files:
1) I extracted the barcodes with the unique header sequences using

"extract_barcodes.py -f Friegrich-01_S1_L001_R1_001.fastq -r Friegrich-01_S1_L001_R2_001.fastq -c barcode_paired_end --bc1_len 24 --bc2_len 24 -o processed_seqs"

2) I joined the reads using

 "join_paired_ends.py -f reads1.fastq -r reads2.fastq -b barcodes.fastq -j 10 -o joined.fastq"

3) I prepared my own mapping file and ran the split library command as follows (twice)

split_libraries_fastq.py -i fastqjoin.join.fastq -b fastqjoin.join_barcodes.fastq -o slout/ -m Mappingfile.txt --store_demultiplexed_fastq -r 0 -q 19 -n 100 --barcode_type 24 --phred_offset 33


split_libraries_fastq.py -i fastqjoin.join.fastq -b fastqjoin.join_barcodes.fastq -o slout/ -m Mappingfile.txt --store_demultiplexed_fastq -r 0 -q 0 -n 100 --barcode_type 24

The only difference in both the cases was the -q command wherein:
-q 19 gave me an output of  :
Total number of input sequences: 3891531
Barcode not in mapping file: 2776834
Read too short after quality truncation: ~704000

whereas,
-q 0 gave me an output of:
Total number of input sequences: 3891531
Barcode not in mapping file: 2776834
Read too short after quality truncation: 10381

As you can see I am not sure which -q value should I use. Clearly the latter give me more sequences per sample as compared to the former. Can somebody help me understand the implications of using or not using the latter in terms of quality check for sequence analysis?

The -q 0 command was one recommended by the BR microbiome protocols which have a published set of commands for 16S analysis using qiime (http://www.brmicrobiome.org/#!16s-profiling-pipeline-illumina/czxl).

Any help with this confusion would be really helpful.
I am a little bit new to this so just bear with my understanding and responses.
Awaiting your thoughts and suggestions in patience.
Regards,
Ajinkya

Colin Brislawn

unread,
Aug 22, 2016, 3:04:21 PM8/22/16
to Qiime 1 Forum
Good morning Ajinkya,

Thank you for your detailed question and explaining the primer design. I think that this is part of the problem, and part of the solution. 

Clearly the latter give me more sequences per sample as compared to the former. Can somebody help me understand the implications of using or not using the latter in terms of quality check for sequence analysis?
This is tradeoff between quality and quantity; you can choose to have fewer high quality reads, or more reads of different qualities. There is now single solution to this balance, and I'll let you decide what is best for your study. 

Based on your detailed notes, I think something else may be going wrong, causing you to lose more reads. If we solve this problem, there will be plenty of reads and you can easily use the high-quality filtering parameters. 
-q 19 gave me an output of  :
Barcode not in mapping file: 2776834
-q 0 gave me an output of:
Barcode not in mapping file: 2776834
It looks like there is a large number of reads in the run, with barcodes that are NOT in the mapping file. Given the unique barcoding method, I'm guessing qiime is having a problem finding all these barcodes. Maybe we could experiment with settings and see if something decreases the number of missing barcodes. 

Let me know what you find,
Colin

PS I also noticed that --phred_offset 33 was used with -q 19, but not -q 0. Why change this setting? 

Ajinkya Kulkarni

unread,
Aug 22, 2016, 4:14:13 PM8/22/16
to Qiime 1 Forum
Hi Colin,
Thanks for your instant reply.
To answer your queries:
1) The large number of reads are from the PhiX DNA (if I am right with the nomenclature) which were not removed. I didnt bother too much about it because I thought during the split library function, It would only process the barcode related sequences which are put in the mapping file. I think thats why I see so many sequences being rejected. Correct me if I am wrong here. During extraction of barcodes from the R1 and R2 file I already observed barcode sequences not belonging to the barcodes that I used. When I contacted the personnel who performed the run, he said that they didnt remove the externally added PhiX lambda DNA. Thats why I concluded that those are these non barcoded sequences.

2) I actually forgot to use the --phred_offset command with the second assessment. The last time I did analysis the split library function wouldnt work without this command. So I just re ran the same command without thinking. When I changed the -q value the second time I thought to run it without the --phred_offset command and it worked. I had to leave thereafter so I couldnt re run the first command without the --phred_offset command. I will try it again.

3) To check for trade off between quality and quantity here are some thoughts from my side:

a)I have 100 samples. When I checked split_library log clearly the distribution of sequences is very uneven. For eg. with the first command for at least 10 samples (round about), I get less than 100 sequences whereas the top 10 samples have more than 10,000 sequences. The same samples reached a 5 to 10 fold higher sequence number after changing the -q value to 0. Thus I wasnt sure what does the value"0" actually signify in terms of quality check (like value 19 means close to 99% stringency, thereby loss of so many sequences)

b) This was a MiSeq run intended to give me at least 20 to 25 million reads. What I got was 10 million reads. Further on the fastq files were about 6Gb and after joining the paired ends (command stated above, minimum overlap of 10 bases) I obtained a fastq joined file of 3.3 Gb which means I clearly lost some reads there too. Not sure how I could check this. Anyways you see that I am not sure at which steps am I actually really losing my sequences. Any way to reflect upon the situation?

Your thoughts and comprehensive knowledge in the same will be much appreciated.
Regards,
Ajinkya

Colin Brislawn

unread,
Aug 22, 2016, 7:31:03 PM8/22/16
to Qiime 1 Forum
Hello Ajinkya,

Glad I could help.

1) Yeah, PhiX is added to MiSeq runs with mostly 16S amplicons and can make up 5-20 % of the reads. These will indeed be filtered out during demultiplexing. Based on the numbers you posted, 71% of your reads were not barcoded amplicons! While these could be PhiX, that would mean they spiked in way too much. It's also possible that these are amplicons that have slightly misformatted barcodes. Given the unique nature of your barcodes, I would guess that qiime may be having trouble processing some reads with valid barcodes. 

2) Glad I caught that --phred_offset! I'm not sure what's the right setting for your fastq files, but if you are comparing -q settings, it's important to keep other settings the same. 

3) Tradeoffs
a) Unfortunately, the number of reads per sample is not usually even. I expect about 10% of samples to fail to sequence (only tens of reads) while other sequence too well (many thousands of reads). Making the -q score greater should affect all samples equally; doubling the -q score may remove an additional 20% of the reads from each sample. 
b) I find that 'good' MiSeq runs often give 10-15M amplicons. Their marketing numbers are best estimates, and shotgun genomics yields more reads then amplicons. 
fastq files were about 6Gb and after joining the paired ends (command stated above, minimum overlap of 10 bases) I obtained a fastq joined file of 3.3 Gb which means I clearly lost some reads there too.
Or maybe not! Keep in mind that, after overlapping reads, there is less redundant information to encode, so the files should be smaller. And it's pretty easy to check:

You probably already know this, but fastq files are just big text files. This means you can view and count them using normal linux commands:
View the start of fastq file:
head example.fastq
Get the total number of lines:
wc -l example.fastq
Because each read in a fastq file takes up 4 lines, you can divide up that number to 4 to get total number of reads. 

Qiime says that your file asdf has 2776834 reads in it. This means that if you run this command...
wc -l fastqjoin.join.fastq
...you should get 11107336. 

You could also search for lines that start with the @ sign, to get the number of reads in your fastq file directly.
grep -c '^@' fastqjoin.join.fastq 
That command should return 2776834, the same number qiime says are in the sample. 


You have been asking all the right questions. Let me know what you find! 
Colin

Ajinkya Kulkarni

unread,
Aug 23, 2016, 6:25:12 PM8/23/16
to Qiime 1 Forum
Hi Colin,
Sorry I am just a bit busy in a course. I will check this up in the coming day or two and revert back to you. Hope that is fine
Regards,
Ajinkya

Colin Brislawn

unread,
Aug 23, 2016, 8:07:23 PM8/23/16
to Qiime 1 Forum
Oh that's fine. I'm here to help you :-)
Colin

Ajinkya Kulkarni

unread,
Aug 25, 2016, 5:13:12 PM8/25/16
to Qiime 1 Forum
Dear Colin,
So I tried a few things

1) I ran the other command with the --phred_offset 33 option and the output was the same in the end.

2) Further on I ran the wc command and it returned the same output of line/4 which same as number of reads in the end.
The thing I couldnt get an output with was the grep command. I waited 30 minutes and no result. Not sure why.
Also I dont know how to run this function for the discarded seqeunces i.e. ~2.7m reads that were not barcoded as I dont know where this file is.

3) Further on I thought that since I have a really long barcode + header sequence (24 bases) I thought if it would be better to run the --max_barcode_errors code to add like more mismataches since default is 1.5 and i thought maybe I could increase it to 3 and see what happens. (longer barcode sequence, more chance for error rate in the barcode sequencing, hence more barcodes not matched to the mapping file?)

4) "Unfortunately, the number of reads per sample is not usually even. I expect about 10% of samples to fail to sequence (only tens of reads) while other sequence too well (many thousands of reads). Making the -q score greater should affect all samples equally; doubling the -q score may remove an additional 20% of the reads from each sample. "
I think I see a similar effect in my sample set because there are about 28 samples out of 100 which have below 1000 sequences (q 19) and about 20 samples below the same threshold when run at q 0. Ofcourse q 0 yields ~1.2m reads and q 19 yields ~400k reads.

I will now try to run split_libraries with a higher value in barcode error to see what happens.
Awaiting your thoughts on the same.
Regards,
Ajinkya


Colin Brislawn

unread,
Aug 25, 2016, 5:25:43 PM8/25/16
to Qiime 1 Forum, jai.r...@gmail.com
Hi Ajinkya,

Thanks for trying all these things and reporting back. I think the issue is related to the super long barcodes, but I'm not really sure how best to tackle it. I would recommend trying out the same settings you are configuring. Increasing --max_barcode_errors is a great start, as we would expect to see errors accumulate in the barcode. 

I've cc'ed a qiime dev who may know more, or have a suggestion about dealing with long barcodes. 

Colin

PS Have you considered contacting the original sequencing core and see if they have a protocol for demultiplexing, or contacting the authors who originally created these barcodes? Presumably the authors were able to demultiplex successfully. 

Ajinkya Kulkarni

unread,
Aug 25, 2016, 5:29:11 PM8/25/16
to Qiime 1 Forum
Dear Colin,
Sorry for the spam.
So I tried the command to add more errors in barcode  ("3) Further on I thought that since I have a really long barcode + header sequence (24 bases) I thought if it would be better to run the --max_barcode_errors code to add like more mismataches since default is 1.5 and i thought maybe I could increase it to 3 and see what happens. (longer barcode sequence, more chance for error rate in the barcode sequencing, hence more barcodes not matched to the mapping file?)

But the outcome was the same.

So I am a bit out of options as to what I could do to check more? Some better options to rerun the joining (although I had already set the overlap of about 10 bp)

Any help would be appreciated.
Again sorry to bombard you with so many things.
Will wait for your response.
Regards,
Ajinkya


Ajinkya Kulkarni

unread,
Aug 25, 2016, 5:37:58 PM8/25/16
to Qiime 1 Forum, jai.r...@gmail.com
Hi Colin,
Thanks for your immediate reply.
The thing is the barcodes are only 8 bp long. But The was we prepared the amplicons was something like this:

<adapter><barcode (8bp)><header sequences (16bp)><primer sequence>

So what I did was when I extracted the barcodes I extracted the barcode+header sequence (24bp) and used the same in the mapping file.
I was not sure if I can get away with keeping the headers in the file.

Thoughts?
Regards
Ajinkya

Colin Brislawn

unread,
Aug 25, 2016, 5:38:52 PM8/25/16
to Qiime 1 Forum
Hello Ajinkya,

Thanks for trying all this stuff. 

But the outcome was the same.
... And that makes me think that something is happening internally in qiime and is tripping us up. Like maybe --max_barcode_errors only works with golay codes. Or the padding is doing something strange. 

So I am a bit out of options as to what I could do to check more? Some better options to rerun the joining (although I had already set the overlap of about 10 bp)
I think joining is fine, and something about demultiplexing is at fault. Either a qiime dev or one of the authors that made (and used!) these large codes could help more than me. 

Thanks for keeping in touch. If you can get in contact with those original authors, let me know! 

Colin

 

Colin Brislawn

unread,
Aug 25, 2016, 5:44:58 PM8/25/16
to Qiime 1 Forum, jai.r...@gmail.com
Oh! That mapping file should have a column for barcodes (8bp) and a separate column for LinkerPrimerSequence (16bp). When you run the qiime scripts, your barcode length will be 8bp. Not sure if this will work, but it will be more similar to the qiime defaults. 

Why do you want to header sequence in the file? Are they different for different samples? 

Colin

Ajinkya Kulkarni

unread,
Aug 25, 2016, 5:47:32 PM8/25/16
to Qiime 1 Forum
Hey Colin,
Thanks for asking the other qiime developers for the same. Would really help me tackling the issue. I will ask the authors of the paper to see what they have done as they have just mentioned "we ran the split library.py code for demulitplexing" without mentioning the parameters.
I will try to contact them.
Thanks once again and wish you a nice evening ahead.
Regards,
Ajinkya


TonyWalters

unread,
Aug 25, 2016, 5:49:45 PM8/25/16
to Qiime 1 Forum
Hello Ajinkya,

You are losing a fair number of reads from the quality truncation, which you might be able to mitigate further by increasing the -r parameter or reducing the -p parameter with split_libraries_fastq.py from the default settings. When you join a read, the low quality section tends to be in the middle of the read, rather than at the end, causing the truncated reads to be short relative to their entire length.

However, I think you want to focus on the barcode extraction step-from looking at the publication, they used an 8 base pair barcode, rather than 12 (your new message just came in stating that you used 8 bp barcodes as well). It might help if you look in your raw sequence file (the one before running extract_barcodes.py), perhaps via these commands:
head Friegrich-01_S1_L001_R1_001.fastq > r1_reads.fastq
head Friegrich-01_S1_L001_R2_001.fastq > r2_reads.fastq
so you get small files (r1 and r2_reads.fastq) that are easy to open. We need to see if the reads start with the adapters, or start with the barcodes. If they start with the barcodes, we want to rerun extract_barcodes and specify an 8 bp length, and make sure the BarcodeSequence values in your mapping file reflect the 8 base pair lengths (and also are in the right orientation, i.e., not the reverse complement of the barcodes in the actual reads).

Once this gets sorted out, I think we can get the demultiplexing moving along.

Ajinkya Kulkarni

unread,
Aug 25, 2016, 5:51:28 PM8/25/16
to Qiime 1 Forum, jai.r...@gmail.com
Hmm. I thought of the same initially. Maybe I should try that out. The header sequences are same for all sample. Good thought. Maybe qiime might work better with adding sequences to the linker.
Good good . On my way to trying that out :d
Cheers,
Ajinkya


Colin Brislawn

unread,
Aug 25, 2016, 5:58:37 PM8/25/16
to Qiime 1 Forum, jai.r...@gmail.com
Thanks for jumping in Tony! Those are great ideas.

Ajinkya, let me know what you find. 

Colin

Ajinkya Kulkarni

unread,
Aug 25, 2016, 6:13:39 PM8/25/16
to qiime...@googlegroups.com
Hello Tony,
Thanks for assissting me in this. Attaching the output of the functions that you suggested. One thing I would like to mention is that the person who ran the sequencer said he removed the illumina adapter but not the Lamda PiX sequences.
So Some barcodes dont match the barcodes that I have here because I presume they belong to the PhiX seqeunces (at least I think).
Any ways have a look at it and let me know what you think.
Regards,
Ajinkya

r1_reads.fastq
r2_reads.fastq
Mapping file.xlsx

TonyWalters

unread,
Aug 26, 2016, 10:49:23 AM8/26/16
to Qiime 1 Forum
Hello again,

 I looked at the R1 file and the mapping file, and only the third read had an initial 8 base pairs that matched a barcode: CAGTTCAG
The other two reads did not start with a sequence that matched a barcode, although they did hit 16S when I blasted the complete read to NCBI.

It doesn't seem like there should be any sort of phasing in the barcode position according to the article. You might try digging through more of the reads and see if the starting 8 base pairs match the barcodes (first 8 bp of the strings in your BarcodeSequence values).

I would create another mapping file at this point, that just contains the left 8 base pairs as the BarcodeSequence values instead of using the barcode + head region.

Maybe this approach will work (if more of the reads start with the 8 base pair barcode) after trimming down the barcode in the mapping file:
1. Run extract_barcodes.py one time but just extract the 8 base pair barcode from the forward read:
extract_barcodes.py -f Friegrich-01_S1_L001_R1_001.fastq --bc1_len 8  -o barcodes_only
2. Then run extract_barcodes.py on the output reads file (saving the barcodes_only file for later use with join_paired_ends.py) to cut off the rest of the PCR constructs:
extract_barcodes.py -f X -r Friegrich-01_S1_L001_R2_001.fastq -c barcode_paired_end --bc1_len 12 --bc2_len 20 -o processed_seqs_2nd_step
where X is the path to the reads file from step 1. I'm guessing the length to cut off is 20, based upon where the NCBI blast hits stopped hitting 16S data, but feel free to change this back to 16/24 if you find different positions in other reads.
3. Then do join_paired_ends.py as you did previously, using the barcodes output from step 1 as -b input and the reads output of step 2 as the r1/r2 input.
4. Run split_libraries_fastq.py, using barcode length 8.


Ajinkya Kulkarni

unread,
Aug 28, 2016, 1:02:34 PM8/28/16
to Qiime 1 Forum
Hi Tony,
I will try this out tomorrow and report back to you accordingly.
Thanks for the suggestions.
Regards,
Ajinkya

Ajinkya Kulkarni

unread,
Aug 29, 2016, 10:53:38 AM8/29/16
to Qiime 1 Forum
Hello Tony,
So I did try the following as you mentioned:
1) extract_barcodes.py -f Friegrich-01_S1_L001_R1_001.fastq --bc1_len 8  -o barcodes_only
2) extract_barcodes.py -f reads.fastq -r Friegrich-01_S1_L001_R2_001.fastq -c barcode_paired_end --bc1_len 12 --bc2_len 20 -o processed_seqs_2nd_step
(Note: I also tried --bc1_len 16 and --bc2_len24)
3) join_paired_ends.py -f reads1.fastq -r reads2.fastq -b ../barcodes.fastq -j 10 -o joined.fastq
The barcode file was from the first step whereas the reads 1 and 2 from the second step.
4) split_libraries_fastq.py -i fastqjoin.join.fastq -b fastqjoin.join_barcodes.fastq -o slout/ -m Mappingfile_barcodeonly.txt --store_demultiplexed_fastq -r 0 -q 19 -n 100 --barcode_type 8 --phred_offset 33
The barcode file was generated from the join_paired_ends.py function. Tha mapping file had the linker sequence column empty and only contained the 8bp barcode sequences.

The output is as follows:
Quality filter results
Total number of input sequences: 3914775
Barcode not in mapping file: 2789740
Read too short after quality truncation: 714587
Count of N characters exceeds limit: 0
Illumina quality digit = 0: 0
Barcode errors exceed max: 0

As you can see the after demultiplexing I still see that about 2.7m reads do not have matching barcodes from the mapping file. This was the same when I used different length of barcodes in the separate (16 and 24) in the second function.
Just out of curiosity, using the grep function, I tried to find the number of times a barcode appeared in a particular file:
I chose the sample with most reads i.e. 15639 (after demultiplexing)
1) I counted number of times the corresponding Barcode+Header sequence appeared in R1 raw file : 107910 (used the header here to minimize counts from within the reads themselves)
2) I counted just the bardcode itself in the barcode file generated from extracting the barcodes from the R1 file (i.e. step 1 from above) :108914 (more than before)
3) I counted the barcode appearing in the final barcode file generated from joining of the paired ends (step 3): 37307 (loss of  about 35% reads after joining)
4) Finally the obtained output after running demultiplexing at -q 19 yielded 15639 sequences i.e. 42% sequences either having low quality or having not barcode matching the mapping file.
5) To check where the sequences were being actually lost, I ran the demultiplexing at -q 0 to relax quality control. After this, the sample yielded 37036 sequences, i.e. a loss of 1 sequence from the joined sequences.
(FYI the summary for -q 0 demultiplexing:
Quality filter results
Total number of input sequences: 3914775
Barcode not in mapping file: 2789740
Read too short after quality truncation: 11829
Count of N characters exceeds limit: 0
Illumina quality digit = 0: 0
Barcode errors exceed max: 0)

Just to see that this also happening in other samples I took the sample with lowest sequence output after demultiplexing i.e. 94 and performed the steps as above to count the number of thimes the samples barcode appeared.
The numbers are as follows:
1) In raw reads: 1627
2) Extracting from R1: 1641
3) After joining: 299
4) Demultiplexing: -q 19 : 94; -q 0 : 291

Correct me if I am wrong :From what I observe from these two examples, most of the sequences lost are due to quality controls and not because barcode is not in mapping file.
And I did this for some more samples within this data set.
So from what I see I am not sure how I can attribute the loss of 2.7m reads during the demultiplexing as using the grep function tells me something else unless I am completely diverting off and doing something wrong.
Also I used the grep function to see the position of the barcodes and the headers in the raw files (for some barcodes, both in R1 and R2 where in the latter  I reverse complemented and checked) and it always shows that all the sequences start with the barcode+header sequence. What I did realise is that the reverse reads had more smaller sequences which explains the loss of sequences after joining.

Hope this helps you in understanding my problem and hope that you can guide me in the right direction with such an assessment.
Thanking you in advance and awaiting your reply in patience.
Regards,
Ajinkya


Colin Brislawn

unread,
Aug 29, 2016, 1:08:00 PM8/29/16
to Qiime 1 Forum
Hello Ajinkya,

I'm still reading through your detailed response... Thanks for sharing all these clues with us!

I have a quick question: Do you have a single Forward and Reverse and barcode read, or do you also have a pair of files (Forward and Reverse) for each of your samples. I ask because 'Friegrich-01_S1_L001_R1_001.fastq' looks like it might be a R1 for a single sample.

I'll review your question later this afternoon,
Colin

Ajinkya Kulkarni

unread,
Aug 29, 2016, 2:07:31 PM8/29/16
to Qiime 1 Forum
Hi Colin,
Take your time into looking into all this. The R1 and R2 files is what I got from the person who performed the run. All the samples are in these two files.
Hope that answers your query :)
Cheers,
Ajinkya

TonyWalters

unread,
Aug 29, 2016, 3:18:12 PM8/29/16
to Qiime 1 Forum
Hello,

It seems that the joining step and the demultiplexing step are losing most of the reads. You might try altering the parameters for the join_paired_ends.py to see if more reads can be retained at this step, and for the split_libraries_fastq.py step I would try some parameters like these: 

max_bad_run_length 6 min_per_read_length_fraction 0.50
and see if that helps retain more reads at the end.

Another option might be to only use the barcodes and R1 reads (barcodes from the first extract_barcodes.py call, and the R1 reads from the second extract_barcodes.py call) and see how those demultiplex with split_libraries_fastq.py if the joining step is being uncooperative.

-Tony


Ajinkya Kulkarni

unread,
Sep 1, 2016, 2:34:23 AM9/1/16
to Qiime 1 Forum
Hi Tony,
I tried using just the R1 reads and the barcodes from the first extraction step to see the demultiplexing.

Here is the outcome:
Quality filter results
Total number of input sequences: 10739119
Barcode not in mapping file: 7006870
Read too short after quality truncation: 2672240

Count of N characters exceeds limit: 0
Illumina quality digit = 0: 0
Barcode errors exceed max: 0

In the end I end up getting more quality sequences of shorter length from such an assessment. I feel that that R2 reads are the ones causing the general problem for whatsoever reason as they seem to be lesser and shorter (check a few samples) than the R1 reads per sample which maybe the reason for me losing reads after joining the paired ends.
But seeing the demultiplexing of the R1 reads I still lose 7m reads for the barcode not matching the mapping file. I will attach the mapping file just so that you can have a look into it and check that I am not doing something wrong in creating it.

Further on, on the paired end analysis I tried the two functions you mentioned during demultiplexing as follows:
split_libraries_fastq.py -i fastqjoin.join.fastq -b fastqjoin.join_barcodes.fastq -o slout_joined/ -m Mappingfile_barcodeonly.txt --store_demultiplexed_fastq --max_bad_run_length 6 --min_per_read_length_fraction 0.50 -q 19 -n 100 --barcode_type 8 --phred_offset 33

This really helped me from preventing the loss of more sequences through quality check.
Quality filter results
Total number of input sequences: 3891531
Barcode not in mapping file: 2767927
Read too short after quality truncation: 186

Count of N characters exceeds limit: 0
Illumina quality digit = 0: 0
Barcode errors exceed max: 0

I guess changing those parameters did help gain more reads in the end but is this after a Q20 score check ?
I am a bit confused here.
For now, until you reply to this ill process the data further using just the R1 reads that I demultiplexed.
Awaiting your reply in patience.
Regards
Ajinkya



Mappingfile_barcodeonly.txt

Colin Brislawn

unread,
Sep 1, 2016, 2:31:21 PM9/1/16
to Qiime 1 Forum
Hello Ajinkya,

Either way, there are a lot of reads with undetected barcodes.
Barcode not in mapping file: 7006870
Barcode not in mapping file: 2767927

Either these are truly non-barcoded and from the Phix spike in, or some of these are the ones that qiime is missing. I wish I knew more, but I'm not sure what software inside of qiime we could try. 

Colin

Ajinkya Kulkarni

unread,
Sep 2, 2016, 8:47:22 AM9/2/16
to Qiime 1 Forum
Hi Coling and Tony,
Thanks for all your help with everything. I really appreciate your efforts. I decided to go ahead and use the R1 reads to process the data since I get more output which is more useful to me. We will work on refining our sequencing run especially with the barcoding steps and hopefully sometime in future I can come back to you with a better output.

One more question:
Is there any way I can store all the sequences that are not barcoded during demultiplexing?
@colin: I had found your post a few days ago explaining the same but I am not able to find it anymore. Some last but of help will be really appreciated.
Thanks in advance and thanks a ton.
Wish you a nice weekend.
Regards,
Ajinkya

TonyWalters

unread,
Sep 2, 2016, 8:59:00 AM9/2/16
to Qiime 1 Forum
Hi Ajinkya,

You can use the --retain_unassigned_reads option with split_libraries_fastq.py to write the unassigned reads (they will be in a SampleID called "Unassigned").

Ajinkya Kulkarni

unread,
Sep 5, 2016, 8:10:39 AM9/5/16
to Qiime 1 Forum
Hello Colin and Tony,
Since you helped me a lot with this assessment I just wanted to let you know that we probably see more non barcoded sequences is because during the barcoding PCR step we had loads of PCR products that didnt get barcoded but did get to the sequencing run. We will test this run again and see the outcome.
With your help I was finally able to process the R1 reads at least to have a result from such a run.
Thanks a lot for helping me.
Regards,
Ajinkya


Reply all
Reply to author
Forward
0 new messages