Barcode not in mapping file

99 views
Skip to first unread message

Matías Di Paola

unread,
Sep 7, 2016, 7:38:19 PM9/7/16
to Qiime 1 Forum
Hi Qiime developers and users,


My problem is in the demultiplexing step using split_libraries_fastq.py . Almost 70% of my input reads has no barcode reference in the mapping file.

Here my pipeline:

// generating barcode file
extract_barcodes.py -f SAM1-25_S8_L001_R1_001.fastq -r SAM1-25_S8_L001_R2_001.fastq -c barcode_paired_end --bc1_len 8 --bc2_len 8 -o barcode
output: barcodes_20lines.fastq

// validating mapping file
validate_mapping_file.py -m mapping_file.txt -o check_id_map
output: mapping_file_corrected.txt

// join_paired_ends
join_paired_ends.py -f SAM1-25_S8_L001_R1_001.fastq -r SAM1-25_S8_L001_R2_001.fastq -b barcodes.fastq -o fastq-join_joined
output: fastqjoin.join_20lines.fastq fastqjoin.join_barcodes_20lines.fastq

// demultiplexing
split_libraries_fastq.py -i fastqjoin.join.fastq -b -o demultiplex_NO_barCodes/ -m mapping_file_corrected.txt -q 19 --barcode_type 8

First lines from split_library_log.txt:

Mapping filepath: mapping_file_corrected.txt (md5: 2ebdc2c6e3cc86da635db8c3a537a407)
Sequence read filepath: fastqjoin.join.fastq (md5: 09dc1774afbbf8146598a44c650c02f4)
Barcode read filepath: fastqjoin.join_barcodes.fastq (md5: 5462eec2bc663960c2ae3c5f996bf3ef)

Quality filter results
Total number of input sequences: 1197320
Barcode not in mapping file: 811160
Read too short after quality truncation: 84368
Count of N characters exceeds limit: 3
Illumina quality digit = 0: 0
Barcode errors exceed max: 0

Result summary (after quality filtering)
Median sequence length: 523.00
P1b.18.9    17472
yucra.2a    15875


I have already seen https://groups.google.com/forum/#!topic/qiime-forum/XLRdwnpILBs . It seems that the reverse complement is not the problem.
Any idea what happening? I am making a mistake in previous steps? Is there a problem with the mapping file, like barcodes missing?

Thanks for your help

Matias

barcodes_20lines.fastq
fastqjoin.join_20lines.fastq
fastqjoin.join_barcodes_20lines.fastq
mapping_file_corrected.txt

Embriette

unread,
Sep 8, 2016, 11:44:57 AM9/8/16
to Qiime 1 Forum
Hi Matias,

It looks like your problem originates from extract_barcodes.py. You are using read 1 and read 2 to extract barcodes, with the option -c barcode_paired_end. Per the help on this command, this will result in the barcode for fastq1 being written first, followed by the barcode from fastq2. If you look at your barcodes file, you have barcodes that are 16bp long, which makes sense given the argument you passed for extract_barcodes.py. This is incorrect, given that the barcodes in your mapping file are 8bp long. Did you add barcodes to both ends of your DNA fragment, or are they only on one end?

Did your sequencing center provide you with an index file (i.e., barcodes file)? If they didn't, see if they have that file, otherwise, I would recommend re-running extract_barcodes.py using the appropriate read, check your output barcodes file to make sure they are 8bp long and match the barcodes in your mapping file, and then try re-running split_libraries. Let me know how it goes!

Embriette


Matías Di Paola

unread,
Sep 8, 2016, 5:34:45 PM9/8/16
to Qiime 1 Forum
Hi rose,

Thank you for your response. I am agree with you, I have a problem with that step, for sure. I used that flag (-c barcode_paired_end) because when I grep my .fastq (Fw, and Rw) with the barcodes I found matches at the beginning of the lines in both files, so I assume that extract_barcodes.py need both files,  but you are right it generates me 16bp length barcodes.

So I change to:

extract_barcodes.py -f SAM1-25_S8_L001_R1_001.fastq -c barcode_single_end --bc1_len 8 -o barcode_fw

output:

@M02542:101:000000000-AHH2T:1:1101:10795:1039 1:N:0:8
GTGTTACG
+
CCCCCGGG
@M02542:101:000000000-AHH2T:1:1101:10977:1040 1:N:0:8
GTTTTACA
+
CCCCCGGG
@M02542:101:000000000-AHH2T:1:1101:11562:1040 1:N:0:8
CGTCGCAT
+
...

So now I have 8bp barcodes, and seems correct ( I grep it with the bardcodes of the mapping file )
And continue with the same pipeline just changing the barcode.fastq file:

join_paired_ends.py -f SAM1-25_S8_L001_R1_001.fastq -r SAM1-25_S8_L001_R2_001.fastq -b barcodes_NEW.fastq -o fastq-join_joined_fw

split_libraries_fastq.py -i fastqjoin.join_NEW.fastq -b fastqjoin.join_barcodes_NEW.fastq -o demultiplex_fw -m mapping_file_corrected.txt -q 19 --barcode_type 8

But with the same results (First lines from split_library_log.txt):

Input file paths

Mapping filepath: mapping_file_corrected.txt (md5: 2ebdc2c6e3cc86da635db8c3a537a407)
Sequence read filepath: fastqjoin.join_NEW.fastq (md5: 09dc1774afbbf8146598a44c650c02f4)
Barcode read filepath: fastqjoin.join_barcodes_NEW.fastq (md5: 8e9b084387d3a30c7dd13338a7d91e9b)


Quality filter results
Total number of input sequences: 1197320
Barcode not in mapping file: 811160
Read too short after quality truncation: 84368
Count of N characters exceeds limit: 3
Illumina quality digit = 0: 0
Barcode errors exceed max: 0

So far I obtained almost the same results.

The sequencing was outsourced to "Mr.DNA". They provided .fasta and .qual full already processed (not demultiplexed)  and the mapping file. In the other hand illumina send the .fastq files (fw an rv). Here is the protocol they follow:

"(..)Methods of MiSeq!

The 16S rRNA gene V4 variable region PCR primers 515/806 (OR OTHER PRIMER SELECTED) with barcode on the forward primer were used in...

To process the r1 and r2 files for amplicons >300bp and <570bp this is the process summarized
that we use at MR DNA.
a. Join the reads together after q25 trimming of the ends (there are many publically
available softwares that join illumina paired end reads).
b. Look for barcodes at the 5’ end, also find reverse compliment barcodes at the 3’ end of
the joined reads.
c. Reverse compliment the sequences containing barcodes at the 3’ end.
d. The resulting file is our full.fasta and full.qual .. this has the joined reads all in the same
5’-3’ orientation.. this is raw data just joined and reoriented.
e. Please note you can convert the full.fasta and full.qual back to fastq using our free
software www.mrdnafreesoftware.com "

Another possibility is, as they say in e), to convert .fasta and .qual to fastq and continue the pipeline from there. But I really want to know what is going one and compare my results.

Thank you for your time!

Embriette

unread,
Sep 9, 2016, 10:44:56 AM9/9/16
to Qiime 1 Forum
Hi Matias,

Can you please send me your entire log file from split_libraries_fastq.py?

Thanks!

Embriette

Matías Di Paola

unread,
Sep 9, 2016, 12:08:45 PM9/9/16
to Qiime 1 Forum
Sure, there they go!
histograms.txt
split_library_log.txt

Embriette

unread,
Sep 10, 2016, 1:36:51 PM9/10/16
to Qiime 1 Forum
Hi Matias,

Based on your log file, everything looks fine. You have at least 7000 seqs in each sample, with the highest just above 17000. Were your samples sequenced together with other samples? Oftentimes multiple projects are sequenced together, and in those cases, the barcodes that don't match your mapping file belong to the samples from other project(s). 

Thanks!

Embriette

Ajinkya Kulkarni

unread,
Sep 13, 2016, 6:49:56 AM9/13/16
to Qiime 1 Forum
Hi Matias,
If you see the first two barcodes you show namely:

@M02542:101:000000000-AHH2T:1:
1101:10795:1039 1:N:0:8
GTGTTACG
+
CCCCCGGG
@M02542:101:000000000-AHH2T:1:1101:10977:1040 1:N:0:8
GTTTTACA
+
CCCCCGGG

they are not in your mapping file (checked the mapping file you used for your first assessment). The third barcode in the list does correspond to the one in your mapping file though. So I am not sure if you have samples just from your sequences. As far as having worked with Mr. DNA the fullfasta and qual files just contain your samples and not the others. But I am not sure whats there in the raw reads. As Embriette mentioned you probably have sequences from other data sets.

Also I have worked with Mr. DNA before and what I did was the conversion to fastq from fasta and qual files and then demultiplexing with the forward end barcodes and the mapping file provided by Mr. DNA. Worked just fine. Only lost sequences to quality truncation but nothing due to the aforementioned error.
Cheers,
Ajinkya


Reply all
Reply to author
Forward
0 new messages