various barcode length in one mapping file

90 views
Skip to first unread message

Jeongsu Kim

unread,
Jan 13, 2017, 1:49:36 AM1/13/17
to qiime...@googlegroups.com
I have various lengths of barcode in a mapping file.

My mapping file is:

#SampleID BarcodeSequence LinkerPrimerSequence Treatment ReversePrimer Wk Description

CON4W.4 AGAGCTG ACGAGTTTGATCMTGGCTCAG Control4 WTTACCGCGGCTGCTGG 7 CON_4W_4
CON4W.6 TCGTCAT ACGAGTTTGATCMTGGCTCAG Control4 WTTACCGCGGCTGCTGG 7 CON_4W_6
RES4W.6 TCAGATG ACGAGTTTGATCMTGGCTCAG Restriction4 WTTACCGCGGCTGCTGG 7 RES_4W_6
RES4W.7 TCGAGTAG ACGAGTTTGATCMTGGCTCAG Restriction4 WTTACCGCGGCTGCTGG 7 RES_4W_7

I have 7bp or 8bp barcode.
When I ran split_libraries.py , ValueError happened: 

ValueError: Mapping file has variable length barcodes.  If this is intended, specifiy variable lengths with the -b variable_length option.

How  can I solve this problem?
How can I use -b option when I have two different barcode length in one mapping file?

Thank you in advance.

Stefan Janssen

unread,
Jan 13, 2017, 11:41:15 AM1/13/17
to Qiime 1 Forum
just add the string '-b variable_length ' to your existing call of split_libraries.py

Jeongsu Kim

unread,
Jan 13, 2017, 7:48:58 PM1/13/17
to Qiime 1 Forum
How can you add -b option in this case?
For example, if I have 7bp and 8bp barcode in a mapping file, should I just add -b 7 -b 8 like this?
I'm new in qiime.. need help... :(

Stefan Janssen

unread,
Jan 13, 2017, 8:12:33 PM1/13/17
to Qiime 1 Forum
you don't write a number like 7, you also don't specify multiple numbers. All you do is write -b variable_length

Jeongsu Kim

unread,
Jan 13, 2017, 11:57:17 PM1/13/17
to qiime...@googlegroups.com

thanks for your help! :D

Message has been deleted

Jeongsu Kim

unread,
Jan 14, 2017, 12:46:33 AM1/14/17
to qiime...@googlegroups.com
I have a problem again ..

I tried split_libraries.py like below.
==================================================================================================

split_libraries.py -m Co1_Mapping.txt -f combined1_seqs.fna -q combined1_seqs.qual -b variable_length -o Split_Library_Run1_Output/ -n 1000000

==================================================================================================

But I failed in getting seqs.fna file, it contained nothing.
I have no idea with this problem.
I attach log file below.
Thanks again!

+++ Split library log.txt +++

Mean qual score below minimum of 25 1236
Max homopolymer run exceeds limit of 6 394
Num mismatches in primer exceeds limit of 0: 0

Sequence length details for all sequences passing quality filters:
No sequences passed quality filters for writing.

Barcodes corrected/not 0/0
Uncorrected barcodes will not be written to the output fasta file.
Corrected barcodes will be written with the appropriate barcode category.
Corrected but unassigned sequences will not be written unless --retain_unassigned_reads is enabled.

Total valid barcodes that are not in mapping file 0
Sequences associated with valid barcodes that are not in the mapping file will not be written.

Barcodes in mapping file
Sample Sequence Count Barcode
RES1W.1 0 TCTGCAG
CON1W.2 0 TCGTCAT
CON1W.4 0 TCAGATG
RES4W.3 0 TAGCTACG
RES1W.5 0 TACAGCAG
RES4W.4 0 CTACACAG
CON1W.5 0 CGATGAG
RES1W.4 0 ATGCTGAG
CON4W.1 0 ATCGTGTG
RES1W.3 0 AGCGATG
CON1W.3 0 AGAGCTG

Total number seqs written 0

Stefan Janssen

unread,
Jan 14, 2017, 1:19:56 AM1/14/17
to Qiime 1 Forum
what is the length of your sequences?
You might need to decrease the quality threshold: --min_qual_score

Jeongsu Kim

unread,
Jan 14, 2017, 9:50:18 AM1/14/17
to qiime...@googlegroups.com
I tried [-s 20 and -s 18] but both didn't work (0 seq.fna file was generated). 
Actually I'm confused. Is there a concept of acceptable cut off in min_qual_score?


++ split_library_log:
Number raw input seqs 107508

Length outside bounds of 200 and 1000 8926
Num ambiguous bases exceeds limit of 6 2
Missing Qual Score 56699
Mean qual score below minimum of 18 10
Max homopolymer run exceeds limit of 6 409

Stefan Janssen

unread,
Jan 14, 2017, 12:06:54 PM1/14/17
to Qiime 1 Forum
Have a look at this line: Length outside bounds of 200 and 1000 8926
Your sequences are either too long or too short.
You should get results when adapting those two parameters:
-l, --min_seq_length
Minimum sequence length, in nucleotides [default: 200]
-L, --max_seq_length
Maximum sequence length, in nucleotides [default: 1000]


Jeongsu Kim

unread,
Jan 14, 2017, 9:59:29 PM1/14/17
to qiime...@googlegroups.com
I really appreciate your help!

I tried this: split_libraries.py -m Co1_Mapping.txt -f combined1_seqs.fna -q combined1_seqs.qual -b variable_length -l 20 -s 10 -H 8 -o Split_Library_Run1_Output/ -n 1000000


but i'm still in trouble with getting fna file.

++ split_library_log:
Number raw input seqs 107508

Length outside bounds of 20 and 1000 0
Num ambiguous bases exceeds limit of 6 3
Missing Qual Score 61757
Mean qual score below minimum of 10 0
Max homopolymer run exceeds limit of 8 6

Jeongsu Kim

unread,
Jan 15, 2017, 5:46:31 AM1/15/17
to qiime...@googlegroups.com
As I thought that quality score file needs to be filtered, I tried below. 

---  quality_scores_plot.py -q combined1_seqs.qual -o quality_histograms/ 

--- truncate_fasta_qual_files.py -f combined1_seqs.fna -q combined1_seqs.qual -b 421 -o filtered_seqs1/



But there was an error.
--- ValueError: Fasta label RES4W.3_1068241 not found in quality score file.

Now.. what can I do with this problem? :(



Stefan Janssen

unread,
Jan 15, 2017, 12:15:26 PM1/15/17
to Qiime 1 Forum
would you mind sending me your fasta and qual file over to sjan...@ucsd.edu
It's hard to track the problem down, without having those files in my hand.

Jeongsu Kim

unread,
Jan 15, 2017, 10:01:27 PM1/15/17
to Qiime 1 Forum
I sent you an email.
Thanks a lot :)

Stefan Janssen

unread,
Jan 18, 2017, 12:56:09 AM1/18/17
to Qiime 1 Forum
Hi,
I received your mapping file and the SFF files. To me, it looks like there are only 4 samples in the mapping file. And running process_sff.py produces 4 fna and 4 qual files. Thus, I assume that there is no need to split_libraries at all, because they where never multiplexed!
Try to apply the quality truncation separately for each of the four samples.

Jeongsu Kim

unread,
Jan 24, 2017, 2:20:45 AM1/24/17
to qiime...@googlegroups.com
Hi Stefan,
I think I'm lost... :(

Let me explain my situation.
Firstly I have two 454 running sets with total 15 samples. ( The first set with 11 samples, the second set with 4 samples)
There are overlapping barcode seqs that were used both in two runs, so I worked with two separated mapping files when I tried QIIME command I wrote below.

QIIME commands I used are as below.
1. Coverting sff files to fna and qual files: process_sff.py
2. Filtering fna and qual files: truncate_fasta_qual_files.py
3. Combining individual fna/qual files into a single fna/qual file: add_qiime_labels.py   

So I have two different combined fna files and qual files. 
Now I need to combine those two fna files into a single fna file to process next analysis(OTU picking etc.).

How can I generate a combined fna/ qual file that contains seq/quality score of the whole 15 samples?


Stefan Janssen

unread,
Jan 24, 2017, 12:03:50 PM1/24/17
to Qiime 1 Forum
Hi Jeongsu,

I would use the following command for OTU picking, where SEQ.FNA is your fasta file for all samples (you don't need quality files for OTU picking). The file for -r is the representative set of GreenGenes and will for sure be at a different location on your machine. Try the command print_qiime_config.py to check your location of pick_otus_reference_seqs_fp.

pick_otus.py -m sortmerna -i SEQ.FNA -r lib/python2.7/site-packages/qiime_default_reference/gg_13_8_otus/rep_set/97_otus.fasta

The file SEQ.FNA must be sorted in a way that it is clear which sequence belongs to which of your 15 samples. Let's assume four of your 15 sampleIDs be CON4W4,  CON4W6,  RES4W6, and  RES4W7 . All sequences for CON4W4 must be first. Their header line must start with >CON4W4_x yyy where x is some number for the read and everything after the first whitespace, i.e. yyy, can be anything. The sampleID is read by everything following the > symbol and being left of the very first _ symbol.
I think add_qiime_labels.py is there to format the single SEQ.FNA in exactly this way, but in your case - since sequences are already demultiplexed and you have two runs with overlapping barcode sequences (thus you need two separate demuliplexing runs) - you can manually compose a SEQ.FNA following the above described format. Once you have this file, you can continue with standard QIIME analysis.
Good luck!

Jeongsu Kim

unread,
Jan 27, 2017, 8:26:36 AM1/27/17
to Qiime 1 Forum
Hi Stefan,

So you mean I do not need a single fna file for OTU picking? 
I also did not get the meaning that I can "manually" compose a fna file.

Thanks for your advice always :-)

Stefan Janssen

unread,
Jan 30, 2017, 12:52:18 PM1/30/17
to Qiime 1 Forum
Forget about the manual creation of one big .fna file.
Yes, you can do OTU picking for each samples independently and merge the resulting .biom files into one holding all samples.

Jeongsu Kim

unread,
Jan 31, 2017, 5:16:58 AM1/31/17
to Qiime 1 Forum
thanks! :)
Reply all
Reply to author
Forward
0 new messages