various barcode length in one mapping file

Jeongsu Kim

unread,

Jan 13, 2017, 1:49:36 AM1/13/17

to qiime...@googlegroups.com

I have various lengths of barcode in a mapping file.

My mapping file is:

#SampleID	BarcodeSequence	LinkerPrimerSequence	Treatment	ReversePrimer	Wk	Description

CON4W.4	AGAGCTG	ACGAGTTTGATCMTGGCTCAG	Control4	WTTACCGCGGCTGCTGG	7	CON_4W_4
CON4W.6	TCGTCAT	ACGAGTTTGATCMTGGCTCAG	Control4	WTTACCGCGGCTGCTGG	7	CON_4W_6
RES4W.6	TCAGATG	ACGAGTTTGATCMTGGCTCAG	Restriction4	WTTACCGCGGCTGCTGG	7	RES_4W_6
RES4W.7	TCGAGTAG	ACGAGTTTGATCMTGGCTCAG	Restriction4	WTTACCGCGGCTGCTGG	7	RES_4W_7

I have 7bp or 8bp barcode.

When I ran split_libraries.py , ValueError happened:

ValueError: Mapping file has variable length barcodes. If this is intended, specifiy variable lengths with the -b variable_length option.

How can I solve this problem?

How can I use -b option when I have two different barcode length in one mapping file?

Thank you in advance.

Stefan Janssen

unread,

Jan 13, 2017, 11:41:15 AM1/13/17

to Qiime 1 Forum

just add the string '-b variable_length ' to your existing call of split_libraries.py

Jeongsu Kim

unread,

Jan 13, 2017, 7:48:58 PM1/13/17

to Qiime 1 Forum

How can you add -b option in this case?

For example, if I have 7bp and 8bp barcode in a mapping file, should I just add -b 7 -b 8 like this?

I'm new in qiime.. need help... :(

Stefan Janssen

unread,

Jan 13, 2017, 8:12:33 PM1/13/17

to Qiime 1 Forum

you don't write a number like 7, you also don't specify multiple numbers. All you do is write -b variable_length

Jeongsu Kim

unread,

Jan 13, 2017, 11:57:17 PM1/13/17

to qiime...@googlegroups.com

thanks for your help! :D

Message has been deleted

Jeongsu Kim

unread,

Jan 14, 2017, 12:46:33 AM1/14/17

to qiime...@googlegroups.com

I have a problem again ..

I tried split_libraries.py like below.

==================================================================================================

split_libraries.py -m Co1_Mapping.txt -f combined1_seqs.fna -q combined1_seqs.qual -b variable_length -o Split_Library_Run1_Output/ -n 1000000

==================================================================================================

But I failed in getting seqs.fna file, it contained nothing.

I have no idea with this problem.

I attach log file below.

Thanks again!

+++ Split library log.txt +++

Mean qual score below minimum of 25 1236

Max homopolymer run exceeds limit of 6 394

Num mismatches in primer exceeds limit of 0: 0

Sequence length details for all sequences passing quality filters:

No sequences passed quality filters for writing.

Barcodes corrected/not 0/0

Uncorrected barcodes will not be written to the output fasta file.

Corrected barcodes will be written with the appropriate barcode category.

Corrected but unassigned sequences will not be written unless --retain_unassigned_reads is enabled.

Total valid barcodes that are not in mapping file 0

Sequences associated with valid barcodes that are not in the mapping file will not be written.

Barcodes in mapping file

Sample Sequence Count Barcode

RES1W.1 0 TCTGCAG

CON1W.2 0 TCGTCAT

CON1W.4 0 TCAGATG

RES4W.3 0 TAGCTACG

RES1W.5 0 TACAGCAG

RES4W.4 0 CTACACAG

CON1W.5 0 CGATGAG

RES1W.4 0 ATGCTGAG

CON4W.1 0 ATCGTGTG

RES1W.3 0 AGCGATG

CON1W.3 0 AGAGCTG

Total number seqs written 0

Stefan Janssen

unread,

Jan 14, 2017, 1:19:56 AM1/14/17

to Qiime 1 Forum

what is the length of your sequences?
You might need to decrease the quality threshold: --min_qual_score

Jeongsu Kim

unread,

Jan 14, 2017, 9:50:18 AM1/14/17

to qiime...@googlegroups.com

I tried [-s 20 and -s 18] but both didn't work (0 seq.fna file was generated).

Actually I'm confused. Is there a concept of acceptable cut off in min_qual_score?

++ split_library_log:

Number raw input seqs 107508

Length outside bounds of 200 and 1000 8926

Num ambiguous bases exceeds limit of 6 2

Missing Qual Score 56699

Mean qual score below minimum of 18 10

Max homopolymer run exceeds limit of 6 409

Stefan Janssen

unread,

Jan 14, 2017, 12:06:54 PM1/14/17

to Qiime 1 Forum

Have a look at this line: Length outside bounds of 200 and 1000 8926
Your sequences are either too long or too short.
You should get results when adapting those two parameters:

-l, --min_seq_length: Minimum sequence length, in nucleotides [default: 200]
-L, --max_seq_length: Maximum sequence length, in nucleotides [default: 1000]

Jeongsu Kim

unread,

Jan 14, 2017, 9:59:29 PM1/14/17

to qiime...@googlegroups.com

I really appreciate your help!

I tried this: split_libraries.py -m Co1_Mapping.txt -f combined1_seqs.fna -q combined1_seqs.qual -b variable_length -l 20 -s 10 -H 8 -o Split_Library_Run1_Output/ -n 1000000

but i'm still in trouble with getting fna file.

++ split_library_log:

Number raw input seqs 107508

Length outside bounds of 20 and 1000 0

Num ambiguous bases exceeds limit of 6 3

Missing Qual Score 61757

Mean qual score below minimum of 10 0

Max homopolymer run exceeds limit of 8 6

Jeongsu Kim

unread,

Jan 15, 2017, 5:46:31 AM1/15/17

to qiime...@googlegroups.com

As I thought that quality score file needs to be filtered, I tried below.

--- quality_scores_plot.py -q combined1_seqs.qual -o quality_histograms/

--- truncate_fasta_qual_files.py -f combined1_seqs.fna -q combined1_seqs.qual -b 421 -o filtered_seqs1/

But there was an error.

--- ValueError: Fasta label RES4W.3_1068241 not found in quality score file.

Now.. what can I do with this problem? :(

Stefan Janssen

unread,

Jan 15, 2017, 12:15:26 PM1/15/17

to Qiime 1 Forum

would you mind sending me your fasta and qual file over to sjan...@ucsd.edu
It's hard to track the problem down, without having those files in my hand.

Jeongsu Kim

unread,

Jan 15, 2017, 10:01:27 PM1/15/17

to Qiime 1 Forum

I sent you an email.

Thanks a lot :)

Stefan Janssen

unread,

Jan 18, 2017, 12:56:09 AM1/18/17

to Qiime 1 Forum

Hi,
I received your mapping file and the SFF files. To me, it looks like there are only 4 samples in the mapping file. And running process_sff.py produces 4 fna and 4 qual files. Thus, I assume that there is no need to split_libraries at all, because they where never multiplexed!
Try to apply the quality truncation separately for each of the four samples.

Jeongsu Kim

unread,

Jan 24, 2017, 2:20:45 AM1/24/17

to qiime...@googlegroups.com

Hi Stefan,

I think I'm lost... :(

Let me explain my situation.

Firstly I have two 454 running sets with total 15 samples. ( The first set with 11 samples, the second set with 4 samples)

There are overlapping barcode seqs that were used both in two runs, so I worked with two separated mapping files when I tried QIIME command I wrote below.

QIIME commands I used are as below.

1. Coverting sff files to fna and qual files: process_sff.py

2. Filtering fna and qual files: truncate_fasta_qual_files.py

3. Combining individual fna/qual files into a single fna/qual file: add_qiime_labels.py

So I have two different combined fna files and qual files.

Now I need to combine those two fna files into a single fna file to process next analysis(OTU picking etc.).

How can I generate a combined fna/ qual file that contains seq/quality score of the whole 15 samples?

Stefan Janssen

unread,

Jan 24, 2017, 12:03:50 PM1/24/17

to Qiime 1 Forum

Hi Jeongsu,

I would use the following command for OTU picking, where SEQ.FNA is your fasta file for all samples (you don't need quality files for OTU picking). The file for -r is the representative set of GreenGenes and will for sure be at a different location on your machine. Try the command print_qiime_config.py to check your location of pick_otus_reference_seqs_fp.

pick_otus.py -m sortmerna -i SEQ.FNA -r lib/python2.7/site-packages/qiime_default_reference/gg_13_8_otus/rep_set/97_otus.fasta

The file SEQ.FNA must be sorted in a way that it is clear which sequence belongs to which of your 15 samples. Let's assume four of your 15 sampleIDs be CON4W4, CON4W6, RES4W6, and RES4W7 . All sequences for CON4W4 must be first. Their header line must start with >CON4W4_x yyy where x is some number for the read and everything after the first whitespace, i.e. yyy, can be anything. The sampleID is read by everything following the > symbol and being left of the very first _ symbol.
I think add_qiime_labels.py is there to format the single SEQ.FNA in exactly this way, but in your case - since sequences are already demultiplexed and you have two runs with overlapping barcode sequences (thus you need two separate demuliplexing runs) - you can manually compose a SEQ.FNA following the above described format. Once you have this file, you can continue with standard QIIME analysis.
Good luck!

Jeongsu Kim

unread,

Jan 27, 2017, 8:26:36 AM1/27/17

to Qiime 1 Forum

Hi Stefan,

So you mean I do not need a single fna file for OTU picking?

I also did not get the meaning that I can "manually" compose a fna file.

Thanks for your advice always :-)

Stefan Janssen

unread,

Jan 30, 2017, 12:52:18 PM1/30/17

to Qiime 1 Forum

Forget about the manual creation of one big .fna file.
Yes, you can do OTU picking for each samples independently and merge the resulting .biom files into one holding all samples.

Jeongsu Kim

unread,

Jan 31, 2017, 5:16:58 AM1/31/17

to Qiime 1 Forum

thanks! :)

Reply all

Reply to author

Forward