mapping file construction

pcorral

unread,

Aug 15, 2012, 11:13:20 PM8/15/12

to qiime...@googlegroups.com

Hello QIIME community,

I am constructing the mapping file to enter into the QIIME workflow, but it's not clear to me how the fields in this file relate to what the sequencing center returned to me.

The sequencing provider gave me two FASTQ files, one representing the forward read and the other the reverse read (as far as I know this was a paired-end sequencing). After doing some in-house cleaning, based on retrieval of reads that I could assign which sample they came from, and also a quality filtering, and also a FASTQ to FASTA and QUAL transformation, I ended up with FASTA entries that looked like this:

>MISEQ_0005_FC:1:1101:13831:1939#ACACCTC

TGGTCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCGAGCGTTACCCGGATTTACTGGGTGTAAAGGGCGTGTAGGCGGTTTCTCAAGTCCGATGCTAAAG

What follows the '#' at the header is my sample barcode which must correspond to BarcodeSequence in the mapping files. In my case, a single FASTA file has different entries with different sample barcodes after the '#' sign. I really don't care about the ID the sequence has, this could be substituted by any other "alphanumeric or dot" name. Also, in the sequence, I know there is a plate barcode (exactly the same in all the FASTA file entries) of 5 nucleotides (TGGTC) and after this, 19 nucleotides (which vary along the FASTA file, I guess due to sequencing errors) that were used as a linker (GTGCCAGCAGCCGCGGTAA), therefore, this must correspond to LinkerPrimerSequence.

Could anyone suggest me how to format tha FASTA file and the mapping file to fulfil the fields SampleID BarcodeSequence LinkerPrimerSequence?

Many thanks in advance!!

Pau

Tony Walters

unread,

Aug 15, 2012, 11:21:20 PM8/15/12

to qiime...@googlegroups.com

Hello Pau,

The filtering that's been done makes it slightly tricky to pull run split_libraries.py, but probably the easiest way to approach this would be to parse the fasta file, and copy the barcode to the beginning of the sequence.

So

>MISEQ_0005_FC:1:1101:13831:1939#ACACCTC

TGGTCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCGAGCGTTACCCGGATTTACTGGGTGTAAAGGGCGTGTAGGCGGTTTCTCAAGTCCGATGCTAAAG

would become

>MISEQ_0005_FC:1:1101:13831:1939#ACACCTC

ACACCTCTGGTCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCGAGCGTTACCCGGATTTACTGGGTGTAAAGGGCGTGTAGGCGGTTTCTCAAGTCCGATGCTAAAG

You would want to put all of the 6 base pair barcodes under the BarcodeSequence column, along with whatever SampleID you desired for each. Finally, you would want your LinkerPrimerSequence fields to be: ACACCTCTGGTCGTGCCAGCMGCCGCGGTAA

(usually there's the M degenerate character at that position so I added it in this example).

Then when you run split_libraries.py, use -b 6 to specify 6 base pair barcodes.

Hope this helps,

Tony

--

pcorral

unread,

Aug 17, 2012, 3:24:00 PM8/17/12

to qiime...@googlegroups.com

Hi Tony,

It did worked. Tank you.

Just a small note when using split_libraries.py. The -b option was 7 not 6 as you mentioned.

Pau

pcorral

unread,

Aug 17, 2012, 4:52:28 PM8/17/12

to qiime...@googlegroups.com

Hello,

I follow this thread as I encountered somthing that I don't understand in the output of split_libraries.py and the mapping file I am constructing.

So my mapping file looks like this (following Tony's advise in this mail thread):

#SampleID BarcodeSequence LinkerPrimerSequence Description

seq1 ACACCTC ACACCTCTGGTCGTGCCAGCMGCCGCGGTAA --

seq2 GACATCA GACATCATGGTCGTGCCAGCMGCCGCGGTAA --

seq3 TAAGGGA TAAGGGATGGTCGTGCCAGCMGCCGCGGTAA --

seq4 ACCTCCC ACCTCCCTGGTCGTGCCAGCMGCCGCGGTAA --

And this is the FASTA related to that Mapping file:

>seq1

ACACCTCTGGTCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCGAGCGTTACCCGGATTTACTGGGTGTAAAGGGCGTGTAGGCGGTTTCTCAAGTCCGATGCTAAAG

>seq2

GACATCATGGTCGTGCCAGCAGCCGCGGTAAGACAGAGGGGGCAAGCGTTGTCCGGAGTCACTGGGCGTAAAGCGCGCGCAGGCGGCTGCCTAAGTGTCGTGTGAAAG

>seq3

TAAGGGATGGTCGTGCCAGCCGCCGCGGTGATACAGAGGTGGCAAGCGTTGTCCGGATTTACTGGGTGTAAAGGGTGCGTAGGCGGATTTGCAAGTCGGGGGTTAAAG

>seq4

ACCTCCCTGGTCGTGCCAGCCGCCGCGGTAATACGGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTCCGCAGGTGGTTGTGTGTGTCTATTGTCAAAG

The command I isseu for split_libraries.py is:

split_libraries.py -m smll_map.txt -f smll.fasta -b 7 -l 60 -M 26

Here I have had to rise -M (maximum number of primer mismatches [default 0]) to 26 to be able to retrieve the 4 sequences in 'seqs.fna', but what is more intriguing to me is why seqs.fna only contains FASTA sequences 70 nucleotids long? it has trimmed 7 nucleotides (TACGGAG) that I would considere real sequenced nucleotides, and not a linker or primer ...

This is how seq_1 looks like in seqs.fna:

>seq1_1 seq1 orig_bc=ACACCTC new_bc=ACACCTC bc_diffs=0

GGTGCGAGCGTTACCCGGATTTACTGGGTGTAAAGGGCGTGTAGGCGGTTTCTCAAGTCCGATGCTAAAG

Could someone explain to me the reason for that?

Pau

Tony Walters

unread,

Aug 17, 2012, 4:55:52 PM8/17/12

to qiime...@googlegroups.com

Hello Pau,

It looks like you put in the BarcodeSequence twice, once in the header column, and again at the beginning of the LinkerPrimerSequence.

Try changing this:

#SampleID BarcodeSequence LinkerPrimerSequence Description

seq1 ACACCTC ACACCTCTGGTCGTGCCAGCMGCCGCGGTAA --

...

to this:

#SampleID BarcodeSequence LinkerPrimerSequence Description

seq1 ACACCTC TGGTCGTGCCAGCMGCCGCGGTAA --

and see if that works better.

-Tony

--

pcorral

unread,

Aug 17, 2012, 8:07:00 PM8/17/12

to qiime...@googlegroups.com

Hello Tony,

Fine surgery this time!! I could even capture sequences with the default -M=0 in split_libraries.py

Thanks for the wise advise.

-Pau

pcorral

unread,

Aug 20, 2012, 12:44:26 PM8/20/12

to qiime...@googlegroups.com

Hello,

It seems that when I pass from a small sample test to the real sample data problems arise, and although I see what the problem is, I don't know how to adapt the mapping file and the FASTA file for split_libraries.py and the rest of the workflow to work. The problem is that I have repeated barcodes in the mapping file. I'll describe all the tips that the QIIME communty has provided me to see if any of this tips can be subsituted by any other or added.

So the mapping file construction stars with sequences that look like this:

>MISEQ_0005_FC:1:1101:13831:1939#ACACCTC

TGGTCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCGAGCGTTACCCGGATTTACTGGGTGTAAAGGGCGTGTAGGCGGTTTCTCAAGTCCGATGCTAAAG

What follows "#" is my sample barcode, and of course, in the FASTA file many sequences share one sample barcode.

Following Tony's advise I have rebuilt the FASTA file in order to have the barcode at the beginning of the sequence, like this, (note that the space in the sequence is for clarity purposes)

>seq1

ACACCTC TGGTCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCGAGCGTTACCCGGATTTACTGGGTGTAAAGGGCGTGTAGGCGGTTTCTCAAGTCCGATGCTAAAG

With this seqeunce layout, the MAP file looks like this

#SampleID BarcodeSequence LinkerPrimerSequence Description

seq1 ACACCTC TGGTCGTGCCAGCMGCCGCGGTAA --

However, when I use a dataset with repeated barcodes, the MAP file reflects this, having the BarcodeSequence field with repeted strings, as in

#SampleID BarcodeSequence LinkerPrimerSequence Description

seq1 ACACCTC TGGTCGTGCCAGCMGCCGCGGTAA --

...

seq12 ACACCTC TGGTCGTGCCAGCMGCCGCGGTAA --

...

So far, check_id_map.py warns me about this repetitions but as far as I can see, the suggested corrected MAP file is exactly the same, with repetitions. split_libraries.py does not output results due to this problem with repetitions.

The reason to use split_libraries is two-fold:

1) I want a MAP file to be constructed to be used in the downstream workflow

2) I want to have my FASTA sequences trimmed of barcode + Linker

I could do this trimming manually, and enter the workflow directly into pick_reference_otus_through_otu_table.py, but then what the layout of the MAP file would be?

I hope there is a soulution for this barcode repetition problem,

Thanks

-Pau

Tony Walters

unread,

Aug 20, 2012, 1:07:37 PM8/20/12

to qiime...@googlegroups.com

Hello Pau,

Are those duplicate barcodes only present if you combine the two starting fasta files together? If all of the barcodes are unique with the files separated, you could make one mapping file for each fasta file (so each will have unique barcodes), run split_libraries.py on each, and cat the resulting seqs.fna files.

With this approach you would also want to use -n 1000000 on the first split_libraries.py and -n 2000000 on the second to make sure you get unique numbers following the output labels.

-Tony

--

pcorral

unread,

Aug 20, 2012, 1:43:58 PM8/20/12

to qiime...@googlegroups.com

Hello Tony,

I don't think I undestand your email, or it might be that I haven't explaind myself clearly enough.

I just have 1 FASTA file, considered the Forward file in a PE experiment, and I guess I am construncting just one MAP file.

What we noticed is that the disposition in which the sequences came grom the sequencing center make it difficut to enter them into QIIME.

My forward FASTA file has entries like this before manipulating it,

>MISEQ_0005_FC:1:1101:13831:1939#ACACCTC

TGGTCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCGAGCGTTACCCGGATTTACTGGGTGTAAAGGGCGTGTAGGCGGTTTCTCAAGTCCGATGCTAAAG

I have got up to 90 codes in "#xxxxxxx", and as you can imagine they are repeated several times in the single Forward FASTA file.

I call them sample barcodes, as they are samples taken in different places, so each sequemce has to cluster with the ones in the same sample.

And I apply the suggestion you mentioned in previous mails in this thread (put "xxxxxxx" at the beginning of the sequence). But then is when the MAP file contains BarcodeSequence with strings repeated.

-Pau

Tony Walters

unread,

Aug 20, 2012, 1:53:19 PM8/20/12

to qiime...@googlegroups.com

Hello Pau,

This gets a bit trickier if there are multiple repeats of the same barcode in the sequences that need to be demultiplexed to different SampleIDs (if I'm understanding this correctly).

Is there some way from the fasta label that you know something with the same #(barcode sequence) at the end of the label should be assigned to different SampleIDs?

And just to clarify from the above example seq1 and seq12 came from two *different* samples, but they have the same barcode sequence?

-Tony

--

Pau Corral

unread,

Aug 20, 2012, 3:21:06 PM8/20/12

to qiime...@googlegroups.com

Hello Tony,

I saw where the problem was. I mistook SampleID per SequenceID in the MAP file and that's why I was trying to construct a MAP file with as many lines/entries as FASTA sequences in the FASTA. Error!!

The way I should have done it from the beginning is to fill one line per sample barcode as in the example below, and as you said modifying the FASTA sequences with this barcode at the beginning of the sequence.

#SampleID BarcodeSequence LinkerPrimerSequence Description

sample1 ACACCTC TGGTCGTGCCAGCMGCCGCGGTAA --

sample2 TTACCGG TGGTCGTGCCAGCMGCCGCGGTAA --

...

sample96 AACTGTA TGGTCGTGCCAGCMGCCGCGGTAA --

This way i have been able to see the Heatmap plot.

Thanks!

-Pau

--

Reply all

Reply to author

Forward