Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Message from discussion mapping file construction

Received: by 10.50.47.199 with SMTP id f7mr5077767ign.5.1345485200325;
        Mon, 20 Aug 2012 10:53:20 -0700 (PDT)
X-BeenThere: qiime-forum@googlegroups.com
Received: by 10.50.11.232 with SMTP id t8ls5158919igb.2.canary; Mon, 20 Aug
 2012 10:53:19 -0700 (PDT)
Received: by 10.50.41.168 with SMTP id g8mr5072268igl.0.1345485199685;
        Mon, 20 Aug 2012 10:53:19 -0700 (PDT)
Received: by 10.50.41.168 with SMTP id g8mr5072264igl.0.1345485199651;
        Mon, 20 Aug 2012 10:53:19 -0700 (PDT)
Return-Path: <william.a.walt...@gmail.com>
Received: from mail-ob0-f174.google.com (mail-ob0-f174.google.com [209.85.214.174])
        by gmr-mx.google.com with ESMTPS id zu7si3099457igb.3.2012.08.20.10.53.19
        (version=TLSv1/SSLv3 cipher=OTHER);
        Mon, 20 Aug 2012 10:53:19 -0700 (PDT)
Received-SPF: pass (google.com: domain of william.a.walt...@gmail.com designates 209.85.214.174 as permitted sender) client-ip=209.85.214.174;
Authentication-Results: gmr-mx.google.com; spf=pass (google.com: domain of william.a.walt...@gmail.com designates 209.85.214.174 as permitted sender) smtp.mail=william.a.walt...@gmail.com; dkim=pass header...@gmail.com
Received: by mail-ob0-f174.google.com with SMTP id uo13so11632046obb.33
        for <qiime-forum@googlegroups.com>; Mon, 20 Aug 2012 10:53:19 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20120113;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        bh=+l6OjRHGZlBAbu/Rbx/Fr1gGNM4wzbqpsnMH18wFWSU=;
        b=nLSau82s5qTGqPqobfLbzz+93ToNA+7uZiyl7/NoVSNSI0q1Ul7UAAo8h7vmaN0JpU
         smtLz0zWA4MSt+hgPLLclxMbLqK5Zj4wysCOS6USr+rNHzv+M/IJTJfbbFdonVKIUneS
         MeehmptcPqT4VQ66dOWltR4sCE/DuPYEHUWUStz8spkK0DELFcDUZOGh2cOmTOMUKrNE
         NsWmmNFWhbCP3OHhufXyWUa4nRGcnFTuL3fRdCeA8a14VUgGnj4HLNx9fQYNRFo1uEk8
         zZumpHLfW4O630/mexxCgKQbUNYa2yYyk/vSmSuaUtfbirERHzQEGpGxchMEWW9qqKER
         4zVg==
MIME-Version: 1.0
Received: by 10.182.76.137 with SMTP id k9mr10747999obw.90.1345485199383; Mon,
 20 Aug 2012 10:53:19 -0700 (PDT)
Received: by 10.182.70.199 with HTTP; Mon, 20 Aug 2012 10:53:19 -0700 (PDT)
In-Reply-To: <83452311-d675-467a-8a19-af752990846b@googlegroups.com>
References: <308cb5a0-6b52-454c-bfdc-652887d1b247@googlegroups.com>
	<059faa37-be8e-4d26-b460-705efe18c134@googlegroups.com>
	<4138acbb-75a4-4df5-81e8-ec05baaacf03@googlegroups.com>
	<CAEkSiekGpphFZb=xPke9RN=WtKBg_aR2bfQ8Dy8X9wGDg=F...@mail.gmail.com>
	<2902dd51-6bad-4fd1-9933-78d3eaa67023@googlegroups.com>
	<082b072b-779d-407a-b8b4-6c5aabc93eba@googlegroups.com>
	<CAEkSie=x0b27qdg5gpjhRaMiHs5ski9z9nz_ABB2PpSUuUc...@mail.gmail.com>
	<83452311-d675-467a-8a19-af752990846b@googlegroups.com>
Date: Mon, 20 Aug 2012 11:53:19 -0600
Message-ID: <CAEkSie=i3G3jEec1R=N+ASvt-8j1brctQ+MSQAw=rj+cfBE...@mail.gmail.com>
Subject: Re: [qiime-forum 1.5.0] Re: mapping file construction
From: Tony Walters <william.a.walt...@gmail.com>
To: qiime-forum@googlegroups.com
Content-Type: multipart/alternative; boundary=f46d04478b955c1c7304c7b62f7f

--f46d04478b955c1c7304c7b62f7f
Content-Type: text/plain; charset=ISO-8859-1

Hello Pau,

This gets a bit trickier if there are multiple repeats of the same barcode
in the sequences that need to be demultiplexed to different SampleIDs (if
I'm understanding this correctly).

Is there some way from the fasta label that you know something with the
same #(barcode sequence) at the end of the label should be assigned to
different SampleIDs?

And just to clarify from the above example seq1 and seq12 came from two
*different* samples, but they have the same barcode sequence?

-Tony

On Mon, Aug 20, 2012 at 11:43 AM, pcorral <pau.cor...@gmail.com> wrote:

> Hello Tony,
>
> I don't think I undestand your email, or it might be that I haven't
> explaind myself clearly enough.
> I just have 1 FASTA file, considered the Forward file in a PE experiment,
> and I guess I am construncting just one MAP file.
>
> What we noticed is that the disposition in which the sequences came grom
> the sequencing center make it difficut to enter them into QIIME.
> My forward FASTA file has entries like this before manipulating it,
>
> >MISEQ_0005_FC:1:1101:13831:**1939#ACACCTC
> TGGTCGTGCCAGCAGCCGCGGTAATACGGA**GGGTGCGAGCGTTACCCGGATTTACTGGGT**
> GTAAAGGGCGTGTAGGCGGTTTCTCAAGTC**CGATGCTAAAG
>
> I have got up to 90 codes in "#xxxxxxx", and as you can imagine they are
> repeated several times in the single Forward FASTA file.
> I call them sample barcodes, as they are samples taken in different
> places, so each sequemce has to cluster with the ones in the same sample.
>
> And I apply the suggestion you mentioned in previous mails in this thread
> (put "xxxxxxx" at the beginning of the sequence). But then is when the MAP
> file contains BarcodeSequence with strings repeated.
>
>
> -Pau
>
>
>
>
>
> On Monday, August 20, 2012 1:07:37 PM UTC-4, TonyWalters wrote:
>
>> Hello Pau,
>>
>> Are those duplicate barcodes only present if you combine the two starting
>> fasta files together?  If all of the barcodes are unique with the files
>> separated, you could make one mapping file for each fasta file (so each
>> will have unique barcodes), run split_libraries.py on each, and cat the
>> resulting seqs.fna files.
>>
>> With this approach you would also want to use -n 1000000 on the first
>> split_libraries.py and -n 2000000 on the second to make sure you get unique
>> numbers following the output labels.
>>
>> -Tony
>>
>>
>> On Mon, Aug 20, 2012 at 10:44 AM, pcorral <pau.c...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> It seems that when I pass from a small sample test to the real sample
>>> data problems arise, and although I see what the problem is, I don't know
>>> how to adapt the mapping file and the FASTA file for split_libraries.py and
>>> the rest of the workflow to work. The problem is that I have repeated
>>> barcodes in the mapping file. I'll describe all the tips that the QIIME
>>> communty has provided me to see if any of this tips can be subsituted by
>>> any other or added.
>>>
>>> So the mapping file construction stars with sequences that look like
>>> this:
>>>
>>> >MISEQ_0005_FC:1:1101:13831:**19**39#ACACCTC
>>> TGGTCGTGCCAGCAGCCGCGGTAATACGGA****GGGTGCGAGCGTTACCCGGATTTACTGGGT****
>>> GTAAAGGGCGTGTAGGCGGTTTCTCAAGTC****CGATGCTAAAG
>>>
>>> What follows "#" is my sample barcode, and of course, in the FASTA file
>>> many sequences share one sample barcode.
>>>
>>> Following Tony's advise I have rebuilt the FASTA file in order to have
>>> the barcode at the beginning of the sequence, like this, (note that the
>>> space in the sequence is for clarity purposes)
>>>
>>> >seq1
>>> ACACCTC **TGGTCGTGCCAGCAGCCGCGGTAATACGGA****
>>> GGGTGCGAGCGTTACCCGGATTTACTGGGT****GTAAAGGGCGTGTAGGCGGTTTCTCAAGTC****
>>> CGATGCTAAAG
>>>
>>> With this seqeunce layout, the MAP file looks like this
>>>
>>> #SampleID BarcodeSequence LinkerPrimerSequence Description
>>> seq1 ACACCTC   TGGTCGTGCCAGCMGCCGCGGTA**A --
>>>
>>> However, when I use a dataset with repeated barcodes, the MAP file
>>> reflects this, having the BarcodeSequence field with repeted strings, as in
>>>
>>> #SampleID BarcodeSequence LinkerPrimerSequence Description
>>> seq1 ACACCTC   TGGTCGTGCCAGCMGCCGCGGTA**A --
>>> ...
>>> seq12 ACACCTC   TGGTCGTGCCAGCMGCCGCGGTA**A --
>>> ...
>>>
>>> So far, check_id_map.py warns me about this repetitions but as far as I
>>> can see, the suggested corrected MAP file is exactly the same, with
>>> repetitions. split_libraries.py does not output results due to this problem
>>> with repetitions.
>>>
>>> The reason to use split_libraries is two-fold:
>>> 1) I want a MAP file to be constructed to be used in the downstream
>>> workflow
>>> 2) I want to have my FASTA sequences trimmed of barcode + Linker
>>>
>>> I could do this trimming manually, and enter the workflow directly into
>>> pick_reference_otus_through_**otu_table.py, but then what the layout of
>>> the MAP file would be?
>>>
>>> I hope there is a soulution for this barcode repetition problem,
>>>
>>> Thanks
>>>
>>> -Pau
>>>
>>>
>>>
>>> On Friday, August 17, 2012 8:07:00 PM UTC-4, pcorral wrote:
>>>>
>>>> Hello Tony,
>>>>
>>>> Fine surgery this time!! I could even capture sequences with the
>>>> default -M=0 in split_libraries.py
>>>>
>>>> Thanks for the wise advise.
>>>>
>>>> -Pau
>>>>
>>>>
>>>> On Friday, August 17, 2012 4:55:52 PM UTC-4, TonyWalters wrote:
>>>>>
>>>>> Hello Pau,
>>>>>
>>>>> It looks like you put in the BarcodeSequence twice, once in the header
>>>>> column, and again at the beginning of the LinkerPrimerSequence.
>>>>> Try changing this:
>>>>> #SampleID BarcodeSequence LinkerPrimerSequence Description
>>>>> seq1 ACACCTC    **ACACCTCTGGTCGTGCCAGCMGCCGCGGT**A**A --
>>>>> ...
>>>>>
>>>>> to this:
>>>>> #SampleID BarcodeSequence LinkerPrimerSequence Description
>>>>> seq1 ACACCTC    TGGTCGTGCCAGCMGCCGCGGTAA --
>>>>>
>>>>> and see if that works better.
>>>>>
>>>>> -Tony
>>>>>
>>>>>
>>>>> On Fri, Aug 17, 2012 at 2:52 PM, pcorral <pau.c...@gmail.com> wrote:
>>>>>
>>>>>>  Hello,
>>>>>>
>>>>>> I follow this thread as I encountered somthing that I don't
>>>>>> understand in the output of split_libraries.py and the mapping file I am
>>>>>> constructing.
>>>>>>
>>>>>> So my mapping file looks like this (following Tony's advise in this
>>>>>> mail thread):
>>>>>>
>>>>>> #SampleID BarcodeSequence LinkerPrimerSequence Description
>>>>>> seq1 ACACCTC    **ACACCTCTGGTCGTGCCAGCMGCCGCGGT**A**A --
>>>>>> seq2 GACATCA    **GACATCATGGTCGTGCCAGCMGCCGCGGT**A**A --
>>>>>> seq3 TAAGGGA    **TAAGGGATGGTCGTGCCAGCMGCCGCGGT**A**A --
>>>>>> seq4 ACCTCCC    **ACCTCCCTGGTCGTGCCAGCMGCCGCGGT**A**A --
>>>>>>
>>>>>> And this is the FASTA related to that Mapping file:
>>>>>>
>>>>>> >seq1
>>>>>> ACACCTCTGGTCGTGCCAGCAGCCGCGGTA****ATACGGAGGGTGCGAGCGTTACCCGGATTT****
>>>>>> ACTGGGTGTAAAGGGCGTGTAGGCGGTTTC****TCAAGTCCGATGCTAAAG
>>>>>> >seq2
>>>>>> GACATCATGGTCGTGCCAGCAGCCGCGGTA****AGACAGAGGGGGCAAGCGTTGTCCGGAGTC****
>>>>>> ACTGGGCGTAAAGCGCGCGCAGGCGGCTGC****CTAAGTGTCGTGTGAAAG
>>>>>> >seq3
>>>>>> TAAGGGATGGTCGTGCCAGCCGCCGCGGTG****ATACAGAGGTGGCAAGCGTTGTCCGGATTT****
>>>>>> ACTGGGTGTAAAGGGTGCGTAGGCGGATTT****GCAAGTCGGGGGTTAAAG
>>>>>> >seq4
>>>>>> ACCTCCCTGGTCGTGCCAGCCGCCGCGGTA****ATACGGAGGATGCAAGCGTTATCCGGAATG****
>>>>>> ATTGGGCGTAAAGCGTCCGCAGGTGGTTGT****GTGTGTCTATTGTCAAAG
>>>>>>
>>>>>> The command I isseu for split_libraries.py is:
>>>>>> split_libraries.py -m smll_map.txt -f smll.fasta -b 7 -l 60 -M 26
>>>>>>
>>>>>> Here I have had to rise -M (maximum number of primer mismatches
>>>>>> [default 0]) to 26 to be able to retrieve the 4 sequences in 'seqs.fna',
>>>>>> but what is more intriguing to me is why seqs.fna only contains FASTA
>>>>>> sequences 70 nucleotids long? it has trimmed 7 nucleotides (TACGGAG) that I
>>>>>> would considere real sequenced nucleotides,  and not a linker or primer ...
>>>>>>
>>>>>> This is how seq_1 looks like in seqs.fna:
>>>>>>
>>>>>> >seq1_1 seq1 orig_bc=ACACCTC new_bc=ACACCTC bc_diffs=0
>>>>>> GGTGCGAGCGTTACCCGGATTTACTGGGTG****TAAAGGGCGTGTAGGCGGTTTCTCAAGTCC****
>>>>>> GATGCTAAAG
>>>>>>
>>>>>> Could someone explain to me the reason for that?
>>>>>>
>>>>>> Pau
>>>>>>
>>>>>>
>>>>>> On Friday, August 17, 2012 3:24:00 PM UTC-4, pcorral wrote:
>>>>>>>
>>>>>>> Hi Tony,
>>>>>>>
>>>>>>> It did worked. Tank you.
>>>>>>>
>>>>>>> Just a small note when using split_libraries.py. The -b option was 7
>>>>>>> not 6 as you mentioned.
>>>>>>>
>>>>>>> Pau
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wednesday, August 15, 2012 11:13:20 PM UTC-4, pcorral wrote:
>>>>>>>>
>>>>>>>> Hello QIIME community,
>>>>>>>>
>>>>>>>> I am constructing the mapping file to enter into the QIIME
>>>>>>>> workflow, but it's not clear to me how the fields in this file relate to
>>>>>>>> what the sequencing center returned to me.
>>>>>>>>
>>>>>>>> The sequencing provider gave me two FASTQ files, one representing
>>>>>>>> the forward read and the other the reverse read (as far as I know this was
>>>>>>>> a paired-end sequencing). After doing some in-house cleaning, based on
>>>>>>>> retrieval of reads that I could assign which sample they came from, and
>>>>>>>> also a quality filtering, and also a FASTQ to FASTA and QUAL
>>>>>>>> transformation, I ended up with FASTA entries that looked like this:
>>>>>>>>
>>>>>>>> >MISEQ_0005_FC:1:1101:13831:**19****39#ACACCTC
>>>>>>>> TGGTCGTGCCAGCAGCCGCGGTAATACGGA******GGGTGCGAGCGTTACCCGGATTTACTGGGT*
>>>>>>>> *****GTAAAGGGCGTGTAGGCGGTTTCTCAAGTC******CGATGCTAAAG
>>>>>>>>
>>>>>>>> What follows the '#' at the header is my sample barcode which must
>>>>>>>> correspond to BarcodeSequence in the mapping files. In my case, a
>>>>>>>> single FASTA file has different entries with different sample barcodes
>>>>>>>> after the '#' sign.  I really don't care about the ID the sequence has,
>>>>>>>> this could be substituted by any other "alphanumeric or dot" name. Also, in
>>>>>>>> the sequence, I know there is a plate barcode (exactly the same in all the
>>>>>>>> FASTA file entries) of 5 nucleotides (TGGTC) and after this, 19 nucleotides
>>>>>>>> (which vary along the FASTA file, I guess due to sequencing errors) that
>>>>>>>> were used as a linker (GTGCCAGCAGCCGCGGTAA), therefore, this must
>>>>>>>> correspond to LinkerPrimerSequence.
>>>>>>>>
>>>>>>>> Could anyone suggest me how to format tha FASTA file and the
>>>>>>>> mapping file to fulfil the fields  SampleID BarcodeSequence
>>>>>>>> LinkerPrimerSequence?
>>>>>>>>
>>>>>>>> Many thanks in advance!!
>>>>>>>>
>>>>>>>> Pau
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>  --
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>  --
>>>
>>>
>>>
>>>
>>
>>
>
>
> --
>
>
>
>

--f46d04478b955c1c7304c7b62f7f
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Hello Pau,<div><br></div><div>This gets a bit trickier if there are multipl=
e repeats of the same barcode in the sequences that need to be demultiplexe=
d to different SampleIDs (if I&#39;m understanding this correctly).</div>
<div><br></div><div>Is there some way from the fasta label that you know so=
mething with the same #(barcode sequence) at the end of the label should be=
 assigned to different SampleIDs?</div><div><br></div><div>And just to clar=
ify from the above example seq1 and seq12 came from two *different* samples=
, but they have the same barcode sequence?</div>
<div><br></div><div>-Tony<br><br><div class=3D"gmail_quote">On Mon, Aug 20,=
 2012 at 11:43 AM, pcorral <span dir=3D"ltr">&lt;<a href=3D"mailto:pau.corr=
a...@gmail.com" target=3D"_blank">pau.cor...@gmail.com</a>&gt;</span> wrote:<=
br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left=
:1px #ccc solid;padding-left:1ex">
Hello Tony,=A0<div><br></div><div>I don&#39;t think I undestand your email,=
 or it might be that I haven&#39;t explaind myself clearly enough.=A0</div>=
<div>I just have 1 FASTA file, considered the Forward file in a PE experime=
nt, and I guess I am construncting just one MAP file.=A0</div>
<div><br></div><div>What we noticed is that the disposition in which the se=
quences came grom the sequencing center make it difficut to enter them into=
 QIIME.=A0</div><div>My forward FASTA file has entries like this before man=
ipulating it,</div>
<div><span style=3D"color:rgb(80,0,80)"><br></span></div><div><div class=3D=
"im"><div style=3D"color:rgb(80,0,80)">&gt;MISEQ_0005_FC:1:1101:13831:<u></=
u>1939#ACACCTC</div><div style=3D"color:rgb(80,0,80)">TGGTCGTGCCAGCAGCCGCGG=
TAATACGGA<u></u>GGGTGCGAGCGTTACCCGGATTTACTGGGT<u></u>GTAAAGGGCGTGTAGGCGGTTT=
CTCAAGTC<u></u>CGATGCTAAAG</div>
<div style=3D"color:rgb(80,0,80)"><br></div></div><div style=3D"color:rgb(8=
0,0,80)">I have got up to 90 codes in &quot;#xxxxxxx&quot;, and as you can =
imagine they are repeated several times in the single Forward FASTA file.=
=A0</div>
<div style=3D"color:rgb(80,0,80)">I call them sample barcodes, as they are =
samples taken in different places, so each sequemce has to cluster with the=
 ones in the same sample.</div><div><br></div>And I apply the suggestion yo=
u mentioned in previous mails in this thread (put &quot;xxxxxxx&quot; at th=
e beginning of the sequence). But then is when the MAP file contains Barcod=
eSequence with strings repeated.=A0<br>
<br><br>-Pau<div class=3D"im"><br><br><br><br><br>On Monday, August 20, 201=
2 1:07:37 PM UTC-4, TonyWalters wrote:</div><blockquote class=3D"gmail_quot=
e" style=3D"margin:0;margin-left:0.8ex;border-left:1px #ccc solid;padding-l=
eft:1ex">
Hello Pau,<div><br></div><div class=3D"im"><div>Are those duplicate barcode=
s only present if you combine the two starting fasta files together? =A0If =
all of the barcodes are unique with the files separated, you could make one=
 mapping file for each fasta file (so each will have unique barcodes), run =
split_libraries.py on each, and cat the resulting seqs.fna files.</div>

<div><br></div><div>With this approach you would also want to use -n 100000=
0 on the first split_libraries.py and -n 2000000 on the second to make sure=
 you get unique numbers following the output labels.</div><div><br></div>

</div><div>-Tony<div><div class=3D"h5"><br><br><div class=3D"gmail_quote">O=
n Mon, Aug 20, 2012 at 10:44 AM, pcorral <span dir=3D"ltr">&lt;<a>pau.c...@=
gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=
=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Hello,=A0<div><br></div><div>It seems that when I pass from a small sample =
test to the real sample data problems arise, and although I see what the pr=
oblem is, I don&#39;t know how to adapt the mapping file and the FASTA file=
 for split_libraries.py and the rest of the workflow to work. The problem i=
s that I have repeated barcodes in the mapping file.=A0I&#39;ll describe al=
l the tips that the QIIME communty has provided me to see if any of this ti=
ps can be subsituted by any other or added.</div>

<div><br></div><div>So the mapping file construction stars with sequences t=
hat look like this:</div><div><div><br></div><div><div><div>&gt;MISEQ_0005_=
FC:1:1101:13831:<u></u>19<u></u>39#ACACCTC</div><div>TGGTCGTGCCAGCAGCCGCGGT=
AATACGGA<u></u><u></u>GGGTGCGAGCGTTACCCGGATTTACTGGGT<u></u><u></u>GTAAAGGGC=
GTGTAGGCGGTTTCTCAAGTC<u></u><u></u>CGATGCTAAAG</div>

</div><br></div></div><div>What follows &quot;#&quot; is my sample barcode,=
 and of course, in the FASTA file many sequences share one sample barcode.=
=A0</div><div><br></div><div>Following Tony&#39;s advise I have rebuilt the=
 FASTA file in order to have the barcode at the beginning of the sequence, =
like this, (note that the space in the sequence is for clarity purposes)</d=
iv>

<div><br></div><div><div>&gt;seq1</div><div>ACACCTC=A0<u></u>TGGTCGTGCCAGCA=
GCCGCGGTAATACGGA<u></u><u></u>GGGTGCGAGCGTTACCCGGATTTACTGGGT<u></u><u></u>G=
TAAAGGGCGTGTAGGCGGTTTCTCAAGTC<u></u><u></u>CGATGCTAAAG</div></div><div><br>
</div><div>With this seqeunce layout, the MAP file looks like this</div>
<div><div><br></div><div><div style=3D"color:rgb(80,0,80)">#SampleID<span s=
tyle=3D"white-space:pre-wrap">	</span>BarcodeSequence<span style=3D"white-s=
pace:pre-wrap">	</span>LinkerPrimerSequence<span style=3D"white-space:pre-w=
rap">	</span>Description</div>

<div style=3D"color:rgb(80,0,80)">seq1<span style=3D"white-space:pre-wrap">=
	</span>ACACCTC =A0 TGGTCGTGCCAGCMGCCGCGGTA<u></u>A<span style=3D"white-spa=
ce:pre-wrap">	</span>--</div></div><div style=3D"color:rgb(80,0,80)"><br></=
div></div>

<div style=3D"color:rgb(80,0,80)">However, when I use a dataset with repeat=
ed barcodes, the MAP file reflects this, having the BarcodeSequence field w=
ith repeted strings, as in</div><div><div style=3D"color:rgb(80,0,80)">
<br></div><div><div style=3D"color:rgb(80,0,80)">#SampleID<span style=3D"wh=
ite-space:pre-wrap">	</span>BarcodeSequence<span style=3D"white-space:pre-w=
rap">	</span>LinkerPrimerSequence<span style=3D"white-space:pre-wrap">	</sp=
an>Description</div>

<div style=3D"color:rgb(80,0,80)">seq1<span style=3D"white-space:pre-wrap">=
	</span>ACACCTC =A0=A0TGGTCGTGCCAGCMGCCGCGGTA<u></u>A<span style=3D"white-s=
pace:pre-wrap">	</span>--</div></div></div><div style=3D"color:rgb(80,0,80)=
">...</div>

<div style=3D"color:rgb(80,0,80)">seq12<span style=3D"white-space:pre-wrap"=
>=A0</span>ACACCTC =A0=A0TGGTCGTGCCAGCMGCCGCGGTA<u></u>A<span style=3D"whit=
e-space:pre-wrap">	</span>--=A0</div><div style=3D"color:rgb(80,0,80)">...<=
/div><div><br>

</div><div>So far, check_id_map.py warns me about this repetitions but as f=
ar as I can see, the suggested corrected MAP file is exactly the same, with=
 repetitions. split_libraries.py does not output results due to this proble=
m with repetitions.=A0</div>

<div><br></div><div>The reason to use split_libraries is two-fold:=A0</div>=
<div>1) I want a MAP file to be constructed to be used in the downstream wo=
rkflow</div><div>2) I want to have my FASTA sequences trimmed of barcode + =
Linker</div>

<div><br></div><div>I could do this trimming manually, and enter the workfl=
ow directly into pick_reference_otus_through_<u></u>otu_table.py, but then =
what the layout of the MAP file would be?</div><div><br></div><div>I hope t=
here is a soulution for this barcode repetition problem,=A0</div>

<div><br></div><div>Thanks</div><div><br></div><div>-Pau</div><div><div><di=
v><br></div><div><br><br>On Friday, August 17, 2012 8:07:00 PM UTC-4, pcorr=
al wrote:<blockquote class=3D"gmail_quote" style=3D"margin:0;margin-left:0.=
8ex;border-left:1px #ccc solid;padding-left:1ex">

Hello Tony,=A0<div><br></div><div>Fine surgery this time!! I could=A0even=
=A0capture sequences with the default -M=3D0 in split_libraries.py</div><di=
v><br></div><div>Thanks for the wise advise.=A0</div><div><br></div><div>-P=
au</div>

<div><br><br>On Friday, August 17, 2012 4:55:52 PM UTC-4, TonyWalters wrote=
:<blockquote class=3D"gmail_quote" style=3D"margin:0;margin-left:0.8ex;bord=
er-left:1px #ccc solid;padding-left:1ex">Hello Pau,<div><br></div><div>It l=
ooks like you put in the BarcodeSequence twice, once in the header column, =
and again at the beginning of the LinkerPrimerSequence.</div>

<div>Try changing this:</div><div><div>#SampleID<span style=3D"white-space:=
pre-wrap">	</span>BarcodeSequence<span style=3D"white-space:pre-wrap">	</sp=
an>LinkerPrimerSequence<span style=3D"white-space:pre-wrap">	</span>Descrip=
tion</div>


<div>seq1<span style=3D"white-space:pre-wrap">	</span>ACACCTC =A0 =A0<u></u=
>ACACCTCTGGTCGTGCCAGCMGCCGCGGT<u></u>A<u></u>A<span style=3D"white-space:pr=
e-wrap">	</span>--</div><div>...</div><div><br></div><div>to this:</div><di=
v><div>
#SampleID<span style=3D"white-space:pre-wrap">	</span>BarcodeSequence<span =
style=3D"white-space:pre-wrap">	</span>LinkerPrimerSequence<span style=3D"w=
hite-space:pre-wrap">	</span>Description</div>

<div>seq1<span style=3D"white-space:pre-wrap">	</span>ACACCTC =A0 =A0TGGTCG=
TGCCAGCMGCCGCGGTAA<span style=3D"white-space:pre-wrap">	</span>--</div></di=
v><div><br></div>and see if that works better.</div><div><br></div><div>-To=
ny</div>


<div><br></div><div><br><div class=3D"gmail_quote">On Fri, Aug 17, 2012 at =
2:52 PM, pcorral <span dir=3D"ltr">&lt;<a>pau.c...@gmail.com</a>&gt;</span>=
 wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;bor=
der-left:1px #ccc solid;padding-left:1ex">


=A0Hello,=A0<div><br></div><div>I follow this thread as I encountered somth=
ing that I don&#39;t understand in the output of split_libraries.py and the=
 mapping file I am constructing.=A0</div><div><br></div><div>So my mapping =
file looks like this (following Tony&#39;s advise in this mail thread):</di=
v>


<div><br></div><div><div>#SampleID<span style=3D"white-space:pre-wrap">	</s=
pan>BarcodeSequence<span style=3D"white-space:pre-wrap">	</span>LinkerPrime=
rSequence<span style=3D"white-space:pre-wrap">	</span>Description</div><div=
>

seq1<span style=3D"white-space:pre-wrap">	</span>ACACCTC =A0 =A0<u></u>ACAC=
CTCTGGTCGTGCCAGCMGCCGCGGT<u></u>A<u></u>A<span style=3D"white-space:pre-wra=
p">	</span>--</div>
<div>seq2<span style=3D"white-space:pre-wrap">	</span>GACATCA<span style=3D=
"white-space:pre-wrap">	</span>=A0 =A0<u></u>GACATCATGGTCGTGCCAGCMGCCGCGGT<=
u></u>A<u></u>A<span style=3D"white-space:pre-wrap">	</span>--</div><div>se=
q3<span style=3D"white-space:pre-wrap">	</span>TAAGGGA =A0 =A0<u></u>TAAGGG=
ATGGTCGTGCCAGCMGCCGCGGT<u></u>A<u></u>A<span style=3D"white-space:pre-wrap"=
>	</span>--</div>


<div>seq4<span style=3D"white-space:pre-wrap">	</span>ACCTCCC =A0 =A0<u></u=
>ACCTCCCTGGTCGTGCCAGCMGCCGCGGT<u></u>A<u></u>A<span style=3D"white-space:pr=
e-wrap">	</span>--</div><div><br></div><div>And this is the FASTA related t=
o that Mapping file:</div>


<div><br></div><div><div>&gt;seq1</div><div>ACACCTCTGGTCGTGCCAGCAGCCGCGGTA<=
u></u><u></u>ATACGGAGGGTGCGAGCGTTACCCGGATTT<u></u><u></u>ACTGGGTGTAAAGGGCGT=
GTAGGCGGTTTC<u></u><u></u>TCAAGTCCGATGCTAAAG</div><div>&gt;seq2</div><div>
GACATCATGGTCGTGCCAGCAGCCGCGGTA<u></u><u></u>AGACAGAGGGGGCAAGCGTTGTCCGGAGTC<=
u></u><u></u>ACTGGGCGTAAAGCGCGCGCAGGCGGCTGC<u></u><u></u>CTAAGTGTCGTGTGAAAG=
</div>

<div>&gt;seq3</div><div>TAAGGGATGGTCGTGCCAGCCGCCGCGGTG<u></u><u></u>ATACAGA=
GGTGGCAAGCGTTGTCCGGATTT<u></u><u></u>ACTGGGTGTAAAGGGTGCGTAGGCGGATTT<u></u><=
u></u>GCAAGTCGGGGGTTAAAG</div><div>&gt;seq4</div><div>ACCTCCCTGGTCGTGCCAGCC=
GCCGCGGTA<u></u><u></u>ATACGGAGGATGCAAGCGTTATCCGGAATG<u></u><u></u>ATTGGGCG=
TAAAGCGTCCGCAGGTGGTTGT<u></u><u></u>GTGTGTCTATTGTCAAAG</div>


</div><div><br></div><div>The command I isseu for split_libraries.py is:</d=
iv><div>split_libraries.py -m smll_map.txt -f smll.fasta -b 7 -l 60 -M 26</=
div><div><br></div><div>Here I have had to rise -M (maximum number of prime=
r mismatches [default 0]) to 26 to be able to retrieve the 4 sequences in &=
#39;seqs.fna&#39;, but what is more intriguing to me is why seqs.fna only c=
ontains FASTA sequences 70 nucleotids long? it has trimmed 7 nucleotides (T=
ACGGAG) that I would considere real sequenced nucleotides, =A0and not a lin=
ker or primer ...</div>


<div><br></div><div>This is how seq_1 looks like in seqs.fna:</div><div><br=
></div><div><div>&gt;seq1_1 seq1 orig_bc=3DACACCTC new_bc=3DACACCTC bc_diff=
s=3D0</div><div>GGTGCGAGCGTTACCCGGATTTACTGGGTG<u></u><u></u>TAAAGGGCGTGTAGG=
CGGTTTCTCAAGTCC<u></u><u></u>GATGCTAAAG</div>


</div><div><br></div><div>Could someone explain to me the reason for that?<=
/div><div><br></div><div>Pau</div><br></div><div><div><br>On Friday, August=
 17, 2012 3:24:00 PM UTC-4, pcorral wrote:<blockquote class=3D"gmail_quote"=
 style=3D"margin:0;margin-left:0.8ex;border-left:1px #ccc solid;padding-lef=
t:1ex">


Hi Tony,=A0<div><br></div><div>It did worked. Tank you.</div><div><br></div=
><div>Just a small note when using split_libraries.py. The -b option was 7 =
not 6 as you mentioned.=A0</div><div><br></div><div>Pau</div><div><br></div=
>


<div><br><br>On Wednesday, August 15, 2012 11:13:20 PM UTC-4, pcorral wrote=
:<blockquote class=3D"gmail_quote" style=3D"margin:0;margin-left:0.8ex;bord=
er-left:1px #ccc solid;padding-left:1ex">Hello QIIME community,=A0<div><br>=
</div>


<div>I am constructing the mapping file to enter into the QIIME workflow, b=
ut it&#39;s not clear to me how the fields in this file relate to what the =
sequencing center returned to me.=A0</div><div><br></div><div>The sequencin=
g provider gave me two FASTQ files, one representing the forward read and t=
he other the reverse read (as far as I know this was a paired-end sequencin=
g). After doing some in-house cleaning, based on retrieval of reads that I =
could assign which sample they came from, and also a quality filtering, and=
 also a FASTQ to FASTA and QUAL transformation, I ended up with FASTA entri=
es that looked like this:</div>


<div><br></div><div><div>&gt;MISEQ_0005_FC:1:1101:13831:<u></u>19<u></u><u>=
</u>39#ACACCTC</div><div>TGGTCGTGCCAGCAGCCGCGGTAATACGGA<u></u><u></u><u></u=
>GGGTGCGAGCGTTACCCGGATTTACTGGGT<u></u><u></u><u></u>GTAAAGGGCGTGTAGGCGGTTTC=
TCAAGTC<u></u><u></u><u></u>CGATGCTAAAG</div>

</div><div>
<br></div><div>What follows the &#39;#&#39; at the header is my sample barc=
ode which must correspond to=A0<span style=3D"line-height:16.36363601684570=
3px;text-align:justify;font-size:12.727272033691406px;background-color:rgb(=
238,238,238);font-family:monospace">BarcodeSequence</span>=A0in the mapping=
 files. In my case, a single FASTA file has different entries with differen=
t sample barcodes after the &#39;#&#39; sign. =A0I really don&#39;t care ab=
out the ID the sequence has, this could be substituted by any other &quot;a=
lphanumeric or dot&quot; name. Also, in the sequence, I know there is a pla=
te barcode (exactly the same in all the FASTA file entries) of 5 nucleotide=
s (TGGTC) and after this, 19 nucleotides (which vary along the FASTA file, =
I guess due to sequencing errors) that were used as a linker (GTGCCAGCAGCCG=
CGGTAA), therefore, this must correspond to=A0<span style=3D"line-height:16=
.344444274902344px;text-align:justify;font-size:12.666666984558105px;backgr=
ound-color:rgb(238,238,238);font-family:monospace">LinkerPrimerSequence</sp=
an>.=A0</div>


<div>=A0<br></div><div>Could anyone suggest me how to format tha FASTA file=
 and the mapping file to fulfil the fields=A0
<span style=3D"line-height:16.363636016845703px;text-align:justify;font-siz=
e:12.727272033691406px;background-color:rgb(238,238,238);font-family:monosp=
ace">SampleID BarcodeSequence LinkerPrimerSequence</span>?</div><div><br>


</div><div>Many thanks in advance!!</div><div><br></div><div>Pau</div><div>=
<br></div><div><br></div><div><br></div><div><br></div><div><br></div><div>=
<br></div><div><br></div><div><br></div></blockquote></div></blockquote>




<p></p>

-- <br>
=A0<br>
=A0<br>
=A0<br>
</div></div></blockquote></div><br></div>
</blockquote></div></blockquote></div>

<p></p>

-- <br>
=A0<br>
=A0<br>
=A0<br>
</div></div></blockquote></div><br></div></div></div>
</blockquote><div><br></div><div>=A0</div></div>

<p></p>

-- <br>
=A0<br>
=A0<br>
=A0<br>
</blockquote></div><br></div>

--f46d04478b955c1c7304c7b62f7f--