http://qiime.org/documentation/file_formats.html#metadata-mapping-files
http://qiime.org/documentation/file_formats.html#demultiplexed-sequences
Greg
John, Another option would be to use the data that we presented in the
initial QIIME Nature Methods paper. That is available here:
http://bmf.colorado.edu/QIIME/QIIME_NM_2010.tgz
Greg
Interesting, I hadn't heard that. Will you follow up and let us know
how it goes (i.e., if it's easy to get)?
Greg
tar -xvzf QIIME_NM_2010.tgz
Let us know if that doesn't work.
Greg
Greg
Great, that's good to know. Thanks!
> pick_otus_blast.py, which I think is equavilent to "pick_otus.py -m
> blast", I think?
Yes, that's correct. I'd recommend using:
pick_otus.py -m uclust_ref
rather than BLAST for reference-based OTU picking. Reference-based
uclust computes pairwise identity over the full length of the read,
while BLAST computes pairwise identity over the alignable region (with
a minimal alignment length), so the former results in better OTUs (in
my opinion) in that all sequences in the OTU are a min of 97%
identical to the seed over their full length.
The most recent versions of our greengenes reference OTUs are here:
Check the notes.txt file in that zip for information on which
sequences should be used for what -- basically you'll want the
unaligned rep_set sequences for OTU picking.
Greg
> It looked like the otu_map from the folder otus was actually generated
> by "pick_otus.py -m uclust -i /rep_set/gg_97_otus_29nov2010.fasta"?
No, OTUs were picked against a larger input set. In this case a
previous version of our greengenes OTUs (6oct2010.fasta). This is all
in the notes.txt file.
> Are you suggesting that the gg_97_otus_29nov2010.fasta file can be
> used as reference sequence set for OTU picking by uclust_ref?
Yes. Of course there are limitations as with any reference set on what
taxa are present, so you may lose those that with reference-only OTU
picking (pick_otus.py -m uclust_ref -C).
> if so, my question is, why don't we eliminate those sequences that map
> to the same OTU? for instance, OTU 154, we only need 107891 in the ref
> set, and can remove 389548 and 449142.
Only 107891 is in the ref set:
caporaso@saguaro gg_otus_29nov2010> egrep 107891
rep_set/gg_97_otus_29nov2010.fasta
>107891
caporaso@saguaro gg_otus_29nov2010> egrep 389548
rep_set/gg_97_otus_29nov2010.fasta
caporaso@saguaro gg_otus_29nov2010> egrep 449142
rep_set/gg_97_otus_29nov2010.fasta
Do you get a different result running the above commands?
You might try allowing for new clusters -- this isn't possible with
the parallel version of uclust_ref, but is the default for the serial
version. In this case you won't get any failures because the sequences
that don't hit reference sequences will form new clusters.
Greg