I received 16S sequence data from a sequencer provider who ran the raw sequences through their quality control, and provided me with the resulting files:
align.txt = UTAX classification alignments
tax.tsv = UTAX classification table
otu.fasta = final OTU centroid sequences (representative OTUs)
otu.fasta.obs = final OTU centroid observations per sample table
contam.otu.fasta = filtered contaminant centroid sequences
contam.otu.fasta.obs = filtered contaminant centroid obs table
unk.otu.fasta = unclassified OTU centroid sequences
unk.otu.fasta.obs = unclassified OTU obs table
log.txt = summary report
mapping.txt = QIIME mapping file
otu.tsv = OTU table of abundances, including taxonomy
otu.biom = OTU abundance+tax table in BIOM format
msa.fasta = multiple sequence alignment of final centroids
otu.tre = phylogenetic tree of final centroids
core_diversity_analyses/ = folder containing QIIME core diversity analyses output
*I have not been provided with the raw .fasta file containing all my OTUs, or the OTU map generated by pick_otus.py.
My workflow is as follows:
#Reclassify:
assign_taxonomy.py -i otu.fasta -r ROOT/SILVA_128_QIIME_release/rep_set/rep_set_all/97/97_otus.fasta -t /ROOT/SILVA_128_QIIME_release/taxonomy/taxonomy_all/97/taxonomy_all_levels.txt
#Make a biom file with output from assign_taxonomy.py:
make_otu_table.py -i otu_map.tsv -t /ROOT/uclust_assigned_taxonomy/otu_tax_assignments.txt -o reclassified.biom
#convert biom to .tsv
biom convert -i reclassified.biom -o reclassified.txt --table-type "OTU table" --to-tsv --header-key taxonomy --output-metadata-id "ConsensusLineage"
I am able to successfully reclassify my otu.fasta file using SILVA, however, I have not been provided with the otu_map.tsv file required for make_otu_table.py. I can't use the otu.tsv file my sequencer provided, as it contains the old taxonomic classifications, thus I need to generate a new OTU map file using pick_otus.py. However, I am uncertain how to do this as I am unsure if I can use the otu.fasta file as input for pick_otus.py since it only contains the representative sequences for my data.
Thank you in advance for your help, please let me know if I can provide any more info.