Formating an OTU table

edglucksman

unread,

Apr 22, 2013, 4:15:21 PM4/22/13

to qiime...@googlegroups.com

Hi,

I would like to run some basic analyses on a dataset that has already been curated.

My raw 454 sequences were already taxonomically assigned and clustered into OTUs, leaving me with a spreadsheet listing each OTU as a number (first column) followed by the number of times each OTU was detected in the many environmental samples we are studying (columns 2-). The final column on the far right shows taxonomical assignment after blasting each representative sequence, also already done for me and written as Kingdom;Phylum;Class etc..

Example of my spreadsheet:

OTU NCarolinaBeach1 NCarolinaBeach2 NCarolinaBeach3 Taxonomy

1 346 290 130 Kingdom;Phylum;Class...

I also have a separate file listing each environmental sample (rows, example: NCarolinaBeach1) and associated environmental metadata (columns, example: temperature).

I would like to run these data through summarize_taxa_through_plots.py but I am confused as to how to create the input files.

a) How do I properly format an OTU table that incorporates not only each OTU, but also the number of times that particular OTU was found within a set of environmental samples?

b) What would my map (-m) file have to look like, and how can I include the metadata recorded at each environmental location/sample?

For starters, I wish to see bar graphs showing the taxonomic makeup as a percentage (y-axis) within each environmental sample (x-axis).

Thanks in advance!

All the best,

Ed

Will Van Treuren

unread,

Apr 22, 2013, 4:23:41 PM4/22/13

to qiime...@googlegroups.com

Hi Ed,

If QIIME is installed, converting the file you have now to a .biom table used throughout the QIIME scripts for analysis would be very straightforward.

add a pound sign to the first line of your spreadsheet so that it looks like the following:

#OTU NCarolinaBeach1 NCarolinaBeach2 NCarolinaBeach3 Taxonomy

1 346 290 130 Kingdom;Phylum;Class...

then use the convert_biom.py script to convert this 'classic' OTU table to the biom format. that command would look something like (have to vary the parameters, look to the help):

convert_biom.py -i spreadsheet.txt --header_key='Taxonomy' --process_obs_metadata='sc_separated' --biom_table_type='otu table' -o the_name_of_your_new_biom_file.biom

To figure out what your mapping file needs to look like please refer to: http://qiime.org/documentation/file_formats.html

You will then be able to use summarize_taxa_through_plots.py.

Hope this helps,

Will

--

---
You received this message because you are subscribed to the Google Groups "Qiime Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to qiime-forum...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

edglucksman

unread,

Apr 22, 2013, 4:41:42 PM4/22/13

to qiime...@googlegroups.com

Hi Will,

Thanks for the tips!

A couple of follow-up questions, to help me understand the process:

- What do the --process_obs_metadata='sc_separated and --header_key='Taxonomy' mean?

- To create the -i file, would I simply copy/paste the spreadsheet from Excel into a text file and save it (adding the # before the first line)?

All the best,

Ed

Will Van Treuren

unread,

Apr 22, 2013, 4:51:36 PM4/22/13

to qiime...@googlegroups.com

Hi Ed,

You should save the file as a tab delimited format file (its an option in your save as menu in excel). You can add the # sign in before or after you do this. Excel will raise warnings about saving in the tab delimited format but you can safely ignore these.

Reading the help documentation for the scripts is a good way to get some idea what those options do. To access that help documentation you can pass -h. For instance with convert_biom.py -h you get output that contains the following (below just a subset, the important stuff for your case):

--header_key=HEADER_KEY

Pull this key from observation metadata within a biom

file when writing a classic table [default: no

observation metadata will be written]

--process_obs_metadata=PROCESS_OBS_METADATA

Process metadata associated with observations when

converting from a classic table. Must be one of:

taxonomy, naive, sc_separated [default: naive]

the --header_key option tells convert_biom.py that it needs to take the column in your table labeled 'Taxonomy' (or whatever you pass to --header_key) and use that as the metadata information for the new biom file.

the --process_obs_metadata option tells convert_biom.py that your metadata is in the form of a single string separated by semicolons. I assumed this was the case based on your first post, but if your metadata is in some other format, for instance a list, you would pass a different value here.

Hope this helps,

Will

edglucksman

unread,

Apr 22, 2013, 5:08:58 PM4/22/13

to qiime...@googlegroups.com

Hi Will, thanks again for such a fast response.

A final question: How can I ensure that the analysis will take my taxonomic groupings into account? In the past the bar charts will simply show K as the first level, I as the second, N as the third, G as the fourth, and so on. In other words, how do I make sure it recognizes the classification structure I have in my in-file - is there a particular format it needs to be in (ie Kingdom; Phylum; Class; Order etc. - do the semi-colons/spaces matter for each taxonomic level to be picked up independently and, if so, what is the correct way to format these?

All the best,

Ed

Will Van Treuren

unread,

Apr 22, 2013, 5:47:50 PM4/22/13

to qiime...@googlegroups.com

Hi Ed,

My guess as to the reason it was showing 'K','i','n'... is because in the conversion from classic to biom table step, the --process_obs_metadata field was left unspecified and convert_biom.py tried to guess (left the value at 'naive') and it guessed incorrectly. This is something we have seen rather frequently. The correct formatting will just be to have your Taxonomy column have entries like the following:

'k__Bacteria;p__something;c__something_else...'

The important part is the semi colon separator and the order. I believe the '__' characters are not required.

Best,

Will

edglucksman

unread,

Apr 23, 2013, 9:04:27 AM4/23/13

to qiime...@googlegroups.com

Hi Will, thanks a lot, everything works smoothly now with the taxonomic composition summaries.

Given my situation, how would you suggest I proceed with computing Alpha diversity? The computation requires a rep_set.tre file, which I don't have since I never had to carry out those initial steps. I do, however, have a single Fasta file with all my sequences. Is there a 'shortcut' to getting such a .tre file?

All the best,

Ed

Will Van Treuren

unread,

Apr 23, 2013, 9:40:19 AM4/23/13

to qiime...@googlegroups.com

Hi Ed,

How were your OTUs picked? If they were picked against greengenes, then the rep_set.tre you are looking for will come from the greengenes collection. Without knowing how your OTUs were picked (e.g. against which greengenes release at what similarity) I can't tell you which greengenes file to download. You can find them all here though. If your OTUs were picked de novo or with open reference picking then you will have to construct a tree file using your Fasta file. Look at the steps outlined in the QIIME tutorial here (particularly step 6) as those should guide you through the process of creating a tree.

The other option is to pass a different set of metrics to alpha_diversity.py. I believe it defaults to PD_whole_tree,chao1,observed_species, but you can pass a whole list of others (find out which with the -s option) that do not require a tree.