Manipulating QIIME data in R

3,151 views
Skip to first unread message

Andrew Krohn

unread,
Dec 8, 2015, 5:40:42 PM12/8/15
to Qiime 1 Forum
Hi everyone,

I have a general question which some may think to be too general, but I think would be useful nonetheless.  Quick background, I am excellent in bash, I can fumble through R, and if you give me all day I can work out a rather basic python script.  I work on a (Debian-based) Ubuntu 14.04 system.

There are many cool analysis and plotting tools beyond the canon of QIIME scripts, many of which are available through R.  Getting R to play nice with biom tables can be a challenge, and there are now several packages for interfacing with them there.  Package 'biomformat' seems to have been released yesterday and enables parsing of hdf5-formatted biom (https://www.bioconductor.org/packages/3.3/bioc/manuals/biomformat/man/biomformat.pdf) but is not yet at biocLite, so I installed it using devtools according to the github instructions (https://github.com/joey711/biomformat/).  I can get my biom table loaded without issue, but then I feel a bit lost.  I am working through some R tutorials to increase my competence there, but it seems useful to have a discussion on this forum describing simple steps for parsing a biom table according to a qiime mapping file.  The 'phyloseq' package seems to do this, but I'm not sure it has been updated to use 'biomformat' over the 'biom' package yet.  Interesting tools not present in QIIME that I would like to play with include 'BEST' (bayesian estimation as alternative to t-test), 'ancom.R' (analysis of composition of microbiomes as a more appropriate handling of OTU tables with non-observances - zeros), and 'indicspecies' (indicator species which is supposed to correlate treatment types with taxa in your analysis).  As well I would like to try my hand at plotting outputs using ggplot2 as an alternative to the native QIIME plots.

Rather than wait for others to comment, I'll start here.  Here is a script you can use to update your R environment and add in some goodies at the same time (gist: https://gist.github.com/alk224/be2a8aa10dbfd3af34a1).  Copy the contents into a file, check the commented out sections (install commands for 'rube' and 'ancom.R') and execute it using sudo power with the Rscript command: sudo Rscript r_updates.r

To load an hdf5-type biom table into R, use the following:

> library(biomformat)
> read_biom("03_table_hdf5.biom")
biom object. 
type: OTU table 
matrix_type: dense 
10 rows and 15 columns 
Warning message:
In strsplit(msg, "\n") : input string 1 is invalid in this locale

That warning message doesn't seem to matter.  Something similar can be accomplished with phyloseq (instructions here: http://joey711.github.io/phyloseq/import-data#import_qiime), but I cannot get my own data to cooperate with this tutorial for an unknown reason (I am using a text version of the OTU table with the)

Here's where I am lost.  I know that for differential abundance testing with 'BEST' I need to supply that command with vectors for comparison (OTU abundances according to treatment types), but I'm not certain how to parse my biom table according to my map file.

For 'ancom.R' the package seems to have a very strict definition of how the data are formatted.  This is accomplished easily via bash manipulation or others may prefer python.

'indicspecies' appears to have even more complicated input requirements, but I've seen others curious about this so maybe someone has a workflow log somewhere that can help keep me (or anyone else) from wasting too much time in setup.

Thanks in advance to any interested people who can provide useful feedback and let's keep this conversation going.  To QIIME, and beyond!!

Andrew Krohn

unread,
Dec 20, 2015, 7:32:23 PM12/20/15
to Qiime 1 Forum
If anyone finds this thread useful, I am still working on things in my spare time.  I found this post useful for getting data into phyloseq: https://groups.google.com/forum/#!topic/qiime-forum/jtiwnTrSkoY

Since I enforce hdf5 output form my qiime workflows, I had to first convert my otu table to json.  Probably just fine for most people but could present problems when data sets become monstrously huge.

I opened R and issued the following:

## Loading libraries
> library(phyloseq)
> library(ggplot2)
> library(scales)
> library(grid)

## Load mapping file
> mapfile="/home/enggen/RAID5/2013-04-15/garden_reanalysis/r0/map.gardenexperiment.postESDtest.SLdropped.matched.nodead.csv"
> map=import_qiime_sample_data(mapfile)

## Load tree file
> tree=read_tree("/home/enggen/RAID5/2013-04-15/garden_reanalysis/r0/swarm_otus_d3/mafft_alignment/fasttree_phylogeny.tre")

## Load OTU table
> otufile="test.json.biom"
> biomfile=import_biom(otufile,parseFunction=parse_taxonomy_default)

## Merge the three objects together in phyloseq
> testdata=merge_phyloseq(biomfile,tree,map)
> print(testdata)
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 271 taxa and 35 samples ]
sample_data() Sample Data:       [ 35 samples by 33 sample variables ]
tax_table()   Taxonomy Table:    [ 271 taxa by 7 taxonomic ranks ]
phy_tree()    Phylogenetic Tree: [ 271 tips and 250 internal nodes ]

There is a section on then exporting such data to other R tools near the bottom of this page: http://joey711.github.io/phyloseq-demo/phyloseq-demo.html

Meanwhile, it is fun to generate different plots with phyloseq:
## Make a tree
> plot_tree(testdata, color="Drought", size = "abundance")

## Make a network
> ig=make_network(testdata, type = "samples", distance = "bray", max.dist=0.85)
> plot_network(ig, testdata, color = "Drought")

I'm also getting ancom to work via the shiny interface for now (ancom.R).  This allows for OTU tables that are either in wide or tall formats which is to say that in txt format your otus are in rows or columns.  Default for biom files converted to txt is OTUs in rows, but the command line interface for ancom.R requires OTUs in columns.  It is handy to sort your OTU table before conversion to txt.  Just do the default sort with sort_otu_table.py.  Next you want to sort samples in your mapping file (natural sort only).  Open map file and txt-converted biom file in spreadsheet, copy a single factor (without header line) from mapping file and paste it with "transpose" to replace your sample names with factor levels.  Then replace the #OTU ID cell (A1) with Group.  If you didn't already, delete the first line which might say "# Constructed from biom file."  I added the following to my biom conversion script to avoid this tedium in the future:

sed -i '/#Constructed from biom file/d' $output.txt

Now you should be able to start the shiny interface and load your table for analysis:
> library(shiny)
> library(ancom.R)
Loading required package: foreach
foreach: simple, scalable parallel programming from Revolution Analytics
Use Revolution R for scalability, fault tolerance and more.
Loading required package: doParallel
Loading required package: iterators
Loading required package: parallel
Warning messages:
1: replacing previous import by ‘DT::dataTableOutput’ when loading ‘ancom.R’ 
2: replacing previous import by ‘DT::renderDataTable’ when loading ‘ancom.R’ 
> shiny_ancom()





Allison

unread,
Mar 6, 2016, 11:15:19 PM3/6/16
to Qiime 1 Forum
Hi Andrew,

Could you possibly post the first few lines of your mapping file as an example?  I'm having trouble reading mine into R and am getting the same error message as in the post you linked to:

> mapfile = system.file("extdata", "Map_Sed_Test.txt", package = "phyloseq")
> map=import_qiime_sample_data(mapfile)
Error in read.table(file = mapfilename, header = TRUE, sep = "\t", comment.char = "") : 
  no lines available in input
In addition: Warning message:
In file(file, "rt") :
  file("") only supports open = "w+" and open = "w+b": using the former

My mapping file works fine in Qiime so I'm not sure what I'm missing.

Thanks!
Allison

Andrew Krohn

unread,
Mar 7, 2016, 10:17:55 AM3/7/16
to Qiime 1 Forum
Hi Allison,

Nothing special about my mapping file. It seems possible that something on the way into phyloseq won't agree with all of the conventions found in qiime mapping files. From the QIIME page on mapping files:

All metadata must be composed of only alphanumeric, underscore (“_”), period (”.”), minus sign (“-”), plus sign (“+”), percentage (“%”), space (” ”), semicolon (”;”), colon (”:”), comma (”,”), and/or forward slash (“/”) characters. For missing data, write “NA”; do not leave blanks.

When I get to my other computer I'll try to remember to post a few lines anyway.

*****

I actually wanted to update this thread as I have done quite a bit of playing since my last post. In all, I guess I'm not super happy with ggplot2 or the ggplot-based outputs in phyloseq, but it is nice knowing I can interface with another language OK. I found the post here comparing different plotting functions interesting: http://pbpython.com/visualization-tools-1.html  And this one "defending" matplotlib: https://climateecology.wordpress.com/2014/07/15/in-defense-of-matplotlib/

I love QIIME for all of it's rather simple automation, but I still don't love the plots or the customizations available for output. For things I want to get out in the near future, I have settled on Veusz for actually making plots (open source, cross platform, yes!! http://home.gna.org/veusz/). For colors, I am using the colorblind palette suggested by Wong (2011) after learning that several of my close colleagues have color deficient vision. Really changes the way you think about trying to communicate your data...

Wong, B. (2011). Color blindness. Nature Methods, 8(6), 441–441. doi:10.1038/nmeth.1618

Meanwhile, I have a number of R scripts to generate some of the phyloseq plots without even really thinking. You can check out the code here (look for the scripts starting with "phyloseq"): https://github.com/alk224/akutils-v1.2/tree/master/scripts

Also there are scripts for making taxa plots in ggplot (I hate the slashes through the legend) and for running ancom.R without the shiny interface, or for converting a text-based OTU table for use in shiny.

While I didn't like the phyloseq or ggplot output enough to make documentation (yet), the ancom scripts I have found to be useful and have been incorporated into my wiki on github: https://github.com/alk224/akutils-v1.2/wiki

Use my installer script (https://github.com/alk224/akutils_ubuntu_installer) in a VM or a stand-alone system to get all the goodies (qiime, phyloseq, necessary R packages etc) in place. If you want to do it yourself or have to because you work on a cluster, I have instructions for that here: https://github.com/alk224/akutils-v1.2

If you want to play on your own with moving things into R, you probably already know that some packages simply don't like the way an OTU table or a qiime mapping file is constructed. I used to use a series of awk and sed commands to shape the data as necessary, but that can be really tedious. I have found the program datamash (https://www.gnu.org/software/datamash/) to be very helpful in command-line manipulation of text files, and it has many other functions (including an example involving genome analysis at the homepage).

If you need to parse a mapping file for a column by name (as you might when scripting with variables), you can use this one-liner to store the column as a value in the variable $factorcol:

factorcol=`awk -v factor="$factor" -v map="$map" '{ for(i=1;i<=NF;i++){if ($i == factor) {print i}}}' $map`

Finally, I mentioned above that I'm not super happy with phyloseq. Phyloseq is great. I really like it, and that group does a super job at documentation and updates and communicating. It's R that I struggle with more. From the ggplot legend slashes (no way to remove them) to more difficult issues where one command will pass a variable while another will not, or more complicated problems such as controlling the order of repeating color units in taxa plots, R feels to me like a place to be if you want to spend all of your time entering almost the same command. I realize that QIIME was trying to solve this issue by making customizable plots via matplotlib, but I'm just not willing to use them in publication either. While I aim to become proficient at matplotlib in the future or some derivative because this package really seems to have the ultimate flexibility, I am enjoying Veusz in the meantime. Because it has it's own command-line interface, I will probably try to create some templates and see if I can use the command line to script some automated plots to life.

For everyone else, maybe you will find my script code useful in generating your own plots in R. Most of my scripts are written such that a bash script (.sh) is called, which then calls the R script (.r) via Rscript once all the major text formatting steps are completed. Maybe you will find that useful as well as it requires passing variables from script to script.

Andrew Krohn

unread,
Mar 7, 2016, 10:36:17 AM3/7/16
to Qiime 1 Forum
Hi Allison,

Attached is the mapping file I was using. It is for my test data set which I also have available on github as a mini dataset for working out scripts or testing my own workflows.


map.test.txt

Allison

unread,
Mar 7, 2016, 7:58:00 PM3/7/16
to Qiime 1 Forum
Thanks Andrew!

Unfortunately that file gave me the same error message so I guess it's back to the drawing board.    I'll post back here if I ever figure out the problem!

Best,
Allison

Andrew Krohn

unread,
Mar 7, 2016, 8:06:10 PM3/7/16
to Qiime 1 Forum
Hmm, maybe it's just all the extra arguments you are passing when defining your mapping file. The function in phyloseq knows what a qiime-formatted map looks like so no need for the system.file or sep="\t" or anything like that.  See below from my phyloseq phylum-level tree script:



## Load libraries
library(phyloseq)
library(ggplot2)
library(scales)
library(grid)
## Recieve input files from bash
args <- commandArgs(TRUE)
otufile=(args[1])
mapfile=(args[2])
treefile=(args[3])
factor=(args[4])
## Load data into phyloseq
map=import_qiime_sample_data(mapfile)
tree=read_tree(treefile)
otus=import_biom(otufile,parseFunction=parse_taxonomy_greengenes)
mergedata=merge_phyloseq(otus,tree,map)
## Make phylum-colored tree
phylumtree = plot_tree(mergedata, color = "Phylum", label.tips = "taxa_names", plot.margin = 0.5, ladderize = "left", nodelabf = nodeplotboot())
## Output pdf graphic
pdf("Phylum_tree.pdf")
plot(phylumtree)
dev.off()
## Make detailed tree
detailtree = plot_tree(mergedata, color = factor, label.tips = "Species", plot.margin = 0.5, ladderize = "left", nodelabf = nodeplotboot(), size = "abundance", base.spacing = 0.03, shape = "Class")
## Output pdf graphic
pdf(paste0(factor, "_detail_tree.pdf"))
plot(detailtree)
dev.off()
## Change pdf resolution like this (doesnt change text size):
#pdf("network.pdf", height = 12, width = 12)
## .png output instead
#png('network.png', height="12")
#plot(networkout)
#dev.off()

Andrew Krohn

unread,
Mar 7, 2016, 8:10:21 PM3/7/16
to Qiime 1 Forum
Because that script is using args to take an input from bash, that example might not be the most clear. Just use setwd to define the location where all your data resides, and define the mapfile directly:

mapfile="my.qiime.formatted.map.file.txt"

Then use the phyloseq command to import it:

map=import_qiime_sample_data(mapfile)

Define your other files similarly, and then load your data into phyloseq. Make sure your biom table is json formatted before starting as I don't think it handles hdf5 yet.

Allison

unread,
Mar 7, 2016, 10:00:58 PM3/7/16
to Qiime 1 Forum
That worked!!!  Thank you so much - I had been following the phyloseq demo which is why I had those extra arguments.

Thanks again - really appreciate the help!

Best,
Allison

Jay T

unread,
Mar 14, 2016, 4:12:11 PM3/14/16
to Qiime 1 Forum
Andrew -

Thank you for posting this guide. The phyloseq tutorials seem like a completely foreign language. I don't think it's  written all that well personally, but it may just be me. I am curious if you have tried working with shiny-phyloseq. It has a GUI but I can't seem to get the darn thing to work. It keeps saying I have uploaded my data but I never see it listed. Could you take a look at it when you get a chance? In the meantime I am going to keep working with your guide. With some luck I will have a nice tree when I am done.

Thanks Again!

Jay T

unread,
Mar 14, 2016, 6:05:20 PM3/14/16
to Qiime 1 Forum
Hey guys I just wanted to add to the discussion that I have got shiny-phyloseq to work properly.

The simplest way to interface with the GUI is to save your phyloseq object as a .Rdata file which you can upload to phyloseq.

I simply used the command(s):

> testdata=merge_phyloseq(biomfile,tree,map)
print(testdata)

## Save phyloseq object - used to export to shiny-phyloseq

> save(testdata, file = ("myphyseq1.RData"))

I have found that the tree tab in shiny-phyloseq doesn't seem to color properly. I'll have to contact Joey on Github about this.

Andrew Krohn

unread,
Mar 14, 2016, 7:04:46 PM3/14/16
to Qiime 1 Forum
Excellent to hear. I hadn't tried shiny for phyloseq, but I have used it with ancom.

Sachia

unread,
Mar 21, 2016, 1:56:15 PM3/21/16
to Qiime 1 Forum
Hi guys,

can I ask what the benefits are using the .biom table over just the .txt version? I tried some time back with phyloseq and never got it to work, since then I have imported the OTU tables as .txt into R and doing whatever I needed.

Thanks for the guide, will have to try it out soon.

Cheers,

Sachia

Colin Brislawn

unread,
Mar 21, 2016, 4:22:14 PM3/21/16
to Qiime 1 Forum
Hello Sachia,

The information contained in the OTU table should be consistent regardless of its encoding (flat text, json, HDF5). I know phyloseq supports text and json, and support is on the way for HDF5. You can choose whatever makes your software happy. 

Colin

Sachia

unread,
Mar 22, 2016, 12:59:53 PM3/22/16
to Qiime 1 Forum
OK, thanks for clarifying Colin :)

Cheers,
Sachia

Jay T

unread,
Mar 24, 2016, 1:12:45 AM3/24/16
to qiime...@googlegroups.com
Hey Colin -

I am trying to merge my biom file (I tried a .txt version of the otu table and it didn't like it) into a phyloseq object but it keeps spitting out this error:

Error in validObject(.Object) : invalid class “phyloseq” object:
 Component taxa/OTU names do not match.
 Taxa indices are critical to analysis.
 Try taxa_names()

Here is my code:

## Loading libraries - getting started
library(phyloseq)
library(ggplot2)
library(scales)
library(grid)

## Creating Phyloseq Object from 3 files

## Load mapping file
mapfile = "amoa_mapping2.txt"
map = import_qiime_sample_data(mapfile)
print(map)

## Load tree file
tree=read_tree("filtered_tree5.tre")


## Load biom file
biom="otu_no_singletons_95_json.biom"
biomfile=import_biom(biom)
print(biomfile)



## Merge the three objects together in phyloseq
testdata=merge_phyloseq(biomfile,tree,map)
print(testdata)


This same code worked previously on another .biom file so I am not sure what the problem is. Does it just not like the format of the biom file?

Thanks,
JT
otu_no_singletons_95_json.biom

Colin Brislawn

unread,
Mar 24, 2016, 1:08:38 PM3/24/16
to Qiime 1 Forum
Hello Jay,

While I use and love phyloseq, this forum is more focused on support for qiime. Perhaps the Phyloseq issue tracker would be a better fit for this question:

(But since you asked, the 'Component taxa/OTU names do not match' error happens when the OTU IDs do not match in the files: biomfile,tree,map. Try opening them manually and seeing if the IDs are exactly the same format.) 

Thanks Jay,
Colin

Aditya Bandla

unread,
Oct 28, 2016, 1:32:32 PM10/28/16
to Qiime 1 Forum
Hi All

My question is with regard to interfacing between QIIME output and Phyloseq

I observed in Andrew's earlier post in this thread, that the Phylogenetic Tree from QIIME consists of 250 internal nodes, while having 271 tips. I have a similar situation. 

I am always running into a warning, when I try computing the unifrac metric in phyloseq (I would assume a similar warning would pop up in QIIME as well). Specifically, the warning is 

In matrix(tree$edge[order(tree$edge[, 1]), ][, 2], byrow = TRUE,  :
data length [27299] is not a sub-multiple or multiple of the number of rows [13650]

After some diagnostic tests, I think it might be do with the relation between the number of tips and number of nodes. Is a strict n tips to n-1 nodes relation required for the unifrac to work properly?

Cheers,
Aditya

Jamie Morton

unread,
Oct 31, 2016, 11:32:57 AM10/31/16
to Qiime 1 Forum
Hi Aditya,

Unifrac is designed to work with multifurcating trees - so that shouldn't be the problem.

Best,
Jamie
Reply all
Reply to author
Forward
0 new messages