UniFrac Question

Stephen B. Cox

unread,

May 9, 2011, 3:58:20 PM5/9/11

to qiime...@googlegroups.com

I am having trouble tracking down a basic issue with Qiime's unifrac calculations. As I understand it, Unifrac distances represent distances between each pair of samples - ignoring all other samples in the set. This seems to be consistent with my reading of the description of the calculations for the unifrac distance matrix in the 2005 Lozupone and Knight paper.

However, when I extract a subset (say 20 samples) of distances from a unifrac matrix that was calculated using a total of 100 samples, the distances are not the same as if those 20 samples were analyzed separately. I have a hunch that the distance between samples is considering the root of the entire tree, not the tree based on only taxa present in the two samples.

So, two questions.

1) is this indeed what Qiime is doing?

and

2) is this consistent with how folks believe it should be done (I am currently trying to get the original python code from the Lozupone paper up and running)?

Jeff Werner

unread,

May 9, 2011, 4:34:32 PM5/9/11

to qiime...@googlegroups.com

Hi Stephen,

You're right that UniFrac is a pairwise distance, and you should get the same answers for pairwise distances if they are calculated for twenty samples, or if those 20 are compared simultaneously along with 100 other samples. However, the results of doing principal coordinates analysis (PCoA) on those distances may look different, because the new samples will change the total amount of variation in the data set, and the variation displayed in PC1, for example, may show you different patterns. Are you seeing that the pairwise distances changed (in the distance matrix), or that the PCoA results look different (in a 3D plot)?

If it is indeed the 3D plot from PCoA that has changed, then that would be expected. However, if the actual distances in the distance matrix have changed, then I would guess that there must have been something different about the process used for calculating the distances (were they rarefied?, etc). If you are using a different tree (i.e., you built two trees, one with just sequences from 20 samples, and one tree with sequences from all 100 samples), then the difference may be due to the different phylogenetic trees. Like you mentioned, the root will be a little different, and the model of evolution will change a little with fewer sequences, etc. The results should show similar patterns though, right?

Cheers,
Jeff

Stephen B. Cox

unread,

May 10, 2011, 9:03:08 AM5/10/11

to qiime...@googlegroups.com

Many thanks for the reply Jeff. I am comparing the distances directly (not the PCoA results). With respect to the different trees, I thought that the trees were reconstructed for each pair, but, I guess even then, there could be an issue with the root. (as you can tell, I am a bit new to Unifrac. Background in biostats, but refreshing myself on phylogenetic methods.)

Yes, similar patterns, but there are some differences. More distant pairs seem to exhibit the biggest changes (both smaller and larger), which, I guess, is to be expected. However, in general, the distances based on the full set tend to have a bias towards larger distances.

Regards

Stephen

Jackson Lee

unread,

May 10, 2011, 1:32:35 PM5/10/11

to Qiime Forum

Hi,
We are seeing a similar issue in our lab when building trees when
doing unifrac analysis of a subsample of a larger dataset. In this
case, if membership influences tree construction (and therefore final
distance values) we were trying to figure out is a tree with more
members, or less members, or the same members as the analysis more
suitable for analysis. Ultimately we decided that if long-arm
attraction is the issue, then a tree with more membership should be
more complete, and taking it to the ultimate conclusion, then some
sort of a reference tree (such as from silva) should be used. Does
this seem right?

Jeff Werner

unread,

May 11, 2011, 11:19:46 AM5/11/11

to qiime...@googlegroups.com

There is no need to re-construct phylogenetic trees for subsamples. The UniFrac algorithm will only look at branches that contain OTUs present in the two samples you are comparing. As Jackson mentioned below, more sequences generally can give you better trees, since the model of evolution has more information to go by for branching and rooting. So, if you have a large set of samples, and you want to re-analyze smaller subsets of samples, the best thing to do is to just build one big phylogenetic tree for the full set of OTU representative sequences and keep using that one big tree in all of your separate analyses. That way the phylogeny on which you base the UniFrac distances isn't changing between different subgroup analyses.

E.g., if you're using beta_diversity.py, you can always pass it the same tree through the -t option, even if you are passing it OTU tables that you've trimmed, filtered, or whatever.

justink

unread,

May 11, 2011, 3:25:30 PM5/11/11

to Qiime Forum

I agree, more sequences are more likely to generate a more accurate
tree. Large reference trees are thus wonderful, but there are other
effects to consider when e.g. blasting sequences against a reference
tree, as no reference tree will contain perfect coverage of all the
taxa in a novel study.

Jeff Werner

unread,

May 11, 2011, 4:16:00 PM5/11/11

to qiime...@googlegroups.com

Since building a tree with FastTree is relatively quick, would it make sense to trim the Greengenes database to your primer region, and always build a tree using your experimental sequences as well as, e.g., Greengenes core sequences? The extra reference sequences could just be thrown in as an extra sample, called "GGcore" or something of the sort...

Stephen B. Cox

unread,

May 11, 2011, 5:34:32 PM5/11/11

to qiime...@googlegroups.com

OK, so, I agree with what has been said with regards to the value of a larger tree.

As a test, we just calculated unifrac distances on a full set of samples. Next, we calculated unifrac distances on a subset of those samples, but supplied the tree from the full set (all using beta_diversity.py). According to my understanding (and if I am following Jeff's logic), the distances between samples in the subset should be equal to those same distances in the full set. Is this correct? (In any case - they do not match very well.)

(Part of the motivation behind this line of questioning is this. If we are interested in writing up a manuscript on a subset of samples that were run as part of a larger set - can we extract the unifrac distances from the full set's matrix, and use those for our multivariate analysis? My impression is that, whatever is done, explicit description of entire set of samples that is used to construct the tree that is used for any unifrac calculation is needed!)

justink

unread,

May 11, 2011, 6:37:21 PM5/11/11

to Qiime Forum

Yes, the pairwise unifrac distance shouldn't depend on other samples
analyzed at the same time, as long as the same tree is used.

Hmm, if you're using the same tree, and the same samples, say by
removing samples from the otu table with filter_by_metadata.py, and
you're getting different values for beta unifrac between samples, that
sounds wrong.

It may be a bug, though it's not one I've been able to reproduce
thusfar. If you're getting different unifrac values when changing only
which samples are included, could you submit a bug report here?
http://sourceforge.net/tracker/?group_id=272178&atid=1157164

As for the issue of requiring an explicit description of entire set of
samples that is used to construct the tree, to reproduce exactly a
reasearch result, you would need both the tree inference algorithm/
program, and the details of each sample used to infer the tree.
However, in my anecdotal experience, the patterns of community
similarity are fairly robust to details of tree building, including
additional samples used when inferring the tree.

Stephen B. Cox

unread,

May 12, 2011, 11:40:13 AM5/12/11

to qiime...@googlegroups.com

As an update, I just re-ran all analysis using FastUnifrac (using trees and OTU tables generated in Qiime) and the results were identical to what we got using beta_diversity.py. Not yet convinced that this is a bug, but it is puzzling. I am currently checking into our methods for generating the trees, and various other places in the pipeline where we could have made a misstep.

Stephen B. Cox

unread,

May 13, 2011, 11:56:29 AM5/13/11

to qiime...@googlegroups.com

Ok. Seem to be getting some resolution to what is going on (and please forgive me if some of this is just a result of my own ignorance). We are indeed getting different unifrac results when changing only which samples are included (keeping the same tree). However, instead of just removing samples from the OTU tables, we are re-starting the pipeline from the fasta files. Hence, even when we use the same tree (the full tree), the OTU tables for the subset are a bit different.

At this point, I am working through figuring out why the OTU table for a subset of samples (created independently) is different than the relevant subset of the full OTU table (extracting only the subset samples).

Reply all

Reply to author

Forward