"nan" in rarefaction table and help with rarefaction plot

572 views
Skip to first unread message

juliesm...@gmail.com

unread,
Aug 25, 2015, 7:25:47 PM8/25/15
to Qiime Forum

Hi, 


I am new to alpha diversity analysis and need help to understand how to interpret the alpha rarefaction plot and table generated by Qiime with my data. I would appreciate any feedback on this. 


I performed alpha diversity analysis on my 1 sample. The corresponding OTU table (txt and biom) are attached. I am also attaching the rarefaction plot and the table generated by qiime for this sample. 


What I noticed is that i see "nan" values in the table. Does this mean that something is wrong? I read in previous messages on this forum that I may have to rarify  my biom file to your maximum depth if I want to get rid of nan values. Would this be the right approach? If so, what should be the max depth looking at my OTU table, how do i determine it?


OR

Would it be ok to skip the rarifying of the biom file step?  Basically, would the results I currently have still make sense? How do I interpret these results? 



Thank you very much in advance for your help and any guidance on how to interpret this data. 

Julie


ps: These are the command that generated the attached plot and table from the attached OTU.biom file:  

Executing commands.


# Alpha rarefaction command

multiple_rarefactions.py -i OTU.biom -m 10 -x 44087 -s 4407 -o /rarefaction/


Stdout:


Stderr:


# Alpha diversity on rarefied OTU tables command

alpha_diversity.py -i /rarefaction/ -o /alpha_div/ --metrics shannon,chao1,simpson,observed_species


Stdout:


Stderr:


# Collate alpha command

collate_alpha.py -i /alpha_div/ -o /alpha_div_collated/


Stdout:


Stderr:


# Removing intermediate files command

rm -r /rarefaction/ /alpha_div/


Stdout:


Stderr:


# Rarefaction plot: All metrics command

make_rarefaction_plots.py -i /alpha_div_collated/ -m /map.txt -o /alpha_rarefaction_plots/


Stdout:


Stderr:


RarefactionPlot.png
RarefactionTable.png
OTU.biom
OTU.txt

Colin Brislawn

unread,
Aug 26, 2015, 1:34:20 PM8/26/15
to Qiime Forum
Good morning Julie,

Good questions! I'll take them one at a time.
  • "I see "nan" values in the table"
    When qiime calculates the alpha diversity, some of the metrics have the potential to divide by zero metrics for a given sample. When a formula divides by zero, 'nan' is returned. I'm pretty sure this is related to analyzing one sample. (Do you have more than one sample?)

  • "I performed alpha diversity analysis on my 1 sample."
  • I'm not sure that will help you very much. When I perform alpha diversity analysis, I use many samples and compare the alpha diversity between groups. For example, my results could tell me that samples in the group "no antibiotics" are more diverse than samples in the group "many antibiotics". 
    Your results, after running this one one samples, tell me you have 19 OTUs in that sample.
    (How many reads are in this sample? Do your other samples have more reads?)

  • "Would it be ok to skip the rarifying of the biom file step?"
  • I want to make sure we are talking about the same step: 'alpha rarefaction', which is used to estimate alpha diversity, is often confused with 'rarifying', which is a normalization method. (The names are way too similar and this is a common mistake.)
    To answer your question: When estimating alpha diversity, you should use a non-rarified .biom table. When calculating beta diversity, using a .biom table what has been normalized is probably a good option. To normalize, you could use normalize_table.py or single_rarefaction.py. 


I hope my answers make sense. Let me know if you have any other questions!

Colin

julie smith

unread,
Aug 26, 2015, 5:18:48 PM8/26/15
to qiime...@googlegroups.com

Good morning Colin.  First of all, thank you very much for your detailed and immediate response. I really appreciate your time and suggestions. 

I will first also to answer your questions in your response to me and I still have a few more follow-up questions if you don't mind. I am very new to the area, so please excuse my ignorance in this topic. Below are the responses, I numbered them to make it easier to follow-up: 


1. "I see "nan" values in the table" When qiime calculates the alpha diversity, some of the metrics have the potential to divide by zero metrics for a given sample. When a formula divides by zero, 'nan' is returned. I'm pretty sure this is related to analyzing one sample. (Do you have more than one sample?) 

My answer: I have only 1 sample. What  I mean by 1 sample is: This 1 sample is actually "a mock bacterial community" which has 19 bacterial species in it as you can see in the OTU.biom table that I had sent in my first email as attachment.  My goal for using the qiime alpha diversity script was to measure to species richness in this 1 mock bacterial community. Therefore, I ran the qiime alpha diversity commands on the OUT.biom table and got the plot and table I had attached in my first email.  Does this approach make sense? Also, could you please help me to understand what the plot means and the values in the table mean? For example, the plot rises and then stays at same level. How do I interpret this? Also, in the results table, For example, I see that observed species ave  = 6.5  when seq/sample is 10 and then it is 19 or 18 for the rest seq/samples. What do these numbers tell about this community?  


2.   "I performed alpha diversity analysis on my 1 sample."

I'm not sure that will help you very much. When I perform alpha diversity analysis, I use many samples and compare the alpha diversity between groups. For example, my results could tell me that samples in the group "no antibiotics" are more diverse than samples in the group "many antibiotics". 

Your results, after running this one one samples, tell me you have 19 OTUs in that sample.(How many reads are in this sample? Do your other samples have more reads?)


My answer:  what I mean by 1 sample, is actually 1 bacterial community. As you also had noticed,  this sample has 19 OTUs which is represented by the OUT.biom table that I had attached in my first email. So, this OTU.biom is the only input I have to the qiime alpha diversity analysis. The numbers represent the number of reads of each OUT in the given mock community sample. Does this clarify your question? 



3.  "Would it be ok to skip the rarifying of the biom file step?"

I want to make sure we are talking about the same step: 'alpha rarefaction', which is used to estimate alpha diversity, is often confused with 'rarifying', which is a normalization method. (The names are way too similar and this is a common mistake.)

To answer your question: When estimating alpha diversity, you should use a non-rarified .biom table. When calculating beta diversity, using a .biom table what has been normalized is probably a good option. To normalize, you could use normalize_table.py or single_rarefaction.py. 


My answer: Thank you very much for your explanation. For alpha diversity analysis, I use the alpha_rarefaction.py script (http://qiime.org/scripts/alpha_rarefaction.html 


For beta diversity: I did beta diversity analysis on another biom file which has 4 samples (4 bacterial communities). I am attaching the OTU_todoBetaDiversity.biom file to this email. For beta diversity analysis, I am using the following qiime script:  http://qiime.org/scripts/beta_diversity_through_plots.html . Is it possible that this script does the normalization, or do I have to normalize the data before giving it as an input to this script? 


Thank you very much for all your help, 

Julie 

ps: This is the command I ran in order to do beta diversity analysis. 

# Beta Diversity (euclidean) command

beta_diversity.py -i OTU_todoBetaDiversity.biom -o outputDirectory --metrics euclidean


--

---
You received this message because you are subscribed to a topic in the Google Groups "Qiime Forum" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/qiime-forum/YWXFEYUeKfE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to qiime-forum...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

OTU_todoBetaDiv.txt
OTU_todoBetaDiv.biom

Colin Brislawn

unread,
Aug 27, 2015, 2:47:53 AM8/27/15
to Qiime Forum
Hello Julie,

I'm glad I can help! Some of the analysis approaches of qiime are a bit confusing at first and I hope I can help get you started. Once I understood the methods, I found many of qiime's approaches to be really intuitive. 


Let's talk about alpha diversity and the purpose of making rarefaction curves. 

When doing an 16S study, one question you have to ask is 'did I observe all the microbial diversity in my community? In practice, this question becomes 'did I sequence deeply enough to see every microbe'. For example, If you have 20 reads per sample and 200 microbes in your community, you could not possibly have sequenced all 200 of them with your 20 reads. If you have 200 reads per sample and 200 microbes, it's also super unlikely that you captured all of them (unless you happen to get one read from each microbe). But what if you have 2,000 reads from a community of 200 microbes? What if you have 10,000 reads from 200 microbes? How do you know when you have sequenced deeply enough? How can you convince yourself you don't need more read depth per sample?


One way to show that you got all the microbes is by making rarefaction curves. The approach works like this:
Take your full community. Let's say this community contains 200 microbes.
Subsample at different read depths, from very low to very high.
Plot this, with sample depth on the x axis, and microbial richness on the y axis.
10 reads = 7 microbes
100 reads = 65 microbes
1000 reads = 187 microbes
10,000 reads = 199 microbes
100,000 reads = 199 microbes
Note how at low depths, more reads means more observed microbes. But at a certain point, you see the same microbes even if you sequence more deeply. At the point where you are now longer seeing new microbes, you can say your sequencing depth is 'good enough'. 

Note that in this example I give, I'm not seeing all 200 microbes even at a crazy high read depth of 100,000 reads per sample. This is expected; you wont catch everything. This method shows me that a few thousand reads per sample will capture most of my microbes. I'm OK with that.

Making rarefaction curves in your mock community let's you ask, 'how many reads do I need to capture all 19 of my microbes'. I can see that your curve stops increasing by 4417 reads per sample, leading me to conclude that 4417 reads (or maybe less) is enough to capture 19 microbes.

I would remake these curves using these parameters to see if you can get away with a lower number of reads to capture all 19 OTUs.
multiple_rarefactions.py -i OTU.biom -m 50 -x 2000 -s 100 -o /rarefaction_mod
This will subsample at 50, 150, 250, 350, etc. I'm guessing that you will find all your OTUs at a pretty low read depth because there is only 19 if this sample. Keep in mind that many microbial communities will include far more than 19 microbes and so will need many more reads to get all of them.



You could also run these alpha diversity metrics on your 4 samples. I think that would be an excellent way to see how this technique works with multiple samples.

I hope this helps! If you have more questions, maybe about beta diversity, feel free to make more threads in the forum! 
Colin

julie smith

unread,
Aug 27, 2015, 7:17:56 PM8/27/15
to qiime...@googlegroups.com
Hello Colin, 

Thank you so much for your detailed explanation of the alpha diversity! I now have a much better understanding of how to interpret the numbers I see in my results. I have a couple of other questions if you don’t mind: 

You had suggested that I run the following command "multiple_rarefactions.py -i OTU.biom -m 50 -x 2000 -s 100 -o /rarefaction_mod” 

The way, I did alpha diversity analysis is by using the alpha_rarefaction.py workflow script.  http://qiime.org/scripts/alpha_rarefaction.html  which calls multiple_rarefactions.py automatically. 

I noticed that when the workflow calls the  multiple_rarefactions.py, it automatically assigns values to -m, -x and -s parameters as below:    multiple_rarefactions.py -i OTU.biom -m 10 -x 44087 -s 4407 -o outputDir

Could you please let me know where the 44087 and 4407 numbers came from? I had a hunch that the 44087 could be the number of total reads in the OTU input file. So, I summed up all the number of reads in the OTU file and it was 44087 as I suspected.  It seems like the python code adds all the reads and assigns the "sum of all reads" to the –x parameter. Is there a reason for using the sum of all reads? 

Also, how does the number 4407 gets calculated  for -s parameter and where does it come from?  I looked through the tutorials, but couldn’t find how –s value is automatically assigned by the workflow script.  

Thank you so much again for all your help! 
Julie 


--

Colin Brislawn

unread,
Aug 27, 2015, 7:45:19 PM8/27/15
to Qiime Forum
Hello Julie,

I'm glad I can help!

I had a hunch that the 44087 could be the number of total reads in the OTU input file
Good detective work! In this case, max depth is equal to the sum of your your OTUs. However, that's not exactly how qiime is calculating the maximum depth. (Hint: what would have happened if you had more than one sample?)
 
Could you please let me know where the 44087 and 4407 numbers came from?
Also, how does the number 4407 gets calculated  for -s parameter and where does it come from?
You were so close to the answer! All these parameters are described in the official documentation, in the link that you just sent to me.


Maybe try running this command on your .biom file with the 19 OTUs.
alpha_rarefaction.py ... -a -O 4 --min_rare_depth 50 -e 2000

Keep up the good work!
Colin

julie smith

unread,
Aug 31, 2015, 3:15:40 PM8/31/15
to qiime...@googlegroups.com
Thanks again for your reply Colin. I looked at the website URL that you sent, however I couldn't find the explanation for  "-s" and "-x" parameters. Is there another URL where these are explained perhaps? 

Also, could you please let me know how qiime comes up with the number 4407 for the -s parameter? 

thank you!
Julie 


--

Colin Brislawn

unread,
Aug 31, 2015, 4:44:50 PM8/31/15
to Qiime Forum
Hello Julie,

I sent you to the wrong page! I'm sorry about that. The correct page is: http://qiime.org/scripts/multiple_rarefactions.html 

The script alpha_rarefaction.py makes use of multple_rarefactions.py, and I got them mixed up.


how qiime comes up with the number 4407 for the -s parameter? 
Take a look at these paramaters from alpha_rarefaction.py
-n, --num_steps
Number of steps (or rarefied OTU table sizes) to make between min and max counts [default: 10]
--min_rare_depth
The lower limit of rarefaction depths [default: 10]
-e, --max_rare_depth
The upper limit of rarefaction depths [default: median sequence/sample count]

Alpha_rarefaction.py is a workflow script, which calls several other scripts including multiple_rarefactions.py. When it calls these other scripts, it has to give them input parameters. In this case, the -s of multiple_rarefactions.py is set by alpha_rarefaction.py using the --min_rare_depth, --max_rare_depth, and --num_steps. I think the formula for -s is (max_rare_depth - min_rare_depth)/num_steps. In your case, (44087 - 10)/10 = 4407 = -s. 


By the way, all the current qiime script are listed on: http://qiime.org/scripts/. I have trouble remember their names and parameters, so I use this page as a referance. 

I hope that helps!
Colin


julie smith

unread,
Sep 4, 2015, 11:51:37 AM9/4/15
to qiime...@googlegroups.com
Hi Colin,

Thank you so much for your clear explanation and quick replies to my questions. I really appreciate it. I have couple of more questions if you don't mind. I ran the alpha diversity analysis for more than one community as you suggested and got the attached plot and table. These are 4 different bacterial communities which I labelled them as Sample1, Sample2, Sample3 and Sample4. I also attached the OTU table.  Here are my questions. 


Q1. What does the attached plot tell? This is how i interpreted it based on your earlier explanations (is there anything else we can say)?: "Sample 4 (S004) has about 18 species and Samples1, 2 and 3 have more species around 22. Even if I increase my sequence depth after some point, the number of species wont increase. 


Q2. What does the attached table tell? I see bunch of "nan" values, why are these values nan? Are the results in the table still valid? 


Q3. Attached is the OTU table. I am also giving all the qiime commands ran below. I see that the command for multiple_rarefaction is run with the parameter given as below. How is the number 110685 for -x calculated by qiime? You had mentioned that for multiple samples this number is not the total reads. Could you please clarify? 

multiple_rarefactions.py -i OTU.biom -m 10 -x 110685 -s 11067 -o outputDir


Thanks again for all your help!
Julie
ps: These are the commands ran that generated the attached plot and table: 


alpha_rarefaction.py -i OTU.biom -m map.txt -p alphaDivparameters.txt -o outputDir -f
Calculating alpha diversity ..


Executing commands.

# Alpha rarefaction command
multiple_rarefactions.py -i OTU.biom -m 10 -x 110685 -s 11067 -o outputDir

Stdout:

Stderr:

# Alpha diversity on rarefied OTU tables command
alpha_diversity.py -i outputDir/rarefaction/ -o outputDir/alpha_div --metrics shannon,chao1,simpson,observed_species

Stdout:

Stderr:

# Collate alpha command
collate_alpha.py -i outputDir/alpha_div/ -o outputDir/alpha_div_collated/

Stdout:

Stderr:

# Removing intermediate files command
rm -r outputDir/rarefaction/ outputDir/alpha_div/

Stdout:

Stderr:

# Rarefaction plot: All metrics command
make_rarefaction_plots.py -i outputDir/alpha_div_collated/ -m map.txt -o outputDir/alpha_rarefaction_plots/

Stdout:

Stderr:




OTU4_Communities.txt
Plot4Communities.png
Table4SCommunities.png

Colin Brislawn

unread,
Sep 4, 2015, 12:30:19 PM9/4/15
to Qiime Forum
Good morning Julie,

Q1 "Sample 4 (S004) has about 18 species and Samples1, 2 and 3 have more species around 22. Even if I increase my sequence depth after some point, the number of species wont increase."
Well said!
I would phrase the second part as 'A depth of 20K reads per sample sufficiently captures microbial diversity in our four samples'. (Note that 20K is the threshold for these samples. If you sequence new samples, you will have to run this analysis again to see at what depth the diversity plateau's. It may be different from 20K.)
If Samples 1, 2, and 3 were from cancer patients while 4 was from a control, you could say 'alpha diversity is higher in cancer patients than in controls'. This is where you connect the observation (differences in alpha diversity) with biological significance (in my example, cancer). 

Q2 The data in the table is the same as in the graph. You don't have to worry about those 'nan's. They are appearing because you are comparing single samples against each other. If you comparing groups (like cancer vs control) there would be numbers there.

Q3 "How is the number 110685 for -x calculated by qiime"
multiple_rarefactions.py -i OTU.biom -m 10 -x 110685 -s 11067 -o outputDir
Note how the -x 110685 is passed to the multiple_rarefactions command? That means that you can find the answer in the multiple_rarefactions documentation. 
Let me know when you find the answer. (I sent you the right link this time!)



Colin

julie smith

unread,
Sep 10, 2015, 2:41:01 PM9/10/15
to qiime...@googlegroups.com
Thank you very much Colin for your replies to my questions. 

I read the documentation on multiple rarefactions link that you had sent to me in your email, however, couldn't figure out where the number 110685 came from. Is it a random number due to subsampling of the OTU table? If you could provide some more insight on this, I would really appreciate it.  

Thanks,
Julie 


Colin Brislawn

unread,
Sep 10, 2015, 4:40:00 PM9/10/15
to Qiime Forum
Sure! The number 110,685 is the median sequence/sample in your input .biom table. 

The value of -x is either set manually by you, or by the script which calls multiple_rarefactions.py. In your example, the script alpha_rarefaction.py calls multiple_rarefactionspy, so the value of -x is set by the script alpha_rarefaction.py.

The script alpha_rarefaction.py has an optional input of -e or --max_rare_depth which will be passed to multiple_rarefactions.py as -x.

By default, the value of alpha_rarefaction.py -e is "median sequence/sample count"

So you run alpha_rarefaction.py with the default -e value.
To calculate the -e, alpha_rarefaction.py looks at your biom file. It finds sequences per sample for every sample, then takes the median.
This is 110685. So alpha_rarefaction.py -e = 110685.
alpha_rarefaction.py takes this number, and makes it the value of -x when it calls the script multiple_rarefactions.py.
So now multiple_rarefactions.py -x = 110685


Does that make more sense? This is kind of confusing because you have two scripts which use a different flag for the same number. (multiple_rarefactions.py uses -x while alpha_rarefaction.py uses -e) It would probably be a lot easier if both scripts used the same letter when they set the maximum rarefaction depth.

Colin


julie smith

unread,
Sep 16, 2015, 6:48:21 PM9/16/15
to qiime...@googlegroups.com
Hi Colin, 

Thanks you so much! I appreciate your quick response and your detailed explanation which helps a lot!  

If you don't mind, I also have some questions about beta diversity analysis. I have posted the question to qiime list a couple of times earlier, but didn't get a response yet. If you could help me with the questions on beta diversity or guide me, I would really appreciate it. Thanks again for all your great help!

I have run beta diversity workflow script on my input biom file. There are 4 different metagenomics bacterial communities and each community has around 20 species.  I didn't group the 4 metagenomics communities with each other, i just wanted to see the differences among these 4 communities. These 4 communities are labelled as Sample1, Sample2, Sample3, Sample4 in the emperor plots/screenshots attached. 

I decided to do a beta diversity analysis in these 4 samples using qiime. After installing qiime, I prepared the input files by studying the tutorials and examples given on the qiime website which was very helpful.  

Attached is the input OTU txt file that I prepared and its biom version. The numbers in the OTU file represent the number of read per species in each sample. I ran the beta diversity through plots workflow:


beta_diversity_through_plots.py -i OTU.biom -m map.txt -p Parameter.txt -o output -f


I am also pasting the commands that were run in the below. My questions are listed below.  I would appreciate any feedback/comments. 

1. What does the PCOA plot(attached) tell about these 4 samples (attached plot)? 
2. What does the PCOA distances tell about these samples (attached file)? 
3. How do I interpret the PC matrix? What does this tell (also file is attached)?
4. What do the numbers in the distance metric tell (attached)? 

Basically, I currently need some help to understand how to interpret these results. I would very much appreciate any insights you could provide to me.   

Thank you,
Julie


# Beta Diversity (euclidean) command 
beta_diversity.py -i OTU_species.biom -o outputDir --metrics euclidean 

Stdout:

Stderr:
/usr/local/lib/python2.7/dist-packages/numpy/core/fromnumeric.py:2507: VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`.
  VisibleDeprecationWarning)

# Rename distance matrix (euclidean) command 
mv /euclidean_OTU_species.txt  euclidean_dm.txt

Stdout:

Stderr:

# Principal coordinates (euclidean) command 
principal_coordinates.py -i euclidean_dm.txt -o euclidean_pc.txt 

Stdout:

Stderr:
/usr/local/lib/python2.7/dist-packages/skbio/stats/ordination/_principal_coordinate_analysis.py:107: RuntimeWarning: The result contains negative eigenvalues. Please compare their magnitude with the magnitude of some of the largest positive eigenvalues. If the negative ones are smaller, it's probably safe to ignore them, but if they are large in magnitude, the results won't be useful. See the Notes section for more details. The smallest eigenvalue is -5.97764983814e-08 and the largest is 322496773.33.
  RuntimeWarning

# Make emperor plots, euclidean) command 
make_emperor.py -i euclidean_pc.txt -o /euclidean_emperor_pcoa_plot/ -m map.txt 

Stdout:

Stderr:


Logging stopped at 16:18:16 on 25 Aug 2015



OTU_species.txt
OTU_species.biom.txt
euclidean_dm.txt
euclidean_pc_table.txt
map.txt
Parameter.txt
DistancePlot.png
PCOAPlot_1.png

Colin Brislawn

unread,
Sep 16, 2015, 7:52:48 PM9/16/15
to Qiime Forum
Hello Julie,

I'm happy to help you use the scripts that qiime offers, but it sounds like you are asking for something different.

4. What do the numbers in the distance metric tell (attached)?
These numbers are Euclidean distance between samples. 

1. What does the PCOA plot(attached) tell about these 4 samples (attached plot)? 
2. What does the PCOA distances tell about these samples (attached file)? 
3. How do I interpret the PC matrix? What does this tell (also file is attached)?
Basically, if samples are similar, they will be placed close together in the plot. If they are different, they will be placed further away. The process by which samples are 'placed' in space is called Principal Coordinate Analysis (PCoA). There is a pretty good explanation of how this works, here: http://occamstypewriter.org/boboh/2012/01/17/pca_and_pcoa_explained/

I think that Antoino gave an excelent answer here: https://groups.google.com/d/msg/qiime-forum/oboEutau8YQ/jiN97N0yCQAJ

Colin

julie smith

unread,
Sep 17, 2015, 2:55:54 PM9/17/15
to qiime...@googlegroups.com
Hi Colin, 

Thank you very much again for your quick reply. I just noticed Antonio's answer post today after you had sent me the link since I didn't receive an email earlier.  Thank you and Antonio both for your help and insights. I will take a look at the links you both posted and will let you know if I have any other questions. 

Best,
Julie 



Reply all
Reply to author
Forward
0 new messages