support for Krona visualization

Daniel Brami

unread,

Aug 1, 2012, 3:36:16 PM8/1/12

to metaphl...@googlegroups.com

Any possibility that an output format could be directly fed into krona-tools?

http://sourceforge.net/p/krona/wiki/Importing%20text%20and%20XML%20data/
Perhaps a simple text export that conforms to use of 'ktImportText' ?

Simple example:
http://krona.sourceforge.net/examples/text.txt

Nicola Segata

unread,

Aug 6, 2012, 9:51:08 AM8/6/12

to metaphl...@googlegroups.com

Hi Daniel,

this is definitely a good idea. We were actually thinking to provide export files for graphical visualization and Krona would be a very nice option.

However, we're currently busy with the development of a new and improved database for MetaPhlAn 2.0 and given that I'm not very familiar with the Krona input format I'm not sure we'll be able to implement your suggestion anytime soon.

Obviously, if anyone would like to implement a conversion script from MetaPhlAn to Krona we would be more than happy to integrate it into the official release and adequately acknowledge him/her on the website and in the source files. If anyone is interested we already have a conversion from MetaPhlAn to PhyloXML that may be useful. The conversions available from other metagenomic profiling software should also be a good starting point...

thanks

Nicola

Daniel Brami

unread,

Aug 7, 2012, 3:00:18 PM8/7/12

to metaphl...@googlegroups.com

Thanks for reply;

So here's a sample output from metaphlan:
==> Sample_SGI-01.bt2out.txt <==
GWZHISEQ01:97:C0LR6ACXX:6:1201:3639:1943 100146888
GWZHISEQ01:97:C0LR6ACXX:6:1201:6666:1990 100167158
GWZHISEQ01:97:C0LR6ACXX:6:1201:14657:1973 100039919
GWZHISEQ01:97:C0LR6ACXX:6:1201:20597:1960 100146591
GWZHISEQ01:97:C0LR6ACXX:6:1201:1706:2116 100146741
GWZHISEQ01:97:C0LR6ACXX:6:1201:2426:2049 100146876
GWZHISEQ01:97:C0LR6ACXX:6:1201:2354:2072 100170358
GWZHISEQ01:97:C0LR6ACXX:6:1201:2327:2104 100167880
GWZHISEQ01:97:C0LR6ACXX:6:1201:4470:2158 100140746
GWZHISEQ01:97:C0LR6ACXX:6:1201:5043:2188 100167880

Krona can be given a simple text input, such as following:
(http://krona.sourceforge.net/examples/text.txt)
2 Fats Saturated fat
3 Fats Unsaturated fat Monounsaturated fat
3 Fats Unsaturated fat Polyunsaturated fat
13 Carbohydrates Sugars
4 Carbohydrates Dietary fiber
21 Carbohydrates
5 Protein
4

Here is the problem is given this format:"If a hierarchy has more than one listing, the quantities will be added."
Since the 2nd metaphlan output is very thorough and gives relative abundance at every taxa, i can't traverse it and simply add the counts (based on converting from % to count).
It should be possible to do if we had a mapping to the taxonomic levels (which is what I suspect each number in column 2 of example 1 is).

Any thoughts?

Nicola Segata

unread,

Aug 7, 2012, 3:24:50 PM8/7/12

to metaphl...@googlegroups.com

Hi Daniel,

thanks for looking into this.

If I understood correctly the Krona input format, I think that what you need from the MetaPhlAn output are only the lines corresponding to taxonomic terminal nodes. This is the equivalent to say that you need the species-level labels (denoted with s__) and the "unclassified" terminals.

If you considering the sample MetaPhlAn output attached (LC1.mpa.txt) you can extract the terminals with:

grep -P "s__|unclassified\t" LC1.mpa.txt > LC1.mpa.terminals.txt

which results in the other attached file (LC1.mpa.terminals.txt). You can of course perform the same operation internally to a script instead of using "grep".

All you should need to do at this point is substitute vertical bars "|" with tabs and move the abundance column as first column. But let me know if I misunderstood something.

I hope this helps... and let us know how it goes :)

thanks

Nicola

LC1.mpa.txt

LC1.mpa.terminals.txt

Daniel Brami

unread,

Aug 7, 2012, 3:41:02 PM8/7/12

to metaphl...@googlegroups.com

Does this mean that every read is always mapped down to the species level (if given proper -a or -s flag)?

Nicola Segata

unread,

Aug 7, 2012, 3:59:13 PM8/7/12

to metaphl...@googlegroups.com

Every read is mapped down to the most specific taxonomic level possible which usually means species level.

If a read (actually enough reads for having statistical robustness) is mapped at genus level only, and there are no hits for any species of that genus (let's call it genus A), you would see:

k__Bacteria|p__Phylumname|[....]|g__A_unclassified

And "k__Bacteria|p__Phylumname|[....]|g__A_unclassified" is a terminal taxonomic node even if is not a species.

In other cases you may have something like that:

k__Bacteria|p__Phylumname|[....]|g__A_unclassified 4%

k__Bacteria|p__Phylumname|[....]|g__A|s_A1 7%

k__Bacteria|p__Phylumname|[....]|g__A|s_A2 12%

This means that there are two species in genus A at 7% and 12% and there is an additional 4% of organisms in that genus that are not classified in none of the known species (A1 and A2 in particular). In total, the abundance of A (denoted as k__Bacteria|p__Phylumname|[....]|g__A is 4%+7%+12%)

Selecting only terminal nodes (for example using the "grep" command above) you will always end up with taxa that sum up to 100%, so there is no redundancy or taxa counted more than once.

Does it clarify the output format?

thanks

Nicola

Daniel Brami

unread,

Aug 8, 2012, 3:18:43 PM8/8/12

to metaphl...@googlegroups.com

Hi Nicola,

I am now producing Krona visualizations from your result files; the code for each script is below;
But there are two shortcomings to this method of creating the krona output:
- we do not obtain the list of read ID's associated with each taxa (I imagine that's what the number in second column of *.bt2out.txt represents)
- there is no confidence value associated with the assignments

Here's how I generate my plots:

for a in `ls *.bt2out.txt|sort`;do
B=`basename ${a} .bt2out.txt`
metaphlan2krona.py \
-m ${a} -p ${B}.profiling_output.txt -k ${B}
/bioinformatics/asm/bio_bin/KronaTools-2.2/bin/ktImportText -o ${B}.metaphlan.krona.html ${B}.tmp
done

And here is the code for metaphlan2krona.py:

#!/usr/bin/env python2.7

import sys
import optparse
import re
import subprocess
import tempfile

def file_len(fname):
p = subprocess.Popen(['wc', '-l', fname], stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
result, err = p.communicate()
if p.returncode != 0:
raise IOError(err)
return int(result.strip().split()[0])

def runCommandTMPLost(myCommand):
try:
tmp_stdout = open( tempfile.NamedTemporaryFile().name, 'w' )
tmp_stderr = open( tempfile.NamedTemporaryFile().name, 'w' )
proc = subprocess.Popen(args=myCommand, shell=True, stdout=tmp_stdout, stderr=tmp_stderr, bufsize=4096)
tmp_stdout.close()
tmp_stderr.close()
returncode = proc.wait()
except Exception, e:
sys.stderr.write( "runCommand error: %s\n"%e )

def main():
#Parse Command Line
parser = optparse.OptionParser()
parser.add_option( '-p', '--profile', dest='profile', default='', action='store', help='the metaPhLan .profile taxonomic output file' )
parser.add_option( '-m', '--map', dest='map', default='', action='store', help='the metaPhLan read to taxa mapping file' )
parser.add_option( '-k', '--krona', dest='krona', default='krona.out', action='store', help='the Kron output file name' )
( options, spillover ) = parser.parse_args()

TotalReads = file_len(options.map)

re_candidates = re.compile(r"s__|unclassified\t")
re_replace = re.compile(r"\w__")
re_bar = re.compile(r"\|")

metaPhLan = list()
with open(options.profile,'r') as f:
metaPhLan = f.readlines()
f.close()

krona_tmp = options.krona + '.tmp'
metaPhLan_FH = open(krona_tmp, 'w')

for aline in (metaPhLan):
if(re.search(re_candidates, aline)):
x=re.sub(re_replace, '\t', aline)
x=re.sub(re_bar, '', x)

x_cells = x.split('\t')
lineage = '\t'.join(x_cells[0:(len(x_cells) -1)])
abundance = float(x_cells[-1].rstrip('\n')) * (float(TotalReads) / float(100))

metaPhLan_FH.write('%s\n'%(str(int(abundance)) + '\t' + lineage))

metaPhLan_FH.close()

if __name__ == '__main__':
main()

I hope this helps some people out.

Ciao

Nicola Segata

unread,

Aug 8, 2012, 4:32:48 PM8/8/12

to metaphl...@googlegroups.com

Hi Daniel,

that's fantastic, thanks a lot!

Regarding the number of reads, MetaPhlAn provides directly the relative abundance of organisms rather than the fraction of reads matching each organisms. This means that you don't need to normalize the results based on the total number of reads because the algorithm already takes care of that.

Let me provide an example for explaining the difference. Suppose you have 1 million cells of bug A and 1 million cells of bug B in the sample. Suppose that the genome of A is twice as long as the genome of B. If you metagenomically sample this community, MetaPhlAn will correctly predict that A is at 50% and B is at 50%. The other existing methods will instead tell you that A is at 66% and B is at 33% because there are many more reads matching the genome of A, but just because A has a longer genome.

All this introduction to say that you need only the main MetaPhlAn output file for converting the results to Krona. In other words you can already provide the percentages that otherwise Krona will compute.

I thus slightly modified your code (attached the file). I commented the changes.

Here how I applied it on the synthetic community (http://huttenhower.sph.harvard.edu/sites/default/files/LC1.fna).

metaphlan.py LC1.fna --bowtie2db bowtie2db/mpa > LC1.mpa.txt

python metaphlan2krona.py -p LC1.mpa.txt -k LC1.krona.txt

ktImportText LC1.krona.txt -o LC1.krona.html

resulting in the following image below.

Thanks a lot Daniel. If you have a chance to have a look to the small modifications I made into the metaphlan2krona.py and give me your feeback I will then upload the script into the MetaPhlAn repository.

thank you very much

Nicola

LC1.mpa.txt

metaphlan2krona.py

LC1.krona.txt

text.krona.html

LC1.krona.png

Daniel Brami

unread,

Aug 8, 2012, 5:29:13 PM8/8/12

to metaphl...@googlegroups.com

On Wednesday, August 1, 2012 12:36:16 PM UTC-7, Daniel Brami wrote:

Your changes make sense.
The only inconvenience I can see is that you dont get a count of the number of classified reads from your sample.
We still need to so a 'wc -l sample.mpa.txt'; Perhaps a log with various metrics would be beneficial? somewhat similar to the bowtie2 mapping metrics?

Anyways, thanks for helping me get the visualization I wanted for MetaPhlan.

Nicola Segata

unread,

Aug 9, 2012, 8:28:42 AM8/9/12

to metaphl...@googlegroups.com

Hi Daniel,

great! I just pushed your script (metaphlan2krona.py) and another conversion script for obtaining PhyloXML trees (metaphlan2phyloxml.py) on the MetaPhlAn repository, and updated the wiki and webpage accordingly.

Given that we are using unique markers in order to avoid ambiguous reads-to-genome hits, it doesn't make sense to count the fraction of mapped reads for MetaPhlAn. In other words, we are focusing on the average coverage for the marker genes (a proxy for the relative abundance of cells) rather than the total number of reads in each genome.

thanks again for the useful script!

ciao

Nicola

Reply all

Reply to author

Forward