how to get strain name after panphlan_profile.py

80 views
Skip to first unread message

Ming Liao

unread,
Apr 8, 2016, 7:56:39 PM4/8/16
to MetaPhlAn-users
Hello all,

It is probably not the best place to ask my question, but I think someone in this group may have similar questions. By the way, the PanPhlan google group seems to be closed, so I came here.

I am trying the PanPhlan with our own data, and got to the final step using panphlan_profile.py.
The colnames of panphlan_profile.py output was like g00001... , how to get the genes names as K05711.. by using KEGG ?


I referred to the latest published paper using PanPhlan,

Metagenomic Sequencing with Strain-Level Resolution Implicates Uropathogenic E. coli in Necrotizing Enterocolitis and Mortality in Preterm Infants

it wrote as follows:

Metagenomic MLST Analysis

We developed a metagenomic approach to exploit the MLST strategy commonly used in cultivation-based typing assays (Maiden et al., 1998). Reads were mapped with Bowtie2 against a database of the known E. coli MLST sequences corresponding to distinct alleles of seven genes: adk, fumC, gyrB, icd, mdh, purA, and recA (parameters -D 20 -R 3 -N 0 -L 20 -i S,1,0.50). A consensus sequence for each loci was constructed considering the nucleotide with the highest frequency in each position. All samples where all loci obtained a minimum breath of coverage of at least 90% were confidently mapped. For the small fraction of loci with low or non-complete coverage (2.11% of the loci in the positive samples), the best-matching reference allele from the MLST database was used to fill the uncovered positions. Reconstructed consensus alleles were used to determine the most abundant MLST (ST) profile in a sample based on known E. coli ST profiles—3,895 known profiles from the University of Warwick Medical School MLST database, http://mlst.warwick.ac.uk/mlst/ (Wirth et al., 2006).

I have tried our raw data (fastq) with the bowtie2 as the same parameters -D 20 -R 3 -N 0 -L 20 -i S,1,0.50, but the output file was SAM format. I am really confused about the results.

To make my question simple, how can I use the Panphlan output to get the strains name of E.coli, similarly as shown in the abstract of the above paper:
...Metagenomic multilocus sequence typing analysis further defined NEC-associated strains as sequence types often associated with urinary tract infections, including ST69, ST73, ST95, ST127, ST131, and ST144. ...

Our preliminary analysis of our own data also discovered E.coli, our next analysis relied heavily on PanPhlan to dive into the strain level.

Any comments or advice are welcome. I really appreciate your help. Thanks

Ming

Matthias Scholz

unread,
Apr 10, 2016, 11:12:39 AM4/10/16
to MetaPhlAn-users
Hi Ming,

For PanPhlAn, to get the gene-name, sequence, and KEGG annotation, take a look here:
https://bitbucket.org/CibioCM/panphlan/wiki/wiki_FAQ_get_KEGG_annotation

cheers,
Matthias

Ming Liao

unread,
Apr 10, 2016, 5:38:30 PM4/10/16
to MetaPhlAn-users
Thanks Mtthlas,

What a good news! There is a wiki-FAQ I have never tried to figure out !!!
I have wasted two days following the instruction in PanPhlan paper :

"KEGG Orthology (KO) identifiers are obtained by a BLASTn mapping of the representative sequence of each gene family against the KEGG nucleotide database (overall e-value threshold at 1e30, best hit strategy)."

and found out this KEGG database (ftp://ftp.genome.jp/pub/db/mgenes/), then this alignment tool ghostX (http://www.bi.cs.titech.ac.jp/ghostx/).
Currently I found out the K numbers were not in the nucleotide database named "meta.nuc", but still downloading the other two databases. I am not sure it is the right way.

Thanks again for your quick response. It really helps.

Ming



Ming Liao

unread,
Apr 13, 2016, 2:12:47 PM4/13/16
to MetaPhlAn-users
On Sunday, April 10, 2016 at 11:12:39 AM UTC-4, Matthias Scholz wrote:
Hi, Matthias

Your information about KEGG annotation really helps. But we have detected hundreds of significant gene-ids, it is really a tough work to make genes annotation by coping and pasting each sequence to the web (http://www.genome.jp/tools/blast/). It did not support the multiple sequence data.

I spent a couple of days to figure out how this on-line annotation is working. Initially, I found out http://www.kegg.jp/ghostkoala/, I thought the GhostX implanted in it would help. But unluckily it only works for amino acid sequencing data. Next, I turned to the BLASTn in NCBI. Until now, I guess the KEGG annotation used the BLASTn at first, then the GhostX. Anyway, the seed data-base is the KEY I think.

Is really the KEGG database for assigning K numbers to the user's sequence data NOT free? I am not sure about it. I am just wanna to be sure about this commercial issue, so that I can decide to learn BLASTn GhostX or not. I don't want to waste time and finally figure out the database is not freely available.

I really appreciate your time and kindness. Thanks

Ming

Ming Liao

unread,
Apr 14, 2016, 2:00:27 PM4/14/16
to MetaPhlAn-users
Hello there,

I think I should make my question straight forward:

Is really the KEGG database for assigning K numbers to the user's sequence data NOT free?

Thanks
Reply all
Reply to author
Forward
0 new messages