custom database

606 views
Skip to first unread message

James Doonan

unread,
Nov 25, 2015, 10:57:19 AM11/25/15
to MetaPhlAn-users
Hi all,

I've been trying to integrate two bacterial genomes to make them part of the humann2 pipeline.

https://groups.google.com/forum/#!topic/humann-users/Lj71_nno3Oo

However I would like to run metaphlan2 with my genomes included. The tutorial on the metaphlan Wiki site gives a tutorial on this. But I'm not sure how to prepare my data to add to the markers.fasta file. Should I pull out the same marker genes from my genomes that are in the markers.fasta file? Would my markers have the same score as their homologs? I've tried to modify the metaphlan2.py script as follows;

####modified###JD
import cPickle as pickle
import bz2

with open(args['mpa_pkl'], 'rb') as ifile:
db = pickle.loads(bz2.decompress(ifile.read()))

# Add the taxonomy of the new genomes
db['k__Bacteria']['g__Gibbsiella.s__quercinecans'] = 5548506
db['k__Bacteria']['g__Brenneria.s__goodwinii'] = 5395301

# Add the information of the new marker as the other markers
db['mpa_v21_m200'][new_marker_name] = {
'bacteria': the clade that the marker belongs to,
'ext': {the name of the first external genome where the marker appears,
the name of the second external genome where the marker appears,
},
'len': length of the marker,
'score': score of the marker,
'taxon': the taxon of the marker}
# To see an example, try to print the first marker information:
# print db['markers'].items()[0]

# Save the new mpa_pkl file
ofile = bz2.BZ2File('metaphlan2/db_v21/mpa_v21_m200.pkl', 'w')
pickle.dump(db, ofile, pickle.HIGHEST_PROTOCOL)
ofile.close()

I'm not sure about this, should I add new information for each marker from my new genomes to the new marker section of the script?

Thanks,

James

Duy Tin Truong

unread,
Nov 26, 2015, 2:50:16 AM11/26/15
to James Doonan, MetaPhlAn-users
Hi James,

The line:
db['k__Bacteria']['g__Gibbsiella.s__quercinecans'] = 5548506
should be changed to:
db['taxonomy']['k__Bacteria|p__..|c_..|o__...|f__...|g__Gibbsiella|s__quercinecans'] = 5548506

'taxonomy' is a fixed key word of the database.
You can see an example by:
print db['taxonomy'].items()[0]

db['mpa_v21_m200'][new_marker_name] = {...}
should be:
db['markers'][new_marker_name] = {...}
'markers' is a fixed keyword, just like 'taxonomy'.

new_marker_name is the name of your marker from the genome you would like to add.
The markers must be common (or core genes) and unique for that species, i.e. they should not be present in other species.

The core score is defined as follows:

import scipy.stats as st
 1.0-st.binom.sf(ok,tot,pr)

where ok is the number of taxa in the clade with the marker, tot is the total number of taxa in the clade, and pr is the success_rate (i.e. 1 - the estimation of the probability of not seeing a marker in a taxa due to problems with the assembly, gene calling, etc). We usually set pr to 0.95.  The uniqueness score is defined as the inverse of the number of external hits nomalized in the [0-1] interval. The marker score is a basically the combination (by multiplication) of the two with some additional special cases for clades with unclear taxonomy or a high probability of genome mislabelling.

In addition, you have to extract the marker sequences in fasta format and add them the markers.fasta file:
bowtie2-inspect metaphlan2/db_v20/mpa_v20_m200 > metaphlan2/markers.fasta
cat new_marker.fasta >> metaphlan2/markers.fasta
mkdir metaphlan2/db_v21/mpa_v21_m200
bowtie2-build metaphlan2/markers.fasta metaphlan2/db_v21/mpa_v21_m200

Thanks,
Tin

--
You received this message because you are subscribed to the Google Groups "MetaPhlAn-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to metaphlan-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hussein Shokry

unread,
Nov 26, 2015, 11:53:32 PM11/26/15
to MetaPhlAn-users, clydea...@gmail.com
hi,

I am doing the same thing as the thread poster, however, with thousands of sequences (will do 2 databases actually, for bacteria & viruses). So my question become does the 'taxonomy' suffice as information for the metaphlan to identify the sequences (since i'll just input the headers as a dict with the lengths) or do i have to enter in the other fields 'bacteria', 'taxon'?

Yours,
Hussein

Duy Tin Truong

unread,
Nov 27, 2015, 4:24:06 AM11/27/15
to Hussein Shokry, MetaPhlAn-users, clydea...@gmail.com
Hi Hussein,

Only db['taxonomy'] is enough but you have to enter all intermediate fields k__ ...| p__...| ... |s__...|t... into the db['taxonomy'].
In addition, the markers must be the core and unique genes in each clade, i.e. if you rebuild a completely new database, you should have the markers for families, genus and species as well.
Extracting such markers may be the most difficult part. We are developing a pipeline for doing this task automatically but it may take a while.

Cheers,
Tin

Duy Tin Truong

unread,
Nov 27, 2015, 4:25:28 AM11/27/15
to Hussein Shokry, MetaPhlAn-users, clydea...@gmail.com
Hi James,

Sorry that the line in the previous email:
db['taxonomy']['k__Bacteria|p__..|c_..|o__...|f__...|g__Gibbsiella|s__quercinecans'] = 5548506

should be:
db['taxonomy']['k__Bacteria|p__..|c_..|o__...|f__...|g__Gibbsiella|s__quercinecans|t__genome_id'] = 5548506

Cheers,
Tin

Hussein Shokry

unread,
Nov 27, 2015, 4:37:41 AM11/27/15
to MetaPhlAn-users, hut...@gmail.com, clydea...@gmail.com
Thank you kindly for the fast reponse.

for the virus this is the format i have (basically the headers):

gnl|BL_ORD_ID|0 gb|AB261990|Strain|M2|Description|Lymphocytic choriomeningitis virus genomic RNA, segment S, nearly complete sequence including cds, strain: M2.|Country||Gi|108743534|vipr-id|284567

I was going to use the Gi Id's for the core markers or should i use the Id given by the curated group (vipr-id), and how should i format it into the format you mentioned ( k__ ...| p__...| ... |s__...|t...)?

for the bacteria though the header look like this : (prn pertactin precursor), the first part makes up unique markers (prn) which contain the information to an outward csv file containing the genus, species & strain, is that marker unique enough to get the information, the genus, species & strain don't have their own uqniue markers though?

Thanks in advance.

Duy Tin Truong

unread,
Nov 29, 2015, 2:32:12 PM11/29/15
to Hussein Shokry, MetaPhlAn-users, clydea...@gmail.com
Hi Hussein,

If you know the k__, p__, ..., s__..., t... of a strain than you should put it there. Otherwise, you can put p__Viruses_noname, etc. 
For example:
k__Viruses|p__Viruses_noname|c__Viruses_noname|o__Mononegavirales|f__Rhabdoviridae|g__Nucleorhabdovirus|s__Maize_mosaic_virus|t__PRJNA14920

The main idea here is that metaphlan2 will sum up the lower (children) level abundance of a clade to report its abundance.
To know if a maker is unique for a clade, we have to map it against the reference genomes in all other clades and make sure that there is no hit. Similarly, to know if a marker is common for that clade, we have to map it against all reference genomes in that clade and make sure that it hits all such reference genomes. And a marker of clade must be unique and common for that clade.

Thanks,
Tin




Message has been deleted

James Doonan

unread,
Nov 29, 2015, 6:13:33 PM11/29/15
to MetaPhlAn-users
Hi Tin,

Thanks for the reply.
I’ve build a bowtie2 database with my new markers as you described. However I don’t know python and I’m having trouble modifying the script. I’m getting syntax errors, here’s what I’ve got at the minute, sorry if it’s very basic;

import cPickle as pickle
import bz2
import scipy.stats as st
1.0-st.binom.sf(3,3,0.95)

with open(args['mpa_pkl'], 'rb') as ifile:
db = pickle.loads(bz2.decompress(ifile.read()))

# Add the taxonomy of the new genomes
db['taxonomy']['k__Bacteria|p__Proteobacteria|c_Gammaproteobacteria|o__Enterobacteriales|f__Enterobacteriaceae|g__Gibbsiella|s__quercinecans|t__genome_id'] = 5548506
db['taxonomy']['k__Bacteria|p__Proteobacteria|c_Gammaproteobacteria|o__Enterobacteriales|f__Enterobacteriaceae|g__Brenneria.s__goodwinii|t__genome_id'] = 5395301

# Add the information of the new marker as the other markers
db['markers'][new_marker_name] = {
'clade':
'ext'
'len'
'score'
'taxon'
}

ofile = bz2.BZ2File('metaphlan2/db_v21/mpa_v21_m200.pkl', 'w')
pickle.dump(db, ofile, pickle.HIGHEST_PROTOCOL)
ofile.close()

current error;

with open(args['mpa_pkl'], 'rb') as ifile:
^
SyntaxError: invalid syntax

Thanks,

James

Duy Tin Truong

unread,
Nov 30, 2015, 5:41:53 PM11/30/15
to James Doonan, MetaPhlAn-users
I answer you as below:

import cPickle as pickle
import bz2
import scipy.stats as st
 
1.0-st.binom.sf(3,3,0.95)
This line should be used to compute the score for each marker. I don't know why you put it here.
 
with open(args['mpa_pkl'], 'rb') as ifile:
        db = pickle.loads(bz2.decompress(ifile.read()))
This is correct to load the database.
 
# Add the taxonomy of the new genomes
db['taxonomy']['k__Bacteria|p__Proteobacteria|c_Gammaproteobacteria|o__Enterobacteriales|f__Enterobacteriaceae|g__Gibbsiella|s__quercinecans|t__genome_id'] = 5548506
db['taxonomy']['k__Bacteria|p__Proteobacteria|c_Gammaproteobacteria|o__Enterobacteriales|f__Enterobacteriaceae|g__Brenneria.s__goodwinii|t__genome_id'] = 5395301
t_genome_id should be replace by the id of the reference genome.
Example:
db['taxonomy'][''k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Pseudomonadales|f__Moraxellaceae|g__Acinetobacter|s__Acinetobacter_baumannii|t__GCF_000222265'] = 4063283

 
# Add the information of the new marker as the other markers
db['markers'][new_marker_name] = {
                                   'clade':
                                   'ext'
                                   'len'
                                   'score'
                                   'taxon'
}
Here, you should replace new_marker_name by the name of your marker and other values. Example:
db['markers']['gi|483970126|ref|NZ_KB891629.1|:c6456-5752'] = {
{'clade': 's__Streptomyces_sp_KhCrAH_244',
  'ext': {'GCF_000226995', 'GCF_000355695', 'GCF_000373585'},
  'len': 705,
  'score': 3.0,
  'taxon': 'k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Actinomycetales|f__Streptomycetaceae|g__Streptomyces|s__Streptomyces_sp_KhCrAH_244'}
}

means that this marker clade is s__Streptomyces_sp_KhCrAH_244, and its can be present in some other genomes 'GCF_000226995', 'GCF_000355695', 'GCF_000373585' (which can be traced back by looking at db['taxonomy']), etc.
 
ofile = bz2.BZ2File('metaphlan2/db_v21/mpa_v21_m200.pkl', 'w')
pickle.dump(db, ofile, pickle.HIGHEST_PROTOCOL)
ofile.close()
This is correct to save the new database.
 
current error;

with open(args['mpa_pkl'], 'rb') as ifile:
       ^
SyntaxError: invalid syntax
I am not sure why you need this line as you already read the database with this line from the beginning.

I also need to emphasize that to make sure a marker is unique and common for a clade, you need to map it again all other reference genomes in other clades and in that clade as I mentioned in the previous email for Hussein. Otherwise, the result will not be correct. An example is that 16S appears everywhere, so it is not a marker for any clade, but if it is  wrongly selected as a marker for a species X, then a lot of reads can be mapped to it, therefore that species abundance will be very high.
The genome ids can be traced from db['taxonomy'].

Thanks,
Tin
 

Thanks,

James

James Doonan

unread,
Nov 30, 2015, 6:54:31 PM11/30/15
to MetaPhlAn-users, clydea...@gmail.com
Hi Tin,

Thanks for that, I have a few marker genes but it will likely take time to verify a large number, I understand the requirement for clade specific makers, the difficulty I have is amending the python code as I have no experience with this.

Thanks,

James

Hussein Shokry

unread,
Jan 13, 2016, 8:14:20 AM1/13/16
to MetaPhlAn-users, clydea...@gmail.com
> Hi,

Thanks for the responses, it makes that much easier through the customization process. But in the case that i have solely marker genes and no reference genomes. What input should i use for the reference genome so as not to cause errors?

Yours,
Hussein

Duy Tin Truong

unread,
Jan 19, 2016, 3:44:21 AM1/19/16
to Hussein Shokry, MetaPhlAn-users, clydea...@gmail.com
Hi Hussein,

We need to know at least which genome a marker comes from, e.g. marker1 from genome1, marker2 from genome1, and marker3 from genome2. Based on that information in the mpa_pkl file, metaphlan can identify a set of markers for each genome and determine its abundance.

Cheers,
Tin
Reply all
Reply to author
Forward
0 new messages