error searching functions in v4 pangenome

251 views
Skip to first unread message

Nastassia Patin

unread,
May 1, 2018, 9:49:01 AM5/1/18
to Anvi'o
Hi anvi'o team,

I am running a pangenome in anvi'o v4, and it was going very well until a recent hiccup. Briefly, my work flow was: 

  1. Download draft genomes from NCBI and use those with my own MAGs, which were in fasta format.
  2. Reformat fastas file using anvi-script-reformat-fasta
  3. Generate gene calls and annotations using Prokka.
  4. Parse the gff files using gff_parser.py.
  5. Generate contigs database with external gene calls and import annotations using anvi-import-functions.
  6. Generate hmms and blank profiles for each contigs database.
  7. Create a genomes storage database for each genome
  8. Make the pangenome!
I ran it in interactive mode and all 30 genomes displayed beautifully. Then I tried to search for a functional annotation using anvi-search-functions, and it gave me the following output and error: 

(anvio4) manager@bl8vbox[Genomes] anvi-search-functions -p OV-PAN-GENOMES/OV-PAN-GENOMES-PAN.db -g OV-GENOMES.db --search-terms cyanophycin --annotation-sources Prokka:Prodigal                   [ 1:35PM]
Searching in Genomes Storage .................: OV-GENOMES.db
Genomes storage .............................................: Initialized (storage hash: 8d08f06b)                                                                                                          
Num genomes in storage ......................................: 30
Num genomes will be used ....................................: 30
Pan DB ......................................................: Initialized: OV-PAN-GENOMES/OV-PAN-GENOMES-PAN.db (v. 8)
                                                                                                                                                                                                             
* Gene clusters are initialized for all 8777 gene clusters in the database.

Search terms ................................................: 1 found
Genomes storage .............................................: Initialized (storage hash: 8d08f06b)                                                                                                          
Num genomes in storage ......................................: 30
Num genomes will be used ....................................: 30
[01 May 18 13:36:47 Search functions] Searching for term "cyanophycin"                                                                                                                                      Traceback (most recent call last):
  File "/home/manager/anaconda3/envs/anvio4/bin/anvi-search-functions", line 186, in <module>
    SearchResultReporter(args).process()
  File "/home/manager/anaconda3/envs/anvio4/bin/anvi-search-functions", line 61, in process
    self.get_search_results()
  File "/home/manager/anaconda3/envs/anvio4/bin/anvi-search-functions", line 90, in get_search_results
    self.matching_item_names_dict, self.verbose_output = self.db.search_for_gene_functions(self.search_terms, requested_sources=self.annotation_sources, verbose = self.verbose)
  File "/home/manager/anaconda3/envs/anvio4/lib/python3.6/site-packages/anvio/dbops.py", line 1679, in search_for_gene_functions
    gene_cluster_id = self.gene_callers_id_to_gene_cluster[genome_name][gene_caller_id]
KeyError: 570

Just to check, I ran anvi-search-functions --list-annotation-sources and sure enough it gave me:
Searching in Genomes Storage .................: OV-GENOMES.db
Genomes storage .............................................: Initialized (storage hash: 8d08f06b)                                                                                                          
Num genomes in storage ......................................: 30
Num genomes will be used ....................................: 30
Pan DB ......................................................: Initialized: OV-PAN-GENOMES/OV-PAN-GENOMES-PAN.db (v. 8)
                                                                                                                                                                                                             
* Gene clusters are initialized for all 8777 gene clusters in the database.

                                                                                                                                                                                                             
AVAILABLE FUNCTIONS (1 FOUND)
==============================================================
* Prokka:Prodigal

So importing the Prokka annotations seems to have worked, but it will not perform the search. Any ideas why this is happening?

Thanks,
Nastassia


A. Murat Eren

unread,
May 1, 2018, 10:13:03 AM5/1/18
to Anvi'o
Hi Nastassia,

This is the second time we are hearing about this Prokka during the last couple of days. We will take a look at this workflow soon. Apologies for the inconvenience!


Best wishes,

--

A. Murat Eren (meren)
http://merenlab.org :: twitter :: gpg

--
Anvi'o Paper: https://peerj.com/articles/1319/
Project Page: http://merenlab.org/projects/anvio/
Code Repository: https://github.com/meren/anvio
---
You received this message because you are subscribed to the Google Groups "Anvi'o" group.
To unsubscribe from this group and stop receiving emails from it, send an email to anvio+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/anvio/0357c622-bddd-40e7-98f9-2c2fad97cf37%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

A. Murat Eren

unread,
May 2, 2018, 5:15:16 PM5/2/18
to Anvi'o, Özcan ESEN
Hi Nastassia,

Ozcan has been trying to reproduce this error, but we couldn't reproduce it neither with the master repo, nor with the stable v4 :(

I am not sure where to go from here. If you share your files privately with me and Ozcan (CC'd), i.e., by providing a Dropbox link to your contigs DBs you used to generate the genomes storage, and exact commands you used for each step, we could try to get to the bottom of this.


Best,

--

A. Murat Eren (meren)
http://merenlab.org :: twitter :: gpg

Nastassia Patin

unread,
May 3, 2018, 11:43:34 AM5/3/18
to Anvi'o, Özcan ESEN
Thanks Meren and Ozcan. I'm uploading the data to a shared Dropbox folder now. I'll also load my workflow as a text file so you have the exact command I used.

-Nastassia

You received this message because you are subscribed to a topic in the Google Groups "Anvi'o" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/anvio/6qGq5euuv-s/unsubscribe.
To unsubscribe from this group and all its topics, send an email to anvio+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/anvio/CAK5_LacrsQDN71SmqOFgKWcVxiHARqG42fDFfbkyy2JFLjT2Ew%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.



--

Dr. Nastassia Patin

Postdoctoral Researcher

School of Biology

Georgia Institute of Technology

Özcan Esen

unread,
May 3, 2018, 3:00:39 PM5/3/18
to Nastassia Patin, Anvi'o
Hi Nastassia,

Thank you for the example data, I could finally able to reproduce the error. You are getting this error because of the `--min-occurrence` parameter. We store functions in the genomes storage, but if filtering parameters used during the pangenome analysis, some gene callers ids in genomes storage may get filtered and do not appear in pangenome database, so function search was having a hard time to find matching gene cluster id for those results. It seems we already fixed this issue on the master to skip those search results, but now we changed the behavior. Instead of skipping, now it keeps those search results but puts 'n/a' as a gene cluster id since those results are still may be useful to go back to contigs database or tweak the filtering parameters in future pangenome analyses.

This is the relevant part in `dbops.py` https://github.com/merenlab/anvio/blob/master/anvio/dbops.py#L1724-L1750  if you want you can find your `dbops.py` in your installation and modify this part. You can get the path of your `dbops.py` file using the command python -c "import anvio.dbops; print(anvio.dbops.__file__)"

Best,
Ozcan

Nastassia Patin

unread,
May 4, 2018, 10:57:54 AM5/4/18
to Özcan Esen, Anvi'o
Thank you very much Ozcan!! Great to have that figured out!

-Nastassia

elizabe...@gmail.com

unread,
Jul 16, 2018, 3:05:11 PM7/16/18
to Anvi'o
Hi Nastassia and Anvi'o team, 

I am also running the pangenomics workflow using Prokka annotations. I've followed the above steps but seem to have trouble displaying SCGs. The only thing I can decipher that is different from when I've run other pangenomics workflows is when I run the anvi-run-hmms script, I get the following warning: "you did not provide any gene caller ids. as a result, anvi'o will give you back sequences for every 4731 gene call stored in the contigs database." I imported my external gene calls and functional annotations successfully using the gff parser script. Everything displays nicely and I can make searches based on the annotations, but the SCGs field is empty. 

Additionally I had a question about how you made your genome storage database using the --gene-caller flag. Prokka gives gene calls for more sources than just Prodigal, but I've only figured out how to give the --gene-caller flag one argument, such as just Prodigal. So I'm not sure if that is also contributing to my issue since it's not taking into account genes that were called with other gene callers. 

Thanks for your help in advance, 

Elizabeth

A. Murat Eren

unread,
Jul 17, 2018, 3:13:24 PM7/17/18
to Anvi'o
Hi Elizabeth,

By design (and by definition), pangenomes should only consider a single gene caller even if there are multiple gene callers stored in the contigs database. If you are certain that you are on top of what you are doing, you can always take the Prokka output, change the gene caller to the same gene caller id prior to importing them into anvi'o. But if there are multiple descriptors of the same gene open reading frame, it will likely cause some issues in your pangenome with inflated number of paralogs in the least.

I hope this makes sense :)

Best wishes,
--

A. Murat Eren (meren)
http://merenlab.org :: twitter :: gpg

--
Anvi'o Paper: https://peerj.com/articles/1319/
Project Page: http://merenlab.org/projects/anvio/
Code Repository: https://github.com/meren/anvio
---
You received this message because you are subscribed to the Google Groups "Anvi'o" group.
To unsubscribe from this group and stop receiving emails from it, send an email to anvio+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/anvio/f8324374-99f0-420c-a68a-b7ce19b813b4%40googlegroups.com.

elizabe...@gmail.com

unread,
Jul 17, 2018, 3:46:49 PM7/17/18
to Anvi'o
Hi Meren, 

That makes sense, and when creating the genomes storage, I've provided it with the --gene-caller Prodigal flag. So for now I'm just looking at annotations from Prokka:Prodigal. 

However, I don't think this is still addressing my problem of not getting SCGs. I've attached a screen shot of my interactive pangenome view, where SCGs aren't displaying. I've run anvi-run-hmms and that's where I get the "you did not provide any gene caller ids. as a result, anvi'o will give you back sequences for every 4731 gene call stored in the contigs database" warning, even though I believe the gene caller IDs are included in the -gene-calls.txt file outputted by the gff_parser.py script. I've also checked the output of anvi-summarize and there are no clusters that are core to all 13 of my genomes there. 

Usually (and expectedly) the SCGs field is empty when I've forgotten to run the anvi-run-hmms script on my contigs databases. And when creating the genomes storage file it gives me a warning that I haven't run that script. But when I run the anvi-run-hmms, I don't get any downstream warnings, but no SCGs displaying in the pangenome. 

Thanks,

Elizabeth
Screen Shot 2018-07-17 at 2.35.44 PM.png

A. Murat Eren

unread,
Jul 17, 2018, 4:41:49 PM7/17/18
to Anvi'o
On Tue, Jul 17, 2018 at 3:46 PM <elizabe...@gmail.com> wrote:
However, I don't think this is still addressing my problem of not getting SCGs.

Based on what I'm seeing in this pangenome, ​I don't think you have any SCGs in this pangenome (SCGs in this display are simply gene clusters that occur in every genome and have no paralogs). That's why the SCGs layer is blank.​

But the error you are getting from anvi-run-hmms seems to be a bug. Can you send one of your contigs databases as an example so I we can try to run HMMs and see what is going on?


Best,

elizabe...@gmail.com

unread,
Jul 17, 2018, 6:21:48 PM7/17/18
to Anvi'o
I've attached one of my contigs databases. 

It seems highly unlikely to me that these sets of MAGs wouldn't have any SCGs shared among them, but I'll play around with my parameters and see what happens. 

Thanks,

Elizabeth 
2556921090-contigs.db

A. Murat Eren

unread,
Jul 18, 2018, 8:23:12 AM7/18/18
to Anvi'o
Hi Elizabeth,

I just run anvi-run-hmms on your contigs database, and got no errors using v5.1. So I am not sure how to reproduce the error you mentioned.

I was looking at this output;

(anvio-master) meren ~/Downloads $ anvi-script-get-collection-info -c 2556921090-contigs.db

WARNING
===============================================
You did not provide a collection name. Anvi'o will assume that what is in your
contigs database is a a single genome (or genome bin).

Contigs DB ...................................: Initialized: 2556921090-contigs.db (v. 12)

* Completion and redundancy estimates. PC: Percent completion; PR: Percent
redundancy; N: Number of splits; L: Length (total number of nucleotides); D:
Domain for single-copy core genes; C: Domain confidence.


Genome in "2556921090-contigs.db"
===============================================
2556921090-contigs :: PC: 30.22%, PR: 1.44%, N: 316, L: 3,097,332, D: bacteria (C: 0.29)


The lack of single-copy core genes in the pangenome at the amino acid level would not surprise me if the rest of the MAGs in that analysis are also comparable to this one with respect to their level of completion.



Best,
--

A. Murat Eren (meren)
http://merenlab.org :: twitter :: gpg

--
Anvi'o Paper: https://peerj.com/articles/1319/
Project Page: http://merenlab.org/projects/anvio/
Code Repository: https://github.com/meren/anvio
---
You received this message because you are subscribed to the Google Groups "Anvi'o" group.
To unsubscribe from this group and stop receiving emails from it, send an email to anvio+un...@googlegroups.com.

elizabe...@gmail.com

unread,
Jul 19, 2018, 11:54:54 PM7/19/18
to Anvi'o
Hi Meren, 

I think I've found the problem. I don't know how Anvi'o is computing completion estimates, but none of my bins are below 75% completion from CheckM estimates. I ran `anvi-script-get-collection-info` on all of my contigs databases for this particular project. The output for that is attached in the file accum-bins-contigsdb-stats.txt. For comparison, I've also attached the CheckM estimates for these bins in the file accum-bins-checkm-stats.txt. I've also discovered this problem in a situation where I am trying to perform the pangenomics workflow using complete genomes downloaded from NCBI. The stats for the contigs databases are listed in the file meth-bac-contigdb-stats.txt. Just for sanity sake, even though these are designated "complete genomes" on NCBI, I also ran CheckM on those isolates, which are summarized in meth-genomes-checkm-stats.txt. 

In both of these situations, I have run Prokka on the raw genome files, and then the gff_parser.py script to input the gene calls and annotations into Anvi'o. To generate the contigs database, I've used the following command: 

for fasta in */*.fa; do
f=$(basename $fasta .fa);
anvi-gen-contigs-database -f $fasta -o $f-contigs.db --external-gene-calls $f/$f.gff-gene-calls.txt -n "$f" --ignore-internal-stop-codons;
done

So I think the problem is with ignoring internal stop codons. I've used the --process-all flag as well for parsing the GFFs. So the part at the end of the blog post "Importing Prokka Anvi'o annotations into Anvi'o" that says there may be problems with the pangenomics workflow using these flags is correct, but it's causing problems at the anvi-gen-contigs-database step. From a test run, of the above command on just one file, it says that 35% of the total genes have internal stop codons, and they will be replaced with an "X". For that particular genome, the stats from the getting the collection info say that it's around 65% complete, so it's throwing out those genes. But even when I don't use the --process-all flag for the gff_parser, I believe I still get a warning about internal stop codons and such. 

I'm probably doing something stupid and not seeing it, but I'm not sure where to go from here. Other than I'm not dealing with bins that are 30% complete. In the meantime, I'll also try annotating my bins and genomes with another tool to compare. 

Thanks, 

Elizabeth
accum-bins-contigsdb-stats.txt
accum-bins-checkm-stats.txt
meth-bac-contigdb-stats.txt
meth-genomes-checkm-stats.txt

elizabe...@gmail.com

unread,
Jul 19, 2018, 11:56:01 PM7/19/18
to Anvi'o
Oh sorry, I also meant to post the Anvi'o version I'm using: 

Anvi'o version ...............................: margaret (vunknown)
Profile DB version ...........................: 29
Contigs DB version ...........................: 12
Pan DB version ...............................: 12
Genome data storage version ..................: 6
Auxiliary data storage version ...............: 2
Structure DB version .........................: 1

And I installed it with conda. 

A. Murat Eren

unread,
Jul 20, 2018, 9:23:28 AM7/20/18
to Anvi'o
Probably low estimates are emerging from the fact that a lot of the gene calls in the external gene calls file contain internal stop codons and thrown away as you suspect. We never see such aberrant estimates of completion of isolate genomes with anvi'o.

Can you please also try to not use Prokka and generate your contigs databases without an external gene calls flag just to compare? This way you will let anvi'o do the gene calling. Then you can run `anvi-run-ncbi-cogs` or `anvi-run-pfams` to annotate your genes with functions for pangenomics later.


Best,
--

A. Murat Eren (meren)
http://merenlab.org :: twitter :: gpg

elizabe...@gmail.com

unread,
Jul 20, 2018, 3:13:27 PM7/20/18
to Anvi'o
Hi Meren, 

Just generated contigs databases without using my external gene calls. Quick test of the pangenome without functional annotations, and looks much better and as expected with the SCGs, as attached. From this test and issues with the complete isolates, I think Prokka gene calls/annotations won't work with the pangenomics workflow. 

Thanks, 

Elizabeth
Screen Shot 2018-07-20 at 2.04.28 PM.png

A. Murat Eren

unread,
Jul 20, 2018, 3:39:23 PM7/20/18
to Anvi'o
Thank you very much for checking.

Maybe we should put a notice on our Prokka workflow tutorial.

--

A. Murat Eren (meren)
http://merenlab.org :: twitter :: gpg

elizabe...@gmail.com

unread,
Jul 20, 2018, 10:53:44 PM7/20/18
to Anvi'o
Hi Meren, 

I think I've found the source of the issues. In the gff_parser.py script, there is a bug that is assigning the orientation of every gene call to reverse. 

I wrote a script that parses the genbank output from Prokka to fit Anvi'o's format for external gene calls and annotations: https://github.com/elizabethmcd/pub-scripts/blob/master/genbank-parser.py
I ran this through the workflow to get to a pangenome and it's comparable to what Anvi'o's `anvi-gen-contigs-database` does by default. With this fix, I also don't get warnings about internal stop codons, so I think the necessity for that flag originates from everything being inputted in reverse orientation. I also saw somewhere somebody wanted a parser for genbank files, so this script will mostly work for any genbank file with some modifications for the annotation source I think. I have yet to test this on impartial gene calls, but I have plans to implement that soon. 

Elizabeth

A. Murat Eren

unread,
Jul 21, 2018, 7:45:34 AM7/21/18
to Anvi'o
This is excellent, Elizabeth! 

Thank you. I am very happy to hear that you could identify the problem, and even happier to see that you solved it :) Kudos to you! I am now wondering what is the best strategy to go forward. Antti already have a post on importing Prokka annotations here:


I am sure he would be happy to hear that you fixed an important issue in his script. I think it would have been amazing if you were to consider updating that post in such a way that not only it leads people to the script that works better now, but also gives credit to you as well :) The source code of the page is here if you are interested suggesting a pull request: https://github.com/merenlab/web/blob/master/_posts/anvio/2017-05-18-working-with-prokka.md

This is my attempt to find a way to make your solution is available to the community in the most straightforward and appropriate way possible, but I don't assume you to have the time or interest to do anything about it. If you don't do anything, I will update things online myself when I am done traveling. Thank you either ways for going after this!


​Best,​
--

A. Murat Eren (meren)
http://merenlab.org :: twitter :: gpg

elizabe...@gmail.com

unread,
Jul 25, 2018, 12:43:03 PM7/25/18
to Anvi'o
Hi Meren, 

Sure, I'd be more than happy to update the post. I'll try to get to it within the next week or so. 
Elizabeth

Sirbius

unread,
Sep 21, 2020, 10:03:19 AM9/21/20
to Anvi'o
Hi there,

I'm doing a pangenome analysis and I would like to import Prokka annotation but got exactly the same error as @Elisabeth (no SCG at all). 
However, when I run the pangenome without Prokka annotation everything works smootly.
But still, I would like to know how she finally solved the problem and I think it would be better to write that on the prokka tutorial page for the others.

Thanks a lot for your efforts guys!
Silvia

Elizabeth McDaniel

unread,
Sep 21, 2020, 11:29:06 AM9/21/20
to Anvi'o
Hi Sirbius, 

There was a bug in the original GFF parser script that reverse orients a lot of genes I believe. I can't remember how I fixed the original GFF parser, but instead I made my own Genbank parser that works for Prokka annotations with the Genbank output. It can be found here: https://github.com/elizabethmcd/genomes-MAGs-database/blob/master/scripts/genbank-parser.py 

I was asked by Meren to update the blog post (2 years ago now, sigh) but I never got around to it, sorry. Maybe this will be the final push to update that with the fix. Apologies for the confusion, hopefully this parser helps you out. 

Best,
Elizabeth

Silvia Turco

unread,
Sep 21, 2020, 12:02:42 PM9/21/20
to an...@googlegroups.com
Thank you very much Elizabeth!
That's what I was hoping for because your old github link was not working anymore...
I'm going to try your script immediately and get back with the results.
Best,

Silvia

You received this message because you are subscribed to a topic in the Google Groups "Anvi'o" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/anvio/6qGq5euuv-s/unsubscribe.
To unsubscribe from this group and all its topics, send an email to anvio+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/anvio/5254dd76-3b60-4de0-bb32-d9f211b0d65bn%40googlegroups.com.


--
Silvia Turco, PhD

Ana Belen Garcia Martin

unread,
Sep 21, 2020, 4:21:42 PM9/21/20
to an...@googlegroups.com
HI Silvia,

you should be able to run anvi-run-ncbi-cogs, anvi-run-hmms and also import your prokka annotation as it is explained in this link: http://merenlab.org/2017/05/18/working-with-prokka/ by Antti Karkman. 
Even, if you have your NCBI_PGAP annotations you can also import them using the Bioinformatics Tools (bit) package https://github.com/AstrobioMike/bioinf_tools#bioinformatics-tools-bit of Mike Lee.

I have done also pangenome analysis using either using prokka or ncbi-pgap annotations. And, I usually have imported first the external annotations and then run the programs anvi-run-ncbi-cogs and anvi-run-hmms. If I followed another order while running the commands, I got warnings, or missing flags and so on ... I am not sure why and if this was always like that ... but for me it worked better as I said above.

Well, I hope this helps.
Best,

Ana


Ana Belen Garcia Martin

unread,
Sep 21, 2020, 4:22:14 PM9/21/20
to an...@googlegroups.com
Hi Elizabeth,

thanks for this link ;).
Best,

Ana

Silvia Turco

unread,
Sep 22, 2020, 7:24:31 AM9/22/20
to an...@googlegroups.com
Hi Ana, 
Thank you for your email.
I did exactly what is written in Antii's tutorial, but at the end I got no SCG (as happened Elizabeth before):
Screenshot at 2020-09-18 12-54-55.png

So, I repeated the tutorial but then I used Elizabeth's gbk_parser script and now everything looks how I expected (they are different pathovars of the same species)!
Screenshot at 2020-09-22 11-50-13.png

It's amazing when the community collaborates like this, thank you very much! :)
Silvia

You received this message because you are subscribed to a topic in the Google Groups "Anvi'o" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/anvio/6qGq5euuv-s/unsubscribe.
To unsubscribe from this group and all its topics, send an email to anvio+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/anvio/CAJEjGmKdk00%2BgU2UveJO03OQm8GUS5G07Z2_9i9ckqY%2BdPSqFQ%40mail.gmail.com.


--
Silvia Turco, PhD

Ana Belen Garcia Martin

unread,
Sep 22, 2020, 11:09:58 AM9/22/20
to an...@googlegroups.com
HI Silvia,

Great you have what you want finally ;)!
Yes, right, many people here collaborates and helped me a lot too. Really cool!!

Best,
Ana

Reply all
Reply to author
Forward
0 new messages