Troubleshooting PanGenome using --external-gene-calls

414 views
Skip to first unread message

Jarrod

unread,
Jan 27, 2017, 12:38:14 PM1/27/17
to Anvi'o
Hi all, 

Wondering if anyone can help me troubleshoot an error I get in the panGenome workflow. My workflow starts with external gene calls (from IMG). I can run the same pipeline, same contig files but without external gene calls and all works great. So I am pretty sure it has something to do with external gene calls--perhaps either my gene-list files and/or the commands I am running. 

The error occurs during anvi-pan-genome step, specifically during MCL protein clustering. Up until that point all looks good. For example, db's are created fine, HMM fine, COG fine, and I can import interproscan gene calls into db's. The log file just terminates at protein clustering without explanation. 

MCL output ...................................: ...pureIMG/mcl-clusters.txt
Number of protein clusters ...................: 6,269
[27 Jan 17 12:30:34 Aligning genes in protein sequences] 0 of 6269

Config Error: Drivers::Muscle: Something went wrong with this run :/ The output does not look
              alright. You can find the output in this log file.

My starting files for each genome are a) a contigs file and b) a gene list. 

gene_callers_id contig start stop direction partial source version
2624135806 Ga0067690_101 1676 2588 f 0 IMG v1.0
2624135807 Ga0067690_101 2715 3720 f 0 IMG v1.0
2624135808 Ga0067690_101 3843 4605 f 0 IMG v1.0
2624135819 Ga0067690_101 16295 16372 f 1 IMG v1.0


here is my workflow:

for g in SOME GENOMES

do 
    anvi-gen-contigs-database -f $g.IMG.fna -o $g.IMG.db --external-gene-calls $g.gene.list
    anvi-run-hmms -c $g.IMG.db -T 10
    anvi-run-ncbi-cogs -c $g.IMG.db --search-with blastp  -T 10

done

anvi-gen-genomes-storage -e external_genomes.txt -o pureIMG-GENOMES.h5

anvi-pan-genome -g pureIMG-GENOMES.h5 -o pureIMG --num-threads 10 --min-occurrence 1 --mcl-inflation 2.0 --maxbit 0.5 --use-ncbi-blast -J pureIMG

Jarrod

unread,
Jan 27, 2017, 1:14:40 PM1/27/17
to Anvi'o
FYI: tried it on stable and dev versions.

Anvi'o version ...............................: 2.1.1.dev0
Profile DB version ...........................: 17
Contigs DB version ...........................: 8
Pan DB version ...............................: 4
Samples information DB version ...............: 2
Genome data storage version ..................: 1
Auxiliary data storage version ...............: 3
Anvi'server users data storage version .......: 1

Anvi'o version ...............................: 2.1.0
Profile DB version ...........................: 17
Contigs DB version ...........................: 8
Pan DB version ...............................: 4
Samples information DB version ...............: 2
Genome data storage version ..................: 1
Auxiliary data storage version ...............: 3
Anvi'server users data storage version .......: 1

A. Murat Eren

unread,
Jan 27, 2017, 2:39:38 PM1/27/17
to an...@googlegroups.com
Hmm. Interesting.

It seems everything is working, and it is only the alignment step that fails. There is a flag "--skip-alignments", I would try that and see whether it works or not. If it works without the flag, then I think it may have something to do with,

  • The min / max length of the genes. I suggest you take a look at your genes and make sure you don't have things like 3 amino acid long or 1,000,000 amino acids. You can use anvi-get-aa-sequences-for-gene-calls to get the AA sequences.
  • Muscle version or a muscle-specific bug? It is very unlikely because muscle is so widely used. My muscle is v3.8.1551. 

Best,

--

A. Murat Eren (meren)
http://merenlab.org :: gpg

--
Anvi'o Paper: https://peerj.com/articles/1319/
Project Page: http://merenlab.org/projects/anvio/
Code Repository: https://github.com/meren/anvio
---
You received this message because you are subscribed to the Google Groups "Anvi'o" group.
To unsubscribe from this group and stop receiving emails from it, send an email to anvio+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/anvio/5339bfea-272a-4ddb-94db-f9ab9f168d04%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Jarrod Scott

unread,
Jan 27, 2017, 3:34:52 PM1/27/17
to an...@googlegroups.com
Thanks Meren. I will try these out and report back. Jarrod
To view this discussion on the web visit https://groups.google.com/d/msgid/anvio/CAK5_LaeUmszHsWbzAKFVm6jys_DpoZz-tweaXc2%3DE579OPUP9w%40mail.gmail.com.

Antti Karkman

unread,
Apr 21, 2017, 10:25:30 AM4/21/17
to Anvi'o
Hi Jarrod and Meren,
I ran into the same problem. Any update on this?
The log.txt file doesn't give any obvious reason. Shows the MCL output and number of protein clusters, but nothing after that.

I'm using the flag --exclude-partial-gene-calls, because otherwise the making of blast DB fails. 
The partial gene calls don't have any AA sequences, which probably causes the failure in makeblastdb command.

And everything works when skipping the alignments, so probably something wrong with the muscle step. 

Using anvio v. 2.2.3. in anaconda2 virtual environment.
My muscle version is 3.8.31. 

-Antti

Jarrod Scott

unread,
Apr 21, 2017, 11:46:44 AM4/21/17
to an...@googlegroups.com
Hello Antii, 

It looks like I said I would report back and never did :) 

The short answer is I did not get it to work. I tried the same troubleshooting you mentioned and it did not work. With the release of 2.3.0 I am re-running the analysis to see if works. So this time I WILL report back. 

My interest in using IMG pre-called genes is so that I can go back and look at what gene neighborhood a particular PC is found in. Have you found a good way to do this with anvi'o?

Jarrod

--
Anvi'o Paper: https://peerj.com/articles/1319/
Project Page: http://merenlab.org/projects/anvio/
Code Repository: https://github.com/meren/anvio
---
You received this message because you are subscribed to the Google Groups "Anvi'o" group.
To unsubscribe from this group and stop receiving emails from it, send an email to anvio+unsubscribe@googlegroups.com.

Antti Karkman

unread,
Apr 21, 2017, 2:09:06 PM4/21/17
to Anvi'o
Hello Jarrod,
let's see what you get. 

I'm still testing the pipeline, so only using a small test data set and haven't gone further yet. And I'm using PROKKA for my annotations
But I can also report back when I get there if I find some solution.

-Antti

perjantai 21. huhtikuuta 2017 17.46.44 UTC+2 Jarrod Scott kirjoitti:
Hello Antii, 

It looks like I said I would report back and never did :) 

The short answer is I did not get it to work. I tried the same troubleshooting you mentioned and it did not work. With the release of 2.3.0 I am re-running the analysis to see if works. So this time I WILL report back. 

My interest in using IMG pre-called genes is so that I can go back and look at what gene neighborhood a particular PC is found in. Have you found a good way to do this with anvi'o?

Jarrod
On Fri, Apr 21, 2017 at 9:25 AM, Antti Karkman <antti....@gmail.com> wrote:
Hi Jarrod and Meren,
I ran into the same problem. Any update on this?
The log.txt file doesn't give any obvious reason. Shows the MCL output and number of protein clusters, but nothing after that.

I'm using the flag --exclude-partial-gene-calls, because otherwise the making of blast DB fails. 
The partial gene calls don't have any AA sequences, which probably causes the failure in makeblastdb command.

And everything works when skipping the alignments, so probably something wrong with the muscle step. 

Using anvio v. 2.2.3. in anaconda2 virtual environment.
My muscle version is 3.8.31. 

-Antti

--
Anvi'o Paper: https://peerj.com/articles/1319/
Project Page: http://merenlab.org/projects/anvio/
Code Repository: https://github.com/meren/anvio
---
You received this message because you are subscribed to the Google Groups "Anvi'o" group.
To unsubscribe from this group and stop receiving emails from it, send an email to anvio+un...@googlegroups.com.

Jarrod Scott

unread,
Apr 23, 2017, 1:16:07 PM4/23/17
to an...@googlegroups.com
Hello Antti, 

using the new version of anvi'o I still get the same error unless I use the --skip-alignments flag. I am going to continue playing around to see if I can identify the issue. Please let me know if you make any progress.

Jarrod


Config Error: Drivers::Muscle: Something went wrong with this run :/ The output does not look
              alright. You can find the output in this log file.
To unsubscribe from this group and stop receiving emails from it, send an email to anvio+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/anvio/f9e0ccf8-30e6-4b0f-991a-d56745e52080%40googlegroups.com.

Antti Karkman

unread,
Apr 24, 2017, 5:15:17 AM4/24/17
to Anvi'o
Hi again,
I had a look at the log file in the temp folder and there were warnings (several of the same) from muscle:

# DATE: 24 Apr 17 10:37:15
# CMD LINE: muscle -quiet
# THIS IS THE OUTPUT YOU ARE LOOKING FOR:


*** WARNING *** Invalid character '*' in FASTA sequence data, ignored

...

And after that several aligned AA sequences in fasta format.

Don't know if this causes the crash.

-Antti

A. Murat Eren

unread,
Apr 24, 2017, 9:18:35 AM4/24/17
to an...@googlegroups.com
​Jarrod, can you please send me a small test set so I can reproduce this error? :(​ I feel like it is somewhat easy to address, but the lack of example data makes it very hard to find out what really causes it. I thought we managed to address it, but clearly we failed again :(

--

A. Murat Eren (meren)
http://merenlab.org :: gpg

To unsubscribe from this group and stop receiving emails from it, send an email to anvio+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/anvio/bf105635-735c-42aa-9038-d0399128b2ad%40googlegroups.com.

A. Murat Eren

unread,
May 8, 2017, 4:32:16 PM5/8/17
to an...@googlegroups.com, Karkman, Antti V, Jarrod J Scott
Dear Jarrod and Antti,

I found a bug in the code that was affecting external gene calls, and fixed it in the master repository. I am now running Jarrod's test, and everything seems to be going fine so far.

I apologize for the delay, and for this silly bug.


Best,

--

A. Murat Eren (meren)
http://merenlab.org :: gpg

A. Murat Eren

unread,
May 8, 2017, 6:04:52 PM5/8/17
to Karkman, Antti V, Jarrod J Scott, an...@googlegroups.com
Done. I run Jarrod's test file (only with 3 genomes for efficiency), and I get good results with the current master (which took more than I imagined, but I am happy this is now fully addressed). Here is your unedited pangenome, Jarrod:

Inline image 1

So now we know that IMG genomes can go directly into anvi'o without any trouble. I guess all we need is a recipe from Jarrod ;)


Best,

--

A. Murat Eren (meren)
http://merenlab.org :: gpg

On Mon, May 8, 2017 at 3:43 PM, A. Murat Eren <a.mura...@gmail.com> wrote:
Wait! I just found another bug. I will fix that, and let you know again. Sorry.

It turns out some of the genes coming from IMG external gene calls contains multiple stop codons within genes. Like this one, Jarrod, if you would like to check:

In otu18_DIS1, gene callers id 2572236632 translates to this:

VDVAQW*SPGL*CRWLGVRVPSSTP

This is the entry in the external genes file:

$ grep 2572236632 otu18_DIS1.gene.list
2572236632 DIS1DRFT_DIS1-NEB_NODE_1 253370 253445 r 0 IMG v1.0


I will simply replace STOP codons with X's and warn the user.


Best,

--

A. Murat Eren (meren)
http://merenlab.org :: gpg

Jarrod J Scott

unread,
May 8, 2017, 6:12:06 PM5/8/17
to A. Murat Eren, Karkman, Antti V, an...@googlegroups.com
Beautiful Meren! Looks like you brought in a much needed gatekeeper. It is a bit odd that the annotations have stop codons--I think that some microbes will read through stop codons? Anyway, good eye. I curated those genomes but missed those dang stop codons. 

And a handsome output to boot!

thanks again Meren

Antti Karkman

unread,
May 11, 2017, 8:55:47 AM5/11/17
to Anvi'o, a.mura...@gmail.com, antti....@helsinki.fi
Thanks Meren!
I got it working with my prokka annotated genomes too.
Not sure, but I guess my internal stop codons come from non-coding genes (e.g. rRNAs, tRNAs etc.), which I should remove before importing them to anvio.
Need to work on my annotation and/or parsing step.

-Antti

A. Murat Eren

unread,
May 11, 2017, 9:43:52 AM5/11/17
to an...@googlegroups.com, Karkman, Antti V
Do you think it would have been useful if anvi'o optionally removed them during import?

--

A. Murat Eren (meren)
http://merenlab.org :: gpg

To unsubscribe from this group and stop receiving emails from it, send an email to anvio+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/anvio/53ab3a2f-5f1a-4c59-b970-e99ae9f37db4%40googlegroups.com.

Antti Karkman

unread,
May 11, 2017, 10:03:35 AM5/11/17
to Anvi'o, antti....@helsinki.fi
Maybe, depends how it's implemented. 
Could it be done specifying the annotation sources to include? I have different source for RNAs, ORFs and CRISPR regions coming from prokka. 
If it is only based on the internal stop codons, some annotations might still stay there if they don't have any.

And if I really get to wish, a GFF3 parser to add functional annotation would be nice. If it is possible. 

Thanks,
-Antti

A. Murat Eren

unread,
May 11, 2017, 5:38:50 PM5/11/17
to an...@googlegroups.com, Karkman, Antti V
Hi Antti,

On Thu, May 11, 2017 at 9:03 AM, Antti Karkman <antti....@gmail.com> wrote:
And if I really get to wish, a GFF3 parser to add functional annotation would be nice. If it is possible. 

The short answer is: not today, but maybe later.

This long answer is like this 'simple' GFF3 file:

##gff-version 3.2.1
##sequence-region ctg123 1 1497228
ctg123 . gene               1000  9000  .  +  .  ID=gene00001;Name=EDEN
ctg123 . TF_binding_site    1000  1012  .  +  .  ID=tfbs00001;Parent=gene00001
ctg123 . mRNA               1050  9000  .  +  .  ID=mRNA00001;Parent=gene00001;Name=EDEN.1
ctg123 . five_prime_UTR     1050  1200  .  +  .  Parent=mRNA00001
ctg123 . CDS                1201  1500  .  +  0  ID=cds00001;Parent=mRNA00001
ctg123 . CDS                3000  3902  .  +  0  ID=cds00001;Parent=mRNA00001
ctg123 . CDS                5000  5500  .  +  0  ID=cds00001;Parent=mRNA00001
ctg123 . CDS                7000  7600  .  +  0  ID=cds00001;Parent=mRNA00001
ctg123 . three_prime_UTR    7601  9000  .  +  .  Parent=mRNA00001
ctg123 . cDNA_match         1050  1500  5.8e-42  +  . ID=match00001;Target=cdna0123+12+462
ctg123 . cDNA_match         5000  5500  8.1e-43  +  . ID=match00001;Target=cdna0123+463+963
ctg123 . cDNA_match         7000  9000  1.4e-40  +  . ID=match00001;Target=cdna0123+964+2964
##FASTA
>ctg123
cttctgggcgtacccgattctcggagaacttgccgcaccattccgccttg
tgttcattgctgcctgcatgttcattgtctacctcggctacgtgtggcta
tctttcctcggtgccctcgtgcacggagtcgagaaaccaaagaacaaaaa
aagaaattaaaatatttattttgctgtggtttttgatgtgtgttttttat
aatgatttttgatgtgaccaattgtacttttcctttaaatgaaatgtaat
cttaaatgtatttccgacgaattcgaggcctgaaaagtgtgacgccattc
gtatttgatttgggtttactatcgaataatgagaattttcaggcttaggc
ttaggcttaggcttaggcttaggcttaggcttaggcttaggcttaggctt
aggcttaggcttaggcttaggcttaggcttaggcttaggcttaggcttag
aatctagctagctatccgaaattcgaggcctgaaaagtgtgacgccattc
...
>cnda0123
ttcaagtgctcagtcaatgtgattcacagtatgtcaccaaatattttggc
agctttctcaagggatcaaaattatggatcattatggaatacctcggtgg
aggctcagcgctcgatttaactaaaagtggaaagctggacgaaagtcata
tcgctgtgattcttcgcgaaattttgaaaggtctcgagtatctgcatagt
gaaagaaaaatccacagagatattaaaggagccaacgttttgttggaccg
tcaaacagcggctgtaaaaatttgtgattatggttaaagg
Just looking at it makes me want to leave everything behind and become a farmer and grow eggplants. Everyone likes eggplants. Eggplants are healthy. They are the result of hundreds of thousands of years of evolution, and almost everything about them communicates elegance. When done properly, an eggplant can turn the neural network of an adult's brain into a christmas tree. GFF3 is no eggplant. Couldn't have been any farther from it, in fact. It is a result of "intelligent design", but comes from times where we were not so intelligent about complex data, files, and file formats for complex data. I would choose an eggplant over GFF3 any day.

I know there are libraries to parse GFF3. But I don't want to condone the use of this file format. Just like the way I wouldn't grow parsley if I were to be a farmer. There is not really a good reason to not do it, but there is no good reason for parsley to imitate cilantro, either. But it does it and gets away with it, doesn't it. GFF3 is like parsley: a bad version of something everyone needs.

I wish I had enough time and energy to suggest another standard. But I am sure people who are much more able than I am already did it, yet here we are. So why bother.

Maybe one day in the future someone will tell me/us to suck it up and do it, and we will do it. Or maybe someone will write an `anvi-import-gff3-annotations` program, in which case we will happily test it and put it in the codebase. But unfortunately we don't have the person power to deal with it today.


Best, 

Antti Karkman

unread,
May 12, 2017, 8:41:29 AM5/12/17
to Anvi'o, antti....@helsinki.fi
Hi Meren,
I understand you. No one wants to be a gff parser, but sometimes you end up as one. 
But I'm not expecting you to include this in anvio, since I have a script based on gffutils that works on prokka gff output.
If someone really needs one, maybe I could share it. But only after polishing and still it probably won't work for all gff outputs. 

Thanks a lot for fixing the other issues and the great work in making all these nice tools.
-Antti

Nastassia Patin

unread,
Jul 17, 2017, 11:36:03 AM7/17/17
to Anvi'o, a.mura...@gmail.com, antti....@helsinki.fi
Hey all, I am having the exact same problem here with a pan-genome of three internal MAGs. I'm not sure how I should get rid of the stop codons in my prokka annotations - is something that needs to be done manually? I am running anvi'o v2.3.2

-Nastassia

A. Murat Eren

unread,
Jul 17, 2017, 11:48:00 AM7/17/17
to Nastassia Patin, Anvi'o, Karkman, Antti V
How did you import your Prokka annotations into anvi'o?

Did you follow this from Antti:



Best,

--

A. Murat Eren (meren)
http://merenlab.org :: gpg

Nastassia Patin

unread,
Jul 17, 2017, 11:50:32 AM7/17/17
to A. Murat Eren, Anvi'o, Karkman, Antti V
Yes, I followed that exactly! And did not modify the Prokka script so there should be no partial gene calls included in the annotation.
--

Dr. Nastassia Patin

Postdoctoral Researcher

School of Biology

Georgia Institute of Technology

A. Murat Eren

unread,
Jul 17, 2017, 11:53:41 AM7/17/17
to Anvi'o, Karkman, Antti V
Interesting. What happens if you take the FASTA files for your MAGs after anvi-summary, and treat them as external genomes?

--

A. Murat Eren (meren)
http://merenlab.org :: gpg

To unsubscribe from this group and stop receiving emails from it, send an email to anvio+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/anvio/CABcLjY0_3iNj0gvaUCc37yakgtP09A1u2-1oXZ8L__AVgZvA0g%40mail.gmail.com.

Nastassia Patin

unread,
Jul 17, 2017, 12:32:02 PM7/17/17
to Anvi'o, Karkman, Antti V
When I tried to generate the contigs database for the first MAG (after running anvi-summarize, then running Prokka on the bin contigs fasta file, then running gff_parser.py) I got the following error:

Config Error: Oops. Anvi'o run into an amino acid sequence (that corresponds to the gene callers id '1') which had an internal stop codon :/ This usually indicates your external gene calls have problems....

etc. 

But I didn't have any problems with the original metagenome contigs...

You received this message because you are subscribed to a topic in the Google Groups "Anvi'o" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/anvio/QIn2TKTutJM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to anvio+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/anvio/CAK5_LacH7HD4GRBeJGNfP9kKtXGCoNyfVVLpP7%2BEKio0JTfD8w%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

A. Murat Eren

unread,
Jul 17, 2017, 12:35:10 PM7/17/17
to Anvi'o, Karkman, Antti V
Would you mind sending me the gzipped MAG, and prokka files so we can try to recapitulate it?

Maybe Antti would suggest something too.


Best,

--

A. Murat Eren (meren)
http://merenlab.org :: gpg

Antti Karkman

unread,
Jul 17, 2017, 12:41:24 PM7/17/17
to Anvi'o, antti....@helsinki.fi
Hi, 
seems that my message didn't go thru by email, so another try.


Hi Nastassia,
Have you tried to use the `--ignore-internal-stop-codons` option in the pangenomics pipeline? That might help. 

Or then you need to re-run the metagenomes and exclude all gene calls with internal stop codons already in that step. 

best
Antti
You received this message because you are subscribed to a topic in the Google Groups "Anvi'o" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/anvio/QIn2TKTutJM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to anvio+un...@googlegroups.com.



--

Dr. Nastassia Patin

Postdoctoral Researcher

School of Biology

Georgia Institute of Technology

--
Anvi'o Paper: https://peerj.com/articles/1319/
Project Page: http://merenlab.org/projects/anvio/
Code Repository: https://github.com/meren/anvio
---
You received this message because you are subscribed to the Google Groups "Anvi'o" group.
To unsubscribe from this group and stop receiving emails from it, send an email to anvio+un...@googlegroups.com.

Nastassia Patin

unread,
Jul 17, 2017, 12:42:37 PM7/17/17
to Karkman, Antti V, Anvi'o
Thanks Antti, I didn't know about that option! That will likely solve the problem. 

Meren do you still want those files? It's no problem to send them.

On Mon, Jul 17, 2017 at 12:37 PM, Karkman, Antti V <antti....@helsinki.fi> wrote:
Hi Nastassia,
Have you tried to use the `--ignore-internal-stop-codons` option in the pangenomics pipeline? That might help. 

Or then you need to re-run the metagenomes and exclude all gene calls with internal stop codons already in that step. 

best
Antti

Nastassia Patin

unread,
Jul 17, 2017, 12:46:27 PM7/17/17
to Karkman, Antti V, Anvi'o
Ah...anvi'o also doesn't seem to know about that option ("unrecognized argument"). I see the --skip-alignments option, which would probably work, but ideally I would like to have alignments of the three MAGs.


A. Murat Eren

unread,
Jul 17, 2017, 1:11:43 PM7/17/17
to Anvi'o, Karkman, Antti V
​The option Antti mentions is​ for anvi-gen-contigs-database :) 

--

A. Murat Eren (meren)
http://merenlab.org :: gpg

Nastassia Patin

unread,
Jul 17, 2017, 1:21:58 PM7/17/17
to Anvi'o, Karkman, Antti V
Oh got it! Sorry. Will go ahead with that.


For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages