Phylogenomic analyses on my MAGs with genomes from NCBI resource

114 views
Skip to first unread message

Jitesh

unread,
Apr 1, 2020, 2:43:30 PM4/1/20
to Anvi'o
Hie,
Meren
I am new to anvio and could achieve to have my own MAGs in anvio collection. Now, I want to do phylogenomic analysis by combining MAGs+3000 genomes from ncbi.
I would appreciate if you guide me to some tutorial in detail. My main concern is how to put my MAGs and external genome together. 

Regards,
Jitesh 

A. Murat Eren

unread,
Apr 1, 2020, 3:51:03 PM4/1/20
to Anvi'o
Hi Jitesh,

For your external genomes you will use the 'external-genomes` file, and for your MAGs, you should use 'internal-genomes' file format. It is described here:

Generating an anvi’o genomes storage

An anvi’o genomes storage is a special database that stores information about genomes. A genomes storage can be generated only from external genomes, only from internal genomes, or it can contain both types. Before we go any further, here are some definitions to clarify things:

  • An external genome is anything you have in a FASTA file format (i.e., a genome you have downloaded from NCBI, or obtained through any other way).

  • An internal genome is any genome bin you stored in an anvi’o collection at the end of your metagenomic analysis (if you are not familiar with the anvi’o metagenomic workflow please take a look at this post).

Converting FASTA files into anvi’o contigs databases: Working with internal genomes is quite straightforward since you already have an anvi’o contigs and an anvi’o profile database for them. But if all you have is a bunch of FASTA files, this workflow will require you to convert each of them into an anvi’o contigs database. There is a lot of information about how to create an anvi’o contigs database here, but if you feel lazy, you can also use the script anvi-script-FASTA-to-contigs-db, which requires a single parameter: the FASTA file path. Power users should consider taking a look at the code, and create their own batch script with improvements on those lines based on their needs (for instance, increasing the number of threads when running HMMs, etc). Also, you may want to run anvi-run-ncbi-cogs on your contigs database to annotate your genes.

You can create a new anvi’o genomes storage using the program anvi-gen-genomes-storage, which will require you to provide descriptions of genomes to be included in this storage. File formats for external genome and internal genome descriptions differ slightly. For instance, this is an example --external-genomes file:

namecontigs_db_path
Name_01/path/to/contigs-01.db
Name_02/path/to/contigs-02.db
Name_03/path/to/contigs-03.db
(…)(…)

and this is an example file for --internal-genomes:

namebin_idcollection_idprofile_db_pathcontigs_db_path
Name_01Bin_id_01Collection_A/path/to/profile.db/path/to/contigs.db
Name_02Bin_id_02Collection_A/path/to/profile.db/path/to/contigs.db
Name_03Bin_id_03Collection_B/path/to/another_profile.db/path/to/another/contigs.db
(…)(…)(…)(…)(…)

For names in the first column please use only letters, digits, and the underscore character.

--

A. Murat Eren (Meren)
http://merenlab.org :: twitter


--
Anvi'o Paper: https://peerj.com/articles/1319/
Project Page: http://merenlab.org/projects/anvio/
Code Repository: https://github.com/meren/anvio
---
You received this message because you are subscribed to the Google Groups "Anvi'o" group.
To unsubscribe from this group and stop receiving emails from it, send an email to anvio+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/anvio/3b815048-2394-4e52-8b1b-1fbe885e7fa5%40googlegroups.com.

Jitesh

unread,
Apr 2, 2020, 4:08:16 PM4/2/20
to Anvi'o
Thanks Meren for you kind reply.

What I understood is to create a genome storage (combining internal and external genome) using 
"anvi-gen-genomes-storage [-h] [-e FILE_PATH] [-i FILE_PATH] [--gene-caller GENE-CALLER] -o GENOMES_STORAGE"

It seems crazy for me but how output GENOMES_STORAGE.db can be used to phylogenomic analysis? I can see only example of either internal genome/external genome data used. May be I am lit'l confused!

Your help is appreciated.

Jitesh

A. Murat Eren

unread,
Apr 2, 2020, 4:10:25 PM4/2/20
to Anvi'o
Hi Jitesh,

Did you follow this tutorial?



Best,
--

A. Murat Eren (Meren)
http://merenlab.org :: twitter

--
Anvi'o Paper: https://peerj.com/articles/1319/
Project Page: http://merenlab.org/projects/anvio/
Code Repository: https://github.com/meren/anvio
---
You received this message because you are subscribed to the Google Groups "Anvi'o" group.
To unsubscribe from this group and stop receiving emails from it, send an email to anvio+un...@googlegroups.com.

Jitesh

unread,
Apr 3, 2020, 8:14:55 AM4/3/20
to Anvi'o
Hie Meren,
Thanks for your link. I was following http://merenlab.org/2017/06/07/phylogenomics/ tutorial only. But it became clear now by observing all programs and ...... list 
(http://merenlab.org/software/anvio/vignette/#anvi-get-sequences-for-hmm-hits) as anvi-get-sequences-for-hmm-hits (INPUT OPTION #3: INT/EXTERNAL GENOMES FILE)
which exactly tells the use of external and internal genome file for HMM hits and that can be further used for rest of the phylogeny processes.
Hope I am going correct.
Thanks for your reply in short time. Cheers for Anvio team !!!!

On Thursday, April 2, 2020 at 12:13:30 AM UTC+5:30, Jitesh wrote:

Jitesh

unread,
Apr 3, 2020, 10:11:08 AM4/3/20
to Anvi'o
Meren, Is it possible to  give  threads for the below script? As i am getting slow due to 3000 genomes.fasta 

for i in *fa
do
	anvi-script-FASTA-to-contigs-db $i 
done

On Thursday, April 2, 2020 at 12:13:30 AM UTC+5:30, Jitesh wrote:

A. Murat Eren

unread,
Apr 3, 2020, 10:15:34 AM4/3/20
to Anvi'o
No. Actually no one should use anvi-script-FASTA-to-contigs-db and I will remove it from the codebase now :(

You should use our snakemake workflows:


The Contigs Workflow will take care of it for you (we did process tens of thousands of FASTA files with that approach). 

Best,
--

A. Murat Eren (Meren)
http://merenlab.org :: twitter

--
Anvi'o Paper: https://peerj.com/articles/1319/
Project Page: http://merenlab.org/projects/anvio/
Code Repository: https://github.com/meren/anvio
---
You received this message because you are subscribed to the Google Groups "Anvi'o" group.
To unsubscribe from this group and stop receiving emails from it, send an email to anvio+un...@googlegroups.com.

Jitesh

unread,
Apr 3, 2020, 10:27:54 AM4/3/20
to Anvi'o
Ohh! Thanks for this information. I am gonna use that.


On Thursday, April 2, 2020 at 12:13:30 AM UTC+5:30, Jitesh wrote:

Jitesh

unread,
Apr 5, 2020, 4:13:04 AM4/5/20
to Anvi'o
Hie Meren,

I tried to run workflow_tutorial_data with two contigs (G01 and G02). G02 contig.db was produced but no output for G01
Here is the command progress and error at the end

(anvio5) jitesh@jitesh:~/Desktop/WORKFLOW_TUTORIAL_DATA$ anvi-run-workflow -w contigs -c config-contigs.json 

WARNING
===============================================
If you publish results from this workflow, please do not forget to cite
snakemake (doi:10.1038/nmeth.3176)


WARNING
===============================================
We are initiating parameters for the contigs workflow

                                                                                                         [05 Apr 20 07:12:22 Bleep bloop] Quick dry run for an initial sanity check ...                                                                                                                                                   
WARNING
===============================================
We are initiating parameters for the contigs workflow


Shell programs for the workflow
===============================================
Needed .......................................: gunzip, anvi-script-reformat-fasta, anvi-script-reformat-fasta, anvi-gen-contigs-database, anvi-import-functions, anvi-run-hmms, anvi-run-pfams, anvi-run-ncbi-cogs, anvi-get-sequences-for-gene-calls, touch
Missing ......................................: None

Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1
Rules claiming more threads will be scaled down.
Unlimited resources: nodes
Job counts:
count jobs
2 annotate_contigs_database
2 anvi_gen_contigs_database
2 anvi_run_hmms
2 anvi_run_ncbi_cogs
2 anvi_script_reformat_fasta
2 anvi_script_reformat_fasta_prefix_only
1 generate_and_annotate_contigs_db
13

[Sun Apr  5 07:12:22 2020]
rule anvi_script_reformat_fasta_prefix_only:
    input: three_samples_example/G02-contigs.fa
    output: 01_FASTA_contigs_workflow/G02/G02-contigs-prefix-formatted-only.fa, 01_FASTA_contigs_workflow/G02/G02-reformat-report.txt
    log: 00_LOGS_contigs_workflow/G02-anvi_script_reformat_fasta_prefix_only.log
    jobid: 11
    wildcards: group=G02
    resources: nodes=1

anvi-script-reformat-fasta three_samples_example/G02-contigs.fa -o 01_FASTA_contigs_workflow/G02/G02-contigs-prefix-formatted-only.fa -r 01_FASTA_contigs_workflow/G02/G02-reformat-report.txt --prefix G02   --simplify-names >> 00_LOGS_contigs_workflow/G02-anvi_script_reformat_fasta_prefix_only.log 2>&1
Write-protecting output file 01_FASTA_contigs_workflow/G02/G02-contigs-prefix-formatted-only.fa.
[Sun Apr  5 07:12:23 2020]
Finished job 11.
1 of 13 steps (8%) done

[Sun Apr  5 07:12:23 2020]
rule anvi_script_reformat_fasta:
    input: 01_FASTA_contigs_workflow/G02/G02-contigs-prefix-formatted-only.fa
    output: 01_FASTA_contigs_workflow/G02/G02-contigs.fa
    log: 00_LOGS_contigs_workflow/G02-anvi_script_reformat_fasta.log
    jobid: 9
    wildcards: group=G02
    resources: nodes=1

anvi-script-reformat-fasta 01_FASTA_contigs_workflow/G02/G02-contigs-prefix-formatted-only.fa -o 01_FASTA_contigs_workflow/G02/G02-contigs.fa  >> 00_LOGS_contigs_workflow/G02-anvi_script_reformat_fasta.log 2>&1
[Sun Apr  5 07:12:24 2020]
Finished job 9.
2 of 13 steps (15%) done

[Sun Apr  5 07:12:24 2020]
rule anvi_gen_contigs_database:
    input: 01_FASTA_contigs_workflow/G02/G02-contigs.fa
    output: 02_CONTIGS_contigs_workflow/G02-contigs.db
    log: 00_LOGS_contigs_workflow/G02-anvi_gen_contigs_database.log
    jobid: 3
    wildcards: group=G02
    resources: nodes=1

anvi-gen-contigs-database -f 01_FASTA_contigs_workflow/G02/G02-contigs.fa -o 02_CONTIGS_contigs_workflow/G02-contigs.db   --project-name G02        >> 00_LOGS_contigs_workflow/G02-anvi_gen_contigs_database.log 2>&1
Removing temporary output file 01_FASTA_contigs_workflow/G02/G02-contigs.fa.
[Sun Apr  5 07:12:28 2020]
Finished job 3.
3 of 13 steps (23%) done

[Sun Apr  5 07:12:28 2020]
rule anvi_run_hmms:
    input: 02_CONTIGS_contigs_workflow/G02-contigs.db
    output: 02_CONTIGS_contigs_workflow/anvi_run_hmms-G02.done
    log: 00_LOGS_contigs_workflow/G02-anvi_run_hmms.log
    jobid: 4
    wildcards: group=G02
    resources: nodes=5

anvi-run-hmms -c 02_CONTIGS_contigs_workflow/G02-contigs.db -T 1   >> 00_LOGS_contigs_workflow/G02-anvi_run_hmms.log 2>&1
Touching output file 02_CONTIGS_contigs_workflow/anvi_run_hmms-G02.done.
[Sun Apr  5 07:12:34 2020]
Finished job 4.
4 of 13 steps (31%) done

[Sun Apr  5 07:12:34 2020]
rule anvi_run_ncbi_cogs:
    input: 02_CONTIGS_contigs_workflow/G02-contigs.db
    output: 02_CONTIGS_contigs_workflow/anvi_run_ncbi_cogs-G02.done
    log: 00_LOGS_contigs_workflow/G02-anvi_run_ncbi_cogs.log
    jobid: 5
    wildcards: group=G02
    resources: nodes=5

anvi-run-ncbi-cogs -c 02_CONTIGS_contigs_workflow/G02-contigs.db -T 1     >> 00_LOGS_contigs_workflow/G02-anvi_run_ncbi_cogs.log 2>&1
[Sun Apr  5 07:12:35 2020]
Error in rule anvi_run_ncbi_cogs:
    jobid: 5
    output: 02_CONTIGS_contigs_workflow/anvi_run_ncbi_cogs-G02.done
    log: 00_LOGS_contigs_workflow/G02-anvi_run_ncbi_cogs.log

RuleException:
CalledProcessError in line 375 of /home/jitesh/anaconda3/envs/anvio5/lib/python3.6/site-packages/anvio/workflows/contigs/Snakefile:
Command ' set -euo pipefail;  anvi-run-ncbi-cogs -c 02_CONTIGS_contigs_workflow/G02-contigs.db -T 1     >> 00_LOGS_contigs_workflow/G02-anvi_run_ncbi_cogs.log 2>&1 ' returned non-zero exit status 255.
  File "/home/jitesh/anaconda3/envs/anvio5/lib/python3.6/site-packages/anvio/workflows/contigs/Snakefile", line 375, in __rule_anvi_run_ncbi_cogs
  File "/home/jitesh/anaconda3/envs/anvio5/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /home/jitesh/Desktop/WORKFLOW_TUTORIAL_DATA/.snakemake/log/2020-04-05T071222.813427.snakemake.log


------------------------------
To unsubscribe from this group and stop receiving emails from it, send an email to an...@googlegroups.com.
config-contigs.json
fasta.txt

A. Murat Eren

unread,
Apr 5, 2020, 10:47:52 AM4/5/20
to Anvi'o
You are almost there, Jitesh. You need to run anvi-setup-ncbi-cogs and anvi-setup-scg-databases to setup necessary databases.

Alternatively you need to turn off steps that require them:

{
    "fasta_txt": "fasta.txt",
    "anvi_run_scg_taxonomy": {
        "run": false
    },
    "anvi_run_scg_taxonomy": {
        "run": false
    },
    "output_dirs": {
        "FASTA_DIR": "01_FASTA_contigs_workflow",
        "CONTIGS_DIR": "02_CONTIGS_contigs_workflow",
        "LOGS_DIR": "00_LOGS_contigs_workflow"
    }
}


I will update the tutorial to make sure it is clear.


Best,
--

A. Murat Eren (Meren)
http://merenlab.org :: twitter
To unsubscribe from this group and stop receiving emails from it, send an email to anvio+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/anvio/5b7432fa-1e33-4a90-a37e-7acab4c450c9%40googlegroups.com.

Jitesh

unread,
Apr 5, 2020, 1:49:04 PM4/5/20
to Anvi'o
Yes Meren, it would be great to describe the tutorial, which will be helpful..
Because current config-contigs.json file has below mentioned steps and no description of parameters could be found elsewhere as explained very nicely in all programs of anvio workflow. 
So, every databases like for centifuge, hmm, cogg, emapper needs to set up and turned on if needed ?  !! How do I set up for creating contig database? I hope you will get me. Thanks in advance.

{
    "fasta_txt": "fasta.txt",
    "anvi_gen_contigs_database": {
        "--project-name": "{group}",
        "--description": "",
        "--skip-gene-calling": "",
        "--external-gene-calls": "",
        "--ignore-internal-stop-codons": "",
        "--skip-mindful-splitting": "",
        "--contigs-fasta": "",
        "--split-length": "",
        "--kmer-size": "",
        "--prodigal-translation-table": "",
        "threads": "10"
    },
    "centrifuge": {
        "threads": 10,
        "run": "",
        "db": ""
    },
    "anvi_run_hmms": {
        "run": true,
        "threads": 10,
        "--installed-hmm-profile": "",
        "--hmm-profile-dir": ""
    },
    "anvi_run_ncbi_cogs": {
        "run": true,
        "threads": 10,
        "--cog-data-dir": "",
        "--sensitive": "",
        "--temporary-dir-path": "",
        "--search-with": ""
    },
    "anvi_script_reformat_fasta": {
        "run": true,
        "--keep-ids": "",
        "--exclude-ids": "",
        "--min-len": "",
        "threads": "10"
    },
    "emapper": {
        "--database": "bact",
        "--usemem": true,
        "--override": true,
        "path_to_emapper_dir": "",
        "threads": "10"
    },
    "anvi_script_run_eggnog_mapper": {
        "--use-version": "0.12.6",
        "run": "",
        "--cog-data-dir": "",
        "--drop-previous-annotations": "",
        "threads": "10"
    },
    "gen_external_genome_file": {
        "threads": ""
    },
    "export_gene_calls_for_centrifuge": {
        "threads": "10"
    },
    "anvi_import_taxonomy": {
        "threads": "10"
    },
    "annotate_contigs_database": {
        "threads": "10"
    },
    "anvi_get_sequences_for_gene_calls": {
        "threads": "10"
    },
    "gunzip_fasta": {
        "threads": "10"
    },
    "reformat_external_gene_calls_table": {
        "threads": ""
    },
    "reformat_external_functions": {
        "threads": "10"
    },
    "import_external_functions": {
        "threads": "10"
    },
    "anvi_run_pfams": {
        "run": "",
        "--pfam-data-dir": "",
        "threads": "10"
    },
    "output_dirs": {
        "FASTA_DIR": "01_FASTA",
        "CONTIGS_DIR": "02_CONTIGS",
        "LOGS_DIR": "00_LOGS"
    },
    "max_threads": "10"
}

-----------------------------------

On Thursday, April 2, 2020 at 12:13:30 AM UTC+5:30, Jitesh wrote:

A. Murat Eren

unread,
Apr 5, 2020, 3:13:25 PM4/5/20
to Anvi'o
If you do what I pointed out in my previous e-mail (running those to setup programs) you will not need to do anything else.


Best,
--

A. Murat Eren (Meren)
http://merenlab.org :: twitter

--
Anvi'o Paper: https://peerj.com/articles/1319/
Project Page: http://merenlab.org/projects/anvio/
Code Repository: https://github.com/meren/anvio
---
You received this message because you are subscribed to the Google Groups "Anvi'o" group.
To unsubscribe from this group and stop receiving emails from it, send an email to anvio+un...@googlegroups.com.

Jitesh

unread,
Apr 7, 2020, 4:15:49 PM4/7/20
to Anvi'o
Hie Meren,
I made it like you told. The error are:
$anvi-run-workflow -w contigs -c config-contigs.json
Config Error: Config files must include a config_version. If this is news to you, and/or you
              don't know what version your config should be, please run in your terminal the
              command `anvi-migrate config-contigs.json` to upgrade your config file. 
$anvi-migrate config-contigs.json
Config Error: Your config must include a workflow_name. For example if this config file is    
              used for the metagenomics workflow then add '"workflow_name": "metagenomics"' to
              your config.
Kindly help me to resolve this issue.
-------------------------------------------

A. Murat Eren

unread,
Apr 7, 2020, 6:33:41 PM4/7/20
to Anvi'o
I think this could be a good practice to carefully read anvi'o error messages, Jitesh.
--

A. Murat Eren (Meren)
http://merenlab.org :: twitter

To unsubscribe from this group and stop receiving emails from it, send an email to anvio+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/anvio/30dc31a8-db24-454c-97bb-e4f4b6053852%40googlegroups.com.

Jitesh

unread,
Apr 8, 2020, 8:42:28 AM4/8/20
to Anvi'o
Thanks Meren,
Worked after including. 
   
 },
    "config_version": "1",
    "anvi_gen_contigs_database": {
        "--project-name": "{group}"
Reply all
Reply to author
Forward
0 new messages