Queries regarding BRUCELLOME metagenomics to estimate Prevelance and Abundance

Praharshit Sharma

unread,

Jul 20, 2021, 3:27:49 AM7/20/21

to bioc...@googlegroups.com

Dear BioClues Member, I am seeking Metagenomics/ Microbial classification Help as below

This is BioClues Member Praharshit Sharma from ICAR: Indian Council of Agricultural Research, India. Could any Metagenomics or Microbial-Omics Expert in BioClues resolve below Query?

Thank you!

========================================================================

My Lab is currently doing the following Metagenomics tasks, specifically pertaining to BRUCELLOME metagenomics classification. For sake of Open Access reproducibility of results to cross-check, I am mentioning my Scripting here

For your ease of reference, I started off with Kraken2 installation...

$


    conda install -c bioconda kraken2

Step1: Installation of kraken2, wrt https://anaconda.org/bioconda/kraken2

$ kraken2-build --download-taxonomy --db kraken2

Step2: Installing the Taxonomy, wrt

https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown


$ kraken2-build --download-library bacteria --db kraken2

Step3: I am interested in the Bacteria custom DataBase, since Brucella is a bacterial genus.
wrt, https://en.wikipedia.org/wiki/Brucella

$ cd kraken2

$ kraken2-build --threads 24 --build --db . --kmer-len 47 --minimizer-len 31 --minimizer-spaces 5

Step4: I had built the kraken2 database, wrt Section-3 of "Custom Databases" part of

https://github.com/DerrickWood/kraken2/wiki/Manual#custom-databases

After this Step, I got the k2d files, located in /kraken2 Directory , + seqid2taxid.map File Too.

hash.k2d: Contains the minimizer to taxon mappings
opts.k2d: Contains information about the options used to build the database
taxo.k2d: Contains taxonomy information used to build the database

$ kraken2 --db . brucellome.fa

Step5: I had classified a Multi-FastA file of 277 sequences called " Brucellome.fa " , wrt

https://github.com/DerrickWood/kraken2/wiki/Manual#classification

[ Where, Brucellome.FA was concatenated as per below Scripts...

$ pwd

( Same as kraken2 directory/ subdirectory ( . ) above),

$ ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/assembly_summary.txt

$ awk -F "\t" '$0 ~ /Brucella/ && $12=="Complete Genome" && $11=="latest"{print $20}' assembly_summary.txt | tee ftpdirpaths

$ wc -l ftpdirpaths

140

( Therefore, 140 Complete Genomes of various Brucella Species (Spp.) belonging to Brucella Genus were downloaded, which became in terms of sequence statistics, 277

$ seqkit stats brucellome.fa
file format type num_seqs sum_len min_len avg_len max_len
brucellome.fa FASTA DNA 277 469,732,267 61,780 1,695,784.4 4,165,132

Because, many individual FastA files that went into concatenation of Brucellome.FA comprised

of >= Chromosomes). There was 4 more intermediate steps to arrive at Brucellome.fa (below))

$ awk 'BEGIN{FS=OFS="/";filesuffix="genomic.fna.gz"}{ftpdir=$0;asm=$10;file=asm"_"filesuffix;print ftpdir,file}' ftpdirpaths | tee ftpfilepaths

$ cat ftpfilepaths | parallel -j 20 --verbose --progress "cd all && curl -O {}"

$ gunzip -c *.fna.gz

$ cat *.fna | tee brucellome.fa

] # End of Scripts sub-section [...] required to construct " brucellome.fa "

Step-6: The STDOUT from Step-6 above was refined (To Print classified sequences to file)

$ kraken2 --db . brucellome.fa --classified-out bruce47.out

Step-7: To Print a report with aggregrate counts/clade to file, I did

$ kraken2 --db . --report bruce47.txt bruce47.out

The output of kraken-report, as we Know, is tab-delimited, with one line per taxon. The fields of the output, from left-to-right, are as follows:

Percentage of reads covered by the clade rooted at this taxon
Number of reads covered by the clade rooted at this taxon
Number of reads assigned directly to this taxon
A rank code, indicating (U)nclassified, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies. All other ranks are simply '-'.
NCBI taxonomy ID
indented scientific name

The scientific names are indented using spaces, according to the tree structure specified by the taxonomy.

NOTE: Above, "47" indicates kMer-size = 47 ( < min_len of Brucellome.FA = 61,780 ).

I think I am justified in using --kmer-len 47 option ( It was an Arbitrary choice),

By default, the values of k and l are 35 and 31, respectively (or 15 and 12 for protein databases). These values can be explicitly set with the --kmer-len and minimizer-len options, however. Note that the minimizer length must be no more than 31 for nucleotide databases, and 15 for protein databases. Additionally, the minimizer length l must be no more than the k-mer length. There is no upper bound on the value of k, but sequences less than k bp in length cannot be classified.

Kraken 2 also utilizes a simple spaced seed approach to increase accuracy. A number s < l/4 can be chosen, and s positions in the minimizer will be masked out during all comparisons. Masked positions are chosen to alternate from the second-to-last position in the minimizer; e.g., s = 5 and l = 31 will result in masking out the 0 positions shown here:

111 1111 1111 1111 1111 1101 0101 0101

By default, s = 7 for nucleotide databases, and k = 0 for protein databases. This can be changed using the --minimizer-spaces option along with the --build task of kraken2-build. REFERENCE-

https://github.com/DerrickWood/kraken2/wiki/Manual#custom-databases [ Section-3].

Dear Member, it is after Step-7 that I am stuck. How do I do

(1) Binning, using Sourmash for instance.

(2) Generate Heatmaps

(3) Detect Prevelance?

(4) Estimate relative Abundances?

Also, is it technically (and biologically speaking), correct to inflate 140 -> 277 Sequences in a Multi-FastA file by merging ALL the chromosomes from individual Brucella species?

I sincerely seek your Help to proceed further.

Also, as a purely computational exploration, I wish to do Trial & Error based on,

(1) various kMer lengths, apart from 47

(2) Simulate Whole genome Shotgun metagenomic reads using

dwgsim Tool, besides Brucella isolates FastQ files from DDBJ-DRA

https://ddbj.nig.ac.jp/DRASearch/query?keyword=Brucella+metagenome&show=20

https://davetang.org/wiki/tiki-index.php?page=DWGSIM

Investigating the diversity of Brucella isolates ,

that Corresponds to the NCBI- BioProject,

https://www.ncbi.nlm.nih.gov/bioproject/?term=Investigating%20the%20diversity%20of%20Brucella%20isolates

The Typical results for 47-mer Brucellome classification is duly enclosed for Reference.

--

Thanking you. Sincerely yours ,

Praharshit Sharma https://www.linkedin.com/in/bioinformaticsharma/

+91 93 511 755 63 "He who can, Does. He who cannot, Teaches."

Bruce47.png

Prash

unread,

Jul 20, 2021, 4:09:29 AM7/20/21

to bioc...@googlegroups.com

Dea Harsh

You could use R ( "Clustvis") to generate heatmaps and kraken output as an input would make it easy. Pl message me in person if you pose any problems

Regards

Prash

--
Bioclues is India's largest bioinformatics society working for bridging mentor mentee relationships. We are an affiliate of APBioNet.org and ISCB.org
http://www.bioclues.orgFacebook: https://www.facebook.com/groups/bioclues/Linkedin: http://www.linkedin.com/groups?gid=1339327

We respect your privacy. Should you feel you are not interested, please unsubscribe from the googlegroup. Alternatively, you can send a request to the moderator. Aside, we would highly appreciate if you respect open access.
---
You received this message because you are subscribed to the Google Groups "Bioclues" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bioclues+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bioclues/CALrM1gAU7ZYThdfQgRqKxDBVoGq1CnySSznGze2SgbQ8epUNJQ%40mail.gmail.com.

--

Prashanth N Suravajhala, Ph.D.
Senior Scientist, Systems Biology
Department of Biotechnology and Bioinformatics
Birla Institute of Scientific Research
Statue Circle, Jaipur 302001 RJ, India
Telephone: (work) +91-141-2385094. Extension 308
E mail: prash[AT]bisr[DOT]res[DOT]in
Home page: http://www.bioinformatics.org/wiki/Prash

Twitter: @prashbio

"Each problem that I solved became a rule, which served afterwards to solve other problems." ~ Rene Descartes

Giri Polavarapu

unread,

Jul 20, 2021, 4:36:00 AM7/20/21

to bioc...@googlegroups.com

Dear Dr. Sharma

Please contact Dr. Rajendran at MKU

9442879993

He will help.

Regards,

Giri

On Tue, Jul 20, 2021 at 12:58 PM Praharshit Sharma <praharsh...@gmail.com> wrote:

This message is eligible for Automatic Cleanup! (praharsh...@gmail.com) Add cleanup rule | More info

--

Thank you and best regards,

Rathnagiri Polavarapu, Ph.D.,

President & CEO

Genomix Biotech Inc

2620 Braithwood Road, Atlanta, GA 30345; USA

770-934-6300 (USA)

770-842-9727 (US Cell)

&

Genomix Molecular Diagnostics (P) Ltd,

5-36/207 Prasanthnagar

Kukatpally, Hyderabad 500 072, AP, India;

09849114520 ( India Cell); 040-64581236 (Phone); 770-783-2191 (India Vonage)

www.genomixbiotech.com

gi...@genomixbiotech.com

Skype: Giri.polavarapu

Praharshit Sharma

unread,

Jul 20, 2021, 4:46:31 AM7/20/21

to bioc...@googlegroups.com, gi...@genomixbiotech.com, Prash

Thank you for the information, Professor Giri. Thank you Prash,

I had already contacted Dr Rajendran @ Madurai Kamaraj University, Tamil Nadu regarding BrucellaBase: https://pubmed.ncbi.nlm.nih.gov/27164438/

Shall get in touch with him again, as suggested. Sincerely yours, PS.

To view this discussion on the web visit https://groups.google.com/d/msgid/bioclues/CAMG%2BiGqa_Ri2kg45rkXXy48kntKQL4cqBx8ufqsuQ-dQEsnCfA%40mail.gmail.com.

Reply all

Reply to author

Forward