Queries regarding BRUCELLOME metagenomics to estimate Prevelance and Abundance

Skip to first unread message

Praharshit Sharma

unread,
Jul 20, 2021, 3:27:49 AM7/20/21
to bioc...@googlegroups.com
Dear BioClues Member, I am seeking Metagenomics/ Microbial classification Help as below
This is BioClues Member Praharshit Sharma from ICAR: Indian Council of Agricultural Research, India. Could any Metagenomics or Microbial-Omics Expert in BioClues resolve below Query?
Thank you!
========================================================================
My Lab is currently doing the following Metagenomics tasks, specifically pertaining to BRUCELLOME metagenomics classification. For sake of Open Access reproducibility of results to cross-check, I am mentioning my Scripting here

For your ease of reference, I started off with Kraken2 installation...

$ conda install -c bioconda kraken2
Step1: Installation of kraken2, wrt https://anaconda.org/bioconda/kraken2

$ kraken2-build --download-taxonomy --db kraken2
Step2: Installing the Taxonomy, wrt

$ kraken2-build --download-library bacteria --db
kraken2
Step3: I am interested in the Bacteria custom DataBase, since Brucella is a bacterial genus.
wrt, https://en.wikipedia.org/wiki/Brucella

$ cd kraken2
$ kraken2-build --threads 24 --build --db . --kmer-len 47 --minimizer-len 31 --minimizer-spaces 5
Step4: I had built the kraken2 database, wrt Section-3 of "Custom Databases" part of
After this Step, I got the k2d files, located in /kraken2 Directory , + seqid2taxid.map File Too.
  • hash.k2d: Contains the minimizer to taxon mappings
  • opts.k2d: Contains information about the options used to build the database
  • taxo.k2d: Contains taxonomy information used to build the database

$ kraken2 --db . brucellome.fa
Step5: I had classified a Multi-FastA file of 277 sequences called " Brucellome.fa " , wrt

[ Where, Brucellome.FA was concatenated as per below Scripts...

$ pwd
( Same as kraken2 directory/ subdirectory ( . ) above),
$ awk -F "\t" '$0 ~ /Brucella/ && $12=="Complete Genome" && $11=="latest"{print $20}' assembly_summary.txt | tee ftpdirpaths

$ wc -l ftpdirpaths
140
( Therefore, 140 Complete Genomes of various Brucella Species (Spp.) belonging to Brucella Genus were downloaded, which became in terms of sequence statistics, 277
$ seqkit stats brucellome.fa
file           format  type  num_seqs      sum_len  min_len      avg_len    max_len
brucellome.fa  FASTA   DNA        277  469,732,267   61,780  1,695,784.4  4,165,132

Because, many individual FastA files that went into concatenation of Brucellome.FA comprised
of >= Chromosomes). There was 4 more intermediate steps to arrive at Brucellome.fa (below))

$ awk 'BEGIN{FS=OFS="/";filesuffix="genomic.fna.gz"}{ftpdir=$0;asm=$10;file=asm"_"filesuffix;print ftpdir,file}' ftpdirpaths | tee ftpfilepaths

$ cat ftpfilepaths | parallel -j 20 --verbose --progress "cd all && curl -O {}"
$ gunzip -c *.fna.gz
$ cat *.fna | tee brucellome.fa
] # End of Scripts sub-section [...] required to construct " brucellome.fa "

Step-6: The STDOUT from Step-6 above was refined (To Print classified sequences to file)
$ kraken2 --db . brucellome.fa --classified-out bruce47.out
Step-7: To Print a report with aggregrate counts/clade to file, I did
$ kraken2 --db . --report bruce47.txt bruce47.out

The output of kraken-report, as we Know, is tab-delimited, with one line per taxon. The fields of the output, from left-to-right, are as follows:
  1. Percentage of reads covered by the clade rooted at this taxon
  2. Number of reads covered by the clade rooted at this taxon
  3. Number of reads assigned directly to this taxon
  4. A rank code, indicating (U)nclassified, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies. All other ranks are simply '-'.
  5. NCBI taxonomy ID
  6. indented scientific name

The scientific names are indented using spaces, according to the tree structure specified by the taxonomy.

NOTE: Above, "47" indicates kMer-size = 47 ( < min_len of Brucellome.FA = 61,780 ).
I think I am justified in using --kmer-len 47 option ( It was an Arbitrary choice),

By default, the values of k and l are 35 and 31, respectively (or 15 and 12 for protein databases). These values can be explicitly set with the --kmer-len and minimizer-len options, however. Note that the minimizer length must be no more than 31 for nucleotide databases, and 15 for protein databases. Additionally, the minimizer length l must be no more than the k-mer length. There is no upper bound on the value of k, but sequences less than k bp in length cannot be classified.

Kraken 2 also utilizes a simple spaced seed approach to increase accuracy. A number s < l/4 can be chosen, and s positions in the minimizer will be masked out during all comparisons. Masked positions are chosen to alternate from the second-to-last position in the minimizer; e.g., s = 5 and l = 31 will result in masking out the 0 positions shown here:

111 1111 1111 1111 1111 1101 0101 0101

By default, s = 7 for nucleotide databases, and k = 0 for protein databases. This can be changed using the --minimizer-spaces option along with the --build task of kraken2-build. REFERENCE-

https://github.com/DerrickWood/kraken2/wiki/Manual#custom-databases    [ Section-3].

Dear Member, it is after Step-7 that I am stuck. How do I do

(1) Binning, using Sourmash for instance.

(2) Generate Heatmaps

(3) Detect Prevelance?

(4) Estimate relative Abundances?

Also, is it technically (and biologically speaking), correct to inflate 140 -> 277 Sequences in a Multi-FastA file by merging ALL the chromosomes from individual Brucella species?

I sincerely seek your Help to proceed further.

Also, as a purely computational exploration, I wish to do Trial & Error based on,

(1) various kMer lengths, apart from 47

(2) Simulate Whole genome Shotgun metagenomic reads using

dwgsim Tool, besides Brucella isolates FastQ files from DDBJ-DRA

https://ddbj.nig.ac.jp/DRASearch/query?keyword=Brucella+metagenome&show=20

https://davetang.org/wiki/tiki-index.php?page=DWGSIM

Investigating the diversity of Brucella isolates ,

that Corresponds to the NCBI- BioProject,

https://www.ncbi.nlm.nih.gov/bioproject/?term=Investigating%20the%20diversity%20of%20Brucella%20isolates

The Typical results for 47-mer Brucellome classification is duly enclosed for Reference.


Bruce47.png

--

Thanking you. Sincerely yours ,

Praharshit Sharma                https://www.linkedin.com/in/bioinformaticsharma/

+91 93 511 755 63                  "He who can, Does. He who cannot, Teaches."

Bruce47.png

Prash

unread,
Jul 20, 2021, 4:09:29 AM7/20/21
to bioc...@googlegroups.com
Dea  Harsh

You could use  R ( "Clustvis")  to generate heatmaps and kraken output as an input would make it easy.  Pl message me in person if you pose any problems

Regards
Prash

--
Bioclues is India's largest bioinformatics society working for bridging mentor mentee relationships. We are an affiliate of  APBioNet.org and ISCB.org
http://www.bioclues.orgFacebook: https://www.facebook.com/groups/bioclues/Linkedin: http://www.linkedin.com/groups?gid=1339327
 
We respect your privacy. Should you feel you are not interested, please unsubscribe from the googlegroup. Alternatively, you can send a request to the moderator. Aside, we would highly appreciate if you respect open access.
---
You received this message because you are subscribed to the Google Groups "Bioclues" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bioclues+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bioclues/CALrM1gAU7ZYThdfQgRqKxDBVoGq1CnySSznGze2SgbQ8epUNJQ%40mail.gmail.com.


--
Prashanth N Suravajhala, Ph.D.
Senior Scientist, Systems Biology
Department of Biotechnology and Bioinformatics
Birla Institute of Scientific Research
Statue Circle, Jaipur 302001  RJ, India 
Telephone: (work) +91-141-2385094. Extension 308
E mail: prash[AT]bisr[DOT]res[DOT]in
Home page: http://www.bioinformatics.org/wiki/Prash 
Twitter: @prashbio

 

"Each problem that I solved became a rule, which served afterwards to solve other problems."    Rene Descartes

Giri Polavarapu

unread,
Jul 20, 2021, 4:36:00 AM7/20/21
to bioc...@googlegroups.com
Dear Dr. Sharma
Please contact Dr. Rajendran at MKU
9442879993

He will help.
Regards,
Giri

On Tue, Jul 20, 2021 at 12:58 PM Praharshit Sharma <praharsh...@gmail.com> wrote:
Boxbe This message is eligible for Automatic Cleanup! (praharsh...@gmail.com) Add cleanup rule | More info


--
Thank you and best regards,
 
 
Rathnagiri Polavarapu, Ph.D.,

President & CEO

Genomix Biotech Inc
2620 Braithwood Road, Atlanta, GA 30345; USA
 
770-842-9727 (US Cell)
&
Genomix Molecular Diagnostics (P) Ltd,
5-36/207 Prasanthnagar
Kukatpally, Hyderabad 500 072, AP, India;
09849114520 ( India Cell); 040-64581236 (Phone); 770-783-2191 (India Vonage)
Skype: Giri.polavarapu
 

Praharshit Sharma

unread,
Jul 20, 2021, 4:46:31 AM7/20/21
to bioc...@googlegroups.com, gi...@genomixbiotech.com, Prash
Thank you for the information, Professor Giri. Thank you Prash,
I had already contacted Dr Rajendran @ Madurai Kamaraj University, Tamil Nadu regarding BrucellaBase: https://pubmed.ncbi.nlm.nih.gov/27164438/

Shall get in touch with him again, as suggested. Sincerely yours, PS.

Reply all
Reply to author
Forward
0 new messages