Using bedtools, How to calculate gene wise coverage?

263 views
Skip to first unread message

Smriti Arora

unread,
Jun 12, 2023, 10:31:14 AM6/12/23
to aqui...@genetics.utah.edu, bedtools...@googlegroups.com
Warm greetings

Sir Aaron

My name is Smriti Arora and I am a Data Analyst cum Immunology researcher at Division of Immunology and Infectious Disease Biology, a lab at Council of Scientific and Industrial Research - Institute of Genomics and Integrative Biology, Delhi, India. Our lab performs whole genome and single cell transcriptomic analysis on infectious diseases. 

Fascinated by a paper "A draft human pangenome reference" published in Nature (2023). Attaching a link to save your time.  https://doi.org/10.1038/s41586-023-05896-x and a relative interest of our lab, and particularly mine to dig into transcripts coverage fraction to check for 1. if they transcript coverage percentages may have an impact on disease severity and other is to comment on should RNA-seq data be used for such general purpose... I mean my interest lied in exploring how far I can go with transcript coverage data. In short, the research question is how much of the gene body is covered by our sequencing reads... by what coverage percentages?  

Among several papers including one mentioned above, I came across bedtools being specifically employed for such purposes. They have given your reference too there and I have also found you being highly active in commenting on queries related to bedtools, Unfortunately, I couldn't use it fully to solve my question. and here I am to seek guidance from its creator. 

I have sample wise .bam files (generated after pair-end sequencing & mapping), one .bed file and one .gtf file. Both can be taken as reference. I have taken .gtf as it has annotations like gene id, name, transcript id, name, other features info. I have first sorted my .bam file and indexed it and got .bam.bai file. Next, I sorted the .bed file. I converted sample1_sorted.bam to sample1.bed. Now as I wanted to look for gene-wise coverage, I employed bedtools coverage option. I want to see the coverage of features in sample1.bed file on features in .gtf file which has annotations too. (annotations were not there like gene name etc in ref.bed file so i didn't use it further) I did an intersection after the coverage was found. I thought intersection might help me get the gene annotations instead of reads and for particularly those features or intervals that got overlapped. So, now my expectations were to get coverage and other info only of particular intervals (overlapped features) with GENE NAMES/GENE=-IDs but to my surprise I still got reads that were in my .bam file and not gene.IDs. Why is it so? Also, why are almost all reads repeated twice/5x/or even more than 15x. (highlighted in yellow and green in "what the output has come like")

I expected outcome to be like:

chr  |      start   |     end              |   geneID   |               feature           | MQ | strand | N.OFs | depth  |   MB   |   coverage
1       1457724 1457875   ENSG00000WXYZ   gene/transcript/exon     3         +           6             2         151     1.0000000
1       1457724 1457875   ENSG00000WXYZ   gene/transcript/exon     3         +          15            2         151     1.0000000
1       1473372 1473496   ENSG00000WXYZ   gene/transcript/exon     3         +          15            3         124     1.0000000
1       1473393 1473496   ENSG00000WXYZ   gene/transcript/exon     3         +           5             3         103     1.0000000

but it came out to be like:

chr  |      start   |  end |                          read                                                  | MQ | strand | depth | MB | FL |     coverage
1       1457724 1457875 VH00677:1:AAALHLFM5:1:1303:29327:51963/1        3       +       2       151     151     1.0000000
1       1457724 1457875 VH00677:1:AAALHLFM5:1:1303:29327:51963/2        3       +       2       151     151     1.0000000
1       1473372 1473496 VH00677:1:AAALHLFM5:1:2311:25862:32369/1        3       +       3       124     124     1.0000000
1       1473393 1473496 VH00677:1:AAALHLFM5:1:2311:25862:32369/2        3       +       3       103     103     1.0000000

* MB - mapped bases; MQ - mapping quality; FL - feature length; N.OFs - number of overlapping features

The bash code that I have been employing is:

1. samtools sort sample 1.bam -o sample1_sorted.bam
2. samtools index sample1_sorted.bam
3. sort -k1,1 -k2,2n ref_bed > ref_sort.bed
4. bedtools bamtobed -i sample1_sorted.bam -o sample1.bed
5. bedtools coverage -a sample1.bed -b Homo_sapiens.GRCh38.106.gtf > genecov_sample1.bed
6. bedtools intersect -abam sample1.bed -b Homo_sapiens.GRCh38.106.gtf > int_sample1.bed

(I also tried the -g option here in line 4 but it says no valid entries were found... Is there any specific file format for this option
i used with: bedtools coverage -a sample1.bed -g Homo_sapiens.GRCh38.106.gtf > genecov_sample1.txt)

Queries:
  1. Why am I getting either 1.0000000 coverage or 0.0000000 (and thus 100 or 0%)?
  2. When I view my .bam file in IGV, it shows hardly more than 3 depth coverage (I interpret it as a stack of 3 reads) but in analysis it shows 7, 13, and so on. Though it was nowhere mentioned about the desired headers of the columns, I could interpret it by reading bedtools documentation. 
  3. if -g option can get me the annotations (gene name or IDs in place of reads) then could you also please make me understand why the -g option is not running... where am I applying it wrong?
  4. can't the coverage and -c/-counts and -d option be used together - so to get the coverage fraction, depth of reads and number of overlapping features in one row? this is because when i used -c or -d option, to get the count of overlapping features and depth, it failed to give me coverage as it was giving earlier when i didn't add any of these options 
  5. to get the histograms, you have mentioned to use -hist option but to me, it didn't generate any such output. (I do these analyses in my office server.. shall i change the interface)
  6. for genome coverage also, I am getting interval wise coverage, why so? bedtools genomecov' s algorithm should be calculating (mapped bases* read count)/total length ...right? If not, then what potential difference does the option "bedtools genomecov" and "bedtools coverage" hold?
  7. In my output, I can see repetitions of specific intervals (sometimes, twice  or seven times or even 20 times). Why so? Is it because I have pair end reads in my bam --> converted bed file (input file). Please specify
These are the bedtools sources that I have been reading and trying to implement.  
4. http://quinlanlab.org/tutorials/bedtools/bedtools.htmlThough I have a few queries to you. Hope you will answer me back. 

I am in NO doubt to ask all of these my queries to you. Please consider this email. It might take your good time but I would request you to please guide me.

Waiting for your reply. 

Warm regards
Smriti Arora


Reply all
Reply to author
Forward
0 new messages