Missing some introns in Table Browser Gencode Track

54 views
Skip to first unread message

Dilek Cansu Gürer

unread,
Feb 23, 2024, 1:20:23 PM2/23/24
to gen...@soe.ucsc.edu
Hello,

I am using Table Browser to get intron BED file for Gencode. I recently realized some of the introns from gencode genes are missing in my list that I retrieved from Table Browser. Here is the steps I followed to get intron BED file:

- clade: Mammal genome: Human assembly: Dec. 2013 (GRCh38/hg38) group: Genes and Gene Predictions track: All GENCODE V44 table: Basic (wgEncodeGencodeBasicV44) region: genome output format: BED

After selecting "get output" I select Introns plus 0 bases at each end, then hit get BED. When I performed my analysis, I realized I couldn't analyze some intronic regions that are normally found in the Gencode V44 GTF file.

After that, I generated a second BED file by selecting table: Comprehensive (wgEncodeGencodeCompV44). But still they were missing.

Then I downloaded the exon BED file (to see if corresponding exons are present because I think you retrieve intronic coordinates according to exons), following the exact steps as above except selecting "exon plus 0 bases at each end". I realized the exons that cover my missing introns are also not found in the exon BED file. After that I extracted exon locations from Gencode V44 GTF file manually and compared with the exon BED file from UCSC Table Browser, I realized the missing exons are present in the BED file that I extracted from original GTF but absent in the one that I downloaded from UCSC. Here is the first 10 lines of each BED file (sorted):

UCSC Table browser Gencode V44 exon BED file after sort -k1,1V -k2,2n -k3,3n

chr1 11868 12227 ENST00000456328.2_exon_0_0_chr1_11869_f 0 +
chr1 12612 12721 ENST00000456328.2_exon_1_0_chr1_12613_f 0 +
chr1 13220 14409 ENST00000456328.2_exon_2_0_chr1_13221_f 0 +
chr1 17368 17436 ENST00000619216.1_exon_0_0_chr1_17369_r 0 -
chr1 29553 30039 ENST00000473358.1_exon_0_0_chr1_29554_f 0 +
chr1 30266 30667 ENST00000469289.1_exon_0_0_chr1_30267_f 0 +
chr1 30365 30503 ENST00000607096.1_exon_0_0_chr1_30366_f 0 +
chr1 30563 30667 ENST00000473358.1_exon_1_0_chr1_30564_f 0 +
chr1 30975 31097 ENST00000473358.1_exon_2_0_chr1_30976_f 0 +
chr1 30975 31109 ENST00000469289.1_exon_1_0_chr1_30976_f 0 +

The BED file containing exons that I extracted from original GTF file by using "grep exon gencode.v44.annotation.gtf | cut -f1,4,5,3,7 | sort -k1,1V -k3,3n -k4,4n -u"

chr1 exon 11869 12227 +
chr1 exon 12010 12057 +
chr1 exon 12179 12227 +
chr1 exon 12613 12697 +
chr1 exon 12613 12721 +
chr1 exon 12975 13052 +
chr1 exon 13221 13374 +
chr1 exon 13221 14409 +
chr1 exon 13453 13670 +
chr1 exon 14404 14501 -

I am aware that the BED from UCSC has zero-based coordinates and the one from GTF has one-based coordinates but still I can see there are missing exons in the UCSC coordinates.

Then I searched for the region that I suspected that is missing in the UCSC Genome Browser, I realized it is also missing in the Gencode V44 annotation showing in the browser. I am attaching screenshots of IGV containing the BED files for introns and exons that I downloaded from UCSC Table Browser and the one I extracted from GTF, and the UCSC Genome Browser showing the region with Gencode V44 annotation. In the IGV screenshot the aqua blue lines are from intron BED file from UCSC, pink lines are from exon BED file from UCSC and red lines are from manually extracted exons:

IGV_Screenshot.png


UCSC Genome Browser Screenshot.png

In the browser, it says "1 items filtered out" in the Gencode V44 annotation. Can this be the gene that I am looking for? This example shows the gene with ID ENSG00000227232.5 but it is not the only one like that. Is there any way to retrieve intronic coordinates and intron IDs of these missing genes by using Table Browser?

Thank you very much in advance.
Kind regards.


--
Dilek Cansu GÜRER, PhD Candidate
Izmir Institute of Technology
Department of Molecular Biology and Genetics
Non-coding RNA Lab 
http://ncrna.iyte.edu.tr/
Room K-209
35430 Gulbahce Campus, Urla, Izmir
Phone no: +90 5547989458

Luis Nassar

unread,
Feb 26, 2024, 8:10:58 PM2/26/24
to Dilek Cansu Gürer, gen...@soe.ucsc.edu

Hello,

Thank you for taking the time to write in and for including a detailed example of your query.

The items you see filtered out, or missing, from the hgTracks display image are pseudogenes in the GENCODE v44 track. These are not displayed by default. To enable them, you can right-click the track and configure, or go to the item description page:

clipboard-202402261643-zaetq.png

Those are indeed the ones you are seeing on IGV. As for the Table Browser output, repeating the same query on the Table Browser (hg38 + GENCODEv44 (knownGene table) + output BED + exons0), I do see the items you refer to from the GTF, once you account for the 0 vs 1 based position. Here is an example of the first few lines of the file. I've marked with asterisks the first 3 items that you listed that were missing from your file:

chr1    11868    12227    ENST00000456328.2_exon_0_0_chr1_11869_f    0    +
chr1    12612    12721    ENST00000456328.2_exon_1_0_chr1_12613_f    0    +
chr1    13220    14409    ENST00000456328.2_exon_2_0_chr1_13221_f    0    +
*chr1    12009    12057    ENST00000450305.2_exon_0_0_chr1_12010_f    0    +
*chr1    12178    12227    ENST00000450305.2_exon_1_0_chr1_12179_f    0    +
*chr1    12612    12697    ENST00000450305.2_exon_2_0_chr1_12613_f    0    +
chr1    12974    13052    ENST00000450305.2_exon_3_0_chr1_12975_f    0    +
chr1    13220    13374    ENST00000450305.2_exon_4_0_chr1_13221_f    0    +
chr1    13452    13670    ENST00000450305.2_exon_5_0_chr1_13453_f    0    +
chr1    14403    14501    ENST00000488147.1_exon_10_0_chr1_14404_r    0    -

It is possible that somewhere in there our pseudogene logic is removing items. If you are still missing these from our output file, could you attach the hgTables exons output file you are getting (or at least just chr1) as well as the steps to generate the file on the Table Browser?

I hope this is helpful. Please include gen...@soe.ucsc.edu in any replies to ensure visibility by the team. All messages sent to that address are archived on our public forum. If your question includes sensitive information, you may send it instead to genom...@soe.ucsc.edu.

Lou Nassar
UCSC Genomics Institute


--

---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Public Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/CAG2E-vDo02Zicu%3DhFaNbAYqX-N7RRhQV6yDZpiT%2BtQcPzRqzqA%40mail.gmail.com.

Dilek Cansu Gürer

unread,
Feb 27, 2024, 12:06:57 PM2/27/24
to Luis Nassar, gen...@soe.ucsc.edu
Hi Lou,

Thanks a lot for your detailed explanation! Yes when I enabled pseudogenes in genome browser, the table browser included pseudogenes as well for knownGene table. I had no idea about that before, since I don't have much experience with UCSC tracks. For the previous gencode versions (I also needed version40), I separately downloaded comprehensive and pseuodegenes tables, since knownGene table is not available for them, and now it seems I have all I need.

Best,
Cansu.

Luis Nassar <lrna...@ucsc.edu>, 27 Şub 2024 Sal, 04:10 tarihinde şunu yazdı:
Reply all
Reply to author
Forward
0 new messages