UCSC hg38 datasets

1,749 views
Skip to first unread message

Vorechovsky I.

unread,
Feb 16, 2017, 10:12:56 AM2/16/17
to gen...@soe.ucsc.edu

Hi there

 

We tried to download all human introns from your table “known genes” (hg38). Their total size exceeded the size of the human genome and their number was 562,218, also well above expectations. The number of exons was 629,924. Not sure how to reconcile these figures with published data.

 

I would be most grateful for a step by step advice how to obtain non-redundant sequences (and corresponding bed files) of currently known human exons. 

 

Thank you.

Dr. I. Vorechovsky

Principal Research Fellow

 

University of Southampton

HDH, MP808

Tremona Road

Southampton SO16 6YD

United Kingdom

 

Tel. +44 (0) 2381 206425

Fax +44 (0) 2381 204264

Email: ig...@soton.ac.uk

 

Cath Tyner

unread,
Feb 22, 2017, 1:43:01 PM2/22/17
to Vorechovsky I., gen...@soe.ucsc.edu
Hello Dr. I. Vorechovsky,

Thank you for using the UCSC Genome Browser and for inquiring about the best approach to find genome-wide annotations. Our support team has identified various approaches which can assist you, but I would first like to clarify your question so that we can provide the best method.

To address your initial attempt, summarized as:
  • downloaded all human introns from the table “known genes” (hg38)
  • number of introns = 562,218, their total size exceeded the size of the human genome
  • number of exons = 629,924.

I believe that the query described above included annotations for all transcripts. As there are multiple transcripts per gene (due to alternative splicing, etc.), performing a query of all introns (or all exons) will account for all transcripts. This explains why the total count exceeded the size of the genome, as including alternate isoforms will lead to a lot of redundant annotations.

We can assist you in finding non-redundant annotations. There are a few approaches to this.

Method 1. We can filter the transcripts from the knownGene table to the subset contained in the knownCanonical table. The knownCanonical table generally includes 1 transcript per gene, and thus is sometimes used as a non-redundant set. You can read about the knownCanonical set on the GENCODE v24 track description page:
http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg38&g=knownGene

knownCanonical: This set identifies the canonical isoform of each cluster ID or gene using the ENSEMBL gene IDs to define each cluster. The canonical transcript is chosen using the APPRIS principal transcript when available. If no APPRIS tag exists for any transcript associated with the cluster, then a transcript in the BASIC set is chosen. If no BASIC transcript exists, then the longest isoform is used.
For example, if counting exons and introns for the knownCanonical set, we would ignore all other transcripts:
KC ########------------##############---------------#######
Total exons = 3
Total introns = 2

Method 2. Another approach is to include all transcripts for knownGene, but don't count overlapping regions more than once.

For example, if counting exons and introns for the knownGene set (excluding overlapping regions), we would count all exons/introns that don't have overlap with another transcript (TX):

TX1 ########------------##############---------------#######
TX2 ########------------##############-----####------#######    
Total exons = 4
Total introns = 3
(In your original query, you would have counted 7 exons and 5 introns for the example above).

Can you respond to this mailing list and clarify the following:

1. Which method would you like to use?
2. Do you need instructions to get a total count and also bed files for introns, exons, or both?


  * Post to the Public Help Forum: E
mail 
gen...@soe.ucsc.edu
​ or search the Public Archives
​  * Post to the Mirror Help Forum: Email
 
genome...@soe.ucsc.edu 
or search the Mirror Archives​
​  * Confidential/private help: Email
 
genom...@soe.ucsc.edu

UCSC Genome Browser Announcements List (email alerts for new data & software):
  * Subscribe: Email genome-announce+subscribe@soe.ucsc.edu 
  * Unsubscribe: Email genome-announce+unsubscribe@soe.ucsc.edu

Join us on Social Media! FacebookTwitter, Wordpress BlogYouTube

​Enjoy,​
Cath
. . .
Cath Tyner
UCSC Genome Browser, Software QA & User Support
UC Santa Cruz Genomics Institute


--

---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser discussion list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.

Cath Tyner

unread,
Mar 7, 2017, 1:45:38 PM3/7/17
to UCSC Genome Browser Public Help Forum
Hello again Dr. I. Vorechovsky​,

Thank you again for your patience! As I'm sure you know, there is no one "right answer" to your question, and answers will vary based on methods and gene sets that are used. Our support team has tried and discussed several options, and I will include our suggested methods here (which have been revised since my last response to you). This is a long response, but you can read the introductions to each method and select your preference. Below are 3 methods for finding exonic regions, and 1 method for finding introns. 

  Finding ​Exons Only  

For the three exon methods below, you can click on this session to view the regions from the bed files for all three methods. This will help illustrate the differences between them. Note that there is also a custom track called, "introns" with view=hide, you can show that track to see introns from the "Intron Method 1" bed file.

Exon Method 1: Use the Table Browser to query for exons of knownGene transcripts limited to the knownCanonical subset.
The count of exons for this method is 279,162.

The knownGene table includes the same transcripts as the "comprehensive set" (wgEncodeGencodeCompV24). We will then filter those transcripts, obtaining only those found in the knownCanonical table. In some regions there may be overlapping knownCanonical transcripts, so some duplicate exons may appear in the output (see Exon Method 3 below to remove the duplicates from a local file).
 
clade: Mammal, genome: Human, assembly: hg38
group: Genes and Gene Predictions, track: GENCODE v24
table: knownGene
region: genome

filter: click "create," scroll down to the section "Linked Tables" and check the checkbox for "hg38. knownCanonical." Next, scroll down and click the button, "allow filtering using fields in checked tables." Then, at the filter page, under the section, "hg38.knownCanonical based filters," type a "1" (without the quotes) into the field for "Free-form query" and click "submit."

output format: BED (or, here you can select "sequence")
output file: leave blank to see in browser, or type in a file name to download.
 
click the button: get output
click the radio button, Create one BED record per: "Exons Plus" 
click the button, "get BED."
 

​Exon Method 2: ​Use a command-line utility "featureBits" to obtain non-redundant exonic regions from knownGene. 

featureBits collapses a table's overlapping items down into covered regions, discarding the identities of items. If featureBits is given two table names, then it finds the regions covered by both tables. It also has many command line options for more advanced usage, but in this case the query can be simple. This method results in a higher exon count, because we are including all non-redundant exons from knownGene - more transcripts are involved, thus more exons in unique regions. 

After clicking on your OS, you can download the "featureBits" utility. 

In order to use featureBits​, you​ will also need to set up a user-read-only ~/.hg.conf file that points to genome-mysql.
See this resource: Downloading data using MySQL, specifically the ​section "Using the MySQL server with our utilities"​.​
To see the usage statement, simply type "featureBits" on the command line. 

You can use the following commands:

% featureBits hg38 knownGene -bed=exon.bed
125648934 bases of 3049335806 (4.121%) in intersection
The count of exons for this method is 303,387.

To intersect with knownCanonical:
% featureBits hg38 knownGene knownCanonical -bed=exonAndCanonical.bed
118515854 bases of 3049335806 (3.887%) in intersection
The count of exons for this method is 291,616.

Exon Method 3: Use a script to sort the bed file created in "Exon Method 1" (created by the filter with knownCanonical) by position and remove exons with identical start and end​. 
The count of exons for this method is 275,119​.​

From the command-line, save your file (exons = 279,162) from "Exon Method 1"​ and name it, "exonsWithDups.txt". You can then​ paste the following script in your command shell. Let me know if this doesn't work for you, if needed, this can be converted to a bash script (*.sh). 


​sort -k1,1 -k2n,2n -k3n,3n exonsWithDups.txt \
| perl -we 'while(<>) { \
              @row = split; \
              if (/^#/ || /^\s*$/) { \
                print; # pass through comment or blank lines \
              } elsif (! @prevRow) { \
                # first row in file -- save for comparison with next row \
                @prevRow = @row; \
              } elsif ($row[0] ne $prevRow[0] || \
                       $row[1] != $prevRow[1] || \
                       $row[2] != $prevRow[2]) { \
                # this row is not a duplicate -- print it out \
                print join("\t", @prevRow) . "\n"; \
                @prevRow = @row; \
              } \
            } \
            if (@prevRow) { \
              # last non-duplicate row in file \
              print join("\t", @prevRow) . "\n"; \
            }' 



  Finding ​Introns​ Only  

Intron Method 1​:​ Use the Table Browser to query for ​introns only in the knownGene table, filtered by knownCanonical transcripts.
The count of inrons for this method is 229​,​214.

Follow the same procedures for Exon Method 1 above, except in this step:
click the radio button, Create one BED record per: "Exons Plus" 
​You wi​ll instead select "Introns Plus."
Note: The script to remove duplicate regions (in Exon Method 3) can also be used for this intron bed file (reducing introns to 225,261).

Please respond to this list if you have further questions!

Thank you again for your inquiry and for using the UCSC Genome Browser. 
​Please send new and follow-up questions to one of our UCSC Genome Browser mailing lists below:

  * Post to the Public Help Forum: E
mail 
gen...@soe.ucsc.edu
​ or search the Public Archives
​  * Post to the Mirror Help Forum: Email
 
genome...@soe.ucsc.edu 
or search the Mirror Archives​
​  * Confidential/private help: Email
 
genom...@soe.ucsc.edu

UCSC Genome Browser Announcements List (email alerts for new data & software):

Join us on Social Media! FacebookTwitter, Wordpress BlogYouTube

​Enjoy,​
Cath
. . .
Cath Tyner
UCSC Genome Browser, Software QA & User Support
UC Santa Cruz Genomics Institute

Cath Tyner

unread,
Mar 10, 2017, 3:35:51 PM3/10/17
to Vorechovsky I., UCSC Genome Browser Public Help Forum
H​i​ Dr. I. Vorechovsky​,

​You can bypass APPRIS by not restricting the exon set to knownCanonical. ​To do this, you can get all exons from knownGene​ (without filtering for the transcripts in knownCanonical)​, and​ you can then​ post-process to keep the set that suits your needs. 

​I would like to note that if you feel that APPRIS tags are not applied correctly, it would be beneficial to the community if you could ​contact the GENCODE/APPRIS authors​ with your concern. 

For more information about the GENCODE Basic set:
As previously mentioned, there is some information on our track description page, and t​he "Methods" section describes the criteria used for including a transcript in the GENCODE Basic set. ​

GENCODE describes this Basic annotation set as:

What is the "basic" annotation in the GTF/GFF3?
http://www.gencodegenes.org/faq.html

The transcripts tagged as "basic" form part of a subset of representative transcripts for each gene. This subset prioritises full-length protein coding transcripts over partial or non-protein coding transcripts within the same gene, and intends to highlight those transcripts that will be useful to the majority of users.

The GENCODE track description page also contains information on the selection criteria used for the Basic annotation set. For more details, it would be best to contact GENCODE support.


Finally, would you be able to adjust the method 3 script to extract exon sequences together with flanking 100 nt of intronic sequences on each side? The ideal output would be: 275,119 FASTA contiguous genomic sequences, each containing 100 nt upstream intronic sequences (in lower case) followed by EXON SEQUENCE (UPPER case) followed by downstream intron (lc).

This goal is probably achievable with a scripted solution, but non-trivial scripting advice is beyond the scope of the support forum.

It would be possible for a script to post-process the fasta to lowercase the first and last 100 bases. One issue to consider is accounting for exons that are at/near the beginning or ends of chrom/alt sequences - in those cases it's not possible to get 100 bases of padding. ​For example, see the 6 regions below, which were found by filtering the custom track "scriptSorted" to keep items with chromStart < 100. You can see the regions below in this session. There may also be exons at/near the ends of some chroms/alts​ (which would require a different filtering method):​

#chrom    chromStart    chromEnd    name    score    strand
chr5_KI270791v1_alt    0    155    uc063ldn.1_exon_0_0_chr5_KI270791v1_alt_1_r    0    -
chr6_KI270798v1_alt    0    2668    uc064aqv.1_exon_0_0_chr6_KI270798v1_alt_1_f    0    +
chr7_KI270806v1_alt    67    202    uc284qgx.1_exon_0_0_chr7_KI270806v1_alt_68_f    0    +
chr12_GL383553v2_alt    0    112    uc058vpx.1_exon_0_0_chr12_GL383553v2_alt_1_f    0    +
chr17_JH159146v1_alt    67    128    uc060mkv.1_exon_0_0_chr17_JH159146v1_alt_68_f    0    +
chr17_KI270861v1_alt    0    5793    uc032gmt.3_exon_0_0_chr17_KI270861v1_alt_1_r    0    -

The T​able ​B​rowser​ sequence output does specify how many padding bases were actually added in the FASTA header, for example see the "pad=0" and "pad=67" parts of these headers:

>hg38_ct_scriptSorted_2749_uc063ldn.1_exon_0_0_chr5_KI270791v1_alt_1_r range=chr5_KI270791v1_alt:1-255 5'pad=100 3'pad=0 strand=- repeatMasking=none
>hg38_ct_scriptSorted_2749_uc064aqv.1_exon_0_0_chr6_KI270798v1_alt_1_f range=chr6_KI270798v1_alt:1-2768 5'pad=0 3'pad=100 strand=+ repeatMasking=none
>hg38_ct_scriptSorted_2749_uc284qgx.1_exon_0_0_chr7_KI270806v1_alt_68_f range=chr7_KI270806v1_alt:1-302 5'pad=67 3'pad=100 strand=+ repeatMasking=none

​In summary,​ a script could get the number of padding bases from the header and then lowercase accordingly.​ If that sounds like the best solution for you, I advise finding scripting support within your institution if possible.

Please respond to this list if you have further questions!

Thank you again for your inquiry and for using the UCSC Genome Browser. 
​Please send new and follow-up questions to one of our UCSC Genome Browser mailing lists below:

  * Post to the Public Help Forum: E
mail 
gen...@soe.ucsc.edu
​ or search the Public Archives
​  * Post to the Mirror Help Forum: Email
 
genome...@soe.ucsc.edu 
or search the Mirror Archives​
​  * Confidential/private help: Email
 
genom...@soe.ucsc.edu

UCSC Genome Browser Announcements List (email alerts for new data & software):
  * Subscribe: Email genome-announce+subscribe@soe.ucsc.edu 
  * Unsubscribe: Email genome-announce+unsubscribe@soe.ucsc.edu

Join us on Social Media! FacebookTwitter, Wordpress BlogYouTube

​Enjoy,​
Cath
. . .
Cath Tyner
UCSC Genome Browser, Software QA & User Support
UC Santa Cruz Genomics Institute


On Thu, Mar 9, 2017 at 2:23 AM, Vorechovsky I. <ig...@soton.ac.uk> wrote:

Dear Cath,

 

Thank you so much.

 

I attach a browser shot showing an example of alternatively spliced exon in the U2AF1 gene (denoted by a red arrow). This exon was not picked up by any of the three methods, yet  the exon is widely expressed, with mean exon inclusion levels  of ~30% (roughly 30% mRNAs. Obviously, the problem is here:

 

knownCanonical identifies the canonical isoform of each cluster ID or gene using the ENSEMBL gene IDs to define each cluster. The canonical transcript is chosen using the APPRIS principal transcript when available. If no APPRIS tag exists for any transcript associated with the cluster, then a transcript in the BASIC set is chosen. If no BASIC transcript exists, then the longest isoform is used

-----

Can you bypass APPRIS? How exactly do you define a ‘BASIC’ set?

 

Nevertheless, the overall accuracy is not bad, we will get there!

 

Finally, would you be able to adjust the method 3 script to extract exon sequences together with flanking 100 nt of intronic sequences on each side? The ideal output would be: 275,119 FASTA contiguous genomic sequences, each containing  100 nt upstream intronic sequences (in lower case) followed by EXON SEQUENCE (UPPER case) followed by downstream intron (lc).

 

Thank you again.

 

Best wishes

 

Dr I Vorechovsky

 

 

 

 

 

 

 

 

  * Post to the Public Help Forum: E

mail 

gen...@soe.ucsc.edu

​ or search the Public Archives

​  * Post to the Mirror Help Forum: Email

 genome...@soe.ucsc.edu 

or search the Mirror Archives​

​  * Confidential/private help: Email

 genom...@soe.ucsc.edu

 

UCSC Genome Browser Announcements List (email alerts for new data & software):

  * Subscribe: Email genome-announce+subs...@soe.ucsc.edu 
  * Unsubscribe: Email genome-announce+unsub...@soe.ucsc.edu

 

Join us on Social Media! FacebookTwitter, Wordpress BlogYouTube

 

​Enjoy,​

Cath
. . .

Cath Tyner

UCSC Genome Browser, Software QA & User Support

UC Santa Cruz Genomics Institute

Reply all
Reply to author
Forward
0 new messages