hg38 RefSeq

Elisheva Javasky

unread,

Apr 27, 2020, 3:27:09 PM4/27/20

to gen...@soe.ucsc.edu

Hi,

Using the table browser, I downloaded the RefSeq exon coordinates for hg38 (Genes and Gene Predictions, NCBI RefSeq) in order to be used as a target file for variant calling on exome data. I noticed that the RefSeq exons in hg38 (after removing the alt, random, fix, and chrUn regions) cover double the number of bps as the hg19 RefSeq exons, and ~30% of the hg38 file is not at all covered by the exome data. I downloaded the curated RefSeq just to check if that set was any different and got the same results. Do you know why this is? Is the hg38 RefSeq exome really double the size of hg19?

Any help would be greatly appreciated.

Thank you,

Elisheva Javasky

Brian Lee

unread,

Apr 30, 2020, 12:46:32 PM4/30/20

to Elisheva Javasky, UCSC Genome Browser Mailing List

Dear Elisheva,

Thank you for using the UCSC Genome Browser and your question about the differences between hg19 RefSeq exons and hg38 RefSeq exons.

The short answer is that hg19 does not include a kind of transcripts called predicted transcripts so it has many fewer annotations. These predicted transcripts start with the letter X and the following MySQL query (name not like 'X%';" ) on the hg38 and hg19 databases will show a more expected number of transcripts for apples-to-apples comparisons by excluding these annotations:

$ mysql --user=genome --host=genome-mysql.soe.ucsc.edu -NAe "select count(*) from hg19.ncbiRefSeq where chrom not like '%Un%' and chrom not like '%random%' and chrom not like '%fix%' and chrom not like '%alt%' and chrom not like '%hap%' and name not like 'X%';" 
+-------+
| 70414 |
+-------+

$ mysql --user=genome --host=genome-mysql.soe.ucsc.edu -NAe "select count(*) from hg38.ncbiRefSeq where chrom not like '%Un%' and chrom not like '%random%' and chrom not like '%fix%' and chrom not like '%alt%' and chrom not like '%hap%' and name not like 'X%';" 
+-------+
| 70661 |
+-------+

A better choice over ncbiRefSeq would be ncbiRefSeqCurated that in essence performs the above exclusion of X-named item when the table was built (note the query does not include name not like 'X%';" ). The ncbiRefSeqCurated set includes only those annotations whose accessions begin with NM, NR, NP or YP (NP and YP are used only for protein-coding genes on the mitochondrion; YP is used for human only).

$ mysql --user=genome --host=genome-mysql.soe.ucsc.edu -NAe "select count(*) from hg38.ncbiRefSeqCurated where chrom not like '%Un%' and chrom not like '%random%' and chrom not like '%fix%' and chrom not like '%alt%' and chrom not like '%hap%';" 
+-------+
| 70661 |
+-------+

Note that for hg19 ncbiRefSeq and ncbiRefSeqCurated are similar since predicted items are not included in ncbiRefSeq.

mysql --user=genome --host=genome-mysql.soe.ucsc.edu -NAe "select count(*) from hg19.ncbiRefSeqCurated where chrom not like '%Un%' and chrom not like '%random%' and chrom not like '%fix%' and chrom not like '%alt%' and chrom not like '%hap%';" 
+-------+
| 70414 |
+-------+

$ mysql --user=genome --host=genome-mysql.soe.ucsc.edu -NAe "select count(*) from hg19.ncbiRefSeq where chrom not like '%Un%' and chrom not like '%random%' and chrom not like '%fix%' and chrom not like '%alt%' and chrom not like '%hap%';" 
+-------+
| 70414 |
+-------+

Part of the explanation for why hg19 does not include these items is that XM_ and XR_ annotations are predicted by Gnomon (a gene predictor https://www.ncbi.nlm.nih.gov/genome/annotation_euk/gnomon/) and for hg19 it was shared that NCBI can not run the process without the EST and mRNA alignments and all the other inputs Gnomon requires.

Thank you again for your inquiry and for using the UCSC Genome Browser. If you have any further public questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

All the best,

--

---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Public Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/CALfxY1B7FtQpJb4YT0Cd8CeWCiXVYjQUsdfF01BCAS%2BEQ9R_iQ%40mail.gmail.com.

--

Brian Lee, QA Manager

UCSC Genome Browser - UC Santa Cruz Genomics Institute

Google Scholar | Twitter | Facebook | YouTube

Elisheva Javasky

unread,

May 1, 2020, 12:37:01 PM5/1/20

to gen...@soe.ucsc.edu

Hi,

I am interested in using RefSeq exon coordinates as a target file for variant calling on WES data. I noticed that the hg38 exon coordinates cover a lot more of the genome than hg19, and sent you an email asking why that would be.

Brian Lee (please thank Brian for his help) explained that this may be because the hg38 set includes XP and XR transcripts, and excluding these by using the curated RefSeq set should solve the problem. After downloading the RefSeq curated set and extracting the exonic coordinates (excluding all random, fix, alt, and un chromosomes), I am still seeing that the hg38 exome covers ~80 million bps, which is approximately double what the hg19 refseq file covers.

Do you know why this would be the case? Approximately 30% of these regions are not covered at all by my WES data so I am just trying to understand what exactly is included in the set.

Thank you for your help,

Elisheva Javasky

Matthew Speir

unread,

May 1, 2020, 7:40:01 PM5/1/20

to Elisheva Javasky, gen...@soe.ucsc.edu

Hello, Elisheva.

Thank you for your question about RefSeq data in the UCSC Genome Browser.

Can you tell us how you are calculating this difference of 80 million bases? Using our command-line tool featureBits, we can calculate the number of bases covered by items in the RefSeq Curated track. This tool takes into account the fact that there can be multiple transcripts for each gene and only counts those bases covered by multiple transcripts once. Using this tool and comparing the numbers, we only see a much smaller difference of 1,416,602 bases:

$ featureBits hg19 ncbiRefSeqCurated
93720294 bases of 2991710746 (3.133%) in intersection

$ featureBits hg38 ncbiRefSeqCurated
95136896 bases of 3110768607 (3.058%) in intersection

Thank you!

-----

Matthew Speir

UCSC Cell Browser, Quality Assurance and Data Wrangler

Human Cell Atlas, User Experience Researcher

UCSC Genome Browser, User Support

UC Santa Cruz Genomics Institute

Revealing life’s code.

Google Scholar | Twitter | Facebook |

YouTube | LinkedIn | Instagram

--

---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Public Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.

To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/CALfxY1CC0a9N%3DWoaJ4hLhpSfTOk0eKLtiVOr4ybLcBSjcJ41wA%40mail.gmail.com.

Elisheva Javasky

unread,

May 4, 2020, 12:22:02 PM5/4/20

to Matthew Speir, gen...@soe.ucsc.edu

Thank you so much for your help.

Now I do see that the the two RefSeq Curated sets are similar in size, it seems I was comparing the hg38 RefSeq Curated set to an older hg19 set that I have of gene-coding regions.

If I am interested in looking just at gene-coding regions in hg38, what would you say would be the best set to use? Are all of the exonic regions in the RefSeq Curated set coding regions? Because it seems that 30% of these regions (just the exonic regions, excluding alt, random, fix, etc chromosomes) are not covered by my exome sequencing data, so I am trying to figure out which regions are included in the RefSeq Curated set that are not actually coding regions.

Thank you again,

Elisheva Javasky

Daniel Schmelter

unread,

May 6, 2020, 8:54:13 PM5/6/20

to Elisheva Javasky, Matthew Speir, UCSC Genome Browser Discussion List

Hello Elisheva,

We are glad you resolved the differences in coverage between your two RefSeq data sets.

As you probably know, whole-exome sequencing covers all of the exons (including UTRs) and not necessarily just the coding regions. The exonic regions in RefSeq Curated include UTRs. We have a few different resources for coding sequences, but you can certainly use Table Browser's BED format filter options to obtain a BED file of the coding regions of every RefSeq Curated item. This requires you to select the output format as BED, hit submit, and then select the "Coding Exons" selection bubble. This will remove the UTR regions and non-coding RNAs from the output and leave you with a list of coding regions in BED format. You can go to this link to have those settings pre-filled:
>http://genome.ucsc.edu/cgi-bin/hgTables?hgS_doOtherUser=submit&hgS_otherUserName=dschmelt&hgS_otherUserSessionName=RefSeqCur_CodingBED

You may also be interested in hg38's Gencode v32 track, based on the knownGene table. This dataset contains splicing variants, so it will contain some redundant entries. This may be more desirable if you want to survey a wider set of regions. You can use the same BED coding filter above to get coding regions.

I hope this was helpful. If you have any more questions, please reply-all to gen...@soe.ucsc.edu. All messages sent to that address are publicly archived. If your question includes sensitive data, please reply-all to genom...@soe.ucsc.edu.
All the best,

Daniel Schmelter
UCSC Genome Browser

To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/CALfxY1Aoyj16R1E6J31nZCCrfBxL7fQd%2BCK6NQnXWDpcwEhipQ%40mail.gmail.com.

Reply all

Reply to author

Forward