refGene cdsStart/end txStart/end exonStarts/end relation

1,397 views
Skip to first unread message

wan...@genetics.ac.cn

unread,
Oct 10, 2012, 6:21:31 AM10/10/12
to gen...@soe.ucsc.edu

Dear Friends, 


        I read the table file and cheked the FAQs without answer to my question. 

        I noticed that the [cdsStart, cdsEnd] interval could be just covered by first exon alone in some records in refGene, whose transcript actually include more than one exton. I thought this should be a rare phenomenon, or I misunderstood the field meaning?Does it have something to do with exonFrames field? 

        I also noticed that the cdsStart and cdsEnd could actually be the same in some records. How to understand this? 

        The third question is what the value of the cdsStartStat and cdsEndStat fields mean? 
 
        The fourth question is about the score. what does this mean?

        Is there more detialed and systematic description about the refGene table?

Thank you! 

Best Wishes, 

Yi


Database: hg19    Primary Table: refGene    Row Count: 43,726
Format description: A gene prediction with some additional info.
fieldexampleSQL typeinfodescription
bin612smallint(5) unsignedrangeIndexing field to speed chromosome range queries.
nameNM_006781varchar(255)valuesName of gene (usually transcript_id from GTF)
chromchr6_apd_hap1varchar(255)valuesReference sequence chromosome or scaffold
strand-char(1)values+ or - for strand
txStart3614545int(10) unsignedrangeTranscription start position
txEnd3654041int(10) unsignedrangeTranscription end position
cdsStart3614545int(10) unsignedrangeCoding region start
cdsEnd3653868int(10) unsignedrangeCoding region end
exonCount14int(10) unsignedrangeNumber of exons
exonStarts3614545,3617947,3618422,361...longblob Exon start positions
exonEnds3614566,3617968,3618443,361...longblob Exon end positions
score0int(11)rangescore
name2C6orf10varchar(255)valuesAlternate name (e.g. gene_id from GTF)
cdsStartStatincmplenum('none', 'unk', 'incmpl', 'cmpl')valuesenum('none','unk','incmpl','cmpl')
cdsEndStatcmplenum('none', 'unk', 'incmpl', 'cmpl')valuesenum('none','unk','incmpl','cmpl')
exonFrames1,1,1,1,1,1,1,1,2,1,1,1,1,0,longblob Exon frame {0,1,2}, or -1 if no frame for exon



Brooke Rhead

unread,
Oct 11, 2012, 9:01:17 PM10/11/12
to wan...@genetics.ac.cn, gen...@soe.ucsc.edu
Hi Yi,

I'll try to answer each of your questions:

> I noticed that the [cdsStart, cdsEnd] interval could be just covered
> by first exon alone in some records in refGene, whose transcript
> actually include more than one exton. I thought this should be a
> rare phenomenon, or I misunderstood the field meaning?Does it have
> something to do with exonFrames field?

It is possible for only the first exon to contain coding sequence; the
other exons can be comprised of untranslated sequence. The exonFrames
field indicates which reading frame each coding exon is in.

> I also noticed that the cdsStart and cdsEnd could actually be the
> same in some records. How to understand this?

This means that the record is for a non-coding gene.

> The third question is what the value of the cdsStartStat and
> cdsEndStat fields mean?

The information comes from the CDS field of the Genbank records (such as
this one: http://www.ncbi.nlm.nih.gov/nuccore/NM_000454?report=GenBank).
Here is an explanation from NCBI of the notation used in the CDS field:
http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord#CDSB

> The fourth question is about the score. what does this mean?

The score field is not used in this table.

> Is there more detialed and systematic description about the refGene
> table?

Our documentation of the track is here:
http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=refGene

The data comes from The RefSeq project:
http://www.ncbi.nlm.nih.gov/RefSeq/

I hope this is helpful. If you have further questions for UCSC, please
contact us again at gen...@soe.ucsc.edu.

--
Brooke Rhead
UCSC Genome Bioinformatics Group


On 10/10/12 3:21 AM, wan...@genetics.ac.cn wrote:
>
> Dear Friends,
>
>
> I read the table file and cheked the FAQs without answer to my
> question.
>
> I noticed that the [cdsStart, cdsEnd] interval could be just
> covered by first exon alone in some records in refGene, whose transcript
> actually include more than one exton. I thought this should be a rare
> phenomenon, or I misunderstood the field meaning?Does it have something
> to do with exonFrames field?
>
> I also noticed that the cdsStart and cdsEnd could actually be
> the same in some records. How to understand this?
>
> The third question is what the value of the cdsStartStat and
> cdsEndStat fields mean?
> The fourth question is about the score. what does this mean?
>
> Is there more detialed and systematic description about the
> refGene table?
>
> Thank you!
>
> Best Wishes,
>
> Yi
>
>
> *Database:* hg19 *Primary Table:* refGene *Row Count:* 43,726
> *Format description:* A gene prediction with some additional info.
> field example SQL type info description
> bin 612 smallint(5) unsigned range
> <http://genome.ucsc.edu/cgi-bin/hgTables?hgsid=302924629&hgta_database=hg19&hgta_histoTable=refGene&hgta_doValueRange=bin>
> Indexing field to speed chromosome range queries.
> name NM_006781 varchar(255) values
> <http://genome.ucsc.edu/cgi-bin/hgTables?hgsid=302924629&hgta_database=hg19&hgta_histoTable=refGene&hgta_doValueHistogram=name>
> Name of gene (usually transcript_id from GTF)
> chrom chr6_apd_hap1 varchar(255) values
> <http://genome.ucsc.edu/cgi-bin/hgTables?hgsid=302924629&hgta_database=hg19&hgta_histoTable=refGene&hgta_doValueHistogram=chrom>
> Reference sequence chromosome or scaffold
> strand - char(1) values
> <http://genome.ucsc.edu/cgi-bin/hgTables?hgsid=302924629&hgta_database=hg19&hgta_histoTable=refGene&hgta_doValueHistogram=strand>
> + or - for strand
> txStart 3614545 int(10) unsigned range
> <http://genome.ucsc.edu/cgi-bin/hgTables?hgsid=302924629&hgta_database=hg19&hgta_histoTable=refGene&hgta_doValueRange=txStart>
> Transcription start position
> txEnd 3654041 int(10) unsigned range
> <http://genome.ucsc.edu/cgi-bin/hgTables?hgsid=302924629&hgta_database=hg19&hgta_histoTable=refGene&hgta_doValueRange=txEnd>
> Transcription end position
> cdsStart 3614545 int(10) unsigned range
> <http://genome.ucsc.edu/cgi-bin/hgTables?hgsid=302924629&hgta_database=hg19&hgta_histoTable=refGene&hgta_doValueRange=cdsStart>
> Coding region start
> cdsEnd 3653868 int(10) unsigned range
> <http://genome.ucsc.edu/cgi-bin/hgTables?hgsid=302924629&hgta_database=hg19&hgta_histoTable=refGene&hgta_doValueRange=cdsEnd>
> Coding region end
> exonCount 14 int(10) unsigned range
> <http://genome.ucsc.edu/cgi-bin/hgTables?hgsid=302924629&hgta_database=hg19&hgta_histoTable=refGene&hgta_doValueRange=exonCount>
> Number of exons
> exonStarts 3614545,3617947,3618422,361... longblob Exon start positions
> exonEnds 3614566,3617968,3618443,361... longblob Exon end positions
> score 0 int(11) range
> <http://genome.ucsc.edu/cgi-bin/hgTables?hgsid=302924629&hgta_database=hg19&hgta_histoTable=refGene&hgta_doValueRange=score>
> score
> name2 C6orf10 varchar(255) values
> <http://genome.ucsc.edu/cgi-bin/hgTables?hgsid=302924629&hgta_database=hg19&hgta_histoTable=refGene&hgta_doValueHistogram=name2>
> Alternate name (e.g. gene_id from GTF)
> cdsStartStat incmpl enum('none', 'unk', 'incmpl', 'cmpl') values
> <http://genome.ucsc.edu/cgi-bin/hgTables?hgsid=302924629&hgta_database=hg19&hgta_histoTable=refGene&hgta_doValueHistogram=cdsStartStat>
> enum('none','unk','incmpl','cmpl')
> cdsEndStat cmpl enum('none', 'unk', 'incmpl', 'cmpl') values
> <http://genome.ucsc.edu/cgi-bin/hgTables?hgsid=302924629&hgta_database=hg19&hgta_histoTable=refGene&hgta_doValueHistogram=cdsEndStat>
> enum('none','unk','incmpl','cmpl')
> exonFrames 1,1,1,1,1,1,1,1,2,1,1,1,1,0, longblob Exon frame {0,1,2}, or
> -1 if no frame for exon
>
>
>
>
> --
>
>
>
Reply all
Reply to author
Forward
0 new messages