some questions about UCSC Genes track, knownGene table

64 views
Skip to first unread message

seirana.hashemi

unread,
Feb 13, 2015, 2:47:12 PM2/13/15
to gen...@soe.ucsc.edu

Hello,

Dear Sir/Madam,

I Emailed you before to ask some question about GRCh37, your answers were very useful,and I really appreciate. In the continue I have some other questions:

There are txStart and txEnd columns in UCSC Genes track, knownGene table. what are these? also there are  cdsStart and cdsEnd; are they abbreviations for "start of coding regions" and "end of coding regions"?  and another question is: how can align minus strand to plus strand, because I have to work only on plus strand.  Is it correct? chromosome size - txStart in minus strand = txStart in plus strand? 

Looking forward to receiving your kind reply at your earliest convenience

Yours sincerely,

 Seirana Hashemi

Steve Heitner

unread,
Feb 13, 2015, 5:43:51 PM2/13/15
to seirana.hashemi, gen...@soe.ucsc.edu

Hello, Seirana.

Thank you for your kind words.  We certainly always try to be as helpful as possible.  :)

The txStart and txEnd fields indicate the start and end of transcription, which includes both coding and non-coding regions.  As you properly identified, the cdsStart and cdsEnd fields indicate the start and end of the coding regions only.

Note that even transcripts on the - strand have their txStart, txEnd, cdsStart and cdsEnd fields defined in terms of the + strand.  If your goal is to get everything in terms of + strand coordinates, then most of the work is already done for you.  The only important thing to understand is that because everything is defined in terms of the + strand, what we list as txStart for - strand items is actually the end of transcription and what we list as txEnd for - strand items is actually the start of transcription.  It is organized in this manner for purposes of drawing the items in our display.

Please contact us again at gen...@soe.ucsc.edu if you have any further questions. 
All messages sent to that address are archived on a publicly-accessible Google Groups forum.  If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

---
Steve Heitner
UCSC Genome Bioinformatics Group

--

Steve Heitner

unread,
Feb 18, 2015, 5:12:14 PM2/18/15
to seirana.hashemi, gen...@soe.ucsc.edu

Hello, Seirana.

Many of the genes in our gene tracks contain multiple transcript variants, so the variants belong to the same gene, hence the repetitive start and stop coordinates, but they have their own unique identifiers (the contents of the “name” column).  In your Table Browser query, if you select your output type as “selected fields from primary and related tables” and then include hg19.kgXref.geneSymbol as one of your output fields, you will see that many gene symbols actually have several transcript variants.

For an explanation of the meaning of some of the columns, on the “table” line of the Table Browser, there is a “describe table schema” button which will often answer questions such as these.  In cases where it does not, viewing a track’s description page will often also describe what some of the columns represent.  In this case, some of these questions remain unanswered, so I will explain them for you.

In UCSC Genes, the “alignId” column is redundant with the “name” column.

In RefSeq Genes, the “exonFrame” column tells you if an exon starts cleanly with the start of a new codon or if it continues a codon that was started on a previous exon.  For example, a value of -1 indicates that a codon is non-coding.  A value of 0 indicates that an exon begins with the start of a new codon.  A value of 1 or 2 indicates that the start of an exon begins with the second or third base of a codon that was started in a previous exon.  To illustrate this, look at the hg19 NOX4 gene in the Browser.  If you look at chr11:89,182,548-89,182,699 (exon 3), you will see that the leftmost codon (G) is not a full codon – there is only one base.  If you then look at chr11:89,177,296-89,177,447, you will see that the rightmost codon is also not a full codon – there are only two bases.  The G that was started at the end of exon 3 is completed at the beginning of exon 4.  The 1 in the exonFrame column for this exon indicates that 1 base of the codon that starts this exon was actually contained in the previous exon.  You can also find a discussion of this in the previously-answered mailing list question at https://groups.google.com/a/soe.ucsc.edu/d/msg/genome/U-w4b_ZS2j0/MVog73mS2W0J.

For an explanation of cdsStartStat and cdsEndStat, please see https://groups.google.com/a/soe.ucsc.edu/d/msg/genome/Uz5ozC9vkCQ/1Zl5z8m8ADwJ.

Please contact us again at gen...@soe.ucsc.edu if you have any further questions.  Questions sent to that address will be archived in a publicly-accessible forum for the benefit of other users.  If your question contains sensitive data, you may send it instead to genom...@soe.ucsc.edu.



---
Steve Heitner
UCSC Genome Bioinformatics Group

 

From: seirana.hashemi [mailto:seirana...@ut.ac.ir]
Sent: Sunday, February 15, 2015 3:49 AM
To: st...@soe.ucsc.edu
Subject: RE: [genome] some questions about UCSC Genes track, knownGene table

 

Dear Steve,

Hi,

My background is computer science, because of that I have a lot of questions about these data, so I emailed you, again.

In UCSC Genes track, KnownGenes table:

The first column is “#name”. These are names of genes in each chromosome? If it’s true; why some genes have the same txStart and txEnd? For example:

#bin

name

chrom

strand

txStart

txEnd

cdsStart

cdsEnd

593

NM_004195

chr1

-

1138887

1142089

1139223

1141951

593

NM_148902

chr1

-

1138887

1142089

1139223

1141951

593

NM_148901

chr1

-

1138887

1142089

1138970

1141951

  “proteinID” and “alignID” what are these columns?

In RefSeq Gene track, refGene table:

What is “ExoneFrame”?

and in “cdsStartSat” and “cdsEndSat” columns, what are the meaning of words “cmpl”,”incmpl”, “unk”?

Best regards,

Seirana 

Reply all
Reply to author
Forward
0 new messages