mRNA sequence

26 views

Skip to first unread message

Doron Lemze

unread,

Feb 8, 2016, 11:14:55 AM2/8/16

to gen...@soe.ucsc.edu

Dear UCSC Browser personal,

First, thank you for this tool and data!

Second, I'm working with the mm10 assembly and I have two small questions.

1) When I look at Pck1 (uc008odn.1), it looks like an additional "a" is added to the mRNA sequence at the end. Why is that?

2) For 3110001I22Rik (uc007yge.2), the mRNA sequence looks partial. The beginning seems to be missing. Is this true?

Many thanks,

Doron

Cath Tyner

unread,

Feb 11, 2016, 2:25:29 PM2/11/16

to Doron Lemze, UCSC Genome Browser Public Help Forum

Hello Doron,

Thank you for using the UCSC Genome Browser and for submitting your question regarding mRNA sequence differences. Please see answers below to your two questions:

1) Because poly-A tails vary in length and can be quite long, they are always trimmed down to two bases in the browser. In the case of transcript uc008odn.1, three "A's" have been trimmed to our standard two-A poly-A tail.

2) UCSC Genes are built using a multi-step pipeline and takes various evidence into account for annotations. You can read more about the methods here:
http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=mm10&g=knownGene

In this case, the UCSC Genes pipeline started with the RefSeq mRNA NM_025653. It used information from other RefSeq mRNAs and ESTs at this position to adjust its prediction. The final predicted transcript, uc007yge.2, included a slightly longer 5' UTR than the starting mRNA. However, when you request the mRNA for this transcript from UCSC Genes, it provides you with the unchanged starting sequence of NM_025653.

Thank you again for your inquiry and for using the UCSC Genome Browser. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Enjoy,

Cath
. . .

Cath Tyner

UC Santa Cruz Genomics Institute

UCSC Genome Browser: Public Help Forum, Suggestions, Contact

--

---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser discussion list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.

Cath Tyner

unread,

Feb 26, 2016, 1:57:14 PM2/26/16

to Doron Lemze, UCSC Genome Browser Public Help Forum

Dear Cath,

Thank you very much for your reply and sorry for my late one. Please look at my questions/comments below your answers.

1) Because poly-A tails vary in length and can be quite long, they are always trimmed down to two bases in the browser. In the case of transcript uc008odn.1, three "A's" have been trimmed to our standard two-A poly-A tail.

Ok, I understand that the poly-A tail is trimmed to two bases. However, it is still not clear to me why in one case there are 2 "A's" and in the other 3 "A's". The following part of the sentence is not clear to me "three "A's" have been trimmed to our standard two-A poly-A tail.". What did you mean?
In addition, if the poly-A tail is always trimmed to two (or three?) bases, how come some transcripts have a much longer poly-A chain? For example, uc007hdm.2 has 26 "A's".

2) UCSC Genes are built using a multi-step pipeline and takes various evidence into account for annotations. You can read more about the methods here:
http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=mm10&g=knownGene
In this case, the UCSC Genes pipeline started with the RefSeq mRNA NM_025653. It used information from other RefSeq mRNAs and ESTs at this position to adjust its prediction. The final predicted transcript, uc007yge.2, included a slightly longer 5' UTR than the starting mRNA. However, when you request the mRNA for this transcript from UCSC Genes, it provides you with the unchanged starting sequence of NM_025653.

I think I understood this, but it is very confusing that the sum of the exon lengths is not the same as the length of the transcript. What do you think?
Thanks!
Doron

Questions 1

Ok, I understand that the poly-A tail is trimmed to two bases. However, it is still not clear to me why in one case there are 2 "A's" and in the other 3 "A's". The following part of the sentence is not clear to me "three "A's" have been trimmed to our standard two-A poly-A tail.". >What did you mean?

My apologies for the confusing first response regarding the trimmed Poly-A tails! Please disregard my original response about this. I now have more information from one of our engineers - clarification below:

In the example below (uc008odn.1) , as seen in "Link 1" (below), the mRNA simply had a 3-A poly-A tail when it was sequenced. That poly-A tail is reflected correctly in "Link 1." When we align this mRNA to the genome, part of our alignment process involves trimming the poly-A tail down to the last two "A's" in order to find the best alignment in the genome. Since poly-A tails generally don't appear in the DNA sequence alignments (biologically, poly-A tails are added after transcription, i.e. they generally aren't in the genome), then the tail is trimmed before alignment so that our alignment methods produce the best match, thus ending up with the DNA sequence seen in "Link 2" below. If the poly-A tail were not trimmed down before being aligned to the genome, RNA's with long poly-A tails might not align properly, as our process would look for those matching A's in the genome, risking misalignment.

"Link 1" (below) displays whatever poly-A tail the mRNA had when it was sequenced, and that tail length is not trimmed in this output. The tail is only trimmed as part of our alignment methods, to prepare it for the best match in the genome.

"Link 2" (below) displays the predicted sequence alignment to the genome, based on the mRNA (the best sequence match prior to the poly-A tail addition during RNA processing).

Example:
From the "Mouse Gene Pck1 (uc008odn.1) Description and Page Index" - navigate to this by: 1) from the mm10 browser, search for "uc008odn.1" and then 2) click on the highlighted transcript in the browser.
On that page, under the section "Sequence and Links to Tools and Databases" you can click on the following two links:

Link 1: mRNA (may differ from genome)" - this sequence ends with "AAA." (Same as RefSeq NM_011044: http://www.ncbi.nlm.nih.gov/nuccore/NM_011044?report=GenBank). This is the actual mRNA poly-A tail, no trimming was done. This output always reflects the actual poly-A tail length for that particular transcript, without trimming.

Link 2: Genomic Sequence (chr2:173,153,073-173,159,253) > Press Submit - this sequence ends with "AA" but this does not reflect a poly-A tail, as poly-A tails generally don't appear in the DNA sequence alignments. In order to map the mRNA to the genome, the poly-A tail is trimmed so that we can retain the best alignment. Thus, the tail -trimming is only part of our sequence methods, and is not seen in the output.

So in this case, the alignment would like this:

- last part of mRNA with full poly-A tail:
...tacaaaataaa

- RNA with trimmed poly-A tail, used to align to genome:

...tacaaaataa

- aligns to the genome here:

...TACAAAATAA

Question 2

In this case, the UCSC Genes pipeline started with the RefSeq mRNA NM_025653. It used information from other RefSeq mRNAs and ESTs at this position to adjust its prediction. The final predicted transcript, uc007yge.2, included a slightly longer 5' UTR than the starting mRNA. However, when you request the mRNA for this transcript from UCSC Genes, it provides you with the unchanged starting sequence of NM_025653.

I think I understood this, but it is very confusing that the sum of the exon lengths is not the same as the length of the transcript. What do you think?

To elaborate on the previous answer, take a look at the following browser view (screenshot below). The important consideration is that UCSC Genes is based on multiple criteria for making an alignment prediction, and based on those criteria, there are degrees of variation.

You have pointed out a case where the curated RefSeq mRNA sequence (NM_025653) does not match the UCSC Genes algorithm-predicted transcript (uc007yge.2). Because RefSeq provides us with manually curated evidence-based mRNA annotations & sequence, and because UCSC Genes transcripts are generated differently via algorithm prediction, the two differently-generated transcripts will not always match. If other transcripts in that same region have a longer sequence, UCSC Genes considers this as evidence toward choosing the longer length to display - thus uc007yge.2 has more sequence (based on our alignment methods) than RefSeq's shorter NM_025653. In the screen shot attached, you can see such "evidence" (other RefSeq transcripts and Mouse ESTs) in the blue rectangles for the longer sequence chosen for UCSC Gene's transcript uc008odn.1.

RefSeq provides transcripts that have differences among other RefSeq transcripts for the same gene; these are isoforms, arising from alternative splicing. If a region is variable, having a single reference sequence can be problematic. Turning the "RefSeq Genes" track onto to "full" will display the other 2 transcripts, which are different from the shorter NM_025653 that you have observed.

Inline image 1

Enjoy,

Cath
. . .

Cath Tyner

UC Santa Cruz Genomics Institute

UCSC Genome Browser: Public Help Forum, Suggestions, Contact

On Thu, Feb 25, 2016 at 12:31 PM, Cath Tyner <ca...@ucsc.edu> wrote:

Hi Doran,

Can you please post your follow-up question to our mailing list, gen...@soe.ucsc.edu? That way we can provide the best answer from our team. In addition, others with similar questions in the future will be able to search and find the answer. Once the question is posted there, out team looks forward to providing an answer!

Cath
. . .
Cath Tyner
UC Santa Cruz Genomics Institute
UCSC Genome Browser: Public Help Forum, Suggestions, Contact

On Thu, Feb 25, 2016 at 1:52 AM, Doron Lemze <doron...@gmail.com> wrote:

Dear Cath,

Thank you very much for your reply and sorry for my late one. Please look at my questions/comments below your answers.

1) Because poly-A tails vary in length and can be quite long, they are always trimmed down to two bases in the browser. In the case of transcript uc008odn.1, three "A's" have been trimmed to our standard two-A poly-A tail.

Ok, I understand that the poly-A tail is trimmed to two bases. However, it is still not clear to me why in one case there are 2 "A's" and in the other 3 "A's". The following part of the sentence is not clear to me "three "A's" have been trimmed to our standard two-A poly-A tail.". What did you mean?

In addition, if the poly-A tail is always trimmed to two (or three?) bases, how come some transcripts have a much longer poly-A chain? For example, uc007hdm.2 has 26 "A's".

2) UCSC Genes are built using a multi-step pipeline and takes various evidence into account for annotations. You can read more about the methods here:
http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=mm10&g=knownGene
In this case, the UCSC Genes pipeline started with the RefSeq mRNA NM_025653. It used information from other RefSeq mRNAs and ESTs at this position to adjust its prediction. The final predicted transcript, uc007yge.2, included a slightly longer 5' UTR than the starting mRNA. However, when you request the mRNA for this transcript from UCSC Genes, it provides you with the unchanged starting sequence of NM_025653.

I think I understood this, but it is very confusing that the sum of the exon lengths is not the same as the length of the transcript. What do you think?

Thanks!
Doron

Reply all

Reply to author

Forward

0 new messages