Simple sequence repeats and your UCSC protein alignments

19 views
Skip to first unread message

raffaele iennaco

unread,
Jan 31, 2014, 11:57:22 AM1/31/14
to gen...@soe.ucsc.edu, Pauline Fujita

Dear UCSC staff,

I am researcher of Cattaneo Lab from Italy.
We have seen the newly 100 species conservation track pubblished on the UCSC
Genome Browser and while investigating CAG repeats in protein coding regions
such as Huntingtin (HTT) or Androgen Receptor (AR) in different taxa we have
realized that it is possible to obtain CDSs in a very elegant and fast way
using your alignment and the UCSC query. However, while most of the protein
region is generally conserved among vertebrates and so it appears very well
represented in the alignment, the CDSs are truncated in the polyQ region in
a manner that we know does not represent the correct number of repeats for a
given species. This is probably due to the process of alignment itself which
does not consider repeats or at least parts of them as conserved sites.
Thus, we were wondering if it is possible to use your dataset to obtain in
an automated fashion the missing polyQ region exon part and if you could
help us in such achievement.

All the very best,

Raffaele 


Raffaele Iennaco, Dr

Department of Biosciences and Centre for Stem Cell Research – University of Milano

Via Viotti 3/5

20133 Milano (Italy) 

email: raffaele...@unimi.it

www.cattaneolab.it

Jonathan Casper

unread,
Feb 5, 2014, 1:28:05 PM2/5/14
to raffaele iennaco, gen...@soe.ucsc.edu, Pauline Fujita

Hello Raffaele,

Thank you for your question about polyQ regions in the exons of different species. Unfortunately, you are right - the alignments we provide are not a good way to get sequence in repetitive parts of the genome like polyQ regions.

We suggest instead that you make a list of the coordinates of the exons containing your polyQ regions in the human genome. You can then use our liftOver tool (http://genome.ucsc.edu/cgi-bin/hgLiftOver) to find the coordinates of similar regions in other species. Using those coordinates, you can then get sequence from the other species and look for CAG repeats.

For example, the first exon of the HTT gene in the human hg19 assembly has the coordinates chr4:3,076,407-3,076,815. You can use the following steps to obtain sequence from the region most closely aligned to that exon in the rat Mar. 2012 (RGSC 5.0/rn5) assembly.

1. Open the UCSC LiftOver tool by visiting http://genome.ucsc.edu/cgi-bin/hgLiftOver (also available by going to the Tools menu of the UCSC Genome Browser and choosing LiftOver).
2. Select Human Feb. 2009 (GRCh37/hg19) as the original genome/assembly, and select Rat Mar. 2012 (RGSC 5.0/rn5) as the new genome/assembly.
3. Enter the coordinates chr4:3,076,407-3,076,815 into the data box.
4. Click Submit.
5. After some processing, the results will be displayed on the page as a link labeled "View Conversions". The link goes to a BED file. Download and open the BED file.
6. The BED file contains the following coordinates in the rn5 genome: chr14:81941727-81942079.
7. Open the UCSC Genome Browser Gateway page at http://genome.ucsc.edu/cgi-bin/hgGateway, select the Rat Mar. 2012 (RGSC 5.0/rn5) assembly, and enter those coordinates. Click Submit.
8. The browser will now display that region of the rat rn5 genome. You can obtain sequence for that section of the rat genome by browsing to that location, going to the View menu, and choosing DNA.

Note that the HTT gene appears on the - strand of the rat genome, so you will need to check both strands for polyQ data.

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. Questions sent to that address will be archived in a publicly-accessible forum for the benefit of other users. If your question contains sensitive data, you may send it instead togenom...@soe.ucsc.edu.

--
Jonathan Casper
UCSC Genome Bioinformatics Group



--
 

Reply all
Reply to author
Forward
0 new messages