Multi Alignments of 100 Vertebrates Question

23 views
Skip to first unread message

Shane Giles

unread,
Jun 2, 2016, 5:35:28 PM6/2/16
to gen...@soe.ucsc.edu
I have a question regarding the "Multiz Alignments of 100 Vertebrates” tracks displayed in the UCSC browser. In particular I am interested in the files behind the tracks that display the conservation of amino acids across species. Which files from the ftp site are the best to use to obtain this information for a given gene in the genome? I have looked at the following files from …/hg19/database/:

multiz100wayFrames.txt
multiz100waySummary.txt
multiz100way.txt

None of them seem to have the amino acid alignment information. I was able to find the refGene.exon.AA.fa.gz file under …/hg19/multiz100way/alignments/. I was disappointed to find out that it is not a bed file but a fasta file. Is this the file the browser uses to display the information? I suspect I will need to take the file and reformat it to get it in a position indexed file. If you don’t have this could you please let me know the format of the fasta header? Here is what I see:

>NM_001918_hg19_2_11 41 0 1 chr1:100706317-100706440-
ICVRYFQTCGNVHVLKPNYVCFFGYPSFKYSHPHHFLKTTA
>NM_001918_panTro4_2_11 41 0 1 chr1:101069219-101069342-
ICVRYFQTCGNVHVLKPNYVCFFGYPSFKYSHPHHFLKTTA
>NM_001918_gorGor3_2_11 41 0 1 chr1:102957120-102957243-
ICVRYFQTCGNVHVLKPNYVCFFGYPSFKYSHPHHFLKTTA
>NM_001918_ponAbe2_2_11 41 0 1 chr1:128258433-128258556+

I could not find a description of the header anywhere. I just want to make sure I am making any stupid assumptions.

Thanks,
Shane Giles

Cath Tyner

unread,
Jun 6, 2016, 8:14:41 PM6/6/16
to Shane Giles, UCSC Genome Browser Public Help Forum
Hello Shane,

Thank you for using the UCSC Genome Browser and for inquiring about obtaining amino acid conservation data across species.

The files that you would like to download are not available; the
​a​
mino
​a​
cid sequence displayed in the GB Multiz alignments
are​
generated on the fly using
​information​
from the multizNwayFrames table.

In thi
​s previously answered ​
​mailing list question, the question is asked:


I want alignment only for the specific amino acid which I am interested at. The full gene alignment is too much amino acid to look for. I want something like the following but instead of nucleotide it has to be amino acid.

​The
 answer:

You can use the NCBI protein multiple 'COBOLT' aligner to find common protein sequences:
http://www.ncbi.nlm.nih.gov/tools/cobalt/cobalt.cgi

You also asked about the
​ ​
format of the fasta header, which is described here:

http://genome.ucsc.edu/goldenPath/help/hgTablesHelp.html#FASTA

Explanation of CDS FASTA header format
 — Whole gene format: geneName_assemblyName peptideLength location
— Exon format: geneName_assemblyName_exonNum_totalExons exonLength inFrame outFrame location

Of possible interest:
Thank you again for your inquiry and for using the UCSC Genome Browser. 
​Please send new and follow-up questions to one of our UCSC Genome Browser mailing lists below:

  * Post to the Public Help Forum: E
mail 
gen...@soe.ucsc.edu
​ or search the Public Archives
​  * Post to the Mirror Help Forum: Email
 
genome...@soe.ucsc.edu 
or search the Mirror Archives​
​  * Confidential/private data help: Email
 
genom...@soe.ucsc.edu

UCSC Genome Browser Announcements List (email alerts for new data & software):
  * Subscribe: Email genome-annou...@soe.ucsc.edu 
  * Unsubscribe: Email genome-announ...@soe.ucsc.edu

Please respond to this list if you have further questions.

Enjoy,​
Cath
. . .
Cath Tyner
UCSC Genome Browser, Software QA & User Support
UC Santa Cruz Genomics Institute


--

---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser discussion list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.

Shane Giles

unread,
Jun 17, 2016, 6:42:30 PM6/17/16
to gen...@soe.ucsc.edu
I have a few questions regarding the "Multiz Alignments of 100 Vertebrates” tracks displayed in the UCSC browser. I see that the file date for the fasta files are "2/14/14". Will UCSC be updating these anytime soon? Is there an update schedule?

Is there any possibility of getting the full refeq ids used in the generation of the file? The ids listed in the fasta file are version-less.

And last, I noticed a few discrepancies between the fasta file and the data presented on the browser. Here is the human entry for the first exon of NM_000348 (SRD5A2).

>NM_000348_hg19_1_5 93 0 1 chr2:31805690-31805969-
MQVQCQQSPVLAGSATLVALGALALYVAKPPATGSTRRAZSRRLPACQPAPPGSCRSCLPSRCPRGSSPGSPSPSSGHLGRYFWASSAYITST

The peptide sequence in the fasta file matches what is shown in the browser for the first 29 amino acid residues. After that none of the residues match. It appears there is a two base insertion in the reference that throws the residues in the fasta file out of phase. I don't see anything in the refGene.exonAA.fa file that would let me know this. What does the browser use to know it needs to fix a frame shift?

Thanks in advance for you reply.

Shane


Brian Lee

unread,
Jun 23, 2016, 12:51:48 PM6/23/16
to Shane Giles, gen...@soe.ucsc.edu

Dear Shane,

Thank you for using the UCSC Genome Browser and for sharing the details around the NM_000348 (SRD5A2) entry in the refGene.exonAA.fa.gz file.

We plan to update this file and will include a method to also have a versioned RefSeq ID such as NM_000348.3. Our engineers explain that there is a one base deletion in the reference with respect to the mRNA that has caused this issue in the generation of the earlier file for the exon lines for this gene.

The rebuilt process would correct the mapped AA sequences by creating two separate exons. An example of what it should look like can be seen by clicking into the "CDS FASTA alignment" from multiple alignment link for RefSeq Gene SRD5A2 and then checking the box "Separate into exons" and narrowing the species selection:

>NM_000348_hg19_1_6 30 0 2 chr2:31805881-31805969-
MQVQCQQSPVLAGSATLVALGALALYVAKP

>NM_000348_hg19_2_6 64 0 2 chr2:31805690-31805880-
SGYGKHTESLKPAATRLPARAAWFLQELPSFAVPAGILARQPLSLFGPPGTVLLGLFCLHYFHR

Versus the issue reported in refGene.exonAA.fa.gz

>NM_000348_hg19_1_5 93 0 1 chr2:31805690-31805969-
MQVQCQQSPVLAGSATLVALGALALYVAKPPATGSTRRAZSRRLPACQPAPPGSCRSCLPSRCPRGSSPGSPSPSSGHLGRYFWASSAYITST


I will send a follow-up email when the file has been updated  athttp://hgdownload.cse.ucsc.edu/goldenPath/hg19/multiz100way/alignments/refGene.exonAA.fa.gz

Thank you again for your message and helping improve the UCSC Genome Browser. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

All the best,

Brian Lee
UCSC Genomics Institute


--


Brian Lee

unread,
Jun 28, 2016, 1:40:11 PM6/28/16
to Shane Giles, gen...@soe.ucsc.edu
Dear Shane,

We have updated the refGene.exonAA.fa.gz and refGene.exonNuc.fa.gz tables available at: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/multiz100way/alignments/

Thank you again for your message and helping improve the UCSC Genome Browser. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

All the best,

Brian Lee
UCSC Genomics Institute
Reply all
Reply to author
Forward
0 new messages