Multi Alignments of 100 Vertebrates Question

Shane Giles

unread,

Jun 2, 2016, 5:35:28 PM6/2/16

to gen...@soe.ucsc.edu

I have a question regarding the "Multiz Alignments of 100 Vertebrates” tracks displayed in the UCSC browser. In particular I am interested in the files behind the tracks that display the conservation of amino acids across species. Which files from the ftp site are the best to use to obtain this information for a given gene in the genome? I have looked at the following files from …/hg19/database/:

multiz100wayFrames.txt

multiz100waySummary.txt

multiz100way.txt

None of them seem to have the amino acid alignment information. I was able to find the refGene.exon.AA.fa.gz file under …/hg19/multiz100way/alignments/. I was disappointed to find out that it is not a bed file but a fasta file. Is this the file the browser uses to display the information? I suspect I will need to take the file and reformat it to get it in a position indexed file. If you don’t have this could you please let me know the format of the fasta header? Here is what I see:

>NM_001918_hg19_2_11 41 0 1 chr1:100706317-100706440-

ICVRYFQTCGNVHVLKPNYVCFFGYPSFKYSHPHHFLKTTA

>NM_001918_panTro4_2_11 41 0 1 chr1:101069219-101069342-

ICVRYFQTCGNVHVLKPNYVCFFGYPSFKYSHPHHFLKTTA

>NM_001918_gorGor3_2_11 41 0 1 chr1:102957120-102957243-

ICVRYFQTCGNVHVLKPNYVCFFGYPSFKYSHPHHFLKTTA

>NM_001918_ponAbe2_2_11 41 0 1 chr1:128258433-128258556+

I could not find a description of the header anywhere. I just want to make sure I am making any stupid assumptions.

Thanks,

Shane Giles

Cath Tyner

unread,

Jun 6, 2016, 8:14:41 PM6/6/16

to Shane Giles, UCSC Genome Browser Public Help Forum

Hello Shane,

Thank you for using the UCSC Genome Browser and for inquiring about obtaining amino acid conservation data across species.

The files that you would like to download are not available; the

a

mino

a

cid sequence displayed in the GB Multiz alignments

are

generated on the fly using

information

from the multizNwayFrames table.

In thi

s previously answered

mailing list question, the question is asked:

I want alignment only for the specific amino acid which I am interested at. The full gene alignment is too much amino acid to look for. I want something like the following but instead of nucleotide it has to be amino acid.

The

answer:

You can use the NCBI protein multiple 'COBOLT' aligner to find common protein sequences:
http://www.ncbi.nlm.nih.gov/tools/cobalt/cobalt.cgi

You also asked about the

format of the fasta header, which is described here:

http://genome.ucsc.edu/goldenPath/help/hgTablesHelp.html#FASTA

Explanation of CDS FASTA header format
— Whole gene format: geneName_assemblyName peptideLength location
— Exon format: geneName_assemblyName_exonNum_totalExons exonLength inFrame outFrame location

Of possible interest:

Table Browser Instructions for Multiz-100 amino acids sequence

Thank you again for your inquiry and for using the UCSC Genome Browser.

Please send new and follow-up questions to one of our UCSC Genome Browser mailing lists below:

* Post to the Public Help Forum: E
mail
gen...@soe.ucsc.edu
or search the Public Archives

 * Post to the Mirror Help Forum: Email
genome...@soe.ucsc.edu or search the Mirror Archives

 * Confidential/private data help: Email
genom...@soe.ucsc.edu

UCSC Genome Browser Announcements List (email alerts for new data & software):

* Subscribe: Email genome-annou...@soe.ucsc.edu
* Unsubscribe: Email genome-announ...@soe.ucsc.edu

Please respond to this list if you have further questions.

Enjoy,

Cath
. . .

Cath Tyner

UCSC Genome Browser, Software QA & User Support

UC Santa Cruz Genomics Institute

UCSC Genome Browser

--

---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser discussion list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.

Shane Giles

unread,

Jun 17, 2016, 6:42:30 PM6/17/16

to gen...@soe.ucsc.edu

I have a few questions regarding the "Multiz Alignments of 100 Vertebrates” tracks displayed in the UCSC browser. I see that the file date for the fasta files are "2/14/14". Will UCSC be updating these anytime soon? Is there an update schedule?

Is there any possibility of getting the full refeq ids used in the generation of the file? The ids listed in the fasta file are version-less.

And last, I noticed a few discrepancies between the fasta file and the data presented on the browser. Here is the human entry for the first exon of NM_000348 (SRD5A2).

>NM_000348_hg19_1_5 93 0 1 chr2:31805690-31805969-

MQVQCQQSPVLAGSATLVALGALALYVAKPPATGSTRRAZSRRLPACQPAPPGSCRSCLPSRCPRGSSPGSPSPSSGHLGRYFWASSAYITST

The peptide sequence in the fasta file matches what is shown in the browser for the first 29 amino acid residues. After that none of the residues match. It appears there is a two base insertion in the reference that throws the residues in the fasta file out of phase. I don't see anything in the refGene.exonAA.fa file that would let me know this. What does the browser use to know it needs to fix a frame shift?

Thanks in advance for you reply.

Shane

Brian Lee

unread,

Jun 23, 2016, 12:51:48 PM6/23/16

to Shane Giles, gen...@soe.ucsc.edu

Dear Shane,

Thank you for using the UCSC Genome Browser and for sharing the details around the NM_000348 (SRD5A2) entry in the refGene.exonAA.fa.gz file.

We plan to update this file and will include a method to also have a versioned RefSeq ID such as NM_000348.3. Our engineers explain that there is a one base deletion in the reference with respect to the mRNA that has caused this issue in the generation of the earlier file for the exon lines for this gene.

The rebuilt process would correct the mapped AA sequences by creating two separate exons. An example of what it should look like can be seen by clicking into the "CDS FASTA alignment" from multiple alignment link for RefSeq Gene SRD5A2 and then checking the box "Separate into exons" and narrowing the species selection:

>NM_000348_hg19_1_6 30 0 2 chr2:31805881-31805969-
MQVQCQQSPVLAGSATLVALGALALYVAKP

>NM_000348_hg19_2_6 64 0 2 chr2:31805690-31805880-
SGYGKHTESLKPAATRLPARAAWFLQELPSFAVPAGILARQPLSLFGPPGTVLLGLFCLHYFHR

Versus the issue reported in refGene.exonAA.fa.gz

>NM_000348_hg19_1_5 93 0 1 chr2:31805690-31805969-
MQVQCQQSPVLAGSATLVALGALALYVAKPPATGSTRRAZSRRLPACQPAPPGSCRSCLPSRCPRGSSPGSPSPSSGHLGRYFWASSAYITST

I will send a follow-up email when the file has been updated athttp://hgdownload.cse.ucsc.edu/goldenPath/hg19/multiz100way/alignments/refGene.exonAA.fa.gz

Thank you again for your message and helping improve the UCSC Genome Browser. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

All the best,

Brian Lee
UCSC Genomics Institute

--

Brian Lee

unread,

Jun 28, 2016, 1:40:11 PM6/28/16

to Shane Giles, gen...@soe.ucsc.edu

Dear Shane,

We have updated the refGene.exonAA.fa.gz and refGene.exonNuc.fa.gz tables available at: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/multiz100way/alignments/

Thank you again for your message and helping improve the UCSC Genome Browser. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

All the best,

Brian Lee

UCSC Genomics Institute

Reply all

Reply to author

Forward