Question regarding 100 species MAF and selecting Refseq protein versions

20 views
Skip to first unread message

Adam Chamberlin

unread,
Jan 3, 2018, 5:02:34 PM1/3/18
to gen...@soe.ucsc.edu
Hello,
I have been using genome browser off and on to generate the 100 species alignments for particular genes of interest through the web interface. When I tried to generate the alignment for SLC16A2, genome browser returned the alignment for the more recent NP_006508.2 however I am using the old NCBI NP_006508.1 sequence. I am wondering if it is possible to obtain the alignment for the former or not.

Christopher Lee

unread,
Jan 16, 2018, 11:34:42 AM1/16/18
to Adam Chamberlin, UCSC Genome Browser Discussion List

Hello Adam,

Thank you for your question about generating alignments for SLC16A2. The short answer is no, it is not possible to use the old version of the transcript to generate alignments.

This is not possible because of the way we make the RefSeq and GenBank tracks. For RefSeq (and GenBank) data, roughly once per week we download and align the most recent transcripts, mRNAs, etc., and we do not keep previous versions around for download.

The longer answer is that it is possible to get these alignments if you have access to a Unix style (Mac or Linux) command line. The process involves using BLAT (http://genome.ucsc.edu/cgi-bin/hgBlat) to align the FASTA of your protein sequence of interest, convert the resulting PSL file to a Gene Prediction, and then using the mafGene utility to extract the multiple alignment from the 100-way for the GenePred file you created. Here are the utilities you will need to download from our directory of utilities: http://hgdownload.soe.ucsc.edu/admin/exe/:
pslToBed
mafGene
bedToGenePred
blat (only if you will be running this procedure for a large number of genes)

And here is the procedure:
1. Use Blat: http://genome.ucsc.edu/cgi-bin/hgBlat, to align your protein sequence of interest. Choose the output option psl (with or without the header is fine).
2. Save the entire resulting line of interest into a file, for example np_006508.1.blat.psl:

613 0   0   0   0   0   5   108296  ++  NP_006508.1 613 0   613 chrX    156040895   74421415    74531550    6   216,49,151,46,78,73,    0,216,265,416,462,540,  74421415,74520985,74524356,74525749,74529206,74531331,

3. Make a species.list file, with the UCSC style names of all the assemblies you want MAF output for. For the list of species in the 100-way that you were using, please see the following file: http://genome-test.soe.ucsc.edu/~chmalee/species.list
4. The run the following commands:
$ pslToBed np_006508.1.blat.psl np_006508.1.blat.bed
$ bedToGenePred np_006508.1.blat.bed np_006508.1.genePred
$ mafGene -useFile hg38 multiz100way np_006508.1.genePred species.list maf.out

5. The resulting maf.out file will contain the same FASTA output as generated by the CDS FASTA output option from the Table Browser.

Lastly, is it possible to explain the process you were using for generating these alignments, such as which links or selections you are making to output the alignments?

Thanks,

Christopher Lee
UCSC Genomics Institute

Want to share the Browser with colleagues?
Host a workshop: http://bit.ly/ucscTraining

Thank you again for your inquiry and using the UCSC Genome Browser. If
you have any further questions, please reply to gen...@soe.ucsc.edu.
All messages sent to that address are archived on a
publicly-accessible forum. If your question includes sensitive data,
you may send it instead to genom...@soe.ucsc.edu.


On Wed, Jan 3, 2018 at 1:58 PM, Adam Chamberlin <yazz...@gmail.com> wrote:
Hello,
I have been using genome browser off and on to generate the 100 species alignments for particular genes of interest through the web interface. When I tried to generate the alignment for SLC16A2, genome browser returned the alignment for the more recent NP_006508.2 however I am using the old NCBI NP_006508.1 sequence. I am wondering if it is possible to obtain the alignment for the former or not.

--

---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Public Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.
To post to this group, send email to gen...@soe.ucsc.edu.
Visit this group at https://groups.google.com/a/soe.ucsc.edu/group/genome/.
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/795BE811-DB33-4170-9A30-61CDAD08FB2D%40gmail.com.
For more options, visit https://groups.google.com/a/soe.ucsc.edu/d/optout.

Christopher Lee

unread,
Jan 16, 2018, 11:40:53 AM1/16/18
to Adam Chamberlin, UCSC Genome Browser Discussion List
Hello again Adam,

I forgot to mention that you will also need to set up an .hg.conf file in the home directory of your system in order to use the mafGene utility.

This page lists a sample .hg.conf file you can copy and paste:

The only thing is you will need to add the following line to your file so mafGene can find the 100-way data:

After you have that line, you can then give the .hg.conf file 600 permissions:
chmod 600 ~/.hg.conf

And you should be good to go with the commands I previously sent.

Thanks,

Christopher Lee
UCSC Genomics Institute

Adam Chamberlin

unread,
Jan 22, 2018, 12:01:17 PM1/22/18
to Christopher Lee, UCSC Genome Browser Discussion List
Thank you very much Christopher,
That was very helpful. I ended up slightly tweaking the method, given that I had the exonic boundaries, and simply generated a bed12 file and then fed that in. Using BLAT for this purpose got many of my cases right and then some others it missed on. Thank you again for the help.
Reply all
Reply to author
Forward
0 new messages