Extract gene alignments from MAF files with blocks intact

183 views

Skip to first unread message

XiaoJu Zhang

unread,

Jun 23, 2015, 3:15:07 PM6/23/15

to gen...@soe.ucsc.edu, genome...@soe.ucsc.edu

Dear UCSC Genome Browser team,

(This is probably a duplicate email, but I cannot find my previous email appearing on the mailing list)

I've been trying to figure out the methods to do an alignment extraction for days, and I hope you can help. Here is my project: I have a list of human genes (total ~ 19,000 genes), with corresponding chromosome ID and gene’s range.
#gene_symbol #chr_id #start #end
MRPL10 chr17 45898638 45910907
OR10V1 chr11 59478389 59483318
PTPN12 chr7 77165352 77271388
…
…

Now, I want to extract the alignment for each of the gene (including introns, 2000bp upstream and 2000bp downstream regions) from hg19 multiz100way maf files (http://hgdownload.soe.ucsc.edu/goldenPath/hg19/multiz100way/maf/), and eventually get the alignment as fasta format, in which each block presents 100 species’ sequences, and the blocks of output alignment should have the same coordinates with their source MAF, even for the ending blocks.

After some research on the mailing list, I found a couple of very useful instructions:
https://groups.google.com/a/soe.ucsc.edu/d/msg/genome/F_YjGiYMcDY/1Sk_3yxRVxcJ suggesting using a tool named "mafsInRegion"; and another discussion from https://groups.google.com/a/soe.ucsc.edu/d/msg/genome/GJ7iKzJ2e0k/oBdKkalta5cJ mentioned a faster way to do that, which need a sorted .bed input file.

However, I am still have some questions.

The tool mafsInRegion seems not work for my purpose since what my alignment extraction needs the start matches the alignment block’s start. In other words, take “MRPL10 chr17 45898638 45910907” for example, if the block covers the gene range’s start position 45898638 is 45898000, what it need is to take 45898000 as start, and keep the whole alignment fragment intact, and same with the end. In this case, what will be the best tool to do so?
I assume the output from the extraction will be in maf format, and my goal is to have the final result in a 100 species all present fatsta format. I tried galaxy server for some test, and it works fine, but what will be the best solution to convert the maf into fasta locally? A command line tool will be preferred.