MAF file aligned coordinates incorrect

485 views
Skip to first unread message

swaraj basu

unread,
Aug 7, 2014, 11:25:12 AM8/7/14
to gen...@soe.ucsc.edu
Dear All,

I am parsing the 8 way multiple genome alignment file (MAF format) with zebrafish (Zv9/danRer7) as the reference genome (http://hgdownload.soe.ucsc.edu/goldenPath/danRer7/multiz8way/multiz8way.maf.gz). I wanted to extract the sub-alignment for particular genomic region of zebrafish from the MAF file (chr8:44990148-44990248). I used the following command

mafsInRegion region.bed region.maf multiz8way.maf

where
- mafsInRegion is UCSC utility (http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/mafsInRegion).
- region.bed contains the coordinates of the region of interest in the zebrafish genome (chr8:44990148-44990248).
- region.maf is the output file for the multiple alignment in the particular region.
- multiz8way.maf is the whole genome alignment file of 8 species with zebrafish as the reference.

From the sub-alignment (region.maf) I wanted to extract the corresponding region aligned in the human genome (hg19). I noticed that the human coordinates are given as chr20:58142414-58142315 (hg19) while if I take the aligned sequence and BLAT it against the human genome (hg19) I get the single hit with coordinate chr20:4883107-4883206. Since I have to perform such an analysis genome wide, I am not sure what is the frequency of such errors in the whole genome MAF alignment file. Can someone kindly help in providing a solution for this issue. Please find attached the extracted sub-alignment in MAF format.

-regards
--
Swaraj Basu
Post-doctoral fellow (Comparative Genomics)
Ecology and Evolution of Plankton
Stazione Zoologica Anton Dohrn
Naples
region.maf

Matthew Speir

unread,
Aug 15, 2014, 6:30:21 PM8/15/14
to swaraj basu, gen...@soe.ucsc.edu
Hi Swaraj,

Thank you for your question about the danRer7 MAF files. The BLAT results are describing the position using
positive strand coordinates, while the MAF files are using negative strand coordinates. You can see this for
yourself by doing some simple calculations.

First you can see that the chr20 size minus BLAT qStart coordinate is equal to the MAF coordinate plus the
region size of 100:
    63025520 - 4883106 = 58142314 + 100

Working from the opposite direction, you can see that subtracting the negative strand based coordinates from
the MAF file will get you the positive strand based BLAT coordinates:
    63025520 - 58142314 = 4883206 = tEnd = chr20 pos-strand coordinate of start
    63025520 - 58142314 - 100 = 4883106 = tStart = chr20 pos-strand coordinate of end

Note that for BLAT, the PSL results of non-translated DNA queries, the strand will either be + or -. If it is -, it
just means that the query was reverse-complemented before searching. The results will always be in positive
coordinates for the tStart and tEnd, but the meaning of start and end are reversed.

If you are just interested in the alignments between hg19 and danRer7, you can just use the pairwise alignment
between the two directly. One of our engineers shared the following strategy for extracting these from the
chain files themselves:

    Obtain the files hg19.2bit, danRer7.2bit and hg19.danRer7.all.chain.gz from hgdownload:
    Generate the hg19 and danRer7 chromosome sizes files:
        twoBitInfo hg19.2bit stdout | sort -k2,2n > hg19.chrom.sizes
        twoBitInfo danRer7.2bit stdout | sort -k2,2n > danRer7.chrom.sizes


    Then use the chainToAxt and axtToMaf utilities to convert the pairwise alignment from
    the chain format to the MAF format:

        chainToAxt hg19.danRer7.all.chain.gz hg19.2bit danRer7.2bit stdout \
    | axtToMaf stdin hg19.chrom.sizes danRer7.chrom.sizes hg19.danRer7.maf

    The hg19.danRer7.maf file will contain all the alignments of danRer7 to hg19 in maf format.
    Be aware that this hg19.danRer7.maf file will likely be over 10 GB in size. You can then use
    the mafsInRegion utility on this hg19.danRer7.maf file to extract any alignments from danRer7
    to hg19 in specific locations.

    If you want a more restrictive set of alignments, use the liftOver chain file hg19ToDanRer7.over.chain.gz
    instead of all the alignments. You can find the liftOver chain file here:
        http://hgdownload.soe.ucsc.edu/goldenPath/hg19/liftOver/hg19ToDanRer7.over.chain.gz

    In general, all the different alignment formats can be converted into each other via a pipeline of:
         aTob | cTod | ... | xToy > wasA.nowY

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages
sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes
sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Matthew Speir
UCSC Genome Bioinformatics Group
--


Reply all
Reply to author
Forward
0 new messages