question about pairwise alignment files

Alireza Fotuhi Siahpirani

unread,

Aug 21, 2014, 8:06:29 PM8/21/14

to gen...@soe.ucsc.edu

hi,

my name is Alireza Fotuhi Siahpirani and I am a graduate student in computer sciences department of UW-Madison.

my question is that if I want to match human regions to mouse regions, which pairwise alignment files I should use and if there is a program that I can use for mapping.

as a part of a project, I am trying to map 1Mb length regions from human genome (hg19) to 1Mb length regions from mouse genome (mm9). (I have pre-defined regions, for example chr1_0M-1M, chr1_1M_2M,... in human and the same in mouse).

I was looking at liftOver, but they mentioned that it is not designed to map from one organism to another organism.
if I used default setting it wouldn't find any region in mouse;
if I decrease the -minMatch it gives me some matching but it isn't unique.
I'm not completely sure how to select the threshold and I didn't completely understood how they work;
so I decided to write my own script to aggregate alignments in my bins.
right now I am using the liftover chain files (from here: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/liftOver/) to aggregate the alignments in my bins.
but I realised there are also other pairwise alignment files here: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/vsMm9/

these are my questions:

which of these files I should use?

what the score in the alignment file means. should I use all the alignments in the chain file, or should I use a threshold and ignore the chains with score lower than the threshold?

is it better to use my own script, or should I just use liftOver (thought they say it's not made for this purpose)?

I really appreciate your help.

thanks,
Alireza

Matthew Speir

unread,

Aug 22, 2014, 7:08:45 PM8/22/14

to Alireza Fotuhi Siahpirani, gen...@soe.ucsc.edu

Hi Alireza,

Thank you for your question about using the mm9 and hg19 pairwise
alignment files. Please see this previously answered mailing list
question for instructions on how to extract the alignments for your
regions of interest here:
https://groups.google.com/a/soe.ucsc.edu/d/msg/genome/tOEkoCYem3c/3FMCjaxVQGUJ.
While that answer is dealing with danRer7 and hg19, you can easily
replace any reference to a danRer7 file with an equivalent file in mm9.
You can find the mm9 2bit file here:
http://hgdownload.soe.ucsc.edu/goldenPath/mm9/bigZips/mm9.2bit, and the
hg19/mm9 pairwise alignment file here:
http://hgdownload.soe.ucsc.edu/goldenPath/hg19/vsMm9/hg19.mm9.all.chain.gz.
You can find the twoBitInfo, chainToAxt, axtToMaf, and mafsInRegion
utilities referenced in those instructions under the directory for your
system here: http://hgdownload.soe.ucsc.edu/admin/exe/.

I hope this is helpful. If you have any further questions, please reply
to gen...@soe.ucsc.edu. All messages sent to that address are archived
on a publicly-accessible Google Groups forum. If your question includes
sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Matthew Speir
UCSC Genome Bioinformatics Group

> --
>

Matthew Speir

unread,

Aug 26, 2014, 3:02:24 PM8/26/14

to Alireza Fotuhi Siahpirani, gen...@soe.ucsc.edu

Hi Alireza,

Sorry for this late update, but I realized that I neglected to answer a few of your questions. First, here is an answer to a previous mailing list question that gives an excellent explanation of the scoring method used in chain files: http://groups.google.com/a/soe.ucsc.edu/d/msg/genome/C7Wxn01IzQk/wHpOkr9mWAoJ. You can find the matrices referenced in that answer on the mm9/hg19 chain/net track description page: http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=mm9&g=chainNetHg19.

Second, which files should you use? To answer this, I should explain the differences between the two files, and then you can decide which file best fits your needs. The pairwise alignment files, such as the hg19.mm9.all.chain.gz file in the vsMm9 directory, are many-to-many and any sufficiently high scoring chain is kept in these files. The LiftOver files, on the other hand, have been filtered using the "nets", which are an attempt to identify the best alignment for each position in the target genome (hg19 in this case).

Third, it was also pointed out to me by another Genome Browser engineer here that you could use the LiftOver to find the corresponding regions in human. While we say that LiftOver wasn't designed to find corresponding regions in other species, we have been using for years for both lifting data sets from one species to another and converting data sets to new assemblies. It is up to you whether or not you want to write your own script or use LiftOver. You can try LiftOver, and if you are unhappy with the results, you can try creating your own script.

Lastly, there are many other groups out there who have made attempts to identify large orthologous regions between assemblies. You may want to talk to these groups and see what they are using to identify these regions. For example, at UW-Madison there is Colin Dewey: https://www.biostat.wisc.edu/~cdewey/.

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Matthew Speir
UCSC Genome Bioinformatics Group

On 8/21/14, 4:45 PM, Alireza Fotuhi Siahpirani wrote:

--

Alireza Fotuhi Siahpirani

unread,

Sep 5, 2014, 12:52:46 PM9/5/14

to Matthew Speir, gen...@soe.ucsc.edu

thank you for your respond.

I have another basic question.

in the explanation of chain format it says:

When the strand value is "-", position coordinates are listed in terms of the reverse-complemented sequence.

I assumed it means the start and end positions are relative to the end of the positive strand. so if I want the aligned sequence I should look at reverse complement of the sequence in (tlen - tend, tlen - tstart] in positive strand.

I looked at the sequence at those coordinates and it seems to match the sequence in the other species.

but my understanding was that in the gtf files, when a gene is on the negative strand, the positions are still relative to the positive strand. for example if the coordinates of a gene on the negative strand is 1000 to 2000, the 1Kb upstream of the gene would be (2000,3000). and if it was on positive strand it would have been (0,1000).

so if I understand it correctly, when I am reading a gtf file, I shouldn't change the coordinates for negative strand; but when I am reading the alignment for the negative strand, I should subtract the coordinates from the length of the chromosome.

is it correct?

I am a little confused because I was expecting the meaning of the coordinates to be the same in all file formats, but it seems like it is different here.

thanks,
Alireza

Jonathan Casper

unread,

Sep 5, 2014, 2:11:49 PM9/5/14

to Alireza Fotuhi Siahpirani, gen...@soe.ucsc.edu

Hello Alireza,

You are absolutely right. When looking at a chain file, the coordinates for things on the - strand are changed around. To find the item on the + strand, you will need to use the coordinates (tSize-tEnd, tSize-tStart]. When using a GTF file, you do not need to change the coordinates for items on the - strand.

It is unfortunate that differences like this exist, which can be confusing for people who need to work with many file formats. That is part of why we try to provide tools for working with these file formats and converting between them, so that these differences are hidden away.

A word of warning: you may also need to work with the PSL data format (http://genome.ucsc.edu/FAQ/FAQformat.html#format2) at some point. PSL has its own way of describing items on the - strand. The start and stop position for each alignment use + strand coordinates, but the list of start positions of the blocks within each alignment uses - strand coordinates. This can be confusing.

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu or genome...@soe.ucsc.edu. Questions sent to those addresses will be archived in publicly-accessible forums for the benefit of other users. If your question contains sensitive data, you may send it instead to genom...@soe.ucsc.edu.