psl alignment format and qstarts- qends calculation

61 views
Skip to first unread message

lenis vasilis

unread,
Jan 30, 2015, 12:28:23 PM1/30/15
to gen...@soe.ucsc.edu

Hello everybody,

I'm trying to make an alignment algorithm in order to generate an proximate alignment among two whole mammalian genomes in quick time. At the moment in order to evaluate my results I want to convert my output into psl format.

My problem is that I cannot understand the way of the calculation of the qstarts and tstarts values (the two last values) for the negative strand. At the positive strand, the qstarts will be the same with starting position at the query and the tstarts will be the starting position of the target.

For the negative strand, if you will see the following example:

132     22      0       0       0       0       0       0       -       chr24   61756751        49146459        49146613        chr22   61598339        1042089 1042243 1       154, 12610138,       1042089,

how are being calculated the last two numbers?

I have red the example that you are giving at the psl explanation page but still I cannot understand it.
If you could explain it to me in more details I would appreciate it

Thank you very much in advance,

Vasilis.

Jonathan Casper

unread,
Feb 11, 2015, 4:42:36 PM2/11/15
to lenis vasilis, gen...@soe.ucsc.edu

Hello Vasilis,

Thank you for your question about the qstarts and tstarts fields in the psl file format. When items are on the negative strand, the starts values are counted backward from the end of the positive strand. Or, equivalently, forward from the beginning of the negative strand (which, as the reverse complement of the positive strand, starts on the opposite side).

Here is a piece of the example from http://genome.ucsc.edu/FAQ/FAQformat.html#format2 to look at more closely:

0         1         2         3         4         5         6 tens position in query positive strand coordinates
0123456789012345678901234567890123456789012345678901234567890 ones position in query positive strand coordinates
                      ++++++++++++++                          plus strand item
                      --------------------                    minus strand item
0987654321098765432109876543210987654321098765432109876543210 ones position in query negative strand coordinates
6         5         4         3         2         1         0 tens position in query negative strand coordinates

First, note that this entire segment has 61 base positions (from 0 to 60, inclusive). That will be useful to remember a bit later. Coordinates for the positive strand start at the left end with 0, and go up to 60 at the right end. Those are the numbers at the top. When counting in terms of the negative strand, however, the coordinates begin at the right edge. That is what the numbers down below are for: 0 at the right edge, and 60 at the left edge. In the example above, position 10 on the positive strand (the top set) lines up with position 50 on the negative strand (the bottom set). Position 30 on the positive strand lines up with position 30 on the negative strand.

There are two alignments in the above example, each with only one block. The first alignment is the top one and is on the positive strand. You can line up the left edge of the block of '+'s with the top set of numbers to see that it begins at position 22 on the positive strand. It also lines up with position 38 on the negative strand (using the coordinates below), but that isn't important because this particular item is on the positive strand. As a result, this track item with one block has a qStart of 22, and a qEnd of 36. If you would like to know why qEnd is 36 (one past the end of the alignment) instead of 35, you can find more information about our "0-based, half open) coordinate system at http://genomewiki.ucsc.edu/index.php/Coordinate_Transforms.

This item is on the positive strand, so we calculate its block starts using positive strand coordinates. That means that its qStarts is "22,", which is the same as the qStart (the beginning of the entire alignment), because the first block starts at the beginning of the entire alignment.

Now look at the second aligned item: the one represented by a string of '-'s. This is an item on the negative strand. For calculating the qStart and qEnd values, we still use positive strand coordinates even though the item is on the negative strand. Using the top set of numbers, we can see that this alignment has a qStart of 22 (the same as the '+' item), and a qEnd of 42. So far so good. Now we want to calculate the qStarts values (the start of each of the alignment blocks), and this is where the negative strand starts to matter.

Even though we still use positive strand coordinates for the qStart and qEnd of an alignment to the negative strand, we use negative strand coordinates for the individual blocks. Note that this means that we have to use negative strand coordinates for the "beginning" of each block of the alignment. Because we are using negative strand coordinates, which start at the right edge and end at the left edge, the beginning of each block is the side closest to the right edge. Using the numbers on the bottom, we can see that the "beginning" of the block of '-'s is at position 19 is negative strand coordinates. Thus, the qStarts for the second alignment would be "19,". For negative strand alignments, none of the qStarts values will be the same as the overall qStart.

If it helps, here is an alternative way to calculate the qStarts and tStarts values for items on the negative strand. First, pretend the item is on the positive strand and calculate the qStarts and blockSizes values. For the '-' alignment above, that would be a qStarts of "22," and a blockSizes of "20,". Then for each block, subtract the qStarts and blockSizes values from the size of the sequence to get the negative strand coordinates. In the above example, the entire sequence has 61 bases. 61 (sequence size) - 22 (start of the block in positive coordinates) - 20 (size of the block) = 19. Thus, the qStarts value for that block on the negative strand is 19.

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu or genome...@soe.ucsc.edu. Questions sent to those addresses will be archived in publicly-accessible forums for the benefit of other users. If your question contains sensitive data, you may send it instead to genom...@soe.ucsc.edu.

--
Jonathan Casper
UCSC Genome Bioinformatics Group


--


Reply all
Reply to author
Forward
0 new messages