Why is the phastCons file shorter than the length of the chromosome?

50 views
Skip to first unread message

Heather Machado

unread,
Oct 19, 2015, 1:18:30 PM10/19/15
to gen...@soe.ucsc.edu
I have download the phastCons scores for alignments of 14 insects with the Release 5 D. melanogaster genome (http://hgdownload.soe.ucsc.edu/goldenPath/dm3/phastCons15way/). I have noticed that the phastCons files are often shorter than the lengths of the chromosomes, despite starting at the first position. It is important that I understand the cause of this discrepancy in order to correctly assign each phastCon score to each genomic position. 

For example, for chr2L, the phastCons files has 22989840 sites, whereas the chr2L length is 23011544 long (starting at position 1, per the header "fixedStep chrom=chr2L start=1 step=1"). 

Does this mean that the last 21,704 bp are omitted from the dm3 2L phastCons file? 

Best,
Heather

Matthew Speir

unread,
Oct 21, 2015, 4:20:04 PM10/21/15
to Heather Machado, gen...@soe.ucsc.edu
Hi Heather,

Thank you for your question about the phastCons files for the dm3 multiple alignment. This discrepancy between the number of positions covered by the phastCons file and the length of the chromosome is likely caused by gaps in the multiple alignment. These gaps could result from gaps in the reference assembly or variable regions that do not align well between species. Anywhere the multiple alignment is lacking data, phastCons cannot be used to calculated conservation scores.

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Matthew Speir
UCSC Genome Bioinformatics Group
--


Heather Machado

unread,
Oct 21, 2015, 4:28:36 PM10/21/15
to gen...@soe.ucsc.edu
Hi Matthew,

Thank you for getting back to me. The phastCons file does not have any position information... is there a way to figure out which genomic positions these scores refer to?

Best,
Heather

Matthew Speir

unread,
Oct 23, 2015, 12:38:17 PM10/23/15
to Heather Machado, gen...@soe.ucsc.edu
Hi Heather,

You can convert the fixedStep wiggle data to a format like bedGraph to get a clearer idea of how these scores correspond to their genomic positions.

To convert your fixedStep wiggle to bedGraph, you can use the following perl script available in the Genome Browser source tree:

    http://genome-source.cse.ucsc.edu/gitweb/?p=kent.git;a=blob;f=src/hg/makeDb/hgLoadWiggle/fixStepToBedGraph.pl;hb=master

The fixedStep format does contain position information, but it is a little more cryptic than something like bedGraph. You can read about the different wiggle data formats, including fixedStep, here: http://genome.ucsc.edu/goldenPath/help/wiggle.html. And, you can read about the bedGraph format here: http://genome.ucsc.edu/goldenPath/help/bedgraph.html.


I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Matthew Speir
UCSC Genome Bioinformatics Group


--


Heather Machado

unread,
Oct 23, 2015, 12:52:15 PM10/23/15
to Matthew Speir, gen...@soe.ucsc.edu
Hi Matthew,

I think I'm not being clear about my question. 

I'm working with one particular file:
chr2L.pp.gz
in the folder:

which is supposed to be the phastCons scores for alignments of 14 insects with the Release 5 D. melanogaster genome for the chromosome 2L.

This is file one line per bp, consisting only of the phastCons score. There is no position information. In this file, which positions do each line correspond to? Since it is of a slightly different length than the chromosome, it cannot be, as advertised, one position per bp of the chromosome. 

Heather

Matthew Speir

unread,
Oct 23, 2015, 3:09:16 PM10/23/15
to Heather Machado, gen...@soe.ucsc.edu
Hi Heather,

I think you are misinterpreting the fixedStep format. I believe the key piece that you are missing is that there are often multiple declaration lines throughout the fixedStep file. From the wiggle format help page I linked previously, the format of these fixedStep declaration lines is as follows:

    fixedStep  chrom=chrN  start=position  step=stepInterval

In the chr2L.pp file you are dealing with, the "chrom" specification will always be "chrom=chr2L". The "start" specification indicates the starting base position of the following lines, and the step indicates how many bases you are moving forward in each subsequent line.

For example, let's look at the first declaration line in chr2L.pp and the 4 lines that follow it:

    fixedStep chrom=chr2L start=1 step=1
    0.771
    0.780
    0.772
    0.785

What the declaration line is saying is that we're starting at position one on chr2L and each following line is moving forward on chr2L by one base. So, we can interpret these lines as

        chr2L base 1, score: 0.771
        chr2L base 2, score: 0.780
        chr2L base 3, score: 0.772
        chr2L base 4, score: 0.785

There are 1015 of these declaration lines throughout the chr2L.pp file. Here are the first 5 of these lines:

    fixedStep chrom=chr2L start=1 step=1

    fixedStep chrom=chr2L start=5243 step=1
    fixedStep chrom=chr2L start=105962 step=1
    fixedStep chrom=chr2L start=114467 step=1
    fixedStep chrom=chr2L start=130960 step=1

I've excluded them, but each of these declaration lines is followed by score lines that correspond to the bases starting from the position indicated in the "start" specification. It should not be assumed that the bases following one of these declaration lines continue right up until the start indicated in the next declaration line. You should actually assume the opposite. You should use the start of a declaration as an indication that there is some gap in the phastCons scores in the preceding region.

For example, take the first two declaration lines in the chr2L.pp file:

    fixedStep chrom=chr2L start=1 step=1
    fixedStep chrom=chr2L start=5243 step=1

We cannot assume that we have scores for the 5242 bases between the start position in first declaration line and the start position in the second. We can assume, however, that there is some gap in the phastCons scores that precedes position 5243 on chr2L. If we look at the conservation track in the Genome Browser, we can see that this is true and that there is a ~38 base gap that precedes the start position in the second declaration line:

http://genome.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=Matt%20Speir&hgS_otherUserSessionName=dm3_phastConsGap

I've highlighted base 5243 on chr2L in blue. Additionally in this session, you can see how the phastCons track is influenced by the multiple alignment. The gap in the phastCons scores matches up with the gap in the multiple alignment.

The purpose of converting the chr2L.pp file from fixedStep to bedGraph was to give you a file format this is more explicit in stating which base each line corresponds to and what the score is for that position. You can read the bedGraph help page I linked my previous message to learn more about how bedGraph files are formatted. If, after reading that page, you decide that the bedGraph format fits your needs better, then you can convert the chr2L.pp file from fixedStep to bedGraph using the perl script I linked.

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Matthew Speir
UCSC Genome Bioinformatics Group


Heather Machado

unread,
Oct 23, 2015, 4:02:06 PM10/23/15
to Matthew Speir, gen...@soe.ucsc.edu
I see. Thank you for the explanation!

Best,
Heather
Reply all
Reply to author
Forward
0 new messages