Hi Heather,
I think you are misinterpreting the fixedStep format. I believe the
key piece that you are missing is that there are often multiple
declaration lines throughout the fixedStep file. From the wiggle
format help page I linked previously, the format of these fixedStep
declaration lines is as follows:
fixedStep
chrom=chrN start=position step=stepInterval
In the chr2L.pp file you are dealing with, the "chrom" specification
will always be "chrom=chr2L". The "start" specification indicates
the starting base position of the following lines, and the step
indicates how many bases you are moving forward in each subsequent
line.
For example, let's look at the first declaration line in chr2L.pp
and the 4 lines that follow it:
fixedStep
chrom=chr2L start=1 step=1
0.771
0.780
0.772
0.785
What the declaration line is saying is that we're starting at
position one on chr2L and each following line is moving forward on
chr2L by one base. So, we can interpret these lines as
chr2L base 1, score: 0.771
chr2L base 2, score: 0.780
chr2L base 3, score: 0.772
chr2L base 4, score: 0.785
There are 1015 of these declaration lines throughout the chr2L.pp
file. Here are the first 5 of these lines:
fixedStep chrom=chr2L start=1 step=1
fixedStep
chrom=chr2L start=5243 step=1
fixedStep
chrom=chr2L start=105962 step=1
fixedStep
chrom=chr2L start=114467 step=1
fixedStep
chrom=chr2L start=130960 step=1
I've excluded them, but each of these declaration lines is followed
by score lines that correspond to the bases starting from the
position indicated in the "start" specification. It should not be
assumed that the bases following one of these declaration lines
continue right up until the start indicated in the next declaration
line. You should actually assume the opposite. You should use the
start of a declaration as an indication that there is some gap in
the phastCons scores in the preceding region.
For example, take the first two declaration lines in the chr2L.pp
file:
fixedStep
chrom=chr2L start=1 step=1
fixedStep chrom=chr2L start=5243 step=1
We cannot assume that we have scores for the 5242 bases between the
start position in first declaration line and the start position in
the second. We can assume, however, that there is some gap in the
phastCons scores that precedes position 5243 on chr2L. If we look at
the conservation track in the Genome Browser, we can see that this
is true and that there is a ~38 base gap that precedes the start
position in the second declaration line:
http://genome.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=Matt%20Speir&hgS_otherUserSessionName=dm3_phastConsGap
I've highlighted base 5243 on chr2L in blue. Additionally in this
session, you can see how the phastCons track is influenced by the
multiple alignment. The gap in the phastCons scores matches up with
the gap in the multiple alignment.
The purpose of converting the chr2L.pp file from fixedStep to
bedGraph was to give you a file format this is more explicit in
stating which base each line corresponds to and what the score is
for that position. You can read the bedGraph help page I linked my
previous message to learn more about how bedGraph files are
formatted. If, after reading that page, you decide that the bedGraph
format fits your needs better, then you can convert the chr2L.pp
file from fixedStep to bedGraph using the perl script I linked.