Fwd: Conservation score discrepancy between table browser and downloaded data

78 views
Skip to first unread message

XiaoJu Zhang

unread,
Sep 8, 2015, 1:20:09 PM9/8/15
to gen...@soe.ucsc.edu

Dear colleagues and team,

I have a question about the discrepancy I have encountered after I tried to get conservation score (phastCons score) from UCSC genome browser with two approaches: query from online table browser and search in downloaded .wigFix data. Details are appended at the end of this mail.

Here are my questions, and I would really appreciate your help for me to understand them:

  1. I assume the chr17.phastCons46way.placental.wigFix organizes the conservation scores along the coordinates of the chromosome as the header addresses “start=1 step=1”. Is it right?
  2. How would you explain the obvious differences between these two methods. What did I miss?
  3. Is there downloadable data set with explicit position information included?


The chromosome I used for the test is chr17, for which I defined the region as ”chr17 0 10’.

In the table browser, I chose 
Clade: Mammal; Genome: Human; assembly: hg19; group: Comparative Genomics; track: Conservation; table: phasCons100way

Following is the output 
variableStep chrom=chr17 span=1 
1 0.0991496 
2 0.0929528 
3 0.0557717 
4 0.0495748 
5 0.0185906 
6 0.0123937 
7 0.00619685 
8 0.00619685 
9 0 
10 0

For the downloaded chr17.phastCons46way.placental.wigFix 
$ head chr17.phastCons46way.placental.wigFix 
fixedStep chrom=chr17 start=1 step=1 
0.096 
0.089 
0.071 
0.060 
0.023 
0.016 
0.010 
0.010 
0.010 
0.010

Thank you very much.

Ju


XiaoJu Zhang

unread,
Sep 8, 2015, 4:42:52 PM9/8/15
to gen...@soe.ucsc.edu
My apology for the wrong information in last mail. I opened the wrong data (46 species placental instead of 100 species ), but still they don't match. It does not look like a rounding issue to me.

Thanks,

Correction:
For the downloaded chr17.phastCons100way.wigFix

$ head -11 chr17.phastCons100way.wigFix

fixedStep chrom=chr17 start=1 step=1

0.102

0.093

0.059

0.050

0.021

0.013

0.011

0.008

0.003

0.002

Luvina Guruvadoo

unread,
Sep 16, 2015, 3:12:42 PM9/16/15
to XiaoJu Zhang, gen...@soe.ucsc.edu
Hello Ju,

Thanks for your question. The data you retrieved from our Table Browser is in variableStep format, whereas the downloaded file is in fixedStep format. You can read more about the difference between these two formats in our help documentation here:
http://genome.ucsc.edu/goldenPath/help/wiggle.html

If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

- - -
Luvina Guruvadoo
UCSC Genome Bioinformatics Group


--


XiaoJu Zhang

unread,
Sep 16, 2015, 3:50:37 PM9/16/15
to Luvina Guruvadoo, gen...@soe.ucsc.edu
Thank you Luvian for your reply. 

For the conservation scores local query, I ended up downloading the bigwig format dataset (http://hgdownload.soe.ucsc.edu/goldenPath/hg19/phastCons100way/hg19.100way.phastCons.bw), and I found it being more consistent with the data table queried results, and it seems to me that you are referring to that data file. For the chromosome based wig.gz files (http://hgdownload.soe.ucsc.edu/goldenPath/hg19/phastCons100way/hg19.100way.phastCons/), however, they seem way more off than just being a precision issue. For example, the total entry of the data points of hg19 chromosome Y in chrY.phastCons100way.wigFix.gz is roughly 3/5 size of the size of hg19 chrY, which is still a mistery to me and I am sure something I just missed. I am hoping you guys can help to to understand it.

Best regards,

Ju

On Wed, Sep 16, 2015 at 3:32 PM, Luvina Guruvadoo <luv...@soe.ucsc.edu> wrote:
Hello again, Ju.

A colleague of mine pointed out something on the help page I referred to you:

(paragraph 2):
For speed and efficiency, wiggle data is compressed and stored internally in 128 unique bins. This compression means that there is a minor loss of precision when data is exported from a wiggle track (i.e., with output format "data points" or "bed format" within the table browser). The bedGraph format should be used if it is important to retain exact data when exporting.

I hope this helps clarify things.


- - -
Luvina Guruvadoo
UCSC Genome Bioinformatics Group

Matthew Speir

unread,
Sep 23, 2015, 12:59:30 PM9/23/15
to XiaoJu Zhang, Luvina Guruvadoo, gen...@soe.ucsc.edu
Hi Ju,

Thank you for your questions about the differences in conservation scores between the Table Browser and those on our download server. One of our engineers notes that this difference is not just the difference between variable step and fixed step wiggle files, but is really the difference between the original scores (provided in the wigFix file) and the heavily compressed version (output by the Table Browser.)

She adds that if you look at the scores from both the variable step and fixed step files side-by-side, you can see how the scores can differ based on their rounding:

Table Browser                    Downloaded wigFix file
variableStep chrom=chr17 span=1  fixedStep chrom=chr17 start=1 step=1
1 0.0991496                      0.102                               
2 0.0929528                      0.093                               
3 0.0557717                      0.059                               
4 0.0495748                      0.050                               
5 0.0185906                      0.021                               
6 0.0123937                      0.013                               
7 0.00619685                     0.011                               
8 0.00619685                     0.008                               
9 0                              0.003  


Note how bases 7 and 8 have the same value in the Table Browser output (0.00619685), but different values in http://hgdownload.cse.ucsc.edu/goldenPath/hg19/phastCons100way/hg19.100way.phastCons/chr17.phastCons100way.wigFix.gz (0.011, 0.008). That can definitely be a little confusing. What is happening is that the Table Browser, since it only has the lossy-compressed version of the data, groups the scores for bases 7 and 8 (0.008 and 0.011) into the same wiggle bin, which causes the scores for these bases to be different than their original scores in the wigFix file.

If you are looking for the original phastCons scores, I would recommend using the data files available from our downloads server here:
http://hgdownload.soe.ucsc.edu/goldenPath/hg38/phastCons100way/

As for the missing data on chrY, if you look in the Genome Browser, you will notice that much of of chrY is covered by gaps. Conservation scores cannot be calculated for gap regions, and are thus these regions excluded from the files. You can see this in the browser here:
http://genome.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=mspeir&hgS_otherUserSessionName=hg19_chrYgaps

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Matthew Speir
UCSC Genome Bioinformatics Group
--


Reply all
Reply to author
Forward
0 new messages