You know how I've been mentioning how I keep getting rare residuals
that don't match what they should be (way off -- not computer
tolerance issues), but that I couldn't replicate the problem reliably?
When I would calculate the residuals genome wide, the problem would
occur, but if I just specified specific units then it wouldn't happen.
I think I have found a possible solution. I think that the 'core' cdf
that Ken created for the exon arrays assigns some probesets (and thus
their probes) to multiple units. As you warned with Mark's cdf for
Ensembl, this will create a problem for storing residuals since they
will override. Perhaps Mark commented on this already(?) but I've been
away from aroma.affymetrix for several months so I haven't been
following. But I feel much better having found a plausible reason for
it.
I think we need to update the exon cdfs that Ken made anyway (since
the Affy annotation changed in March so they are no longer current)
but now I am perhaps more motivated to delve into the world of
creating cdfs.
Best,
Elizabeth
PS I have posted this to the group as well, but wanted to send it
directly to you since it was a follow up from a previous
conversation.
Example: the probeset "2436147" exists as part of the transcript
cluster (unit) "2436147", but also its own transcript cluster
"2436147".
> readUnits(cdfCore,unit=indexOf(cdfCore,"2436132"))[[1]]$groups[["2436147"]]
$x
[1] 2348 1782 512 1328
$y
[1] 1182 1329 1159 469
$pbase
[1] "T" "G" "T" "G"
$tbase
[1] "A" "C" "A" "C"
$expos
[1] 0 1 2 3
$direction
[1] 1
> readUnits(cdfCore,unit=indexOf(cdfCore,"2436147"))[[1]]$groups
$`2436147`
$`2436147`$x
[1] 2348 1782 512 1328
$`2436147`$y
[1] 1182 1329 1159 469
$`2436147`$pbase
[1] "T" "G" "T" "G"
$`2436147`$tbase
[1] "A" "C" "A" "C"
$`2436147`$expos
[1] 0 1 2 3
$`2436147`$direction
[1] 1
good that you found out. It cannot see a straightforward solution to
the issue where multiple unit groups refer to a common set of probes.
This is a problem because it breaks the underlying assumption in the
design that there is a one-to-one map of "input" and "output" cells.
The result is, as you say, that probe affinity estimates, residuals
and weights are overwritten. In order to deal with this some of the
basic framework needs to be updates and a more generic file structures
is needed. This is hard.
For the record, to find out if a CDF has multiple references to the
same cell do:
> cdf <- AffymetrixCdfFile$fromChipType("MoEx-1_0-st-v1")
> cells <- getCellIndices(cdf, unlist=TRUE, useNames=FALSE)
> str(cells)
int [1:5266159] 5994477 2030429 4309592 5663393 340560 115236
967170 2214061 977049 3384045 ...
t <- table(cells)
reps <- t[t > 1]
> reps
1
1023
So, in this case the cell with index 1 is replicated 1023 times. With
the CDF that Elizabeth is referring to there will be more replicates.
/Henrik
There are 522 core probesets (containing 1994 probes) that are assigned
to multiple units. Each of the 522 probesets shows up as an 'singleton'
unit (i.e. a unit with only that probeset as a group) and also in a
longer unit with multiple probesets. There are only 73 units involved,
so some units have a large number of these probesets (as many as 41
probesets in 1 unit with this problem! see table below). If anyone would
like the list of probesets or units that are involved, I can make that
available on the help pages.
As a side note, there was some discussion before about these singleton
units and what should be the group name (whether the group name is just
blank ("") or if the group name is the same as the unit name). In the
core cdf, both seem to occur (not because of my 522 'special' ones above
either); of the 4736 singleton units, 4129 are blank group names and 607
have the group name as the unit name. I don't know if there is a
purposeful distinction here or not.
Best,
Elizabeth
Table of distribution of probesets/groups per unit:
Numb. Pbsets per Unit Numb. Units
1 14
2 8
3 6
4 5
5 6
6 1
7 7
8 1
9 6
10 1
11 4
12 1
13 3
14 1
15 2
16 2
18 1
19 1
22 1
26 1
41 1