My problem with the residuals -- solved??

23 views
Skip to first unread message

epurdom

unread,
Oct 19, 2007, 1:17:53 AM10/19/07
to aroma.affymetrix, h...@stat.berkeley.edu, mrob...@wehi.edu.au, ken.m....@gmail.com
Hi Henrik,

You know how I've been mentioning how I keep getting rare residuals
that don't match what they should be (way off -- not computer
tolerance issues), but that I couldn't replicate the problem reliably?
When I would calculate the residuals genome wide, the problem would
occur, but if I just specified specific units then it wouldn't happen.
I think I have found a possible solution. I think that the 'core' cdf
that Ken created for the exon arrays assigns some probesets (and thus
their probes) to multiple units. As you warned with Mark's cdf for
Ensembl, this will create a problem for storing residuals since they
will override. Perhaps Mark commented on this already(?) but I've been
away from aroma.affymetrix for several months so I haven't been
following. But I feel much better having found a plausible reason for
it.

I think we need to update the exon cdfs that Ken made anyway (since
the Affy annotation changed in March so they are no longer current)
but now I am perhaps more motivated to delve into the world of
creating cdfs.

Best,
Elizabeth

PS I have posted this to the group as well, but wanted to send it
directly to you since it was a follow up from a previous
conversation.

Example: the probeset "2436147" exists as part of the transcript
cluster (unit) "2436147", but also its own transcript cluster
"2436147".

> readUnits(cdfCore,unit=indexOf(cdfCore,"2436132"))[[1]]$groups[["2436147"]]
$x
[1] 2348 1782 512 1328

$y
[1] 1182 1329 1159 469

$pbase
[1] "T" "G" "T" "G"

$tbase
[1] "A" "C" "A" "C"

$expos
[1] 0 1 2 3

$direction
[1] 1

> readUnits(cdfCore,unit=indexOf(cdfCore,"2436147"))[[1]]$groups
$`2436147`
$`2436147`$x
[1] 2348 1782 512 1328

$`2436147`$y
[1] 1182 1329 1159 469

$`2436147`$pbase
[1] "T" "G" "T" "G"

$`2436147`$tbase
[1] "A" "C" "A" "C"

$`2436147`$expos
[1] 0 1 2 3

$`2436147`$direction
[1] 1

Henrik Bengtsson

unread,
Oct 19, 2007, 1:41:54 AM10/19/07
to aroma-af...@googlegroups.com
Hi,

good that you found out. It cannot see a straightforward solution to
the issue where multiple unit groups refer to a common set of probes.
This is a problem because it breaks the underlying assumption in the
design that there is a one-to-one map of "input" and "output" cells.
The result is, as you say, that probe affinity estimates, residuals
and weights are overwritten. In order to deal with this some of the
basic framework needs to be updates and a more generic file structures
is needed. This is hard.

For the record, to find out if a CDF has multiple references to the
same cell do:

> cdf <- AffymetrixCdfFile$fromChipType("MoEx-1_0-st-v1")
> cells <- getCellIndices(cdf, unlist=TRUE, useNames=FALSE)
> str(cells)
int [1:5266159] 5994477 2030429 4309592 5663393 340560 115236
967170 2214061 977049 3384045 ...
t <- table(cells)
reps <- t[t > 1]
> reps
1
1023

So, in this case the cell with index 1 is replicated 1023 times. With
the CDF that Elizabeth is referring to there will be more replicates.

/Henrik

Elizabeth Purdom

unread,
Oct 19, 2007, 3:23:16 PM10/19/07
to aroma-af...@googlegroups.com
Hi,
I think for the exon arrays that you don't really want to have a common
set of probes. In what I discovered, this is a mistake (I don't know if
it's from Ken creation of the cdf or Affy's original annotation). For
Mark's situation, it's a bit more questionable since some probes map to
regions of the genome that refseq (or ensemble, etc.) says actually are
parts of multiple genes. But I guess I would argue that you are not
going to be able to easily disintangle this confounding of the two
expression levels (and moving to refseq annotation instead of Affy's is
partly to be able to avoid the problems this creates), so those probes
will mess up your single-gene analysis. So for me (so far) I don't see
any need to actually address this problem other than to perhaps give
warnings/checks of the cdfs to let the user know. Maybe some automatic
way to throw them out of the cdf, but really it's not fair to expect
aroma.affymetrix to deal with the problem. Or perhaps we (Mark and I?)
should develop something specific to exon arrays to ignore replicated
probes, since it seems like it might be rather specific, but perhaps
common, to the exon arrays right now.
Best,
Elizabeth

Elizabeth Purdom

unread,
Oct 19, 2007, 3:46:43 PM10/19/07
to aroma-af...@googlegroups.com
Hi,
So I have dug a bit and I was just going to characterize the depth of
the multiple probes problem with the core cdf in case there is anyone
else out there that would like to know. (I haven't looked at the full
set of probesets, though my code would be the same)

There are 522 core probesets (containing 1994 probes) that are assigned
to multiple units. Each of the 522 probesets shows up as an 'singleton'
unit (i.e. a unit with only that probeset as a group) and also in a
longer unit with multiple probesets. There are only 73 units involved,
so some units have a large number of these probesets (as many as 41
probesets in 1 unit with this problem! see table below). If anyone would
like the list of probesets or units that are involved, I can make that
available on the help pages.

As a side note, there was some discussion before about these singleton
units and what should be the group name (whether the group name is just
blank ("") or if the group name is the same as the unit name). In the
core cdf, both seem to occur (not because of my 522 'special' ones above
either); of the 4736 singleton units, 4129 are blank group names and 607
have the group name as the unit name. I don't know if there is a
purposeful distinction here or not.

Best,
Elizabeth

Table of distribution of probesets/groups per unit:

Numb. Pbsets per Unit Numb. Units
1 14
2 8
3 6
4 5
5 6
6 1
7 7
8 1
9 6
10 1
11 4
12 1
13 3
14 1
15 2
16 2
18 1
19 1
22 1
26 1
41 1

Reply all
Reply to author
Forward
0 new messages