On Sat, Jul 19, 2008 at 5:26 AM, Karl Dykema <karl....@gmail.com> wrote:
>
> We have recently started a small project using the custom canine SNP
> arrays developed by the Broad Institute. (http://www.broad.mit.edu/
> mammals/dog/caninearrayfaq.html - array version 2.) Making chromosome
> copy number calls is our first goal. We don't have any 'normal' copy
> number samples at this point, can we even make the calls without this
> reference?
I'm no expert on SNP array analysis, but I understand that using the
robust average of the data set as a reference sample may be an option
for you. Please see the vignette on Total Copy Number Analysis for
the 10K, 100K, 500K human chips:
http://groups.google.com/group/aroma-affymetrix/web/total-copy-number-analysis.
> They give the proper CDF file but as I understand it we will also need
> the UGP and UFL files. The problem is that I don't think they exist
> because this is a custom chip. Do we actually need them and if so
> could they be created from the other files available on their web
> page? (AffyV2_all_probeset.txt and v2.feature-response.txt)
As far as I know, you will definitely need UGP and UFL files. If you
go to the main group page (same as the above URL, but without the bit
following the last /) you can do a search for UFL and/or UGP and you
should be able to find several discussions or documentation pages
which should give you enough information to build your own.
Hope this helps,
Ken
On Sun, Jul 20, 2008 at 4:58 AM, Ken Simpson <ken.m....@gmail.com> wrote:
>
> Hi Karl,
>
> On Sat, Jul 19, 2008 at 5:26 AM, Karl Dykema <karl....@gmail.com> wrote:
>>
>> We have recently started a small project using the custom canine SNP
>> arrays developed by the Broad Institute. (http://www.broad.mit.edu/
>> mammals/dog/caninearrayfaq.html - array version 2.) Making chromosome
>> copy number calls is our first goal. We don't have any 'normal' copy
>> number samples at this point, can we even make the calls without this
>> reference?
>
> I'm no expert on SNP array analysis, but I understand that using the
> robust average of the data set as a reference sample may be an option
> for you. Please see the vignette on Total Copy Number Analysis for
> the 10K, 100K, 500K human chips:
> http://groups.google.com/group/aroma-affymetrix/web/total-copy-number-analysis.
Yes, as Ken says, if you have enough arrays and you can assume that at
any locus more than 50% (theoretically) are copy neutral, the robust
average (median) at that locus will be an estimator of the
copy-neutral mean.
I don't know how many arrays you've got, but the more you have the
better, not only for the above issue but also got get less noise
estimates. A bit more than two years ago I was asked to do CN
analysis on 3(!) 500K arrays. The obvious was to go and download the
public HapMap samples and use those as a pooled reference. However,
it turned out that when more *in-house* hybridizations were produced
they were much better suited as a reference than the outside-produced
HapMap data. Already 10-15 in-house "unknown" (CN aberrant) arrays
were better than 270 HapMap samples as a reference! So, collect as
much Canine SNP hybridizations as possible from the same microarray
facility.
>
>> They give the proper CDF file but as I understand it we will also need
>> the UGP and UFL files. The problem is that I don't think they exist
>> because this is a custom chip. Do we actually need them and if so
>> could they be created from the other files available on their web
>> page? (AffyV2_all_probeset.txt and v2.feature-response.txt)
>
> As far as I know, you will definitely need UGP and UFL files. If you
> go to the main group page (same as the above URL, but without the bit
> following the last /) you can do a search for UFL and/or UGP and you
> should be able to find several discussions or documentation pages
> which should give you enough information to build your own.
You will need the UGP file for the segmentation, because they contain
(chromosome, position) information. However, you can do without the
UFL file, which contains fragment-length information. The UFL file is
used for the fragment-length normalization, and we have shown that it
boost the signal, but you will still get decent results without it.
So, I had a look at the DogSty06m520431 (aka "Canine SNP v2") chip
type and what annotation data is available. I could not locate any
fragment-length information, so for now you have to live without the
UFL file/FragmentLengthNormalization. However, the
'AffyV2_all_probeset.txt' file contains (chromosome, position)
information, so I create UGP files for both the full size CDF and the
"Platinum" CDF. I've made them available on the new chip type page
'DogSty06m520431' at:
http://groups.google.com/group/aroma-affymetrix/web/dogsty06m520431
Note the comment that you should rename the "Platinum" CDF. FYI,
there is also a code example at the bottom on how to create the UGP
files from scratch.
Let us know how it goes.
Henrik
>
> Hope this helps,
>
> Ken
>
> >
>
First of all, there is a typo in the error message: it should read
that "object 'gi' is [not] compatible with the CDF". The check is
currently by comparing the number of units in the located UGP file and
the CDF and they don't match [FYI, next release will report the reason
for why it thinks it is not compatible]. This makes me think that
somehow your CDF is not the correct one. Be careful to compare your
output to what is listed on online Page 'DogSty06m520431'
[http://groups.google.com/group/aroma-affymetrix/web/dogsty06m520431].
Here is what I get for the "complete" CDF. You should get the same:
> cdf <- AffymetrixCdfFile$byChipType("DogSty06m520431");
> print(cdf);
AffymetrixCdfFile:
Path: annotationData/chipTypes/DogSty06m520431
Filename: DogSty06m520431.cdf
Filesize: 84.63MB
Chip type: DogSty06m520431
RAM: 0.00MB
File format: v4 (binary; XDA)
Dimension: 1612x1612
Number of cells: 2598544
Number of units: 127201
Cells per unit: 20.43
Number of QC units: 0
> print(getChecksum(cdf));
[1] "4977ab60affc25feafa74602da5d5c0f"
If you get the same, then confirm that the following also works for you:
> gi <- getGenomeInformation(cdf);
> gi
UgpGenomeInformation:
Name: DogSty06m520431
Tags: Broad20080721,HB20080721
Pathname: C:/Documents and Settings/hb/My Documents/My Data/annotationData/chipT
ypes/DogSty06m520431/DogSty06m520431,Broad20080721,HB20080721.ugp
File size: 621.49kB
RAM: 0.00MB
Chip type: DogSty06m520431
> print(getChecksum(gi));
[1] "01814716affae7720a6e9600a2648fe7"
> nbrOfUnits(gi);
[1] 127201
Now, if that works, check that your "monocell" CDF is valid:
> cdfM <- getMonocellCdf(cdf);
> print(cdfM);
AffymetrixCdfFile:
Path: annotationData/chipTypes/DogSty06m520431
Filename: DogSty06m520431,monocell.CDF
Filesize: 57.24MB
Chip type: DogSty06m520431,monocell
RAM: 0.00MB
File format: v4 (binary; XDA)
Dimension: 714x713
Number of cells: 509082
Number of units: 127201
Cells per unit: 4.00
Number of QC units: 0
> print(getChecksum(cdfM));
[1] "ea6d31089a3c65cc46b70035c33c2790"
Next validate that you can get the same UGP file via the monocell CDF:
> giM <- getGenomeInformation(cdfM);
> stopifnot(equals(giM, gi));
If you in any of the above get reported that the number of units is
'49732', then you have a file that is for the "Platinum" CDF and not
the complete one. It is important that you do not mix these up. If
it turns out that your monocell CDF is for the "Platinum" one, the
remove it and redo the above.
>
> The vignette says that the monocell CDF file should be created at a
> previous fit() command but that does not appear be happening.
It is created "quietly" in the background and if you study the verbose
output you will see it was created. Note that it is only created once
[of the life-time of your R installation; not per session], so you
won't see it the next time you call fit(). From the above error
message reporting "DogSty06m520431,monocell" we know that it has been
created.
> I am
> hoping you might see something wrong with my script which I will
> include:
>
>
> cs <- AffymetrixCelSet$fromName(arraySet, chipType=chipType)
> acc <- AllelicCrosstalkCalibration(cs)
> csC <- process(acc, verbose=verbose)
> plm <- AvgCnPlm(csC, mergeStrands=TRUE, combineAlleles=TRUE, shift=
> +300)
> fit(plm, verbose=verbose)
> ces <- getChipEffectSet(plm)
> #fln <- FragmentLengthNormalization(ces)
> #cesN <- process(fln, verbose=verbose)
> #print(cesN)
> cesN <- ces
> cbs <- CbsModel(cesN)
> rawCNs <- extractRawCopyNumbers(cbs, array=1, chromosome=1)
Please do not exclude the most important information such as what
'arraySet' and 'chipType' is. As you see from my answer above, it is
very useful to also see the value of print(<object>) of the different
steps.
Otherwise the above looks correct to me.
/Henrik
good that you located the CDF issue.
I think your most recent problem will be solved if you delete all files in:
plmData/duesDog,ACC,-XY,AVG,+300,A+B/DogSty06m520431/
That directory contains chip effect CEL files that was generated for
the previous incorrect monocell CDF. Now it tries to reuse those for
the new monocell CDF and therefore the error message.
Cheers
/Henrik
that error is due to an assertion that test for "too many" missing
values, which would indicate that something went wrong in the earlier
step of the processing. This is useful when you batch process you
data.
The default threshold is 20% missing values. You can change this by
specifying argument 'maxNAFraction', e.g.
rawCNs <- extractRawCopyNumbers(cbs, array=1, chromosome=1, maxNAFraction=3/5);
More importantly, you get > 50% missing values while processing your
arrays. That's quite a bit, especially if you expect Chr1 to be a
copy neutral chromosome. For humans, you can get quite a few missing
values for females on ChrY because there you have CN=0 and your
estimated signals are close to zero or even non-positive which turn
into NAs on the log-scale.
The allelic crosstalk estimates and corrects for offset and crosstalk.
This correction introduce non-positive probe signals. This cause a
problem on the log scale. To protect against this, we shift probe
signals shift=+300 before doing the summarization. The summarization
in turn lower the risk for non-positive estimates, but not completely.
Thus, some non-positive signals will remain and show up as
non-positive chip effect estimates if you use the AvgPlm and NAs if
you use RmaPlm. For a detailed discussion on this, see the CRMA
paper:
H. Bengtsson; R. Irizarry; B. Carvalho; T.P. Speed, Estimation and
assessment of raw copy numbers at the single locus level,
Bioinformatics, 2008. [pmid: 18204055] [doi:
10.1093/bioinformatics/btn016]
As a first run, you could replace the AllelicCrosstalkCalibration with
a QuantileNormalization. That will you a "lower bound" on what kind
of raw CN estimates you can get.
Note, don't expect too much with only five arrays (I am working on
methods to improve this, but that's far in the future). I recommend
you to get more references. If they are from the same facility that
is even better. Have Broad check with other group who run
DogSty06m520431 and see if they are (also) willing to share their data
anonymously. It's a win-win situation for everyone.
Cheers
Henrik