Custom Canine SNP

17 views
Skip to first unread message

Karl Dykema

unread,
Jul 18, 2008, 3:26:07 PM7/18/08
to aroma.affymetrix
Hello All,

We have recently started a small project using the custom canine SNP
arrays developed by the Broad Institute. (http://www.broad.mit.edu/
mammals/dog/caninearrayfaq.html - array version 2.) Making chromosome
copy number calls is our first goal. We don't have any 'normal' copy
number samples at this point, can we even make the calls without this
reference?

They give the proper CDF file but as I understand it we will also need
the UGP and UFL files. The problem is that I don't think they exist
because this is a custom chip. Do we actually need them and if so
could they be created from the other files available on their web
page? (AffyV2_all_probeset.txt and v2.feature-response.txt)

Thanks in advance for any help, advice or tips!

-karl

Ken Simpson

unread,
Jul 20, 2008, 7:58:30 AM7/20/08
to aroma-af...@googlegroups.com
Hi Karl,

On Sat, Jul 19, 2008 at 5:26 AM, Karl Dykema <karl....@gmail.com> wrote:
>
> We have recently started a small project using the custom canine SNP
> arrays developed by the Broad Institute. (http://www.broad.mit.edu/
> mammals/dog/caninearrayfaq.html - array version 2.) Making chromosome
> copy number calls is our first goal. We don't have any 'normal' copy
> number samples at this point, can we even make the calls without this
> reference?

I'm no expert on SNP array analysis, but I understand that using the
robust average of the data set as a reference sample may be an option
for you. Please see the vignette on Total Copy Number Analysis for
the 10K, 100K, 500K human chips:
http://groups.google.com/group/aroma-affymetrix/web/total-copy-number-analysis.

> They give the proper CDF file but as I understand it we will also need
> the UGP and UFL files. The problem is that I don't think they exist
> because this is a custom chip. Do we actually need them and if so
> could they be created from the other files available on their web
> page? (AffyV2_all_probeset.txt and v2.feature-response.txt)

As far as I know, you will definitely need UGP and UFL files. If you
go to the main group page (same as the above URL, but without the bit
following the last /) you can do a search for UFL and/or UGP and you
should be able to find several discussions or documentation pages
which should give you enough information to build your own.

Hope this helps,

Ken

Henrik Bengtsson

unread,
Jul 21, 2008, 2:49:31 PM7/21/08
to aroma-af...@googlegroups.com
Hi Karl,

On Sun, Jul 20, 2008 at 4:58 AM, Ken Simpson <ken.m....@gmail.com> wrote:
>
> Hi Karl,
>
> On Sat, Jul 19, 2008 at 5:26 AM, Karl Dykema <karl....@gmail.com> wrote:
>>
>> We have recently started a small project using the custom canine SNP
>> arrays developed by the Broad Institute. (http://www.broad.mit.edu/
>> mammals/dog/caninearrayfaq.html - array version 2.) Making chromosome
>> copy number calls is our first goal. We don't have any 'normal' copy
>> number samples at this point, can we even make the calls without this
>> reference?
>
> I'm no expert on SNP array analysis, but I understand that using the
> robust average of the data set as a reference sample may be an option
> for you. Please see the vignette on Total Copy Number Analysis for
> the 10K, 100K, 500K human chips:
> http://groups.google.com/group/aroma-affymetrix/web/total-copy-number-analysis.

Yes, as Ken says, if you have enough arrays and you can assume that at
any locus more than 50% (theoretically) are copy neutral, the robust
average (median) at that locus will be an estimator of the
copy-neutral mean.

I don't know how many arrays you've got, but the more you have the
better, not only for the above issue but also got get less noise
estimates. A bit more than two years ago I was asked to do CN
analysis on 3(!) 500K arrays. The obvious was to go and download the
public HapMap samples and use those as a pooled reference. However,
it turned out that when more *in-house* hybridizations were produced
they were much better suited as a reference than the outside-produced
HapMap data. Already 10-15 in-house "unknown" (CN aberrant) arrays
were better than 270 HapMap samples as a reference! So, collect as
much Canine SNP hybridizations as possible from the same microarray
facility.

>
>> They give the proper CDF file but as I understand it we will also need
>> the UGP and UFL files. The problem is that I don't think they exist
>> because this is a custom chip. Do we actually need them and if so
>> could they be created from the other files available on their web
>> page? (AffyV2_all_probeset.txt and v2.feature-response.txt)
>
> As far as I know, you will definitely need UGP and UFL files. If you
> go to the main group page (same as the above URL, but without the bit
> following the last /) you can do a search for UFL and/or UGP and you
> should be able to find several discussions or documentation pages
> which should give you enough information to build your own.

You will need the UGP file for the segmentation, because they contain
(chromosome, position) information. However, you can do without the
UFL file, which contains fragment-length information. The UFL file is
used for the fragment-length normalization, and we have shown that it
boost the signal, but you will still get decent results without it.

So, I had a look at the DogSty06m520431 (aka "Canine SNP v2") chip
type and what annotation data is available. I could not locate any
fragment-length information, so for now you have to live without the
UFL file/FragmentLengthNormalization. However, the
'AffyV2_all_probeset.txt' file contains (chromosome, position)
information, so I create UGP files for both the full size CDF and the
"Platinum" CDF. I've made them available on the new chip type page
'DogSty06m520431' at:

http://groups.google.com/group/aroma-affymetrix/web/dogsty06m520431

Note the comment that you should rename the "Platinum" CDF. FYI,
there is also a code example at the bottom on how to create the UGP
files from scratch.

Let us know how it goes.

Henrik

>
> Hope this helps,
>
> Ken
>
> >
>

Karl Dykema

unread,
Jul 23, 2008, 4:33:35 PM7/23/08
to aroma.affymetrix
Ken and Henrik, Thanks a lot for the help and for the UGP files.
Unfortunately we only have a small number of samples (5 so far) so I
do not know if it will work to use them as the 'normal reference.'
Only one way to find out though, right? I also am trying to obtain
some 'normal' samples from the Broad.

I am attempting to follow along with the "Total Copy Number Analysis
(5.0 & 6.0)" vignette (without the fragment length normalization bit),
but I must be doing something wrong because it is giving me an error
about the new UGP file:


> rawCNs <- extractRawCopyNumbers(cbs, array=1, chromosome=1)
Error in list("extractRawCopyNumbers(cbs, array = 1, chromosome = 1)"
= <environment>, :

[2008-07-23 16:21:54] Exception: Cannot set genome information. The
UgpGenomeInformation object 'gi' is compatible with the CDF:
DogSty06m520431 != DogSty06m520431,monocell
at throw(Exception(...))
at throw.default("Cannot set genome information. The ", class(gi)
[1], " object
at throw("Cannot set genome information. The ", class(gi)[1], "
object 'gi' is
at setGenomeInformation.AffymetrixCdfFile(this, gi)
at setGenomeInformation(this, gi)
at getGenomeInformation.AffymetrixCdfFile(X[[1]], ...)
at FUN(X[[1]], ...)
at lapply.default(cdfList, getGenomeInformation, verbose =
less(verbose))
at lapply(cdfList, getGenomeInformation, verbose = less(verbose))
at getListOfGenomeInformations.ChromosomalModel(this)
at getListOfGenomeInformations(this)
at getChromosomes.ChromosomalModel(this)
at getChromosomes(this)
at extractRawCopyNumbers.CopyNumberChromosomalModel(cbs, array = 1,
chromosome
at extractRawCopyNumbers(cbs, array = 1, chromosome = 1)

The vignette says that the monocell CDF file should be created at a
previous fit() command but that does not appear be happening. I am
hoping you might see something wrong with my script which I will
include:


cs <- AffymetrixCelSet$fromName(arraySet, chipType=chipType)
acc <- AllelicCrosstalkCalibration(cs)
csC <- process(acc, verbose=verbose)
plm <- AvgCnPlm(csC, mergeStrands=TRUE, combineAlleles=TRUE, shift=
+300)
fit(plm, verbose=verbose)
ces <- getChipEffectSet(plm)
#fln <- FragmentLengthNormalization(ces)
#cesN <- process(fln, verbose=verbose)
#print(cesN)
cesN <- ces
cbs <- CbsModel(cesN)
rawCNs <- extractRawCopyNumbers(cbs, array=1, chromosome=1)

Thanks again for the help!

-karl

Henrik Bengtsson

unread,
Jul 23, 2008, 6:24:15 PM7/23/08
to aroma-af...@googlegroups.com
Hi Karl.

First of all, there is a typo in the error message: it should read
that "object 'gi' is [not] compatible with the CDF". The check is
currently by comparing the number of units in the located UGP file and
the CDF and they don't match [FYI, next release will report the reason
for why it thinks it is not compatible]. This makes me think that
somehow your CDF is not the correct one. Be careful to compare your
output to what is listed on online Page 'DogSty06m520431'
[http://groups.google.com/group/aroma-affymetrix/web/dogsty06m520431].

Here is what I get for the "complete" CDF. You should get the same:

> cdf <- AffymetrixCdfFile$byChipType("DogSty06m520431");
> print(cdf);
AffymetrixCdfFile:
Path: annotationData/chipTypes/DogSty06m520431
Filename: DogSty06m520431.cdf
Filesize: 84.63MB
Chip type: DogSty06m520431
RAM: 0.00MB
File format: v4 (binary; XDA)
Dimension: 1612x1612
Number of cells: 2598544
Number of units: 127201
Cells per unit: 20.43
Number of QC units: 0
> print(getChecksum(cdf));
[1] "4977ab60affc25feafa74602da5d5c0f"


If you get the same, then confirm that the following also works for you:

> gi <- getGenomeInformation(cdf);
> gi
UgpGenomeInformation:
Name: DogSty06m520431
Tags: Broad20080721,HB20080721
Pathname: C:/Documents and Settings/hb/My Documents/My Data/annotationData/chipT
ypes/DogSty06m520431/DogSty06m520431,Broad20080721,HB20080721.ugp
File size: 621.49kB
RAM: 0.00MB
Chip type: DogSty06m520431
> print(getChecksum(gi));
[1] "01814716affae7720a6e9600a2648fe7"
> nbrOfUnits(gi);
[1] 127201

Now, if that works, check that your "monocell" CDF is valid:

> cdfM <- getMonocellCdf(cdf);
> print(cdfM);
AffymetrixCdfFile:
Path: annotationData/chipTypes/DogSty06m520431
Filename: DogSty06m520431,monocell.CDF
Filesize: 57.24MB
Chip type: DogSty06m520431,monocell
RAM: 0.00MB
File format: v4 (binary; XDA)
Dimension: 714x713
Number of cells: 509082
Number of units: 127201
Cells per unit: 4.00
Number of QC units: 0
> print(getChecksum(cdfM));
[1] "ea6d31089a3c65cc46b70035c33c2790"

Next validate that you can get the same UGP file via the monocell CDF:
> giM <- getGenomeInformation(cdfM);
> stopifnot(equals(giM, gi));

If you in any of the above get reported that the number of units is
'49732', then you have a file that is for the "Platinum" CDF and not
the complete one. It is important that you do not mix these up. If
it turns out that your monocell CDF is for the "Platinum" one, the
remove it and redo the above.

>
> The vignette says that the monocell CDF file should be created at a
> previous fit() command but that does not appear be happening.

It is created "quietly" in the background and if you study the verbose
output you will see it was created. Note that it is only created once
[of the life-time of your R installation; not per session], so you
won't see it the next time you call fit(). From the above error
message reporting "DogSty06m520431,monocell" we know that it has been
created.

> I am
> hoping you might see something wrong with my script which I will
> include:
>
>
> cs <- AffymetrixCelSet$fromName(arraySet, chipType=chipType)
> acc <- AllelicCrosstalkCalibration(cs)
> csC <- process(acc, verbose=verbose)
> plm <- AvgCnPlm(csC, mergeStrands=TRUE, combineAlleles=TRUE, shift=
> +300)
> fit(plm, verbose=verbose)
> ces <- getChipEffectSet(plm)
> #fln <- FragmentLengthNormalization(ces)
> #cesN <- process(fln, verbose=verbose)
> #print(cesN)
> cesN <- ces
> cbs <- CbsModel(cesN)
> rawCNs <- extractRawCopyNumbers(cbs, array=1, chromosome=1)

Please do not exclude the most important information such as what
'arraySet' and 'chipType' is. As you see from my answer above, it is
very useful to also see the value of print(<object>) of the different
steps.

Otherwise the above looks correct to me.

/Henrik

Karl Dykema

unread,
Jul 24, 2008, 11:38:29 AM7/24/08
to aroma.affymetrix
Hi Henrick,

I think I figured out my previous problem. I had renamed the wrong CDF
file which then spawned an incorrect monocel CDF. Now I have that
straightened and my monocel matches the CDF I am running into a
different problem. It says that my CEL files have more cells than the
CDF. This is the error I get when I try to fit the model:

"The specified CDF structure ('DogSty06m520431,monocell') is not
compatible with the chip type ('DogSty06m520431,monocell') of the CEL
file. The number of cells do not match: 509082 != 207480."

I am not sure what this new problem might be. Does it look like
anything you've seen before? Thanks a lot!

-karl



> arraySet <- c("duesDog")
> chipType <- c("DogSty06m520431")
>
> cdf <- AffymetrixCdfFile$byChipType("DogSty06m520431");
> print(cdf);
AffymetrixCdfFile:
Path: annotationData/chipTypes/DogSty06m520431
Filename: DogSty06m520431.cdf
Filesize: 84.63MB
Chip type: DogSty06m520431
RAM: 0.00MB
File format: v4 (binary; XDA)
Dimension: 1612x1612
Number of cells: 2598544
Number of units: 127201
Cells per unit: 20.43
Number of QC units: 0
> gi <- getGenomeInformation(cdf);
> gi
UgpGenomeInformation:
Name: DogSty06m520431
Tags:
Pathname: annotationData/chipTypes/DogSty06m520431/DogSty06m520431.ugp
File size: 621.49kB
RAM: 0.00MB
Chip type: DogSty06m520431
> nbrOfUnits(gi);
[1] 127201
> cdfM <- getMonocellCdf(cdf);
> print(cdfM);
AffymetrixCdfFile:
Path: annotationData/chipTypes/DogSty06m520431
Filename: DogSty06m520431,monocell.CDF
Filesize: 57.24MB
Chip type: DogSty06m520431,monocell
RAM: 0.00MB
File format: v4 (binary; XDA)
Dimension: 714x713
Number of cells: 509082
Number of units: 127201
Cells per unit: 4.00
Number of QC units: 0
> print(getChecksum(cdfM));
[1] "ea6d31089a3c65cc46b70035c33c2790"
> giM <- getGenomeInformation(cdfM);
> equals(giM, gi)
[1] TRUE
>
> cs <- AffymetrixCelSet$fromName(arraySet, chipType=chipType)
> acc <- AllelicCrosstalkCalibration(cs)
> csC <- process(acc, verbose=verbose)
Calibrating data set for allelic cross talk...
Already calibrated
Calibrating data set for allelic cross talk...done
> plm <- AvgCnPlm(csC, mergeStrands=TRUE, combineAlleles=TRUE, shift=+300)
> fit(plm, verbose=verbose)
Fitting model of class AvgCnPlm:...
AvgCnPlm:
Data set: duesDog
Chip type: DogSty06m520431
Input tags: ACC,-XY
Output tags: ACC,-XY,AVG,+300,A+B
Parameters: (probeModel: chr "pm"; shift: num 300; flavor: chr
"median"; mergeStrands: logi TRUE; combineAlleles: logi TRUE).
Path: plmData/duesDog,ACC,-XY,AVG,+300,A+B/DogSty06m520431
RAM: 0.00MB
Identifying non-estimated units...
Error in list("fit(plm, verbose = verbose)" = <environment>,
"fit.ProbeLevelModel(plm, verbose = verbose)" = <environment>, :

[2008-07-24 11:31:58] Exception: The specified CDF structure
('DogSty06m520431,monocell') is not compatible with the chip type
('DogSty06m520431,monocell') of the CEL file. The number of cells do
not match: 509082 != 207480
at throw(Exception(...))
at throw.default("The specified CDF structure ('", getChipType(cdf),
"') is not compatible with the chip type
at throw("The specified CDF structure ('", getChipType(cdf), "') is
not compatible with the chip type ('", get
at setCdf.AffymetrixCelFile(res, cdf)
at setCdf(res, cdf)
at method(static, ...)
at clazz$fromDataFile(df, path = path, name = name, cdf = cdf, ...,
verbose = less(verbose))
at method(static, ...)
at clazz$fromDataSet(dataSet = ds, path = getPath(this), cdf =
cdfMono, verbose = less(verbose))
at getChipEffectSet.ProbeLevelModel(this, verbose = verbose)
at NextMethod("getChipEffectSet", this, ...)
at getChipEffectSet.SnpPlm(this, verbose = verbose)
at NextMethod("getChipEffectSet", this,
Identifying non-estimated units...done
Fitting model of class AvgCnPlm:...done

Henrik Bengtsson

unread,
Jul 24, 2008, 12:16:43 PM7/24/08
to aroma-af...@googlegroups.com
Hi,

good that you located the CDF issue.

I think your most recent problem will be solved if you delete all files in:

plmData/duesDog,ACC,-XY,AVG,+300,A+B/DogSty06m520431/

That directory contains chip effect CEL files that was generated for
the previous incorrect monocell CDF. Now it tries to reuse those for
the new monocell CDF and therefore the error message.

Cheers

/Henrik

Karl Dykema

unread,
Jul 24, 2008, 1:40:28 PM7/24/08
to aroma.affymetrix
Henrik,

Thanks, your suggestion fixed that problem and now I can successfully
get up to the point where I attempt to extract the raw copy numbers.
Unfortunately I am still having troubles; now it is telling me that
there are too many non-finite values. I have tried to extract
different chromosomes and for each it comes back with a different
percentage of non-finite values, but that number always seems to be
more than ~50%. Do you know what might be causing this? Thanks yet
again...


> rawCNs <- extractRawCopyNumbers(cbs, array=1, chromosome=1)
Error in list("extractRawCopyNumbers(cbs, array = 1, chromosome = 1)"
= <environment>, :

[2008-07-24 13:20:09] Exception: Something is wrong with the data. Too
many non-finite values: 3089 (50.8% > 20.0%)
at throw(Exception(...))
at throw.default(sprintf("Something is wrong with the data. Too many
non-finite values: %d (%.1f%% > %.1f%%)", a
at throw(sprintf("Something is wrong with the data. Too many non-
finite values: %d (%.1f%% > %.1f%%)", as.intege
at getRawCnData.CopyNumberChromosomalModel(this, ceList = ceList,
refList = rfList, chromosome = chromosome, ...
at getRawCnData(this, ceList = ceList, refList = rfList, chromosome
= chromosome, ..., verbose = less(verbose))
at extractRawCopyNumbers.CopyNumberChromosomalModel(cbs, array = 1,
chromosome = 1)
at extractRawCopyNumbers(cbs, array = 1, chromosome = 1)


-karl

Henrik Bengtsson

unread,
Jul 24, 2008, 2:07:07 PM7/24/08
to aroma-af...@googlegroups.com
Hi,

that error is due to an assertion that test for "too many" missing
values, which would indicate that something went wrong in the earlier
step of the processing. This is useful when you batch process you
data.

The default threshold is 20% missing values. You can change this by
specifying argument 'maxNAFraction', e.g.

rawCNs <- extractRawCopyNumbers(cbs, array=1, chromosome=1, maxNAFraction=3/5);

More importantly, you get > 50% missing values while processing your
arrays. That's quite a bit, especially if you expect Chr1 to be a
copy neutral chromosome. For humans, you can get quite a few missing
values for females on ChrY because there you have CN=0 and your
estimated signals are close to zero or even non-positive which turn
into NAs on the log-scale.

The allelic crosstalk estimates and corrects for offset and crosstalk.
This correction introduce non-positive probe signals. This cause a
problem on the log scale. To protect against this, we shift probe
signals shift=+300 before doing the summarization. The summarization
in turn lower the risk for non-positive estimates, but not completely.
Thus, some non-positive signals will remain and show up as
non-positive chip effect estimates if you use the AvgPlm and NAs if
you use RmaPlm. For a detailed discussion on this, see the CRMA
paper:

H. Bengtsson; R. Irizarry; B. Carvalho; T.P. Speed, Estimation and
assessment of raw copy numbers at the single locus level,
Bioinformatics, 2008. [pmid: 18204055] [doi:
10.1093/bioinformatics/btn016]

As a first run, you could replace the AllelicCrosstalkCalibration with
a QuantileNormalization. That will you a "lower bound" on what kind
of raw CN estimates you can get.

Note, don't expect too much with only five arrays (I am working on
methods to improve this, but that's far in the future). I recommend
you to get more references. If they are from the same facility that
is even better. Have Broad check with other group who run
DogSty06m520431 and see if they are (also) willing to share their data
anonymously. It's a win-win situation for everyone.

Cheers

Henrik

Karl Dykema

unread,
Aug 14, 2008, 11:21:33 AM8/14/08
to aroma.affymetrix
Hi Henrik,

Thanks again for your help. We are trying to get some normal reference
and I think I am close to getting it right, but I am still having a
minor problem displaying my results in the chromosome explorer.
Everything is still in the scale relative to Human chromosomes and it
does have links to view chromosomes 24-39. I have created my
Canine,chromosomes.txt file and placed it in the proper \annotationData
\genomes\Canine directory. I am pretty sure that the file is formatted
properly because when I was cobbling it together I was previously
getting an error message about duplicate row names but when I changed
the "X" to a "39" it started working. I also think the GladModel() is
accepting the genome argument correctly because it is generating
the .png images for the chromosomes, just not incorporating them into
the chromosome explorer html for some reason.

> glad <- GladModel(ces,genome="Canine")
> print(glad)
GladModel:
Name: duesDog
Tags: ACC,-XY,RMA,A+B
Chip type (virtual): DogSty06m520431
Path: gladData/duesDog,ACC,-XY,RMA,A+B/DogSty06m520431
Number of chip types: 1
Chip-effect set & reference file pairs:
Chip type #1 of 1 ('DogSty06m520431'):
Chip-effect set:
CnChipEffectSet:
Name: duesDog
Tags: ACC,-XY,RMA,A+B
Path: plmData/duesDog,ACC,-XY,RMA,A+B/DogSty06m520431
Platform: Affymetrix
Chip type: DogSty06m520431,monocell
Number of arrays: 5
Names: Duesbery_CanineSNP_01_mw-15_07112008,
Duesbery_CanineSNP_02_mw-52_07112008, ...,
Duesbery_CanineSNP_05_mw-116_07112008
Time period: 2008-08-13 16:36:42 -- 2008-08-13 16:36:43
Total file size: 24.28MB
RAM: 0.01MB
Parameters: (probeModel: chr "pm", mergeStrands: logi TRUE,
combineAlleles: logi TRUE)
Reference file:
<average across arrays>
RAM: 0.00MB
> getGenome(glad)
[1] "Canine"

Reply all
Reply to author
Forward
0 new messages