Specify reference normals for each batch/condition

27 views
Skip to first unread message

Cheng Li

unread,
Jan 1, 2009, 10:45:01 PM1/1/09
to dChip Software
lch Posted: 20 Jul 2007 07:24 pm Post subject: Specify reference
normals for each batch/condition

--------------------------------------------------------------------------------
dChip update 7/22/07: In copy number analysis, samples of different
batches or experiment conditions can use their own normal reference
samples. Specify a "RefBatch" column in sample information file with
the same value for samples in the same batch or condition. "Ploidy
(numeric)" and trimming will work within each "RefBatch".


lch Posted: 20 Jul 2007 07:28 pm Post subject: Question regarding
combining arrays

--------------------------------------------------------------------------------
Thu Mar 16, 2006 1:32 pm

If a tumor and its paired normal are in the same batch, you can put
them in the standardize group and check “Options/Chromosome/use paired
normal as reference”. In this case the normal samples still use their
average to get raw copy numbers (may not be good due to batch effect),
but tumor samples are adjusted for both batch and individual effect by
its paired normal sample.

If not, it’s best to analyze samples from the same batch together
(Array list file contains sample of the same batch), if each batch has
its own normal samples. After obtaining raw copy numbers, use
“Chromosome/Export SNP data” to export raw copy numbers, and finally
combine such raw copy number file of multiple batches and format it
into dChip “External data” file format and analyze the combined file
in a different dchip session (use “Get external data” to read this
file).

Cheng Li



From: Arupa Ganguly
Sent: Thursday, March 16, 2006 1:18 PM
To: Cheng Li
Subject: [dChip] Question regarding combining arrays

I am going to do that in a minute. But before that I have another
question -

I was able to work with dCHIP.2005 version for this analysis - but I
find that i am clearly seeing batch effect in the copy number analysis
- some CHIPs hybridized on a different date have a higher copy number
for the normal tissue as well as the tumor tissues. Can you help?

In the array list, normal sample is followed by tumor sample and then
a separator - but how do I indicate different dates of hybridizations
for normal-tumor pairs?


lch Posted: 24 Oct 2007 04:01 pm Post subject: unpaired Affy 500K
copy number analysis

--------------------------------------------------------------------------------
From: Cheng Li
Sent: Wednesday, October 24, 2007 3:58 PM
To: 'Joshua Herbeck'
Subject: RE: unpaired Affy 500K copy number analysis

You may try this 'refbatch' method, and for each batch do a trimmed
mean analysis.
http://www.dchip.org/copy.htm#refbatch

I think you don't have to worry the batch effect before you are sure
it exists. You can do copy number of all samples together, and use
"tools/array list file" to order samples by batches, and check if
samples in the same batch have similar noise pattern in copy number
view (if so it indicates batch effect).

cheng

From: Joshua Herbeck
Sent: Wednesday, October 24, 2007 2:55 PM
To: Cheng Li
Subject: Re: unpaired Affy 500K copy number analysis

So, is it possible to adjust for batch effects if I don't have
identical samples run in different batches?


From: Cheng Li
Sent: Wednesday, October 24, 2007 2:54 PM
To: 'Joshua Herbeck'
Subject: RE: unpaired Affy 500K copy number analysis

> So I should normalize and run MBEI for my samples, then do the same,
separately, for the HapMap normals?

Yes. You don't have to use hapmap samples in your analysis if CNV in
them are not your immediate interest.

> For my samples, should I
normalize and run MBEI for _each_ separate batch?

Yes.

> And this applies
to the chip batch (Affymetrix chip batch #), or the date of chip
processing batch (which can be different for chips from the same Affy
batch)?

Processing batch could have bigger effect. But I am not sure.

> I'm not going to combine my sub-arrays, but for adjusting for batch
effects, why would normalizing and performing MBEI for all my samples
together not be sufficient to adjust for batch effects?

Normalization adjusts for overall array brightness, which can be only
part of batch effect.

Cheng


From: Joshua Herbeck
Sent: Wednesday, October 24, 2007 2:46 PM
To: Cheng Li
Subject: Re: unpaired Affy 500K copy number analysis

Hi Cheng,

> Are your 200 samples normal or cancer samples?

My 200 samples are normal samples.

> Usually you don't have to normalize your sample with hapmap
> normals, due to
> batch effect between different datasets.

So I should normalize and run MBEI for my samples, then do the same,
separately, for the HapMap normals? For my samples, should I
normalize and run MBEI for _each_ separate batch? And this applies
to the chip batch (Affymetrix chip batch #), or the date of chip
processing batch (which can be different for chips from the same Affy
batch)?

> It's better to analyze sub arrays separately for their agreement
> before
> combing them.

I'm not going to combine my sub-arrays, but for adjusting for batch
effects, why would normalizing and performing MBEI for all my samples
together not be sufficient to adjust for batch effects?

Thanks a lot for your help!

Josh


From: Cheng Li
Sent: Wednesday, October 24, 2007 2:24 PM
To: 'Joshua Herbeck'
Subject: RE: unpaired Affy 500K copy number analysis

Are your 200 samples normal or cancer samples?

Usually you don't have to normalize your sample with hapmap normals,
due to batch effect between different datasets.

It's better to analyze sub arrays separately for their agreement
before combing them.

Filtering SNPs may not be necessary for good quality samples.

cheng


From: Joshua Herbeck
Sent: Tuesday, October 23, 2007 11:18 AM

Subject: unpaired Affy 500K copy number analysis


Hi Cheng,

I have some questions about unpaired copy number analysis in dChip,
using Affy
500K chip data. If you could help me through them, I would really
appreciate
it. I have ~200 samples of CEU ancestry individuals, genotyped with
the Nsp and
Sty chips. I also have the HapMap normals, and 3 sets of normal-tumor
pairs
that I wish to use as positive controls. I have loaded all CEL and TXT
files,
and have normalized and performed model-based expression for ALL
samples
together--my 200 together with the HapMap normals and the normal-tumor
pairs. I
have not combined the subarrays, because when I do this and then try
to infer
copy number, the program consistently says it is out of memory. So,
looking at
just Nsp and then just Sty, I first filter SNPs (not genes) for <= 600
bp
fragments, 90% call rate across samples, and then infer copy number
for those
filtered SNPs. I use trimmed analysis with % trimmed at -1, so that
the HapMap
normals with Ploidy(numeric) = 2 are recognized as normals, and the
Hidden
Markov Model option for inferring copy number (SNP marker Allele A
frequency =
Caucasian; HMM length = 1000). I then export the Nsp and Sty inferred
copy
number output, and row combine, accepting only those copy number
variations that
are found in adjacent SNPs spanning 5 kb or greater. Questions: Is
this a
reasonable approach to infer copy number variation in the data set?
These
samples were run over several different batches; but I have no local
normals, so
I can not batch correct as your manual suggests. Does inferring copy
number for
Nsp and Sty separately, and then combining, and only accepting copy
number
variation seen in adjacent Nsp and Sty SNPs correct for the potential
batch
variation?

Thanks a lot,

Josh



ccyen Posted: 17 Jun 2008 07:01 am Post subject: Batch effect

--------------------------------------------------------------------------------
Dear Dr. Li:

About the batch effect, I noticed that you suggested that we can
obtain raw copy number for each batch first, then use get external
data to put all the batches together. For individual batch, do we need
to normalize them separately?( I mean that batch 1-> normalized->
probe signal -> raw CN, then Batch 20-> normalize -> probe signal ->
raw CN. Then combined together. )

Chueh-Chuan Yen


lch Posted: 17 Jun 2008 06:02 pm Post subject:

--------------------------------------------------------------------------------
You don't need to normlaize batches separately. You can use the
RefBatch function in the latest version.



Arupa Ganguly Posted: 31 Aug 2007 02:12 pm Post subject: Refbatch

--------------------------------------------------------------------------------
What is the maximum number of RefBatch that can be used?
When will dCHIP be able to handle SNP 6.o copy number probes?


lch Posted: 10 Sep 2007 01:15 pm Post subject:

--------------------------------------------------------------------------------
It's 50 in the version after 8/30/07 and 5 before this version.

If you use GenomeWideSNP_6.cdf in dchip, there are about 1853020 probe
sets. I assume the CNVs are organized into probe sets containing a
single probe. If Affy can make a CDF file grouping CNV probes
according to the restriction fragments, it can be used directly in
dchip (I may do this but do not have specific timeline).



mitzi Posted: 09 Aug 2007 04:05 pm Post subject: adjust batch
effects in latest version - runtime error

--------------------------------------------------------------------------------
I have been using dchip2006, build date April 29, 2006
to analyze 500K SNP data.
I need to use the adjust batch effects for some of my data.

Recently I downloaded the latest version of dchip2006.exe,
and was unable to adjust batch effects for my dataset.
I didn't load probe data into memory. I tried upping the memory
to 1500MB, and that blew out too.

So I rolled back to the April 29 version, and this worked just fine.

cheers,

Mitzi


lch Posted: 09 Aug 2007 05:43 pm Post subject:

--------------------------------------------------------------------------------
Which menu functions are you referring to? Could you attach dchip
outputs or screenshots to explain more details?


mitzi Posted: 16 Aug 2007 03:13 pm Post subject:

--------------------------------------------------------------------------------
I downloaded the August 12th 2007 build, and
this worked for me.

For the record, I have a data set where I have a
biological sample that was processed in 2 batches,
with 2 replicate chips in each batch, which shows
very strong batch effects.

Following the instructions in the manual
I have created an array list file with standardize
separators for the 2 batches.
After normalization and MBEI, I choose the
menu item "Adjust Batch Effect" from the Tools menu.
If successful, there is a message in the Analysis log,
and I check my data either through running a report,
or exporting the expression data and checking the data.


lch Posted: 16 Aug 2007 03:21 pm Post subject:

--------------------------------------------------------------------------------
This function is for expression array and hasn't been tested on SNP
array data. It adjusts batch effect by simple scaling, so that for
each gene, the mean of each batch is the same as the mean of the first
batch. Therefore it requires more arrays (e.g. 10) in each batch to
work reasonablly.


mitzi Posted: 16 Aug 2007 03:38 pm Post subject:

--------------------------------------------------------------------------------
we're conducting an ongoing experiment.
the January batch contains 15 arrays total
for 3 biological samples, and the batch from Feb
contains 20 arrays total from 4 biological samples,
of which 3 are the same as the Jan batch.

the data set at present contains 92 arrays, from
7 different batches, but the only the Jan,Feb batches
were run on the same biological sample.

when you say that this function is for expression array,
do you mean that the rescaling isn't appropriate for SNP data?
or that it is untested? if the latter, what would be a good way
to test it? I have made jpegs of before and after adjustments -
would these be helpful?

cheers,

Mitzi


lch Posted: 25 Aug 2007 07:14 pm Post subject:

--------------------------------------------------------------------------------
Yes the figures will be helpful. For SNP array it's not tested, but
you may consider these functions:

http://www.dchip.org/copy.htm#batch_effect



mitzi Posted: 30 Aug 2007 04:59 pm Post subject:

--------------------------------------------------------------------------------
Attached are screenshots of a subset of my data
showing before and after Tools/Adjust Batch Effects
for one of the samples which was processed in 2 batches.

I tried to add a "RefBatch" column to my sample information file.
there are 7 batches, so I used values 1-7.
But this gave the following error:
Error: NumRefBatch >= MAX_REF_BATCH, ignored reference batch
information


lch Posted: 30 Aug 2007 06:19 pm Post subject:

--------------------------------------------------------------------------------
You can try this 8/30 version which has larger (50) MAX_REF_BATCH
limit:
http://www.hsph.harvard.edu/~cli/dchip2006.exe



lch Posted: 02 Jul 2007 01:47 pm Post subject: Batch effect when
merging two datasets

--------------------------------------------------------------------------------
RE: [dChip] merging of two datasets

Szesing,


This is probably typical batch effect between two different datasets.
You don’t have to normalize again after combine, but may try
standardize separator for “Analysis/clustering”:

http://biosun1.harvard.edu/complab/dchip/array%20list%20file.htm#standardize

Cheng



From: dc...@yahoogroups.com [mailto:dc...@yahoogroups.com] On Behalf
Of szesing
Sent: Tuesday, January 16, 2007 10:32 PM
To: c...@hsph.harvard.edu; dc...@yahoogroups.com
Subject: [dChip] merging of two datasets



Hi,

I have obtained some Early Access 100K Affymetrix data from GEO and
combined them with my own cell line samples done on Affymetrix 100K
SNP arrays. I did invarient set normalization and also PM only model
(as the PM/MM model was not working) separately on both my data and
the EA data. Then, I merged the two datasets using the common probe
set file obtained from this forum by merging in Excel the exported
dChip file.

After obtaining the merged file, I ran dChip again and did
normalization again. I did not do the model ling this time. After
running analysis/chromosome, I obtained the copy number for my
datasets.

Using the exported copy number data, I run hierachical clustering on
a combined Hind and Xba set of around 45000 SNPs that have copy
number gain. Strangely, I found that my own datasets are clustered
separately from the downloaded EA datasets. I would have expected to
see one or two cell lines to cluster together with the EA datasets.

Would this be caused by the merging of the two different arrays?
Would the merging and subsequent renormalization cause the changes
in the copy number? What should be the correct steps in trying to
merge 2 different datasets from different array types?

Thanks a lot.



Re: merging of two datasets

Hi Cheng,

I am aware of the possible batch effect, hence the normalization
after
the combine step. As I use another clustering program for my data, I
would need to have an unbiased exported copy number file. Is there
another way to overcome the batch effect?

As for using the analysis/clustering, I do not have a gene list file
as I want to do unsupervise clustering on all the SNPs in the
dataset.
How could I generate the file?

Many thanks.
Reply all
Reply to author
Forward
0 new messages