minfi - remove samples and CpG sites in the MethylSet object

4,225 views
Skip to first unread message

Guangju Zhai

unread,
May 30, 2013, 3:16:16 PM5/30/13
to epigenom...@googlegroups.com
Hi, all

Does anyone know how to remove samples and CpGs from the MethylSet object in Minfi? there is a function called "dropMethylationLoci", but it seems it only does removing probes measuring SNPs and CH probes. I would like to remove some samples as well as CpG sites which fail the QCs.

Thanks for your help in advance!

Guangju

Elmar Tobi

unread,
May 31, 2013, 4:57:35 AM5/31/13
to epigenom...@googlegroups.com
Hi,
 
The minfi object samples can simply be removed as if it was a normal matrix. All objects in the S4 object than follow this removal of a sample.
The probes (rows) are a bit trickier. We currently just set the values as NA as computing beta values becomes difficult when rows (=probes) have been removed. The function refering to the manifest to piece together the beta values does not like probesets missing from the matrix. Most functions, including various normalizations do handle NAs however. So it might be an easier work around.
best wishes
Elmar
 
Op donderdag 30 mei 2013 21:16:16 UTC+2 schreef Guangju Zhai het volgende:

Guangju Zhai

unread,
May 31, 2013, 9:06:00 AM5/31/13
to epigenom...@googlegroups.com
thanks, Elmar

do you have a script to remove samples in minfi?

thanks,

Guangju

Rob

unread,
Jun 1, 2013, 8:57:31 AM6/1/13
to epigenom...@googlegroups.com
If you have followed the document in Bioconductor e.g.

RGset <- read.450k.exp(base = baseDir, targets = targets)

Then you can simply

RGset=RGset[,-c(1,3,4)]

Which will remove samples 1,3,4 then carry on as normal. The negative sign can be used to drop row or columns in matrices in R.

Rob

Guangju Zhai

unread,
Jun 1, 2013, 9:11:50 AM6/1/13
to epigenom...@googlegroups.com
thanks, Rob

much appreciated!

Guangju

Brad

unread,
Apr 16, 2014, 11:38:44 AM4/16/14
to epigenom...@googlegroups.com
Hi there -
I know this is a very old thread but I'm struggling with a related issue.
I'm trying to use dmpFinder and bumphunter in minfi to assess changes in a list of ~30 genes (corresponding to ~1000 probes) of interest to my hypothesis, and I'm unable to figure out how to restrict the analysis to this list. 
My experience with R and programming in general is minimal.
Is there a way to remove a large list of probes from a methylset or genomicratioset? Is it possible to remove objects from within these objects (i.e. gRatioSet.quantile$rowData) and then replace edited versions?
If anyone can point me towards the best approach for my goal I'd really appreciate it.
Thanks,
Brad

Celine Bourdon

unread,
May 12, 2014, 5:36:35 PM5/12/14
to epigenom...@googlegroups.com
Hi
I am also kinda new at this stuff, ... and have been struggling with the same kind of issue,
I came across this... maybe it is usefull for you?

http://www.bioconductor.org/packages/release/bioc/manuals/DMRcate/man/DMRcate.pdf

I will try and get this to work ... but
if, in the meantime,  you have found a solution, ... let me know...

:)

celine !

Celine Bourdon

unread,
May 14, 2014, 3:31:17 PM5/14/14
to epigenom...@googlegroups.com
Hi Brad or any other interested,

i can remove a list of probes or a list of samples from a methylset ... it might not be the most elegant, i am new at this too,...
here is what i would do:

I d make a ListFile.csv containing the list of probe_IDs that i want to remove,  read.csv and create a vector with them...

vector.of.probes.to.remove <- ListFile$probes

then, make a new methylSet:
New.MethylSet <-MethylSet[ , !colnames(MethylSet)  %in% vector.of.probes.to.remove ]

for sample removal, you exclude the elements of your vector on the rows, instead of columns
New.MethylSet <-MethylSet [ !rownames(MethylSet) %in% vector.of.samples.to.remove, ]

 
I am not yet sure how to remove objects (ex: b-values) within a Methylset and replace them (with ex: a "batch corrected" set of b-values)...
still working on these basics...

catch you around,
celine




Le mercredi 16 avril 2014 11:38:44 UTC-4, Brad a écrit :

Rob

unread,
May 15, 2014, 4:26:10 AM5/15/14
to epigenom...@googlegroups.com
In this case it may be easier to just ask for the probes and samples your interested in:

e.g.

You will need your probes in a character vector. You could define this in R or even read in a file but here ill just create one for example for 3 probes:

probes=c("cg00035864","cg00050873","cg00061679")

New.MethylSet <-MethylSet[probes,]

For samples you will need to know the colnames of the MethylSet

colnames(MethylSet)

Then again produce a character vector (for this example ill select samples 1,5 and 10):

samples=colnames(MethylSet)[c(1,5,10)

New.MethylSet <-MethylSet[,samples]


You can put them together at once as well:

New.MethylSet <-MethylSet[probes,samples]

Rob

unread,
May 15, 2014, 4:27:44 AM5/15/14
to epigenom...@googlegroups.com
Can't figure out how to edit the above but I missed a bracket:

samples=colnames(MethylSet)[c(1,5,10)]

Celine Bourdon

unread,
May 15, 2014, 9:22:55 AM5/15/14
to epigenom...@googlegroups.com
Cool Rob!!!
thank you!!   I love getting more stream-lined ways of doing things, ... I also made a mistake in the above where probes are in the rows and samples in colnames...
I'm getting there slowly....

celine, orange belt R-ninja

swa...@gmail.com

unread,
May 20, 2014, 10:02:28 AM5/20/14
to epigenom...@googlegroups.com
Hi there,

I used illumina 450k with 80 samples on minfi package. I would like to remove the probes with p-value>0.01 after loading the dataset.

>RGset <- read.450k.exp(base = baseDir, targets = targets)

I found a function to detect P-value, detectionP()
>detectP<-detectionP(RGset, type = "m+u")
>failed <- detectP>0.01
> dim(failed)
[1] 485512     80

I assume that "failed" only contains the probes with P-value>0.01, why it includes all of the probes?

And most importantly, could you let me know how to remove the probes with p-value >0.01?

I really appreciate your help.

Thanks a lot,
Li

Celine Bourdon

unread,
May 21, 2014, 9:52:14 AM5/21/14
to epigenom...@googlegroups.com
Hi Li,

I ll paste the code  I use,,...  in case it helps you ...

### detection "p-value" is the probability that target signal is different from background
### commun practice threashold is "failed position" if p-value >0.01
detP<-detectionP(RGset) # this gets the p-values
failed.01<-detP > 0.01  #set detection cutoff

### LIST Bad Probes
failedProbes <-rownames(failed.01)[rowMeans(failed.01)>0.2]     #list of probes that failed in more than 20% of the sample
sum(rowMeans(failed.01)>0.2) # how many probes failed in more than 20% of samples

(make sure probe-ids are the same in the vector as in your RGset (sometimes I have to remove the "cg" ...in the probe names for this to work, depending if i am removing from the mehtylset or the RGset...  )
### create new vector without the "cg" in the CpG-probe ID
failedP<-str_replace_all(failedProbes, "cg", "")
failedP # new vector list of failed probes without "cg"

# then you remove these
RGset_FX <- RGset[!rownames(RGset)%in% failedP]

#check dimensions of objects... make sure probes are "gone"
dim(RGset)
dim(RGset_FX)

Hope it helps, and if i am doing it wrong, let me know :)
catch you around

swa...@gmail.com

unread,
May 21, 2014, 2:24:23 PM5/21/14
to epigenom...@googlegroups.com
Thanks Celine! That works well.

swa...@gmail.com

unread,
May 23, 2014, 10:37:17 AM5/23/14
to epigenom...@googlegroups.com
Hi Celine,

Everything seems good until I tried to run preprocessRaw() to generate methylationSet.

> MSet.raw <- preprocessRaw(RGset_FX)
Error in getGreen(rgSet)[getProbeInfo(rgSet, type = "II")$AddressA, ] :
  subscript out of bounds

But it does work when I used RGset
> MSet.raw <- preprocessRaw(RGset)

I also checked on each of set

> RGset_FX
RGChannelSet (storageMode: lockedEnvironment)
assayData: 622394 features, 80 samples
  element names: Green, Red
phenoData
  sampleNames: 7810920100_R05C01 7810920100_R04C01 ...
    8691803146_R02C02 (80 total)
  varLabels: SampleName Smaple_Well ... filenames (23 total)
  varMetadata: labelDescription
Annotation
  array: IlluminaHumanMethylation450k
  annotation: ilmn12.hg19

> RGset
RGChannelSet (storageMode: lockedEnvironment)
assayData: 622399 features, 80 samples
  element names: Green, Red
phenoData
  sampleNames: 7810920100_R05C01 7810920100_R04C01 ...
    8691803146_R02C02 (80 total)
  varLabels: SampleName Smaple_Well ... filenames (23 total)
  varMetadata: labelDescription
Annotation
  array: IlluminaHumanMethylation450k
  annotation: ilmn12.hg19

There are 550 failed probes, but regards to the number of features, RGset_FX has 5 short comparing to RGset.

Do you, by any chance, know what the error could be?

Thanks a lot and do appreciate it.

Li

Celine Bourdon

unread,
May 23, 2014, 11:44:46 AM5/23/14
to epigenom...@googlegroups.com
okay, ...
I got this error too, but what i figured was, that preprocessRaw does not like to have probes removed,... before.
he does not mind loosing samples, but probes cannot be removed at this stage,
I am sure there is a work around... but... i m not good enough yet. and have not time to figure it out

but... the cludge i found was ...
make the RGset --> make the mSet ... do your QC on the mSet --> ID all the bad smaples and probes...
remove them and the other probes you want to exclude and proceed. 

NOW ... having said this,..
... i had the similar issue with SWAN normalisation,
 it wants a full RGset (with all probes) which it matches to a QC-"ed" Mset (with probes removed)...
 (he does not enjoy having less probes than he is expecting... in the RGset)

So i only managed by:
feeding SWAN a full RGset ... and a QC"ed"-mSet (minus bad probes, bad samples, sex probes, etc)
and i get back a SWANed-MethylSet that has bad probes and samples removed.

and I moved on from there..
Hope it help,
let me know if you understand what i mean,
:)

celine


--
You received this message because you are subscribed to the Google Groups "Epigenomics forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to epigenomicsfor...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

swa...@gmail.com

unread,
May 23, 2014, 2:09:15 PM5/23/14
to epigenom...@googlegroups.com
Hi Celine,

I think I know how to remove the failed probes. continue with your results on failedProbes.

failedProbes <-rownames(failed.01)[rowMeans(failed.01)>0.2] 

You need to do preprocessRaw with RGset
MSet.raw <- preprocessRaw(RGset)
 
##### then to remove the failed probes with p-value >0.01
Mset_FP=MSet.raw[!rownames(MSet.raw)%in% failedProbes]

move along with then further analysis, it goes well at mine.

Thanks,
Li

Celine Bourdon

unread,
May 23, 2014, 2:26:46 PM5/23/14
to epigenomicsforum
yes! hahah we came up with the same solution
:)
good luck !

swa...@gmail.com

unread,
May 27, 2014, 3:53:57 PM5/27/14
to epigenom...@googlegroups.com
Great! Good to know.

Celine, I don't know if you experienced another issue when generating the GenomeRatioSet.

> gRatioSet <- mapToGenome(ratioSet, mergeManifest = TRUE)
Error in .Call2("solve_user_SEW0", start, end, width, PACKAGE = "IRanges") :
  solving row 1: range cannot be determined from the supplied arguments (too many NAs)

Do you happen to know what's wrong?

Thanks a lot
LE

Celine Bourdon

unread,
May 27, 2014, 5:27:34 PM5/27/14
to epigenomicsforum

Hey,

!
humm, nope I managed to create the Genomic_ratioSet... I got another error but not yours...

,... i am wondering, ... how you created the Ratioset... or get your b-values...
if you are using illumina, you have to say

getBeta (SWANed_Mset, type="Illumina")  or offset=100

this adds 100 to all the beta values, to avoid getting small numbers or zeros ...
which become a problem when doing the logit transforms down the line...

maybe?  ...  I am only talking through my hat cause i am not sure...

or I would look into threashold arguments, ... the beta values, if you calculated the m-logit values,..

i would look into that,..? that is all my brain storm came up with for now, if you get more info or clues, let me know

...for my part...
 I did get the following error
> gRatioSet <- mapToGenome(ratioSet, mergeManifest = TRUE)
Error in `elementMetadata<-`(`*tmp*`, value = <S4 object of class "DataFrame">) :
  names of metadata columns cannot be one of  "seqnames", "ranges", "strand", "seqlevels", "seqlengths",
"isCircular", "start", "end", "width", "element"
and this tread helped me, ... i set the mergeManifest to FALSE and it worked...
http://comments.gmane.org/gmane.science.biology.informatics.conductor/52820

hope it helps,
let me know if it does.
c

Celine Bourdon

unread,
May 28, 2014, 9:23:34 AM5/28/14
to epigenom...@googlegroups.com
Hey L,
check out the
minfi release 1.8  thread in the forum, Emma is having the same issue as you.

maybe if you get together... you 'll figure it out.
c
To unsubscribe from this group and stop receiving emails from it, send an email to epigenomicsforum+unsubscribe@googlegroups.com.

Christian Trolle

unread,
Jun 4, 2014, 4:19:13 PM6/4/14
to epigenom...@googlegroups.com
Hi 

If you are still having issues with filtering out bad performing samples, maybe then The watermelon function pfilter could do the trick.

CT

Stephanie Shiau

unread,
Mar 26, 2015, 5:05:14 PM3/26/15
to epigenom...@googlegroups.com
Hi all, 

Are there any updated systematic methods on how best to remove problematic probes in minfi?  Including 1) probes displaying high detection p-value 2) cross-reactive probes, 3) probes containing common SNPs, and 4) probes with a high average intensity?

Is it suggested to move from minfi to another package? If so, what package?  How do you transform the object?

Thank you!!

Kasper Daniel Hansen

unread,
Mar 26, 2015, 8:54:36 PM3/26/15
to epigenom...@googlegroups.com
These CpGs /probes should be completely straightforward to remove once a MethylSet has been constructed, usually post-normalization. 

Multiple users have wanted to remove them prior to normalization.  I am not sure I understand why, but of course it is possible.  The problem is that if you remove a probe of which has a pair you need to remove both of them.

--
You received this message because you are subscribed to the Google Groups "Epigenomics forum" group.

Tim Triche, Jr.

unread,
Mar 27, 2015, 5:38:58 PM3/27/15
to epigenom...@googlegroups.com
Here is a suggestion. 

Suppose you have IDATs (you should) and you want to keep the SNP probes in order to detect label swaps (you should). 
Suppose further that you'd like to keep CNV calls, and you'd like to mask probes with p > 0.01.
Lastly, suppose you have functions called...
  processIDATs (which calls minfi::read.450k.exp() with sensible defaults), 
  processMeth (which calls minfi::preprocessFunnorm(), minfi::detectionP(), and minfi::getSnpBeta(), stuffs the SNPs into exptData, and masks off at 0.01 using is.na(assays(grSet)$Beta[which(assays(grSet)$pval > pcutoff)]) <- TRUE),
  processCNV (which calls CopyNumber450k::CNV450kSet and computeSignificance(segmentize(normalize(dropSNPprobes(cnvSet)))) as the poor man's GISTIC). 

Then you could do something like this...

  rgSet <- processIDATs(targets, name) ## pulls in SNPs and detection p-values
  grSet <- processMeth(rgSet, name) ## masks betas by pval > 0.01 & keeps SNPs
  exptData(grSet)$CNVs <- processCNV(rgSet, name)
  saveRDS(grSet, grSetName)

And all of your masked, label-verified, CNV-matchable samples end up in a container amenable to further analysis. 
Obviously there are some judgment calls to be made here, but eventually you arrive at the least worst option.

Somebody should write a package like that ;-)

Smeeta Shrestha

unread,
Apr 8, 2015, 8:21:33 AM4/8/15
to epigenom...@googlegroups.com
Hi !


I read a review  paper by Dr Morris on overall pipelines for 450K analysis. Below is a para from the paper

"Briefly, raw data (IDAT files) is imported using the illuminaio [16] tool implemented in minfi. Then a number of quality control metrics are examined to determine the success of the bisulphite conversion and subsequent array hybridisation. Probe filtering is performed to remove probes that have failed to hybridise (detection p-value) and that are not represented by a minimum of 3 beads on the array."


Could anyone please help me as HOW CAN I REMOVE "detection p-value" and "Nbeads" using the minfi script.

I know watermelon can do but then post "pfilter" command in watermelon i get methyset and lose the RGset because of which I am not able to do normalisation and further SWAN correction.

Thank you so so much in advance.

Smeeta

Celine Bourdon

unread,
Apr 8, 2015, 4:05:25 PM4/8/15
to epigenom...@googlegroups.com
Hi Smeeta,

I have part of your answer, it is probably not the most elegant but it works... . What i have done is:
 
detP<-detectionP(RGset) # this gets the p-values
failed.01<-detP > 0.01  #establish your quality cut-off so my probes must be significantly above background level (significance cut off 0.01, some us 0.05)  
# this gets you a list of failed positions ... save it for record
 save(failed.01, file ="2015failed.01_object.RData")

#get list of failed probes:  
failedProbes <-rownames(failed.01)[rowMeans(failed.01)>0.2] 
sum(rowMeans(failed.01)>0.2)  # how many probes failed in more than 20% of samples 
failedProbes

# do your preprocessing to get your mset
Mset <-preprocessIllumina(RGset, bg.correct=TRUE, normalize = c("controls"), reference =1) 

# run QC on the Mset ... 
 minfiQC_wOO<-minfiQC(Mset, fixOutliers =TRUE, verbose=TRUE)
plotQC(minfiQC_wOO$qc) # ID bad samples
 plotSex(minfiQC_wOO$qc, id=Mset@phenoData@data$Gender)


gSet.q<-preprocessQuantile(Mset, fixOutliers=TRUE, removeBadSamples=TRUE,
                           badSampleCutoff=10.5, quantileNormalize=TRUE,
                           stratified=TRUE, mergeManifest=TRUE, sex=NULL)


#make list of other bad samples you might want to remove... 
badSamples.1.2.3 <- c("5771955078_R03C02", "5771955073_R03C01", "5771955077_R06C02") 
badSamples.1.2.3

gSet.q<-gSet.q[,!colnames(gSet.q) %in% badSamples.1.2.3]
dim(gSet.q)


### remove Bad probes
rownames(gSet.q.146) 
gSet.q.146.xP <- gSet.q.146[!rownames(gSet.q.146)%in% failedProbes,] 


###remove Sex probes
sex.probes<- read.csv(XYSites.csv", header=FALSE)
dim(sex.probes)
sex.p <- sex.probes[,1]
length(sex.p) # 11646

gSet.q.146.xP <- gSet.q.146.xP[!rownames(gSet.q.146.xP)%in% sex.p,] 
dim(gSet.q.146.xP) 


crossHyb.probes<- read.csv(file="CrossReactiveProbes.csv", header=FALSE)
dim(crossHyb.probes)
crossHyb.p <- crossHyb.probes[,1]
length(crossHyb.p) # 29233

gSet.q.146.xP <- gSet.q.146.xP[!rownames(gSet.q.146.xP)%in% crossHyb.p,] 
dim(gSet.q.146.xP) #features.460350 samples.146 

annot<- getAnnotation(gSet.q.146.xP)
names(annot)


snps.q.146.xP<-getSnpInfo(gSet.q.146.xP)
gSet.q.146.xP<-addSnpInfo(gSet.q.146.xP)

gSet.q.146.xP.xSNP<-dropLociWithSnps(gSet.q.146.xP, snps=c("SBE","CpG"), maf=0)
dim(gSet.q.146.xP.xSNP) 


Hope it helps, all you need to figure out is how to drop the Bead count minimum of 3 beads  ... if you get a list of these, you are good to go.
If you want to take these out in a mSet (do the same subsetting proceedure, but replace the gSet.q with your mset object)

as always, double check the logic for yourself, I am not fool proof :)

c

Smeeta Shrestha

unread,
Dec 24, 2015, 3:57:49 AM12/24/15
to Epigenomics forum
Dear Celine,

Apologies for a really late reply. But thank you so much for the information. Yes I recently tried your method and it works well.

regards
smeeta

Hrishikesh Lokhande

unread,
Jan 31, 2017, 7:23:05 PM1/31/17
to Epigenomics forum

Celine,

Thank you for posting your code here. I know this is a very old post but I was using lines from your code to remove failed probes from the RGset. 
The code does identify the failed probes but it doesn't remove them from the RGset_FX object. 
Here is what I am doing: 

RGset

RGChannelSet (storageMode: lockedEnvironment)

assayData: 622399 features, 85 samples 

  element names: Green, Red 

An object of class 'AnnotatedDataFrame'

  sampleNames: 200517420015_R03C01 200517420102_R06C01 ...

    200400320002_R04C02 (85 total)

  varLabels: Sample_Name Sample_Well ... filenames (13 total)

  varMetadata: labelDescription

Annotation

  array: IlluminaHumanMethylation450k

  annotation: ilmn12.hg19

> detP<-detectionP(RGset) 

> failed.01<-detP > 0.01 

> failedProbes <-rownames(failed.01)[rowMeans(failed.01)>0.2]  

> length(failedProbes)

[1] 352

> failedP<-str_replace_all(failedProbes, "cg", "") 

> length(failedP)

[1] 352

> RGset_FX <- RGset[!rownames(RGset)%in% failedP] 

> dim(RGset)

Features  Samples 

  622399       85 

> dim(RGset_FX) 

Features  Samples 

  622397       85


So instead of removing 352 probes it removes only 2. Would really appreciate your help on this. 

Thanks

Tim Triche, Jr.

unread,
Feb 1, 2017, 12:50:09 PM2/1/17
to Epigenomics forum
set your failed probes to NA and impute them later (this allows you to pool information across samples and recover some of it):

rgSet <- read.metharray.exp(targets=targets)
grSet <- ratioConvert(mapToGenome(preprocessNoob(rgSet)))

pCutoff <- 0.01
pvals <- detectionP(rgSet)
# "if a (probe, sample) tuple exceeds the p-value cutoff, set it to NA"
is.na(assays(grSet)$Beta) <- (pvals[rownames(grSet), colnames(grSet)] >= pCutoff)

Also, use the rest of the information in the rgSet to your advantage:

library(ComplexHeatmap)
SNPs <- getSnpBeta(rgSet) # cluster on these to try and catch label swaps
Heatmap(t(SNPs), name="MAF", row_names_side="left", row_title="sample",
               column_title="SNP", column_title_side="bottom")

As you can imagine, due to CpG SNPs on the platform, this can be extended to guess at ethnicity, germline SNPs, etc. if needs be. See https://academic.oup.com/nar/article/doi/10.1093/nar/gkw967/2290930/Comprehensive-characterization-annotation-and for more.

Once you're certain that your samples are who/what you think they are, you can cluster by ethnicity (or not) and use kNN to impute failed probes or high-MAF CpG site SNPs with empirical evidence of a trimodal distribution.  I don't do "EWAS" studies but I will say that identifying biomarkers or functionally interesting sites is easier when you call DMRs using sensibly imputed data. They tend to replicate.

Best,

--t
Reply all
Reply to author
Forward
0 new messages