RnBeads Combine & Estimate Proportions

171 views
Skip to first unread message

Peter Mcerlean

unread,
Apr 15, 2020, 9:38:45 AM4/15/20
to Epigenomics forum
Dear RnBeads team,

Apologies for combining two issues in one post but I believe they are related.

I'm trying to compare my EPIC data with some BluePrint WGBS references to potentially see relationships/contribution of cell types.

To start, I have been using the 'combine' function to join my EPIC and WGBS RnB.set files. I can import and read in each data set separately but get an error when trying to combine them using the 'type= "common" option (code at end). 

I excluded the 'type' option and the files combine, but include all CpGs, not the Union between sets. This is evidenced by the numbers and when I brought this combined set through exploratory analysis and samples separate on PCA by assay.  

I also tried do.call (combine) and 'merge' and 'union' options to no avail, the latter two I believed are masked?

Is this somehow due to the RnBeadSet < RnBeadRawSet < RnBiseqSet hierarchy? 

Regardless of this, I wanted to see about cell type contribution and ran the inference module (GUI) but only to get presented with "Found only one sample per each cell type". I also ran the rnb.execute.ct.estimation individually with the same result. 

Is there an issue with my annotation? (image below).

Kind regards,

Peter


Screen Shot 2020-04-15 at 2.29.30 PM.png




Combine sample output:

Samples

Object of class RnBeadRawSet

       8 samples

  782028 probes

                  of which: 782028 CpG, 0 CpH, and 0 rs

Region types:

                    237462 regions of type tiling

                     32035 regions of type genes

                     41684 regions of type promoters

                     24870 regions of type cpgislands

Intensity information is present

Detection p-values are present

Bead counts are present

Quality control information is present

Summary of normalization procedures:

                  The methylation data was normalized with method wm.dasen.

                  No background correction was performed.

 

BluePrint

Object of class RnBiseqSet

       8 samples

24790847 methylation sites

Region types:

                    533764 regions of type tiling

                     50336 regions of type genes

                     53767 regions of type promoters

                     26065 regions of type cpgislands

Coverage information is present

 

Combined <- combine(BluePrint,Samples, type= "common")

Error in do.call(combine, list(y, ...)) :

  argument "y" is missing, with no default


Combined <- combine(BluePrint,Samples)

 

Combined

Object of class RnBiseqSet

      16 samples

24820689 methylation sites

Region types:

                    533990 regions of type tiling

                     50413 regions of type genes

                     53913 regions of type promoters

                     26182 regions of type cpgislands

Coverage information is absent

Michael Scherer

unread,
Apr 16, 2020, 2:40:03 AM4/16/20
to Epigenomics forum
Hi Peter,

Thanks for reporting this issue, I will try to help you as good as I can. First, for combining a BeadArray with a bisulfite sequencing dataset, I typically use this code: 

library(RnBeads)
rnb.set <- load.rnb.set("path_to_RnBiseqSet")
ref.set <- load.rnb.set("path_to_RnBeadRawSet")
anno.rnb <- annotation(rnb.set)
anno.rnb <- GRanges(Rle(anno.rnb$Chromosome),IRanges(start=anno.rnb$Start,end=anno.rnb$End),strand=anno.rnb$Strand)
anno.ref <- annotation(rnb.set)
anno.ref <- GRanges(Rle(anno.ref$Chromosome),IRanges(start=anno.ref$Start,end=anno.ref$End),strand=anno.ref$Strand)
op <- findOverlaps(anno.rnb,anno.ref)
anno.new.rnb <- anno.ref[subjectHits(op)]
anno.new.rnb <- data.frame(chromosome=seqnames(anno.new.rnb),position=start(anno.new.rnb),strand=strand(anno.new.rnb))
new.meth <- meth(rnb.set)[queryHits(op),]
rnb.set <- RnBiseqSet(pheno(rnb.set),sites=anno.new.rnb,meth=new.meth,assembly="hg19")
rnb.options(identifiers.column="sample_id")
rnb.set <- combine(rnb.set,ref.set)

For the second issue you raised: RnBeads requires multiple replicates of a cell type to compute the proportions. In the first step, cell-type specific CpGs are determined similar to a differential analysis, this is why multiple samples per group (here cell type) are required. As far as I can see it in your sheet, you have some cell types with only a single sample. Do you have access to more samples?

Hope that helps a bit.

Best,

Michael

Peter Mcerlean

unread,
Apr 18, 2020, 5:30:56 AM4/18/20
to Epigenomics forum
Dear Micheal, 

Thanks for you help!

Good news first, the code you provided combined my data sets and with a revised annotation sheet I was able to get the cell-type composition output. Thank you! (FYI - I believe there was a typo in the code - red text below)

However, when I did the exploratory analysis I was still getting separation by assay on PCA. The clustering heat-maps revealed that there is a whole bunch of missing values, in both datasets?!

Not sure why this has happened because looking at the numbers in the combined rnb.set it seems that they all should be there (output below)?

I neglected to say in my previous post that I downloaded the WGBS data from the methylome resource page and converted them to hg19 using LifOver (Galaxy) and then preprocessed the WGBS and EPIC data set separately. 

I've included the preprocessing log files in case this has introduced an issue. 

Kind regards,

Peter


> library(RnBeads)

> ref.set <- load.rnb.set("/Users/…/EPIC_rnbSet_preprocessed")

Warning messages:

1: In FUN(X[[i]], ...) : NOTE: did not overwrite file 'ff141261f3d829.ff'

2: In FUN(X[[i]], ...) : NOTE: did not overwrite file 'ff141262ff46cbf.ff'

> ref.set

Object of class RnBeadRawSet

      12 samples

  784977 probes

                  of which: 784977 CpG, 0 CpH, and 0 rs

Region types:

                    237785 regions of type tiling

                     32055 regions of type genes

                     41697 regions of type promoters

                     24889 regions of type cpgislands

Intensity information is present

Detection p-values are present

Bead counts are present

Quality control information is present

Summary of normalization procedures:

                  The methylation data was normalized with method wm.dasen.

                  No background correction was performed.

> rnb.set <- load.rnb.set("/Users/…/rnbSet_WGBS_preprocessed.zip")

> rnb.set

Object of class RnBiseqSet

      12 samples

24775763 methylation sites

Region types:

                    533515 regions of type tiling

                     53668 regions of type promoters

                     50281 regions of type genes

                     26017 regions of type cpgislands

Coverage information is present

> anno.rnb <- annotation(rnb.set)

> anno.rnb <- GRanges(Rle(anno.rnb$Chromosome),IRanges(start=anno.rnb$Start,end=anno.rnb$End),strand=anno.rnb$Strand)

> anno.ref <- annotation(ref.set) <- I think this was a typo in the original post

> anno.ref <- GRanges(Rle(anno.ref$Chromosome),IRanges(start=anno.ref$Start,end=anno.ref$End),strand=anno.ref$Strand)

> op <- findOverlaps(anno.rnb,anno.ref)

> anno.new.rnb <- anno.ref[subjectHits(op)]

> anno.new.rnb <- data.frame(chromosome=seqnames(anno.new.rnb),position=start(anno.new.rnb),strand=strand(anno.new.rnb))

> new.meth <- meth(rnb.set)[queryHits(op),]

> rnb.set <- RnBiseqSet(pheno(rnb.set),sites=anno.new.rnb,meth=new.meth,assembly="hg19")

> rnb.options(identifiers.column="Patient_ID")

> rnb.set.combined <- combine(rnb.set,ref.set)

> rnb.set.combined

Object of class RnBiseqSet

      24 samples

  784977 methylation sites

Region types:

                    237785 regions of type tiling

                     41697 regions of type promoters

                     32055 regions of type genes

                     24889 regions of type cpgislands

Coverage information is absent

> save.rnb.set(rnb.set.combined, path=file.path("/Users/…/RnBeads/, “Combine.test.170420”))

heatmapLoc_1_euclidean_average_1_1_1000.png
heatmapLoc_4_correlation_average_1_1_1000.png
heatmapLoc_4_manhattan_complete_1_2_1000_high_resolution.png
heatmapQuant_1_correlation_complete_1.png
WGBS_preprocess_analysis.log
EPIC_preprocess_analysis.log

Michael Scherer

unread,
Apr 19, 2020, 5:07:40 AM4/19/20
to Epigenomics forum
Dear Peter,

Finding out the particular problem that you report is pretty challenging without having the dataset at hand. I would suggest that you remove all the sites that have missing values using something like this code:

meth.data <- meth(rnb.set.combined)
has.missing <- apply(meth.data,1,function(x)any(is.na(x)))
rnb.set.combined <- remove.sites(rnb.set.combined,has.missing)

and then re-perform the analysis. Just some further thoughts on your analysis:

  • I have made some bad experience with liftOver, since many sites are simply not mapped at all. You might loose some a substantial amount of information.
  • Combining WGBS and EPIC data still has many caveats: For instance, bisulfite sequencing data will always be "more" bimodal than EPIC data, simply due to technical issue. It's hard to overcome this.
  • If the proportions obtained using the cell type estimation make sense to you, I think it's fine to use them even though there might be a technical bias.
I hope that helps a bit,

Michael

Peter Mcerlean

unread,
Apr 27, 2020, 6:08:48 AM4/27/20
to Epigenomics forum
Hi Michael,

So I played around with this a lot and it seems that the coverage between the reference sets and my data is the issue. 

I was expecting there to be 100% overlap between EPIC probes and WGBS, as in have at least some read coverage in the reference sets but I do observe loads of missing values. Perhaps imputation is needed?

I think these missing values are also due to the fact I'm using reference data from different cells. So after removing the missing values (or not!) and doing the overlap, RnBeads is giving me a type of union of the best represented EPIC probes across all cells. 

At the end of the day I had 297,199 methylation sites across my samples and reference data (below) and while I did see separation by assay for PC1, the PC2/PC3 plots looked better suggesting biological variance was there. 

I was thinking that when doing the cell estimate analysis using this set would give me some funky results - 50K most variable out of ~300K? But honestly the data looks good and makes sense biologically, confirming the biological variance captured in the overlap was sufficient. 

Thanks again for your help!

Best,

Peter


Object of class RnBiseqSet
      71 samples
  297199 methylation sites
Region types:
  139251 regions of type tiling
   26039 regions of type genes
   32219 regions of type promoters
   19730 regions of type cpgislands
Coverage information is absent




Michael Scherer

unread,
Apr 28, 2020, 2:15:01 AM4/28/20
to Epigenomics forum
Hey Peter,

Thanks for this detailed update; I think it is really valuable for the community.

This is also my experience; although the overlap seems low, a low number of CpGs is sufficient to characterize the cell type identity and perform the cell type estimation.

Best,
 Michael

Peter Mcerlean

unread,
Jun 17, 2020, 6:25:31 AM6/17/20
to Epigenomics forum
Dear Michael, 

I'm trying to plot the methylation profiles for the CpG's used in the cell composition calculations using the merged EPIC and BluePrint datasets as outlined in the posts above. 

However, I can't seem to find what the $marker output represents? 

I thought it was the row.names for the merged set but it starts at '12641' and excludes some of the lower markers identified (e.g. 3272 - below). 

I then thought it may have been one of either the EPIC or BlueP annotations but theses numbers aren't there either?

Does the $marker output represent the row number in the EPIC annotation? 

Additionally, is there a way of exporting individual components of the CellTypeInferenceResult object?

Kind regards,

Peter

Cell.comp <-rnb.execute.ct.estimation(rnb.set.combined, cell.type.column = “CellType”, test.max.markers = 50000, top.markers = 500, method = "houseman1", verbose = TRUE)

 

>Cell.comp

….

$most.variable

   [1] 156150…87176…41285 …

 

$coef.ests

                  M2                   Class_Mono   M1                   M0                  Mono

30451247 0.020204082 0.865195332 0.006535948 0.000000000 0.985714286

17829019 0.043181818 0.906926554 0.049203468 0.015873016 0.960284281

1653259  0.012500000 0.932564103 0.000000000 0.000000000 0.803571429

 

$markers

  [1]  59752…3272…8379…

 

>annot.sites <- annotation(either - rnb.set/ref.set/rnb.set.combined)

 

>annot.sites.rnb.set

Chromosome  Start    End Strand Strand.1 AddressA AddressB Design Color Context Random

cg26928153       chr1  10848  10849      -        - 91693541 47784201      I   Grn      CG  FALSE

cg16269199       chr1  10850  10851      -        - 82663207  3701821      I   Grn      CG  FALSE

cg13869341       chr1  15865  15866      +        +  2665852 39757192      I   Red      CG  FALSE

 

>annot.sites.ref.set

      Chromosome  Start    End Strand CpG GC CGI Relation SNPs HumanMethylation27 HumanMethylation450

13          chr1  10525  10526      +   8 67     Open Sea <NA>                 NA                  NA

129         chr1  10848  10849      +  15 74     Open Sea <NA>                 NA                  NA

131         chr1  10850  10851      +  15 74     Open Sea <NA>                 NA                  NA

 

>annot.sites.rnb.set.combined

      Chromosome  Start    End Strand CpG GC CGI Relation SNPs HumanMethylation27 HumanMethylation450

12641       chr1 710097 710098      +   2 28        Shelf <NA>                 NA                  34

13677       chr1 752696 752697      +   2 38     Open Sea <NA>                 NA                  NA

14583       chr1 778272 778273      +   5 60     Open Sea <NA>                 NA                  NA

 

Michael Scherer

unread,
Jun 17, 2020, 12:14:21 PM6/17/20
to Epigenomics forum
Hi Peter,

Yes, indeed the $markers output represents the row number of the CpGs in the set that you used (rnb.set.combined). Notably, the number of rows for meth(rnb.set.combined) and annotation(rnb.set.combined) are the same, so you can obtain the information from the annotation. Please also note that there is no CpG identifier associated with the sites of WGBS/RRBS, which is why it is lost in the combined object.

Since the CellTypeInferenceResult is merely a list of objects, which you described above, there is no dedicated way to export the information other than standard R in- or output.

Hope that helps,

Michael

Ayesha Tariq

unread,
Jun 17, 2020, 4:03:14 PM6/17/20
to epigenom...@googlegroups.com
Can some one please guide me that how would I know if the region I am looking at is hyper or hypomethyalted in RnBeads results. Do I have to look at mean.mean.qout of log2?
Best,
Ayesha

--
You received this message because you are subscribed to the Google Groups "Epigenomics forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to epigenomicsfor...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/epigenomicsforum/86d0bf3e-c2e3-4b3c-8830-b4028db34054o%40googlegroups.com.

Michael Scherer

unread,
Jun 18, 2020, 2:17:49 AM6/18/20
to Epigenomics forum
Hi Ayesha,

You will have to look at the mean.mean.diff value. In case it is negative, the first group has higher methylated (i.e. hypermethylated) and in case it is negative, the first group is lower methylated (i.e. hypomethylated).

Best,

Michael

Ayesha Tariq

unread,
Jun 18, 2020, 3:03:17 AM6/18/20
to epigenom...@googlegroups.com
Thank you.

Sent from my iPhone
--
You received this message because you are subscribed to the Google Groups "Epigenomics forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to epigenomicsfor...@googlegroups.com.

Peter Mcerlean

unread,
Jun 18, 2020, 10:57:21 AM6/18/20
to Epigenomics forum
Hi Michael,

Ahhh I see!

Thanks again for your continued support!

Best,

Peter

Ayesha Tariq

unread,
Jun 29, 2020, 5:17:30 AM6/29/20
to epigenom...@googlegroups.com
Hi Michael,
Negative mean.mean.diff is hyper methylated?

On 18-Jun-2020, at 11:17 AM, 'Michael Scherer' via Epigenomics forum <epigenom...@googlegroups.com> wrote:

--
You received this message because you are subscribed to the Google Groups "Epigenomics forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to epigenomicsfor...@googlegroups.com.

Ayesha Tariq

unread,
Jun 29, 2020, 5:48:51 AM6/29/20
to epigenom...@googlegroups.com
Hi,
I have DMR output data from RnBeads in a csv format. I want to plot DMR for chromosomes. I am unable to figure out the code for that. Can anyone please help?
My CSV file has following columns

id Chromosome Start End symbol entrezID mean.mean.T1 mean.mean.T2 mean.mean.diff mean.mean.quot.log2 comb.p.val comb.p.adj.fdr combinedRank num.sites mean.num.na.T1 mean.num.na.T2 mean.mean.covg.T1 mean.mean.covg.T2 mean.nsamples.covg.thres5.T1 mean.nsamples.covg.thres5.T2
ENSG00000227232 chr1 14363 29806 WASH7P 653635 0.37379381 0.347063 0.02673081 -0.0065004 0.00013973 0.00034358 27172 5 0 0 12.15 12.3935484 15.8 30.8
Best, 
Ayesha


--
You received this message because you are subscribed to the Google Groups "Epigenomics forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to epigenomicsfor...@googlegroups.com.

s9ml...@googlemail.com

unread,
Jun 30, 2020, 2:18:26 AM6/30/20
to Epigenomics forum
Hi Ayesha,

This depends on the setting that you have and what you compare . In the example below, the DMR is hypermethylated in the first group (T1), i.e. hypomethylated in the second (T2).

Best,

Michael

Ayesha Tariq

unread,
Jun 30, 2020, 2:17:03 PM6/30/20
to epigenom...@googlegroups.com
Thanks so much Michael.

Best,
Ayesha

On 30-Jun-2020, at 11:18 AM, 's9ml...@googlemail.com' via Epigenomics forum <epigenom...@googlegroups.com> wrote:

Hi Ayesha,

This depends on the setting that you have and what you compare . In the example below, the DMR is hypermethylated in the first group (T1), i.e. hypomethylated in the second (T2).

Best,

Michael

atar...@ole.augie.edu schrieb am Montag, 29. Juni 2020 um 11:17:30 UTC+2:
Hi Michael,
Negative mean.mean.diff is hyper methylated?

On 18-Jun-2020, at 11:17 AM, 'Michael Scherer' via Epigenomics forum <epigenom...@googlegroups.com> wrote:

Hi Ayesha,

You will have to look at the mean.mean.diff value. In case it is negative, the first group has higher methylated (i.e. hypermethylated) and in case it is negative, the first group is lower methylated (i.e. hypomethylated).

Best,

Michael

--
You received this message because you are subscribed to the Google Groups "Epigenomics forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to epigenomicsfor...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Epigenomics forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to epigenomicsfor...@googlegroups.com.

Ayesha Tariq

unread,
Jul 10, 2020, 2:07:58 PM7/10/20
to epigenom...@googlegroups.com
Hi all,
I am working with Gviz R package and getting the following error


> itrack <- IdeogramTrack(genome = gen, chromosome = chr)
Error in eval(expression, envir = callEnv) : 
  'NA' is not a valid UCSC genome.

Although there aren’t  any NAs in my data.
 Can anyone please help?
Best,
 Ayesha 
Reply all
Reply to author
Forward
0 new messages