RnBeads Combine & Estimate Proportions

Peter Mcerlean

unread,

Apr 15, 2020, 9:38:45 AM4/15/20

to Epigenomics forum

Dear RnBeads team,

Apologies for combining two issues in one post but I believe they are related.

I'm trying to compare my EPIC data with some BluePrint WGBS references to potentially see relationships/contribution of cell types.

To start, I have been using the 'combine' function to join my EPIC and WGBS RnB.set files. I can import and read in each data set separately but get an error when trying to combine them using the 'type= "common" option (code at end).

I excluded the 'type' option and the files combine, but include all CpGs, not the Union between sets. This is evidenced by the numbers and when I brought this combined set through exploratory analysis and samples separate on PCA by assay.

I also tried do.call (combine) and 'merge' and 'union' options to no avail, the latter two I believed are masked?

Is this somehow due to the RnBeadSet < RnBeadRawSet < RnBiseqSet hierarchy?

Regardless of this, I wanted to see about cell type contribution and ran the inference module (GUI) but only to get presented with "Found only one sample per each cell type". I also ran the rnb.execute.ct.estimation individually with the same result.

Is there an issue with my annotation? (image below).

Kind regards,

Peter

Screen Shot 2020-04-15 at 2.29.30 PM.png

Combine sample output:

Samples

Object of class RnBeadRawSet

8 samples

782028 probes

of which: 782028 CpG, 0 CpH, and 0 rs

Region types:

237462 regions of type tiling

32035 regions of type genes

41684 regions of type promoters

24870 regions of type cpgislands

Intensity information is present

Detection p-values are present

Bead counts are present

Quality control information is present

Summary of normalization procedures:

The methylation data was normalized with method wm.dasen.

No background correction was performed.

BluePrint

Object of class RnBiseqSet

8 samples

24790847 methylation sites

Region types:

533764 regions of type tiling

50336 regions of type genes

53767 regions of type promoters

26065 regions of type cpgislands

Coverage information is present

Combined <- combine(BluePrint,Samples, type= "common")

Error in do.call(combine, list(y, ...)) :

argument "y" is missing, with no default

Combined <- combine(BluePrint,Samples)

Combined

Object of class RnBiseqSet

16 samples

24820689 methylation sites

Region types:

533990 regions of type tiling

50413 regions of type genes

53913 regions of type promoters

26182 regions of type cpgislands

Coverage information is absent

Michael Scherer

unread,

Apr 16, 2020, 2:40:03 AM4/16/20

to Epigenomics forum

Hi Peter,

Thanks for reporting this issue, I will try to help you as good as I can. First, for combining a BeadArray with a bisulfite sequencing dataset, I typically use this code:

library(RnBeads)
rnb.set <- load.rnb.set("path_to_RnBiseqSet")
ref.set <- load.rnb.set("path_to_RnBeadRawSet")
anno.rnb <- annotation(rnb.set)
anno.rnb <- GRanges(Rle(anno.rnb$Chromosome),IRanges(start=anno.rnb$Start,end=anno.rnb$End),strand=anno.rnb$Strand)
anno.ref <- annotation(rnb.set)
anno.ref <- GRanges(Rle(anno.ref$Chromosome),IRanges(start=anno.ref$Start,end=anno.ref$End),strand=anno.ref$Strand)
op <- findOverlaps(anno.rnb,anno.ref)
anno.new.rnb <- anno.ref[subjectHits(op)]
anno.new.rnb <- data.frame(chromosome=seqnames(anno.new.rnb),position=start(anno.new.rnb),strand=strand(anno.new.rnb))
new.meth <- meth(rnb.set)[queryHits(op),]
rnb.set <- RnBiseqSet(pheno(rnb.set),sites=anno.new.rnb,meth=new.meth,assembly="hg19")
rnb.options(identifiers.column="sample_id")
rnb.set <- combine(rnb.set,ref.set)

For the second issue you raised: RnBeads requires multiple replicates of a cell type to compute the proportions. In the first step, cell-type specific CpGs are determined similar to a differential analysis, this is why multiple samples per group (here cell type) are required. As far as I can see it in your sheet, you have some cell types with only a single sample. Do you have access to more samples?

Hope that helps a bit.

Best,

Michael

Peter Mcerlean

unread,

Apr 18, 2020, 5:30:56 AM4/18/20

to Epigenomics forum

Dear Micheal,

Thanks for you help!

Good news first, the code you provided combined my data sets and with a revised annotation sheet I was able to get the cell-type composition output. Thank you! (FYI - I believe there was a typo in the code - red text below)

However, when I did the exploratory analysis I was still getting separation by assay on PCA. The clustering heat-maps revealed that there is a whole bunch of missing values, in both datasets?!

Not sure why this has happened because looking at the numbers in the combined rnb.set it seems that they all should be there (output below)?

I neglected to say in my previous post that I downloaded the WGBS data from the methylome resource page and converted them to hg19 using LifOver (Galaxy) and then preprocessed the WGBS and EPIC data set separately.

I've included the preprocessing log files in case this has introduced an issue.

Kind regards,

Peter

> library(RnBeads)

> ref.set <- load.rnb.set("/Users/…/EPIC_rnbSet_preprocessed")

Warning messages:

1: In FUN(X[[i]], ...) : NOTE: did not overwrite file 'ff141261f3d829.ff'

2: In FUN(X[[i]], ...) : NOTE: did not overwrite file 'ff141262ff46cbf.ff'

> ref.set

Object of class RnBeadRawSet

12 samples

784977 probes

of which: 784977 CpG, 0 CpH, and 0 rs

Region types:

237785 regions of type tiling

32055 regions of type genes

41697 regions of type promoters

24889 regions of type cpgislands

Intensity information is present

Detection p-values are present

Bead counts are present

Quality control information is present

Summary of normalization procedures:

The methylation data was normalized with method wm.dasen.

No background correction was performed.

> rnb.set <- load.rnb.set("/Users/…/rnbSet_WGBS_preprocessed.zip")

> rnb.set

Object of class RnBiseqSet

12 samples

24775763 methylation sites

Region types:

533515 regions of type tiling

53668 regions of type promoters

50281 regions of type genes

26017 regions of type cpgislands

Coverage information is present

> anno.rnb <- annotation(rnb.set)

> anno.rnb <- GRanges(Rle(anno.rnb$Chromosome),IRanges(start=anno.rnb$Start,end=anno.rnb$End),strand=anno.rnb$Strand)

> anno.ref <- annotation(ref.set) <- I think this was a typo in the original post

> anno.ref <- GRanges(Rle(anno.ref$Chromosome),IRanges(start=anno.ref$Start,end=anno.ref$End),strand=anno.ref$Strand)

> op <- findOverlaps(anno.rnb,anno.ref)

> anno.new.rnb <- anno.ref[subjectHits(op)]

> anno.new.rnb <- data.frame(chromosome=seqnames(anno.new.rnb),position=start(anno.new.rnb),strand=strand(anno.new.rnb))

> new.meth <- meth(rnb.set)[queryHits(op),]

> rnb.set <- RnBiseqSet(pheno(rnb.set),sites=anno.new.rnb,meth=new.meth,assembly="hg19")

> rnb.options(identifiers.column="Patient_ID")

> rnb.set.combined <- combine(rnb.set,ref.set)

> rnb.set.combined

Object of class RnBiseqSet

24 samples

784977 methylation sites

Region types:

237785 regions of type tiling

41697 regions of type promoters

32055 regions of type genes

24889 regions of type cpgislands

Coverage information is absent

> save.rnb.set(rnb.set.combined, path=file.path("/Users/…/RnBeads/, “Combine.test.170420”))

heatmapLoc_1_euclidean_average_1_1_1000.png

heatmapLoc_4_correlation_average_1_1_1000.png

heatmapLoc_4_manhattan_complete_1_2_1000_high_resolution.png

heatmapQuant_1_correlation_complete_1.png

WGBS_preprocess_analysis.log

EPIC_preprocess_analysis.log

Michael Scherer

unread,

Apr 19, 2020, 5:07:40 AM4/19/20

to Epigenomics forum

Dear Peter,

Finding out the particular problem that you report is pretty challenging without having the dataset at hand. I would suggest that you remove all the sites that have missing values using something like this code:

meth.data <- meth(rnb.set.combined)

has.missing <- apply(meth.data,1,function(x)any(is.na(x)))

rnb.set.combined <- remove.sites(rnb.set.combined,has.missing)

and then re-perform the analysis. Just some further thoughts on your analysis:

I have made some bad experience with liftOver, since many sites are simply not mapped at all. You might loose some a substantial amount of information.
Combining WGBS and EPIC data still has many caveats: For instance, bisulfite sequencing data will always be "more" bimodal than EPIC data, simply due to technical issue. It's hard to overcome this.
If the proportions obtained using the cell type estimation make sense to you, I think it's fine to use them even though there might be a technical bias.

I hope that helps a bit,

Michael

Peter Mcerlean

unread,

Apr 27, 2020, 6:08:48 AM4/27/20

to Epigenomics forum

Hi Michael,

So I played around with this a lot and it seems that the coverage between the reference sets and my data is the issue.

I was expecting there to be 100% overlap between EPIC probes and WGBS, as in have at least some read coverage in the reference sets but I do observe loads of missing values. Perhaps imputation is needed?

I think these missing values are also due to the fact I'm using reference data from different cells. So after removing the missing values (or not!) and doing the overlap, RnBeads is giving me a type of union of the best represented EPIC probes across all cells.

At the end of the day I had 297,199 methylation sites across my samples and reference data (below) and while I did see separation by assay for PC1, the PC2/PC3 plots looked better suggesting biological variance was there.

I was thinking that when doing the cell estimate analysis using this set would give me some funky results - 50K most variable out of ~300K? But honestly the data looks good and makes sense biologically, confirming the biological variance captured in the overlap was sufficient.

Thanks again for your help!

Best,

Peter

Object of class RnBiseqSet

71 samples

297199 methylation sites

Region types:

139251 regions of type tiling

26039 regions of type genes

32219 regions of type promoters

19730 regions of type cpgislands

Coverage information is absent

Michael Scherer

unread,

Apr 28, 2020, 2:15:01 AM4/28/20

to Epigenomics forum

Hey Peter,

Thanks for this detailed update; I think it is really valuable for the community.

This is also my experience; although the overlap seems low, a low number of CpGs is sufficient to characterize the cell type identity and perform the cell type estimation.

Best,

Michael

Peter Mcerlean

unread,

Jun 17, 2020, 6:25:31 AM6/17/20

to Epigenomics forum

Dear Michael,

I'm trying to plot the methylation profiles for the CpG's used in the cell composition calculations using the merged EPIC and BluePrint datasets as outlined in the posts above.

However, I can't seem to find what the $marker output represents?

I thought it was the row.names for the merged set but it starts at '12641' and excludes some of the lower markers identified (e.g. 3272 - below).

I then thought it may have been one of either the EPIC or BlueP annotations but theses numbers aren't there either?

Does the $marker output represent the row number in the EPIC annotation?

Additionally, is there a way of exporting individual components of the CellTypeInferenceResult object?

Kind regards,

Peter

Cell.comp <-rnb.execute.ct.estimation(rnb.set.combined, cell.type.column = “CellType”, test.max.markers = 50000, top.markers = 500, method = "houseman1", verbose = TRUE)

>Cell.comp

….

$most.variable

[1] 156150…87176…41285 …

$coef.ests

M2 Class_Mono M1 M0 Mono

30451247 0.020204082 0.865195332 0.006535948 0.000000000 0.985714286

17829019 0.043181818 0.906926554 0.049203468 0.015873016 0.960284281

1653259 0.012500000 0.932564103 0.000000000 0.000000000 0.803571429

$markers

[1] 59752…3272…8379…

>annot.sites <- annotation(either - rnb.set/ref.set/rnb.set.combined)

>annot.sites.rnb.set

Chromosome Start End Strand Strand.1 AddressA AddressB Design Color Context Random

cg26928153 chr1 10848 10849 - - 91693541 47784201 I Grn CG FALSE

cg16269199 chr1 10850 10851 - - 82663207 3701821 I Grn CG FALSE

cg13869341 chr1 15865 15866 + + 2665852 39757192 I Red CG FALSE

>annot.sites.ref.set

Chromosome Start End Strand CpG GC CGI Relation SNPs HumanMethylation27 HumanMethylation450

13 chr1 10525 10526 + 8 67 Open Sea <NA> NA NA

129 chr1 10848 10849 + 15 74 Open Sea <NA> NA NA

131 chr1 10850 10851 + 15 74 Open Sea <NA> NA NA

>annot.sites.rnb.set.combined

Chromosome Start End Strand CpG GC CGI Relation SNPs HumanMethylation27 HumanMethylation450

12641 chr1 710097 710098 + 2 28 Shelf <NA> NA 34

13677 chr1 752696 752697 + 2 38 Open Sea <NA> NA NA

14583 chr1 778272 778273 + 5 60 Open Sea <NA> NA NA

Michael Scherer

unread,

Jun 17, 2020, 12:14:21 PM6/17/20

to Epigenomics forum

Hi Peter,

Yes, indeed the $markers output represents the row number of the CpGs in the set that you used (rnb.set.combined). Notably, the number of rows for meth(rnb.set.combined) and annotation(rnb.set.combined) are the same, so you can obtain the information from the annotation. Please also note that there is no CpG identifier associated with the sites of WGBS/RRBS, which is why it is lost in the combined object.

Since the CellTypeInferenceResult is merely a list of objects, which you described above, there is no dedicated way to export the information other than standard R in- or output.

Hope that helps,

Michael

Ayesha Tariq

unread,

Jun 17, 2020, 4:03:14 PM6/17/20

to epigenom...@googlegroups.com

Can some one please guide me that how would I know if the region I am looking at is hyper or hypomethyalted in RnBeads results. Do I have to look at mean.mean.qout of log2?

Best,

Ayesha

--
You received this message because you are subscribed to the Google Groups "Epigenomics forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to epigenomicsfor...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/epigenomicsforum/86d0bf3e-c2e3-4b3c-8830-b4028db34054o%40googlegroups.com.

Michael Scherer

unread,

Jun 18, 2020, 2:17:49 AM6/18/20

to Epigenomics forum

Hi Ayesha,

You will have to look at the mean.mean.diff value. In case it is negative, the first group has higher methylated (i.e. hypermethylated) and in case it is negative, the first group is lower methylated (i.e. hypomethylated).

Best,

Michael

Ayesha Tariq

unread,

Jun 18, 2020, 3:03:17 AM6/18/20

to epigenom...@googlegroups.com

Thank you.

Sent from my iPhone

--

You received this message because you are subscribed to the Google Groups "Epigenomics forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to epigenomicsfor...@googlegroups.com.

To view this discussion on the web, visit https://groups.google.com/d/msgid/epigenomicsforum/abb2595e-9c98-420a-be72-2138fd2b79a8o%40googlegroups.com.

Peter Mcerlean

unread,

Jun 18, 2020, 10:57:21 AM6/18/20

to Epigenomics forum

Hi Michael,

Ahhh I see!

Thanks again for your continued support!

Best,

Peter

Ayesha Tariq

unread,

Jun 29, 2020, 5:17:30 AM6/29/20

to epigenom...@googlegroups.com

Hi Michael,

Negative mean.mean.diff is hyper methylated?

On 18-Jun-2020, at 11:17 AM, 'Michael Scherer' via Epigenomics forum <epigenom...@googlegroups.com> wrote:

--
You received this message because you are subscribed to the Google Groups "Epigenomics forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to epigenomicsfor...@googlegroups.com.

To view this discussion on the web, visit https://groups.google.com/d/msgid/epigenomicsforum/abb2595e-9c98-420a-be72-2138fd2b79a8o%40googlegroups.com.

Ayesha Tariq

unread,

Jun 29, 2020, 5:48:51 AM6/29/20

to epigenom...@googlegroups.com

Hi,

I have DMR output data from RnBeads in a csv format. I want to plot DMR for chromosomes. I am unable to figure out the code for that. Can anyone please help?

My CSV file has following columns

id	Chromosome	Start	End	symbol	entrezID	mean.mean.T1	mean.mean.T2	mean.mean.diff	mean.mean.quot.log2	comb.p.val	comb.p.adj.fdr	combinedRank	num.sites	mean.num.na.T1	mean.num.na.T2	mean.mean.covg.T1	mean.mean.covg.T2	mean.nsamples.covg.thres5.T1	mean.nsamples.covg.thres5.T2
ENSG00000227232	chr1	14363	29806	WASH7P	653635	0.37379381	0.347063	0.02673081	-0.0065004	0.00013973	0.00034358	27172	5	0	0	12.15	12.3935484	15.8	30.8

Best,

Ayesha

--
You received this message because you are subscribed to the Google Groups "Epigenomics forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to epigenomicsfor...@googlegroups.com.

To view this discussion on the web, visit https://groups.google.com/d/msgid/epigenomicsforum/28e4b2fd-c764-4642-90ff-8c4e2b144838o%40googlegroups.com.

s9ml...@googlemail.com

unread,

Jun 30, 2020, 2:18:26 AM6/30/20

to Epigenomics forum

Hi Ayesha,

This depends on the setting that you have and what you compare . In the example below, the DMR is hypermethylated in the first group (T1), i.e. hypomethylated in the second (T2).

Best,

Michael

Ayesha Tariq

unread,

Jun 30, 2020, 2:17:03 PM6/30/20

to epigenom...@googlegroups.com

Thanks so much Michael.

Best,

Ayesha

On 30-Jun-2020, at 11:18 AM, 's9ml...@googlemail.com' via Epigenomics forum <epigenom...@googlegroups.com> wrote:

Hi Ayesha,

This depends on the setting that you have and what you compare . In the example below, the DMR is hypermethylated in the first group (T1), i.e. hypomethylated in the second (T2).

Best,

Michael

atar...@ole.augie.edu schrieb am Montag, 29. Juni 2020 um 11:17:30 UTC+2:
Hi Michael,
Negative mean.mean.diff is hyper methylated?

On 18-Jun-2020, at 11:17 AM, 'Michael Scherer' via Epigenomics forum <epigenom...@googlegroups.com> wrote:

Hi Ayesha,

You will have to look at the mean.mean.diff value. In case it is negative, the first group has higher methylated (i.e. hypermethylated) and in case it is negative, the first group is lower methylated (i.e. hypomethylated).

Best,

Michael

--
You received this message because you are subscribed to the Google Groups "Epigenomics forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to epigenomicsfor...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/epigenomicsforum/abb2595e-9c98-420a-be72-2138fd2b79a8o%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Epigenomics forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to epigenomicsfor...@googlegroups.com.

To view this discussion on the web, visit https://groups.google.com/d/msgid/epigenomicsforum/dc3680b4-ac3b-435c-98f6-cff8da81c931n%40googlegroups.com.

Ayesha Tariq

unread,

Jul 10, 2020, 2:07:58 PM7/10/20

to epigenom...@googlegroups.com

Hi all,

I am working with Gviz R package and getting the following error

> itrack <- IdeogramTrack(genome = gen, chromosome = chr)

Error in eval(expression, envir = callEnv) :

'NA' is not a valid UCSC genome.

Although there aren’t any NAs in my data.

Can anyone please help?

Best,

Ayesha

Reply all

Reply to author

Forward