DB files partially read when converted into methylKit object

280 views
Skip to first unread message

Johan Zicola

unread,
Jan 25, 2021, 7:56:45 AM1/25/21
to methylkit_discussion
Hello,

It seems that the conversion from methylBaseDB or methylRawListDB to methylBase or methylRawList  using the function as() is no working well in R 4.0.3, methylKit 1.14.2, Windows 10 (see details session at the end).

The issue is that for methylKit objects of a certain size (the error does not seem to occur below 200 Mb), the transfer from DB files (.bgz) to the RAM is only partial, with the beginning of the object being correct and the end with NAs instead of the values from the compressed DB file.

I first thought the issue may come from a RAM management problem in my R installation but it does not seem to be the case as I can read without issues the bgzipped files using the function "gzfile"
> zz <- gzfile("file.txt.bgz", "rt")
> df <- read.csv(zz, header=F, sep='\t', comment.char = "#")

Also, the problem is not systematic and I sometimes recover the whole object when I restart my R session. So it seems somehow to be connected with some RAM cluttering although functions as "gzfile" work in any case, as far as I have tested.

I put here some pictures to illustrate the issues:

Capture.PNG

You can see that the methylBase object for CpG displays correclty the last rows while the methylBase for CHH displays NAs (I put a screenshot of the tail of actual DB file to show that the issue is not there).

I don't know what is behind the as() function and what is used to read the bgzipped files but I think the problem comes from there.


Here me info session:
infoSession.PNG

Thanks for your help!
Best,
Johan

Alexander Blume

unread,
Jan 26, 2021, 7:45:07 AM1/26/21
to methylkit_discussion
Hi Johan, 

Thank you for the detailed reporting of your error, that helps a lot. Right now I do not have any explanation why this error could occur, but with your help, we might figure this out. 

The internal call stack of the `as()` function is quite similar to the `getData()` function, as both are extracting all data from the tabix file using the internal function `methylkit:::fread.gzipped()`. There we are using `R.utils::gunzip` to uncompress the tabix .bgz file and `data.table::fread()` to read the data into memory.  
However, I checked our specific settings (https://github.com/al2na/methylKit/blob/cfbc32ed9dd490e9e8070d33cacaa6ebecb56d76/R/backbone.R#L7) and I think the default is to perform caching of uncompressed files, i.e. reuse uncompressed file if already exists.  
Maybe you could try whether `methylkit:::fread.gzipped("bgz.filepath", skipDecompress =FALSE )` works as expected, to check whether the caching is the problem. 

Best,
Alex

Johan Zicola

unread,
Jan 26, 2021, 9:37:38 AM1/26/21
to methylkit_discussion
Dear Alex,

Thank you for your quick answer!

So I tried `methylKit:::fread.gzipped` and it works properly, I get 3 data.table objects without NAs issues and the size of the objects is realistic.

image2.PNGimage3.PNG


I also tried with `skipDecompress =TRUE` and it also works (no NAs) but I get exactly the same object when I do the function a second time, as R was stuck to the same cache file:
image1.PNG


I had this problem before and I now understand why. R seems to read what is already in cache instead of extracting the new file given in the function. Maybe it is because the file names are identical, although the paths are different. I tested this hypothesis by renaming one of my methylBaseDB file and its associated tabix index file and it is the case:
image5.PNG

I had a whole pipeline running smoothly with an earlier methylKit version, where I was using similar name files in different folders. It does not work anymore and I assume it is because of a new version of ` R.utils::decompressFile` you mention line 22 in https://github.com/al2na/methylKit/blob/cfbc32ed9dd490e9e8070d33cacaa6ebecb56d76/R/backbone.R#L9

Now, I restart R to empty the cache and I start loading the mCHH methylBase object first (with the original name methylBase_.txt.bgz) and it works. The object is big and without NAs so the argument `skipDecompress=TRUE` is the definitive culprit for the misreading of cache files with the same names. It makes sense if they are located in the same "temporary folder or RAM slot" but this problem was not there in earlier versions of methylKit.

image4.PNG

That leads me to think that these cache files are also responsible for these NAs lines: I thought the issue was coming from the RAM as my first object was always CpG and was also the smallest in size. But the thing is that it was the first object to go to the uncompressed cache file and I guess the CHG and CHH were overwriting part of this file but not all. It makes sense since all consecutive loaded files have the same beginning (see below) but a different end (see below). It seems that for mCHH, where we see NAs, the object is extended (more lines as shown with the dim() ) function but not filled with the correct value. However, the data.table objects seem to have a correct ending, even with `skipDecompress=TRUE`. It seems to me that these NAs occur then when the data.table is converted into a methylKit object. I guess R extends the object based on the tabix file, which is probably not uncompressed, but uses the wrong uncompressed file to fill in. Somehow data.table objects deal with it but not methylKit objects.

image6.PNG

I think I could solve my issue locally by changing the argument  of the option `skipDecompress=FALSE`  for `methylKit:::fread.gzipped`  but I am not sure how I should do it?
Can you tell me how to convert the data.table objects into methylKit objects, then I could basically do a manual loading with `methylKit:::fread.gzipped` and a conversion to methylKit object and skip altogether the obscure function  `as()`?

Thanks,
Best regards,
Johan

Alexander Blume

unread,
Jan 26, 2021, 11:15:19 AM1/26/21
to methylkit_...@googlegroups.com
Hi Johan,

Great you figured this out! 

To circumvent the as() function you could implement your own version. Basically all you are missing is to rename the columns and create a new object of class “methylBase”: 

```
convert2MethylBase <- function(x) {
  
  df <- methylKit:::fread.gzipped(x@dbpath, stringsAsFactors = FALSE, data.table = FALSE,skipDecompress=FALSE)
  methylKit:::.setMethylDBNames(df,"methylBaseDB")
  
  new("methylBase",df[i,],
      sample.ids = x...@sample.ids
      assembly = x@assembly,
      context = x@context,
      treatment=x@treatment,
      destranded=x@destranded,
      resolution =x@resolution)
}
```

```
data("methylKit")
obj = methylKit::makeMethylDB(methylBase.obj)
convert2MethylBase(obj)
```

Please let me know if this does not work for you. 

Best, 
Alex


On 26. Jan 2021, at 15:37, Johan Zicola <johan....@gmail.com> wrote:

Dear Alex,

Thank you for your quick answer!

So I tried `methylKit:::fread.gzipped` and it works properly, I get 3 data.table objects without NAs issues and the size of the objects is realistic.

<image2.PNG><image3.PNG>



I also tried with `skipDecompress =TRUE` and it also works (no NAs) but I get exactly the same object when I do the function a second time, as R was stuck to the same cache file:
<image1.PNG>


I had this problem before and I now understand why. R seems to read what is already in cache instead of extracting the new file given in the function. Maybe it is because the file names are identical, although the paths are different. I tested this hypothesis by renaming one of my methylBaseDB file and its associated tabix index file and it is the case: 
<image5.PNG>

I had a whole pipeline running smoothly with an earlier methylKit version, where I was using similar name files in different folders. It does not work anymore and I assume it is because of a new version of `R.utils::decompressFile` you mention line 22 in https://github.com/al2na/methylKit/blob/cfbc32ed9dd490e9e8070d33cacaa6ebecb56d76/R/backbone.R#L9

Now, I restart R to empty the cache and I start loading the mCHH methylBase object first (with the original name methylBase_.txt.bgz) and it works. The object is big and without NAs so the argument `skipDecompress=TRUE` is the definitive culprit for the misreading of cache files with the same names. It makes sense if they are located in the same "temporary folder or RAM slot" but this problem was not there in earlier versions of methylKit.

<image4.PNG>

That leads me to think that these cache files are also responsible for these NAs lines: I thought the issue was coming from the RAM as my first object was always CpG and was also the smallest in size. But the thing is that it was the first object to go to the uncompressed cache file and I guess the CHG and CHH were overwriting part of this file but not all. It makes sense since all consecutive loaded files have the same beginning (see below) but a different end (see below). It seems that for mCHH, where we see NAs, the object is extended (more lines as shown with the dim() ) function but not filled with the correct value. However, the data.table objects seem to have a correct ending, even with `skipDecompress=TRUE`. It seems to me that these NAs occur then when the data.table is converted into a methylKit object. I guess R extends the object based on the tabix file, which is probably not uncompressed, but uses the wrong uncompressed file to fill in. Somehow data.table objects deal with it but not methylKit objects.

-- 
You received this message because you are subscribed to a topic in the Google Groups "methylkit_discussion" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/methylkit_discussion/UruFjvX89B4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to methylkit_discus...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/methylkit_discussion/e95f651c-e44a-44da-9f53-6f1332294bd3n%40googlegroups.com.
<image3.PNG><image6.PNG><image2.PNG><image5.PNG><image4.PNG><image1.PNG>

Johan Zicola

unread,
Jan 27, 2021, 9:22:39 AM1/27/21
to methylkit_discussion
Hi Alex,

It seems there is an issue with your function. What represents the index i? this variable is not initialized anywhere
image1.PNG


image2.PNG
Also, I could not find the role of the ellipsis (three dots) that you use in the function `new()` to retrieve some variables from the methylBaseDB object in the R documentation or the web. Why do you need these instead of using directly `x...@sample.ids` for instance?

Thanks,
Best,
Johan
Message has been deleted
Message has been deleted

Alexander Blume

unread,
Jan 27, 2021, 10:00:08 AM1/27/21
to methylkit_...@googlegroups.com
Hi Johan,

I am sorry for the confusion, the mistake happened on my side. I forgot to fix the function in the mail after testing it in Rstudio. 
Please find the correct code snipped below and attached as script. 


```
#' convert2MethylBase
#'
#' Replacement of the as(x,"methylBase") function to handle
#' conversion of methylBaseDB to methylBase objects.
#'
#' NOTE: this function works only methylKit version > 1.13.2 and
#' will likely fail for earlier versions.
#'
#' @param x methylBaseDB object
#'
#' @return a methylBase object
#' @export
#'
#' @examples
#' library(methylKit)
#' data("methylKit")
#' obj = methylKit::makeMethylDB(methylBase.obj,dbdir = tempdir())
#' convert2MethylBase(obj)
#'
convert2MethylBase <- function(x) {

    df <- methylKit:::fread.gzipped(x@dbpath,
                                    stringsAsFactors = FALSE,
                                    data.table = FALSE,
                                    skipDecompress = FALSE)
    methylKit:::.setMethylDBNames(df,"methylBaseDB")

    new("methylBase",
        df,
        sample.ids = x...@sample.ids,
        assembly = x@assembly,
        context = x@context,
        treatment = x@treatment,
        coverage.index = x...@coverage.index,
        numCs.index = x...@numCs.index,
        numTs.index = x...@numTs.index,
        destranded = x@destranded,
        resolution = x@resolution
    )
}

```

Best,
Alex

convert2MethylBase.R
Message has been deleted
Message has been deleted

Johan Zicola

unread,
Jan 28, 2021, 6:28:36 AM1/28/21
to methylkit_...@googlegroups.com
Hi Alex,
Sorry but it is the 6th time I sent this email. It was bugging when I was trying from google groups so I send it from gmail now.
I created a different version to make it work with the 3 types of objects as the cache uncompressed file issue is the same whatever the object type (see enclosed R file).

I tested this function for the 3 objects and it works with your toy dataset.

```{r}
data("methylKit")

# Test for methylBase

# Test Function of Alex (just made for methylBase objects)
MethylBase_DB = methylKit::makeMethylDB(methylBase.obj)
MethylBase_DB_RAM1 <- convert2MethylBase(MethylBase_DB)
# Works

# New function to take any methylKit object classes
MethylBase_DB_RAM2 <- methylKitDB_to_methylKit_conversion(MethylBase_DB, type="methylBase")
# Works

# Test for methylDiff

MethylDiff_DB = methylKit::makeMethylDB(methylDiff.obj)
MethylDiff_DB_RAM <- methylKitDB_to_methylKit_conversion(MethylDiff_DB, type="methylDiff")
# Works

# Test for methylRawList
# Use lapply as we process a list of methylRaw objects
MethylRawList_DB = methylKit::makeMethylDB(methylRawList.obj)
MethylDiff_DB_RAM <- lapply(MethylRawList_DB, methylKitDB_to_methylKit_conversion, type="methylRaw")
# Works but I had to fix a bug in .setMethylDBNames
```

Note that I had to fix a bug in `.setMethylDBNames` line 1 (methylRawDB should be instead of methylDB). Can you fix it for the next version of MethylKit?

Thanks,
Best,
Johan

--
You received this message because you are subscribed to a topic in the Google Groups "methylkit_discussion" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/methylkit_discussion/UruFjvX89B4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to methylkit_discus...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/methylkit_discussion/67ABF5D6-447F-4A2E-947A-65430A2423EE%40gmail.com.

 
On 27. Jan 2021, at 15:26, Johan Zicola <johan....@gmail.com> wrote:

Sorry, I had a typo in my last sentence:
Also, I could not find the role of the ellipsis (three dots) that you use in the function `new()` to retrieve some variables from the methylBaseDB object in the R documentation or the web. Why do you need these instead of using directly `x...@sample.ids` for instance?


On Wednesday, 27 January 2021 at 15:22:39 UTC+1 Johan Zicola wrote:
Hi Alex,

It seems there is an issue with your function. What represents the index i? this variable is not initialized anywhere
image1.PNG


image2.PNG
Also, I could not find the role of the ellipsis (three dots) that you use in the function `new()` to retrieve some variables from the methylBaseDB object in the R documentation or the web. Why do you need these instead of using directly `x...@sample.ids` for instance?

Thanks,
Best,
Johan


On Tuesday, 26 January 2021 at 17:15:19 UTC+1 Alexander Blume wrote:
Hi Johan,

Great you figured this out! 

To circumvent the as() function you could implement your own version. Basically all you are missing is to rename the columns and create a new object of class “methylBase”: 

```
convert2MethylBase <- function(x) {
  
  df <- methylKit:::fread.gzipped(x@dbpath, stringsAsFactors = FALSE, data.table = FALSE,skipDecompress=FALSE)
  methylKit:::.setMethylDBNames(df,"methylBaseDB")
  
  new("methylBase",df[i,],
      sample.ids = x...@sample.ids
      assembly = x@assembly,
      context = x@context,

--
You received this message because you are subscribed to a topic in the Google Groups "methylkit_discussion" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/methylkit_discussion/UruFjvX89B4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to methylkit_discus...@googlegroups.com.
methylKitDB_to_methylKit_conversion.R

Alexander Blume

unread,
Jan 28, 2021, 8:02:44 AM1/28/21
to methylkit_...@googlegroups.com
Hi Johan, 

thank you for pointing out the bug in .setMethylDBNames, I’ll fix this as soon as possible. I’ll also disable skipDecompress for now, as it seems to cause unexpected behaviour and thereby more harm than good. 

Since you seem to be quite experienced with methylKit, may I ask you If there is something else that you might be missing about the functionality of tabix files? 

Best,
Alex  


Johan Zicola

unread,
Jan 28, 2021, 9:25:31 AM1/28/21
to methylkit_discussion
Hi Alex,

Nothing come to my mind about the tabix files but I have developped many functions to allow methylKit to deal with >100 samples without having RAM issues (hence the importance of flat DB files to me).
I have also developped add-on functions to plot average weighted methylation derived from your function `getCoverageStats()` and a couple of other things built around ggplot() and your functions. Maybe we could Zoom or Skype so I can show you what are these functions and their output and whether you and Akalin wish to integrate them as built-in functions in next versions of methylKit. You can email privately about this not to spam this thread.

Best,
Johan

Altuna Akalin

unread,
Jan 28, 2021, 9:58:36 AM1/28/21
to methylkit_...@googlegroups.com
Hi Johan,
I suggest opening issues on github for the functionality, and discuss/organize there. What you did, especially on 100+ samples, sounds very interesting, depending on how easy to integrate the functions you wrote we might explore merging them with methylKit giving you credit of course . I support if Alex and you explore this further.

Best,
Altuna

You received this message because you are subscribed to the Google Groups "methylkit_discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to methylkit_discus...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/methylkit_discussion/e5d8b335-9845-486e-8437-888f12e092ean%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages