I am trying to use methylkit::unite() on a very large dataset (1000 WGBS samples) and running into various issues. I am running my code on an HPC and giving it 2TB of memory, 64 cores and 96h of time to run. I am running R version 4.4.1-foss-2022b.The current problem I am having is with the MethylKit::unite() function for which I am getting the error:
methylationCpGs <- methylKit::unite(methylationDB,
destrand = TRUE,
mc.cores = 64,
save.db = TRUE,
chunk.size = 500000,
dbdir = "MethylKit_temp/FULL_DB",
suffix = "FullFilteredDB",
min.per.group = 500L)
Error in paste(tabixRes[[1]], collapse = "\n") :
result would exceed 2^31-1 bytes
Calls: <Anonymous> ... getTabixByChunk -> tabix2df -> fread -> paste0 -> paste
In addition: Warning message:
In mclapply(chrs, mergeTbxByChr, tabixList, dir, filename2, parallel = TRUE, :
scheduled core 2 did not deliver a result, all values of the job will be affected
Execution halted
I believe this means that R is unable to deal with the list length produced by the filter mincov per group which is the point in which unite stops. I have tried reducing the chunk size to 500,000 and still get the same error.Hi Lucy,
Unfortunately, we already know the problem where fetching data from the tabix files triggers the error where “result would exceed 2^31-1 bytes”
As you already mentioned, this is a limitation posed by R, which we currently worked around by using chunked and mostly per-chromosome processing. This worked fine so far, but since the number of samples per analysis keeps rising it was just a matter of time when this will not work anymore :D.
There are multiple problems here: methylkit:::applyTbxByChr operates only on a single tabix file, while methylKit::unite will merge multiple Tabix files, so this will unfortunately not work. As a side note, the function called by methylkit:::applyTbxByChr should get and return a data.frame, this is why you got the Error: unable to find an inherited method for function ‘unite’ for signature ‘object = "data.frame"` . The second error occurs since methylkit:::applyTbxByChr expects a single tabix file.
But now back to your original issue. You are right that error occurs at the filtering step, since the the call stack shows methylkit:::getTabixByChunk which is called by methylkit:::applyTbxByChunk. I think your approach of reducing the chunk size is correct, but you have not been strict enough.
You can use this code to check how much you need to reduce the chunk size:
library(methylKit) test_overflow <- function(n_samples, chunk_size) { # simulate 10 samples res <- dataSim(replicates = 10, sites = 1e3, treatment = rep(1, 10)) |> makeMethylDB(dbdir = tempdir()) |> # convert to tabix getDBPath() |> TabixFile(yieldSize = 100) |> open() |> Rsamtools::scanTabix() |> paste(collapse = "\n") |> paste0(collapse = "\n") |> object.size() * (n_samples / 10) * (chunk_size / 100) > 2^31 - 1 ifelse(res, yes = "result would exceed 2^31-1 bytes", no = "result is fine") } test_overflow(n_samples = 1000, chunk_size = 500000) test_overflow(n_samples = 1000, chunk_size = 50000)I hope this helps.
Best,
Alex