Very Large Data Unite()

12 views
Skip to first unread message

Lucy R Barnard

unread,
Aug 11, 2025, 6:53:05 AMAug 11
to methylkit_discussion
Hi

I am trying to use methylkit::unite() on a very large dataset (1000 WGBS samples) and running into various issues. I am running my code on an HPC and giving it 2TB of memory, 64 cores and 96h of time to run. I am running R version 4.4.1-foss-2022b.The current problem I am having is with the MethylKit::unite() function for which I am getting the error:

methylationCpGs <- methylKit::unite(methylationDB,

                                    destrand = TRUE,

                                    mc.cores = 64,

                                    save.db = TRUE,

                                    chunk.size = 500000,

                                    dbdir = "MethylKit_temp/FULL_DB",

                                    suffix = "FullFilteredDB",

                                    min.per.group = 500L)


Error in paste(tabixRes[[1]], collapse = "\n") :

  result would exceed 2^31-1 bytes

Calls: <Anonymous> ... getTabixByChunk -> tabix2df -> fread -> paste0 -> paste

In addition: Warning message:

In mclapply(chrs, mergeTbxByChr, tabixList, dir, filename2, parallel = TRUE,  :

  scheduled core 2 did not deliver a result, all values of the job will be affected

Execution halted

I believe this means that R is unable to deal with the list length produced by the filter mincov per group which is the point in which unite stops. I have tried reducing the chunk size to 500,000 and still get the same error. 

As a work around I am trying out the methylkit:::applyTbxByChr but seem to be stuck in the circular problem of needing to provide it with a tabix file which I am unable to provide without running the unite function. When providing the tabix file however, the enclosed unite function doesn’t work. I have recreated the problem with the internal package data:

# for a filepath to a tbx file 
> byChr <- methylKit:::applyTbxByChr(tbxFile = filepath, # at tabix file object
chrs = "chr21",#chromosome names
return.type = "tabix",# return type for the function
FUN = methylKit::unite, #function to apply to the chr
... = list(myobjDB), # parameters to be passed to FUN
dir = "TARGET/", # directory to create temporary files and resulting tabix
filename = "TESTBYCHR", # just the file name for the resulting tabix
mc.cores = 1, # number of cores to use in parallel
tabixHead = "TESTBYCHR", # OPTIONAL header to add to the file
)

Error: unable to find an inherited method for function ‘unite’ for signature ‘object = "data.frame"’ 
# for the original file.list created in the package
byChr <- methylKit:::applyTbxByChr(tbxFile = file.list, # at tabix file object
chrs = "chr21",#chromosome names
return.type = "data.frame",# return type for the function
FUN = methylKit::unite, #function to apply to the chr
... = list(myobjDB), # parameters to be passed to FUN
dir = "TARGET/", # directory to create temporary files and resulting tabix
filename = "TESTBYCHR", # just the file name for the resulting tabix
mc.cores = 1, # number of cores to use in parallel
tabixHead = "TESTBYCHR", # OPTIONAL header to add to the file
)
Error: TabixFile: invalid 'file' argument 

Is there any alternative to either using MethylKit:::applyTbxByChr(), or to reduce the list length more in methylkit::unite() itself? Or another way to cope with large datasets?

Thanks
Lucy

alex....@gmail.com

unread,
Aug 12, 2025, 2:57:48 PMAug 12
to methylkit_discussion

Hi Lucy,

Unfortunately, we already know the problem where fetching data from the tabix files triggers the error where “result would exceed 2^31-1 bytes”
As you already mentioned, this is a limitation posed by R, which we currently worked around by using chunked and mostly per-chromosome processing. This worked fine so far, but since the number of samples per analysis keeps rising it was just a matter of time when this will not work anymore :D.

There are multiple problems here: methylkit:::applyTbxByChr operates only on a single tabix file, while methylKit::unite will merge multiple Tabix files, so this will unfortunately not work. As a side note, the function called by methylkit:::applyTbxByChr should get and return a data.frame, this is why you got the Error: unable to find an inherited method for function ‘unite’ for signature ‘object = "data.frame"` . The second error occurs since methylkit:::applyTbxByChr expects a single tabix file.

But now back to your original issue. You are right that error occurs at the filtering step, since the the call stack shows methylkit:::getTabixByChunk which is called by methylkit:::applyTbxByChunk. I think your approach of reducing the chunk size is correct, but you have not been strict enough.

You can use this code to check how much you need to reduce the chunk size:

library(methylKit) test_overflow <- function(n_samples, chunk_size) { # simulate 10 samples res <- dataSim(replicates = 10, sites = 1e3, treatment = rep(1, 10)) |> makeMethylDB(dbdir = tempdir()) |> # convert to tabix getDBPath() |> TabixFile(yieldSize = 100) |> open() |> Rsamtools::scanTabix() |> paste(collapse = "\n") |> paste0(collapse = "\n") |> object.size() * (n_samples / 10) * (chunk_size / 100) > 2^31 - 1 ifelse(res, yes = "result would exceed 2^31-1 bytes", no = "result is fine") } test_overflow(n_samples = 1000, chunk_size = 500000) test_overflow(n_samples = 1000, chunk_size = 50000)

I hope this helps.

Best,
Alex

alex....@gmail.com

unread,
Aug 12, 2025, 3:19:53 PMAug 12
to methylkit_discussion

Lucy R Barnard

unread,
Aug 13, 2025, 3:25:03 AMAug 13
to methylkit_discussion
Hi Alex, 

Thank you! Will test that out and fingers crossed!

Lucy

Reply all
Reply to author
Forward
0 new messages