Very Large Data Unite()

Lucy R Barnard

unread,

Aug 11, 2025, 6:53:05 AMAug 11

to methylkit_discussion

Hi

I am trying to use methylkit::unite() on a very large dataset (1000 WGBS samples) and running into various issues. I am running my code on an HPC and giving it 2TB of memory, 64 cores and 96h of time to run. I am running R version 4.4.1-foss-2022b.The current problem I am having is with the MethylKit::unite() function for which I am getting the error:

methylationCpGs <- methylKit::unite(methylationDB,

destrand = TRUE,

mc.cores = 64,

save.db = TRUE,

chunk.size = 500000,

dbdir = "MethylKit_temp/FULL_DB",

suffix = "FullFilteredDB",

min.per.group = 500L)

Error in paste(tabixRes[[1]], collapse = "\n") :

result would exceed 2^31-1 bytes

Calls: <Anonymous> ... getTabixByChunk -> tabix2df -> fread -> paste0 -> paste

In addition: Warning message:

In mclapply(chrs, mergeTbxByChr, tabixList, dir, filename2, parallel = TRUE, :

scheduled core 2 did not deliver a result, all values of the job will be affected

Execution halted

I believe this means that R is unable to deal with the list length produced by the filter mincov per group which is the point in which unite stops. I have tried reducing the chunk size to 500,000 and still get the same error.

As a work around I am trying out the methylkit:::applyTbxByChr but seem to be stuck in the circular problem of needing to provide it with a tabix file which I am unable to provide without running the unite function. When providing the tabix file however, the enclosed unite function doesn’t work. I have recreated the problem with the internal package data:

# for a filepath to a tbx file

> byChr <- methylKit:::applyTbxByChr(tbxFile = filepath, # at tabix file object

chrs = "chr21",#chromosome names

return.type = "tabix",# return type for the function

FUN = methylKit::unite, #function to apply to the chr

... = list(myobjDB), # parameters to be passed to FUN

dir = "TARGET/", # directory to create temporary files and resulting tabix

filename = "TESTBYCHR", # just the file name for the resulting tabix

mc.cores = 1, # number of cores to use in parallel

tabixHead = "TESTBYCHR", # OPTIONAL header to add to the file

)

Error: unable to find an inherited method for function ‘unite’ for signature ‘object = "data.frame"’

# for the original file.list created in the package

byChr <- methylKit:::applyTbxByChr(tbxFile = file.list, # at tabix file object

chrs = "chr21",#chromosome names

return.type = "data.frame",# return type for the function

FUN = methylKit::unite, #function to apply to the chr

... = list(myobjDB), # parameters to be passed to FUN

dir = "TARGET/", # directory to create temporary files and resulting tabix

filename = "TESTBYCHR", # just the file name for the resulting tabix

mc.cores = 1, # number of cores to use in parallel

tabixHead = "TESTBYCHR", # OPTIONAL header to add to the file

)

Error: TabixFile: invalid 'file' argument

Is there any alternative to either using MethylKit:::applyTbxByChr(), or to reduce the list length more in methylkit::unite() itself? Or another way to cope with large datasets?

Thanks
Lucy

alex....@gmail.com

unread,

Aug 12, 2025, 2:57:48 PMAug 12

to methylkit_discussion

Hi Lucy,

Unfortunately, we already know the problem where fetching data from the tabix files triggers the error where “result would exceed 2^31-1 bytes”
As you already mentioned, this is a limitation posed by R, which we currently worked around by using chunked and mostly per-chromosome processing. This worked fine so far, but since the number of samples per analysis keeps rising it was just a matter of time when this will not work anymore :D.

There are multiple problems here: methylkit:::applyTbxByChr operates only on a single tabix file, while methylKit::unite will merge multiple Tabix files, so this will unfortunately not work. As a side note, the function called by methylkit:::applyTbxByChr should get and return a data.frame, this is why you got the Error: unable to find an inherited method for function ‘unite’ for signature ‘object = "data.frame"` . The second error occurs since methylkit:::applyTbxByChr expects a single tabix file.

But now back to your original issue. You are right that error occurs at the filtering step, since the the call stack shows methylkit:::getTabixByChunk which is called by methylkit:::applyTbxByChunk. I think your approach of reducing the chunk size is correct, but you have not been strict enough.

You can use this code to check how much you need to reduce the chunk size:

library(methylKit) test_overflow <- function(n_samples, chunk_size) { # simulate 10 samples res <- dataSim(replicates = 10, sites = 1e3, treatment = rep(1, 10)) |> makeMethylDB(dbdir = tempdir()) |> # convert to tabix getDBPath() |> TabixFile(yieldSize = 100) |> open() |> Rsamtools::scanTabix() |> paste(collapse = "\n") |> paste0(collapse = "\n") |> object.size() * (n_samples / 10) * (chunk_size / 100) > 2^31 - 1 ifelse(res, yes = "result would exceed 2^31-1 bytes", no = "result is fine") } test_overflow(n_samples = 1000, chunk_size = 500000) test_overflow(n_samples = 1000, chunk_size = 50000)

I hope this helps.

Best,
Alex

alex....@gmail.com

unread,

Aug 12, 2025, 3:19:53 PMAug 12

to methylkit_discussion

Updated version of the above code: https://gist.github.com/alexg9010/a71269078b6613edaeab138db2b62945

Lucy R Barnard

unread,

Aug 13, 2025, 3:25:03 AMAug 13

to methylkit_discussion

Hi Alex,

Thank you! Will test that out and fingers crossed!

Lucy

Reply all

Reply to author

Forward