Memory issues

Matt Morgan

unread,

May 25, 2026, 11:30:14 AMMay 25

to Biociphers

Hi there,

I'm having an issue with memory with Voila modulize

- Running a pretty large dataset - ~ 1000 samples

Allocating 250GB of memory to the job

Running Voila v3.0.20.dev1+gcd116a77c

voila modulize \
-j 1 --changing-between-group-dpsi 0.1 \
--show-read-counts \
-d "$basedir/output/het_modulized_mem" \
/majiq/output/build/sg.zarr \
majiq/output/het/normal_tumour.hetcov \
/majiq/output/sgc/*.sgc

Get the following error

2026-05-25 13:04:00,426 (PID:3977105) - ERROR - Unable to allocate 2.31 GiB for an array with shape (579, 1070386) and data type float32
Traceback (most recent call last):
File "/mnt/homes/hmg/envs/majiq/lib/python3.12/site-packages/rna_voila/run_voila.py", line 541, in main
args.func()
File "/mnt/homes/hmg/envs/majiq/lib/python3.12/site-packages/rna_voila/classify.py", line 29, in __init__
config = ClassifyConfig()
^^^^^^^^^^^^^^^^
File "/mnt/homes/hmg/envs/majiq/lib/python3.12/site-packages/rna_voila/config.py", line 749, in __new__
files, settings = _getInputFilesSet(config_parser, cov_multiarray=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/homes/hmg/envs/majiq/lib/python3.12/site-packages/rna_voila/config.py", line 531, in _getInputFilesSet
files['cov_zarr'][cov_file] = open_cov_wrapper(cov_file, preload=zarr_preload)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/homes/hmg/envs/majiq/lib/python3.12/site-packages/rna_voila/api/view_matrix.py", line 119, in open_cov_wrapper
cov = nm.HeterogenDataset.from_zarr(path, preload=preload)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/homes/hmg/envs/majiq/lib/python3.12/site-packages/rna_majiq/voila/HeterogenDataset.py", line 268, in from_zarr
return HeterogenDataset(df, events_df)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/homes/hmg/envs/majiq/lib/python3.12/site-packages/rna_majiq/voila/HeterogenDataset.py", line 202, in __init__
].sel(prefix=df["prefix_grp"] == grp),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/homes/hmg/envs/majiq/lib/python3.12/site-packages/xarray/core/dataset.py", line 3140, in sel
result = self.isel(indexers=query_results.dim_indexers, drop=drop)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/homes/hmg/envs/majiq/lib/python3.12/site-packages/xarray/core/dataset.py", line 2973, in isel
return self._isel_fancy(indexers, drop=drop, missing_dims=missing_dims)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/homes/hmg/envs/majiq/lib/python3.12/site-packages/xarray/core/dataset.py", line 3029, in _isel_fancy
new_var = var.isel(indexers=var_indexers)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/homes/hmg/envs/majiq/lib/python3.12/site-packages/xarray/core/variable.py", line 1033, in isel
return self[key]
~~~~^^^^^
File "/mnt/homes/hmg/envs/majiq/lib/python3.12/site-packages/xarray/core/variable.py", line 800, in __getitem__
data = indexing.apply_indexer(indexable, indexer)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/homes/hmg/envs/majiq/lib/python3.12/site-packages/xarray/core/indexing.py", line 1027, in apply_indexer
return indexable.oindex[indexer]
~~~~~~~~~~~~~~~~^^^^^^^^^
File "/mnt/homes/hmg/envs/majiq/lib/python3.12/site-packages/xarray/core/indexing.py", line 367, in __getitem__
return self.getter(key)
^^^^^^^^^^^^^^^^
File "/mnt/homes/hmg/envs/majiq/lib/python3.12/site-packages/xarray/core/indexing.py", line 1504, in _oindex_get
return self.array[key]
~~~~~~~~~~^^^^^
numpy._core._exceptions._ArrayMemoryError: Unable to allocate 2.31 GiB for an array with shape (579, 1070386) and data type float32

I have tried running using the -lazy-load flag, but this is still running and is taking 2 days + already.

Is it simply that my dataset size is too great to run normally? I would've hoped that 250GB would be enough to tackle it.

Thank you

San Jewell

unread,

May 26, 2026, 2:46:15 PMMay 26

to Biociphers

Hi Matt,

It is a known issue that the lazy loading method of zarr is slow in voila. I actually have currently a fix which involves changing the chunk size/shape in which majiq writes the output files that voila uses which I have been experimenting with, but as majiq clin also uses this data structures it is difficult to optimize performance for both tools at the same time. I am figuring out what is the best way to release and document this, but it will be in the next minor version bump.

In the meantime, you can obtain a middle ground between 100% in memory and 100% lazy load, the way to do this is to copy some subset of the input files into /dev/shm/subfolder on your system, which acts as a ram drive. You can ensure that you keep this comfortably below 250gb. Then you can specify the run command to read from a mix of on-disk and in-memory zarr files, and specify the --lazy-load-zarr option. Also, it would help to use more than one thread (-j) if you are using lazy loading.

Let me know if that method proves helpful at all.

Thanks,

-San

Matt Morgan

unread,

May 28, 2026, 3:09:42 PMMay 28

to Biociphers

Hi San,

Thanks for your reply.

That has not seemed to make a dent in the running speed, despite putting all files I was using into memory on the /dev/shm, and I was running with 40 threads.

Unfortunately this makes MAJIQ somewhat untenable for this dataset, which I find quite surprising? If you have any other suggestions, please let me know.

San Jewell

unread,

May 28, 2026, 3:50:24 PMMay 28

to Biociphers

Hi Matt,

I am unsure why the run would be the same speed from a ram drive or a disk, but putting that aside, If you have time I would like you to test a solution with me. I have sent out an update which adds two new flags to voila --psicov-nchunks-ec-idx and --psicov-nchunks-prefix. In order for the new sizes to work, the files must be chunked into smaller pieces first. (The reason the slow down is happening is because the original design, from MAJIQ CLIN, read across many genes at once on one experiment at a time, whereas voila reads many experiments but only one gene at a time.)

To do this, in your python environment, open a python terminal $ python

Here is a small script to run:

import rna_majiq as nm
import os
files = []

psicov_path = '/some/path/on/your/system/where/psicov/files/are/under/'
for fname in os.listdir(psicov_path'):
if fname.endswith('.psicov'):
files.append(psicov_path + fname)
cov = nm.PsiCoverage.from_zarr(files, preload=False)
cov.to_zarr('/some/place/to/write/optimized/psicov.psicov', show_progress=True, ec_idx_chunks=500, prefix_chunks=1000)

Note that you should change prefix_chunks=1000 to the exact number of experiments you have, which should just be len(cov.prefixes).

After that try to run voila again but specify the flags --psicov-nchunks-ec-idx500 --psicov-nchunks-prefix 1000 (or whatever the exact number is). Specify only the new psicov file and not any of the original ones, and run with --lazy-load-zarr

If this works much better in your case I'll work on making a more official way to choose output styles between clin usage and voila usage. I'm not going into too much detail here because I'm not sure your experience level but if you need any more help feel free to ask.

Thanks,

-San

Matt Morgan

unread,

Jun 2, 2026, 5:35:42 AMJun 2

to Biociphers

Hi San,

Thank you very much for this. I will give it a try with this, going straight from .psicov files to modulize.

Just to clarify, I think there was a difference hosting the files on RAM, however it still ran very slowly.

However, on the original run that wasn't working for me, I was using the .hetcov file and the .sgc files (no .psicov files). Am I missing something, or does the above solution just relate to the use of .psicov files?

Thanks again,

Matt

San Jewell

unread,

Jun 4, 2026, 2:57:16 PMJun 4

to Biociphers

Hi Matt,

You are correct. I must state the unfortunateness of the long time between my replies, as our lab has been undertaking quite the significant number of projects recently.

In any case, the fix was originally designed for PsiCoverage but I have also just adapted it for Heterogen as well. (please pull the update) There is a difference in that, instead of gathering all files into one output file, like with the PsiCoverage example above, in this case you must iterate through the files to save one comparison each:

import rna_majiq as nm
import os
files = []

hetcov_path = '/some/path/on/your/system/where/hetcov/files/are/under/'
for fname in os.listdir(hetcov_path'):
if fname.endswith('.hetcov'):
cov = nm.HeterogenDataset.from_zarr(hetcov_path + fname, preload=False)
cov.to_zarr('/some/place/to/write/optimized/hetcov' + fname, show_progress=True, ec_idx_chunks=500, prefix_chunks=1000)

Again with the same thing about prefix_chunks being related to number of experiments. The arguments to voila would be the same as above.

I look forward to hearing if there are good results with this, as I have not yet run into large enough scales of v3 heterogen analysis I've done to check for a meaningful difference.

-San

Matt Morgan

unread,

Jun 11, 2026, 11:01:11 AMJun 11

to Biociphers

Hi San,

Not to worry at all. Thank you very much for this update!

I have tried 2 runs of the above.

One using 40 threads (I think the parallelism throttled the I/O) and one using 10 threads with all files stored locally on my nodes.

Unfortunately, for the quickest run (local, J 10), it is still very slow. After 3 days, the largest of the event files have probably processed and output significant events for ~30 genes, and the majority have just output ~10.

Sorry it's not better news!

Matt

San Jewell

unread,

Jun 11, 2026, 5:52:22 PMJun 11

to Biociphers

Hi Matt,

I understand, and I really do appreciate you working with me on improving this! I want to double check, you did actually re-chunk the files using the specified python script, and not just run voila with the new switches, right? Would you be willing to share some subset of your data that you can time (i.e. with this many samples of this data, the run takes about an hour) ; so that I can verify what might be happening and why there isn't an improvement when I see it in other cases. It might be something fishy going on that I'm not thinking of with your run, or your environment, or something else.

Thanks!

-San

Matt Morgan

unread,

Jun 12, 2026, 6:30:47 PMJun 12

to Biociphers

Hi San,

No bother at all.

Yes, I ran the following script to process the .hetcov file, based on your script above

####################

mport os
import rna_majiq as nm
from dask.distributed import Client, LocalCluster

old_HETCOV = "/mnt/cgs-fs8a/mjm280/present_data/majiq/output/het/normal_tumour.hetcov"
new_HETCOV = "/mnt/cgs-fs8a/mjm280/present_data/majiq/output/het/normal_tumour_memory.hetcov"

EC_IDX_NCHUNKS = 500
PREFIX_NCHUNKS = 1066

N_WORKERS = 5
THREADS_PER_WORKER = 4
MEMORY_LIMIT = "20GB"

def main():
if os.path.exists(new_HETCOV):
print(f"exit")

return

cov = nm.HeterogenDataset.from_zarr(old_HETCOV, preload=False)
cov.to_zarr(
new_HETCOV,
show_progress=True,
ec_idx_nchunks=EC_IDX_NCHUNKS,
prefix_nchunks=PREFIX_NCHUNKS,
)
print(f"finished {new_HETCOV}")

if __name__ == "__main__":
cluster = LocalCluster(
n_workers=N_WORKERS,
threads_per_worker=THREADS_PER_WORKER,
memory_limit=MEMORY_LIMIT,
)
client = Client(cluster)
try:
main()
finally:
client.close()
cluster.close()

###########################

It took about 12 minutes to run, but I multi-threaded it as you can see.

I then input this into voila using the following command

###########################

voila modulize \
-j 10 --changing-between-group-dpsi 0.1 \
--show-read-counts --show-per-sample-psi \
--lazy-load-zarr --psicov-nchunks-ec-idx 500 --psicov-nchunks-prefix 1066 \
-d "$basedir/majiq_out" \
"$basedir/sg.zarr" \
$basedir/sgc/*.sgc \
"$basedir/normal_tumour_memory.hetcov"

###########################

I have sent you a link sharing the files via email.

Thanks again for your help,

Matt

Reply all

Reply to author

Forward