Memory issues

21 views
Skip to first unread message

Matt Morgan

unread,
May 25, 2026, 11:30:14 AM (11 days ago) May 25
to Biociphers
Hi there, 

I'm having an issue with memory with Voila modulize
- Running a pretty large dataset - ~ 1000 samples
Allocating 250GB of memory to the job 

Running Voila v3.0.20.dev1+gcd116a77c

voila modulize \
-j 1 --changing-between-group-dpsi 0.1 \
--show-read-counts \
-d "$basedir/output/het_modulized_mem" \
/majiq/output/build/sg.zarr \
majiq/output/het/normal_tumour.hetcov \
/majiq/output/sgc/*.sgc

Get the following error
2026-05-25 13:04:00,426 (PID:3977105) - ERROR - Unable to allocate 2.31 GiB for an array with shape (579, 1070386) and data type float32
Traceback (most recent call last):
  File "/mnt/homes/hmg/envs/majiq/lib/python3.12/site-packages/rna_voila/run_voila.py", line 541, in main
    args.func()
  File "/mnt/homes/hmg/envs/majiq/lib/python3.12/site-packages/rna_voila/classify.py", line 29, in __init__
    config = ClassifyConfig()
             ^^^^^^^^^^^^^^^^
  File "/mnt/homes/hmg/envs/majiq/lib/python3.12/site-packages/rna_voila/config.py", line 749, in __new__
    files, settings = _getInputFilesSet(config_parser, cov_multiarray=True)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/homes/hmg/envs/majiq/lib/python3.12/site-packages/rna_voila/config.py", line 531, in _getInputFilesSet
    files['cov_zarr'][cov_file] = open_cov_wrapper(cov_file, preload=zarr_preload)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/homes/hmg/envs/majiq/lib/python3.12/site-packages/rna_voila/api/view_matrix.py", line 119, in open_cov_wrapper
    cov = nm.HeterogenDataset.from_zarr(path, preload=preload)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/homes/hmg/envs/majiq/lib/python3.12/site-packages/rna_majiq/voila/HeterogenDataset.py", line 268, in from_zarr
    return HeterogenDataset(df, events_df)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/homes/hmg/envs/majiq/lib/python3.12/site-packages/rna_majiq/voila/HeterogenDataset.py", line 202, in __init__
    ].sel(prefix=df["prefix_grp"] == grp),
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/homes/hmg/envs/majiq/lib/python3.12/site-packages/xarray/core/dataset.py", line 3140, in sel
    result = self.isel(indexers=query_results.dim_indexers, drop=drop)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/homes/hmg/envs/majiq/lib/python3.12/site-packages/xarray/core/dataset.py", line 2973, in isel
    return self._isel_fancy(indexers, drop=drop, missing_dims=missing_dims)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/homes/hmg/envs/majiq/lib/python3.12/site-packages/xarray/core/dataset.py", line 3029, in _isel_fancy
    new_var = var.isel(indexers=var_indexers)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/homes/hmg/envs/majiq/lib/python3.12/site-packages/xarray/core/variable.py", line 1033, in isel
    return self[key]
           ~~~~^^^^^
  File "/mnt/homes/hmg/envs/majiq/lib/python3.12/site-packages/xarray/core/variable.py", line 800, in __getitem__
    data = indexing.apply_indexer(indexable, indexer)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/homes/hmg/envs/majiq/lib/python3.12/site-packages/xarray/core/indexing.py", line 1027, in apply_indexer
    return indexable.oindex[indexer]
           ~~~~~~~~~~~~~~~~^^^^^^^^^
  File "/mnt/homes/hmg/envs/majiq/lib/python3.12/site-packages/xarray/core/indexing.py", line 367, in __getitem__
    return self.getter(key)
           ^^^^^^^^^^^^^^^^
  File "/mnt/homes/hmg/envs/majiq/lib/python3.12/site-packages/xarray/core/indexing.py", line 1504, in _oindex_get
    return self.array[key]
           ~~~~~~~~~~^^^^^
numpy._core._exceptions._ArrayMemoryError: Unable to allocate 2.31 GiB for an array with shape (579, 1070386) and data type float32

I have tried running using the -lazy-load flag, but this is still running and is taking 2 days + already. 

Is it simply that my dataset size is too great to run normally? I would've hoped that 250GB would be enough to tackle it. 

Thank you

San Jewell

unread,
May 26, 2026, 2:46:15 PM (10 days ago) May 26
to Biociphers
Hi Matt, 

It is a known issue that the lazy loading method of zarr is slow in voila. I actually have currently a fix which involves changing the chunk size/shape in which majiq writes the output files that voila uses which I have been experimenting with, but as majiq clin also uses this data structures it is difficult to optimize performance for both tools at the same time. I am figuring out what is the best way to release and document this, but it will be in the next minor version bump. 

In the meantime, you can obtain a middle ground between 100% in memory and 100% lazy load, the way to do this is to copy some subset of the input files into /dev/shm/subfolder on your system, which acts as a ram drive. You can ensure that you keep this comfortably below 250gb. Then you can specify the run command to read from a mix of on-disk and in-memory zarr files, and specify the --lazy-load-zarr option. Also, it would help to use more than one thread (-j) if you are using lazy loading. 

Let me know if that method proves helpful at all. 

Thanks, 
-San

Matt Morgan

unread,
May 28, 2026, 3:09:42 PM (8 days ago) May 28
to Biociphers
Hi San, 

Thanks for your reply. 

That has not seemed to make a dent in the running speed, despite putting all files I was using into memory on the /dev/shm, and I was running with 40 threads. 

Unfortunately this makes MAJIQ somewhat untenable for this dataset, which I find quite surprising? If you have any other suggestions, please let me know. 

San Jewell

unread,
May 28, 2026, 3:50:24 PM (8 days ago) May 28
to Biociphers
Hi Matt, 

I am unsure why the run would be the same speed from a ram drive or a disk, but putting that aside, If you have time I would like you to test a solution with me. I have sent out an update which adds two new flags to voila --psicov-nchunks-ec-idx and --psicov-nchunks-prefix. In order for the new sizes to work, the files must be chunked into smaller pieces first. (The reason the slow down is happening is because the original design, from MAJIQ CLIN, read across many genes at once on one experiment at a time, whereas voila reads many experiments but only one gene at a time.)

To do this, in your python environment, open a python terminal $ python 
Here is a small script to run:

import rna_majiq as nm
import os
files = []
psicov_path = '/some/path/on/your/system/where/psicov/files/are/under/'
for fname in os.listdir(psicov_path'):
if fname.endswith('.psicov'):
files.append(psicov_path + fname)
cov = nm.PsiCoverage.from_zarr(files, preload=False)
cov.to_zarr('/some/place/to/write/optimized/psicov.psicov', show_progress=True, ec_idx_chunks=500, prefix_chunks=1000)

Note that you should change prefix_chunks=1000 to the exact number of experiments you have, which should just be len(cov.prefixes).

After that try to run voila again but specify the flags --psicov-nchunks-ec-idx500 --psicov-nchunks-prefix 1000 (or whatever the exact number is). Specify only the new psicov file and not any of the original ones, and run with --lazy-load-zarr 

If this works much better in your case I'll work on making a more official way to choose output styles between clin usage and voila usage. I'm not going into too much detail here because I'm not sure your experience level but if you need any more help feel free to ask. 

Thanks, 
-San

Matt Morgan

unread,
Jun 2, 2026, 5:35:42 AM (3 days ago) Jun 2
to Biociphers
Hi San, 

Thank you very much for this. I will give it a try with this, going straight from .psicov files to modulize. 

Just to clarify, I think there was a difference hosting the files on RAM, however it still ran very slowly. 

However, on the original run that wasn't working for me, I was using the .hetcov file and the .sgc files (no .psicov files). Am I missing something, or does the above solution just relate to the use of .psicov files?

Thanks again,
Matt

San Jewell

unread,
Jun 4, 2026, 2:57:16 PM (yesterday) Jun 4
to Biociphers
Hi Matt, 

You are correct. I must state the unfortunateness of the long time between my replies, as our lab has been undertaking quite the significant number of projects recently. 

In any case, the fix was originally designed for PsiCoverage but I have also just adapted it for Heterogen as well. (please pull the update) There is a difference in that, instead of gathering all files into one output file, like with the PsiCoverage example above, in this case you must iterate through the files to save one comparison each:

import rna_majiq as nm
import os
files = []
hetcov_path = '/some/path/on/your/system/where/hetcov/files/are/under/'
for fname in os.listdir(hetcov_path'):
if fname.endswith('.hetcov'):
            cov = nm.HeterogenDataset.from_zarr(hetcov_path + fname, preload=False)
            cov.to_zarr('/some/place/to/write/optimized/hetcov' + fname, show_progress=True, ec_idx_chunks=500, prefix_chunks=1000)

Again with the same thing about prefix_chunks being related to number of experiments. The arguments to voila would be the same as above. 

I look forward to hearing if there are good results with this, as I have not yet run into large enough scales of v3 heterogen analysis I've done to check for a meaningful difference. 

-San

Reply all
Reply to author
Forward
0 new messages