IndexError in rna_majiq.src.io.get_coverage_mat_lsv

19 views
Skip to first unread message

Yihao ZHANG

unread,
Jul 31, 2024, 10:48:16 PMJul 31
to Biociphers
Hello, 

I have a test dataset containing five normal and five tumor samples. I did majiq build for 10 samples individually first. 
majiq --license {} build {} -c {} -o {} -j {} --min-experiments {} \
--simplify {} --min-denovo 3

Then, I executed the following command: 
majiq --license $license_dir heterogen -j 4 -o $output_dir/heterogen \
-n tumor normal \
-grp1 $output_dir/tumor/majiq/*/*.majiq \
-grp2 $output_dir/normal/majiq/*/*.majiq

It can parse the required data. But then returns error message as below: 
```
2024-08-01 11:27:17,483 (PID:1902070) - INFO - Group normal: 34480 LSVs
2024-08-01 11:27:17,500 (PID:1902070) - INFO - Number quantifiable LSVs: 25408
2024-08-01 11:27:17,608 (PID:1902070) - INFO - Sampling from PSI
2024-08-01 11:27:17,627 (PID:1902070) - INFO - Group tumor performing at least 501 psi samples per experiment to bound visualization error
2024-08-01 11:27:17,627 (PID:1902070) - INFO - (group 1/2, experiment 1/5) Group tumor sampling PSI from '/data/TCGA/output/tumor/majiq/295c896a-e9c1-467b-ae56-15b18aad97e8/ef697db7-3b14-4999-9e9b-85c37718f615.rna_seq.genomic.gdc_realn.majiq'
Traceback (most recent call last):
  File "rna_majiq/src/io.pyx", line 298, in rna_majiq.src.io.get_coverage_mat_lsv
IndexError: index 5 is out of bounds for axis 0 with size 5
Exception ignored in: 'rna_majiq.src.io.get_coverage_mat_lsv'
Traceback (most recent call last):
  File "rna_majiq/src/io.pyx", line 298, in rna_majiq.src.io.get_coverage_mat_lsv
IndexError: index 5 is out of bounds for axis 0 with size 5
Segmentation fault (core dumped)
```

May I ask: 
1. Does the [experiment] group name in configuration .ini, which was used at build step, affect the behavior at quantification step? Should the group name in .ini be exactly the same as the arguments used in quantification step? 
2. Would you mind helping me if there is any other possible reason that may cause this error? 

Thank you very much!

Best,
Yihao
2. 

Yihao ZHANG

unread,
Jul 31, 2024, 11:27:33 PMJul 31
to Biociphers
I tested the group name. Changing group name in .ini still get this error. 

I also checked logs in build step, all the samples have > 30000 LSVs. The input should be fine I guess. 

```
    for fidx, fname in enumerate(file_list):
        with open(fname, 'rb') as fp:
            fl = np.load(fp)
            data = fl['coverage']
            info = fl['junc_info']
            msamples = data.shape[1]
            idx = -1
            for row in info:
                idx += 1
                lsv_id = row[0]
                if result.count(lsv_id) > 0:

                    if prev_lsvid != lsv_id:
                        if prev_lsvid != empty_string:
                            # update add coverage/set status for previous LSV
                            result[prev_lsvid].add(<np.float32_t *> cov.data, msamples)
                            result[prev_lsvid].set_bool(bflt or result[prev_lsvid].is_enabled())
                        # if fltr, LSV will be used no matter what
                        bflt = fltr or (row[3] >=minreads and row[4] >= minnonzero)
                        prev_lsvid = lsv_id
                        prev_juncidx = -1


                        njunc = result[lsv_id].get_num_ways()
                        cov = np.zeros(shape=(njunc, msamples), dtype=np.float32)
                    else:
                        bflt = bflt or (row[3] >=minreads and row[4] >= minnonzero)
                    prev_juncidx += 1
                    cov[prev_juncidx] = data[idx]
            # make sure to add coverage/set status for final LSV
            result[prev_lsvid].add(<np.float32_t *> cov.data, msamples)
            result[prev_lsvid].set_bool(bflt or result[prev_lsvid].is_enabled())
```

I tried to investigate how it happens. But I still don't fully understand what caused this error. 

The reason I did not build all required samples at the same time is that I'm working on large dataset (> 10000). There is a limitation of storage. I have to perform the pipeline on subset datasets first. Am I correct to perform build step individually and move to quantification step? 

San Jewell

unread,
Aug 7, 2024, 12:45:21 PMAug 7
to Biociphers
Hi Yihao,

The reason for this error is because one of the majiq files has a LSV size with six junctions , and another majiq file has fewer than six junctions (" index 5 is out of bounds for axis 0 with size 5") ; as it seems you indicate, this is caused by running a quantification step on a build with majiq files from separate builds. This causes the same LSV to be of different size in different majiq files, which is unexpected.

You should run the build on all rnaseq data at the same time to avoid this. You can use --incremental option and sjdirs=/somepath in the config if you would like to avoid needing to re-process BAM files when updating a build.

Please make your build unified and let me know if this resolves your issue.

Thanks,
-San
Reply all
Reply to author
Forward
0 new messages