Index across multiple independent samples and combined junction counts

14 views

Skip to first unread message

Dan Sprague

unread,

Apr 9, 2025, 1:13:21 PM4/9/25

to Biociphers

Hi!

We’ve been enjoying using MAJIQ + VOILA. I have a question for you about analyzing a large number of samples at once. We’re looking at some public data, and trying to treat each individual SRR separately (independently). My understanding is to set up the config file as such:

[info]

bamdirs=path/to/bams

genome=hg38

[experiments]

SRRXXX=SRRXXX_Aligned.out_sorted

SRRXX=SRRXXX_Aligned.out_sorted

SRRXXX=SRRXXX_Aligned.out_sorted

I have two questions from this.

1) In VOILA, a "combined" splicegraph is shown. I have found a dropdown menu where I can add on splice graphs from individual SRRs. My question: What do the numbers above the junction arcs represent in the "combined" graph? They are less than the numbers from some individual experiments. Is this a count regression across all the samples of some kind?
2) We are looking for a very specific gene. When I ran all these samples as shown in the config file, VOILA says that the gene of interest is not in the index. However, we know for a fact that the specific gene we are interested in is highly expressed in at least some of these samples. Interestingly, when I ran only a single high expresser SRR in an entirely separately separate run of MAJIQ+VOILA, the gene of interest was in the VOILA index and had quite high junction counts. What is causing this behavior? An average across the samples? Is there a way to change this behavior?

Thank you!

Dan

San Jewell

unread,

Apr 9, 2025, 1:37:26 PM4/9/25

to Biociphers

Hi Dan,

Thanks for reaching out!

1) A simple median across all samples is what is shown as the read count in the "combined" splicegraph. Perhaps this can be better articulated in the documentation somewhere.

2) The voila index is across LSVs rather than genes as a whole. All information like read counts is always stored anyway even if there are no LSVs. You can actually manually navigate to localhost:<port>/gene/<gene_id> if you want to view the splicegraph for genes that don't have quantified LSVs. This can be the case for even highly expressed, high read count genes, just without any significant alternative splicing detected. Which means that one of more LSVs in that gene are quantified in the case with one sample, but not in the case with many samples. This can be due to various settings / thresholds specified in the builder. For example --min-experiments, --min-reads, --min-pos. I think the specific behavior you are seeing comes from the --min-experiments flag, from the help text:

--min-experiments MIN_EXP
Threshold for group filters. This specifies the fraction (value < 1) or absolute number (value >= 1) of experiments passing per-experiment filters (i.e. minreads, minpos, etc.) that must pass individually in order to pass an LSV or junction. If greater than the total number of
experiments in a group, requires all experiments to pass individually. [Default: 0.5]

So by default, half the experiments are required to pass minreads and minpos in order to pass that lsv-junction. If you want to be more permissive, you can either reduce minreads/minpos, or you can set --min-experiments, to, for example, 1, which would mean that only one experiment pass minreads and minpos to pass that lsv-junction.

Let me know if it makes sense.

Thanks!

-San

Reply all

Reply to author

Forward

0 new messages