MAJIQ & Paired-end reads...

Chris Khoury

unread,

Dec 11, 2023, 2:19:46 AM12/11/23

to Biociphers

Greetings MAJIQ Team,

We have two very quick questions (we hope) on RNA sequencing data and the MAJIQ algorithm.

1. How does MAJIQ deal with paired-end reads which have read through, and this read-through is over a splice junction? Will this junction be counted once or twice?

If a junction is counted twice, do you have any recommendations from your experience on steps after alignment to obtain accurate representation of LSVs ?

2. Is "Min-pos" affected by quality trimming (hard clipped) on the head and tail of RNA seq reads?

Thank you,
Samantha/Chris.

San Jewell

unread,

Dec 11, 2023, 12:26:03 PM12/11/23

to Biociphers

Hi Samantha,

The answer to both questions is basically, that the trimming / aligning steps that occur before majiq are usually the determining factor. MAJIQ detects usage of different junctions/retained introns directly from split/unsplit alignments per read. It does not distinguish/use information between paired reads, so it does not need to detect/handle differences between single vs paired-end data. In general the paired reads are processed into a single resulting bamfile with an aligner such as STAR which comes with it's own suite of options for resolving these reads. In general we haven't needed to perform further filtering of the data to obtain accurate LSV representation after these steps. For question 2 again we expect most trimming / cleaning algorithms are performed prior to aligning, so by the time it gets to majiq min-pos is not effected by it.

Let me know if it helps.

-San

Chris Khoury

unread,

Dec 11, 2023, 5:08:09 PM12/11/23

to Biociphers

Good evening San,

Thanks - so in summary, with a little bit more qualification, could you confirm that:

a. if you have a paired end read with 100 percent overlap. i.e we sequenced 150 bp and our fragment insert is 150 bp, both fragments contain the same splice junction when aligned. The subsequent BAM out thus contains 2 individual lines for each fragment, both containing the one annotated splice junction. This annotated splice junction will be counted twice when MAJIQ parses the CIGAR string.

and

b. for min-pos; we have two single end reads, both with a splice junction starting at 50 bp in from the fragment start point. Single end 1 will has been head-trimmed 10 bases, and Single end 2 has not. Therefore Single end 1 now has a new start coordinate within chromosome 'x' of plus 10, compared to Single end 2. But the splice junction coordinate remains the same. Will these two reads be deemed as satisfying the MAJIQ parameter '--minpos 2'?

Cheers,

Samantha/Chris

Caleb Radens

unread,

Dec 13, 2023, 6:59:59 PM12/13/23

to Biociphers

Hi Samantha/Chris,

Ex-lab Barash member here :)

A) paired reads are counted twice, which is a feature, not a bug :). Usually, this isn't a problem because read lengths are 50-150 bp and inserts are way more than 300bp. But if you're concerned that your library fragment insert sizes are less than 2x the read lengths and you're seeing lots of overlapping paired reads, I guess you could try and do some kind of trimming of one of your mates, so that there isn't overlapping information. But AFAIK, MAJIQ doesn't do anything like that under the hood. I haven't personally come across a library with inserts that small, however, so I might have other concerns about the sample if the fragments are so short :/

B) if I understand correctly, you're wondering what happens if you have two fragments that have inserts with the same start, but when they were sequenced, one perhaps had low quality base calls at the start and the other fragment didn't have low quality base calls at the start. Then, the reads were trimmed or the aligner doesn't map their starts the same? In this case, yes, MAJIQ would see two distinct start pos. I'm not sure there is any way to determine that these two reads came from two fragments with inserts sharing the same start. I'm also not sure if this is a common scenario. Usually, reads have higher quality base calls at the start, and by the end of the read the quality goes down a bit. If there is a systematic low quality of base calls, I'd again be a bit worried about the underlying sample or sequencing run.

These are fun questions! Makes me miss the good ole days of drawing reads on the lab windows

Caleb Matthew Radens

Chris Khoury

unread,

Dec 19, 2023, 11:51:12 PM12/19/23

to Biociphers

Hi Caleb, Thank you for the information and walk down memory lane - since they are fun questions we will keep them coming! Did you move to a bigger window???

Trimming was necessary for the dataset at hand due to read-through rather than the reason of a dataset with systematically poor base call qualities (150bp RNA sequencing). On the Novaseq6000 we actually suffer from very good base calls. LOL
We apply a globally uniform trim to mitigate multiple positions for MAJIQ's --minpos.

Can you please advise how MAJIQ --minpos handles soft-clipping on paired-end reads? (Obviously STAR does a lot of soft-clipping and assume you encounter this during your work.)

In our case, we soft-clip read through overlap at the 5' or 3' end of RNA seq reads using clipOverlap - https://genome.sph.umich.edu/wiki/BamUtil:_clipOverlap .

While we have you online, have you been engaged with MAJIQ PSI and MAJIQ HET? We have an open post regarding uncorrelated PSI values in the two programs. https://groups.google.com/g/majiq_voila/c/odlDmmJGwkE

Samantha/Chris

Caleb Radens

unread,

Jan 12, 2024, 5:03:22 PM1/12/24

to Biociphers

Please someone else correct if I'm wrong, but I think soft clipping is the same as trimming from MAJIQ's point of view. So if the read starts at position X, but the first 3 base calls get soft clipped, it will then start at position X+3, and MAJIQ counts that read towards the X+3 position.

Reply all

Reply to author

Forward

Message has been deleted