Hi Alex,
Related to implementing the EM algorithm, could you please clarify whether this is accurate?
STARsolo assigns reads to cell barcodes prior to gene mapping. Therefore, the EM implementation for multi-gene multimapped reads will be distributed on a 'per-cell' basis (based on uniquely mapped reads in that cell rather than uniquely mapped reads throughout all cells in the sample).
Example: If a cell has uniquely mapped reads in Gene X but not Gene Y, then a multimapped read aligning to both X and Y will get assigned to Gene X but not Gene Y (even if other cells have reads that uniquely map to Gene Y). Hope that makes sense.
I believe Alevin's order is similar to this whereas Kallisto-Bustools aligns first and assigns reads to CBs second, which in our example would result in our read getting distributed to both Gene X and Y based on the sample-wide distribution of reads aligned to X and Y.
And a related question:
Is there a way to keep multi-gene multimapped reads only if there are 'supporting' uniquely mapped reads in the corresponding genes, but throw them out when there aren't any uniquely mapped reads to support it? In the example above, if a cell does not have unique reads in either Gene X or Gene Y, then the multi-mapped reads get thrown out.
Thank you for your time,
Jesse