Large difference in mismatch rate per base between samples from the same experiment

34 views
Skip to first unread message

Nathanael Walker-Hale

unread,
May 4, 2022, 5:53:03 PMMay 4
to rna-star
Hi Alex,

I'm using STAR to map plant RNA-seq data from a time course experiment with 5 time points, with control and treatment samples from each time point. I've observed that some samples (the first 26) have what seem to me to be concerning mismatch rates, e.g.

Started job on | May 03 14:49:33
Started mapping on | May 03 14:49:35
Finished on | May 03 14:56:00
Mapping speed, Million of reads per hour | 428.13

Number of input reads | 45785933
Average input read length | 200
UNIQUE READS:
Uniquely mapped reads number | 43373737
Uniquely mapped reads % | 94.73%
Average mapped length | 198.51
Number of splices: Total | 18686245
Number of splices: Annotated (sjdb) | 0
Number of splices: GT/AG | 18441405
Number of splices: GC/AG | 212140
Number of splices: AT/AC | 4163
Number of splices: Non-canonical | 28537
Mismatch rate per base, % | 0.75%
Deletion rate per base | 0.01%
Deletion average length | 1.48
Insertion rate per base | 0.01%
Insertion average length | 1.21
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 1367946
% of reads mapped to multiple loci | 2.99%
Number of reads mapped to too many loci | 23257
% of reads mapped to too many loci | 0.05%
UNMAPPED READS:
Number of reads unmapped: too many mismatches | 0
% of reads unmapped: too many mismatches | 0.00%
Number of reads unmapped: too short | 872178
% of reads unmapped: too short | 1.90%
Number of reads unmapped: other | 148815
% of reads unmapped: other | 0.33%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%

However, a little over halfway through the samples (sample 27-44), the mismatch rate dramatically improves, e.g.

Started job on | May 03 15:28:49
Started mapping on | May 03 15:28:51
Finished on | May 03 15:35:20
Mapping speed, Million of reads per hour | 445.69

Number of input reads | 48158944
Average input read length | 200
UNIQUE READS:
Uniquely mapped reads number | 45423573
Uniquely mapped reads % | 94.32%
Average mapped length | 198.99
Number of splices: Total | 21217528
Number of splices: Annotated (sjdb) | 0
Number of splices: GT/AG | 20963673
Number of splices: GC/AG | 218053
Number of splices: AT/AC | 5120
Number of splices: Non-canonical | 30682
Mismatch rate per base, % | 0.20%
Deletion rate per base | 0.01%
Deletion average length | 1.48
Insertion rate per base | 0.01%
Insertion average length | 1.26
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 1524512
% of reads mapped to multiple loci | 3.17%
Number of reads mapped to too many loci | 20480
% of reads mapped to too many loci | 0.04%
UNMAPPED READS:
Number of reads unmapped: too many mismatches | 0
% of reads unmapped: too many mismatches | 0.00%
Number of reads unmapped: too short | 983893
% of reads unmapped: too short | 2.04%
Number of reads unmapped: other | 206486
% of reads unmapped: other | 0.43%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%

I noticed also that for these latter samples, the average mapped length is a little closer to the average input length and the proportion of uniquely mapping reads is maybe the tiniest bit elevated, suggesting higher quality.

I conducted all the extractions myself over the course of about 5 days, meaning that some samples stayed at -80 longer than others. However if this was the issue, I would expect that later extractions to have higher mismatch rates, and this is not what I see. Our sequencing provider (BGI) is supposed to have processed them all together, to avoid batch effects. I suppose that my questions are i) do you think this indicates a library quality-related batch effect and ii) are the elevated mismatch rates of the first 26 samples a concern for alignment quality and read quantification for DEG analysis?

Many thanks for the help,

Best wishes,

Nathanael

Alexander Dobin

unread,
May 13, 2022, 2:32:58 PMMay 13
to rna-star
Hi Nathanael,

A higher mismatch error rate could indicate poorer sequencing quality or library prep issues,, but, in principle, could also be biological in nature.
You can look at the quality score distributions to see if the higher error rate library had lower scores, i.e. poorer sequencing quality.

It's interesting that other mapping statistics do not change significantly, which points against the sequencing quality issue.

Biological explanations require some imagination... e.g., some samples have more expression from more variable regions of the genome.

Cheers
Alex
Reply all
Reply to author
Forward
0 new messages