SMART-SEQ scRNA alignment

190 views
Skip to first unread message

User_new

unread,
Apr 23, 2020, 5:34:50 PM4/23/20
to rna-star
I have SMART-SEQ data generated as part of a single cell experiment and used STAR for mapping the reads. After removing the nextera sequences the mapping percentages had too many unmapped reads (not contamination)

STAR --runThreadN 8 --genomeDir $TRANS_DATA --readFilesIn <(gunzip -c ${names[${SLURM_ARRAY_TASK_ID}]}_R1_001_val_1.fq.gz) <(gunzip -c ${names[${SLURM_ARRAY_TASK_ID}]}_R2_001_val_2.fq.gz)  --outSAMtype BAM SortedByCoordinate --outFileNamePrefix ${names[${SLURM_ARRAY_TASK_ID}]}_noextraparam --quantMode GeneCounts

 Started job on |       Apr 22 23:19:11
                             Started mapping on |       Apr 22 23:19:39
                                    Finished on |       Apr 23 00:01:11
       Mapping speed, Million of reads per hour |       38.02

                          Number of input reads |       26318495
                      Average input read length |       183
                                    UNIQUE READS:
                   Uniquely mapped reads number |       8665045
                        Uniquely mapped reads % |       32.92%
                          Average mapped length |       178.45
                       Number of splices: Total |       37848
            Number of splices: Annotated (sjdb) |       1071
                       Number of splices: GT/AG |       25908
                       Number of splices: GC/AG |       1456
                       Number of splices: AT/AC |       31
               Number of splices: Non-canonical |       10453
                      Mismatch rate per base, % |       0.36%
                         Deletion rate per base |       0.06%
                        Deletion average length |       1.15
                        Insertion rate per base |       0.01%
                       Insertion average length |       1.18
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |       291859
             % of reads mapped to multiple loci |       1.11%
        Number of reads mapped to too many loci |       45962
             % of reads mapped to too many loci |       0.17%
                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches |       0.00%
                 % of reads unmapped: too short |       65.42%
                     % of reads unmapped: other |       0.37%
                                  CHIMERIC READS:
                       Number of chimeric reads |       0
                            % of chimeric reads |       0.00%

I went through a  couple of posts, and included the following parameters which improved the unique reads to 65% how would it affect the read counting before proceeding to downstream analysis. 

STAR --runThreadN 8 --genomeDir $TRANS_DATA --readFilesIn <(gunzip -c ${names[${SLURM_ARRAY_TASK_ID}]}_R1_001_val_1.fq.gz) <(gunzip -c ${names[${SLURM_ARRAY_TASK_ID}]}_R2_001_val_2.fq.gz) --outFilterScoreMinOverLread 0 --outFilterMatchNminOverLread 0 --outFilterMatchNmin 0 --outFilterMismatchNmax 2 --outSAMtype BAM SortedByCoordinate --outFileNamePrefix ${names[${SLURM_ARRAY_TASK_ID}]}_extraparam --quantMode GeneCounts

                                                                           Mapping speed, Million of reads per hour |       38.20

     Number of input reads |       27081369
                      Average input read length |       182
                                    UNIQUE READS:
                   Uniquely mapped reads number |       17671953
                        Uniquely mapped reads % |       65.26%
                          Average mapped length |       105.20
                       Number of splices: Total |       45787
            Number of splices: Annotated (sjdb) |       1221
                       Number of splices: GT/AG |       17506
                       Number of splices: GC/AG |       2289
                       Number of splices: AT/AC |       93
               Number of splices: Non-canonical |       25899
                      Mismatch rate per base, % |       0.55%
                         Deletion rate per base |       0.05%
                        Deletion average length |       1.16
                        Insertion rate per base |       0.01%
                       Insertion average length |       1.22
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |       8788785
             % of reads mapped to multiple loci |       32.45%
        Number of reads mapped to too many loci |       502928
             % of reads mapped to too many loci |       1.86%
                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches |       0.00%
                 % of reads unmapped: too short |       0.00%
                     % of reads unmapped: other |       0.43%
                                  CHIMERIC READS:
                       Number of chimeric reads |       0
                            % of chimeric reads |       0.00%

Alexander Dobin

unread,
Apr 27, 2020, 7:53:26 PM4/27/20
to rna-star
Hi @User_new

in your 2nd run, you basically removed all "mapping quality" filtering, allowing alignments of any length.
The danger with such an approach is that many short alignments may be wrong, which may skew the quantification.
I think the best approach is to try to understand why the reads do not map. It looks like you checked for contamination.
Other probable causes are
(i) poor sequencing quality
(ii) presence of adapter sequences at the read ends - have you trimmed the adapter sequences before mapping?

Cheers
Alex

User_new

unread,
Apr 27, 2020, 11:38:03 PM4/27/20
to rna-star
Yes, the nextera sequences were trimmed before mapping. Maybe will check by mapping one set of reads of the paired end.

Alexander Dobin

unread,
Apr 30, 2020, 6:43:54 PM4/30/20
to rna-star
That's a good check!
Reply all
Reply to author
Forward
0 new messages