Low unique maping when aligning smRNA-seq data using STAR

106 views
Skip to first unread message

Lauren Fletcher

unread,
Oct 12, 2023, 3:34:39 PM10/12/23
to rna-star

Hello,

I am working with a small RNA-seq data set that was run on Illumina Next seq, the reads are single end, 75bp. I have used TrimGalore to perform trimming and all looks good. However, when it comes to index generation and aligning with STAR I am struggling to find an option that works with my chosen Ensembl datasets (I am working with pig as my species, so cannot use ENCODE). I have done the indexing an aligning in two different ways and am having issues with both methods, even when looking at some of the solutions that have been posted to this group here. I will provide the code and Log.out files below. The main issue appears that I am having very low unique mapping, with a lot of reads unmapped.

 

1.     Using the entire genome and gtf

a.     Index generation

                                               i.     STAR --runMode genomeGenerate

                                              ii.      --genomeDir whole_genome_index_redo/generated_index

                                             iii.     --genomeFastaFiles Susscrofa_ensembl_genome_primary_assembly/Sus_scrofa.full_genome.primary_assembly.fa

                                             iv.     --sjdbGTFfile ensemblgenomeSTAR/Sus_scrofa.Sscrofa11.1.110.gtf

                                              v.      --sjdbOverhang 74

                                             vi.     --genomeChrBinNbits 15

b.     Alignment:

                       i.   STAR -- runMode alignReads

                     ii.   --genomeDir whole_genome_index_redo/generated_index

                    iii.   --readFilesCommand zcat

                     iv.   --readFilesIn trim_galore/48_INF_S2_R1_001_trimmed.fq.gz

                       v.   --outFileNamePrefix star_ensembl/L48genomic

                     vi.   --runThreadN 16

                    vii.   --sjdbGTFfile ensemblgenomeSTAR/Sus_scrofa.Sscrofa11.1.110.gtf

                   viii.   --alignEndsType EndToEnd

                     ix.   --outFilterMismatchNmax 1

                       x.   --outFilterMultimapScoreRange 0

                     xi.   --quantMode TranscriptomeSAM GeneCounts

                    xii.   --outReadsUnmapped Fastx

                   xiii.   --outSAMtype BAM Unsorted SortedByCoordinate

                    xiv.   --outFilterMultimapNmax 10

                     xv.   --outSAMunmapped Within

                    xvi.   --outFilterScoreMinOverLread 0

                   xvii.   --outFilterMatchNminOverLread 0

                  xviii.   --outFilterMatchNmin 16

                    xix.   --alignSJDBoverhangMin 100000

                     xx.   --alignIntronMax 1

                    xxi.   --outWigType wiggle

                   xxii.   --outWigStrand Stranded

                  xxiii.   --outWigNorm RPM 

c.  Alignment Log.final.out file:

Number of input reads |       1271129

                      Average input read length |       30

                                    UNIQUE READS:

                   Uniquely mapped reads number |       86400

                        Uniquely mapped reads % |       6.80%

                          Average mapped length |       26.02

                       Number of splices: Total |       0

            Number of splices: Annotated (sjdb) |       0

                       Number of splices: GT/AG |       0

                       Number of splices: GC/AG |       0

                       Number of splices: AT/AC |       0

               Number of splices: Non-canonical |       0

                      Mismatch rate per base, % |       1.12%

                         Deletion rate per base |       0.00%

                        Deletion average length |       1.00

                        Insertion rate per base |       0.00%

                       Insertion average length |       1.53

                             MULTI-MAPPING READS:

        Number of reads mapped to multiple loci |       32590

             % of reads mapped to multiple loci |       2.56%

        Number of reads mapped to too many loci |       190729

             % of reads mapped to too many loci |       15.00%

                                  UNMAPPED READS:

  Number of reads unmapped: too many mismatches |       933772

       % of reads unmapped: too many mismatches |       73.46%

            Number of reads unmapped: too short |       9026

                 % of reads unmapped: too short |       0.71%

                Number of reads unmapped: other |       18612

                     % of reads unmapped: other |       1.46%

                                  CHIMERIC READS:

                       Number of chimeric reads |       0

                            % of chimeric reads |       0.00%

 

2.     Using the ncRNA fasta file with no gtf

a.     Index generation

                                               i.     STAR --runThreadN 16

                                              ii.     --runMode genomeGenerate –

                                             iii.     -genomeDir ensemblgenomeSTAR/star_ensembl_index_nogtf

                                             iv.     --genomeFastaFiles ensemblgenomeSTAR/Sus_scrofa.Sscrofa11.1.ncrna.fa

b.     Alignment

                                               i.     STAR --genomeDir ensemblgenomeSTAR/star_ensembl_index_nogtf

                                              ii.     --readFilesCommand zcat

                                             iii.     --readFilesIn trim_galore/48_INF_S2_R1_001_trimmed.fq.gz

                                             iv.     --outFileNamePrefix alignedfiles/L48ncRNAnogtf

                                              v.     --runThreadN 16

c.     Log.final.out file:

Number of input reads |       1271129

                      Average input read length |       30

                                    UNIQUE READS:

                   Uniquely mapped reads number |       187182

                        Uniquely mapped reads % |       14.73%

                          Average mapped length |       23.37

                       Number of splices: Total |       1589

            Number of splices: Annotated (sjdb) |       0

                       Number of splices: GT/AG |       1560

                       Number of splices: GC/AG |       16

                       Number of splices: AT/AC |       0

               Number of splices: Non-canonical |       13

                      Mismatch rate per base, % |       2.49%

                         Deletion rate per base |       0.00%

                        Deletion average length |       1.16

                        Insertion rate per base |       0.00%

                       Insertion average length |       1.00

                             MULTI-MAPPING READS:

        Number of reads mapped to multiple loci |       346488

             % of reads mapped to multiple loci |       27.26%

        Number of reads mapped to too many loci |       57097

             % of reads mapped to too many loci |       4.49%

                                  UNMAPPED READS:

  Number of reads unmapped: too many mismatches |       0

       % of reads unmapped: too many mismatches |       0.00%

            Number of reads unmapped: too short |       679573

                 % of reads unmapped: too short |       53.46%

                Number of reads unmapped: other |       789

                     % of reads unmapped: other |       0.06%

                                  CHIMERIC READS:

                       Number of chimeric reads |       0

                            % of chimeric reads |       0.00%

 

If there is any insight it would be much appreciated!!! I have been struggling with this issue for quite some time.

Thanks!!

Lauren

Alexander Dobin

unread,
Oct 12, 2023, 3:44:43 PM10/12/23
to rna-star
Hi Lauren,

in the 1st run, you restricted the number of mismatches to  --outFilterMismatchNmax 1 with EndToEnd alignment, so most reads could not map because they had >1 mismatch.
In the 2nd run, you mapped only to ncRNA sequences, and a decent % of reads were mapped uniquely, and even more as multimappers. Still, the mismatch error rate of 2.49% is pretty high, which indicates either a poor quality of sequencing, or a divergent genome.
I would recommend mapping to the whole genome with default parameters which gives you a better idea of mappability.
Reply all
Reply to author
Forward
0 new messages