Low unique maping when aligning smRNA-seq data using STAR

228 views

Skip to first unread message

Lauren Fletcher

unread,

Oct 12, 2023, 3:34:39 PM10/12/23

to rna-star

Hello,

I am working with a small RNA-seq data set that was run on Illumina Next seq, the reads are single end, 75bp. I have used TrimGalore to perform trimming and all looks good. However, when it comes to index generation and aligning with STAR I am struggling to find an option that works with my chosen Ensembl datasets (I am working with pig as my species, so cannot use ENCODE). I have done the indexing an aligning in two different ways and am having issues with both methods, even when looking at some of the solutions that have been posted to this group here. I will provide the code and Log.out files below. The main issue appears that I am having very low unique mapping, with a lot of reads unmapped.

1. Using the entire genome and gtf

a. Index generation

i. STAR --runMode genomeGenerate

ii. --genomeDir whole_genome_index_redo/generated_index

iii. --genomeFastaFiles Susscrofa_ensembl_genome_primary_assembly/Sus_scrofa.full_genome.primary_assembly.fa

iv. --sjdbGTFfile ensemblgenomeSTAR/Sus_scrofa.Sscrofa11.1.110.gtf

v. --sjdbOverhang 74

vi. --genomeChrBinNbits 15

b. Alignment:

i. STAR -- runMode alignReads

ii. --genomeDir whole_genome_index_redo/generated_index

iii. --readFilesCommand zcat

iv. --readFilesIn trim_galore/48_INF_S2_R1_001_trimmed.fq.gz

v. --outFileNamePrefix star_ensembl/L48genomic

vi. --runThreadN 16

vii. --sjdbGTFfile ensemblgenomeSTAR/Sus_scrofa.Sscrofa11.1.110.gtf

viii. --alignEndsType EndToEnd

ix. --outFilterMismatchNmax 1

x. --outFilterMultimapScoreRange 0

xi. --quantMode TranscriptomeSAM GeneCounts

xii. --outReadsUnmapped Fastx

xiii. --outSAMtype BAM Unsorted SortedByCoordinate

xiv. --outFilterMultimapNmax 10

xv. --outSAMunmapped Within

xvi. --outFilterScoreMinOverLread 0

xvii. --outFilterMatchNminOverLread 0

xviii. --outFilterMatchNmin 16

xix. --alignSJDBoverhangMin 100000

xx. --alignIntronMax 1

xxi. --outWigType wiggle

xxii. --outWigStrand Stranded

xxiii. --outWigNorm RPM

c. Alignment Log.final.out file:

Number of input reads | 1271129

Average input read length | 30

UNIQUE READS:

Uniquely mapped reads number | 86400

Uniquely mapped reads % | 6.80%

Average mapped length | 26.02

Number of splices: Total | 0

Number of splices: Annotated (sjdb) | 0

Number of splices: GT/AG | 0

Number of splices: GC/AG | 0

Number of splices: AT/AC | 0

Number of splices: Non-canonical | 0

Mismatch rate per base, % | 1.12%

Deletion rate per base | 0.00%

Deletion average length | 1.00

Insertion rate per base | 0.00%

Insertion average length | 1.53

MULTI-MAPPING READS:

Number of reads mapped to multiple loci | 32590

% of reads mapped to multiple loci | 2.56%

Number of reads mapped to too many loci | 190729

% of reads mapped to too many loci | 15.00%

UNMAPPED READS:

Number of reads unmapped: too many mismatches | 933772

% of reads unmapped: too many mismatches | 73.46%

Number of reads unmapped: too short | 9026

% of reads unmapped: too short | 0.71%

Number of reads unmapped: other | 18612

% of reads unmapped: other | 1.46%

CHIMERIC READS:

Number of chimeric reads | 0

% of chimeric reads | 0.00%

2. Using the ncRNA fasta file with no gtf

a. Index generation

i. STAR --runThreadN 16

ii. --runMode genomeGenerate –

iii. -genomeDir ensemblgenomeSTAR/star_ensembl_index_nogtf

iv. --genomeFastaFiles ensemblgenomeSTAR/Sus_scrofa.Sscrofa11.1.ncrna.fa

b. Alignment

i. STAR --genomeDir ensemblgenomeSTAR/star_ensembl_index_nogtf

ii. --readFilesCommand zcat

iii. --readFilesIn trim_galore/48_INF_S2_R1_001_trimmed.fq.gz

iv. --outFileNamePrefix alignedfiles/L48ncRNAnogtf

v. --runThreadN 16

c. Log.final.out file:

Number of input reads | 1271129

Average input read length | 30

UNIQUE READS:

Uniquely mapped reads number | 187182

Uniquely mapped reads % | 14.73%

Average mapped length | 23.37

Number of splices: Total | 1589

Number of splices: Annotated (sjdb) | 0

Number of splices: GT/AG | 1560

Number of splices: GC/AG | 16

Number of splices: AT/AC | 0

Number of splices: Non-canonical | 13

Mismatch rate per base, % | 2.49%

Deletion rate per base | 0.00%

Deletion average length | 1.16

Insertion rate per base | 0.00%

Insertion average length | 1.00

MULTI-MAPPING READS:

Number of reads mapped to multiple loci | 346488

% of reads mapped to multiple loci | 27.26%

Number of reads mapped to too many loci | 57097

% of reads mapped to too many loci | 4.49%

UNMAPPED READS:

Number of reads unmapped: too many mismatches | 0

% of reads unmapped: too many mismatches | 0.00%

Number of reads unmapped: too short | 679573

% of reads unmapped: too short | 53.46%

Number of reads unmapped: other | 789

% of reads unmapped: other | 0.06%

CHIMERIC READS:

Number of chimeric reads | 0

% of chimeric reads | 0.00%

If there is any insight it would be much appreciated!!! I have been struggling with this issue for quite some time.

Thanks!!

Lauren

Alexander Dobin

unread,

Oct 12, 2023, 3:44:43 PM10/12/23

to rna-star

Hi Lauren,

in the 1st run, you restricted the number of mismatches to --outFilterMismatchNmax 1 with EndToEnd alignment, so most reads could not map because they had >1 mismatch.

In the 2nd run, you mapped only to ncRNA sequences, and a decent % of reads were mapped uniquely, and even more as multimappers. Still, the mismatch error rate of 2.49% is pretty high, which indicates either a poor quality of sequencing, or a divergent genome.

I would recommend mapping to the whole genome with default parameters which gives you a better idea of mappability.

Reply all

Reply to author

Forward

0 new messages