Hello,
I am working with a small RNA-seq data set that was run on Illumina Next seq, the reads are single end, 75bp. I have used TrimGalore to perform trimming and all looks good. However, when it comes to index generation and aligning with STAR I am struggling to find an option that works with my chosen Ensembl datasets (I am working with pig as my species, so cannot use ENCODE). I have done the indexing an aligning in two different ways and am having issues with both methods, even when looking at some of the solutions that have been posted to this group here. I will provide the code and Log.out files below. The main issue appears that I am having very low unique mapping, with a lot of reads unmapped.
1. Using the entire genome and gtf
a. Index generation
i. STAR --runMode genomeGenerate
ii. --genomeDir whole_genome_index_redo/generated_index
iii. --genomeFastaFiles Susscrofa_ensembl_genome_primary_assembly/Sus_scrofa.full_genome.primary_assembly.fa
iv. --sjdbGTFfile ensemblgenomeSTAR/Sus_scrofa.Sscrofa11.1.110.gtf
v. --sjdbOverhang 74
vi. --genomeChrBinNbits 15
b. Alignment:
i. STAR -- runMode alignReads
ii. --genomeDir whole_genome_index_redo/generated_index
iii. --readFilesCommand zcat
iv. --readFilesIn trim_galore/48_INF_S2_R1_001_trimmed.fq.gz
v. --outFileNamePrefix star_ensembl/L48genomic
vi. --runThreadN 16
vii. --sjdbGTFfile ensemblgenomeSTAR/Sus_scrofa.Sscrofa11.1.110.gtf
viii. --alignEndsType EndToEnd
ix. --outFilterMismatchNmax 1
x. --outFilterMultimapScoreRange 0
xi. --quantMode TranscriptomeSAM GeneCounts
xii. --outReadsUnmapped Fastx
xiii. --outSAMtype BAM Unsorted SortedByCoordinate
xiv. --outFilterMultimapNmax 10
xv. --outSAMunmapped Within
xvi. --outFilterScoreMinOverLread 0
xvii. --outFilterMatchNminOverLread 0
xviii. --outFilterMatchNmin 16
xix. --alignSJDBoverhangMin 100000
xx. --alignIntronMax 1
xxi. --outWigType wiggle
xxii. --outWigStrand Stranded
xxiii. --outWigNorm RPM
c. Alignment Log.final.out file:
Number of input reads | 1271129
Average input read length | 30
UNIQUE READS:
Uniquely mapped reads number | 86400
Uniquely mapped reads % | 6.80%
Average mapped length | 26.02
Number of splices: Total | 0
Number of splices: Annotated (sjdb) | 0
Number of splices: GT/AG | 0
Number of splices: GC/AG | 0
Number of splices: AT/AC | 0
Number of splices: Non-canonical | 0
Mismatch rate per base, % | 1.12%
Deletion rate per base | 0.00%
Deletion average length | 1.00
Insertion rate per base | 0.00%
Insertion average length | 1.53
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 32590
% of reads mapped to multiple loci | 2.56%
Number of reads mapped to too many loci | 190729
% of reads mapped to too many loci | 15.00%
UNMAPPED READS:
Number of reads unmapped: too many mismatches | 933772
% of reads unmapped: too many mismatches | 73.46%
Number of reads unmapped: too short | 9026
% of reads unmapped: too short | 0.71%
Number of reads unmapped: other | 18612
% of reads unmapped: other | 1.46%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%
2. Using the ncRNA fasta file with no gtf
a. Index generation
i. STAR --runThreadN 16
ii. --runMode genomeGenerate –
iii. -genomeDir ensemblgenomeSTAR/star_ensembl_index_nogtf
iv. --genomeFastaFiles ensemblgenomeSTAR/Sus_scrofa.Sscrofa11.1.ncrna.fa
b. Alignment
i. STAR --genomeDir ensemblgenomeSTAR/star_ensembl_index_nogtf
ii. --readFilesCommand zcat
iii. --readFilesIn trim_galore/48_INF_S2_R1_001_trimmed.fq.gz
iv. --outFileNamePrefix alignedfiles/L48ncRNAnogtf
v. --runThreadN 16
c. Log.final.out file:
Number of input reads | 1271129
Average input read length | 30
UNIQUE READS:
Uniquely mapped reads number | 187182
Uniquely mapped reads % | 14.73%
Average mapped length | 23.37
Number of splices: Total | 1589
Number of splices: Annotated (sjdb) | 0
Number of splices: GT/AG | 1560
Number of splices: GC/AG | 16
Number of splices: AT/AC | 0
Number of splices: Non-canonical | 13
Mismatch rate per base, % | 2.49%
Deletion rate per base | 0.00%
Deletion average length | 1.16
Insertion rate per base | 0.00%
Insertion average length | 1.00
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 346488
% of reads mapped to multiple loci | 27.26%
Number of reads mapped to too many loci | 57097
% of reads mapped to too many loci | 4.49%
UNMAPPED READS:
Number of reads unmapped: too many mismatches | 0
% of reads unmapped: too many mismatches | 0.00%
Number of reads unmapped: too short | 679573
% of reads unmapped: too short | 53.46%
Number of reads unmapped: other | 789
% of reads unmapped: other | 0.06%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%
If there is any insight it would be much appreciated!!! I have been struggling with this issue for quite some time.
Thanks!!
Lauren