STAR --runMode genomeGenerate --genomeDir ~/db/hg38/
--genomeFastaFiles ~/db/hg38/
hg38.fa --sjdbGTFfile ~/db/hg38/
hg38.gtf --runThreadN 30 --sjdbOverhang 89
STAR --genomeDir ~/db/hg38/ --readFilesIn sample1.R1.fastq.gz sample1.R2.fastq.gz
--readFilesCommand zcat --outSAMunmapped Within --outFileNamePrefix sample1. --runThreadN 30
STAR --runMode genomeGenerate --genomeDir ~/db/hg38/SJ_Index/
--genomeFastaFiles ~/db/hg38/
SJ_Index/
hg38.fa --sjdbGTFfile ~/db/hg38/
SJ_Index/
hg38.gtf --runThreadN 30 --sjdbOverhang 89 --sjdbFileChrStartEnd SJ_out/*.SJ.out.tab
STAR --genomeDir ~/db/hg38/SJ_Index/
--readFilesIn sample1.R1.fastq.gz sample1.R2.fastq.gz
--readFilesCommand zcat --outSAMunmapped Within --outFileNamePrefix sample1. --runThreadN 30
1. Filter out the junctions on chrM, those are most likely to be false. That's mean remove all chrM records.
2. Filter out non-canonical junctions (column5 == 0). That's mean (column5 > 0).
3. Filter out junctions supported by multi mappers only (column7==0). That's mean (column7 > 0)
4. Filter out junctions supported by too few reads (e.g. column7<=2). That's mean (column7 > 2)
One more question, I should merge junction files of diseases and healthy samples into one or I need to make separate index for diseases and healhy for 2-pass?
cat *.tab | awk '($5 > 0 && $7 > 2 && $6==0)' | cut -f1-6 | sort | uniq | wc -l
cat *.tab | awk '($5 > 0 && $7 > 2 && $6==0)' | cut -f1-6 | sort | uniq > SJ.filtered.tab