Hello,
I have ~500M 200bp paired end RNAseq reads from mouse. I am interested in doing a de-novo transcript reconstruction using cufflinks. To achieve this I have decided upon the following parameters:
a) Consider only those reads where both the ends align concordantly [no chimeric reads or where one end align and the other dosen't]
b) In case a read spans a splice junction then length of the overhang should be at-least 40bp on either side [I have ~500M reads so at-least a few of them will satisfy this contraint for most of the splice junctions]
c) In case the splice overhang is less than 40bp on one of the sides then soft clip the read
d) Keep only uniquely mapping paired end read in the output SAM file
e) Take the illumina error rate as 3% of the read length
To achieve this I used the following command:
STAR \
--runThreadN 16 \
--genomeDir ~/mouse/star \
--readFilesIn ~/fq1.fastq.gz \
~/fq2.fastq.gz \
--readFilesCommand zcat \
--outFileNamePrefix ./ \
--outFilterIntronMotifs RemoveNoncanonical \
--outSJfilterReads Unique \
--alignSJoverhangMin 40 \
--alignSJDBoverhangMin 40 \
--outFilterMultimapNmax 1 \
--outFilterMismatchNoverLmax 0.03 \
--outStd SAM | samtools view -bS - > star.bam
I think the above command achieves my goals [any advise to improve it is most welcome], but I could not figure out how to tell STAR to report only 'properly' paired alignments. Can someone help me with it ? Thanks.
Regards,
Rahul