Hi Alex,
Thanks a lot for reply!
1. Now it is clear. Yes, only a small portion of genes in yeasts are alternatively spliced, so in most of the cases only one isoform is observed. I will try --quantMode GeneCounts. Thank you!
2.
Here are first three records form Unmapped.out.mate1 file:
@D00733:162:CADM2ANXX:2:2204:6092:26744 00
CTCGTATCATGACCCACTTGACACGCCTTGGTAATCTTAGTAAATGGGCA
+
CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
@D00733:162:CADM2ANXX:2:2204:7437:26609 00
CAAGAAGCAGACAAAGCGTAAGCACCGTCAGCAGTCAAAGTACAGTCTTG
+
CCCCCGFGGGGGGGGGGGGGGGFGGEGGGGGGGGGGGGGGGGGGCBDGEG
@D00733:162:CADM2ANXX:2:2204:8253:26526 00
CCGGCAACAGAGTTGTAGACCAAACCGAAACCAACAGACTTACCACCACC
+
CCCCCDGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
I have found out that reads are matching to a different strain of S. cereviseae than I am using. Maybe this is a reason?
Here is my STAR command line:
STAR --runThreadN 8 --genomeDir ${GENOME_DIR} --sjdbGTFfile ${GTF} --readFilesIn ../../S_cer*REP${i}_*read1.* ../../S_cer*REP${i}_*read2.* --readFilesCommand zcat --outSAMtype BAM Unsorted --outReadsUnmapped Fastx --quantMode TranscriptomeSAM --outTmpDir ~/TMP/TMPs
Here is the reference genome and gff file:
http://downloads.yeastgenome.org/sequence/S288C_reference/genome_releases/S288C_reference_genome_Current_Release.tgz (.fsa and .gff, respectively)
There was a fasta sequence in the end of gff, I have removed it and renamed chromosome names in fasta file so they match to gff.
Then I have converted gff to gtf using gffread ( which gives me a strange results though where the third column contains CDS only, is it normal?).
I have spike-ins in my RNAseq, so I also appended seqs and annotations of spike-ins to fasta and gtf file, respectively.
3. Let me add another question (please let me know if I should add it as a separate topic)
Besides of doing gene DE analysis using raw counts, I also need to calculate TPM values for genes (not transcripts). For this, I was trying to use RSEM and feed it with bam files generated by --quantMode TranscriptomeSAM. However, produced bam files are very small (ca 40 MB) and apparently do not contain most of the data. So is my data (yeast RNAseq with almost no splicing) suitable for STAR-RSEM pipeline for calculating gene-level TPMs or should I calculate TPMs in a different way? I will appreciate very much if you can point out what am I doing wrong or missing.
Kind regards,
Grant