Unmapped reads and multi-mappers for S. cerevisease

Grant

unread,

Dec 30, 2016, 2:33:31 AM12/30/16

to rna-star

Dear Alex,

First let me thank you for STAR, which very efficient tool, and for your time that you spend for supporting users.

My question is following,

I use STAR to align RNAseq (50 bp paired end high quality Illumina reads) for S. cereviseae. All the procedure is quite standard, I do not change most of the default parameters for neither indexing, nor mapping (I could provide details on procedure if necessary).

In the end I get quite good results (according to your previous posts and suggestions) - 92% of unique mapping, 5.44% reads are mapping to multiple loci, and 2.5% of unmapped reads (too short). Those are numbers for 1 replicate, for the other reps results are almost the same.

However, I am curious about two things:

1. Is it possible that this amount multi-mappings might affect DE analysis (I will use featurecounts/htseq with DESeq2)?

2. I have tried to blast unmapped reads, and for 10-15 randomly chosen reads blast found ideal matches with S. cereviseae genome. What might be the reason that STAR could not map those reads?

Thank you in advance for your time,
Best,
Grant

Alexander Dobin

unread,

Jan 6, 2017, 4:04:33 PM1/6/17

to rna-star

Hi Grant,

1. The multimapping reads may indeed affect the DE analysis. The obvious case is where a gene is covered with multimappers only.

If a gene contains only one isoform, the multimapping will simply reduce the reads counts, so the confidence in the DE will be lower.

The worst case is when a gene has multiple isoforms, and different isoforms contain different proportion of multi-mappers - in this case, the counting may be significantly screwed up.

However, I suspect the latter cases should be rare for yeast.

In principle, the "transcript qunatification" tools, such as RSEM, Cufflinks etc., should be able to "rescue" the multimappers. In practice, they have significant problem as well.

Note that STAR has the --quantMode GeneCounts option that will produce the unique mapper counts per gene the same way as htseq.

2. I will need to check these reads - please send me these few reads, as well as STAR mapping parameters and link to the genome assembly and annotations.

Cheers

Alex

Grant

unread,

Jan 9, 2017, 10:51:44 AM1/9/17

to rna-star

Hi Alex,

Thanks a lot for reply!

1. Now it is clear. Yes, only a small portion of genes in yeasts are alternatively spliced, so in most of the cases only one isoform is observed. I will try --quantMode GeneCounts. Thank you!

2.
Here are first three records form Unmapped.out.mate1 file:
@D00733:162:CADM2ANXX:2:2204:6092:26744    00
CTCGTATCATGACCCACTTGACACGCCTTGGTAATCTTAGTAAATGGGCA
+
CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
@D00733:162:CADM2ANXX:2:2204:7437:26609    00
CAAGAAGCAGACAAAGCGTAAGCACCGTCAGCAGTCAAAGTACAGTCTTG
+
CCCCCGFGGGGGGGGGGGGGGGFGGEGGGGGGGGGGGGGGGGGGCBDGEG
@D00733:162:CADM2ANXX:2:2204:8253:26526    00
CCGGCAACAGAGTTGTAGACCAAACCGAAACCAACAGACTTACCACCACC
+
CCCCCDGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

I have found out that reads are matching to a different strain of S. cereviseae than I am using. Maybe this is a reason?

Here is my STAR command line:
STAR --runThreadN 8 --genomeDir ${GENOME_DIR} --sjdbGTFfile ${GTF} --readFilesIn ../../S_cer*REP${i}_*read1.* ../../S_cer*REP${i}_*read2.* --readFilesCommand zcat --outSAMtype BAM Unsorted --outReadsUnmapped Fastx --quantMode TranscriptomeSAM --outTmpDir ~/TMP/TMPs

Here is the reference genome and gff file:
http://downloads.yeastgenome.org/sequence/S288C_reference/genome_releases/S288C_reference_genome_Current_Release.tgz (.fsa and .gff, respectively)

There was a fasta sequence in the end of gff, I have removed it and renamed chromosome names in fasta file so they match to gff.
Then I have converted gff to gtf using gffread ( which gives me a strange results though where the third column contains CDS only, is it normal?).
I have spike-ins in my RNAseq, so I also appended seqs and annotations of spike-ins to fasta and gtf file, respectively.

3. Let me add another question (please let me know if I should add it as a separate topic)
Besides of doing gene DE analysis using raw counts, I also need to calculate TPM values for genes (not transcripts). For this, I was trying to use RSEM and feed it with bam files generated by --quantMode TranscriptomeSAM. However, produced bam files are very small (ca 40 MB) and apparently do not contain most of the data. So is my data (yeast RNAseq with almost no splicing) suitable for STAR-RSEM pipeline for calculating gene-level TPMs or should I calculate TPMs in a different way? I will appreciate very much if you can point out what am I doing wrong or missing.

Kind regards,
Grant

Alexander Dobin

unread,

Jan 10, 2017, 4:34:05 PM1/10/17

to rna-star

Hi Grant,

The column 3 in the GFF (and converted GTF) have three different entries:

7058 CDS

47 exon

484 noncoding_exon

Normally, the GTF file has to list all exons (CDS+UTR, non-coding) as simply "exon" lines.

I am not sure whether UTRs are listed at all in this file, which would create problems for quantification.

If you cannot find a standard annotations file, I guess you could just rename all these lines as "exon".

Hopefully, after this change the problem #3 will disappear. Please send me the Log.out file of the genome generation step.

>>>I have found out that reads are matching to a different strain of S. cereviseae than I am using. Maybe this is a reason?

Please try to generate the genome for that other strain and remap these reads - if they map there, than it means that some genomic variants prevent them from mapping to the first strain.

Cheers

Alex

Grant

unread,

Jan 13, 2017, 6:30:54 AM1/13/17

to rna-star

Dear Alex,

Thanks a lot for clarifications.

I have found another gtf file from ensemble which seems to be fine.
Here is the link to it ftp://ftp.ensembl.org/pub/release-87/gtf/saccharomyces_cerevisiae/

I have generated genome again and did the mapping (using gtf on mapping step).
The only thing I am still suspicious about is the file for RSEM, because when I do samtools flagstat on genome bam file and transcriptome bam files, there are ca 38 and 30 mln reads respectively, so I am missing 8 mln reads in transcriptome file.

Is this normal or I do you think something goes wrong again?

Log of genome generation and log after mapping are attached (rep1).

Thanks in advance for your time,
Grant

Log.out

Log_rep1.out

Alexander Dobin

unread,

Jan 13, 2017, 4:53:17 PM1/13/17

to rna-star

Hi Grant,

From the Log.out file, STAR found in the GTF file

7218 transcripts

7645 exons (non-collapsed)

411 collapsed junctions

Please check that these numbers are reasonable.

The missed reads in the transcriptome file may

1. Map to the intergenic space, i.e. do not overlap gene. This is quite common in mammals, but should be less so in yeast, I guess. You can check this by calculating the number of reads that overlap genes with --quantMode TranscriptomeSAM GeneCounts

2. Reads that have soft-clipping at the ends or (less likely) indels. By default, the soft-clipping is prohibited in the Transcriptome output to conform with RSEM requirements. You can allow it with

--quantTranscriptomeBan Singleend option and see the change in the number of transcriptomic reads.

Cheers

Alex

Grant

unread,

Jan 18, 2017, 11:29:45 AM1/18/17

to rna-star

Dear Alex,

Sorry for replying late.

Yes, the numbers seem to be fine.

1. I have done you suggestion and obtained number ca 15.7 mln counts (excluding multimaps, unmaped, etc) in the 4th column. In total there are 35 mln reads, and if GeneCounts makes the counting as htseq, 15.7 mln seems very close to truth.
2. This option did not change the results much, still missing 6 mln reads.

Thanks,
Hrant

Alexander Dobin

unread,

Jan 18, 2017, 4:22:44 PM1/18/17

to rna-star

Hi Grant,

for paired-end reads, 15.7M reads will corresponds to 31.4M lines in the BAM file, which is close to the 30M lines counted by flagstat in the transcriptome BAM.

It seems that the next possible explanation is that you have highly expressed unannotated ("intergenic") loci. In mammals this is often caused by unannotated rRNA loci, especially in the total (not A+ selected) RNA samples. To hunt for these loci, you could generate the signal (wiggle bedGraph) files (STAR has an option to do that, or you can use bedtools), and sum the signal over small bins in the intergenic space.

Cheers

Alex

Grant

unread,

Jan 25, 2017, 8:53:00 AM1/25/17

to rna-star

Dear Alex,

Thanks a lot for your very helpful suggestions!
And let me please ask one last question.
I am analyzing now a different yeast (S. uvarum) which has a quite weird GFF file. Available through this link http://www.saccharomycessensustricto.org/current/Sbay/Sbay.gff

I have tried --sjdbGTFfeatureExon CDS --sjdbGTFtagExonParentTranscript Parent
but getting an error "Fatal INPUT FILE error, no exon lines in the GTF file:"

What I have done is converted it to gtf by gffread and replaced all CDS in third column by exons. Do you think it is a legitimate approach or are there any others?

Thank you in advance,
Grant

Alexander Dobin

unread,

Jan 27, 2017, 4:15:38 PM1/27/17

to rna-star

Hi Grant,

I think the problem with the original file is that it has the 2nd field made of multiple words separated by spaces (Scannell_and Zill 2011),

so STAR counts the fields wrongly. How does it look in the converted GTF file? If you can replace the 2nd field with any single word, it should work fine.

Converting to GTF is the best option to deal with GFF files at the moment.

Cheers

Alex

Grant

unread,

Feb 22, 2017, 6:42:16 AM2/22/17

to rna-star

Dear Alex,

Sorry for replying so late!
Thank you a lot, you are very helpful!!

Reply all

Reply to author

Forward