STAR +RSEM

Alexander Czerny

unread,

Aug 20, 2013, 10:18:50 AM8/20/13

to rsem-...@googlegroups.com

HI folks,

iam actually doing some inhouse comparison on data and therefore this weird question. I use STAR to map data to a transcriptome made by rsem. Afterwards i want to use RSEm -calculate-expression with the --sam Option.

And he is complaining that he cant use gapped alignments. Then i filter just reads which are matching completly. So far so good. That works.
My question is: what is gapeless for RSEM:

just matches (M) ?
also missmatches (M+N)?
ever more before he complains ?
because i want to take as much reads as possible not just the perfect matched. according to cigar string (\*|([0-9]+[MIDNSHPX=])+) in sam files there are much more possiblitys.

If i can just take matched (M) reads i have to life with but i know it for sure ^^.

Thx in advance, Alex.

bli

unread,

Aug 21, 2013, 3:13:56 AM8/21/13

to rsem-...@googlegroups.com

Hi Alex,

Thanks for your interest in using RSEM! I'm very happy to answer your
question.

Currently RSEM only allows alignment match (M, =, or X) appear in the
cigar string. I think that N stands for skipped region. Since for
Illumina sequencer insertion/deletion errors rarely happen, allowing
alignment match only in theory will capture most of alignments and
therefore good enough for providing a robust and accuracy overall
expression estimation.

Hope it helps,
Bo

On 2013-08-20 07:18, Alexander Czerny wrote:
> HI folks,
>
> iam actually doing some inhouse comparison on data and therefore this
> weird question. I use STAR to map data to a transcriptome made by
> rsem. Afterwards i want to use RSEm -calculate-expression with the
> --sam Option.
>
> And he is complaining that he cant use gapped alignments. Then i
> filter just reads which are matching completly. So far so good. That
> works.

> MY QUESTION IS: WHAT IS GAPELESS FOR RSEM:
>
> * just matches (M) ?
> * also missmatches (M+N)?
>
> * ever more before he complains ?
> * because i want to take as much reads as possible not just the
> perfect matched. according to cigar string (*|([0-9]+[MIDNSHPX=])+) in

> sam files there are much more possiblitys.
>
> If i can just take matched (M) reads i have to life with but i know it
> for sure ^^.
>
> Thx in advance, Alex.
>

> --
> RSEM website: http://deweylab.biostat.wisc.edu/rsem/ [1]
> ---
> You received this message because you are subscribed to the Google
> Groups "RSEM Users" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to rsem-users+...@googlegroups.com.
> To post to this group, send email to rsem-...@googlegroups.com.
> Visit this group at http://groups.google.com/group/rsem-users [2].
>
>
> Links:
> ------
> [1] http://deweylab.biostat.wisc.edu/rsem/
> [2] http://groups.google.com/group/rsem-users

Alexander Czerny

unread,

Nov 12, 2013, 7:18:57 AM11/12/13

to rsem-...@googlegroups.com

As i got an answer from Alex Dobin for converting his STAR sam-output into rsem readable format i want to share it with the RSEM users:

here is what you can do to make STAR work with RSEM;
1. Run STAR with --alignIntronMax 1 --alignIntronMin 2 --scoreDelOpen -10000 --scoreInsOpen -10000
This should prevent junctions and indels - however, it will still contain soft-clipping
2. Run this awk script to remove soft-clipping:
awk 'BEGIN {OFS="\t"} {split($6,C,/[0-9]*/); split($6,L,/[SMDIN]/); if (C[2]=="S") {$10=substr($10,L[1]+1); $11=substr($11,L[1]+1)}; if (C[length(C)]=="S") {L1=length($10)-L[length(L)-1]; $10=substr($10,1,L1); $11=substr($11,1,L1); }; gsub(/[0-9]*S/,"",$6); print}' Aligned.out.sam > Aligned.noS.sam

Hopefully the resulting file will be compatible with RSEM.
Please share your experience if you try it out
Cheers
Alex

Shawn Driscoll

unread,

Nov 24, 2013, 12:18:32 AM11/24/13

to rsem-...@googlegroups.com

Does this alignment mode report all alignments for reads? I thought there was at least one other option that needs to be set so STAR will do that.

Alexander Czerny

unread,

Nov 26, 2013, 7:23:11 AM11/26/13

to rsem-...@googlegroups.com

As stated in the Manual of STAR there is an option to increase the number of multimapped reads to be outputed, everything higher then this is stated as "to many" in the final log, i think:

outFilterMultimapNmax
10
int: read alignments will be output only if the read maps fewer than this
value, otherwise no alignments will be output

Alexander Czerny

unread,

Jan 6, 2014, 6:29:28 AM1/6/14

to rsem-...@googlegroups.com

Greetings all,

since RSEM does not allow for different readlength at multiple mapped loci from one read, u cant apply the awk command. Alex Dobin implemented an endtoend-alignment in his new version for STAR. I am right now testing it and let you know how it works.

The additional option is: -alignEndsType EndToEnd , and should therefore not include any "S" in the cigarstring u still need to forbid any indels or splicing of course.

Message has been deleted

Alexander Czerny

unread,

Jan 22, 2014, 3:40:39 AM1/22/14

to rsem-...@googlegroups.com

I tried the new "EndToEnd" option and it reported no soft-clipping anymore and RSEM went through nicely.

pbczyd

unread,

Mar 14, 2014, 10:01:49 AM3/14/14

to rsem-...@googlegroups.com

Dear all,

since some of you have already successfully managed to use STAR + RSEM, I want to know whether you compared the results between <STAR + RSEM> and <Bowtie +RSEM>.

Did you get similar results from rsem-calculate-expression after STAR/Bowtie ?

Regards,

Sheng

Alexander Predeus

unread,

Jul 26, 2014, 1:06:20 PM7/26/14

to rsem-...@googlegroups.com

Hello all,

I'm also experimenting with RNA-seq quantification using STAR alignment to transcriptome and following RSEM quantification.

I'm seeing some strange behaviour with the flag corresponding to strand specificity (--forward-prob), would anybody be able to explain this?

So I've aligned left and right reads of a strand-specific RNA-Seq paired-end experiment separately (as if they are each a separate strand-specific single-end experiment).

For R2, which is the read that goes according to the strand (--stranded=yes option in HTSeq for example), the following happens:
1) if you use --forward-prob 1.0, the reads are counted (correctly, as far as I can tell)
2) if you use --forward-prob 0.5, the reads are counted and the sum is slightly higher (which makes sense, see flag distribution below)
3) if you use --forward-prob 0.0, most counts are 0.

However, for R1, which aligns to the opposite strand (--stranded=reverse in HTSeq), all three above-mentioned possibilities produce ~ 0 counts.

I've looked at the actual SAM/BAM files, they both (R1 and R2) have flags 256 and 272 in 2nd field, with R2 mostly having 256 (99%+), and R1 is almost all 272.

The actual command I'm using is as follows:

rsem-calculate-expression -p 8 --bam --no-bam-output --forward-prob 1 --estimate-rspd <BAM> <REFERENCE> <TAG>

and here's the command I've used to prepare the reference:

rsem-prepare-reference -gtf ../Gencode/mouse_v2/gencode.vM2.all_exon.mm10G.gtf ../Gencode/mouse_v2/GRCm38.p2.genome.fa rsem_mm10G_gencode_vM2

(I took exon-only portion of gencode vM2 GTF file, because there's some confusion with transcript_id and gene_id in other entry types)

Thank you in advance!

Alexander Predeus

unread,

Jul 26, 2014, 2:13:11 PM7/26/14

to rsem-...@googlegroups.com

To be more precise, here are some stats from 6 runs (--forward-prob of 0.0, 0.5, and 1.0 for R1 and R2):

run                                           #of non-zero read counts average read count             stdev read count
test_p0.5_R1.genes.results                     1938                                                  8.97149                       44.0204
test_p0.5_R2.genes.results                   11552                                                58.0327                        311.64
test_p0_R1.genes.results 3                                         4.66667 5.18545
test_p0_R2.genes.results                               2                                            1.5                                             0.5
test_p1_R1.genes.results                        1938                                                     8.97033                                 44.0334
test_p1_R2.genes.results                      11565       57.9754    311.48

Here's what the typical BAM file looks like:

C17VNACXX121020:6:1101:10000:94393    272    ENSMUST00000001008.5    691    255    25M    *    0    0    AAGTCATGAACTGACAAATGTACAA    JJJJJJJIJHJJHHHHHFFFFFCCC    NH:i:1    HI:i:1
C17VNACXX121020:6:1101:10001:78337    272    ENSMUST00000149940.1    869    255    25M    *    0    0    CTGAACAGGTTGGCCCAGGTCTGCA    JJIJJJJJJJJJHHHHHFFFFFCCC    NH:i:1    HI:i:1
C17VNACXX121020:6:1101:10003:8152    272    ENSMUST00000038038.7    425    255    25M    *    0    0    CCTGAATTCGGGAGGTGGAGCAATC    JJJJJJJGGJJIHGHHHFFFFFB@C    NH:i:1    HI:i:1
C17VNACXX121020:6:1101:10004:46606    272    ENSMUST00000126101.1    325    0    25M    *    0    0    GTGCAGTGCTGGCTGTGCATCTCAA    JIIJJJJJJJJJHHHHHFFFFFCCC    NH:i:9    HI:i:1

Alexander Predeus

unread,

Jul 28, 2014, 7:03:05 PM7/28/14

to rsem-...@googlegroups.com

Hello again,

a little something to add:

In R2 alignment, most reads have aligned with the flag 256, with a small fraction being 272.

In R1, the picture is exactly the opposite.

When I switch flags (256/272) manually in aligned R1 file, the counts are calculated properly (~ 12k expressed genes, with average and stdev similar to R2).