STAR input to RSEM

1,691 views
Skip to first unread message

manu tiwari

unread,
Feb 25, 2014, 4:56:14 AM2/25/14
to rna-...@googlegroups.com
Hi

I am using STAR to alingn my paired-end reads and use the resulting sam files as input to RSEM.
However, I find the sam files generated by STAR to be incompatible with RSEM.
Based on the RSEM group discussions, I ran my STAR with the following options:

--alignIntronMax 1 --alignIntronMin 2  --scoreDelOpen -10000  --scoreInsOpen -10000 --alignEndsType EndToEnd

(I used the patched version of STAR for running the same- STAR2.3.1).

This because RSEM requires input that has been mapped to the transcriptome instead of the genome.

However, I still cannot get RSEM to accept these files as the input. It says:
.sam -t 3 -tag XM
[samopen] SAM header is present: 13 sequences.
Number of transcripts does not match! Please check if the reads are aligned to a transcript set (instead of a genome)!

I checked my sam files using the RSEM-validator, and its output says that the sam files are verified.

On the other hand, If I give the transcripts.fa file (generated by rsem-prepare-reference) as input to STAR for mapping, it does not accept this file.

Can someone kindly suggest a solution.

Thanks.

Best
M

Alexander Dobin

unread,
Feb 25, 2014, 4:04:21 PM2/25/14
to rna-...@googlegroups.com
Hi Manu,

how did you generate the genome for this run? Since RSEM needs alignments to transcriptome, your "genome" should be a fasta file with sequences of all the transcripts.
I believe the transcripts.fa file (generated by rsem-prepare-reference) is exactly the file you need - feed it to STAR for genome generation with 
--runMode genomeGenerate  --genomeFastaFiles  transcripts.fa 
and then use this genomeDir for mapping.

Cheers
Alex

manu tiwari

unread,
Feb 26, 2014, 7:58:34 AM2/26/14
to rna-...@googlegroups.com
Hi Alex

Many thanks for your reply!

This is actually what I did initially. But giving the transcripts.fa (from rsem-prepare-reference) as input to STAR gives me an error:

 ..... Started STAR run
 ... Starting to generate Genome files
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Aborted (core dumped)


Secondly, sometimes when I give the (regular) genome.fa file as the input to STAR, it hangs my PC.

Also, if I give the sam file (after STAR mapping) as input to htseq-count, it again shows an error:

Error occured when reading first line of sam file.
  ("SAM optional field with illegal type letter ':'", 'line 15 of file Bam1Aligned.out.sam')
  [Exception type: ValueError, raised in _HTSeq.pyx:1180]

I did not index or process the (raw) sam files generated by STAR, and gave them directly as input to htseq-count.
Is this because while mapping I gave --samAttributes All as an option?
Do I need to index/sort these sam files?

I have an 8 GB RAM on a 4-core machine running Ubuntu 13.10.

Any suggestions are welcome.

Thanks again.

Best
Manu

manu tiwari

unread,
Feb 28, 2014, 11:50:52 AM2/28/14
to rna-...@googlegroups.com
Hi Alex

The sam file (first 100 lines) is attached.

I have tried generating .transcripts.fa file using rsem-preprare-reference (RSEM) using the following pairs of files:
1. Ensembl genome.fa and Ensembl gtf (ERROR generating the file)
2. UCSC genome.fa and UCSC gtf (ERROR generating the file)
3. Ensembl Individual chromosomes.fa and Ensembl gtf (ERROR generating the file)
4. UCSC Individual chromosomes.fa and UCSC gtf (transcripts.fa IS generated).

The error in 1, 2 , and 3 corresponds to some transcripts not found. If I delete those from gtf (so as to match everything chromosome-wise in the .fa and gtf files), then also it does not work.

If I give the transcripts.fa file generated in 4 as input to STAR using genomeGenerate, it hangs my system and I have to reboot (RAM usage and SWAP immediately shoot to 100% as soon as STAR starts).
Interestingly, upon looking at the log.out file, the process always stops at a particular stage ( I ran it with STAR 2.3.0 and 2.3.1, and compared the log files):
.................transcripts.fa : chr # 31425  "NM_140963" chrStart: 8237875200

Can you suggest something?
Thanks for all your help.
Best
Manu




On Thursday, February 27, 2014 6:19:56 PM UTC+1, Alexander Dobin wrote:
Hi Manu,

you have to be careful about RAM usage, since STAR requires ~10*GenomeSize bytes of RAM for both genome generation and mapping. For instance, the full human genome will require~30GB of RAM. There is an option to reduce it to 16GB, but it will not work with 8GB of RAM.

The transcriptome size is much smaller, and 8GB should be sufficient. At the genome generation step, please try --limitGenomeGenerateRAM 7000000000.

HTseq should work fine with STAR .sam file. In principle, HTseq authors require that the file is sorted by read name, however, I think STAR output is OK even without sorting.
For the HTseq error, please send the first 100 lines of your SAM file.

Cheers
Alex
Sample1Sam.txt

manu tiwari

unread,
Feb 28, 2014, 12:10:33 PM2/28/14
to rna-...@googlegroups.com
Hi Alex
Just to add to your suggestion for RAM
It get the following error:


/home/Programs/STAR_2.3.0e.Linux_x86_64/STAR --runMode genomeGenerate --genomeDir /STAR_cDNA/ --genomeFastaFiles /STAR_cDNA/sample.transcripts.fa --limitGenomeGenerateRAM 7000000000 --runThreadN 8

..... Started STAR run
... Starting to generate Genome files

EXITING because of FATAL PARAMETER ERROR: limitGenomeGenerateRAM=6is too small for your genome
SOLUTION: please specify limitGenomeGenerateRAM not less than31 GB and make that much RAM available

Thanks.
Best
Manu

Alexander Dobin

unread,
Feb 28, 2014, 3:57:15 PM2/28/14
to rna-...@googlegroups.com
Hi Manu,

1. SAM input to HTseq:
I believe this is caused by the jM:B:c, jI:B:i, SAM attributes, please see this post. The solution is not to use --outSAMattributes All options.
With the latest STAR patch, you can specify which attributes you need in the SAM output:
--outSAMattributes      a string of desired SAM attributes, in the order desired for the output SAM
                                NH HI AS nM NM MD jM jI XS
                                Standard    : NH HI AS nM
                                All         : NH HI AS nM NM MD jM jI
                                None

2. I am not sure why RSEM fails to generate the transcript sequences, it's better to ask about it on the RSEM mailing list. I have used it once with ENSEMBL, like this:
rsem-prepare-reference --no-polyA --no-bowtie --no-ntog --gtf Homo_sapiens.GRCh37.66.chrOnly.gtf fullGenome.fa RSEMtr,
and it worked fine.

What is the size of the UCSC transcriptome file? It should be ~300MB.
Please try this option: --genomeChrBinNbits 14 (if it still does not work, go even lower to 13 or 12). Keep --limitGenomeGenerateRAM 7000000000, as it prevents you machine from swapping.

Cheers
Alex

manu tiwari

unread,
Mar 5, 2014, 12:12:48 PM3/5/14
to rna-...@googlegroups.com
Hi Alex

Finally, I have successfully mapped using STAR and used the output as input to RSEM.
However, I would like to point out few observations:

1. Genome.fa vs transcripts.fa
There appears to be distinction in the way STAR handles the genome.fa file (from ENSEMBL or UCSC) and the transcripts.fa file (generated by rsem-prepare-reference) during generation of GenomeDir.
With genome.fa (and a corresponding annotation (gtf) file), the process goes on smoothly with default parameters.
With cDNA.fa (without any annotation (gtf) file), the process goes on smoothly with default parameters.
With transcripts.fa (generated by rsem-prepare-reference), the default parameters cause an immediate spike in RAM usage and it hangs my system.
The transcripts.fa file only works when I specify --genomeChrBinNbits 14, IRRESPECTIVE of whether or not I use the --limitGenomeGenerateRAM option.

2. Log.out file
It appears that I always get similar (default) log.out files whether I change some parameters or not.
For instance, running with the default options results in:
 
##### DEFAULT parameters:
versionSTAR                       20201
versionGenome                     20101   20200  
parametersFiles                   -  
runMode                           alignReads
runThreadN                        1
genomeDir                         ./GenomeDir/
genomeLoad                        NoSharedMemory
genomeFastaFiles                  -  
genomeSAindexNbases               14
genomeChrBinNbits                 18
genomeSAsparseD                   1
readFilesIn                       Read1   Read2  
readFilesCommand                  -  
readMatesLengthsIn                NotEqual
limitGenomeGenerateRAM            31000000000
.....
I get the same parameters in my log.out file when I change genomeChrBinNbits or limitGenomeGenerateRAM.

3. limitGenomeGenerateRAM
During generation of genomeDir, the --limitGenomeGenerateRAM option does not appear to have any effect on my RAM or SWAP usage.
Even if I limit it to 7000000000 (as you suggested), the RAM and SWAP immediately spike to 100%. This happens only with the transcripts.fa file and not the genome.fa or cDNA.fa files.

As I mentioned, the thing that worked for me with transcripts.fa file was limiting genomeChrBinNbits to 14.

I thank you for all your help and suggestions.

Best
Manu

Alexander Dobin

unread,
Mar 6, 2014, 12:32:43 PM3/6/14
to rna-...@googlegroups.com
Hi Manu,

your observations are correct. 
STAR default parameter are optimized for mapping to whole genome with annotations, not to transcriptomes.
--genomeChrBinNbits  and  --limitGenomeGenerateRAM  control two different aspects RAM usage. The former one defines the size of the "padding" (i.e. minimum size) for "chromosomes". Since you have a large number of "chromosomes" ( whihc are transcripts in your case), you need to reduce it, which reduces significantly the size of the "Genome" file in the genome directory. On the other hand,  --limitGenomeGenerateRAM  affects the RAM used for suffix array generation. The max amount of memory for generating suffix array is 16*genomeSize bytes (it's much bigger than the final size of ~8*genomeSize), however, it can be done in chunks limiting the memory by  --limitGenomeGenerateRAM. In your case the size of the genome is probably ~300Mb, so the max RAM required was less than available 8GB. However, generally, it is advisable to set   --limitGenomeGenerateRAM to the max RAM available.

2. Log.out file first lists just the DEFAULT parameters, however, if you scroll down, you will see "##### Final parameters after user input" - these will be modified according to your command line.

Cheers
Alex

Alexander Dobin

unread,
Feb 27, 2014, 12:19:56 PM2/27/14
to rna-...@googlegroups.com
Hi Manu,

you have to be careful about RAM usage, since STAR requires ~10*GenomeSize bytes of RAM for both genome generation and mapping. For instance, the full human genome will require~30GB of RAM. There is an option to reduce it to 16GB, but it will not work with 8GB of RAM.

The transcriptome size is much smaller, and 8GB should be sufficient. At the genome generation step, please try --limitGenomeGenerateRAM 7000000000.

HTseq should work fine with STAR .sam file. In principle, HTseq authors require that the file is sorted by read name, however, I think STAR output is OK even without sorting.
For the HTseq error, please send the first 100 lines of your SAM file.

Cheers
Alex
Reply all
Reply to author
Forward
0 new messages