low alignment rate

3,116 views
Skip to first unread message

RK

unread,
Nov 11, 2012, 8:07:36 AM11/11/12
to rsem-...@googlegroups.com
Hi Bo,

I set up RSEM run for 20M paired-end Illumina data from human sample and got the following numbers which seem very low. I used masked chromosome files from UCSC downloads and  GTF for hg19 build for preparing RSEM reference. I got the GTF using the instructions provided at UCSC wiki: http://genomewiki.ucsc.edu/index.php/Genes_in_gtf_or_gff_format

Could you please give some insights on possible reasons as to why this is so low? I would appreciate your input. The alignment rate is ~80% when I run bowtie outside RSEM.

# reads processed: 20739697
# reads with at least one reported alignment: 4279601 (20.63%)
# reads that failed to align: 16460095 (79.37%)
# reads with alignments suppressed due to -m: 1 (0.00%)
Reported 12607311 paired-end alignments to 1 output stream(s)

Thanks,
Ram

b...@cs.wisc.edu

unread,
Nov 11, 2012, 1:20:11 PM11/11/12
to rsem-...@googlegroups.com
Hi Ram,

Two questions. 1) Are chromosome files hard masked or soft masked? 2) Do
you align to the transcript sequences or the genome?

Best,
Bo

RK

unread,
Nov 11, 2012, 5:12:51 PM11/11/12
to rsem-...@googlegroups.com, b...@cs.wisc.edu
Hi Bo,

I used chromFa.tar.gz from UCSC downloads where the repeats are indicated in lower case - so, soft-masked I suppose. I ran rsem-calculate-expression after running rsem-prepare-reference by supplying the GTF. So, I guess I aligned to the transcript sequences.

I have run RSEM for a different genome with similar set up and reported 80% reads with at least one alignment.

Thanks.

Erik Aronesty

unread,
Nov 11, 2012, 7:30:15 PM11/11/12
to rsem-...@googlegroups.com, b...@cs.wisc.edu
Yep, 80% is normal for bowtie/rsem on human.   Did you try pasting a sampling of 20 or so unaligned reads into ncbi blast?   Very often this can reveal issues.   The program "fastq-stats" provides a terse summary that can reveal most problems with RNA seq data... or at least rule out a gross/major problem with the reads (http://code.google.com/p/ea-utils/).

RK

unread,
Nov 11, 2012, 9:42:45 PM11/11/12
to rsem-...@googlegroups.com, b...@cs.wisc.edu
Actually, I get only 20% of the reads aligned when I set up a RSEM run (please see below). I ran fastqc on the paired-end reads and it looks good. I ran bowtie1 outside of RSEM and got 80% of the reads with at least one reported alignment.

My question why am I getting a poor alignment rate when RSEM uses bowtie1?

Shawn Driscoll

unread,
Nov 12, 2012, 12:33:28 AM11/12/12
to rsem-...@googlegroups.com, b...@cs.wisc.edu
it really doesn't make sense. during the RSEM run do you see that percentage statistic as it is echoing out info during the run or are you finding that percentage yourself at the end of the run by checking the BAM file? you can check out the bam file generated from their bowtie call if you cancel the process when it starts the EM stage. it'll be in the temp folder. maybe you can see what's going on in there and compare it to the bam you generate by aligning yourself.

RK

unread,
Nov 12, 2012, 10:15:55 AM11/12/12
to rsem-...@googlegroups.com, b...@cs.wisc.edu
Hi Bo,

So, here is the details:

First when I ran rsem with the inbuilt bowtie aligner, it only used s very small fraction of reads. This was in the stdout file:


# reads processed: 20739697
# reads with at least one reported alignment: 4279601 (20.63%)
# reads that failed to align: 16460095 (79.37%)
# reads with alignments suppressed due to -m: 1 (0.00%)
Reported 12607311 paired-end alignments to 1 output stream(s)


Secondly, I used reference_name.idx.fa to build indices with bowtie2 followed by alignment using bowtie2 using those indices. I used '--gbar 100', ie., disallow gaps within 100 positions of the beginning or end of the read (100 was read length). I also set 10000 as gap opening and extension penalty so make sure that I dont get gapped alignment. I also diasllowed singleton alignments. Paired-end reads were alignment only as a pair. Inspite of that when I say:

$ ~/Software/rsem-1.2.0/rsem-sam-validator  bowtie2_noGap_noMixed.SAM 
[samopen] SAM header is present: 80922 sequences.
.
Only find one mate for paired-end read SN603_WBP007_5_1101_67.10_93.20_1!
The input file is not valid!

When I say

$~/Software/rsem-1.2.0/convert-sam-for-rsem bowtie2_noGap_noMixed.SAM rsem_compatible.SAM
it gives the error:
Number of first and second mates in read SN603_WBP007_5_1101_1000.00_1494.70_1's full alignments (both mates are aligned) are not matched!
"/home/rk/Software/rsem-1.2.0/rsem-scan-for-paired-end-reads rsem_compatible.SAM.bam.temp/tmp.sam rsem_compatible.SAM.bam" failed! Plase check if you provide correct parameters/options for the pipeline!

Then I used bowtie1 for this whole thing: used bowtie1 to build indices using reference_name.idx.fa and used bowtie1 to align using the same settings as rsem:
 bowtie -q --phred33-quals -n 2 -e 99999999 -l 25 -I 1 -X 1000  -a -m 200 /home/rk/data/hg19/rsem_ref_bowtie1/hg19_rsem_ref_bowtie1  -1 ../../T1_S1L5_R1_end1.fq -2 ../../T1_S1L5_R1_end3.fq -p 10  -S bowtie1.SAM 2> T1_R1_bowtie1.log &

and when I checked it, it gave the error:

$ ~/Software/rsem-1.2.0/rsem-sam-validator bowtie1.SAM 
[samopen] SAM header is present: 80922 sequences.
.
Only find one mate for paired-end read SN603_WBP007_5_1101_60.90_98.90_1!
The input file is not valid!

conversion did not work either.

$  ~/Software/rsem-1.2.0/convert-sam-for-rsem bowtie1.SAM compatible.SAM
grep ^@ bowtie1.SAM > compatible.SAM.bam.temp/tmp.sam

grep ^[^@] bowtie1.SAM | sort -k 1,1 -s >> compatible.SAM.bam.temp/tmp.sam

~/Software/rsem-1.2.0/rsem-scan-for-paired-end-reads compatible.SAM.bam.temp/tmp.sam compatible.SAM.bam
[samopen] SAM header is present: 80922 sequences.
.
Number of first and second mates in read SN603_WBP007_5_1101_1000.00_1494.70_1's full alignments (both mates are aligned) are not matched!
"~/Software/rsem-1.2.0/rsem-scan-for-paired-end-reads compatible.SAM.bam.temp/tmp.sam compatible.SAM.bam" failed! Plase check if you provide correct parameters/options for the pipeline!

Could you tell me which chromosome files and GTFs to use? Thanks.

On Sunday, November 11, 2012 1:20:13 PM UTC-5, b...@cs.wisc.edu wrote:

Erik Aronesty

unread,
Nov 12, 2012, 11:17:30 AM11/12/12
to rsem-...@googlegroups.com
Again, if you're getting 80% alignment to the genome ... that's different than 80% alignment to the transcriptome (which is what rsem is doing).   Can you paste a random sample of 40 or so unaligned reads to the forum?   That will make it easy to diagnose.

RK

unread,
Nov 13, 2012, 10:18:31 AM11/13/12
to rsem-...@googlegroups.com
Hi,

I have attached 100 unaligned reads from fastqs 1 and 2. I would appreciate some help in troubleshooting this. I ran again on a different set of fastq files and got similar results - of the 20M paired-end reads, only 20% of reads had a reported alignment. 

For human, could anyone tell me which chromosome and GTF files they normally use for hg19? I used the soft-masked chromosome fasta files and the GTF I got from using GenePredToGtf utility and the hg19 database.

Thanks.
unalreads.zip

Erik Aronesty

unread,
Nov 13, 2012, 12:25:26 PM11/13/12
to rsem-...@googlegroups.com
Summary:

- I reproduced your results... no alignment to transcriptome, some alignment to genome
- Assuming this was NOT poly-A selected, and was treated, instead with ribo-zero or some similar kit
- There is a large amount of mitochondrial DNA, which is supposed to be selected out by the newer Ribo-Zero kits, but was not in the older kits.
- Normal alignment numbers with Ribo-zero kits is about 40%.   With older kits it could be lower.
- Some of these reads look like DNA, not RNA.   IE: They align uniquely to intronic and extragenic regions.     (Example: TAGGCAGGAGTTAGCCTGATAGGGAGGAAAAGGCTAACTGGCATTATGGGCACCACATGGGCAAAGCTTTAGAGGCAGGCAAAGATAATGGCACAGTAA)  .   Perhaps an insufficient DNAse waiting time, or higher levels DNA contamination than expected.

Also, from the fastq-stats:

Typical RNA profile has a signature at the front due to random hexamer bias::
Inline image 2

Your RNA profile (although I understand this was a very small sampling, and was severely skewed because of that) doesn't exhibit this:

Inline image 1
image.png
image.png

Lakshmanan Iyer

unread,
Nov 13, 2012, 12:31:53 PM11/13/12
to rsem-...@googlegroups.com
Disclaimer: Have not read the thread but commenting after looking at the plots:
Did you check for adaptor contamination. You will have a lot of it if you did not choose the appropriate fragment size for sequencing.
-best
-Lax
image.png
image.png

Erik Aronesty

unread,
Nov 13, 2012, 12:43:08 PM11/13/12
to rsem-...@googlegroups.com
Your right... 


After clipping adaptors, you get an extra 4% rna alignment, and 16% of the unaligned reads were discarded after running fastq-mcf with default params:

Adapter TruSeq_Universal_Adapter_rc (AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT): counted 27 at the 'end' of 'T1_un_2.fq', clip set to 1
Adapter TruSeq_Index_Adapter_5p (GATCGGAAGAGCACACGTCTGAACTCCAGTCAC): counted 0 at the 'start' of 'T1_un_1.fq', clip set to 1
Adapter TruSeq_Index_Adapter_3p (ATCTCGTATGCCGTCTTCTGCTTG): counted 5 at the 'end' of 'T1_un_1.fq', clip set to 3
Adapter ILMN RT_primer_rc (TCGTATGCCGTCTTCTGCTTG): counted 4 at the 'end' of 'T1_un_1.fq', clip set to 4


And that  gets you up to 40% transcriptome alignment.

Assuming this is ribo-zero, you're good...
image.png
image.png

RK

unread,
Nov 13, 2012, 1:25:55 PM11/13/12
to rsem-...@googlegroups.com
Hi Erik and Lax,

Thank you for your suggestions.

I'll remove adapters and clean the fastqs before running RSEM. Did you use fastx-toolkit for clipping adapters? 

Erik - how do you calculate what fraction of reads have mitochondrial origin and how do you quantify DNA contamination. Also, when you say normal alignment numbers with Ribo-zero kits is about 40%, do you mean transcriptome alignment?

Thanks guys!

Erik Aronesty

unread,
Nov 13, 2012, 2:22:51 PM11/13/12
to rsem-...@googlegroups.com
For clipping, I used fastq-mcf in auto-tuning mode "-l"  .... the list of adapters I used was pasted in the output

I never calculated the exact percentage of mitochondrial RNA-seeming reads.   But when I pasted the unaligned reads into BLAST, I got a lot of mitochondrial... that's pretty indicative of a non-poly-A situation.

RK

unread,
Nov 13, 2012, 2:51:06 PM11/13/12
to rsem-...@googlegroups.com

Thanks Erik!

Will keep posted on what numbers I get after removing adapters.

Lakshmanan Iyer

unread,
Nov 13, 2012, 4:07:24 PM11/13/12
to rsem-...@googlegroups.com
That was an easy one !
So remember, this means that on filtering adapter contamination you are going to have reads of variable length!
-best
-Lax

b...@cs.wisc.edu

unread,
Nov 13, 2012, 6:05:58 PM11/13/12
to rsem-...@googlegroups.com
Hi Ram,

Sorry for late reply.

Can you show me the commands which gave you 80% alignments using bowtie1?
I cannot find it anywhere. If you use reference_name.idx.fa to build your
bowtie indices, there is no reason that you will get more alignments using
bowtie alone. RSEM also use 'reference_name.idx.fa' to build bowtie
indices.

I have looked at the 100 unaligned reads you sent and found the reason why
both rsem-sam-validator and convert-sam-for-rsem failed. In your two mate
files, the two mates of a same read have different file names. Therefore,
there is no way for RSEM to tell the two mates are from a same read. To
avoid this, you can either change the names the same or change the last
underscore '_' to slash '/'.

Best,
Bo

> Hi Bo,
>
> So, here is the details:
>
> First when I ran rsem with the inbuilt bowtie aligner, it only used s very
> small fraction of reads. This was in the stdout file:
>
> # reads processed: 20739697
> # reads with at least one reported alignment: 4279601 (20.63%)
> # reads that failed to align: 16460095 (79.37%)
> # reads with alignments suppressed due to -m: 1 (0.00%)
> Reported 12607311 paired-end alignments to 1 output stream(s)
>
>
> [samopen] SAM%

RK

unread,
Nov 14, 2012, 10:12:34 AM11/14/12
to rsem-...@googlegroups.com, b...@cs.wisc.edu
Hi Bo,

No worries. Thank you for replying.

The 80% was when using the genome. I got 20% when using the transcriptome indexes that RSEM created. I got very useful comments from Erik and Lax in the forum and working on fixing those now.

Reg, the format for read IDs, I did notice that the fastq I was using did not have a "/" before the pair indicator. Will change the "_" to "/" and try again. 

Do you know if there is any advantage in treating the two paired-end fastq files as single-end would increase read depth as bowtie1 would use otherwise discordantly mapping reads? 

Also, could you comment on using bowtie2 but obtaining ungapped alignment. Basically, I used '--gbar 100' option, ie., disallow gaps within 100 positions of the beginning or end of the read (100 was read length). I also set 10000 as gap opening and extension penalty so make sure that I don't get gapped alignment.

Thanks.

RK

unread,
Nov 14, 2012, 10:15:43 AM11/14/12
to rsem-...@googlegroups.com, liy...@tufts.edu
Thanks. Could you highlight the consequences of using reads with variable length? Did you mean syncing the two fastq files after removing adapters? Are there any bowtie1 params that I need to use?

RK

unread,
Nov 14, 2012, 5:01:32 PM11/14/12
to rsem-...@googlegroups.com
I ran RSEM again after removing adapters and still got only 20% of the reads with a reported alignment to the transcriptome. I used fastq-mcf and supplied the following adapters along with two other illumina PE adapters that the people who prepared the library gave me. I synced the two fastq files before running RSEM.

I also have 7 nt barcodes that was used for demultiplexing. Should I be using those as well as part of the list of the adaptors? I didn't given they are short and likely to trim reads with identical sequences that are not barcodes. I would appreciate any help on this.

Here's the output from fastq-mcf

Scale used: 2.2
Phred: 64
Threshold used: 251 out of 100000
Adapter Adapter_TruSeq_Universal_Adapter_rc (AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT): counted 22335 at the 'end' of 'T1_R1_r2.fq', clip set to 1
Adapter Adapter_TruSeq_Index_Adapter_5p (GATCGGAAGAGCACACGTCTGAACTCCAGTCAC): counted 16824 at the 'end' of 'T1_R1_r1.fq', clip set to 2
Adapter Adapter_TruSeq_Index_Adapter_3p (ATCTCGTATGCCGTCTTCTGCTTG): counted 1972 at the 'end' of 'T1_R1_r1.fq', clip set to 4
Adapter Adapter_ILMN_RT_primer_rc (TCGTATGCCGTCTTCTGCTTG): counted 1669 at the 'end' of 'T1_R1_r1.fq', clip set to 5
Files: 2
Total reads: 20739697
Too short after clip: 18666
Clipped 'end' reads (T1_R1_r1.fq): Count 7805422, Mean: 21.20, Sd: 16.56
Trimmed 606327 reads (T1_R1_r1.fq) by an average of 1.33 bases on quality < 7
Clipped 'end' reads (T1_R1_r2.fq): Count 11641059, Mean: 14.79, Sd: 16.96
Trimmed 382978 reads (T1_R1_r2.fq) by an average of 1.32 bases on quality < 7

Thanks.

Erik Aronesty

unread,
Nov 14, 2012, 5:06:55 PM11/14/12
to rsem-...@googlegroups.com
I'm assuming that your barcodes were removed during demultiplexing.... if that's not the case, you need to remove them.   And 20% is not unheard of.   If you have an FFPE sample with ribo-zero - for example... 

RK

unread,
Nov 16, 2012, 7:20:33 AM11/16/12
to rsem-...@googlegroups.com
Thanks Erik!

I have been getting a consistent ~20% on the samples I ran so far. I have a question on what actually constitutes transcriptome for a UCSC derived GTF.

I used the script read_distribution.py provided by the RSeQC suite and found that a significant number of reads map to introns and 3'UTRs. When RSEM calculates reference transcriptome, does it mask repetitive regions and obtain sequences for exons? There is no UTR annotation in the GTF that I am using and I am wondering how it excludes the UTRs. Does the masking account for this mostly? The number of reported alignment should be more than 50% if it uses UTRs. There are ~5K non-coding RNA in the current GTF and I am interested in getting accurate expression levels for those genes.

I would appreciate any response.

Thanks

Songhui Zhao

unread,
Nov 4, 2015, 9:05:33 AM11/4/15
to RSEM Users
Hi Ram,
Have you got the answer ? Now i have the same problem when i using rsem-calculate-expression.

Thanks
Songhui

在 2012年11月11日星期日 UTC+8下午9:07:36,RK写道:

Bo Li

unread,
Nov 5, 2015, 3:59:05 AM11/5/15
to rsem-...@googlegroups.com
Hi Songhui,

If you align your reads using Bowtie outside RSEM, do you have the same
low alignment rate?

Best,
Bo

On 2015-11-04 06:05, Songhui Zhao wrote:
> Hi Ram,
> Have you got the answer ? Now i have the same problem when i using
> rsem-calculate-expression.
>
> Thanks
> Songhui
>
> 在 2012年11月11日星期日 UTC+8下午9:07:36,RK写道:
>
>> Hi Bo,
>>
>> I set up RSEM run for 20M paired-end Illumina data from human sample
>> and got the following numbers which seem very low. I used masked
>> chromosome files from UCSC downloads and  GTF for hg19 build for
>> preparing RSEM reference. I got the GTF using the instructions
>> provided at UCSC
>>
> wiki: http://genomewiki.ucsc.edu/index.php/Genes_in_gtf_or_gff_format
>> [1]
>>
>> Could you please give some insights on possible reasons as to why
>> this is so low? I would appreciate your input. The alignment rate is
>> ~80% when I run bowtie outside RSEM.
>>
>> # reads processed: 20739697
>> # reads with at least one reported alignment: 4279601 (20.63%)
>> # reads that failed to align: 16460095 (79.37%)
>> # reads with alignments suppressed due to -m: 1 (0.00%)
>> Reported 12607311 paired-end alignments to 1 output stream(s)
>>
>> Thanks,
>> Ram
>
> --
> RSEM website: http://deweylab.biostat.wisc.edu/rsem/ [2]
> ---
> You received this message because you are subscribed to the Google
> Groups "RSEM Users" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to rsem-users+...@googlegroups.com.
> To post to this group, send email to rsem-...@googlegroups.com.
> Visit this group at http://groups.google.com/group/rsem-users [3].
>
>
> Links:
> ------
> [1] http://genomewiki.ucsc.edu/index.php/Genes_in_gtf_or_gff_format
> [2] http://deweylab.biostat.wisc.edu/rsem/
> [3] http://groups.google.com/group/rsem-users

vomi...@umn.edu

unread,
Feb 2, 2017, 12:06:06 PM2/2/17
to RSEM Users
Hi Bo
I also have near the same problem as others. very low alignment (0.54%) rate using RSEM while got a good alignment (>90%) using Bowtie outside RSEM.
I was following the forum, but couldn't find the final conclusion what is contributing to this?
your input is highly appreciated.
Reply all
Reply to author
Forward
0 new messages