Strand-specific results from unstranded BAM

467 views
Skip to first unread message

Andrew Jaffe

unread,
Jun 22, 2015, 1:27:13 PM6/22/15
to sub...@googlegroups.com, Nikolay Ivanov
Hi, we encountered a somewhat strange output of featureCounts with regard to the strand-specific options. While the RNA-seq was generated with a strand specific protocol, a preliminary TopHat run had not indicated a strand-specific alignment, so I would assume that the strand information in the resulting BAM files is random. However, when featureCounts was run on Ensembl v75 using the Illumina stranded flag (`-s 2`) we actually get MORE reads assigned to genes than leaving out this flag. I would have assumed that >1/2 of the reads would not be assigned to genes, given the strand information of the reads is random, but not random in the annotation. So, for example, if there were 100 reads aligned to a gene, I would have expected ~50 reads on the plus and ~50 reads on the minus strands on average, and then only 50 would be assigned to a gene on, say the plus strand. Is the strand information ignored if the annotation is not ambiguous at a given locus?

### with strand information 
[ajaffe@compute-085 Fibroblasts]$ featureCounts -A /dcs01/lieber/ajaffe/Brain/DLPFC_PolyA/Counts/Ensembl/chrAliases_GRCh37_to_hg19.csv -a /dcs01/lieber/ajaffe/Brain/DLPFC_PolyA/Counts/Ensembl/Homo_sapiens.GRCh37.75.gtf -o tmp_revstrand.counts -s 2 -p TopHat/2328-Dura/accepted_hits.bam


//================================= Running ==================================\\
||                                                                            ||
|| 25 chromosome name aliases are loaded.                                     ||
|| Load annotation file /dcs01/lieber/ajaffe/Brain/DLPFC_PolyA/Counts/Ens ... ||
||    Number of features is 1306656                                           ||
||    Number of meta-features is 63677                                        ||
||    Number of chromosomes is 265                                            ||
||                                                                            ||
|| Process BAM file TopHat/2328-Dura/accepted_hits.bam...                     ||
||    Assign fragments (read pairs) to features...                            ||
||    Each fragment is counted once.                                          ||
||    Found reads that are not properly paired.                               ||
||    (missing mate or the mate is not the next read)                         ||
||    1830975 reads have missing mates.                                       ||
||    Input was converted to a format accepted by featureCounts.              ||
||    Total number of fragments is : 14393505                                 ||
||    Number of successfully assigned fragments is : 12708887 (88.3%)         ||
||    Running time : 4.71 minutes                                             ||
||                                                                            ||
||                         Read assignment finished.                          ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

### without strand info
[ajaffe@compute-085 Fibroblasts]$ featureCounts -A /dcs01/lieber/ajaffe/Brain/DLPFC_PolyA/Counts/Ensembl/chrAliases_GRCh37_to_hg19.csv -a /dcs01/lieber/ajaffe/Brain/DLPFC_PolyA/Counts/Ensembl/Homo_sapiens.GRCh37.75.gtf -o tmp_unstrand.counts -p TopHat/2328-Dura/accepted_hits.bam

//================================= Running ==================================\\
||                                                                            ||
|| 25 chromosome name aliases are loaded.                                     ||
|| Load annotation file /dcs01/lieber/ajaffe/Brain/DLPFC_PolyA/Counts/Ens ... ||
||    Number of features is 1306656                                           ||
||    Number of meta-features is 63677                                        ||
||    Number of chromosomes is 265                                            ||
||                                                                            ||
|| Process BAM file TopHat/2328-Dura/accepted_hits.bam...                     ||
||    Assign fragments (read pairs) to features...                            ||
||    Each fragment is counted once.                                          ||
||    Found reads that are not properly paired.                               ||
||    (missing mate or the mate is not the next read)                         ||
||    1830975 reads have missing mates.                                       ||
||    Input was converted to a format accepted by featureCounts.              ||
||    Total number of fragments is : 14393505                                 ||
||    Number of successfully assigned fragments is : 12379037 (86.0%)         ||
||    Running time : 4.64 minutes                                             ||
||                                                                            ||
||                         Read assignment finished.                          ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

Wei Shi

unread,
Jun 22, 2015, 7:17:09 PM6/22/15
to sub...@googlegroups.com, niva...@gmail.com
It seems to me that you do have a strand-specific dataset, since your stranded counting and unstranded counting gave very similar percentages (only ~2% difference). I believe the reason why your stranded counting had more counts than unstranded counting is due to the overlap of some genes in your annotation. The stranded counting resolved the ambiguity arising from the overlapping genes that are located on different strands. But for unstranded counting, such ambiguity cannot be resolved and consequently reads mapping to those ambiguous regions were not counted.

Best,
Wei

Mathias Lesche

unread,
Jun 23, 2015, 4:05:41 AM6/23/15
to sub...@googlegroups.com, niva...@gmail.com
Hey Andrew,

did you do a strand-specific paired-end sequencing? Because with paired-end data the second read of the pair goes always on the opposite strand of your gene. if you do strand-specific sequencing with dUTP, then the first read will be on the opposite strand and the second on the strand of a gene. featureCounts will take care of such things by reading the SAM flags.

Andrew Jaffe

unread,
Jun 23, 2015, 11:21:10 AM6/23/15
to sub...@googlegroups.com, niva...@gmail.com
We did do strand-specific paired end sequencing, and I agree the slightly higher % of assigned reads is due to using the strand information. however, again note that we aligned the reads accidentally using the default TopHat parameters the first time around, which ignores the strand-specific information in the reads:

--library-typeThe default is unstranded (fr-unstranded). If either fr-firststrand or fr-secondstrand is specified, every read alignment will have an XS attribute tag as explained below. Consider supplying library type options below to select the correct RNA-seq protocol.

therefore, I am a little confused why the proportion of reads assigned to the genes is so high to begin with, as I would have assumed it would be ~1/2 of what we are seeing now (if the reads are assigned to strands randomly in unstranded data)

I would therefore guess that featureCounts only use strand-specific information when the annotation is ambiguous...can you confirm? That would be my guess based on the results of these featureCounts runs...

-a

Mathias Lesche

unread,
Jun 23, 2015, 11:42:36 AM6/23/15
to sub...@googlegroups.com, niva...@gmail.com
Please look at the definition of fr-unstranded in the manual of TopHat (http://ccb.jhu.edu/software/tophat/manual.shtml): Reads from the left-most end of the fragment (in transcript coordinates) map to the transcript strand, and the right-most end maps to the opposite strand.
I think with paired-end and Illumina it's not even possible to do unstranded paired-end sequencing. So either the first or the second read of your fragment will go on the gene depending if it is dUTP or not. FeatureCounts will use the Sam flag information and if one read goes on your gene and the other is on the opposite strand, of course it will be counted to the gene because this read is from the same fragment. I would assume you did dUTP treatment during library prep. otherwise I couldn't explain the higher percentage in the counts.

Andrew Jaffe

unread,
Jun 23, 2015, 12:27:01 PM6/23/15
to sub...@googlegroups.com, niva...@gmail.com
Yes, we had done dUTP treatment in the library prep step prior to sequencing. I would assume that running TopHat would assign both reads to the same strand when passing the --fr-firststrand option whereas, as you point out, the left read gets one strand and the right read gets the other when the reads are aligned using an unstranded option. Therefore, when assigning either fragments or reads to genes, I still don't understand why we are not seeing 1/2 the assignment rate we are getting as only 1 of the reads in the pair should be getting assigned to the gene on its strand...

Wei Shi

unread,
Jun 23, 2015, 8:22:52 PM6/23/15
to sub...@googlegroups.com, niva...@gmail.com
Hi Andrew,

featureCounts performs stranded counting only when you provided the '-s' option. The ambiguity in gene annotation does not trigger the stranded counting.

You are correct that half of your reads should have the same strand with that of genes and the other half should not. But you are not counting reads. You are counting fragments because you specified the '-p' option in your command. When counting fragments for stranded data, you will get a high assignment rate for both unstranded counting and stranded counting with -s 2 (for your data). But you will almost get zero counts if you count the fragments with -s 1.

Best,
Wei

Andrew Jaffe

unread,
Jun 23, 2015, 8:59:49 PM6/23/15
to sub...@googlegroups.com, niva...@gmail.com
Hi, I had gotten the same results when counting reads (in addition to the fragments with the -p flag):

[ajaffe@compute-083 Fibroblasts]$ featureCounts -A /dcs01/lieber/ajaffe/Brain/DLPFC_PolyA/Counts/Ensembl/chrAliases_GRCh37_to_hg19.csv -a /dcs01/lieber/ajaffe/Brain/DLPFC_PolyA/Counts/Ensembl/Homo_sapiens.GRCh37.75.gtf -o tmp_unstrand_single.counts TopHat/2328-Dura/accepted_hits.bam

        ==========     _____ _    _ ____  _____  ______          _____
        =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \
          =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
            ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
              ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
        ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
        v1.4.3-p1

//========================== featureCounts setting ===========================\\
||                                                                            ||
||             Input files : 1 BAM file                                       ||
||                           o TopHat/2328-Dura/accepted_hits.bam             ||
||                                                                            ||
||             Output file : tmp_unstrand_single.counts                       ||
||             Annotations : /dcs01/lieber/ajaffe/Brain/DLPFC_PolyA/Count ... ||
||   Chromosome alias file : /dcs01/lieber/ajaffe/Brain/DLPFC_PolyA/Count ... ||
||                                                                            ||
||                 Threads : 1                                                ||
||                   Level : meta-feature level                               ||
||              Paired-end : no                                               ||
||         Strand specific : no                                               ||
||      Multimapping reads : not counted                                      ||
|| Multi-overlapping reads : not counted                                      ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

//================================= Running ==================================\\
||                                                                            ||
|| 25 chromosome name aliases are loaded.                                     ||
|| Load annotation file /dcs01/lieber/ajaffe/Brain/DLPFC_PolyA/Counts/Ens ... ||
||    Number of features is 1306656                                           ||
||    Number of meta-features is 63677                                        ||
||    Number of chromosomes is 265                                            ||
||                                                                            ||
|| Process BAM file TopHat/2328-Dura/accepted_hits.bam...                     ||
||    Assign reads to features...                                             ||
||    Total number of reads is : 26989737                                     ||
||    Number of successfully assigned reads is : 23026481 (85.3%)             ||
||    Running time : 1.32 minutes                                             ||
||                                                                            ||
||                         Read assignment finished.                          ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

[ajaffe@compute-083 Fibroblasts]$ featureCounts -A /dcs01/lieber/ajaffe/Brain/DLPFC_PolyA/Counts/Ensembl/chrAliases_GRCh37_to_hg19.csv -a /dcs01/lieber/ajaffe/Brain/DLPFC_PolyA/Counts/Ensembl/Homo_sapiens.GRCh37.75.gtf -o tmp_strand_single.counts -s 2 TopHat/2328-Dura/accepted_hits.bam

        ==========     _____ _    _ ____  _____  ______          _____
        =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \
          =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
            ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
              ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
        ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
        v1.4.3-p1

//========================== featureCounts setting ===========================\\
||                                                                            ||
||             Input files : 1 BAM file                                       ||
||                           o TopHat/2328-Dura/accepted_hits.bam             ||
||                                                                            ||
||             Output file : tmp_strand_single.counts                         ||
||             Annotations : /dcs01/lieber/ajaffe/Brain/DLPFC_PolyA/Count ... ||
||   Chromosome alias file : /dcs01/lieber/ajaffe/Brain/DLPFC_PolyA/Count ... ||
||                                                                            ||
||                 Threads : 1                                                ||
||                   Level : meta-feature level                               ||
||              Paired-end : no                                               ||
||         Strand specific : inversed                                         ||
||      Multimapping reads : not counted                                      ||
|| Multi-overlapping reads : not counted                                      ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

//================================= Running ==================================\\
||                                                                            ||
|| 25 chromosome name aliases are loaded.                                     ||
|| Load annotation file /dcs01/lieber/ajaffe/Brain/DLPFC_PolyA/Counts/Ens ... ||
||    Number of features is 1306656                                           ||
||    Number of meta-features is 63677                                        ||
||    Number of chromosomes is 265                                            ||
||                                                                            ||
|| Process BAM file TopHat/2328-Dura/accepted_hits.bam...                     ||
||    Assign reads to features...                                             ||
||    Total number of reads is : 26989737                                     ||
||    Number of successfully assigned reads is : 23839724 (88.3%)             ||
||    Running time : 1.33 minutes                                             ||
||                                                                            ||
||                         Read assignment finished.                          ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

Indeed, there is a low mapping rate when "-s 1" is specified:

[ajaffe@compute-083 Fibroblasts]$ featureCounts -A /dcs01/lieber/ajaffe/Brain/DLPFC_PolyA/Counts/Ensembl/chrAliases_GRCh37_to_hg19.csv -a /dcs01/lieber/ajaffe/Brain/DLPFC_PolyA/Counts/Ensembl/Homo_sapiens.GRCh37.75.gtf -o tmp_strand_single.counts -s 1 TopHat/2328-Dura/accepted_hits.bam

        ==========     _____ _    _ ____  _____  ______          _____
        =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \
          =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
            ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
              ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
        ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
        v1.4.3-p1

//========================== featureCounts setting ===========================\\
||                                                                            ||
||             Input files : 1 BAM file                                       ||
||                           o TopHat/2328-Dura/accepted_hits.bam             ||
||                                                                            ||
||             Output file : tmp_strand_single.counts                         ||
||             Annotations : /dcs01/lieber/ajaffe/Brain/DLPFC_PolyA/Count ... ||
||   Chromosome alias file : /dcs01/lieber/ajaffe/Brain/DLPFC_PolyA/Count ... ||
||                                                                            ||
||                 Threads : 1                                                ||
||                   Level : meta-feature level                               ||
||              Paired-end : no                                               ||
||         Strand specific : yes                                              ||
||      Multimapping reads : not counted                                      ||
|| Multi-overlapping reads : not counted                                      ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

//================================= Running ==================================\\
||                                                                            ||
|| 25 chromosome name aliases are loaded.                                     ||
|| Load annotation file /dcs01/lieber/ajaffe/Brain/DLPFC_PolyA/Counts/Ens ... ||
||    Number of features is 1306656                                           ||
||    Number of meta-features is 63677                                        ||
||    Number of chromosomes is 265                                            ||
||                                                                            ||
|| Process BAM file TopHat/2328-Dura/accepted_hits.bam...                     ||
||    Assign reads to features...                                             ||
||    Total number of reads is : 26989737                                     ||
||    Number of successfully assigned reads is : 964570 (3.6%)                ||
||    Running time : 1.30 minutes                                             ||
||                                                                            ||
||                         Read assignment finished.                          ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

Do you think this could be somehow related to the alignment? again, I'm surprised the "-s 2" assignment rate is not around 1/2 of what it is here...As you can see from the BAM header, the TopHat run  was unstranded/default:

@PG     ID:TopHat       VN:2.0.13       CL:/jhpce/shared/community/core/tophat/2.0.13/bin/tophat -p 3 -G /amber2/scratch/jleek/iGenomes-index/Homo_sapiens/UCSC/hg19/Annotation/Genes/genes.gtf -o /dcs01/lieber/ajaffe/Fibroblasts/TopHat/2328-Dura -r 160 --mate-std-dev 30 /amber2/scratch/jleek/iGenomes-index/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/genome /dcs01/lieber/ajaffe/Fibroblasts/FASTQ/Sample_R2328-Dura_C41CPACXX/R2328-Dura_C41CPACXX_ATCACG_L001_R1_001.fastq.gz /dcs01/lieber/ajaffe/Fibroblasts/FASTQ/Sample_R2328-Dura_C41CPACXX/R2328-Dura_C41CPACXX_ATCACG_L001_R2_001.fastq.gz

And here are the first four reads:

[ajaffe@compute-083 Fibroblasts]$ samtools view TopHat/2328-Dura/accepted_hits.bam | head -4
HWI-ST1038:141:C41CPACXX:1:1308:9737:14515      163     chr1    13485   3       100M    =       13525   140     TGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACAAAGGTGA   2CCCFFFFFHHHHHJJJJIJIJJHHHIJJJJJJJJJJJIJJIIJJIJJJGGJGIJACEFFFFFEEEDDCDDDDDDDDDDDDDDDDDDDDDCCCDACCCCC   AS:i:-3 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:0A99       YT:Z:UU XS:A:+  NH:i:2  CC:Z:chr2       CP:i:114357432 HI:i:0
HWI-ST1038:141:C41CPACXX:1:1308:9737:14515      83      chr1    13525   3       100M    =       13485   -140    GCCCGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACAAAGGTGAAACCCAGGAGAGTGTGGAGTCCAGAGTGTTGCCAGGACCC   BDBBCCDDDDDBDBDDDDDDEDECCFDECCHHJJJIJHEIJJIJJJIIIJIFIJGJJIIIHGJJJJJIIHGIGGEJJJJJJIJJJJJHHHHHFFFFFCCC   AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:100        YT:Z:UU XS:A:+  NH:i:2  CC:Z:chr2       CP:i:114357392 HI:i:0
HWI-ST1038:141:C41CPACXX:1:1208:3317:99803      99      chr1    14407   3       100M    =       14469   162     CGGCTCAGTTCTTTATTGATTGGTGTGCCGTTTTCTCTGGAAGCCTCTTAAGAACACTGTGGCGCAGGCTGGGTGGAGCCGTCCCCCCATGGAGCACAGG   C@CFFFDEFFFFHICGHBEHIGGEHBFHHHHIIIGHIJGGGIIJJJJJJGHHIJIJJJJGIJJJJHHFFFFDD@B;?BCDDDD@BBDD3?CDCDDDDDC<   AS:i:-11        XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:2  MD:Z:1T55A42    YT:Z:UU XS:A:-  NH:i:2  CC:Z:chr12    CP:i:91119       HI:i:0
HWI-ST1038:141:C41CPACXX:1:2211:8394:77327      99      chr1    14407   3       100M    =       14469   162     CGGCTCAGTTCTTTATTGATTGGTGTGCCGTTTTCTCTGGAAGCCTCTTAAGAACACTGTGGCGCAGGCTGGGTGGAGCCGTCCCCCCATGGAGCACAGG   CCCFFFFFFHHGHJGIJIEJJJJFHHGGIJIJJJIJJJIIJIJJJJJJJJJJJJJJJJJGIIIJJHHFFFFDDCDBBDDDDDDDDDDDDCDACDDDDDDB   AS:i:-11        XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:2  MD:Z:1T55A42    YT:Z:UU XS:A:-  NH:i:2  CC:Z:chr12    CP:i:91119       HI:i:0

All I can think of was that using the Ensembl annotation above resulted in properly stranding the reads that aligned in the first step of aligning to the transcriptome, and indeed, it looks like 81.88% of the left reads and 79.49% of the right reads had mapped in that step...Sorry to ramble a bit and I guess this is beyond the scope of featureCounts, but I was wondering if you had seen this before...

Thanks,
Andrew

Mathias Lesche

unread,
Jun 24, 2015, 4:26:40 AM6/24/15
to sub...@googlegroups.com, niva...@gmail.com
Hey Andrew,

paired-end will always be like this. one read goes to one strand and the other read to the other strand, because it is the best and correct alignment. it wouldn't make sense to assign both reads to the same strand.

how ever I do understand your problem now and I think only Wei can really explain what happens if you do read counting with paired end data.
I have the following example (see attached picture). The gene is on the minus strand and we did paired-end sequencing with dUTP. So reads on the plus Strand should be assigned when using the -s 2 option and only read counting. It's an easy example because one can count it by looking at the alignments.

I ran featureCounts in six settings and here are the results:
count reads with -s 0 -> Successfully assigned reads : 16 (88.9%)
count reads with -s 1 -> Successfully assigned reads : 0 (0.0%)
count reads with -s 2 -> Successfully assigned reads : 16 (88.9%)
count fragments with -s 0 and -p -B -> Successfully assigned fragments : 8 (88.9%)
count fragments with -s 1 and -p -B -> Successfully assigned fragments : 0 (0.0%)
count fragments with -s 2 and -p -B -> Successfully assigned fragments : 8 (88.9%)

For me everything but two don't make sense. I don't understand why counting reads with -s 2 assigns 16 reads because it should assign only 8 and counting reads with -s 1 should assign 8 as well.

So is featureCounts still check the paired flag even though one wants to counts reads and not fragments?



test1.png

Mathias Lesche

unread,
Jun 24, 2015, 4:28:57 AM6/24/15
to sub...@googlegroups.com, niva...@gmail.com
I forgot, I'm using featureCounts v1.4.6-p3.

best
Mathias

Wei Shi

unread,
Jun 24, 2015, 7:29:52 PM6/24/15
to sub...@googlegroups.com, niva...@gmail.com
Hi Andrew and Mathias,

Thanks for providing detailed data to illustrate the problem. It turned out that it is a bug in featureCounts that caused the incorrect counting of reads when paired end reads were provided. For paired end reads, when you count reads instead of fragments featureCounts still checks if each read is the left read or the right read in the pair and it then flipped the strand of the read according to the value of -s. When -s is 2, it flipped the strand of the left read and this caused the problems you have seen.

We will fix this bug and release a patched version soon.

Best wishes,
Wei

Wei Shi

unread,
Jun 24, 2015, 10:43:07 PM6/24/15
to sub...@googlegroups.com, niva...@gmail.com
Bug fixed. Please update to the latest version 1.4.6-p4.

Best,
Wei
Reply all
Reply to author
Forward
0 new messages