Low successfully assigned fragments

1,139 views
Skip to first unread message

Elizabeth

unread,
Sep 1, 2016, 7:06:44 PM9/1/16
to Subread
Hello,

I am new to RNAseq analysis and this is my first time using featureCounts. I have some sequences (PE100 from Illumina HiSeq4000), prepared with the TruSeq stranded mRNA kit and want to do differential expression analysis. I aligned my reads to the genome using TopHat2 (consistently around 75% alignment to genome), and am using the resulting accepted_hits.bam as my input into featureCounts. 

This is my command for featurecounts (subread-1.5.0): 
featureCounts -T 8 \
-p -F GTF -t exon -g Parent -s 2 \
-a Csinensis_154_gene.gtf \
-G Csinensis_154.fa \
-o NC1-8_featurecounts_output.txt \
accepted_hits.bam

My output shows very low successfully assigned fragments- around 20%. I also have changed -B and -M settings, but neither provide much improvement. -s 2 is correct over -s 1. Example output:

//========================== featureCounts setting ===========================\\
||                                                                            ||
||             Input files : 1 BAM file                                       ||
||                           P accepted_hits_NC1-8_R1.bam                     ||
||                                                                            ||
||             Output file : NC1-8_featurecounts_output.txt                   ||
||                 Summary : NC1-8_featurecounts_output.txt.summary           ||
||              Annotation : /home/ubuntu/workspace/rnaseq/phyto_genome/v ... ||
||                                                                            ||
||                 Threads : 8                                                ||
||                   Level : meta-feature level                               ||
||              Paired-end : yes                                              ||
||         Strand specific : reversely stranded                               ||
||      Multimapping reads : not counted                                      ||
|| Multi-overlapping reads : not counted                                      ||
||                                                                            ||
||          Chimeric reads : counted                                          ||
||        Both ends mapped : not required                                     ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

//================================= Running ==================================\\
||                                                                            ||
|| Load annotation file /home/ubuntu/workspace/rnaseq/phyto_genome/versio ... ||
||    Features : 259657                                                       ||
||    Meta-features : 34329                                                   ||
||    Chromosomes/contigs : 1750                                              ||
||                                                                            ||
|| Loading FASTA contigs : /home/ubuntu/workspace/rnaseq/phyto_genome/ver ... ||
||    12574 contigs were loaded                                               ||
||                                                                            ||
|| Process BAM file accepted_hits_NC1-8_R1.bam...                             ||
||    Paired-end reads are included.                                          ||
||    Assign fragments (read pairs) to features...                            ||
||                                                                            ||
||    WARNING: reads from the same pair were found not adjacent to each       ||
||             other in the input (due to read sorting by location or         ||
||             reporting of multi-mapping read pairs).                        ||
||                                                                            ||
||    Read re-ordering is performed.                                          ||
||                                                                            ||
||    Total fragments : 34002467                                              ||
||    Successfully assigned fragments : 7239403 (21.3%)                       ||
||    Running time : 0.89 minutes                                             ||
||                                                                            ||
||                         Read assignment finished.                          ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ======================//

It seems like this is due to many ambiguous reads: 

Status accepted_hits_NC1-8_R1.bam
Assigned 7239403
Unassigned_Ambiguity 18514271
Unassigned_MultiMapping 576445
Unassigned_NoFeatures 7672348
Unassigned_Unmapped 0
Unassigned_MappingQuality 0
Unassigned_FragmentLength 0
Unassigned_Chimera 0
Unassigned_Secondary 0
Unassigned_Nonjunction 0
Unassigned_Duplicate 0

Adding the -O to allow ambiguous reads increases the number of assigned fragments, (75%), but I know this is not recommended for RNAseq data. --largestOverlap doesn't improve the number much (in absence of -O). 

The header of my gtf and bam look like they match. I did not modify the annotation file, but did obtain this gtf by converting the original gff3 by using gffread from Cufflinks: 

head Csinensis_154_gene.gtf
# gffread Csinensis_154_gene.gff3 -o Csinensis_154_gene.gtf
##gff-version 3
scaffold00001   phytozome8_0    mRNA    6848    8905    .       -       .       ID=PAC:18137364;geneID=orange1.1g009555m.g;gene_name=orange1.1g009555m.g
scaffold00001   phytozome8_0    exon    6848    7034    .       -       .       Parent=PAC:18137364
scaffold00001   phytozome8_0    exon    7149    7550    .       -       .       Parent=PAC:18137364

head accepted_hits_NC1-8_R1.sam

J00113:145:HF3W3BBXX:2:1107:14641:48878 97      scaffold00001   6838    50      100M    =   7011     387     CTGGTAGAGAGAAATGAACAGTTATATTAAACAGCTCAGCTCACGTGATATTTCTAATTAGTAGGACGATCGCCTTCTCGGCTTCTCCAATCCTAATAGG AAFFFJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:100    YT:Z:UU  NH:i:1  XS:A:-

I'm really new to RNAseq analysis, so I am unsure if there is something that I am missing, because it seems like there are so few fragments successfully assigned. I would greatly appreciate any advice or suggestions you have for troubleshooting. 

Thanks!
Liz

Wei Shi

unread,
Sep 2, 2016, 2:16:57 AM9/2/16
to Subread
Hi Liz,

Firstly you do not need to specify '-G' since this option is only useful when '-J' is specified. However this doesn't affect your assignment percentage.

Having a large number of reads in category "Unassigned_Ambiguity" suggests that in your annotation many genes overlap with each other. You may try to use a different annotation to see if assignment rate improves.

Cheers,

Wei

Elizabeth

unread,
Sep 2, 2016, 1:09:09 PM9/2/16
to Subread
I will try this. 
Thanks for your suggestion!
Reply all
Reply to author
Forward
0 new messages