featureCounts on sorted bam (Rsubread)

2,305 views
Skip to first unread message

Davide Cittaro

unread,
Dec 3, 2014, 6:03:34 AM12/3/14
to sub...@googlegroups.com
Dear list, 
I've recently started using featureCounts in my RNA-seq experiments, all in R session via Rsubread package (Rsubread_1.16.1).
I've noticed that featureCounts creates a sorted sam file for each input bam file (already sorted). Is there a way to avoid bam to sam conversion and, more important, read sorting? It is an incredible waste of time (and disk space as well).
Best, 

d

Wei Shi

unread,
Dec 4, 2014, 5:06:54 PM12/4/14
to sub...@googlegroups.com
Hi Divide,

featureCounts will only start the sorting process when it finds that the reads provided in the input were not properly sorted. We found that for paire-end read data, when some read pairs were reported multiple times in a BAM file, samtools may not sort them correctly, ie. reads from the same pair were not placed next to each other after sorting. I guess your data are paired end data and you allowed multi-mapping reads to be reported multiple times in the mapping output.

featureCounts uses name of the read and mapping location of its mate read (can be found in the same record of the current read) to correctly pair up reads from the same pair, and this solves the sorting issue relating to the multiple occurrences of the same read pair in the BAM file.

Although it is fairly straightforward to provide an option to disable the featureCounts sorting process, I think it is better to let featureCounts check if your bam files are properly sorted and re-sort it if needed, to ensure the read counting is correctly conducted.

Cheers,
Wei

Mathias Lesche

unread,
Dec 17, 2014, 10:05:11 AM12/17/14
to sub...@googlegroups.com
Dear Wei,

I'm using for my alignments the GSNAP written by Thomas Wu. If a read is divided into two parts because it aligns to distant locations both fragments of the read are printed on separate lines in the sam/bam file. The reads are marked in the XO flag with

CT (concordant translocation), HT (halfmapping translocation), UT (unpaired translocation). Here is an example:


HWI-ST678:296:C437KACXX:7:1101:12820:30889      99      chr13   45970104        40      47M54H  chr12   6687256 209   ... MD:Z:47 NH:i:1  HI:i:1  NM:i:0  SM:i:40 XQ:i:40 X2:i:0  XO:Z:CT XS:A:-  XT:Z:GT-AG,2.00,2.00,-chr12@6687195..-chr13@45970104
HWI-ST678:296:C437KACXX:7:1101:12820:30889      99      chr12   6687195 40      47H54M  =       6687256 209  .... MD:Z:54 NH:i:1  HI:i:1  NM:i:0  SM:i:40 XQ:i:40 X2:i:0  XO:Z:CT XS:A:-  XT:Z:GT-AG,2.00,2.00,-chr12@6687195..-chr13@45970104
HWI-ST678:296:C437KACXX:7:1101:12820:30889      147     chr12   6687256 40      48M271N53M      =       6687195 -209 .... MD:Z:101        NH:i:1  HI:i:1  NM:i:0  SM:i:40 XQ:i:40 X2:i:0  XO:Z:CT XS:A:-  XG:Z:A


Even though the sam/bam file is sorted by queryname, featureCounts will always try to resort the file. My temporary solution is, I filter out these reads and run featureCounts on the rest.

I don't do the filtering for small files because featureCounts is rather quick, does the sorting and gives me counts, but for big files it's a pain to do it ...

Do you think there could be a solution for this problem? If you want I can send you some test data.


best

Mathias

Wei Shi

unread,
Apr 6, 2015, 7:51:50 PM4/6/15
to sub...@googlegroups.com
Apologies for the late reply. We will try to add a new option to allow users to turn off read sorting in the next release.

Best,
Wei

Mathias Lesche

unread,
Jun 23, 2015, 4:59:59 AM6/23/15
to sub...@googlegroups.com
Hey Wei,

sry I was absent for a while. I testet your option with some pair alignment that I'm dealing right now. I used the original alignment and featureCounts with the "--donotsort" option. That's the summary:

Status  L7701.sam
Assigned        23292803
Unassigned_Ambiguity    7284086
Unassigned_MultiMapping 4216790
Unassigned_NoFeatures   12770376
Unassigned_Unmapped     825995
Unassigned_MappingQuality       0
Unassigned_FragmentLength       0
Unassigned_Chimera      187441

Next, I did was I usually do with paired-end alignments from gsnap. I remove reads which are marked in the XO flag with CT (concordant translocation), HT (halfmapping translocation), UT (unpaired translocation). Now my alignment file only has alignment which have two lines per fragment. I did the counting with featureCounts again but this time without the "--donotsort". Here is the summary:

Status  L7701.sam
Assigned        26297471
Unassigned_Ambiguity    173031
Unassigned_MultiMapping 2854068
Unassigned_NoFeatures   18224727
Unassigned_Unmapped     825060
Unassigned_MappingQuality       0
Unassigned_FragmentLength       0
Unassigned_Chimera      184116

The other options were: -s 2 -p -B -C -Q 1

So what does featureCounts do with the "--donotsort" option when it encounters more than two lines per read id? For me it seems it disregards the whole entries and continues with the next read id.

Wei Shi

unread,
Jun 24, 2015, 7:50:40 PM6/24/15
to sub...@googlegroups.com
Hi Mathias,

When you count fragments ("-p") and turn on the "--donotsort" option, featureCounts will take two reads at a time from the input and treat them as the reads from the same pair. If there are for example three lines having the same read id (ie. from the same fragment/pair), then the last line will be used together with the read after it for counting. This will result in incorrect fragment counting because reads were incorrectly paired.

Regards,
Wei

Kristen Dang

unread,
Apr 11, 2016, 2:28:40 PM4/11/16
to Subread
Dear Wei,

How does featurecounts handle the example from Mathias' email if allowed to do the sorting? Is it able to properly count read pairs where the read for one end is reported on two lines? (not multiple mapping locations, but a chimeric/split read)?

I am also dealing with GSNAP alignments that including CT/HT/UT mapping situations.

Regards,
Kristen

Wei Shi

unread,
Apr 11, 2016, 7:58:53 PM4/11/16
to Subread
Hi Kristen,

Yes featureCounts can count such read pairs. The read that does not pair with its mate will be counted separately.

Best wishes,

Wei

Wei Shi

unread,
Apr 11, 2016, 9:06:52 PM4/11/16
to Subread
I also want to mention that the speed of read sorting in featureCounts was much improved from version 1.5.0: it takes <30 seconds to sort 30 million read pairs.

Kristen Dang

unread,
Apr 12, 2016, 1:03:06 PM4/12/16
to Subread
Dear Wei,
Thanks for your quick response. I'm sorry I didn't clarify which example I meant, so I have copied again below. This is a case where both ends of a read pair are aligned and paired, but one end is a chimeric/translocated alignment so it is reported on two lines. 

In this case, will featurecounts know to count these three lines together for the read pair? Or will the three lines be discarded? 

HWI-ST678:296:C437KACXX:7:1101:12820:30889      99      chr13   45970104        40      47M54H  chr12   6687256 209   ... MD:Z:47 NH:i:1  HI:i:1  NM:i:0  SM:i:40 XQ:i:40 X2:i:0  XO:Z:CT XS:A:-  XT:Z:GT-AG,2.00,2.00,-chr12@6687195..-chr13@45970104
HWI-ST678:296:C437KACXX:7:1101:12820:30889      99      chr12   6687195 40      47H54M  =       6687256 209  .... MD:Z:54 NH:i:1  HI:i:1  NM:i:0  SM:i:40 XQ:i:40 X2:i:0  XO:Z:CT XS:A:-  XT:Z:GT-AG,2.00,2.00,-chr12@6687195..-chr13@45970104
HWI-ST678:296:C437KACXX:7:1101:12820:30889      147     chr12   6687256 40      48M271N53M      =       6687195 -209 .... MD:Z:101        NH:i:1  HI:i:1  NM:i:0  SM:i:40 XQ:i:40 X2:i:0  XO:Z:CT XS:A:-  XG:Z:A

Regards,
Kristen


Wei Shi

unread,
Apr 14, 2016, 1:04:28 AM4/14/16
to Subread
Hi Kristen,

featureCounts will pair up the second and third reads and count them as one read pair. It will count the first read separately.

featureCounts uses read name, mapping location of the current read, mapping location of the mate read and HI tag to pair up reads. The first read can not be paired up with any other reads.

Cheers,
Wei
Reply all
Reply to author
Forward
0 new messages