featureCounts running time read and fragment mode, pre-sorting

275 views
Skip to first unread message

Christian Arnold

unread,
Nov 7, 2017, 6:24:58 PM11/7/17
to Subread
Hi, I am using featureCounts for optimized performance, and I noticed the following and wanted to confirm this:
1. For paired-end data, the "normal" mode (without -p) is very fast indeed and does not require additional sorting, but things become slower by a factor of at least 5 with -p and the same data (with additional sorting, which I presume cannot be parallelized). Is the latter a combination of (a) sorting and (b) inherently more difficult and therefore computationally more expensive? What are the individual contributions of (a) and (b), any idea?
2. Because I have to run featureCounts for the same set of BAM files literally hundreds of times, I'd like to pre-sort properly beforehand even though sorting just takes 30-60 seconds or so. However, with "repair", I get a segmentation fault. How should I proceed here?

Thanks for Subread, superb piece of software!

Wei Shi

unread,
Nov 7, 2017, 11:24:32 PM11/7/17
to Subread
Hi Christian,

The repair function has a bug and a fix should be released soon.

Counting read pairs might be slower than counting individual reads, but we would not expect it to be slower by a factor of 5. How are your bam files generated? Are they name sorted or location sorted? Is there a lot of singleton reads reported in your mapping results?

Best wishes,

Wei

Christian Arnold

unread,
Nov 8, 2017, 11:18:14 AM11/8/17
to Subread
Hi, thanks for getting back to me!

They are location-sorted! The statistics look fine, this is one example from samtools flagstat:

13549271 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
13549271 + 0 mapped (100.00% : N/A)
13549271 + 0 paired in sequencing
6782337 + 0 read1
6766934 + 0 read2
13475515 + 0 properly paired (99.46% : N/A)
13499234 + 0 with itself and mate mapped
50037 + 0 singletons (0.37% : N/A)
996 + 0 with mate mapped to a different chr
996 + 0 with mate mapped to a different chr (mapQ>=5)

So, only 0.37% singletons, 

The x 5 estimate was a crude guess, I could provide more accurate ones, but the times that featureCounts gives in the end totally reflect that difference also. Not sure, though, if this encompasses all steps before or just the intersection steps (i.e, excluding the time for sorting)

Wei Shi

unread,
Nov 8, 2017, 5:24:45 PM11/8/17
to Subread
Thanks for the info. Which aligner did you use for mapping? Did you do any post-processing on the bam files?

Christian Arnold

unread,
Nov 12, 2017, 11:17:23 AM11/12/17
to Subread
They have been post-processed with various tools, yes. Bowtie2 was used for aligning, for example, and then MarkDuplicates. SOme custom filters have been applied also, like removing reads with low MAPQ values and reads with INDELS, adjusting red start sites by a few base pairs. Do you have an idea which post-processing might explain such a difference? I could also send you one of the files if that helps (privately, for data security reasons)

Wei Shi

unread,
Nov 12, 2017, 5:19:35 PM11/12/17
to Subread
It will help to understand what caused the increase in counting time if we can take a look at the data ... thanks

Christian Arnold

unread,
Nov 14, 2017, 12:53:49 PM11/14/17
to Subread
OK, what is the best way to send you the files? I could provide a Dropbox link, but I'd prefer to not post it publicly here but rather in a private message?

Wei Shi

unread,
Nov 14, 2017, 5:19:32 PM11/14/17
to Subread
Dropbox is fine. You can include the link in a private reply (choose 'Reply privately to author' from the drop down list when you reply).

Christian Arnold

unread,
Nov 22, 2017, 5:26:11 AM11/22/17
to Subread
I sent you a private message with a link to the files, please let me know if you received it and whether the download works. Thanks!

Christian Arnold

unread,
Nov 23, 2017, 11:12:23 AM11/23/17
to Subread
Hi, do you have any idea when the fix for repair is going to be released? Thanks!

Wei Shi

unread,
Nov 23, 2017, 10:20:52 PM11/23/17
to Subread
Hi Christian, this bug has been fixed in the latest release 1.6.0.

Wei Shi

unread,
Nov 23, 2017, 10:30:03 PM11/23/17
to Subread
Thanks for the data. We have successfully downloaded it. We took a close look at the data and found that mapping locations of mate reads were not properly reported in the bam file, which caused the extended counting time. Below is an example:

NB501508:15:H22NNBGXY:1:12306:16596:2074        83      chr1    3032682 34      52M     =       3032687 EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEAAAAA    MD:Z:52 PG:Z:MarkDuplicates     XG:i:0  NM:i:0  XM:i:0  XN:i:0  XO:i:0  AS:i:0  XS:i:-15        YS:i:0       YT:Z:CP YC:f:2.19       YN:i:2  YG:i:32


NB501508
:15:H22NNBGXY:1:12306:16596:2074        163     chr1    3032691 34      52M     =       3032687 AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE    MD:Z:52 PG:Z:MarkDuplicates     XG:i:0  NM:i:0  XM:i:0  XN:i:0  XO:i:0  AS:i:0  XS:i:-15        YS:i:0       YT:Z:CP YC:f:2.13       YN:i:2  YG:i:34


 

These two reads are from the same pair but the reporting of mapping location of mate read in each read is incorrect. In the first read, the mapping location of its mate read is reported as 3032687 but this is different from the mapping location of the second read (3032691).

Christian Arnold

unread,
Nov 24, 2017, 4:23:45 AM11/24/17
to Subread
Hi, thanks a lot for taking a close look. Actually, yes, I see what happened - as this is ATAC-Seq data, the read start sites have been adjusted by 4 or 5 bp depending on the strand, but the positions of the next mates have apparently not been adjusted accordingly.
I have two questions related to this:
1. It might make sense for featureCounts to abort with an error if that happens, or doesn't it matter too much because these positions are not used anyway in featureCounts?
2. Is this affecting the output of featureCounts?

Thanks a lot, your help is really appreciated!

Christian Arnold

unread,
Nov 24, 2017, 4:31:41 AM11/24/17
to Subread
OK I see, so maybe I used a previous version of it. I'll check. If I am just using the BAM files for featureCounts, repairing them with the -t option should be fine right and will reduce the files sizes without consequences for featureCounts?

Christian Arnold

unread,
Nov 24, 2017, 4:42:25 AM11/24/17
to Subread
Could or should this be something that "repair" is also addressing?

Christian Arnold

unread,
Nov 24, 2017, 10:25:36 AM11/24/17
to Subread
UPDATE: Sorry for all the messages, but I do not see any edit button, so I have to create a new post all the time...

I fixed the BAM file, so that coordinates are now correctly referring to their mates (FixMateInformation from PICARD). I then used repair to prepare the file for featureCounts. However, when I start featureCounts, it still prints the "WARNING: reads from the same pair were found not adjacent to each other" message. Any idea why? 

Wei Shi

unread,
Nov 26, 2017, 5:46:58 PM11/26/17
to Subread
We will have to take a look at your new bam file. Could you please send it again? Or you may just post your commands for generating this bam file so we can reproduce it from our end. We also need to have your 'repair' command. Thanks.

Christian Arnold

unread,
Nov 28, 2017, 10:10:27 AM11/28/17
to Subread
I used one of the files as an example, do you still have them? I already deleted the download, but can reactivate it I guess. I simply used repair with -c and -t, and subsequently featureCounts with -F SAF -p -B -d 0 -D 2000 -C -Q 10. If that is not enough, I can provide a more concrete example, let me know!

Yang LIAO

unread,
Dec 5, 2017, 11:52:54 PM12/5/17
to Subread
Hi Christian,

Than you for the bug report. I can reproduce the bug in the repair program: the "dummy" reads that were generated for the read-pairs that had only one end reported in the BAM file had a format issue. The "mate mapping position" was incorrect for those "dummy" reads. This will only result in the warning message you have seen. It will not cause error in the read-counting results.

The bug will be fixed in the next release.

Cheers,
Yang

Christian Arnold

unread,
Dec 6, 2017, 8:46:08 AM12/6/17
to Subread
Ok thanks a lot for checking, good to know you guys carefully check the reported issues ;-). So it is actually not resorted despite the print that it does, therefore the running time does not increase because of this bug?

Thanks,
Christian

Wei Shi

unread,
Dec 7, 2017, 8:43:45 PM12/7/17
to Subread
Hi Christian, we will improve the screen output of featureCounts regarding the processing of paired end bam files. For the running time, the bug only causes a very small increase of running time. 
Reply all
Reply to author
Forward
0 new messages