Hi Wei,
Thanks for responding-- the trouble with the --donotsort option is that the common way of presorting, using samtools, will sometimes create invalid files. So as a user or a builder of a pipeline I can't enable that option if I'm going to be sorting with samtools, I have to resort my files just to make featureCounts work. To make sure featureCounts works we basically have to sort with the subtools tool.
The subtools tool does sort the files, but it has a couple of problems:
1) it does not work in parallel, so it is very slow. Part of what makes featureCounts preferable over something like htseq-count is that it is really really fast, but having to do a slow sort with subtools hurts that speed.
2) it creates a ton of temporary files in the current directory. usually you would like temporary files like that to go to TMP so that you can have TMP be a local disk and not be sorting temporary files onto something like a NFS mounted filesystem. Sorting in the current directory works fine if you want to do tens of samples but if you try to do hundreds of samples simultaneously then you can't write all of those files to your NFS mount or it will crush it.
I agree, samtools sometimes won't put the pairs right next to each other if they are multimappers, but it will put all of the alignments from the same read name in a group. Right now featureCounts assumes in a presorted file the read pairs are right next to each other. Could featureCounts instead assume just that all reads with the same read name are next to each other, read them in until a different read name is encountered and then process all of the reads with the same name as a block and figure out which one are pairs by position/tags/etc? Then it wouldn't matter if there are records out of order as long as they were sorted by read name.
Thanks again for responding so quick.