Change to allow for samtools sorting?

Rory Kirchner

unread,

Sep 14, 2015, 3:48:43 PM9/14/15

to Subread

Hi Wen,

I was wondering if it would be possible for you to fix the featureCounts algorithm so it works on samtools/sambamba queryname sorted files. Those files should have all of the read names in a clump, even if they are not paired up exactly the way featureCounts expects. Is there a way you could read in batches, where you read in all of the alignments from a read pair at once and then put them in the order featureCounts expects?

The reason why I am asking is that although featureCounts will sort the files, the sorting algorithm in subread is really slow. It isn't multithreaded and it runs in the working directory rather than using a local temporary directory so you can nuke your shared filesystem as well if you are running hundreds of samples simultaneously. It would be awesome if featureCounts could just use the presorted files from one of the other tools, is that a complete no-go.

https://github.com/chapmanb/bcbio-nextgen/issues/1013 is where we have been talking about this a little bit.

If you had other suggestions as well that would be helpful.

Thanks!

Best,

Rory

Wei Shi

unread,

Sep 14, 2015, 9:51:53 PM9/14/15

to Subread

Dear Rory,

Firstly, featureCounts has a new option '--donotsort' since version 1.4.6-p3. featureCounts will not re-sort input reads when this option is turned on.

Secondly, if you want to pre-sort your bam files and then run featureCounts on them, you can use subtools command included in Subread package (under bin/utilities directory) to do so:

./subtools -i your_file.bam -o your_file_name_sorted.bam --informat BAM --outformat BAM --sort byname

featureCounts will not resort the name-sorted bam files generated by subtools.

We found that samtools might not sort paired-end reads properly when they were reported multiple times in the mapping output (multi-mapping read pairs). This could result in incorrect read counting. You should either run subtools to resort it or let featureCounts resort it.

Cheers,

Wei

Rory Kirchner

unread,

Sep 14, 2015, 11:21:43 PM9/14/15

to Subread

Hi Wei,

Thanks for responding-- the trouble with the --donotsort option is that the common way of presorting, using samtools, will sometimes create invalid files. So as a user or a builder of a pipeline I can't enable that option if I'm going to be sorting with samtools, I have to resort my files just to make featureCounts work. To make sure featureCounts works we basically have to sort with the subtools tool.

The subtools tool does sort the files, but it has a couple of problems:

1) it does not work in parallel, so it is very slow. Part of what makes featureCounts preferable over something like htseq-count is that it is really really fast, but having to do a slow sort with subtools hurts that speed.

2) it creates a ton of temporary files in the current directory. usually you would like temporary files like that to go to TMP so that you can have TMP be a local disk and not be sorting temporary files onto something like a NFS mounted filesystem. Sorting in the current directory works fine if you want to do tens of samples but if you try to do hundreds of samples simultaneously then you can't write all of those files to your NFS mount or it will crush it.

I agree, samtools sometimes won't put the pairs right next to each other if they are multimappers, but it will put all of the alignments from the same read name in a group. Right now featureCounts assumes in a presorted file the read pairs are right next to each other. Could featureCounts instead assume just that all reads with the same read name are next to each other, read them in until a different read name is encountered and then process all of the reads with the same name as a block and figure out which one are pairs by position/tags/etc? Then it wouldn't matter if there are records out of order as long as they were sorted by read name.

Thanks again for responding so quick.

Wei Shi

unread,

Sep 16, 2015, 1:11:26 AM9/16/15

to Subread

Hi Rory,

We will add multi-threading support to subtools to speed it up. We will also add a parameter to it to allow users to specify where the temporary files should be saved to.

We will release a new version fairly soon.

Thanks,

Wei

Mathias Lesche

unread,

Sep 16, 2015, 4:23:13 AM9/16/15

to Subread

Hi Rory

Am Dienstag, 15. September 2015 05:21:43 UTC+2 schrieb Rory Kirchner:

I agree, samtools sometimes won't put the pairs right next to each other if they are multimappers, but it will put all of the alignments from the same read name in a group. Right now featureCounts assumes in a presorted file the read pairs are right next to each other. Could featureCounts instead assume just that all reads with the same read name are next to each other, read them in until a different read name is encountered and then process all of the reads with the same name as a block and figure out which one are pairs by position/tags/etc? Then it wouldn't matter if there are records out of order as long as they were sorted by read name.

Thanks again for responding so quick.

I'm having more or less the same problem. I use GSNAP as aligner and there just some things which are not covered in the sam/bam specs. For example if parts of a read align equally good, the aligner can't just write one line in the output. instead two are written. If you use paired-end data, you'll end up with three lines for this read id. 2xR1 and 1xR2 or vice versa. Even though the file is sorted by name, featureCounts will stumble about it and try to resort the file. So I asked Wei to introduce the no-sorting option. This options disregards these reads (if I'm not mistaken). I would prefer to have them in there and get counts for them, so I wrote a small script with the help of the picard-tools and remove one of the R1 or R2.

best

mathias

Rory Kirchner

unread,

Sep 16, 2015, 8:19:00 AM9/16/15

to Subread

Hi Wei,

That is good news about speeding the subtools sorting up and allowing for it to write to temporary disk as a fallback mechanism, but it still requires a third special sort of a BAM file just to use featureCounts. There's no way that featureCounts could use files queryname presorted by samtools or sambamba and have an internal mechanism to handle the pairs that might be out of sync? The sorting from samtools and sambamba sticks the read names next to each other so they exist in a block. Reading in the block and pairing up the alignments internally can't happen to make featureCounts compatible with samtools?

Best,

Rory

Wei Shi

unread,

Sep 16, 2015, 7:47:28 PM9/16/15

to Subread

Hi Rory,

Thanks for your suggestion. But this will incur a lot of changes to featureCounts and it may significantly increase the running time of featureCounts as well.

If samtools can correctly sort multi-mapping read pairs by read names, then this wouldn't cause problems you have encountered. This issue with samtools is probably not only causing problem for featureCounts, but also causing problems for other tools. I would suggest you contact the samtools authors and let them fix it.

In the meantime, we have a new idea to speed up the read sorting. We are thinking of actually not doing name sorting, but just pairing reads up. If this works, it will tremendously reduce the amount of time for preparing bam files for featureCounts counting.

Best wishes,
Wei

Wei Shi

unread,

Sep 16, 2015, 8:01:28 PM9/16/15

to Subread

Hi Mathias,

I'm having more or less the same problem. I use GSNAP as aligner and there just some things which are not covered in the sam/bam specs. For example if parts of a read align equally good, the aligner can't just write one line in the output. instead two are written. If you use paired-end data, you'll end up with three lines for this read id. 2xR1 and 1xR2 or vice versa. Even though the file is sorted by name, featureCounts will stumble about it and try to resort the file. So I asked Wei to introduce the no-sorting option. This options disregards these reads (if I'm not mistaken). I would prefer to have them in there and get counts for them, so I wrote a small script with the help of the picard-tools and remove one of the R1 or R2.

No, the --donotsort option does not disregard those reads. Those reads will be disregarded by featureCounts if they have a NH tag value greater than 1 (ie. multi-mapping reads) and you instruct featureCounts not to count multi-mapping reads. But the --donotsort option does not disregard them.

The re-sorting process in featureCounts will add a dummy pair for the second alignment so you will get 2xR1 and 2xR2, and then featureCounts will consider both alignments for this fragment in its counting. If you have such reads in your mapping results, you will need to let featureCounts to resort it otherwise you will get incorrect or incomplete counts.

Finally, you may consider using the Subread/Subjunc aligners which are fast, accurate and compatible with featureCounts ...

Best regards,

Wei

Wei Shi

unread,

Oct 28, 2015, 9:35:34 PM10/28/15

to Subread

featureCounts can now use multiple threads to speed up read re-sorting. It takes only half a minute to re-sort 30 million read pairs with 8 threads.

A new utility program 'repair' is also included in the new release (v1.5.0) and it can be used to pre-sort paired-end bam reads (re-sorted reads are saved to a bam file). The generated bam file will not need to be re-sorted by featureCounts.

Wei

Reply all

Reply to author

Forward