Using unix sort with pybedtools

153 views
Skip to first unread message

Dario Beraldi

unread,
Sep 23, 2015, 12:35:26 PM9/23/15
to bedtools-discuss
Dear pybedtools developers,

Could you suggest a good strategy to use Unix sort in a pybedtools pipeline?

I have the following command (what it does is not relevant I guess):

tab= pybedtools.BedTool(regions).intersect(b= scoreBedGraph, wa= True, wb= True, sorted= True) \
    .each(prepareForSortGroup) \
    .sort() \
    .groupby(g= [1], c= [1, 8], o= ['count', 'sum'])

Instead of pybedtools sort() I'd like to use the OS gnu sort. What I'm thinking to do is to write the output of pybedtools to a temp file, sort it using python subprocess module and read the sorted file back for further manipulations with pybedtools. However, I guess in this way I would miss the advantage of working with a single stream as above.

(Is there any plan to integrate Unix sort in pybedtools so that one could do something like bedtools.intersect(...).unix_sort(...).merge(...)?)

Thanks for the great suite!

Dario





Ryan Dale

unread,
Sep 23, 2015, 1:01:12 PM9/23/15
to bedtools...@googlegroups.com
Hi Dario -

Your idea of calling GNU sort in a subprocess should work fine.

Without the "stream=True" arguments, your `.intersect()` and `.sort()` calls here already create temp files on disk. So it's not actually a single stream being operated on. This just means that compared to what you're already doing, you shouldn't lose performance by operating on files using GNU sort.

Here's a quick way to do what you're asking, using tempfiles that will be cleaned up when Python exits:

import pybedtools, os
x = pybedtools.example_bedtool('c.gff')
tmpfile = x._tmp()
os.system('sort {0} -k1,1 -nk4,4 > {1}'.format(x.fn, tmpfile))
y = pybedtools.BedTool(tmpfile)
assert x.sort() == y

It might be interesting to integrate sort into pybedtools along with sed, awk, grep et al, but it will take some thought into how to implement it. Suggestions welcome!

-ryan
--
You received this message because you are subscribed to the Google Groups "bedtools-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bedtools-discu...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Dario Beraldi

unread,
Sep 24, 2015, 8:08:06 AM9/24/15
to bedtools-discuss
Thanks Ryan- I'll go for the subprocess solution and thanks for pointing out the use of stream=True. I think integrating programs like sed and grep would be nice, of course, but these can be well emulated with the .each() method.

Dario
Reply all
Reply to author
Forward
0 new messages