groupBy with full line output

5 views

Skip to first unread message

Assaf Gordon

unread,

Dec 10, 2010, 5:45:56 PM12/10/10

to bedtools...@googlegroups.com

Hi Aaron,

We're using groupBy a lot, very useful tool.
There's one feature that we need, and I'd like to submit a patch for it:
Currently, the output of groupBy consist only of the actual grouped columns, and in the order of their grouping (not in the order they appear in the file).
This is useful for many purposes (and mimics the SQL's "group by" behavior), but sometimes we very much need all the fields from the input file, and preferably in their original order (makes post-processing much easier).

The attached patch provides this feature.

Quick example:
=================
$ cat genes.txt
AAAACCTT 10 Box Patient1
TTAAACGTG 1 Box Patient1
AGGAACTTT 2 Edl3 Patient1
TAGAAAGCCC 50 Edl3 Patient1
AAAGGATCC 4 Edl4 Patient1
TTGTAAGC 4 Edl4 Patient1

## Currently, Grouping by gene name, and summing column 2, provides the following output:

$ groupBy -g 3 -o sum -c 2 -i genes.txt
Box 11
Edl3 52
Edl4 8

## Useful, but some necessary information from the input file is lost, and needs to be re-joined.

## With the following patch, adding "-full" will output the original line, followed by the groupBy operation(s):

$ ./groupBy -full -g 3 -o sum -c 2 -i genes.txt
AAAACCTT 10 Box Patient1 11
AGGAACTTT 2 Edl3 Patient1 52
AAAGGATCC 4 Edl4 Patient1 8
==================

Obviously, fields that are not 'grouped by' will only show one value (out of many possible values in the same group) - this is the intended behavior.
with this patch,
it is always the *first* line of each group that will be printed in full (this can be further used to take specific values from a group, I can provide more examples if needed).