How to sort 2-D array by multiple columns?

sgih...@gmail.com

unread,

Mar 28, 2017, 4:31:11 PM3/28/17

to

Can someone suggestion an algorithm to sort a two dimensional array of data
in selected sort order by column one, then by column two, then by column three . . . then by column n?

Essentially I am looking for a routine to duplicate the
sorting capability of Microsoft Excel. Here is an example which has three
columns that need to be sorted. The algorithm should be able to handle up to N
columns.

The following table was sorted as follows:

Sort by Column 1 by value, Smallest to Largest
Then by Column 2 by value, Smallest to Largest
Then by Column 3 by value, Smallest to Largest

Each row retains all its values, only the position of the row is changed
during sort.

Thanks....

Column 1 Column 2 Column 3 Column 4
Row 1 3.0 20.0 31.0 1.0
Row 2 3.0 20.0 33.0 2.0
Row 3 3.0 20.0 35.0 3.0
Row 4 3.0 20.0 36.0 4.0
Row 5 3.0 22.0 31.0 5.0
Row 6 3.0 22.0 33.0 6.0
Row 7 3.0 22.0 35.0 7.0
Row 8 3.0 22.0 36.0 8.0
Row 9 7.0 20.0 31.0 9.0
Row 10 7.0 20.0 33.0 10.0
Row 11 7.0 20.0 35.0 11.0
Row 12 7.0 20.0 36.0 12.0
Row 13 7.0 22.0 31.0 13.0
Row 14 7.0 22.0 33.0 14.0
Row 15 7.0 22.0 35.0 15.0
Row 16 7.0 22.0 36.0 16.0
Row 17 8.0 20.0 31.0 17.0
Row 18 8.0 20.0 33.0 18.0
Row 19 8.0 20.0 35.0 19.0
Row 20 8.0 20.0 36.0 20.0
Row 21 8.0 22.0 31.0 21.0
Row 22 8.0 22.0 33.0 22.0
Row 23 8.0 22.0 35.0 23.0
Row 24 8.0 22.0 36.0 24.0

Beliavsky

unread,

Mar 28, 2017, 4:51:32 PM3/28/17

to

On Tuesday, March 28, 2017 at 4:31:11 PM UTC-4, sgih...@gmail.com wrote:
> Can someone suggestion an algorithm to sort a two dimensional array of data
> in selected sort order by column one, then by column two, then by column three . . . then by column n?
>
> Essentially I am looking for a routine to duplicate the
> sorting capability of Microsoft Excel. Here is an example which has three
> columns that need to be sorted. The algorithm should be able to handle up to N
> columns.
>
> The following table was sorted as follows:
>
> Sort by Column 1 by value, Smallest to Largest
> Then by Column 2 by value, Smallest to Largest
> Then by Column 3 by value, Smallest to Largest
>
> Each row retains all its values, only the position of the row is changed
> during sort.

A quick-and-dirty approach that could work, depending on the ranges of the columns, is to create a column that gives much more weight to columns that should be sorted first, for example 1e6*column_1 + 1e3*column2 + column3, and sort by it. It would be more careful to compute ranks of each columns and create a weighted rank.

The best approach is to create subsets based on column 1, then create sub-subsets based on column 2, and to sort the sub-subsets baesd on column 3.
It is not conceptually hard but requires bookkeeping. I wonder if a recursive subroutine could be used.

sgih...@gmail.com

unread,

Mar 28, 2017, 5:09:20 PM3/28/17

to

On Tuesday, March 28, 2017 at 1:51:32 PM UTC-7, Beliavsky wrote:
> On Tuesday, March 28, 2017 at 4:31:11 PM UTC-4, sgih...@gmail.com wrote:
> > Can someone suggestion an algorithm to sort a two dimensional array of data
> > in selected sort order by column one, then by column two, then by column three . . . then by column n?
> >
> > Essentially I am looking for a routine to duplicate the
> > sorting capability of Microsoft Excel. Here is an example which has three
> > columns that need to be sorted. The algorithm should be able to handle up to N
> > columns.
> >
> > The following table was sorted as follows:
> >
> > Sort by Column 1 by value, Smallest to Largest
> > Then by Column 2 by value, Smallest to Largest
> > Then by Column 3 by value, Smallest to Largest
> >
> > Each row retains all its values, only the position of the row is changed
> > during sort.
>
> A quick-and-dirty approach that could work, depending on the ranges of the columns, is to create a column that gives much more weight to columns that should be sorted first, for example 1e6*column_1 + 1e3*column2 + column3, and sort by it. It would be more careful to compute ranks of each columns and create a weighted rank.

I had the same idea but could not come up with a reliable way to calculate a value for a sort column that would work for all data values.

Beliavsky

unread,

Mar 28, 2017, 5:41:44 PM3/28/17

to

If r2 is the range of column 2 (the largest value minus the smallest value) and r3 is then range of column 3, you could sort by

x1*(r2+1)*(r3+1) + x2*(r3+1)

sgih...@gmail.com

unread,

Mar 28, 2017, 6:07:56 PM3/28/17

to

>
> If r2 is the range of column 2 (the largest value minus the smallest value) and r3 is then range of column 3, you could sort by
>
> x1*(r2+1)*(r3+1) + x2*(r3+1)

Thanks how would you expand this to sort five columns?

sgih...@gmail.com

unread,

Mar 28, 2017, 6:26:28 PM3/28/17

to

I tried your algorithm and it does not appear to work with all ranges of data

herrman...@gmail.com

unread,

Mar 28, 2017, 7:26:08 PM3/28/17

to

One way is to use a stable sort algorithm. A stable sort
does not change the order of items with the same value.
You then sort the whole file in the least significant field,
then the next, and finally the most significant.

However, the faster sort algorithms, such as Quicksort,
are not stable. C's qsort(), which may or may not actually
use quicksort, is not meant to imply a stable sort.

The Java sort routines are documented as stable, and commonly
use a merge sort algorithm that is stable, and also fast.
(It is O(n log n), though maybe not as fast as Quicksort.)

The other way is to use a sort routine where you supply the
comparison function, and have the function compare elements
as appropriate. Returning a value to indicate greater,
equal, or lesser based first on the most significant, and
then if equal, lesser significant members.

michael siehl

unread,

Mar 28, 2017, 7:30:14 PM3/28/17

to

Personally, I use IMSL Fortran routine SROWR for such sorting, resp. grouping of data:
http://docs.roguewave.com/imsl/fortran/6.0/stat/default.htm?turl=srowr.htm
See the example there which does exactly what you are looking for (except that they use only three columns and do only sort by columns (keys) 2 and 3, which can easily be changed by the INDKEY argument). If you don't have access to the IMSL Fortran library and don't want to implement a similar algorithm yourself, look for a similar routine in one of the free Fortran numerical libraries. Maybe someone else can point to some similar routine?

best regards

Ron Shepard

unread,

Mar 28, 2017, 9:17:15 PM3/28/17

to

On 3/28/17 3:31 PM, sgih...@gmail.com wrote:
> Can someone suggestion an algorithm to sort a two dimensional array of data
> in selected sort order by column one, then by column two, then by column three . . . then by column n?

The way I usually do this is to perform an indexed sort, where you
rearrange integer index values rather than the underlying data. This
eliminates potentially expensive swaps when the number of columns is
large. With an indexed sort, the only data the sort routine actually
sees and rearranges is the integer array. You also pass in a function
argument that does the comparisons between just two elements, given
their row indexes. You write this function any way you want. In the case
you have here, you would compare the first elements, then the second
elements, and so on. The function typically returns -1, 0, or +1 as a
result of the comparison to indicate how the two values are ordered.

The underlying data does not change during this operation. Upon return,
the integer index array determines the sorted order. You can go
backwards or forwards through this array as appropriate. If necessary,
you can use the index array to reorder the actual data. This deferred
operation only requires O(m) memory accesses, rather than the
O(m*log(m)) or O(m^2) memory accesses if the data were sorted directly
rather than through the index array.

This approach works fine all the way back to f77 and before. With modern
fortran, the same approach could be done with, for example, pointers
rather than integers.

$.02 -Ron Shepard

herrman...@gmail.com

unread,

Mar 28, 2017, 9:43:41 PM3/28/17

to

On Tuesday, March 28, 2017 at 6:17:15 PM UTC-7, Ron Shepard wrote:

> On 3/28/17 3:31 PM, sgihal..@gmail.com wrote:
> > Can someone suggestion an algorithm to sort a two dimensional array of data
> > in selected sort order by column one, then by column two, then by column three . . . then by column n?

> The way I usually do this is to perform an indexed sort, where you
> rearrange integer index values rather than the underlying data. This
> eliminates potentially expensive swaps when the number of columns is
> large. With an indexed sort, the only data the sort routine actually
> sees and rearranges is the integer array. You also pass in a function
> argument that does the comparisons between just two elements, given
> their row indexes. You write this function any way you want. In the case
> you have here, you would compare the first elements, then the second
> elements, and so on. The function typically returns -1, 0, or +1 as a
> result of the comparison to indicate how the two values are ordered.

Yes, many do that with C's qsort. Often enough, I find the problem
I have is small enough, and the indirection enough extra work, to
just sort the array.

> The underlying data does not change during this operation. Upon return,
> the integer index array determines the sorted order. You can go
> backwards or forwards through this array as appropriate. If necessary,
> you can use the index array to reorder the actual data. This deferred
> operation only requires O(m) memory accesses, rather than the
> O(m*log(m)) or O(m^2) memory accesses if the data were sorted directly
> rather than through the index array.

Yes it is O(m), and fast if random access is fast.

As Knuth notes in "Sorting and Searching" that isn't true
for disk files. Once you have the sorted index, moving the
data records around is not easy.

> This approach works fine all the way back to f77 and before. With modern
> fortran, the same approach could be done with, for example, pointers
> rather than integers.

As with integration and minimization routines, the old favorite
was to have the routine call a comparison routine, and pass
the data through COMMON.

Donald Arseneau

unread,

Apr 3, 2017, 11:40:31 PM4/3/17

to

sgih...@gmail.com writes:

> Can someone suggestion an algorithm to sort a two dimensional array of data
> in selected sort order by column one, then by column two, then by
> column three .

You should be more explicit with what you mean by "then"!
Your example doen't suggest any meaning to me, and this:

> Each row retains all its values, only the position of the row is changed
> during sort.

is incomprehensible to me.

Taken literally, the answer is to sort by colum1. Then sort by
column 2 etc. But then why would you ask the question?

I will guess that you want to sort the rows of data, based
first on column 1 values, but in the case of ties, sort those
based on couln 2 values; etc. As if each row was a word, and
each column contained a letter of that word.

If that's the case, you want to sort the rows four times in
succession, using a STABLE sorting algorithm. Sort the rows
first based on column 4, THEN sort the result based on column
3, then on column 2, and finally sort based on column 1. Use
an array of index numbers to sort the rows without actually
moving the values around in memory. A good stable sorting
algorithm (https://en.wikipedia.org/wiki/Sorting_algorithm)
is the merge sort.

--
Donald Arseneau as...@triumf.ca

campbel...@gmail.com

unread,

Apr 4, 2017, 8:13:53 PM4/4/17

to

You should try an indexed sort approach and create a logical function that compares the order of 2 rows "Logical function is_I_gt_J ( I, J, ...)"
Replace the order test in the indexed sort and you have your solution.

Providing a flexible sort approach will depend on how you can describe the rules for your logical function. I have applied this sort approach to a text file editor, describing the sort as multiple triples of column ranges and direction, such as:
4,7,0
11,15,-1
-1
This would first sort the rows of a file by the characters in columns 4 to 7, then a sub sort of characters in columns 11 to 15 in reverse.
Once you have defined the rule for the sort and the number of rules (triples), the logical function can be defined and the sort proceed.
I applied this approach to an indexed quick sort, but unless you have a large data set, any indexed sort approach will work.

campbel...@gmail.com

unread,

Apr 4, 2017, 8:17:38 PM4/4/17

to

I should have included; If the 2 rows are identical, I finally test if I > J, to retain the original sort order, as it is annoying with quick sort when sorting multiple values that are the same that their order is random.