algorithm for detecting sort order

Ben Pfaff

unread,

Dec 17, 2006, 4:44:53 PM12/17/06

to

I have a matrix with C columns and R rows. I suspect that the
rows in the matrix are sorted lexicographically on all C columns
of the array, but I do not know the order of the columns in the
lexicographical ordering, nor do I know whether each column is
sorted into ascending or descending order.

Question 1: What is an efficient way to determine the sort order?
There are C! * 2**C possibilities, so testing each possibility is
not a good idea. Also, it's not always possible to determine the
ordering uniquely: for example, if R is 1 or if all the rows are
identical, we cannot eliminate any possibilities at all.

Question 2: If it's cheaper than answering question 1, is there
an efficient way to determine whether the matrix is actually
sorted at all? That is, can we eliminate any possible sort order
as possibilities?

We can eliminate many possibilities relatively cheaply. If we
have a pair of records in which a variable increases as we move
the matrix, then the lexicographical ordering cannot start with
that column in descending order. Similarly, if we have a pair of
records for a column in which a variable decreases down the
matrix, the ordering can't start with that column in ascending
order. If we have examples of both for a column, that column
can't be the first in the ordering at all.

The actual problem I am trying to solve is this. I have a
program that accepts matrix-shaped data that is supposed to be
grouped on a set of variables, so that all the rows with
particular data values are together. Typically, users do this by
pre-sorting the data file on those variables. Occasionally some
user will forget to sort the data, and I would like to provide a
warning to those users that the data file does not appear to be
sorted. It would be nice to be able to do this without requiring
the user to tell me the ordering of the variables used for
grouping, or the direction of sort for each variable.

Because of this application, it would be fine if the algorithm I
come with sometimes answers "I don't know" instead of "sorted" or
"not sorted".

The rows are dealt with one at a time, so an online algorithm
would be ideal.

I think I can probably come up with some kind of recursion-based
algorithm to solve this, but I'm curious to see whether this is a
well known or previously solved problem. My simple searches on
Google didn't turn anything up, but perhaps I'm not using the
right terms.
--
"A computer is a state machine.
Threads are for people who cant [sic] program state machines."
--Alan Cox

Thad Smith

unread,

Dec 18, 2006, 2:46:00 AM12/18/06

to

Ben Pfaff wrote:
> I have a matrix with C columns and R rows. I suspect that the
> rows in the matrix are sorted lexicographically on all C columns
> of the array, but I do not know the order of the columns in the
> lexicographical ordering, nor do I know whether each column is
> sorted into ascending or descending order.
>
> Question 1: What is an efficient way to determine the sort order?
> There are C! * 2**C possibilities, so testing each possibility is
> not a good idea. Also, it's not always possible to determine the
> ordering uniquely: for example, if R is 1 or if all the rows are
> identical, we cannot eliminate any possibilities at all.

Here is my approach, Ben, assuming I understand the problem.

Find all consistently ordered columns. They are candidates for the primary key.

For each candidate ordering of the first n keys, find all columns which
consistently order tie-breakers. These are candidates for the key n+1, given the
chosen values for the first n. Of course, it could be that there are no rows tied
in the first n keys, so there is no way to know what key n+1 was, but this means
that the key was never used.

The result could be that
1) There are no possible primary key columns
2) There are multiple candidates for primary key and no way to resolve which is
primary.
3) There is a single candidate for primary key, but no way to resolve lesser keys.

Take this down as far as you can. Unless you have many rows, it is likely that
you will not uniquely resolve the key ordering.

The most probable results, I think, will be
1) knowing what the first few (say 1, 2, or 3) keys columns are.
2) or knowing that there is no primary key.

If you have a sufficient amount of data you might be able to determine the first
few keys and then determine that a lesser key cannot exist, but that probably
requires lots of rows with column ties.

--
Thad

dco...@connx.com

unread,

Dec 18, 2006, 5:14:14 AM12/18/06

to

One possiblity is to sample a set of (say) 30 elements from each column
and see if they are sorted (you do not have to choose a consecutive
set, but the positions must be ascending). This should help you to
discard the majority of unsorted columns quickly.

For those columns where the elements are not sorted, no need to examine
further. For those which are sorted, you can choose a larger set or
test the full columns for sort.

When you know which columns are sorted, then when you get a group by
command, you can examine the column list for the 'group by' and see if
any columns are out of sort.

Instead of rejecting the request if you find that the columns are
unsorted, I suggest that you sort them for the user. After all, you
know _exactly_ what is wanted. I would post a message about the
discrepancy that you found and the corrective action taken.

dco...@connx.com

unread,

Dec 18, 2006, 5:19:21 AM12/18/06

to

Sort order detection for a vector:

/* Test to see if segment [Lo, ... Hi] is sorted already */
long ARRAYISSORTED(Etype A[], unsigned long Lo, unsigned
long Hi)
{
unsigned long i;
unsigned int ascending;
if (Lo == Hi)
return 1;
if ((ascending = (LE(A[Lo], A[Lo + 1]))))
for (i = Lo + 1; i < Hi; i++) {
/* Is the next item bigger than this one? */
if (GT(A[i], A[i + 1])) {
/* Were all of them the same size up to here? */
if (EQ(A[i], A[Lo]))
/* Maybe the partition is reversed? */
if (ARRAYISREVERSED(A, i, Hi)) {
/* It was reversed. Invert it and we are done!
*/
REVERSEARRAY(A, Lo, Hi);
return 1;
} else { /* Nope. The partition is not reversed
*/
return 0;
}
else
return 0;
}
}
/* not ascending */
else if (ARRAYISREVERSED(A, Lo, Hi)) {
/* It was reversed. Invert it and we are done! */
REVERSEARRAY(A, Lo, Hi);
return 1;
} else
return 0;
return 1;
}

/* Test to see if segment [Lo, ... Hi] is reverse sorted already */
long ARRAYISREVERSED(Etype A[], unsigned long Lo, unsigned
long Hi)
{
unsigned long i;
for (i = Lo; i < Hi; i++) {
if (LE(A[i], A[i + 1]))
return 0;
}
return 1;
}
/*
Reverse the order of a partition.
*/
void REVERSEARRAY(Etype * A, unsigned long Lo, unsigned long
Hi)
{
Etype *r = A + Lo;
for (r = A + (Hi - Lo); A < r; A++, r--) {
Etype Tmp = *A;
*A = *r;
*r = Tmp;
}
}

Ben Pfaff

unread,

Dec 18, 2006, 11:54:26 AM12/18/06

to

Thad Smith <Thad...@acm.org> writes:

R> Ben Pfaff wrote:
>> I have a matrix with C columns and R rows. I suspect that the
>> rows in the matrix are sorted lexicographically on all C columns
>> of the array, but I do not know the order of the columns in the
>> lexicographical ordering, nor do I know whether each column is
>> sorted into ascending or descending order.
>>
>> Question 1: What is an efficient way to determine the sort order?
>> There are C! * 2**C possibilities, so testing each possibility is
>> not a good idea. Also, it's not always possible to determine the
>> ordering uniquely: for example, if R is 1 or if all the rows are
>> identical, we cannot eliminate any possibilities at all.
>
> Here is my approach, Ben, assuming I understand the problem.
>
> Find all consistently ordered columns. They are candidates for the primary key.
>
> For each candidate ordering of the first n keys, find all columns
> which consistently order tie-breakers. These are candidates for the
> key n+1, given the chosen values for the first n. Of course, it could
> be that there are no rows tied in the first n keys, so there is no way
> to know what key n+1 was, but this means that the key was never used.

This looks like a viable approach to the problem as stated. I wish,
though, that I'd stated more prominently that I'm looking for an
online solution (that is, one that only examines one pair
of rows at a time, in sequential order, and only gets one pass
over the data).

> [...] Unless you have many rows, it is likely that you will not

> uniquely resolve the key ordering.

The program I'm working on is designed to process arbitrary
amounts of data, so that R >= 1e7 or even 1e8 is possible
(although some runs have only a few). Also, most of the time I'd
guess that C <= 3, and I know for sure that C <= 8. Overall, I'd
suspect that there's usually a unique ordering.

dco...@connx.com writes:

> One possiblity is to sample a set of (say) 30 elements from each column
> and see if they are sorted (you do not have to choose a consecutive
> set, but the positions must be ascending). This should help you to
> discard the majority of unsorted columns quickly.
>
> For those columns where the elements are not sorted, no need to examine
> further. For those which are sorted, you can choose a larger set or
> test the full columns for sort.

That's an interesting idea too, although again it doesn't work
well with my requirement for online solution.

I think that these answers add up to what I suspected: there's no
"magic" solution, but plenty of viable approaches. I don't think
I'll have trouble with the problem.

> When you know which columns are sorted, then when you get a group by
> command, you can examine the column list for the 'group by' and see if
> any columns are out of sort.
>
> Instead of rejecting the request if you find that the columns are
> unsorted, I suggest that you sort them for the user. After all, you
> know _exactly_ what is wanted. I would post a message about the
> discrepancy that you found and the corrective action taken.

That is an interesting approach, but for my situation I think it
is problematic. First, by the time that I've figured out that
something is out of order, I've already created some output
(perhaps a substantial amount) and might have difficulty backing
out of that situation. Second, I don't think I want to presume
that the user was wrong; perhaps he merely grouped together his
data in an unconventional manner, instead of sorting it. I'd
rather alert him to the possibility of a problem, so that he can
insert the proper sorting instruction into the command sequence
if necessary.
--
"Ho ho ho. I _so_ enjoy making a fool out of myself."
--Linus, on Christmas Day, 2000

dco...@connx.com

unread,

Dec 18, 2006, 1:02:45 PM12/18/06

to

I think that the problem is intractible.

ColA ColB ColC
------- ------- -------
1 A aa
2 B aa
3 C cc
3 C dd
4 G dd

In the above set, is the set ordered by Column A first or by Column C?
If by column C, is the second most significant column B or A?
Is the data ordered by Column B as the most significant, but out of
order?

I think that they must tell you what the ordering is supposed to be or
there is no unambiguous way to decide.

Also, you need to know if the ordering is ascending or descending for
each of the columns.

Ben Pfaff

unread,

Dec 18, 2006, 1:50:32 PM12/18/06

to

dco...@connx.com writes:

> I think that the problem is intractible.
>
> ColA ColB ColC
> ------- ------- -------
> 1 A aa
> 2 B aa
> 3 C cc
> 3 C dd
> 4 G dd
>
> In the above set, is the set ordered by Column A first or by Column C?
> If by column C, is the second most significant column B or A?
> Is the data ordered by Column B as the most significant, but out of
> order?

Sure it's intractable if you look at it in the most general way.
There are 2**C * C! possibilities, which is superexponential.
But it's not necessary to solve it in the most general way. If I
can just determine in *common* cases that the data cannot be
sorted according to any possible sort order, then that will allow
me to provide a useful warning to my users.

Here's one simple idea that I'm considering. For each column,
categorize its values as ascending, descending, constant, or
other. An ascending column is one whose values do not decrease
as we examine successive rows; a descending column is one whose
values do not increase; a constant column is both ascending and
descending, that is, its values do not change at all; all other
patterns are "other".

Then: the data *may* be sorted if there is at least one ascending
column, or if there is at least one descending column, or if all
the columns are constant. Otherwise, the data is not sorted and
a warning should be emitted.

This won't catch all the non-sorted cases, but it should catch
many of them. It also will be easy to implement and won't ever
warn incorrectly. Right?
--
I love deadlines.
I love the whooshing noise they make as they go by.
--Douglas Adams

dco...@connx.com

unread,

Dec 18, 2006, 8:16:52 PM12/18/06

to

That seems reasonable. I think it would be better if you knew the
column precedence, along with whether each column is supposed to be
ascending, descending or "don't care".

On the other, other hand an educated guess is probably better than
reckless abandon.
Otherwise it can never

Ben Pfaff

unread,

Dec 20, 2006, 12:57:15 PM12/20/06

to

dco...@connx.com writes:

> That seems reasonable. I think it would be better if you knew the
> column precedence, along with whether each column is supposed to be
> ascending, descending or "don't care".

I came up with a more advanced approach. Here's the code and
test program:

#include <assert.h>
#include <stdbool.h>
#include <stddef.h>
#include <stdio.h>

/* We are portably limited to 16 columns because of the
minimum number of bits in an unsigned int. */
#define MAX_COLUMNS 16

/* Used to check whether a set of data is lexicographically
sorted on a specified set of up to MAX_COLUMNS columns. */
struct order_checker
{
/* Element 0 in each array represents the most-sigificant
sort column, element 1 the next most-significant, and so
on.

A 1 in bit position X in the increasing (resp. decreasing)
array indicates that variable X may be decreasing
(resp. increasing), that is, that we have not yet observed
it increase (resp. decrease) from one case to the next.
If both bits in a given position are 1, then the variable
is thus far constant; if neither bit is 1, then the
variable has been observed to both increase and
decrease. */
unsigned int decreasing[MAX_COLUMNS];
unsigned int increasing[MAX_COLUMNS];
size_t n;
};

/* Output from the order checker. */
enum order_checker_output
{
MAYBE_SORTED, /* The data may be sorted. */
NOT_SORTED /* The data is not sorted. */
};

/* Initializes OC to check the order of N-column data. */
void
order_checker_init (struct order_checker *oc, size_t n)
{
if (n > 0 && n <= MAX_COLUMNS)
{
size_t i;
for (i = 0; i < n; i++)
oc->decreasing[i] = oc->increasing[i] = (1u << n) - 1;
}
oc->n = n;
}

/* Returns true if exactly one bit is set in A and B, taken
together, false otherwise. */
static inline bool
exactly_one_bit_set (unsigned int a, unsigned int b)
{
unsigned int c = a | b;
return (a == 0 || b == 0) && (c & (c - 1)) == 0;
}

/* Adds the given ROW to order checker OC. Each of the N values
in CHANGES indicates that the given column was observed to
increase, stay constant, or decrease from one row to a later
row according to whether the given value is positive, zero, or
negative. Returns MAYBE_SORTED if the data could be sorted or
NOT_SORTED if it is definitely not sorted. */
enum order_checker_output
order_checker_iterate (struct order_checker *oc, int *row, size_t n)
{
/* We use progressive elimination of possible sort orders. We
first eliminate possibilities for the most significant sort
column. Then, when only one possibility is left, we
eliminate possibilities for the next most significant
column, and so on. If every possibility for any given
position is eliminated, then the data cannot be sorted.

For n columns, there are n! * 2**n possible sort orders,
which quickly becomes impractical as n increase. This
algorithm uses O(n) time per room and O(n) total space for n
columns. It will not flag every unordered set as unsorted,
but in practice it seems to work pretty well. */
assert (n == oc->n);
if (n > 0 && n <= MAX_COLUMNS)
{
unsigned int increasing_mask;
unsigned int decreasing_mask;
unsigned int constant_mask;
size_t i;

/* Form masks to eliminate possibilities.
If the value increased, it's not a decreasing column.
If the value decreased, it's not an increasing column. */
increasing_mask = decreasing_mask = 0;
for (i = 0; i < n; i++)
{
if (row[i] >= 0)
increasing_mask |= 1u << i;
if (row[i] <= 0)
decreasing_mask |= 1u << i;
}
constant_mask = increasing_mask & decreasing_mask;

for (i = 0; i < n; i++)
{
unsigned int inc;
unsigned int dec;
unsigned int sort_column;

/* Eliminate possibilities. */
inc = oc->increasing[i] &= increasing_mask;
dec = oc->decreasing[i] &= decreasing_mask;

/* If there are no possible sort orders, the input is
not sorted. */
if (inc == 0 && dec == 0)
return NOT_SORTED;

/* If there is exactly one sort order left, then we can
consider sorting the next most significant column;
otherwise, we bail to avoid requiring more than
O(columns) memory or time. */
if (!exactly_one_bit_set (inc, dec))
break;

/* We only learn something about sorting the next
most significant column if the current column is
constant. */
sort_column = inc | dec;
if (!(sort_column & constant_mask))
break;

/* Make sure that less significant columns do not
consider the current column. */
increasing_mask &= ~sort_column;
decreasing_mask &= ~sort_column;
}
}
return MAYBE_SORTED;
}

/* Prints the internal state of order checker OC, which has N
columns. */
void
order_checker_print (const struct order_checker *oc, size_t n)
{
assert (n == oc->n);
if (n > 0 && n <= MAX_COLUMNS)
{
unsigned int ignore;
size_t i;

ignore = 0;
for (i = 0; i < n; i++)
{
unsigned inc = oc->increasing[i];
unsigned dec = oc->decreasing[i];
size_t j;

for (j = 0; j < n; j++)
{
unsigned int column = 1u << j;
const char *name;

if (ignore & column)
name = "";
else if (inc & column)
name = dec & column ? "constant" : "increasing";
else
name = dec & column ? "decreasing" : "mixed";

printf ("%12s", name);
}

if (inc == 0 && dec == 0)
{
printf (" (inconsistent)\n");
break;
}
printf ("\n");

if (!exactly_one_bit_set (inc, dec))
break;
ignore |= inc | dec;
}
}
else
printf ("(no internal state)\n");
}

/* Test program. */

#include <stdlib.h>
#include <string.h>

size_t
read_numbers (int output[MAX_COLUMNS])
{
char buffer[128];
char *p;
size_t n;

if (fgets (buffer, sizeof buffer, stdin) == NULL)
return 0;

for (p = strtok (buffer, " \t\n"), n = 0; p != NULL && n < MAX_COLUMNS;
p = strtok (NULL, " \t\n"), n++)
{
output[n] = strtol (p, NULL, 10);
printf ("%12d", output[n]);
}
printf ("\n");

return n;
}

int
main (void)
{
struct order_checker oc;
int previous[MAX_COLUMNS];
size_t n;

n = read_numbers (previous);
order_checker_init (&oc, n);

for (;;)
{
int current[MAX_COLUMNS];
int changes[MAX_COLUMNS];
enum order_checker_output output;
size_t i, n2;

n2 = read_numbers (current);
if (n2 == 0)
break;
else if (n != n2)
{
printf ("wrong number of columns, try again\n");
continue;
}

for (i = 0; i < n; i++)
changes[i] = (current[i] > previous[i] ? 1
: current[i] < previous[i] ? -1
: 0);
output = order_checker_iterate (&oc, changes, n);

order_checker_print (&oc, n);
if (output == NOT_SORTED)
{
printf ("not sorted\n");
return 0;
}
printf ("\n");

memcpy (previous, current, sizeof current);
}
}

Example output (the output embeds the input, for readability):

1 1 2
1 1 3
constant constant increasing

1 2 3
constant increasing increasing

1 2 2
constant increasing mixed

1 1 2
constant mixed mixed

1 3 4
constant mixed mixed

2 3 4
increasing mixed mixed
constant constant

2 5 6
increasing mixed mixed
increasing increasing

2 3 6
increasing mixed mixed
mixed increasing
decreasing

2 4 6
increasing mixed mixed
mixed increasing
mixed (inconsistent)
not sorted

This does pretty well. I generated a file with 3 columns in
sorted order, using the Perl program
for $i(0...5){for $j(0...1){for $k(0...2){print "$i $j $k\n"}}}
and then repeatedly tried it after swapping just a single pair of
rows with another little program:
#! /usr/bin/perl
my (@lines) = <STDIN>;
my ($n) = defined ($ARGV[0]) ? $ARGV[0] : 1;
for (1...$n) {
my ($i) = int (rand (scalar (@lines)));
my ($j) = int (rand (scalar (@lines)));
($lines[$i], $lines[$j]) = ($lines[$j], $lines[$i]);
}
print @lines;
The great majority of the time, it reported the data as
inconsistently sorted.

I'm feeling clever now.
--
Ben Pfaff
email: b...@cs.stanford.edu
web: http://benpfaff.org