Pipelined sorting algorithms...

Eric Crabill

unread,

Mar 22, 2002, 12:45:34 PM3/22/02

to

Hi,

I am looking to build a circuit for sorting small data sets,
and am hoping someone here might have some pointers for where
I should start looking for algorithms to do it.

My desire is to build a circuit to do the following:

Every cycle, 16 pieces of data are provided as input.
Some number of cycles later, the data set is provided
at the output, sorted...

The latency isn't a big issue, as long as it is constant,
and I can provide a new data set to the circuit every cycle.

Thanks,
Eric

John_H

unread,

Mar 22, 2002, 2:26:43 PM3/22/02

to

Sorts aren't easy.

Do a google search for sorting algorithms and you'll find all sorts of
sorts out there mostly optimized for computing. The parallel algorithms
may be harder to find - I haven't gotten a good reference yet.

What's the design data set size and clock frequency? For 16 data sets
in and 16 data sets out, that's a bunch of pins! Then do you just want
the data output or the indicies to know what the channel order was?
It's a mess in the general sense.

A quick stab for me would be an insertion sort (easy to understand from
a quick google search) if your clock rate is too fast for multiplexing
the comparisons. The insertion sort would need about 14 SRL sets,
(16*15)/2=120 comparators, 30 20-1 and 105 3-1 registered multiplexers.
Uuuuuhgly. But it may be better than the bubble sort's 113 comparators
(est) and 113 2-1 swap registers (complementary 2-1 multiplex registers)
with 14 register sets mixed in the middle.

Have fun with this!

Eric Crabill

unread,

Mar 22, 2002, 3:01:17 PM3/22/02

to John_H

Hi,

Each data set consists of 16 pieces of data, each piece of data is
32-bits
(but I'm only sorting on a specific 4-bit field in the data, the rest of
the data just goes along for the ride...)

When I initially searched the web, I found all sorts of exotic methods,
I'm just looking for something simple to implement, and fast (~100 MHz),
and don't care too much about the latency or size of the resulting
design.

Thanks!
Eric

Fabien Todescato

unread,

Mar 22, 2002, 3:11:51 PM3/22/02

to

The book "Introduction to Algorithms" has a special section devoted to
sorting networks. You may find the course based on the book on the MIT
web pages at http://web.mit.edu/is/courseweb/courses.html.

Best regards, Fabien

Kelly Hall

unread,

Mar 22, 2002, 5:35:54 PM3/22/02

to

On Fri, 22 Mar 2002 12:01:17 -0800, Eric Crabill <eric.c...@xilinx.com> wrote:
> Each data set consists of 16 pieces of data, each piece of data is
> 32-bits
> (but I'm only sorting on a specific 4-bit field in the data, the rest of
> the data just goes along for the ride...)

If you're sorting on a such a narrow field (4-bit sounds narrow to me),
why not go with a radix sort? The 4-bit field indexes one of 16 FIFOs,
where the data value is inserted (16 pieces, so 16 or 32 clocks). The
following state then dequeues the FIFOs one at a time and stores the
values into memory (or whereever they go) (maybe 48 clocks).

Kelly

John_H

unread,

Mar 22, 2002, 5:56:30 PM3/22/02

to

Sounds like the best approach for the parts he's dealing with. The sort algorithms I
looked at suggest the one-step sort would be a bin sort (sorting into bins, not
"binary" sort) as opposed to a multistep approach that illustrated the radix sort
mechanism (e.g. first sort into 4 bins based on two least significant bits then
resort these in order based on the two most significant bits). It seems like one
would need different sets of FIFOs needed for each incoming set to run in parallel
until the sort is complete - can't quite use one FIFO set to cover a new set ever 10
ns.

A more parallel approach might give similar results to the radix without the serial
nature: generate a field of 16x16 bits where each input identifies its rank out of
16 as a bit select in a column. The indicies are then ordered scanning the rows in
order. It's not exceedingly simple to grab 16 indicies based on 256 bits, but it's
conceptually straightforward.

Kevin Neilson

unread,

Mar 22, 2002, 7:51:26 PM3/22/02

to

I'm not sure how to sort the values, but I'd manipulating only the 4-bit
fields and sending the rest of the data through SRL delay lines while you're
sorthing the fields. That will save some gates.

"Eric Crabill" <eric.c...@xilinx.com> wrote in message
news:3C9B8D8D...@xilinx.com...

Philip Freidin

unread,

Mar 23, 2002, 4:11:07 AM3/23/02

to

On Fri, 22 Mar 2002 09:45:34 -0800, Eric Crabill <eric.c...@xilinx.com>
wrote:
>

The following link will take you to some research on using
Genetic Programming, on FPGAs, to create sorting systems.

While this is not what you want to do, the paper (select
the "cached" PDF in the top right corner of the page) has
good background information.

Then, get out your Volume 3 of Knuth's "The Art of Computer
Programming (Sorting and Searching)", and turn to page 231,
and there you will see an n=16 sort network, with 61
comparators, and a worstcase chain depth of 9. Read the
adjacent 722 pages for a good introduction to the subject.

Philip Freidin

Philip Freidin
Fliptronics

Christopher Saunter

unread,

Mar 25, 2002, 6:17:48 AM3/25/02

to

Eric Crabill (eric.c...@xilinx.com) wrote:

: Hi,

: I am looking to build a circuit for sorting small data sets,
: and am hoping someone here might have some pointers for where
: I should start looking for algorithms to do it.

: Every cycle, 16 pieces of data are provided as input.

: Some number of cycles later, the data set is provided
: at the output, sorted...

< stated elsewhere - 32 bit words, sorted off a nibble >

Hi,

I got thinking about sorting, and this is what I came up with.
It should sort data at high speed, and accepts all words in
a single cycle in parallel, and spits them out in parallel,
sorted, a fixed number of cycles later. I think with a bit
of effort in layout it would run very fast as well. However,
it makes rather less efficient use of a Xilinx device than
the FIFO solution, as it uses flip-flops for data storage, not
LUTs.

Please lay into any problems it has, I'm out to learn!

---
The design consists of a number of pipeline stages, each of which
partially sorts the data by sorting pairs of values.

We have N words. Conceptually these are split into N/2 pairs
of words, each of which are sorted based on the 4 bit key.
This could be done with a 4 bit magnitude comparitor and two 2:1
mux's - see my noddy diagram at
http://www.dur.ac.uk/christopher.saunter/sort1.bmp

So the first stage of the sorting pipeline is as follows:

w0 > larger (w0,w1) = v0
w1 > smaller(w0,w1) = v1

w2 > larger (w2,w3) = v2
w3 > smaller(w2,w3) = v3

w4 > larger (w4,w5) = v4
w5 > smaller(w4,w5) = v5

w6 > larger (w4,w5) = v6
w7 > smaller(w4,w5) = v7

The next stage is similar, although it sorts pairs consisting of
(v1, v2), (v3,v4), (v5,v6) and optionally (v0,v7) (I say optinally
- this sort of folds the flat rectangular shape of the pipeline
into a cylinder - I think doing this will save you one pipeline stage
but it might make getting a fast design out of P&R harder?)

So if we watch the effect of this on some worst case data...
e represents a sort of (w0, w1), (w2,w3) etc and o represents
a sort of (w1,w2), (w3,w4) etc:

e o e o e o e
1 > 2 > 7 > 7 > 7 > 7 > 7 > 8
2 > 1 > 4 > 4 > 6 > 6 > 8 > 7
3 > 4 > 1 > 6 > 4 > 8 > 6 > 6
4 > 3 > 6 > 1 > 8 > 4 > 5 > 5
5 > 6 > 3 > 8 > 1 > 5 > 4 > 4
6 > 5 > 8 > 3 > 5 > 1 > 3 > 3
7 > 8 > 5 > 5 > 3 > 3 > 1 > 2
8 > 7 > 2 > 2 > 2 > 2 > 2 > 1

So if I have this right, you should be able to sort N words in
N-1 pipeline stages, requiring N(N-1) word registers.

You want 32 bit words, 16 words, so thats 32.16.15 = 7680 registers.
That's doable in larger devices.

If you can run the sorter faster that the data rate, then
you can time mux it, by reducing the number of stages and looping
datra back through it repeatedly. Infact, using 3:1 mux's and a bit
of logic, you could probably reduce it to just one stage.
However, you said 100MHz clock...

This approach should sort data at a scary rate, although getting that much
data into and off the chip might be interesting....

Cheers,
Chris Saunter

: The latency isn't a big issue, as long as it is constant,

Jonathan Bromley

unread,

Mar 25, 2002, 7:27:08 AM3/25/02

to

In article <3C9BB6A0...@mail.com>, John_H <johnha...@mail.com>
writes

[a radix sort]

>Sounds like the best approach for the parts he's dealing with.

[snip]

Radix sorts are good if the key is small (as is yours) but in the
general case of a larger key there are other hardware-friendly
methods.

The classic source for this kind of material is Donald Knuth's
masterly work "The Art of Computer Programming"; volume 3 covers
sorting and searching. He describes a method due to Batcher, which
Knuth calls "merge-exchange" sort. It maps really nicely on to
hardware because:
* it uses lots of instances of a single, simple logic element
(compare two values, swap them if they're out of order)
* it is very pipeline-friendly; the software algorithm
naturally suggests a pipelined hardware implementation
* although it's fairly extravagant of hardware, as any
fully-parallel sort is sure to be, it is not as expensive
as some of the better known methods

I have a little C program that "designs" Batcher sorting
networks for arbitrary numbers of input words: source code
by email on request! Here's its output for a network to sort
six numbers into order. The data comes in at the top and
falls out, sorted, at the bottom. The "|-|-|-|-|-|" lines
represent places where a pipeline register could go.
Strings like "O======O" represent a compare/swap module.
The numbers on the left represent steps in the algorithm
given by Knuth, and use the same notation - they only make
sense if you read his description.

View this in a monospaced font!!!!
=========================================
p,q,r,d = 4, 4, 0, 4 |-|-|-|-|-|
O=======O |
| O=======O
p,q,r,d = 2, 4, 0, 2 |-|-|-|-|-|
O===O | | |
| O===O | |
p,q,r,d = 2, 2, 2, 2 |-|-|-|-|-|
| | O===O |
| | | O===O
p,q,r,d = 1, 4, 0, 1 |-|-|-|-|-|
O=O O=O O=O
p,q,r,d = 1, 2, 1, 3 |-|-|-|-|-|
| O=====O |
p,q,r,d = 1, 1, 1, 1 |-|-|-|-|-|
| O=O O=O |

Sorted 6 elements using:
6 levels of logic;
12 compare/exchange modules.
=========================================
--
Jonathan Bromley
DOULOS Ltd.
Church Hatch, 22 Market Place, Ringwood, Hampshire BH24 1AW, United Kingdom
Tel: +44 1425 471223 Email: jonathan...@doulos.com
Fax: +44 1425 471573 Web: http://www.doulos.com

**********************************
** Developing design know-how **
**********************************

This e-mail and any attachments are confidential and Doulos Ltd. reserves
all rights of privilege in respect thereof. It is intended for the use of
the addressee only. If you are not the intended recipient please delete it
from your system, any use, disclosure, or copying of this document is
unauthorised. The contents of this message may contain personal views which
are not the views of Doulos Ltd., unless specifically stated.

Christopher Saunter

unread,

Mar 25, 2002, 7:59:10 AM3/25/02

to

Okay, so playing around with my sugestion in a -4 virtex, it'll take a lot
of work to get it up to 100MHz. I suppose the 100MHz data rate could be
split into 2x50MHz streams, processed in parallel.

Eric Crabill

unread,

Mar 25, 2002, 12:54:58 PM3/25/02

to

Thank you to all who replied! I value your
insight into the problem. I'm going to see
where I can get a copy of Knuth's book to
see for myself...

Thanks again,
Eric

glen herrmannsfeldt

unread,

Mar 25, 2002, 7:02:25 PM3/25/02

to

Christoph...@durham.ac.uk (Christopher Saunter) writes:
>Eric Crabill (eric.c...@xilinx.com) wrote:

>: I am looking to build a circuit for sorting small data sets,
>: and am hoping someone here might have some pointers for where
>: I should start looking for algorithms to do it.

(snip)

>I got thinking about sorting, and this is what I came up with.
>It should sort data at high speed, and accepts all words in
>a single cycle in parallel, and spits them out in parallel,
>sorted, a fixed number of cycles later. I think with a bit
>of effort in layout it would run very fast as well. However,
>it makes rather less efficient use of a Xilinx device than
>the FIFO solution, as it uses flip-flops for data storage, not
>LUTs.

>Please lay into any problems it has, I'm out to learn!

>---
>The design consists of a number of pipeline stages, each of which
>partially sorts the data by sorting pairs of values.

>We have N words. Conceptually these are split into N/2 pairs
>of words, each of which are sorted based on the 4 bit key.
>This could be done with a 4 bit magnitude comparitor and two 2:1
>mux's - see my noddy diagram at
>http://www.dur.ac.uk/christopher.saunter/sort1.bmp

>So the first stage of the sorting pipeline is as follows:

(snip)

Fast sorting algorithms are O(n log n) comparisons. Here you are
doing m comparisons at a time, maybe even n comparisons. I don't
know if there is an O(n log n) parallel sort algorithm.

I did once implement a maximum of two 16 bit numbers in XC4000
series. It is convenient in that the carry logic does the compare
and the LUT's do the select, so it can be done in 9 CLBs. It seems
that newer Xilinx FPGA's don't allow this separate carry logic
operation.

I would look into systolic array algorithms, as they tend to be
good for doing this. According to my systolic algorithms book
bubble sort is about the most efficient systolic sort algorithm,
which is pretty much what you were describing. You need n cells
and n cycles to sort n elements.

-- glen

Christopher Saunter

unread,

Mar 26, 2002, 5:25:12 AM3/26/02

to

A couple of other thaughts:

1) It's pretty inefficient to carry all 32 bits of each word through the
sorting logic. How about haveing 16 32bit fifos of equal length to the
sorting pipeline, clock the 16 words into them each cycle. Extract the 4
bit key from the word and attatch a 4 bit pointer to the fifo containing
the 'parent' word, and just sort this. You would then use the pointers
emerging from the sort to reorder the data emerging from the fifos. This
would need a lot of logic in itself (or a lot of tristates) but would
probably save some. Also, cutting the size of the sort algorthim by a
factor of 4 might make it easier to speed up.

2) It's not the case that the 4 bit keys are all unique is it? Probably
a daft question.

Cheers,
Chris Saunter

John_H

unread,

Mar 27, 2002, 7:44:07 PM3/27/02

to

Ohhhhh, I've figured out what Batcher was all about. The "bitonic sort"
algorithm is a very nice way to do things! Rather than the n-1 = 15 cycles of
pipeline to do the sort with 2-input compare/swap elements, the whole 16 values
can be sorted with 10 compare/swap elements. Wow. (I prototyped the system in
Excel and it's clean).

If you had a zero-cost method of determining what index corresponded to ranks
0-15, it'd take 16-1 multiplexers (composed of 2 stages of 4-1 muxes for 10
LUTs per bit) 32 bits wide to accomplish the ordering. The 10-stage
compare/swap uses... 10 LUTs per bit without the cost of finding the index
order!

I'd love to extend the method to 4-input compare/order elements instead of the
2-input items in an attempt to save on resources further, but the direct
bitonic sort is such a pretty method and the complexities are so low.

This post is late but the topic's been on my mind for a few days. A good sort
can be a nice tool in my verilog toolbox so I wouldn't let it die.