Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

[Haskell-cafe] Space Efficiency When Sorting a List of Many Lists

3 views

Skip to first unread message

Peter Green

unread,

Dec 30, 2009, 11:39:41 PM12/30/09

to haskel...@haskell.org

I'm a Haskell neophyte, so may be missing something obvious in the
problem outlined below. I'm fairly proficient in Python + have some
limited experience in OCaml and F#, so know just enough to be be
dangerous, but not nearly enough to really know what I'm doing here.

OK, I have text files containing 'compressed lists' Compressed lists
look like this:

8+12,11,7+13,10
1+2+3,1+9,3+6,4
.
.

Sublists are comma-delimited, and sublist elements are separated by
'+' character.

I parse these to look like so:

[[[8,12],[11],[7,13],[10]],
[[1,2,3],[1,9],[3,6],[4]],
.
.
]

I need to explode these and produce a lex-sorted list of exploded lists:

[[1,1,3,4],[1,1,6,4],[1,9,3,4],[1,9,6,4],[2,1,3,4],[2,1,6,4],[2,9,3,4],
[2,9,6,4],[3,1,3,4],[3,1,6,4],[3,9,3,4],[3,9,6,4],
[8,11,7,10],[8,11,13,10],[12,11,7,10],[12,11,13,10]]

I then output this data as comma-delimited lists:

1,1,3,4
1,1,6,4
.
.
12,11,13,10

I can do all this in my (no doubt fairly naive) program shown below.
In real life, one of my test data files contains compressed lists
which all contain 7 sublists. This data (correctly) explodes to a list
containing ~3,700,000 exploded lists. All good as far as correctly
transforming input to output goes. However, sorting the data uses a
lot of memory:

Partial output from ./explode +RTS -sstderr:

540 MB total memory in use (4 MB lost due to fragmentation)

If I do not do any sorting on the exploded lists, i.e. modify the main
function to be
main = interact (unlines . toCSV . explode . fromCSV . lines)

I then see this partial output from ./explode +RTS --stderr:

2 MB total memory in use (0 MB lost due to fragmentation)

I can guess that there might be be less laziness and more
instantiation when sorting is introduced, but my questions are:
(1) Am I doing anything terribly stupid/naive here?
(2) If so, what can I do to improve space efficiency?

TIA!

import Data.List (sort)
import Data.List.Split (splitOn)

-- Cartesian Product over a List of Lists
-- cp [[1,2],[3],[4,5,6]] --> [[1,3,4],[1,3,5],[1,3,6],[2,3,4],[2,3,5],
[2,3,6]]
cp :: [[a]] -> [[a]]
cp [] = [[]]
cp (xs:xss) = [y:ys | y <- xs, ys <- cp xss]

-- fromCSV ["8+12,11,7+13,10", "1+2+3,1+9,3+6,4"] -->
-- [[[8,12],[11],[7,13],[10]],[[1,2,3],[1,9],[3,6],[4]]]
fromCSV :: [String] -> [[[Int]]]
fromCSV = map parseOneLine
where parseOneLine = map parseGroup . splitOn ","
where parseGroup = map read . splitOn "+"

-- explode [[[1,2],[3],[4,5,6]], [[1, 2], [14,15], [16]]] --> [[1,3,4],
[1,3,5],
-- [1,3,6],[2,3,4],[2,3,5],[2,3,6],[1,14,16],[1,15,16],[2,14,16],
[2,15,16]]
explode :: [[[a]]] -> [[a]]
explode = concatMap cp

-- toCSV [[8,11,7,10,12],[8,11,7,10,12],[8,11,7,10,12]] -->
-- ["8,11,7,10,12","8,11,7,10,12","8,11,7,10,12"]
toCSV :: (Show a) => [[a]] -> [String]
toCSV = map $ tail . init . show
--toCSV = map (intercalate "," . map show)

main = interact (unlines . toCSV . sort . explode . fromCSV . lines)

_______________________________________________
Haskell-Cafe mailing list
Haskel...@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Alexander Dunlap

unread,

Dec 30, 2009, 11:58:16 PM12/30/09

to Peter Green, haskel...@haskell.org

I'm not sure about your problem specifically but the external-sort
package on Hackage may be of interest.

Alex

Luke Palmer

unread,

Dec 31, 2009, 5:39:14 AM12/31/09

to Peter Green, haskel...@haskell.org

On Wed, Dec 30, 2009 at 9:39 PM, Peter Green <kinc...@me.com> wrote:
> I can guess that there might be be less laziness and more instantiation when
> �sorting is introduced,

Yes, by a lot. Sorting requires keeping the entire list in memory.
And Haskell lists, unfortunately, are not that cheap in terms of space
usage (I think [Int] uses 3 words per element).

> but my questions are:
> � � � �(1) Am I doing anything terribly stupid/naive here?
> � � � �(2) If so, what can I do to improve space efficiency?
>
> TIA!
>
>
> import Data.List (sort)
> import Data.List.Split (splitOn)
>
> -- Cartesian Product over a List of Lists
> -- cp [[1,2],[3],[4,5,6]] -->
> [[1,3,4],[1,3,5],[1,3,6],[2,3,4],[2,3,5],[2,3,6]]
> cp � � � � �:: [[a]] -> [[a]]
> cp [] � � � = �[[]]
> cp (xs:xss) = �[y:ys | y <- xs, ys <- cp xss]

This cartesian product varies in its tail faster than its head, so
every head gets its own unique tail. If you reverse the order of the
bindings so that it varies in its head faster, then tails are shared.
If my quick and dirty reasoning is correct, it improves the space
usage by a factor of the number of sublists.

cp' [] = [[]]
cp' (xs:xss) = [y:ys | ys <- cp' xss, y <- xs]

But if you're serious, you can probably do better than just generating
them all and passing them to sort. I get the impression that there is
some structure here that can be taken advantage of.

Daniel Fischer

unread,

Dec 31, 2009, 6:09:54 AM12/31/09

to haskel...@haskell.org, Peter Green

Am Donnerstag 31 Dezember 2009 11:38:51 schrieb Luke Palmer:
> This cartesian product varies in its tail faster than its head, so
> every head gets its own unique tail. �If you reverse the order of the
> bindings so that it varies in its head faster, then tails are shared.
> If my quick and dirty reasoning is correct, it improves the space
> usage by a factor of the number of sublists.

That reasoning is supported by
http://www.haskell.org/pipermail/haskell-cafe/2009-December/070184.html

However, that concerns only the generation of the lists to be sorted, as far as I can see
(that will be faster and use less memory).
The main problem here is the space usage of the sorting, I think. For that, probably an
external sort is the best solution.

Heinrich Apfelmus

unread,

Dec 31, 2009, 11:28:15 AM12/31/09

to haskel...@haskell.org

Felipe Lessa wrote:

> Luke Palmer wrote:
>> But if you're serious, you can probably do better than just generating
>> them all and passing them to sort. I get the impression that there is
>> some structure here that can be taken advantage of.
>

> Isn't what he wants a trie? In particular, a Patricia trie?

Peter, this is a very nice problem. Is this a programming exercise or
did you encounter it in the "real world"?

There is indeed a structure that can be taken advantage of and it
involves tries.

The key point is that thanks to the lexicographic ordering, you can
*interleave* exploding and sorting the rows. In other word, we can
exploit the fact that for example

(sort . cartesian) ([8,12]:[11]:[7,13]:[10]:[])
= [8 : (sort . cartesian) ([11]:[7,13]:[10]:[]) ]
++ [12 : (sort . cartesian) ([11]:[7,13]:[10]:[]) ]

where cartesian denotes the cartesian product. The code will mainly
work with functions like

type Row a = [a]

headsIn :: [Row a] -> [(a, [Row a])]

which groups rows by their first element. The result type is best
understood as a finite map from a to [Row a]

headsIn :: [Row a] -> Map a [Row a]

And unsurprisingly, the fixed point of the (Map a) functor is the
trie for [a] .

Without much explanation, here the full formulation in terms of
catamorphisms and anamorphisms.

{-# LANGUAGE NoMonomorphismRestriction #-}
import qualified Data.Map
import Control.Arrow (second)

-- the underlying structure
newtype Map a b = Map { unMap :: [(a,b)] } deriving (Eq, Show)

-- category theory: bananas and lenses
instance Functor (Map a) where
fmap f = Map . (map . second) f . unMap

newtype Fix f = In { out :: f (Fix f) }

cata f = In . fmap (cata f) . f
ana f = f . fmap (ana f) . out

-- very useful type synonym to keep track of rows and colums
type Row a = [a]

-- grouping and "ungrouping" by the first elements of each row
headsIn :: [Row a] -> Map a [Row a]
headsIn xss = Map [(x,[xs]) | x:xs <- xss]

headsOut :: Map a [Row a] -> [Row a]
headsOut (Map []) = [[]]
headsOut xxs = [x:xs | (x,xss) <- unMap xxs, xs <- xss]

-- cartesian product
cartesian1 :: Row [a] -> Map a (Row [a])
cartesian1 [] = Map []
cartesian1 (xs:xss) = Map [(x,xss) | x <- xs]

cartesian = ana headsOut . cata cartesian1

-- sorting
sort1 :: Ord a => Map a [b] -> Map a [b]
sort1 = Map . Data.Map.toList . Data.Map.fromListWith (++) . unMap

sortRows = ana headsOut . cata (sort1 . headsIn)

-- and cold fusion!
-- sortCartesian = sortRows . cartesian
-- best written as hylomorphism
sortCartesian = ana headsOut . cata (sort1 . cartesian1)

This is readily extended to handle the explode function as well. And
thanks to lazy evaluation, I expect this to run with a much better
memory footprint.

Regards,
Heinrich Apfelmus

--
http://apfelmus.nfshost.com

Peter Green

unread,

Jan 2, 2010, 9:33:40 AM1/2/10

to haskel...@haskell.org

Thank you everyone for the very helpful suggestions so far!

I think I should re-state the problem and provide more background
info. In answer to the poster who suggested using a Trie, there *is* a
real world application for what I'm attempting to do. I'll also hint
at that below:

I have text files containing 'compressed lists' Compressed lists
look like this:

8+12,11,7+13,10
1+2+3,1+9,3+6,4
.
.

Sublists are comma-delimited, and sublist elements are separated by

'+' character. Sublists contain integers in the range 1..20.

I parse these to look like so:

[[[8,12],[11],[7,13],[10]],

[[1,2,3],[1,9],[3,6],[4]],
.
.
]

I need to explode these and produce a lex-sorted list of exploded lists:

[[1,1,3,4],[1,1,6,4],[1,9,3,4],[1,9,6,4],[2,1,3,4],[2,1,6,4],[2,9,3,4],
[2,9,6,4],[3,1,3,4],[3,1,6,4],[3,9,3,4],[3,9,6,4],
[8,11,7,10],[8,11,13,10],[12,11,7,10],[12,11,13,10]]

I then output this data as comma-delimited lists:

1,1,3,4
1,1,6,4
.
.
12,11,13,10

It is a property of the underlying problem 'structure' that none of
these comma-delimited lists in the exploded data is repeated. i.e. we
can never see two instances of (say) 1,1,3,4.

{ Begin Aside on Why I Would Want to do Such a Silly Thing:

'Compressed lists' are in fact compressed wager combinations for multi-
leg exotic horse racing events. 'Exploded lists' are the single wager
combinations covered by the grouped combinations.

Why use compressed combinations? Well, it's a lot easier to submit
5,000 compressed tickets (whether physically or electronically) than
1M single tickets for the individual combinations we have somehow
managed to represent by the 5,000 compressed tickets.

However, single combinations are the currency of the realm. These are
the underlying wagers and compression is merely a convenience/
necessity to enable single wagers to be placed.

It's not uncommon to generate large numbers of single combinations:
Imagine 6 races with 15 runners in each race, and a wager requiring
one to select first place getters in all 6 legs. The number of
potential outcomes is 15^6 = 11,390,625. One might decide (somehow)
that it makes sense in a positive expectation way to wager (say) 500K
of these potential outcomes.

Getting from theoretically desirable single combinations to optimally
compressed wagers is actually NP-Complete and another story for
another day. Suffice it to say that one might use some nasty hacks and
heuristics to arrive at compressed wagers - but in the process doing
violence to one's precise coverage of the theoretically desirable
single combinations. That's one reason why I need to recover (by
'exploding') the *actual* wagered single combinations from the
compressed/ wagers.

I need to explode these compressed wagers back to single combinations
because, I need to (eventually) do set comparisons on collections of
single combinations. i.e. given File A and File B of compressed
wagers, I need to be able to answer questions like:

(1) How many single combinations are common to File A and File B
(2) How many single combinations in File A are missing from File B
.
.
etc. i.e a few of the common set comparison operations.

And the dumbest, quickest and dirtiest and most idiot-proof way of
doing this as a baseline approach before I start using maps, tries,
etc... is to explode Files A and B into lex sorted single combinations
order and then use diff -y with judicious grepping for '>' and <''

End Aside }

I'm probably going to go with an external sort for starters to keep
memory usage down. For most use-cases, I can get away with my current
in-memory sort - just have to beware edge cases.

However, later on, I'm keen to do everything with set theoretic
operations and avoid writing large files of single combinations to disk.

So, I guess the things I'd like to get a feel for are:

(1) Is it realistic to expect to do in-memory set comparisons on sets
with ~1M elements where each element is a (however-encoded) list of
(say) 6 integers? I mean realistic in terms of execution time and
space usage, of course.

(2) What would be a good space-efficient encoding for these single
combinations? I know that there will never be more than 8 integers in
a combination, and none of these values will be < 1 or > 20. So
perhaps I should map them to ByteString library Word8 strings?
Presumably sorting a big list of ByteStrings is going to be faster
than sorting a big list of lists of int?

Heinrich Apfelmus

unread,

Jan 4, 2010, 7:51:59 AM1/4/10

to haskel...@haskell.org

Peter Green wrote:

> Luke Palmer wrote:
>> Yes, by a lot. Sorting requires keeping the entire list in memory.
>> And Haskell lists, unfortunately, are not that cheap in terms of space
>> usage (I think [Int] uses 3 words per element).
>>
>>> but my questions are:
>>> (1) Am I doing anything terribly stupid/naive here?
>>> (2) If so, what can I do to improve space efficiency?
>

> I think I should re-state the problem and provide more background info.
> In answer to the poster who suggested using a Trie, there *is*
> a real world application for what I'm attempting to do.
> I'll also hint at that below:
>

> { Begin Aside on Why I Would Want to do Such a Silly Thing:
>
> 'Compressed lists' are in fact compressed wager combinations for

> multi-leg exotic horse racing events. 'Exploded lists' are the single

> wager combinations covered by the grouped combinations.
>

> [...]

Thanks for describing the background of this problem in detail! I was
mainly asking because I'm always looking for interesting Haskell topics
that can be turned into a tutorial of sorts, and this problem makes a
great example.

Concerning optimization, the background also reveals some additional
informations, namely

* The single wager combinations are short lists
* and each list element is small, i.e. `elem` [1..20]
* No duplicate single wagers
* You are free to choose another ordering than lexicographic

> I need to explode these compressed wagers back to single combinations
> because, I need to (eventually) do set comparisons on collections of
> single combinations. i.e. given File A and File B of compressed wagers,
> I need to be able to answer questions like:
>
> (1) How many single combinations are common to File A and File B
> (2) How many single combinations in File A are missing from File B
> ..

> ..

> etc. i.e a few of the common set comparison operations.
>
> And the dumbest, quickest and dirtiest and most idiot-proof way of doing
> this as a baseline approach before I start using maps, tries, etc... is
> to explode Files A and B into lex sorted single combinations order and
> then use diff -y with judicious grepping for '>' and <''

Looks good to me, sorted lists are a fine data structure for set operations.

> I'm probably going to go with an external sort for starters to keep
> memory usage down. For most use-cases, I can get away with my current
> in-memory sort - just have to beware edge cases.

Using an out-of-the box external sort library is probably the path of
least resistance, provided the library works as it should.

However, your problem has a very special structure; you can interleave
explode and sort and turn the algorithm into one that is partially
on-line, which will save you a lot of memory. I've tried to show how to
do this in my previous post; the resulting code will look something like
this

{-# LANGUAGE NoMonomorphismRestriction #-}
import qualified Data.Map
import Control.Arrow (second)

newtype Map a b = Map { unMap :: [(a,b)] } deriving (Eq, Show)

instance Functor (Map a) where

fmap f = Map . (map . second) f . unMap

hylo :: (Map a b -> b) -> (c -> Map a c) -> (c -> b)
hylo f g = f . fmap (hylo f g) . g

type Row a = [a]

headsOut :: Map a [Row a] -> [Row a]
headsOut (Map []) = [[]]
headsOut m = [x:row | (x,rows) <- unMap m, row <- rows]

explode1 :: [Row [a]] -> Map a [Row [a]]
explode1 rows = Map [(x,[row]) | (xs:row) <- rows, x <- xs]

sort1 :: Ord a => Map a [b] -> Map a [b]
sort1 = Map . Data.Map.toList . Data.Map.fromListWith (++) . unMap

sortExplode = hylo headsOut (sort1 . explode1)

example = [[[8,12],[11],[7,13],[10]], [[1,2,3],[1,9],[3,6],[4]]]
test = sortExplode example

main = interact (unlines . toCSV . sortExplode . fromCSV . lines)

In other words, sortExplode will "stream" the sorted single wagers on
demand, unlike the monolithic sort . explode which has to produce all
single wagers before sorting them.

> However, later on, I'm keen to do everything with set theoretic
> operations and avoid writing large files of single combinations to disk.

Note that there's also the possibility of not expanding the compressed
wagers at all, and perform set operations directly. For instance, it is
straightforward to intersect two such sets of size n in O(n^2) time.
Since n ~ 5000 , n^2 is about the same ballpark as the exploded single
wager combinations.

> So, I guess the things I'd like to get a feel for are:
>
> (1) Is it realistic to expect to do in-memory set comparisons on sets
> with ~1M elements where each element is a (however-encoded) list of
> (say) 6 integers? I mean realistic in terms of execution time and space
> usage, of course.
>
> (2) What would be a good space-efficient encoding for these single
> combinations? I know that there will never be more than 8 integers in a
> combination, and none of these values will be < 1 or > 20. So perhaps I
> should map them to ByteString library Word8 strings? Presumably sorting
> a big list of ByteStrings is going to be faster than sorting a big list
> of lists of int?

The overhead for lists and sets are quite high, it's not uncommon for 1M
elements to occupy 10M-100M of memory.

Not to mention that your elements are, in fact, small lists, introducing
another factor of 10-100. But this can indeed be ameliorated
considerably by representing your elements in a packed format like
ByteStrings or a Word64.

Regards,
Heinrich Apfelmus

--
http://apfelmus.nfshost.com

_______________________________________________

Peter Green

unread,

Jan 4, 2010, 11:34:52 AM1/4/10

to haskel...@haskell.org

Heinrich, thanks for some great help and food for thought!

(More on performance of your solution below)

> Thanks for describing the background of this problem in detail! I was
> mainly asking because I'm always looking for interesting Haskell
> topics
> that can be turned into a tutorial of sorts, and this problem makes a
> great example.

Another interesting problem is starting from a file of single wagers
and trying to compress them (i.e. inverse of 'explosion'. I believe
this problem is analogous to Set Cover and therefore NP-Complete.
Heuristics are the order of the day here.

(OT, but might also make an interesting multi-part tutorial question/
topic:
A simpler (fairly unrelated problem) is doing exact k-means clustering
of a list of n items (i.e. single dimension data) via the k-
segmentations of a (sorted by values one is clustering on) list of n
items. The number of k-segmentations of an n-list is (n-1) Choose
(k-1). One generates all the k-segmentations and each set of k-
segments constitutes a candidate clustering solution. One selects the
candidate solution which minimises the 'goodness of fit criterion'
used in k-means clustering (easily found in the literature). Needless
to say, this is only feasible for smallish n and for values of k
closer to the LHS and RHS of Pascal's triangle. K-means clustering in
more than one dimension is NP-Complete.)

Seems to be some issue with the external-sort package. As it currently
exists, it blows up on my [[Int]] data and exhibits more memory usage
than vanilla sort! I'm keen to look into fixing this as might be one
small way I can give back, but still taking baby steps with Haskell.

Thank you *very* much for this code! I will try to get my head around
it. I understand the broad outline you posted elsewhere, but will take
me a while to fully grasp the above as I'm only up to ~p200 in Real
World Haskell :).

As for performance of your code above on my file of compressed wagers
which expands to 3.7M single wagers:

(My original version posted here and using vanilla sort)
541 MB total memory in use (5 MB lost due to fragmentation)
INIT time 0.00s ( 0.01s elapsed)
MUT time 13.98s ( 15.82s elapsed)
GC time 8.69s ( 9.64s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 22.67s ( 25.47s elapsed)

(Your much improved version)
10 MB total memory in use (1 MB lost due to fragmentation)
INIT time 0.00s ( 0.00s elapsed)
MUT time 7.61s ( 9.38s elapsed)
GC time 3.48s ( 3.58s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 11.08s ( 12.95s elapsed)

Very impressive and thanks again!

>> However, later on, I'm keen to do everything with set theoretic
>> operations and avoid writing large files of single combinations to
>> disk.
>
> Note that there's also the possibility of not expanding the compressed
> wagers at all, and perform set operations directly. For instance, it
> is
> straightforward to intersect two such sets of size n in O(n^2) time.
> Since n ~ 5000 , n^2 is about the same ballpark as the exploded single
> wager combinations.

Not sure I quite understand you here. In my mind, set elements *are*
single combinations. It is possible for two quite different-looking
files of compressed wagers to contain exactly the same single wager
elements. So I'm not sure how to compare without some notion of
explosion to single combinations.

>> So, I guess the things I'd like to get a feel for are:
>>
>> (1) Is it realistic to expect to do in-memory set comparisons on sets
>> with ~1M elements where each element is a (however-encoded) list of
>> (say) 6 integers? I mean realistic in terms of execution time and
>> space
>> usage, of course.
>>
>> (2) What would be a good space-efficient encoding for these single
>> combinations? I know that there will never be more than 8 integers
>> in a
>> combination, and none of these values will be < 1 or > 20. So
>> perhaps I
>> should map them to ByteString library Word8 strings? Presumably
>> sorting
>> a big list of ByteStrings is going to be faster than sorting a big
>> list
>> of lists of int?
>
> The overhead for lists and sets are quite high, it's not uncommon
> for 1M
> elements to occupy 10M-100M of memory.
>
> Not to mention that your elements are, in fact, small lists,
> introducing
> another factor of 10-100. But this can indeed be ameliorated
> considerably by representing your elements in a packed format like
> ByteStrings or a Word64.

I haven't tried packing ByteStrings yet in this context, but *did* get
some improvement in external-sort performance when using ByteStrings
instead of [[Int]]. Presumably because it somewhat ameliorated the
effects of a bug in external-sort where blocks are fully compared
rather than just the heads of blocks.

I'll be interested to play around with your version more and see if it
benefits much from ByteString. It's already very good now in terms of
space usage, but perhaps can speed up the I/O (output dominates, of
course).

I'll be learning from your post for a long time. So thanks again!

Heinrich Apfelmus

unread,

Jan 9, 2010, 12:48:48 PM1/9/10

to haskel...@haskell.org

Peter Green wrote:
> Heinrich, thanks for some great help and food for thought!

My pleasure. :)

>> Thanks for describing the background of this problem in detail! I was
>> mainly asking because I'm always looking for interesting Haskell topics
>> that can be turned into a tutorial of sorts, and this problem makes a
>> great example.
>
> Another interesting problem is starting from a file of single wagers and
> trying to compress them (i.e. inverse of 'explosion'. I believe this
> problem is analogous to Set Cover and therefore NP-Complete. Heuristics
> are the order of the day here.

Interesting indeed! What kind of heuristics do you employ there? It
seems quite difficult to me, somehow.

>> [...]

>> main = interact (unlines . toCSV . sortExplode . fromCSV . lines)
>>

> Thank you *very* much for this code! I will try to get my head around
> it. I understand the broad outline you posted elsewhere, but will take
> me a while to fully grasp the above as I'm only up to ~p200 in Real
> World Haskell :).
>
> As for performance of your code above on my file of compressed wagers
> which expands to 3.7M single wagers:
>
> (My original version posted here and using vanilla sort)
> 541 MB total memory in use (5 MB lost due to fragmentation)
> INIT time 0.00s ( 0.01s elapsed)
> MUT time 13.98s ( 15.82s elapsed)
> GC time 8.69s ( 9.64s elapsed)
> EXIT time 0.00s ( 0.00s elapsed)
> Total time 22.67s ( 25.47s elapsed)
>
> (Your much improved version)
> 10 MB total memory in use (1 MB lost due to fragmentation)
> INIT time 0.00s ( 0.00s elapsed)
> MUT time 7.61s ( 9.38s elapsed)
> GC time 3.48s ( 3.58s elapsed)
> EXIT time 0.00s ( 0.00s elapsed)
> Total time 11.08s ( 12.95s elapsed)
>
> Very impressive and thanks again!

Category theory saves the day. ;)

The code is based on three observations:
1) Sorting and exploding can be interleaved to create version that can
stream results in an on-line fashion.
2) There is a standard formulation of this pattern in terms of folds and
unfolds (= catamorphisms and anamorphisms).
3) Lazy evaluation can be used to make this look like an off-line algorithm.

To my surprise, I don't have good references for these, but maybe the
following can help a bit. Something similar to 1) can be found in

Graham Hutton. The countdown problem.
http://www.cs.nott.ac.uk/~gmh/bib.html#countdown

and

Richard Bird. A program to solve Sudoku.
http://www.cs.tufts.edu/~nr/comp150fp/archive/richard-bird/sudoku.pdf

The standard reference for 2) is

Meijer et al. Functional programming with bananas,
lenses, envelopes and barbed wire.
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.41.125

and 3) is folklore, maybe

John Hughes. Why functional programming matters.
http://www.cs.chalmers.se/~rjmh/Papers/whyfp.html

could serve as an introduction to lazy evaluation.

>> Note that there's also the possibility of not expanding the compressed
>> wagers at all, and perform set operations directly. For instance, it is
>> straightforward to intersect two such sets of size n in O(n^2) time.
>> Since n ~ 5000 , n^2 is about the same ballpark as the exploded single
>> wager combinations.
>
> Not sure I quite understand you here. In my mind, set elements *are*
> single combinations. It is possible for two quite different-looking
> files of compressed wagers to contain exactly the same single wager
> elements. So I'm not sure how to compare without some notion of
> explosion to single combinations.

What I mean is that there are two ways to represent sets of wagers:

* A list of single combinations -> [Row a]
* A list of compressed wagers -> [Row [a]]

To perform set operations, you first convert the latter into the former.
But the idea is that there might be a way of performing set operations
without doing any conversion.

Namely, the compressed wagers are cartesian products which behave well
under intersection: the intersection of two compressed wagers is again a
compressed wager

intersectC :: Eq a => Row [a] -> Row [a] -> Row [a]
intersectC [] [] = []
intersectC (xs:xrow) (ys:yrow) =
intersect xs ys : intersectC xrow yrow

In formulas:

intersect (cartesian x) (cartesian y)
= cartesian (intersectC x y)

for compressed wagers x, y :: Row [a] with

intersect = list intersection, for instance Data.List.intersect
cartesian = converts a compressed wager to a list of
single combinations

This allows you to intersect two lists of compressed wagers directly

intersect' :: [Row [a]] -> [Row [a]] -> [Row [a]]
intersect' xs ys = filter (not . null) $
[intersectC x y | x<-xs, y<-ys]

Unfortunately, the result will have quadratic size; but if you have
significantly less compressed wagers than single combinations, this may
be worth a try. Not to mention that you might be able to compress the
result again.

Mathematically, this exploits the equations

(A × B) ∩ (C × D) = (A ∩ C) × (B ∩ D)

"The intersection of two rectangles is another rectangle"

and

(A ∪ B) ∩ (C ∪ D) = (A ∩ C) ∪ (A ∩ D)
∪ (B ∩ C) ∪ (B ∩ D)

where A,B,C,D are sets.

0 new messages