[Haskell-cafe] library for set of heterogeneous values?

39 views
Skip to first unread message

Anthony Clayden

unread,
Aug 11, 2016, 12:38:08 AM8/11/16
to haskel...@haskell.org
I have a (let's call it) database of heterogeneous records.

They're not Haskell records, but anonymous/extensible type-labelled rows. (Could be tuples, could be HLists, could be Lens-like, could be something fancier.)

There's a small number (dozens) of distinct row types, each with a large number (thousands) of rows. The variety of row-types is not predictable in advance. And indeed a row might 'morph' over time with fields added/removed.

So the obvious answer of putting the lot into a giant HList (each element of the list being a row) isn't going to scale. I could have a type-indexed HList in which each element is a Set of homogeneous rows. But performance still suffers from scanning along the list to find the right type index.

Is there something better? On hackage there's two packages called HSet, neither giving very much help about their suitability:

* `hset` (lower case) [AlekseyUymanov] seems isomorphic to a type-indexed HList.
      ie Must be unique type in each element (could be a Set type, I guess)

* `HSet` (upper case) [athanclark] "Faux heterogeneous sets" seems a lot meatier
     why the "Faux"?
     built over hashtables in the ST monad.

Has anybody used these? Can give guidance on what they can and can't?

Bonus questions:

Given a filter specifying a restriction on (some) fields of rows, I want to get a heterog subset:
* all rows with at least those fields, matching those restrictions.
* the restriction might be merely "has field labelled L".

GIven a candidate row for insertion, I want first to scan for quasi-duplicates:
* any existing row with a subset of the given fields, and the same value at those fields.
* any existing row with a superset of the given fields, and the same value at those fields in common.
* ignore records with only a partial overlap of fields.

One possible data structure: a "vertical store".
Give each row a Globally Unique Id.
Have a separate set for each possible field,
where the set elements are field value (key) to set of GUId -- records with that value.

Then I have a different bonus question:
* how to retrieve all field values for a given GUId?


Thanks
AntC

David Feuer

unread,
Aug 11, 2016, 2:58:33 PM8/11/16
to Anthony Clayden, haskel...@haskell.org
I don't know if they're at all suitable for your purposes, but you
should look (at least for inspiration) at both vinyl and
dependent-map.

> _______________________________________________
> Haskell-Cafe mailing list
> To (un)subscribe, modify options or view archives go to:
> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
> Only members subscribed via the mailman list are allowed to post.
_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.

Яшин Сергей

unread,
Aug 11, 2016, 4:59:56 PM8/11/16
to Haskell-cafe, haskel...@haskell.org, anthony...@clear.net.nz

Anthony Clayden

unread,
Aug 11, 2016, 8:17:57 PM8/11/16
to David Feuer, haskel...@haskell.org
> I don't know if they're at all suitable for your purposes,
> but you should look (at least for inspiration) at both
> vinyl and dependent-map.

Thanks David,

vinyl is a record 'solution', not a set-of-records
repository

Really there's now too many bewildering choices for records;
representing the records is not the difficult part.

dependent-map, and
vault [Heinrich Apfelmus,thank you Sergei for the
suggestion]

seem to be (key, value) stores
so I'm not seeing how I 'query'/filter the values (the whole
records)
if I don't already know the keys for those records.

Contrast that with Data.Set [*],
you could stream the contents to a list and filter.
(Which would be horribly inefficient for my purposes.)

[*] Data.Set seems to have been swallowed into the new
collections library,
which is not currently maintained :-(

Doug Burke

unread,
Aug 13, 2016, 3:06:22 AM8/13/16
to anthony...@clear.net.nz, David Feuer, haskel...@haskell.org
Frames - http://acowley.github.io/Frames/ - is built on top of vinyl and provides one way to access data either as an array of structures or a structures of arrays. I'm on my phone (hence the use of top posting) so I can't productively check whether it allows the kind of filtering you were asking for.

Doug

Anthony Clayden

unread,
Aug 14, 2016, 6:53:06 AM8/14/16
to david...@gmail.com, libr...@haskell.org, haskell-cafe
Firstly, to correct myself:

> Contrast that with Data.Set [*], you could stream the contents to a list and filter.
> (Which would be horribly inefficient for my purposes.) 

Thank you Libraries team. Data.Set seems to have been seriously revamped since I last looked, and now includes a 'proper' filter function. (It used to stream the Set to a List; filter the List; build a new Set.)

> [*] Data.Set seems to have been swallowed into the new collections library,
> which is not currently maintained 

Uhh. That was a dead link (of ~10 years ago, but Google still took me to it). Data.Set is in the **containers** library supported on hackage.

But I have some q's about the optimisations:
The hedge-union algorithm seems to have been taken out, as inefficient.
(per treeowl on github). But the code on hackage still uses hedge??
So the docos are out of date. (And quite a bit of commentary on StackOverflow.) 
Was it really not more efficient?

The Set data type's constructors got reordered (Bin is now first, Tip is second).
On grounds it "improves the benchmark by up to 10%".
Are you sure it makes that much difference?
Are you sure it makes any difference at all?
The culprit seems to be pattern matching. It makes me feel there's something seriously wrong with GHC's code generator.
(And the semantics of ADTs is that the first constructor is ordered first if you go `deriving (Ord)`. So there's usually a design reason driving the ordering of constructors -- eg Nothing ordered before Just; [] ordered before (:), which makes String ordering do the right thing.)

Set is a purely functional data structure, no in-situ updating of nodes, no pointer semantics
unlike imperative/low-level implementations of balanced binary trees.
That means Tips (empty end-points) genuinely get allocated/built and accessed and garbaged on insert -- or does all that strictness annotation/INLINing alter that?

Contrast that in an imperative/low-level model they'd be null pointers, nothing allocated.
Note the number of Tips is equal the number of values in the tree, + 1.
Would it save construction/destruction/access overhead to have three extra constructors?
| both sub-trees are empty 
| left (only) is empty
| right (only) is empty

If it's true per above that each constructor adds a performance penalty just doing the pattern matching, I guess the answer is 'no' :-(

Thanks
AntC

David Feuer

unread,
Aug 14, 2016, 8:05:43 AM8/14/16
to Anthony Clayden, haskel...@haskell.org

I don't really think we need to bring *two* mailing lists into this. I'm treeowl. The hedge algorithms were usually slower on benchmarks and never much faster. Their theoretical performance has never been understood well. Divide and conquer is fast, and its theoretical performance bound is now understood to be good. The new code is not on Hackage because I haven't made a release yet. containers doesn't usually release a major version more than once a year. This year it may get two, but there's no need to rush.

The Tip constructor is not allocated over and over. Nullary constructors of any given type are shared. So there's just one Tip somewhere. GHC's pointer tagging means that the pointers to Tip are almost never followed. Pattern matching on a Tip value ends up just inspecting the two (32-bit) or three (64-bit) tag bits on the pointer to it.

Yes, constructor order makes a difference, weird as that seems. When GHC desugars a pattern match, it always reorders the case branches into constructor order. Sometimes this is obviously good; certain large case expressions are turned into jump tables. Sometimes this is not so good. One of my predecessors (probably Milan Straka, but I'm not sure) ran a lot of benchmarks and picked the constructor order that was usually best.

Milan Straka wrote a paper that, among other things, recommended adding an additional singleton constructor. I don't actually know why he didn't do that; I can try to ask him. It's possible that it made some things better and others worse.

Anthony Clayden

unread,
Aug 15, 2016, 4:15:30 AM8/15/16
to David Feuer, haskell-cafe
Thanks David,

> Yes, constructor order makes a difference, weird as that seems. When GHC desugars a pattern  match, it always reorders the case branches into constructor order. Sometimes this is obviously good; certain large case expressions are turned into jump tables. Sometimes this is not so good. One of my predecessors (probably Milan Straka, but I'm not sure) ran a lot of benchmarks and picked the constructor order that was usually best.

Yes it was Milan [1], [2] refers to [3], which has the explanation. (And yes it seems weird.)

"indirect jumps are particularly expensive on a modern processor architecture, because they fox the branch prediction hardware, leading to a stall of 10 or more cycles depending on the length of the pipeline." [Introduction. The paper goes on to estimate a penalty around 7 to 9 % in general GHC code.]

Milan benchmarked a penalty around 10% for Data.Set. The prediction hardware doesn't seem very clever: it "predicts" any test-and-jump will not jump but will barrel on through the straight-line code. GHC's test is for the constructor not being the first in the decl. So the optimal straight-line code comes from declaring first your most likely constructor.

That seems to me worrying for case/constructor branching code in general. Do we pay that cost for every pattern-match on a list or a Maybe, for example? Their definitions in the Prelude put the constructors in the 'wrong' order. (Although presumably those two get optimised to smithereens.)

> Milan Straka wrote a paper [that's 2] that, among other things, recommended adding an additional singleton constructor. I don't actually know why he didn't do that; I can try to ask him. It's possible that it made some things better and others worse.

Performance seemed somewhat better for certain functions. (And none were worse.) OTOH you would pay a penalty of another indirect jump. So perhaps the extra code complexity just wasn't worth it. Perhaps the combination of pointer tagging and inlining the recursive call wrung the best performance out of it. (Inlining works, despite recursion, because every 'visible' function has a helper `go` that isn't inlined.)


[1] The performance of Haskell CONTAINERS package, Haskell Symposium 2010, Milan Straka
[2] Adams' Trees Revisited, 2011, Milan Straka
[3] Faster Laziness Using Dynamic Pointer Tagging, ICFP 2007, Simon M, Alexey Y, Simon PJ


AntC

Anthony Clayden

unread,
Aug 15, 2016, 6:02:04 AM8/15/16
to Doug Burke, David Feuer, haskel...@haskell.org
> Frames - http://acowley.github.io/Frames/ - is built on
> top of vinyl and provides one way to access data either as
> an array of structures or a structures of arrays.

Thanks Doug,

Nice ability to define column types on the fly,
and interpret them from incoming CSV.
Yes and treat as structure of arrays.

AFAICT Frames keeps each set of records distinct;
and wants each set to be same type/same columns and types.
So there's no heterogeneous repository.

> whether it allows the kind of filtering you were asking
for.

There's standard filtering features you'd get with Set or
List.
I'm not seeing any ability to match/join across different
datasets.
That is, you'd have to write your own code.

wren romano

unread,
Aug 23, 2016, 1:52:39 AM8/23/16
to Anthony Clayden, haskell-cafe
On Mon, Aug 15, 2016 at 1:15 AM, Anthony Clayden
<anthony...@clear.net.nz> wrote:
> "indirect jumps are particularly expensive on a modern processor
> architecture, because they fox the branch prediction hardware, leading to a
> stall of 10 or more cycles depending on the length of the pipeline."
> [Introduction. The paper goes on to estimate a penalty around 7 to 9 % in
> general GHC code.]

Indeed, this is the point I was about to bring up (but I've been AWOL
the last few weeks).

The cost model here makes sense if you think about a recursive
function over something like lists. To compile down to a tight C-like
loop we'll chose the conditional statement to be "is this nil" and
assume it's always false, so that code falls through straightline to
do whatever loop body, and then does an unconditional jump back up to
the top of the loop. "Almost always" our assumption that the condition
fails will be correct, as it's correct in the recursive case and only
fails in the basis case. (This notion of "almost always" is assuming
we don't have most of our input arguments be the empty list; with
felicity degrading as nil inputs begin to dominate.)

Regardless of how GHC does things, those basics of branch prediction
and primop ordering will remain the same. The question is just about
how we get there, how we go from english descriptions like "the
recursive case" and "the basis case" (or the "common" and "uncommon"
cases) and get to the actual byte code.

In practice GHC does the naive thing of ordering branches to be in
order of declaration in the data type definition. This is good in as
much as it doesn't matter how users textually order their case
analyses, since GHC will reorder them. But it's bad in that the
assumed ordering is not always very smart— especially when the order
of data constructors in a data type's definition is also used for
other magic, like deriving Ord. Really, we'd like to make GHC smarter
here about how it orders things; but that gets complicated very
quickly.


> That seems to me worrying for case/constructor branching code in general. Do
> we pay that cost for every pattern-match on a list or a Maybe, for example?

In principle, yes. In practice, not so much. Because it's not
recursive, you gotta figure you're just going to be wrong half the
time— but you don't know which half. The only stable solutions here
are (a) use dynamic instrumentation to determine which is the hot path
and then recompile, or (b) distinguish the (Nothing | Some a) type
from the (Just a | Everything) type— which we often want anyways for
other things like Ord and Monoid instances, but noone other than me
seems to care very much. In addition to these, there are also various
unstable solutions: (c) assume the user-provided case-analysis order
gives the most likely constructor first (but do we treat Maybe as a
special case? do we over-serialize everything?), (d) add a new pragma
to specify for each case analysis which constructor should be
considered hot (but how much should we actually trust the user to get
that right?), etc...

--
Live well,
~wren

Anthony Clayden

unread,
Aug 26, 2016, 3:06:52 AM8/26/16
to wren romano, haskell-cafe
> > On Mon, Aug 15, 2016 at 1:15 AM, Anthony Clayden wrote:
> > "indirect jumps are particularly expensive on a modern
> > processor architecture, because they fox the branch
> > prediction hardware, leading to a stall of 10 or more
> > cycles depending on the length of the pipeline."
> [Introduction. The paper goes on to estimate a penalty
> > around 7 to 9 % in general GHC code.]

> The cost model here makes sense if ...

Thanks Wren. Is that 7 to 9 % figure unduly alarmist?
(It is consistent with Milan's benchmarking.)
For example would strictifying or inlining
have much greater benefits than worrying about constructors?
And it only affects processor-intensive applications(?)

> ...

> Regardless of how GHC does things, those basics of branch

> prediction and primop ordering will remain the same. ...



> In practice GHC does the naive thing of ordering branches
> to be in order of declaration in the data type definition.
> This is good in as much as it doesn't matter how users
> textually order their case analyses, since GHC will

> reorder them. ...

> ...the order of data constructors in a data type's
definition
> is also used for other magic, like deriving Ord. ...

Hmm, I'd hardly describe derivng Ord as "magic".
It's what the language report says [Chapter 11].

> Really, we'd like to make GHC smarter here about

> how it orders [case branchng code];


> but that gets complicated very quickly.

I can see that. I gave some possible rules of thumb
in an earlier message.
(In particular, recursive constructors are more likely.)

> ... noone other than me seems to care very much. ...

I guess not many know. We kinda expect GHC will be much
smarter
than we could achieve by hand-crafting the object code.

Those who could write assembler code must be dying out ;-)
Judging by the 2006 paper,
part of the difficulty seems to be that the C code GHC
generates
is none too idiomatic for what a C optimiser is tuned for.

AntC

Reply all
Reply to author
Forward
0 new messages