Stream conversion for Hashtbl: dirty or lazy way?

6 views
Skip to first unread message

Philippe Veber

unread,
Dec 22, 2012, 4:05:00 AM12/22/12
to Biocaml
Hi guys,

I'd like to add conversion functions from Hashtbl to Stream, but I cannot do it properly without having access to the internal representation of core's hash tables, which is hidden. So I see two (non satisfying options):
- either I use a conversion to list first, but this breaks the main interest of streams (i.e. not building the whole collection to use it)
- either I copy the type definition from Core and use a bit of black magic.

I'd like to try the second option, but not if anyone is strongly against the use of [Obj.magic] in biocaml (which I'd understand of course)
Maybe someone has a better strategy for this?

Cheers,
  ph.

Malcolm Matalka

unread,
Dec 22, 2012, 6:07:13 AM12/22/12
to bio...@googlegroups.com

The first version sounds like the best option. The reasons yo use streams are because a computation is expensive or consumes too much memory. If your data is in a hash table then neither of those are a concern.

--
 
 

Sebastien Mondet

unread,
Dec 22, 2012, 12:41:15 PM12/22/12
to bio...@googlegroups.com
Hi

I also prefer the first option:
- converting hashtables to streams does not seem useful enough to take
the risk of Obj.magic: a hash-table does not keep any intersseting
order, so anyway if order matters we'll end up doing Hashtbl.to_list
|! List.sort, If order does not matter Hashtbl.fold is enough
- one always ends regretting Obj.magic :)

Maybe there is a real use case of ('a, 'b) Hashtbl.t -> ('a, 'b)
Stream.t that I could not think of?
> --
>
>

Ashish Agarwal

unread,
Dec 22, 2012, 1:28:28 PM12/22/12
to bio...@googlegroups.com
Is the use of Obj.magic for efficiency? I can see that using it might avoid an extra pass over the items being put into the stream. As for memory, I agree with Malcolm that it's not an issue because you already have the items in memory.

--



Ashish Agarwal

unread,
Dec 22, 2012, 1:37:44 PM12/22/12
to bio...@googlegroups.com
And on a related note, did we decide to use Core's hashtbl? Philippe, I thought you wanted to stick to the Stdlib's hashtbl since the two are incompatible, and we thought Biocaml's exposed types should be Stdlib compatible wherever possible.

Philippe Veber

unread,
Dec 23, 2012, 3:37:35 AM12/23/12
to bio...@googlegroups.com
Thanks all for your answers. Yes, my motivation was to save the time spent in the extra pass and in allocating memory. But you're probably right the gain is probably not much, if compared to the danger of using [Obj.magic]. I'll stick to your recommandation :o).

My use case for hash table to stream transformation is to sort a set of objects into categories with a hash table and stream on the categories for further calculations (see Biocaml_accu). Now I could as well use lists, but following the context, sometimes an array was more convenient. So I used streams (in fact, enum at that time) as a proxy for any other suitable datastructure.

As for the use of Core's hashtbl, I still think we should stick to Stdlib's type and in particular in this case. As of now, none of our module declares a value using Hashtbl, and we started to use Core's hashtbl in the internal library only. Now if you find more convenient to use stdlib's hashtbl everywhere for the time someone will have an exposed function returning a hash table, I can make the change.

Thanks again for the feedback!
ph.

2012/12/22 Ashish Agarwal <agarw...@gmail.com>
--
 
 

Ashish Agarwal

unread,
Dec 23, 2012, 8:33:59 AM12/23/12
to bio...@googlegroups.com
You're right, regardless of the memory benefits, supporting Streamable is useful for data structure conversions. We should support it whenever possible.

Before we commit to the Core vs Stdlib hashtbl, does anyone know what the difference is? Why did Core implement an incompatible type for theirs?

--
 
 

Sebastien Mondet

unread,
Dec 23, 2012, 10:14:25 AM12/23/12
to bio...@googlegroups.com
The stdlib Hashtable is an array of lists
Core's Hashtable is an array of avl-trees, so the worst case is O(ln
n) instead of O(n)
→ it's much better when the hash function is pathologically bad or
when the input is crafted against it [1]

The semantics are different also, Caml.Hashtbl accepts duplicates for
a given key, Core_hashtbl does not (there is Hashtbl.add_multi which
creates a hash-table of lists...).



[1] Denial of service via algorithmic complexity attacks
http://dl.acm.org/citation.cfm?id=1251356
> --
>
>

Ashish Agarwal

unread,
Dec 24, 2012, 7:54:13 PM12/24/12
to bio...@googlegroups.com
So should we just use Core's Hashtbl internally? We can revisit what to do if we ever need to expose a Hashtbl in the API, which hasn't come up yet. I could go either way though, so if anyone else feels strongly either way please say so.

--



Francois Berenger

unread,
Dec 24, 2012, 8:15:48 PM12/24/12
to bio...@googlegroups.com
On 12/23/2012 05:37 PM, Philippe Veber wrote:
> Thanks all for your answers. Yes, my motivation was to save the time
> spent in the extra pass and in allocating memory. But you're probably
> right the gain is probably not much, if compared to the danger of using
> [Obj.magic]. I'll stick to your recommandation :o).
>
> My use case for hash table to stream transformation is to sort a set of
> objects into categories with a hash table and stream on the categories
> for further calculations (see Biocaml_accu).

I sniff you may need to use a Set.

> Now I could as well use
> lists, but following the context, sometimes an array was more
> convenient. So I used streams (in fact, enum at that time) as a proxy
> for any other suitable datastructure.
>
> As for the use of Core's hashtbl, I still think we should stick to
> Stdlib's type and in particular in this case. As of now, none of our
> module declares a value using Hashtbl, and we started to use Core's
> hashtbl in the internal library only. Now if you find more convenient to
> use stdlib's hashtbl everywhere for the time someone will have an
> exposed function returning a hash table, I can make the change.
>
> Thanks again for the feedback!
> ph.
>
> 2012/12/22 Ashish Agarwal <agarw...@gmail.com
> <mailto:agarw...@gmail.com>>
>
> And on a related note, did we decide to use Core's hashtbl?
> Philippe, I thought you wanted to stick to the Stdlib's hashtbl
> since the two are incompatible, and we thought Biocaml's exposed
> types should be Stdlib compatible wherever possible.
>
>
> On Sat, Dec 22, 2012 at 1:28 PM, Ashish Agarwal
> <agarw...@gmail.com <mailto:agarw...@gmail.com>> wrote:
>
> Is the use of Obj.magic for efficiency? I can see that using it
> might avoid an extra pass over the items being put into the
> stream. As for memory, I agree with Malcolm that it's not an
> issue because you already have the items in memory.
>
>
> On Sat, Dec 22, 2012 at 12:41 PM, Sebastien Mondet
> <sebastie...@gmail.com <mailto:sebastie...@gmail.com>>
> wrote:
>
> Hi
>
> I also prefer the first option:
> - converting hashtables to streams does not seem useful
> enough to take
> the risk of Obj.magic: a hash-table does not keep any
> intersseting
> order, so anyway if order matters we'll end up doing
> Hashtbl.to_list
> |! List.sort, If order does not matter Hashtbl.fold is enough
> - one always ends regretting Obj.magic :)
>
> Maybe there is a real use case of ('a, 'b) Hashtbl.t -> ('a, 'b)
> Stream.t that I could not think of?
>
>
>
> On Sat, Dec 22, 2012 at 6:07 AM, Malcolm Matalka
> <mmat...@gmail.com <mailto:mmat...@gmail.com>> wrote:
> > The first version sounds like the best option. The
> reasons yo use streams
> > are because a computation is expensive or consumes too
> much memory. If your
> > data is in a hash table then neither of those are a concern.
> >
> > On Dec 22, 2012 10:05 AM, "Philippe Veber"
> <philipp...@gmail.com <mailto:philipp...@gmail.com>>
> --
>
>

Francois Berenger

unread,
Dec 24, 2012, 8:19:50 PM12/24/12
to bio...@googlegroups.com
On 12/23/2012 10:33 PM, Ashish Agarwal wrote:
> You're right, regardless of the memory benefits, supporting Streamable
> is useful for data structure conversions. We should support it whenever
> possible.
>
> Before we commit to the Core vs Stdlib hashtbl, does anyone know what
> the difference is? Why did Core implement an incompatible type for theirs?

They expose the hash function in their type signature I think.
This way they can detect at compile time if they try to mix hashtables
which are in some way incompatible.

Currently, I always avoid core's Sets and Hashtables. I find their
signature annoying (and getting in my way) and don't have at all the
same safety requirements than Janestreet.

Sebastien mentioned some performance-related differences.
Usually, if I need performance, I use a Set instead of a Hashtable.

> On Sun, Dec 23, 2012 at 3:37 AM, Philippe Veber
> <philipp...@gmail.com <mailto:philipp...@gmail.com>> wrote:
>
> Thanks all for your answers. Yes, my motivation was to save the time
> spent in the extra pass and in allocating memory. But you're
> probably right the gain is probably not much, if compared to the
> danger of using [Obj.magic]. I'll stick to your recommandation :o).
>
> My use case for hash table to stream transformation is to sort a set
> of objects into categories with a hash table and stream on the
> categories for further calculations (see Biocaml_accu). Now I could
> as well use lists, but following the context, sometimes an array was
> more convenient. So I used streams (in fact, enum at that time) as a
> proxy for any other suitable datastructure.
>
> As for the use of Core's hashtbl, I still think we should stick to
> Stdlib's type and in particular in this case. As of now, none of our
> module declares a value using Hashtbl, and we started to use Core's
> hashtbl in the internal library only. Now if you find more
> convenient to use stdlib's hashtbl everywhere for the time someone
> will have an exposed function returning a hash table, I can make the
> change.
>
> Thanks again for the feedback!
> ph.
>
> 2012/12/22 Ashish Agarwal <agarw...@gmail.com
> <mailto:agarw...@gmail.com>>
>
> And on a related note, did we decide to use Core's hashtbl?
> Philippe, I thought you wanted to stick to the Stdlib's hashtbl
> since the two are incompatible, and we thought Biocaml's exposed
> types should be Stdlib compatible wherever possible.
>
>
> On Sat, Dec 22, 2012 at 1:28 PM, Ashish Agarwal
> <agarw...@gmail.com <mailto:agarw...@gmail.com>> wrote:
>
> Is the use of Obj.magic for efficiency? I can see that using
> it might avoid an extra pass over the items being put into
> the stream. As for memory, I agree with Malcolm that it's
> not an issue because you already have the items in memory.
>
>
> On Sat, Dec 22, 2012 at 12:41 PM, Sebastien Mondet
> <sebastie...@gmail.com
> --
>
>

Malcolm Matalka

unread,
Dec 25, 2012, 4:09:57 AM12/25/12
to bio...@googlegroups.com

I don't understand, hashtables and sets have fairly different use cases. Why use one when you want the other?   What about cores has gotten in your way too?


--


Philippe Veber

unread,
Dec 25, 2012, 5:16:13 AM12/25/12
to bio...@googlegroups.com


2012/12/25 Ashish Agarwal <agarw...@gmail.com>

So should we just use Core's Hashtbl internally?
That's what I meant: let's say it doesn't hurt to use it internally.
 
We can revisit what to do if we ever need to expose a Hashtbl in the API, which hasn't come up yet. I could go either way though, so if anyone else feels strongly either way please say so.
Same line here.
 
--
 
 

Philippe Veber

unread,
Dec 25, 2012, 5:22:19 AM12/25/12
to bio...@googlegroups.com


2012/12/25 Francois Berenger <bere...@riken.jp>

On 12/23/2012 05:37 PM, Philippe Veber wrote:
Thanks all for your answers. Yes, my motivation was to save the time
spent in the extra pass and in allocating memory. But you're probably
right the gain is probably not much, if compared to the danger of using
[Obj.magic]. I'll stick to your recommandation :o).

My use case for hash table to stream transformation is to sort a set of
objects into categories with a hash table and stream on the categories
for further calculations (see Biocaml_accu).

I sniff you may need to use a Set.
Yes it may happen: when creating an accumulator, I specify an accumulation function which can for example only count the number of objects in each category, build the list of occurrences or a set to avoid repetitions.
 
        <sebastie...@gmail.com <mailto:sebastien.mondet@gmail.com>>

        wrote:

            Hi

            I also prefer the first option:
            - converting hashtables to streams does not seem useful
            enough to take
            the risk of Obj.magic: a hash-table does not keep any
            intersseting
            order, so anyway if order matters we'll end up doing
            Hashtbl.to_list
            |! List.sort, If order does not matter Hashtbl.fold is enough
            - one always ends regretting Obj.magic :)

            Maybe there is a real use case of ('a, 'b) Hashtbl.t -> ('a, 'b)
            Stream.t that I could not think of?



            On Sat, Dec 22, 2012 at 6:07 AM, Malcolm Matalka
            <mmat...@gmail.com <mailto:mmat...@gmail.com>> wrote:
             > The first version sounds like the best option. The
            reasons yo use streams
             > are because a computation is expensive or consumes too
            much memory. If your
             > data is in a hash table then neither of those are a concern.
             >
             > On Dec 22, 2012 10:05 AM, "Philippe Veber"
            <philipp...@gmail.com <mailto:philippe.veber@gmail.com>>

--



Philippe Veber

unread,
Dec 25, 2012, 5:25:27 AM12/25/12
to bio...@googlegroups.com
Maybe François meant Map instead of Set? If so, last time I tried (with the stdlib), Hashtbl was violently faster than Map for my accumulation problem.

2012/12/25 Malcolm Matalka <mmat...@gmail.com>
--
 
 

Francois Berenger

unread,
Dec 25, 2012, 8:03:07 PM12/25/12
to bio...@googlegroups.com
On 12/25/2012 07:25 PM, Philippe Veber wrote:
> Maybe Fran�ois meant Map instead of Set? If so, last time I tried (with
> the stdlib), Hashtbl was violently faster than Map for my accumulation
> problem.

Interesting.

I mean, sometimes you use a List or a Hashtabl as the initial
implementation, until you realize that you really needed a Set.

> 2012/12/25 Malcolm Matalka <mmat...@gmail.com <mailto:mmat...@gmail.com>>
>
> I don't understand, hashtables and sets have fairly different use
> cases.

Set, Map and Hashtables have quite some overlap in their interface.

> Why use one when you want the other? What about cores has
> gotten in your way too?

I don't use their Hashtbl and Set.

I use Caml.Set and Caml.Hashtbl because they are trivial to use
and I know them well.

> On Dec 25, 2012 2:19 AM, "Francois Berenger" <bere...@riken.jp
> <mailto:bere...@riken.jp>> wrote:
>
> On 12/23/2012 10:33 PM, Ashish Agarwal wrote:
>
> You're right, regardless of the memory benefits, supporting
> Streamable
> is useful for data structure conversions. We should support
> it whenever
> possible.
>
> Before we commit to the Core vs Stdlib hashtbl, does anyone
> know what
> the difference is? Why did Core implement an incompatible
> type for theirs?
>
>
> They expose the hash function in their type signature I think.
> This way they can detect at compile time if they try to mix
> hashtables
> which are in some way incompatible.
>
> Currently, I always avoid core's Sets and Hashtables. I find
> their signature annoying (and getting in my way) and don't have
> at all the same safety requirements than Janestreet.
>
> Sebastien mentioned some performance-related differences.
> Usually, if I need performance, I use a Set instead of a Hashtable.
>
> On Sun, Dec 23, 2012 at 3:37 AM, Philippe Veber
> <philipp...@gmail.com <mailto:philipp...@gmail.com>
> <mailto:philippe.veber@gmail.__com
> <mailto:agarw...@gmail.com
> <mailto:agarw...@gmail.com>>__>
>
> And on a related note, did we decide to use Core's
> hashtbl?
> Philippe, I thought you wanted to stick to the
> Stdlib's hashtbl
> since the two are incompatible, and we thought
> Biocaml's exposed
> types should be Stdlib compatible wherever possible.
>
>
> On Sat, Dec 22, 2012 at 1:28 PM, Ashish Agarwal
> <agarw...@gmail.com
> <mailto:agarw...@gmail.com> <mailto:agarw...@gmail.com
> <mailto:agarw...@gmail.com>>__> wrote:
>
> Is the use of Obj.magic for efficiency? I can
> see that using
> it might avoid an extra pass over the items
> being put into
> the stream. As for memory, I agree with Malcolm
> that it's
> not an issue because you already have the items
> in memory.
>
>
> On Sat, Dec 22, 2012 at 12:41 PM, Sebastien Mondet
> <sebastie...@gmail.com
> <mailto:sebastie...@gmail.com>
> <mailto:sebastien.mondet@__gmail.com
> <mailto:sebastie...@gmail.com>>> wrote:
>
> Hi
>
> I also prefer the first option:
> - converting hashtables to streams does not
> seem useful
> enough to take
> the risk of Obj.magic: a hash-table does
> not keep any
> intersseting
> order, so anyway if order matters we'll end
> up doing
> Hashtbl.to_list
> |! List.sort, If order does not matter
> Hashtbl.fold is
> enough
> - one always ends regretting Obj.magic :)
>
> Maybe there is a real use case of ('a, 'b)
> Hashtbl.t ->
> ('a, 'b)
> Stream.t that I could not think of?
>
>
>
> On Sat, Dec 22, 2012 at 6:07 AM, Malcolm
> Matalka
> <mmat...@gmail.com
> <mailto:mmat...@gmail.com> <mailto:mmat...@gmail.com
> <mailto:mmat...@gmail.com>>> wrote:
> > The first version sounds like the best
> option. The
> reasons yo use streams
> > are because a computation is expensive
> or consumes
> too much memory. If your
> > data is in a hash table then neither of
> those are a
> concern.
> >
> > On Dec 22, 2012 10:05 AM, "Philippe Veber"
> <philipp...@gmail.com
> <mailto:philipp...@gmail.com>
> <mailto:philippe.veber@gmail.__com
> --
>
>

Reply all
Reply to author
Forward
0 new messages