Efficient idiom for converting key-set of a hash-map into a hash-set

72 views
Skip to first unread message

Thomas Dickerson

unread,
Oct 25, 2019, 10:28:40 AM10/25/19
to Racket Users
For reasons that are unclear to me, hash-keys and dict-keys both return a list? instead of a set? (frankly, this seems like a bug in the API...).

Is there a something I can do that's more efficient than just calling list->set?

David Storrs

unread,
Oct 25, 2019, 4:19:42 PM10/25/19
to Thomas Dickerson, Racket Users
You could do '(apply set list-of-keys)', although I'm not sure if
that's more efficient or not. You'd need to benchmark it.


>
> --
> You received this message because you are subscribed to the Google Groups "Racket Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to racket-users...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/racket-users/36eb8bd7-45eb-4ebb-be30-ebdfd89023fe%40googlegroups.com.

Sorawee Porncharoenwase

unread,
Oct 25, 2019, 4:21:58 PM10/25/19
to David Storrs, Thomas Dickerson, Racket Users
It also might be worth trying `(for/set ([(k v) (in-hash h)]) k)`.

Stephen Chang

unread,
Oct 25, 2019, 4:27:47 PM10/25/19
to Sorawee Porncharoenwase, David Storrs, Thomas Dickerson, Racket Users
I don't know of a more efficient way to convert, but depending on what
operations you want, you might be able to compute it directly on the
hash, e.g., see hash-has-key?, hash-keys-subset?, hash-union, etc. (I
didnt know about some of the functions until recently.)
> To view this discussion on the web visit https://groups.google.com/d/msgid/racket-users/CADcuegvk8wqkJsdJNqkQXSHi5FgYGZ4afRODYoZD8FnYmOcxVg%40mail.gmail.com.

Jon Zeppieri

unread,
Oct 25, 2019, 4:37:59 PM10/25/19
to Stephen Chang, Sorawee Porncharoenwase, David Storrs, Thomas Dickerson, Racket Users
From a little test I just ran, building a set of hash keys, using a
hash with 100,000 k/v pairs and generating the key set 100 times each:

"list->set"
cpu time: 3147 real time: 3148 gc time: 1137

"apply"
cpu time: 3205 real time: 3206 gc time: 1146

"in-hash"
cpu time: 2791 real time: 2791 gc time: 974

"in-hash-keys"
cpu time: 2748 real time: 2749 gc time: 981
> To view this discussion on the web visit https://groups.google.com/d/msgid/racket-users/CAFfiA1JXu_rEOP3V7vYO8oLW%2BfgOJrjK9%3DxXhr-WRYNvmVr6fQ%40mail.gmail.com.

Jon Zeppieri

unread,
Oct 25, 2019, 5:12:47 PM10/25/19
to Stephen Chang, Sorawee Porncharoenwase, David Storrs, Thomas Dickerson, Racket Users
Oh -- forgot to mention: if you're using an immutable hash, then
`in-immutable-hash-keys` is significantly faster than `in-hash-keys`.
I mean:

(for/set ([k (in-immutable-hash-keys h)]) k)

Matthew Butterick

unread,
Oct 26, 2019, 2:40:16 AM10/26/19
to Thomas Dickerson, Racket Users

On Oct 25, 2019, at 7:28 AM, Thomas Dickerson <thomas_d...@alumni.brown.edu> wrote:

For reasons that are unclear to me, hash-keys and dict-keys both return a list? instead of a set?(frankly, this seems like a bug in the API...).


A dict doesn't guarantee unique keys. So `dict-keys` returns a list, not a set. 

#lang racket
(require racket/dict)
(define kvs '((a 1) (a 2)))
(dict? kvs) ; #t
(dict-keys kvs) ; '(a a)

Why does `hash-keys` also return a list? For consistency, I suppose, because every hash is a dict.

Jack Firth

unread,
Oct 26, 2019, 4:14:56 AM10/26/19
to Racket Users
A dict doesn't guarantee unique keys, but it does guarantee that dict-ref returns a single value and dict-set accepts a single value, which is really strange to guarantee if you're not also going to guarantee unique keys.

Honestly, I think association-lists-as-dicts were a mistake. If you want to allow multiple values per key, then use a collection interface that's actively designed around that (such as multidicts). Racket needs to grow beyond using lists as a bunch of different ad-hoc collection types.

Thomas Dickerson

unread,
Oct 29, 2019, 1:34:37 PM10/29/19
to Racket Users
Hi All,

Thanks for the information + various ideas.
The various suggest constructs provide a helpful view on different corners of the language, but all appear to have the characteristic of throwing out the existing hashtable structure from the map and then reconstructing it from scratch for the new set (which is fine in a big-O sense, but feels like it can be improved upon in principle).

Responding specifically to Stephen:
I don't know of a more efficient way to convert, but depending on what
operations you want, you might be able to compute it directly on the
hash, e.g., see hash-has-key?, hash-keys-subset?, hash-union, etc. (I
didnt know about some of the functions until recently.)

I agree that if I just needed to query an existing notional set, this would be a good approach without having to copy at all.
Unfortunately, I need to be able to update the set over time without modifying the original map, and maintaining a second map when I only need the keys seems even more wasteful than the intermediate conversion to list.

Best,
~Thomas

Simon Schlee

unread,
Oct 30, 2019, 7:52:10 AM10/30/19
to Racket Users
Hi Thomas,

it seems to me that you may have a misleading intuition for racket's hash tables.
Racket has mutable and immutable hash tables these are implemented differently.
Because as a user of hash tables you have to differentiate between mutable and immutable,
there are also different hash operations some of which can only be used with mutable and other that are used with immutable hash tables.

Immutable hash tables are excellent for your use case, because you can easily keep around the original version just by keeping a variable/reference to it.
As its name implies it is immutable/not changed, so when you "update" the hash that you are using as a set you aren't really updating it, instead you create a new updated hash from it, while the original version remains as it was. Because the immutable hash is implemented as a persistent data-structure with structural sharing, this is efficient,
because it doesn't copy everything but instead only copies a few small internal nodes, those which need to be changed, the rest refers to the old/original version.
Depending on the racket version there are different internal implementations of immutable hashes,
but you can expect operations like hash-set to create a small extended version of your hash (if you are using racket's immutable hashes) that doesn't copy the complete hash.

Immutable hash tables actually provide O(log N) access and update. Since N is limited by the address space so that log N is limited to less than 30 or 62 (depending on the platform), log N can be treated reasonably as a constant.

If you want to use the original hash as if it were a set you can do that by using hash operations instead of set operations like Stephen suggested.
Another possibility is to create a custom-set implementation which wraps the hash to appear like a set, here is an example implementation:

#lang racket

(provide (contract-out
          [proxy-set (-> (and/c hash? immutable?) generic-set?)])
         proxy-set->set
         proxy-set->seteq
         proxy-set->seteqv)

(require racket/struct)

(struct proxy-set [hash]
  ;; #:transparent
  #:methods gen:set
  [(define (set-member? s k)
     (hash-has-key? (proxy-set-hash s) k))
   (define (set-add s v)
     (proxy-set (hash-set (proxy-set-hash s) v #f)))
   (define (set-remove s v)
     (proxy-set (hash-remove (proxy-set-hash s) v)))
   (define (set-empty? s)
     (hash-empty? (proxy-set-hash s)))
   (define (set-count s)
     (hash-count (proxy-set-hash s)))
   (define (set->stream s)
     (sequence->stream (in-immutable-hash-keys (proxy-set-hash s))))]

  #:methods gen:custom-write
  [(define write-proc
     (make-constructor-style-printer
      (lambda (s) 'proxy-set)
      (lambda (s) (in-immutable-hash-keys (proxy-set-hash s)))))])

;; these functions are in case you eventually don't need the proxy anymore
;; and want to convert to a plain racket set
(define/contract (proxy-set->set s)
  (-> proxy-set? (and/c generic-set? set-equal? set?))
  (for/set ([k (in-immutable-hash-keys (proxy-set-hash s))])
    k))

(define/contract (proxy-set->seteq s)
  (-> proxy-set? (and/c generic-set? set-eq? set?))
  (for/seteq ([k (in-immutable-hash-keys (proxy-set-hash s))])
    k))

(define/contract (proxy-set->seteqv s)
  (-> proxy-set? (and/c generic-set? set-eqv? set?))
  (for/seteqv ([k (in-immutable-hash-keys (proxy-set-hash s))])
    k))

(define-syntax-rule (show x)
  (displayln (format "~a: ~a" (quote x) x)))

(define (example)
  (define original-map (hasheq 'a 4 'b 7 'c 1))
  (show original-map)

  (define extended-map (hash-set original-map 'another 42))
  (show extended-map)

  (define derived-set (proxy-set original-map))
  (show derived-set)

  (define extended-set (set-union derived-set (seteq 'd 'e)))
  (show extended-set)

  (displayln "after all that the original map remains unchanged")
  (show original-map)
  (displayln "and the extended set does not modify the derived set")
  (show derived-set)

  (displayln "you also can merge the new keys from the extended-map")
  (displayln "with the extended-set at a later point")
  (define merged-set (set-union extended-set (proxy-set extended-map)))
  (show merged-set)

  (displayln "or convert the proxy-set to a plain racket set")
  (define plain-set (proxy-set->seteq merged-set))
  (show plain-set))

(module+ main
  (example))

;; output
;; original-map: #hasheq((a . 4) (b . 7) (c . 1))
;; extended-map: #hasheq((a . 4) (another . 42) (b . 7) (c . 1))
;; derived-set: #<proxy-set: a c b>
;; extended-set: #<proxy-set: a e d c b>
;; after all that the original map remains unchanged
;; original-map: #hasheq((a . 4) (b . 7) (c . 1))
;; and the extended set does not modify the derived set
;; derived-set: #<proxy-set: a c b>
;; you also can merge the new keys from the extended-map
;; with the extended-set at a later point
;; merged-set: #<proxy-set: a e another d c b>
;; or convert the proxy-set to a plain racket set
;; plain-set: #<seteq: a e another d c b>

Thomas Dickerson

unread,
Oct 30, 2019, 11:06:29 AM10/30/19
to Simon Schlee, Racket Users
Hi Simon -

Thanks for the detailed reply, my responses are in-line below.

~Thomas

On Wed, Oct 30, 2019 at 7:52 AM Simon Schlee <schle...@gmail.com> wrote:
Racket has mutable and immutable hash tables these are implemented differently.
Because as a user of hash tables you have to differentiate between mutable and immutable,
I dug into the source a little bit here: for posterity, the immutable hash "tables" are actually HAMTs.
There's no particular reason why the mutable hash tables couldn't also be HAMTs, but the fact that they're not suggests a good reason at least for why there wouldn't be an efficient direct conversion from mutable maps and sets to immutable maps and sets.
 
Immutable hash tables are excellent for your use case, because you can easily keep around the original version just by keeping a variable/reference to it.
As its name implies it is immutable/not changed, so when you "update" the hash that you are using as a set you aren't really updating it, instead you create a new updated hash from it, while the original version remains as it was. Because the immutable hash is implemented as a persistent data-structure with structural sharing, this is efficient,
I'm reasonably familiar with the semantics of persistent data structures =)
 
because it doesn't copy everything but instead only copies a few small internal nodes, those which need to be changed, the rest refers to the old/original version.
Depending on the racket version there are different internal implementations of immutable hashes,
but you can expect operations like hash-set to create a small extended version of your hash (if you are using racket's immutable hashes) that doesn't copy the complete hash.
In my use case, persistence will result in O(N log N) unnecessary allocations, since the root-to-leaf path will be rewritten N times as I drain the set.
Threading the state in and out of the function calls implementing my loop will also make my code a lot more verbose for no particular gain, as opposed to just wrapping a mutable set in a closure.
 
Immutable hash tables actually provide O(log N) access and update. Since N is limited by the address space so that log N is limited to less than 30 or 62 (depending on the platform), log N can be treated reasonably as a constant.
This is actually six years out of date and should be updated: a HAMT is typically bounded to depth 6 or 7.
This is a nice feature of the Racket libraries, thanks for illustrating it.
 

Simon Schlee

unread,
Oct 30, 2019, 2:26:28 PM10/30/19
to Racket Users
Hi Tomas,

you say you drain the set, do you remove elements one by one from it?
Do you need more from that set than only getting successive elements from it?

It seems you are implementing something, which might require going lower level or using different data structures.
I am still not really able to fully grasp your use-case.
Just thinking out loud here:
It could be that the internal structure of the implementation would allow to add some operation that lets you directly do what you want?

A long time ago before I found racket, I used python and before that c++, this thread reminds me of playing around with an early version of this library https://www.boost.org/doc/libs/1_71_0/libs/multi_index/doc/index.html 
Maybe racket should have something similar which lets you create custom data-types with multiple access semantics in a generative programming family of types way, through some kind of dsl that describes/configures the datatype that is being generated.

But currently I don't work on lowlevel / highly optimized code, although I find it very interesting.
Reply all
Reply to author
Forward
0 new messages