Hi,
This is an interesting question.
To analyze it, let's observe what happens to the cursor when
iterating. Assume a hash table of size 2^N. On every call the next
cursor is determined by incrementing the bit-wise reverse (while
observing the size of the hash table, being 2^N). For a hash table of
size 8, the cursor sequence is [0, 4, 2, 6, 1, 5, 3, 7], or in binary
[000, 100, 010, 110, 001, 101, 011, 111]. As you can see, for a hash
table with a stable size, the most significant bit gets flipped on
every call. This means that the cursor switches between the range [0,
2^(N-1)-1] and [2^(N-1), 2^N-1], or: the first half and the second
half. When the hash table is shrunk by half to size 2^(N-1), the most
significant bit of the hashes of the elements is removed and
effectively the second half of the original range is overlaid on top
of the first half of the original range. If the cursor was pointing to
the first half before this happened, we cannot observe duplicates
since it will not be revisiting slots it has already visited. If the
cursor was pointing to the second half before this happened, we may
observe duplicates since the slot it is pointing to will be merged
with with the slot in the first half of the table *that the cursor
visited in the previous call*. Think about the hash table of size 8:
shrinking it to 4 means that the following slots will be merged:
(0,4), (2,6), (1,5), and (3,7). If the cursor points to slot 2, the
algorithm will just continue where it left off regardless of the
resize operation. If the cursor points to slot 6, the algorithm *must*
revisit slot 2 to ensure all elements are emitted, even while this can
cause elements that originally were in slot 2 to be re-emitted. So
far, it looks like duplicate elements can only be emitted in a
subsequent call to SCAN. However...
This works for the case where you shrink a hash table from size 2^N to
2^(N-1). However, it breaks down when you shrink the hash table from
2^N to 2^(N-2) or further. Instead of worrying about the first half
and second half of the hash table, we need to worry about 4 quarters
of the hash table. If your cursor points to quarter #0, the algorithm
will continue where it left off and the user will not see any
duplicates. If the cursor points to quarter #2, the algorithm may
return elements already seen from quarter #0. If the cursor points to
quarter #1, the algorithm may return elements already seen from
quarters #0 and #2. If the cursor points to quarter #3, the algorithm
may return elements already seen from quarters #0, #2, and #1. It is
relatively easy to see how you can extrapolate to shrinking a hash
table from 2^N to 2^(N-M). The larger the factor by which the hash
table is shrunk, the more results from calls to SCAN you need to
retain to filter duplicates on the client side. In fact, it seems that
you would need to retain 2^M-1 results.
Since any time can pass between invocations to SCAN, M can be greater
than 1. To get to a state where you don't need to filter for
duplicates, you would need to make sure all active cursors point to
the first half before shrinking it down further. This requires keeping
state on the server and can cause independent cursors to be blocked on
each other. While certainly possible, I don't think this is a good
idea given the lightweight nature of the current implementation.
To answer your other question: hash collisions can cause multiple
elements to be present in the same hash slot, so multiple elements can
be duplicated.
I hope this clarifies a few things.
Cheers,
Pieter
> --
> You received this message because you are subscribed to the Google Groups
> "Redis DB" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to
redis-db+u...@googlegroups.com.
> To post to this group, send email to
redi...@googlegroups.com.
> Visit this group at
http://groups.google.com/group/redis-db.
> For more options, visit
https://groups.google.com/groups/opt_out.