http://www.cs.rice.edu/~scrosby/hash/CrosbyWallach_UsenixSec2003/index.html
I'm drawing a blank on the following statement:
"However, if each element hashes to the same bucket, the hash table
will also degenerate to a linked list, and it will take O(n*n) time to
insert n elements. "
I thought it took constant time (ie O(1) ) to search a hash table and
linear time to search a linked list. How do they get n^2 for the
insertion time?
Chad
If all your inputs hash to the same bucket, then you might as well have
a one-bucket hash table --- the other unused buckets don't need to be
there.
A one-bucket [chained] hash table is a linked list.
Insertion of one element into a linked list, in order, takes O(n).
Therefore insertion of n elements takes O(n*n).
Why do we insert in order? Well, we could just push the new element onto
the front of the list in constant time, but that would put the elements in
random order, so our lookups would have to scan to the end --- doubling
the cost of each and every lookup. Since insertions are less frequent than
lookups, it makes sense to choose the in-order solution... in this
degenerate case. In a nice evenly distributed hash table, insertion at
the head of the list should be perfectly all right.
HTH,
-Arthur
A common convention for symbol tables is to disallow duplicate keys:
if the key already exists in the symbol table, you overwrite its value
with the new one. Under this scenario, it would take Theta(k) to insert
into an unsorted list of size k, and Theta(n^2) to do n insertions. If
all elements hash to the same bucket, the hash table degenerates
to an unsorted list.
Linear time is O(n). Doing this for each of n items results in
O(n*n). However there is no need to handle overflows in this crude
way.
--
"If you want to post a followup via groups.google.com, don't use
the broken "Reply" link at the bottom of the article. Click on
"show options" at the top of the article, then click on the
"Reply" at the bottom of the article headers." - Keith Thompson
More details at: <http://cfaj.freeshell.org/google/>
Also see <http://www.safalra.com/special/googlegroupsreply/>
I don't see the logic here. A lookup would be O(n) regardless of the
ordering of the list elements?
If the bucket's sorted, you can do a binary search to lookup within the
bucket, which is better than 0(n).
Normally you wouldn't sort the buckets though, because you're already
using a hashtable and so the idea is each bucket shouldn't be that long.
Which is what Arthur said:
>> [...] In a nice evenly distributed hash table, insertion at the head
Well, binary search on a linked list is tricky, I think :) (although
algorithms for doing it, sort of, have been given in this newsgroup in
the past).
I meant exactly what I said: "our lookups would have to scan to the
end --- doubling the cost of each and every lookup." Scanning the entire
list every time is exactly twice as expensive as scanning (on average)
half of a sorted list. Big-O had nothing to do with my statement.
N.B.: My reason for scanning the list was to speed up lookups.
Googmeister gives a better reason for insertion to scan the list:
to avoid duplicates. And if you're scanning the list anyway, you
might as well keep it sorted.
-Arthur
Sorry, you're quite right, I didn't think it through.
>>> I don't see the logic here. A lookup would be O(n) regardless of the
>>> ordering of the list elements?
>> Normally you wouldn't sort the buckets though, because you're already
>> using a hashtable and so the idea is each bucket shouldn't be that long.
>
> Right. No argument there. I just can't see how sorting it changes
> anything.
>
> Arthur says that linear searching in a sorted list (for a match) is
> quicker than sorting a randomly ordered list. I cannot see how this is
> so. In either case, you will on average check half the list.
With an unsorted linked list, you have to check every element of the
list every time, so you average 100%. With a sorted linked list, you
will average 50%[1].
- Logan
[1] Getting back to the concept of security, you will only average
50% if it's an even distribution. Someone could feed you
numbers in the reverse order that you'd sort them in, and
that would have you going to the end of the list every time.
Luckily, that'd only be twice as bad, so probably not that
effective a denial of service attack...
I think I'm missing the concept of ordering a list. Say I have a list
of words like:
Apples
Oranges
Cherries
If I push each onto the font (starting with Apple), I get
Cherries, Oranges, Apples
Or as I understand it, random order.
Now, I just insert at the end. This becomes
Apples, Oranges, Cherries.
But if I sorted the list, it becomes
Apples, Cherries, Oranges.
The only thing I see is that by inserting at the end, we would preserve
the order of the original list. What did I miss here?
Chad
Right.
> Now, I just insert at the end. This becomes
> Apples, Oranges, Cherries.
>
> But if I sorted the list, it becomes
> Apples, Cherries, Oranges.
Right.
> The only thing I see is that by inserting at the end, we would preserve
> the order of the original list. What did I miss here?
Nothing, yet. But notice the following:
Search for "Cherries" in the unsorted list:
is Cherries == Cherries? yes, return
Search for "Cherries" in the sorted list:
is Apples == Cherries? no, continue
is Cherries == Cherries? yes, return
Search for "Bananas" in the unsorted list:
is Cherries == Bananas? no, continue
is Apples == Bananas? no, continue
is Oranges == Bananas? no, continue
return
Search for "Bananas" in the sorted list:
is Apples >= Bananas? no, continue
is Cherries >= Bananas? yes, return
Half the search time, on average, if the item is not in the list ---
and no worse, if the item is in the list.
-Arthur
Maybe I need to sleep on this some more, but let's go back to the
original list.
Apples
Oranges
Cherries
If I would insert this into the list so that it is:
Apples, Oranges,Cherries
Would this list be considered sorted for your above example to work?
Chad
I failed to see Arthur's point initially. But the principle behind what
he's saying is, that you can stop looking in a sorted list once you
reach a value greater than your key. Assuming a uniform distribution of
values and keys, this will happen on average halfway through the list.
The optimisation is not available on an unsorted list.
>
> Chad