Hashing
Computing the location of data instead of traversing a datastructure, without key comparisons.
With hashing, data is stored in an array-like structure called a hash table.
A hash function assigns a hash address (=index in the hash table) to each key , .
Since the number of possible keys is usually much larger than the number of hash addresses, multiple keys may map to the same address (a collision): is generally not injective.often consists of two steps:
A hash code maps each key to an integer, and
A compression function maps the integer to a valid hash address.A good hash function is easy to calculate, distributes keys as evenly as possible over the hash addresses, and minimizes collisions.
… load factor (stored items / hash table size)
… are synonyms (collide)
Integer = 32-bit signed
For 64 bit long integers, a simple hash code function is:
int hashCode(long i) {
return (int)((i >> 32) + (int)i); // upper 32 + lower 32 bits
}This works because the upper and lower bits of long integers are often not correlated.
For strings, a simple character sum would fail, things like “abc” and “cab” would collide.
Instead, we can treat characters as polynomial coefficients in some base :
Or computed via horner’s method
Horner’s method is more efficient, requiring only multiplications and additions.
With , there are only 6 collisions in the english dictionary (50k words).
To compress the hash code to a valid address, we can use the modulo operator:
Even are bad if the last bit encodes something (→ even/odd bias if there are many even/odd keys).
is even worse, as it only uses the lower bits of the hash code, losing information from the higher bits.
Empircally, the best choice for is a prime number.
Turan Sos Theorem
Let be an irrational number. If you plance points
in the interval , the resulting intervals have at most three different lengths.
No matter how many points you place this way, you will never have more than 3 distinct gap sizes.When you place the next point, it always lands inside one of the largest gaps, splitting it.
Of all numbers , the most uniform distribution is achieved by choosing , the fractional part of the golden ratio.
This is very convenient for hashing: New keys automatically go to the emptiest regions.
Universal Hashing
Problem: For any fixed , an adversary can construct key sets with arbitrarily many collisions.
Solution: Pick randomly from a family of hash functions.is universal if for any :
The fraction says “what proportion of my hash functions cause a collision between and “.
The universal property says that for any two distinct keys, at most of the hash functions cause a collision.
So the family is at least as good as random guessing.Equivalently: