Hashing

Computing the location of data instead of traversing a datastructure, without key comparisons.
With hashing, data is stored in an array-like structure called a hash table.
A hash function assigns a hash address (=index in the hash table) to each key , .
Since the number of possible keys is usually much larger than the number of hash addresses, multiple keys may map to the same address (a collision): is generally not injective.

often consists of two steps:
A hash code maps each key to an integer, and
A compression function maps the integer to a valid hash address.

A good hash function is easy to calculate, distributes keys as evenly as possible over the hash addresses, and minimizes collisions.

… load factor (stored items / hash table size)
are synonyms (collide)

Integer = 32-bit signed
For 64 bit long integers, a simple hash code function is:

int hashCode(long i) {
	return (int)((i >> 32) + (int)i);  // upper 32 + lower 32 bits
}

This works because the upper and lower bits of long integers are often not correlated.

For strings, a simple character sum would fail, things like “abc” and “cab” would collide.
Instead, we can treat characters as polynomial coefficients in some base :

Or computed via horner’s method

Horner’s method is more efficient, requiring only multiplications and additions.
With , there are only 6 collisions in the english dictionary (50k words).

To compress the hash code to a valid address, we can use the modulo operator:

Even are bad if the last bit encodes something (→ even/odd bias if there are many even/odd keys).
is even worse, as it only uses the lower bits of the hash code, losing information from the higher bits.
Empircally, the best choice for is a prime number.

Turan Sos Theorem

Let be an irrational number. If you plance points

in the interval , the resulting intervals have at most three different lengths.
No matter how many points you place this way, you will never have more than 3 distinct gap sizes.

When you place the next point, it always lands inside one of the largest gaps, splitting it.

Of all numbers , the most uniform distribution is achieved by choosing , the fractional part of the golden ratio.

This is very convenient for hashing: New keys automatically go to the emptiest regions.

Universal Hashing

Problem: For any fixed , an adversary can construct key sets with arbitrarily many collisions.
Solution: Pick randomly from a family of hash functions.

is universal if for any :

The fraction says “what proportion of my hash functions cause a collision between and “.
The universal property says that for any two distinct keys, at most of the hash functions cause a collision.
So the family is at least as good as random guessing.

Equivalently:


https://claude.ai/chat/e6329d19-820e-4bb6-88f5-1a3f837faad8