Abstract

This article presents a theoretical analysis of hashing using chaining and universal hashing. All results are derived using probability theory and elementary number theory. The focus is on understanding both idealized probabilistic models and their implications for practical hash function design.

Prerequisites

This article assumes that the reader has basic familiarity with the concept of hashing as used in data structures and algorithms.

In particular, the reader is expected to know:

What a hash table is and how it is used to implement dictionary operations such as insertion, search, and deletion.
The role of a hash function in mapping keys to table indices.
The concept of collisions, where multiple keys map to the same table slot.
Common collision resolution strategies, including chaining and open addressing.

No prior knowledge of probabilistic analysis or advanced hashing theory is assumed; all such concepts are developed within the article.

Model, notation, and assumptions

We begin by formally specifying the hashing model and the notation used throughout the analysis.

Let:

be the universe of keys.
be the set of stored keys.
be the number of hash table slots.
be a hash function.

Load factor

The load factor of the hash table is defined as

Intuitively, the load factor measures the average occupancy of the table and plays a central role in the performance analysis of hashing schemes.

It represents the expected number of keys per slot.

Simple Uniform Hashing (SUH) assumption

To enable tractable probabilistic analysis, we adopt the Simple Uniform Hashing (SUH) assumption.

For every key , and hash values of distinct keys are mutually independent.

This is an assumption, not a property of real hash functions. All expectations derived below are conditional on this assumption and represent idealized average-case behavior.

Expected number of keys in a slot (chaining)

We first analyze the expected distribution of keys across slots when chaining is used to resolve collisions.

Fix a slot .

Define indicator random variables:

The number of keys stored in slot is

Taking expectation and using linearity of expectation,

Since is an indicator variable,

Substituting,

Thus, under SUH, the expected chain length in every slot equals the load factor.

Collision probability between two distinct keys

We now quantify the likelihood of collisions between pairs of distinct keys, which will be used in the analysis of search cost.

Let be two distinct keys.

Define

Then

Using the law of total probability,

By independence under SUH,

Therefore,

Expected cost of successful search (chaining)

Using the collision analysis above, we now derive the expected cost of a successful search in a hash table with chaining.

Assume that each key is inserted at the head of its chain.

Keys are inserted in the order

When searching for , only keys inserted after may appear before it in the chain.

Thus, the cost of searching for is

Taking expectation,

Averaging over all keys,

Using ,

Expected cost of unsuccessful search (chaining)

A complete analysis of chaining must also consider the cost of an unsuccessful search.

An unsuccessful search for a key examines all keys stored in the slot . Under the Simple Uniform Hashing assumption, the expected number of keys in any slot equals the load factor .

Therefore, the expected cost of an unsuccessful search is

Including the constant-time hash computation, unsuccessful searches take expected time which serves as a baseline comparison for the successful search bound.

Division method and structural failure

The preceding analysis relies critically on the Simple Uniform Hashing assumption, which models hash functions as ideal random mappings. In practice, however, hash functions are deterministic and may violate this assumption.

We now examine the division method as a concrete example illustrating how poor design choices can lead to systematic clustering.

Consider the division method

If , then depends only on the lowest bits of .

Hence, any regularity in lower-order bits of keys produces systematic clustering.

More generally, for keys of the form the sequence covers only distinct slots.

Choosing prime ensures for all , preventing such collapse.

Universal hashing

To overcome the limitations of deterministic hashing schemes, we introduce the concept of universal hashing.

Definition

A family of hash functions is universal if

Universal hash family construction

Let be a prime such that .

Choose:

Define

Proof of universality

Fix .

Let

For fixed , as ranges uniformly, the pair is uniformly distributed over .

A collision occurs when

For a fixed , the number of satisfying this congruence is at most .

Thus,

Since ,

Strictly speaking, the above bound yields which is sometimes referred to as almost universal hashing. In standard treatments, is chosen sufficiently large compared to , making the additional term negligible. Under this convention, the family satisfies the universal hashing requirement for all practical and theoretical purposes.

Final conclusions

We summarize the main theoretical insights obtained in this analysis.

The expected chain length equals the load factor.
Successful search time under chaining is .
Unsuccessful search time under chaining is also .
Poor modulus choice causes deterministic clustering.
Universal hashing guarantees bounded collision probability independent of input distribution.

Continuation: Open Addressing and Advanced Topics

This article has focused on hashing with chaining under idealized probabilistic assumptions and on the design of universal hash families. In Part II of this article, we will study open-address hashing, where all keys are stored directly within the hash table itself and collisions are resolved by systematic probing sequences.

The next part will develop both algorithmic techniques and their rigorous analysis, with particular emphasis on probabilistic bounds and worst-case guarantees.

Topics to be covered in Part II

Part II will cover the following topics in detail:

Open addressing and probe sequences
Linear probing
Quadratic probing
Double hashing
Expected-time analysis of open-address hashing
Longest-probe bound for hashing
Slot-size bounds for chaining
Perfect hashing
Hashing and authentication

Hashing: Chaining, Collision Analysis, and Universal Hashing