Hashing
Literature references and annotations by Dick Grune, dick@dickgrune.com.
Last update: Sat Sep 19 08:19:12 2009.
These references and annotations were originally intended
for personal use and are presented here only in the hope
that they may be useful to others.
There is no claim to completeness or even correctness.
Each annotation represents my understanding of the text
at the moment I wrote the annotation.
No guarantees given; comments and content criticism welcome.
Peter K. Pearson,
Fast hashing of variablelength text strings,
Commun. ACM,
vol. 33,
#6,
June 1990,
pp. 677680.
Technical correspondence in CACM, Nov 1991.
B.J. McKenzie,
R. Harries,
T. Bell,
Selecting a hashing algorithm,
Software  Practice and Experience,
vol. 20,
#2,
Feb. 1990,
pp. 209224.
Statistical tests are applied to 8 hash functions extracted from 8 actual
compilers.
Five are linear congruence functions and these are the ones that perform best.
The very best is
$ H = H sub n ~ bold mod ~ 2 sup 30 ~ bold mod ~ 1008 $, with
$ h sub i = 613 h sub { i  1 } + c sub i $.
M.V. Ramakrishna,
P.Å. Larson,
File organization using composite perfect hashing,
ACM Trans. Database Syst.,
vol. 14,
#2,
June 1989,
pp. 231263.
Authors' abstract:
Perfect hashing refers to hashing with no overflows.
We propose and analyze a composite perfect hashing
scheme for large external files. The scheme guarantees
retrieval of any record in a single disk access.
Insertions and deletions are simple, and the file size
may vary considerably without adversely affecting the
performance. A simple variant of the scheme supports
efficient range searches in addition to being a
completely dynamic file organization scheme. These
advantages are achieved at the cost of a small amount
of additional internal storage and increased cost of
insertions.
R.N. Horspool,
G.V. Cormack,
Hashing as a compaction technique for LR parser tables,
Software: Practice and Experience,
vol. 17,
#6,
June 1987,
pp. 413416.
Remarkably, shows that hashing is at most a bearable alternative for
graph colouring and other compression techniques, its most attractive
feature being its ease of implementation.
The possibility of applying perfect hashing is mentioned only in
passing.
G.V. Cormack,
R.N. Horspool,
M. Kaiserswerth,
Practical perfect hashing,
Computer J.,
vol. 28,
#1,
1985,
pp. 5458.
Describes open hashing in which the second level is implemented as hash
tables with a possibly different hash function for each secondlevel
hash table.
For each entry, the primary hash table records the position of the
secondary hash table plus the function used.
The secondary tables are kept free from hash duplicates; this ensures
retrieval in constant time, which is the authors' interpretation of
the term "perfect hashing".
If a hash duplicate occurs during insertion, the old (secondary) hash
function is abandoned and the contents of the entire secondary hash table
are rehashed to a larger array, using another hash function, one that
does not create duplicates.
This function is found by exhaustive search.
As an implementation detail, the secondary hash tables are kept in a
single array; this is important because the algorithm is designed to use
secondary storage for the secondary tables.
Two variants are discussed, both based on the heuristic that if the
primary hash table is large enough, the secondary tables are small; here
"large enough" means having roughly one entry for each element.
The first variant uses perfect hash functions for the secondary tables,
thus achieving optimal packing.
The functions are again found by exhaustive search, during insertion.
It is proved that insertion can be done in constant time, but some
insertions can be hundreds of times more expensive than others.
In the second variant, a secondary set of r elements is stored in a
hash table of size $ r sup 2 $.
Under this regimen, the variation in the insertion time is no more than
a factor of 10.
Detailed statistics are given.
W.C. Chen,
J.S. Vitter,
Analysis of new variants of coalesced hashing,
ACM Trans. On Database Sys.,
vol. 9,
#4,
pp. ??????.
December 1984,
See also 10 (1), March 1985
Ronald Fagin,
Jurg Nievergelt,
Nicholas Pippenger,
H. Raymond Strong,
Extendible hashing  a fast access method for dynamic files,
ACM Trans. Database Syst.,
vol. 4,
#3,
Sept. 1979,
pp. 315344.
The hash table has $ 2 sup d $ entries pointing to leaves. Each leaf has a
local depth $ d prime $, with $ d prime <= d $. Entry k points to a leaf
which contains between $min$ and $max$ records; for each record in this leaf
we have
$ h ~ mod ~ 2 sup d prime = k ~ mod ~ 2 sup d prime $,
where h is the record's hash value.
This arrangement means that if for a leaf $ d prime < d $,
$ 2 sup { d  d prime } $ entries in the hash table will point to that leaf.
When a record with hash value h is to be inserted, it is put in the leaf
pointed to by entry $ h ~ mod ~ 2 sup d prime $.
If as a result this leaf is now too full there are two possibilities:
$ d prime < d $ and $ d prime = d $.
If $ d prime < d $ the leaf is pointed to by more than one entry; the leaf
is split into two leaves.
If $ d prime = d $ the hash table is doubled by duplicating it; now
$ d prime < d $ and see above.
Renzo Sprugnoli,
Perfect hashing functions: a single probe retrieving method for static sets,
Commun. ACM,
vol. 20,
#11,
Nov. 1977,
pp. 841850.
[ Pages and pages of number theory ]
G. Schay,
W.G. Spruth,
Analysis of a file addressing method,
Commun. ACM,
vol. 5,
#8,
Aug. 1962,
pp. 459452.
The authors consider open hashing of records in a sequential file on
disk and the point is to reduce (sequential) search time (this is 1962).
(They agree that chained hashing is preferable for random access
memory.)
Suppose a new record R hashes (the authors don't use the word "hash")
to address A.
Rather than storing R in the first empty bin after A (allowing
buffer circularity), which can be far away, they store it in the first
bin that holds a record S that hashes to a higher address than A (or
of course in an empty bin if they meet that first).
They store the evicted record S in the next bin, etcetera, until an
empty bin is found and the eviction process ends.
This constructs a sorted list of records with hash values $ >= A $
starting from each address A in the file.
This makes sure that the records with hash value A are never far away
from address A.
An exemplary analysis of the algorithm is given, resulting in a function
describing the expected displacement of records as a function of the
load factor.
L.R. Johnson,
An indirect method for addressing on secondary keys,
Commun. ACM,
vol. 4,
#??,
??? 1961,
pp. 218222.
[First publication on chained hashing?]
