Hashing

> Literature references and annotations by Dick Grune, dick@dickgrune.com.
Last update: Fri Sep 06 15:53:23 2024.

These references and annotations were originally intended for personal use and are presented here only in the hope that they may be useful to others. There is no claim to completeness or even correctness. Each annotation represents my understanding of the text at the moment I wrote the annotation.
> No guarantees given; comments and content criticism welcome.

* Peter K. Pearson, Fast hashing of variable-length text strings, Commun. ACM, vol. 33, #6, June 1990, pp. 677-680.
Technical correspondence in CACM, Nov 1991.

* B.J. McKenzie, R. Harries, T. Bell, Selecting a hashing algorithm, Software -- Practice and Experience, vol. 20, #2, Feb. 1990, pp. 209-224.
Statistical tests are applied to 8 hash functions extracted from 8 actual compilers. Five are linear congruence functions and these are the ones that perform best. The very best is $ H = H sub n ~ bold mod ~ 2 sup 30 ~ bold mod ~ 1008 $, with $ h sub i = 613 h sub { i - 1 } + c sub i $.

* M.V. Ramakrishna, P.-Å. Larson, File organization using composite perfect hashing, ACM Trans. Database Syst., vol. 14, #2, June 1989, pp. 231-263.
Authors' abstract: Perfect hashing refers to hashing with no overflows. We propose and analyze a composite perfect hashing scheme for large external files. The scheme guarantees retrieval of any record in a single disk access. Insertions and deletions are simple, and the file size may vary considerably without adversely affecting the performance. A simple variant of the scheme supports efficient range searches in addition to being a completely dynamic file organization scheme. These advantages are achieved at the cost of a small amount of additional internal storage and increased cost of insertions.

* R.N. Horspool, G.V. Cormack, Hashing as a compaction technique for LR parser tables, Software: Practice and Experience, vol. 17, #6, June 1987, pp. 413-416.
Remarkably, shows that hashing is at most a bearable alternative for graph colouring and other compression techniques, its most attractive feature being its ease of implementation. The possibility of applying perfect hashing is mentioned only in passing.

* G.V. Cormack, R.N. Horspool, M. Kaiserswerth, Practical perfect hashing, Computer J., vol. 28, #1, 1985, pp. 54-58.
Describes open hashing in which the second level is implemented as hash tables with a possibly different hash function for each second-level hash table. For each entry, the primary hash table records the position of the secondary hash table plus the function used. The secondary tables are kept free from hash duplicates; this ensures retrieval in constant time, which is the authors' interpretation of the term "perfect hashing". If a hash duplicate occurs during insertion, the old (secondary) hash function is abandoned and the contents of the entire secondary hash table are rehashed to a larger array, using another hash function, one that does not create duplicates. This function is found by exhaustive search. As an implementation detail, the secondary hash tables are kept in a single array; this is important because the algorithm is designed to use secondary storage for the secondary tables.
     Two variants are discussed, both based on the heuristic that if the primary hash table is large enough, the secondary tables are small; here "large enough" means having roughly one entry for each element. The first variant uses perfect hash functions for the secondary tables, thus achieving optimal packing. The functions are again found by exhaustive search, during insertion. It is proved that insertion can be done in constant time, but some insertions can be hundreds of times more expensive than others. In the second variant, a secondary set of r elements is stored in a hash table of size $ r sup 2 $. Under this regimen, the variation in the insertion time is no more than a factor of 10. Detailed statistics are given.

* W.C. Chen, J.S. Vitter, Analysis of new variants of coalesced hashing, ACM Trans. On Database Sys., vol. 9, #4, December 1984, pp. ???-???.
See also 10 (1), March 1985

* Ronald Fagin, Jurg Nievergelt, Nicholas Pippenger, H. Raymond Strong, Extendible hashing -- a fast access method for dynamic files, ACM Trans. Database Syst., vol. 4, #3, Sept. 1979, pp. 315-344.
The hash table has $ 2 sup d $ entries pointing to leaves. Each leaf has a local depth $ d prime $, with $ d prime <= d $. Entry k points to a leaf which contains between $min$ and $max$ records; for each record in this leaf we have $ h ~ mod ~ 2 sup d prime = k ~ mod ~ 2 sup d prime $, where h is the record's hash value. This arrangement means that if for a leaf $ d prime < d $, $ 2 sup { d - d prime } $ entries in the hash table will point to that leaf.
     When a record with hash value h is to be inserted, it is put in the leaf pointed to by entry $ h ~ mod ~ 2 sup d prime $. If as a result this leaf is now too full there are two possibilities: $ d prime < d $ and $ d prime = d $. If $ d prime < d $ the leaf is pointed to by more than one entry; the leaf is split into two leaves. If $ d prime = d $ the hash table is doubled by duplicating it; now $ d prime < d $ and see above.

* Renzo Sprugnoli, Perfect hashing functions: a single probe retrieving method for static sets, Commun. ACM, vol. 20, #11, Nov. 1977, pp. 841-850.
[ Pages and pages of number theory ]

* G. Schay, W.G. Spruth, Analysis of a file addressing method, Commun. ACM, vol. 5, #8, Aug. 1962, pp. 459-452.
The authors consider open hashing of records in a sequential file on disk and the point is to reduce (sequential) search time (this is 1962). (They agree that chained hashing is preferable for random access memory.)
     Suppose a new record R hashes (the authors don't use the word "hash") to address A. Rather than storing R in the first empty bin after A (allowing buffer circularity), which can be far away, they store it in the first bin that holds a record S that hashes to a higher address than A (or of course in an empty bin if they meet that first). They store the evicted record S in the next bin, etcetera, until an empty bin is found and the eviction process ends.
     This constructs a sorted list of records with hash values $ >= A $ starting from each address A in the file. This makes sure that the records with hash value A are never far away from address A.
     An exemplary analysis of the algorithm is given, resulting in a function describing the expected displacement of records as a function of the load factor.

* L.R. Johnson, An indirect method for addressing on secondary keys, Commun. ACM, vol. 4, #??, ??? 1961, pp. 218-222.
[First publication on chained hashing?]