Hashing

Literature references and annotations by Dick Grune, dick@dickgrune.com. Last update: Sat Sep 19 09:19:12 2009.
These references and annotations were originally intended for personal use and are presented here only in the hope that they may be useful to others. There is no claim to completeness or even correctness. Each annotation represents my understanding of the text at the moment I wrote the annotation.
No guarantees given; comments and content criticism welcome.

* Maurizio Panti, Salvatore Valenti, "A modulus-oriented hash function for the construction of minimal perfect tables", ACM SIGPLAN Notices, 27, #11, pp. 33-38. Nov. 1992,

* Chin-Chen Chang, Tzong-Chen Wu, "A letter-oriented perfect hash scheme based upon sparse table compression", Softw. Pract. Exper., 21, #1, Jan. 1991, pp. 35-49.

* F.J. Burkowski, G.V. Cormack, "Use of perfect hashing in a paged memory management unit", in 1990 International Conference on Parallel Processing, Illinois, 1990, pp. 96-100.

* Peter K. Pearson, "Fast hashing of variable-length text strings", Commun. ACM, 33, #6, June 1990, pp. 677-680.
Technical correspondence in CACM, Nov 1991.

* Vijay Kumar, "Concurrent operations on extendible hashing and its performance", Commun. ACM, 33, #6, June 1990, pp. 681-694.

* B.J. McKenzie, R. Harries, T. Bell, "Selecting a hashing algorithm", Software -- Practice and Experience, 20, #2, Feb. 1990, pp. 209-224.
Statistical tests are applied to 8 hash functions extracted from 8 actual compilers. Five are linear congruence functions and these are the ones that perform best. The very best is H = H sub n   bold mod   2 sup 30   bold mod   1008 , with h sub i = 613 h sub { i - 1 } + c sub i .

* M.V. Ramakrishna, P.-Å. Larson, "File organization using composite perfect hashing", ACM Trans. Database Syst., 14, #2, June 1989, pp. 231-263.
Authors' abstract: Perfect hashing refers to hashing with no overflows. We propose and analyze a composite perfect hashing scheme for large external files. The scheme guarantees retrieval of any record in a single disk access. Insertions and deletions are simple, and the file size may vary considerably without adversely affecting the performance. A simple variant of the scheme supports efficient range searches in addition to being a completely dynamic file organization scheme. These advantages are achieved at the cost of a small amount of additional internal storage and increased cost of insertions.

* R.J. Enbody, H.C. Du, "Dynamic hashing schemes", ACM Computing Surveys, 20, #2, June 1988, pp. 85-115.

* Per-Åke Larson, "Dynamic hash tables", Commun. ACM, 31, #4, April 1988, pp. 446-457.

* R.N. Horspool, G.V. Cormack, "Hashing as a compaction technique for LR parser tables", Software: Practice and Experience, 17, #6, June 1987, pp. 413-416.
Remarkably, shows that hashing is at most a bearable alternative for graph colouring and other compression techniques, its most attractive feature being its ease of implementation. The possibility of applying perfect hashing is mentioned only in passing.

* E. Lodi, F. Luccio, "Split sequence hash search", Inform. Proc. Letters, 20, pp. 131-136. 1985,

* G.V. Cormack, R.N. Horspool, M. Kaiserswerth, "Practical perfect hashing", Computer J., 28, #1, 1985, pp. 54-58.
Describes open hashing in which the second level is implemented as hash tables with a possibly different hash function for each second-level hash table. For each entry, the primary hash table records the position of the secondary hash table plus the function used. The secondary tables are kept free from hash duplicates; this ensures retrieval in constant time, which is the authors' interpretation of the term "perfect hashing". If a hash duplicate occurs during insertion, the old (secondary) hash function is abandoned and the contents of the entire secondary hash table are rehashed to a larger array, using another hash function, one that does not create duplicates. This function is found by exhaustive search. As an implementation detail, the secondary hash tables are kept in a single array; this is important because the algorithm is designed to use secondary storage for the secondary tables.
Two variants are discussed, both based on the heuristic that if the primary hash table is large enough, the secondary tables are small; here "large enough" means having roughly one entry for each element. The first variant uses perfect hash functions for the secondary tables, thus achieving optimal packing. The functions are again found by exhaustive search, during insertion. It is proved that insertion can be done in constant time, but some insertions can be hundreds of times more expensive than others. In the second variant, a secondary set of r elements is stored in a hash table of size r sup 2 . Under this regimen, the variation in the insertion time is no more than a factor of 10. Detailed statistics are given.

* J.K. Mullin, "Spiral storage: efficient dynamic hashing with constant performance", Comput. J., 28, #3, 1985, pp. 330-334.

* T.J. Sager, "A polynomial time generator for minimal perfect hash functions", Commun. ACM, 28, #5, May 1985, pp. 523-532.

* E. Veklerov, "Analysis of dynamic hashing with deferred splitting", ACM Trans. Database Syst., 10, #1, March 1985, pp. 90-96.

* P.-Å. Larson, "Linear hashing with overflow-handling by linear probing", ACM Trans. Database Syst., 10, #1, March 1985, pp. 75-89.

* W.C. Chen, J.S. Vitter, "Analysis of new variants of coalesced hashing", ACM Trans. On Database Sys., 9, #4, pp. ???-???. December 1984,
See also 10 (1), March 1985

* P. Pritchard, "The study of an ordered minimal perfect hashing scheme", Letter to the Editor, Commun. ACM, 27, #11, pp. ???-???. Nov. 1984,

* K. Ramamohanarao, R. Sacks-Davis, "Recursive linear hashing", ACM Trans. Database Syst., 9, #3, Sept. 1984, pp. 369-391.

* U. Bechtold, K. Kuspert, "On the use of extendible hashing without hashing", Info. Proc. Lett., 19, #1, pp. ???-???. July 1984,

* C.C. Chang, "The study of an ordered minimal perfect hashing scheme", Commun. ACM, 27, #4, pp. ???-???. Apr. 1984,

* W.C. Chen, J.S. Vitter, "Analysis of early-insertion standard coalesced hashing", SIAM J. Computing, 12, #4, 1983, pp. 667-676.

* S. Nishihara, K. Ikeda, "Reducing the retrieval time of hashing method by using predictors", Commun. ACM, 26, #12, pp. ???-???. Dec. 1983,

* R.C. Bell, B. Floyd, "A Monte Carlo study of Cichelli hash-function solvability", Commun. ACM, 26, #11, pp. ???-???. Nov. 1983,

* P.A. Larson, "Analysis of uniform hashing", J. ACM, 30, #4, pp. 805-819. Oct. 1983,

* M.W. Du, T.M. Hsieh, K.F. Jea, D.W. Shieh, "The study of a new perfect hash scheme", IEEE Trans. Software Eng., SE-9, #3, pp. ???-???. May 1983,

* J.S. Vitter, "Analysis of the search performance of coalesced hashing", J. ACM, 30, #2, April 1983, pp. 231-258.

* P. Flajolet, "On the performance evaluation of extendable hashing and trie searching", Acta Inform., 20, 1983, pp. 345-369.

* D.B. Lomet, "Bounded index exponential hashing", ACM Trans. Database Syst., 8, #1, March 1983, pp. 136-165.

* K. Ramamohanarao, J.W. Lloyd, "Dynamic hashing schemes", Comput. J., 25, #4, 1982, pp. 478-485.

* J.S. Vitter, "Implementations for coalesced hashing", Commun. ACM, 25, #12, Dec. 1982, pp. 911-926.

* P.-Å. Larson, "Performance analysis of linear hashing with partial extensions", ACM Trans. Database Syst., 7, #4, Dec. 1982, pp. 556-587.

* C.R. Cook, R.R. Oldehoeft, "A letter-oriented minimal perfect hash function", ACM SIGPLAN Notices, 17, #9, pp. 18-27. Sept. 1982,

* P.A. Larson, "A single-file version of linear hashing with partial expansions", Proc. 8th Conf. Very Large Data bases, pp. 300-309. ACM, New York, Sept. 1982,

* J.S. Vitter, "Deletion algorithms for hashing that preserve randomness", J. Algorithms, 3, Sept. 1982, pp. 261-275.

* J.K. Mullin, "Tightly controlled linear hashing without separate overflow storage", BIT, 21, pp. 390-400. 1981,

* G. Jaeschke, "Reciprocal hashing: a method for generating minimal perfect hashing functions", Commun. ACM, 24, #12, pp. 829-833. Dec. 1981,

* J.S. Vitter, "A shared-memory implementation of coalesced hashing", Inform. Process. Lett., 13, #2, Nov. 1981, pp. 77-79.

* M. Scholl, "New file organizations based on dynamic hashing", ACM Trans. Database Syst., 6, #1, March 1981, pp. 194-211.

* Witold Litwin, "Linear hashing: A new tool for file and table addressing", in Sixth International Conference on Very Large Data Bases, ed. by Frederick H. Lochovsky and ?. Taylor, IEEE, 1980, pp. 212-223.

* P.A. Larson, "Linear hashing with partial expansions", in Proc. 6th Conf. Very Large Data bases, pp. 224-232. ACM, New York, 1980,

* G. Jaeschke, G. Osterburg, "On Cichelli's minimal perfect hash function method", Commun. ACM, 23, #??, pp. 728-729. 1980,

* D.U. Sarwate, "A note on universal classes of hash functions", Info. Proc. Lett., 10, #1, pp. 41-45. Feb. 1980,

* R.J. Cichelli, "Minimal perfect hash functions made simple", Commun. ACM, 23, #1, pp. 17-19. Jan. 1980,

* C.H. Papadimitriou, P.A. Bernstein, "On the performance of balanced hashing functions when the keys are not equiprobable", ACM Trans. Prog. Lang. and Sys., 2, #1, pp. ???-???. Jan. 1980,

* K. Raiha, S.H. Zweben, "An optimal insertion algorithm for one-sided height-balanced binary search trees", Commun. ACM, 22, #9, pp. ???-???. Sept. 1979,

* Ronald Fagin, Jurg Nievergelt, Nicholas Pippenger, H. Raymond Strong, "Extendible hashing -- a fast access method for dynamic files", ACM Trans. Database Syst., 4, #3, Sept. 1979, pp. 315-344.
The hash table has 2 sup d entries pointing to leaves. Each leaf has a local depth d prime , with d prime <= d . Entry k points to a leaf which contains between min and max records; for each record in this leaf we have h   mod   2 sup d prime = k   mod   2 sup d prime , where h is the record's hash value. This arrangement means that if for a leaf d prime < d , 2 sup { d - d prime } entries in the hash table will point to that leaf.
When a record with hash value h is to be inserted, it is put in the leaf pointed to by entry h   mod   2 sup d prime . If as a result this leaf is now too full there are two possibilities: d prime < d and d prime = d . If d prime < d the leaf is pointed to by more than one entry; the leaf is split into two leaves. If d prime = d the hash table is doubled by duplicating it; now d prime < d and see above.

* G.H. Gonnet, J.I. Munro, "Efficient ordering of hash tables", SIAM J. Comp., 8, #3, pp. ???-???. Aug. 1979,

* J.L. Carter, M.N. Wegman, "Universal classes of hash functions", J. Comput. Syst. Sci., 18, #2, April 1979, pp. 143-154.

* M.R. Anderson, M.G. Anderson, "Comments on perfect hashing functions -- A single probe retrieving method for static sets", Commun. ACM, 22, #2, pp. ???-???. Feb. 1979,

* R.L. Rivest, "Optimal arrangement of keys in a hash table", Journal of the ACM, 25, pp. 200-209. 1978,

* P.-Å. Larson, "Dynamic hashing", BIT, 18, #2, 1978, pp. 184-201.

* L.J. Guibas, "The analysis of hashing techniques that exhibit k-ary clustering", Journal of the ACM, 25, pp. 544-55. 1978,

* L.J. Guibas, E. Szemeredi, "The analysis of double hashing", J. Comp. Sys. Sci., 16, pp. 226-74. 1978,

* C. Halatsis, G. Philokyprou, "Pseudo-chaining in hash tables", Commun. ACM, 21, #7, pp. ???-???. July 1978,

* Renzo Sprugnoli, "Perfect hashing functions: a single probe retrieving method for static sets", Commun. ACM, 20, #11, Nov. 1977, pp. 841-850.
[ Pages and pages of number theory ]

* I.F. Blake, A.G. Konhein, "Big buckets are (are not) better", J. ACM, 24, #4, Oct. 1977, pp. 591-606??.

* A.L. Rosenberg, L.J. Stockmayer(sp?), "Hashing schemes for extendible arrays", J. ACM, 24, #2, April 1977, pp. 199-221.

* E. Goto, T. Ida, "Parallel hashing algorithms", Inform. Process. Letters, 6, #1, pp. 8-13. Feb. 1977,

* D. Severance, R. Duhne, "A practitioner's guide to addressing algorithms", Commun. ACM, 19, #6, June 1976, pp. 314-326.

* O. Amble, D.E. Knuth, "Ordered hash tables", Computer J., 18, pp. 135-42. 1975,

* S.P. Ghosh, V.Y. Lum, "Analysis of collisions when hashing by division", Inf. Syst., 1, pp. 15-22. 1975,

* G.O. Kno??t, "Hashing functions", Computer Journal, 18, pp. ???-???. Aug. 1975,

* V. Batagelj, "The quadratic hash method when the table size is not a prime number", Commun. ACM, 18, #4, pp. ???-???. Apr. 1975,

* W.D. Maurer, T.G. Lewis, "Hash table methods", Computing Surveys, 7, #1, pp. 5-19. 1975,

* A.E. Ackerman, "Quadratic search for hash tables of size p sup n ", Commun. ACM, 17, #3, pp. ???-???. Mar. 1974,

* R. Loeser, "Some performance tests of 'quicksort' and descendants", Commun. ACM, 17, #3, pp. ???-???. Mar. 1974,

* V.Y. Lum, "General performance analysis of key to address transformation methods", Commun. ACM, 16, #10, Oct. 1973, pp. ???-???.

* R.P. Brent, "Reducing the retrieval time of scatter storage techniques", Commun. ACM, 16, #2, Feb. 1973, pp. 105-109.

* B. Carter, "The reallocation of hash-coded tables", Commun. ACM, 16, #1, Jan. 1973, pp. 11-14.

* F.R.A. Hopgood, J. Davenport, "The quadratic hash method where the table size is a power of 2", Comptr. J., 15, #4, pp. ???-???. 1972,

* M.C. Harrison, "Implementation of the substring test by hashing", Commun. ACM, 14, #12, Dec. 1971, pp. 777-779.

* J.R. Bell, C.H. Kaman, "The linear quotient hash code", Commun. ACM, 13, #11, pp. 675-677. Nov. 1970,

* J.R. Bell, "The quadratic quotient method -- A hash code eliminating secondary clustering", Commun. ACM, 13, #2, pp. ???-???. Feb. 1970,

* C.E. Radke, "The use of quadratic residue research", Commun. ACM, 13, #2, pp. ???-???. Feb. 1970,

* F.R.A. Hopgood, "A solution to the table overflow problem for hash tables", Comp. Bull. (???), 11, #??, 1968, pp. 297-???.

* R. Morris, "Scatter storage techniques", Commun. ACM, 11, #1, pp. 38-44. 1968,

* W.D. Maurer, "An improved hash code for scatter storage", Commun. ACM, 11, #1, pp. 35-37. 1968,

* G. Schay, W.G. Spruth, "Analysis of a file addressing method", Commun. ACM, 5, #8, Aug. 1962, pp. 459-452.
The authors consider open hashing of records in a sequential file on disk and the point is to reduce (sequential) search time (this is 1962). (They agree that chained hashing is preferable for random access memory.)
Suppose a new record R hashes (the authors don't use the word "hash") to address A. Rather than storing R in the first empty bin after A (allowing buffer circularity), which can be far away, they store it in the first bin that holds a record S that hashes to a higher address than A (or of course in an empty bin if they meet that first). They store the evicted record S in the next bin, etcetera, until an empty bin is found and the eviction process ends.
This constructs a sorted list of records with hash values >= A starting from each address A in the file. This makes sure that the records with hash value A are never far away from address A.
An exemplary analysis of the algorithm is given, resulting in a function describing the expected displacement of records as a function of the load factor.

* L.R. Johnson, "An indirect method for addressing on secondary keys", Commun. ACM, 4, #??, ??? 1961, pp. 218-222.
[First publication on chained hashing?]

* W.W. Peterson, "Addressing for random-access storage", IBM Jour. Res. Dev., 1, #2, pp. 249-253 (130-146???, or both?). 1957,