Now that we have the solution for searching in a list of numbers, can we use it for non-integer objects? We can if we find a way to turn objects into numbers for indexing.
We don't need a perfect one-to-one correspondence between objects and integers. In fact, it is totally fine if two unrelated objects are turned into the same number — we can use linear probing to resolve this collision anyway!
However, if we turn all objects into the same number, for example, 42
, our hash table would work, but its performance would severely degrade. So, for performance reasons it is desirable to get distinct numbers for distinct objects usually.
The transformation also needs to be completely deterministic, i.e. we need to always get the same value for the same object. In other words, something like random()
would not work, because we would "forget" where we placed our objects and we wouldn't be able to locate them during a search.
Functions that do this kind of transformation are called hash functions. Since it is not required to preserve any order in the input domain, a typical hash function "mixes up" its input domain, hence the name "hash".
In Python, there are built-in implementations of hash functions for many built-in types. They are all available through a single interface: the function hash()
. This function can take any Python object as an input and call an appropriate implementation (if it exists).
Here are some examples of hash function values for some of the built-in types.
hash(
)
= 468330442038187741
hash(
)
= 42
Floats:hash(42.5)
= 1426259968
Tuples: hash(("Hi", 11))
= 4421265786515608844
None: hash(None)
= -9223372036581563745
In the case of strings, hash()
returns fairly unpredictable results, as it should. However, for small integers hash(x) == x
. This fact may seem surprising to people familiar with hash functions, but it is a deliberate design decision by Python core developers.
For "long" integers Python uses a different algorithm. Try typing a relatively big number, for example, 12345678901234567890
to see this.
Another fact: hash()
never returns -1
, because -1
used internally as an indicator of an error. That's why hash(-1)
is -2
.
This chapter and the next two chapters will use hash()
implementation from Python 3.2 (running on an x86_64 system). So if you run Python 3.2 on your x86_64 system, you should see the same hash values for integers and strings (and the same data structure states). One exception is hash(None)
. It changes between runs, but in this explanation hash(None)
is always -9223372036581563745
.
Why Python 3.2? Because dict implementation has changed over time but Python 3.2's dict implements most major ideas, and, thus, Python 3.2's dict is a good starting point. Later versions of Python extend (rather than completely replace) Python 3.2's implementation. Eventually, we will get to these implementations as well.
Not all types are hashable. For example, lists aren't and if you call hash(["some", "values"])
you will get TypeError: unhashable type: 'list'
. Why can't we use the same hash function for lists as for tuples? The answer is because lists are mutable and tuples are not.
We could still define a hash function for lists and other mutable objects. However changing a list would change the value of the hash function as well, and therefore we will not be able to locate the mutated list! Hashing and using lists as keys in dicts would lead to many accidental bugs, so core developers of Python chose not to allow this.
Because hash tables use hash functions and because hash tables mix up inserted elements, they are called hash tables. Sometimes people shorten "hash table" to just "hash". The output of a hash function is sometimes called "hash value" or "hash code", but very often it is also shortened to just "hash". Also, Python's built-in hash function is called hash()
.
Because people like to shorten things, several different (but related) concepts end up having the same shortened name. This can get a bit confusing sometimes.
Recall that we started with a simple problem: searching efficiently in a list of distinct numbers. Now, let's make this problem harder: our hash table needs to handle duplicates, support types other than integers, support removing and adding keys (and therefore resizing). We will see how to handle values in the next chapter, but for now let's assume we only need to search for keys.
How does using hash functions change the insertion algorithm?
Obviously, we have to use hash()
function to convert objects into integers for indexing.
Because None
is hashable too, we will need to use some other value as a placeholder for an empty slot. The cleanest way to do this is to create a new type and use a value of this type. In Python, this is quite simple:
class EmptyValueClass(object):
pass
EMPTY = EmptyValueClass()
We will now use EMPTY
to denote an empty slot. After we do this, we will be able to insert None
in the hash table safely.
But here is one critical thing: checking for equality of objects can be expensive. For example, comparing strings of length 10000 may require up to 10000 comparison operations - one per each pair of corresponding characters. And, we may end up doing several equality checks when doing linear probing.
When we only had integers, we didn't have this problem, because comparing integers is cheap. But here is a neat trick we can use to improve the performance in the case of arbitrary objects. We still get numbers from hash functions. So, we can save these numbers and compare them before comparing actual objects. When comparing hashes, there are two different outcomes. First, the hashes are different; in this case, we can safely conclude that the objects are different as well. Second, the hashes are equal; in this case, there is a good chance that the objects are equal but there is still a possibility of two distinct objects having the same hash, so we have to compare the actual objects.
This optimization is a space-time tradeoff. We spend extra memory to make the algorithm faster.
In this chapter, we'll allow duplicates. Remember how search works in the previous chapter? We retrace the steps necessary to insert the element and check if any slot along the way contains it. We also do all these steps when we are actually inserting an element. This means that we're effectively doing a search while inserting an element. So, handling duplicates is straightforward — we can terminate the insertion process if we find the element. And if we hit an empty slot without finding the element, then the element is not in the table and it means that we can safely insert it.
Now, let's see this algorithm in action. We'll use a separate list called hash_codes
for caching values of hash functions.
Let's say we have a mixed list of integers and strings:
Let's build a hash table from it:
The search algorithm hasn't changed much. We get the hash()
function value for the object, and do linear probing. Just like in the first chapter, we essentially have a bunch of "islands", separated by empty slots. Each island usually contains only a few elements, and a linear scan over a small "island" isn't expensive.
If we removed a key without a trace, it'd leave a hole, and this would break the search algorithm. Then, how do we remove a key?
The answer is that if we can't remove a key without a trace, we should leave a trace. When removing a key, we replace it with a "dummy" object (another term for this object is "tombstone"). This object acts as a placeholder that indicates we shouldn't stop probing during a search.
In code, we can create this placeholder object just like we created EMPTY
:
class DummyValueClass(object):
pass
DUMMY = DummyValueClass()
Removing a lot of objects may lead to a table being filled with these dummy objects. What if a table overflows with dummy objects? There is a way to clean them up. But first, let's see what happens if a table overflows with ordinary objects.
How do we resize a hash table? The index of each element depends on the table size, so it may change if the size of a table changes. Moreover, because of collisions and linear probing, each index may depend on the indexes of other objects (which, in turn, also depend on the size of a table and the indexes of other objects). This is a tangled mess.
There is a way to disentangle this Gordian Knot, however. We can create a new, bigger table and re-insert all the elements from the smaller table (skipping dummy placeholders). This may sound expensive. And, it is expensive. But, the thing is, we don't have to resize the table after every operation. If we make the new table size 1.5x, 2x or even 4x the size of the old table, it'll take a while until it fills up again. We will do resizes rarely enough that the heavy cost of it will amortize (spread out) over many insertions/deletions.
Since we're re-building the table, the code is fairly similar to the code for building it from scratch. The differences are:
There is still one more important question. What condition should trigger the resizing operation? If we postpone resizing until a table is nearly full, the performance will severely degrade. If we resize a table when it is still sparse, we will waste memory. Typically, a hash table is resized when it is around 66% full.
The number of non-empty slots (including dummy/tombstone slots) is called fill. The ratio between fill and table capacity is called the load factor (also, sometimes: fill factor or fill ratio). So, using the new terms, we can say that a hash table is resized when the load factor reaches 66%. When a table is resized, it's size usually goes up, but sometimes it can actually go down, if there are a lot of dummy slots.
To implement this efficiently, we need to track the load factor. We will need two counters for tracking fill and "real" usage. With the current code structure, tracking these counters would be messy because we would need to pass these counters to and from every function. A much cleaner solution would be using classes. More on that in the next chapter.