Inside python dict — an explorable explanation

JavaScript code is loading...

Searching efficiently in a list

Why are hash tables called hash tables?

Putting it all together to make an "almost"-python-dict

How python dict really works internally

Chapter 2. Why are hash tables called hash tables?

Now that we have the solution for searching in a list of numbers, can we use it for non-integer objects? We can if we find a way to turn objects into numbers for indexing.

We don't need a perfect one-to-one correspondence between objects and integers. In fact, it is totally fine if two unrelated objects are turned into the same number — we can use linear probing to resolve this collision anyway!

However, if we turn all objects into the same number, for example, 42, our hash table would work, but its performance would severely degrade. So, for performance reasons it is desirable to get distinct numbers for distinct objects usually.

The transformation also needs to be completely deterministic, i.e. we need to always get the same value for the same object. In other words, something like random() would not work, because we would "forget" where we placed our objects and we wouldn't be able to locate them during a search.

Functions that do this kind of transformation are called hash functions. Since it is not required to preserve any order in the input domain, a typical hash function "mixes up" its input domain, hence the name "hash".

In Python, there are built-in implementations of hash functions for many built-in types. They are all available through a single interface: the function hash(). This function can take any Python object as an input and call an appropriate implementation (if it exists).

Here are some examples of hash function values for some of the built-in types.

Strings:hash(

) = 468330442038187741

Integers:hash(

) = 42

Floats:hash(42.5) = 1426259968

Tuples: hash(("Hi", 11)) = 4421265786515608844

None: hash(None) = -9223372036581563745

In the case of strings, hash() returns fairly unpredictable results, as it should. However, for small integers hash(x) == x. This fact may seem surprising to people familiar with hash functions, but it is a deliberate design decision by Python core developers.

For "long" integers Python uses a different algorithm. Try typing a relatively big number, for example, 12345678901234567890 to see this.

Another fact: hash() never returns -1, because -1 used internally as an indicator of an error. That's why hash(-1) is -2.

hash() implementation notes

This chapter and the next two chapters will use hash() implementation from Python 3.2 (running on an x86_64 system). So if you run Python 3.2 on your x86_64 system, you should see the same hash values for integers and strings (and the same data structure states). One exception is hash(None). It changes between runs, but in this explanation hash(None) is always -9223372036581563745.

Why Python 3.2? Because dict implementation has changed over time but Python 3.2's dict implements most major ideas, and, thus, Python 3.2's dict is a good starting point. Later versions of Python extend (rather than completely replace) Python 3.2's implementation. Eventually, we will get to these implementations as well.

Unhashable types

Not all types are hashable. For example, lists aren't and if you call hash(["some", "values"]) you will get TypeError: unhashable type: 'list'. Why can't we use the same hash function for lists as for tuples? The answer is because lists are mutable and tuples are not.

We could still define a hash function for lists and other mutable objects. However changing a list would change the value of the hash function as well, and therefore we will not be able to locate the mutated list! Hashing and using lists as keys in dicts would lead to many accidental bugs, so core developers of Python chose not to allow this.

A note on the word "hash"

Because hash tables use hash functions and because hash tables mix up inserted elements, they are called hash tables. Sometimes people shorten "hash table" to just "hash". The output of a hash function is sometimes called "hash value" or "hash code", but very often it is also shortened to just "hash". Also, Python's built-in hash function is called hash().

Because people like to shorten things, several different (but related) concepts end up having the same shortened name. This can get a bit confusing sometimes.

Using hash functions for hash tables

Recall that we started with a simple problem: searching efficiently in a list of distinct numbers. Now, let's make this problem harder: our hash table needs to handle duplicates, support types other than integers, support removing and adding keys (and therefore resizing). We will see how to handle values in the next chapter, but for now let's assume we only need to search for keys.

How does using hash functions change the insertion algorithm?

Obviously, we have to use hash() function to convert objects into integers for indexing.

Because None is hashable too, we will need to use some other value as a placeholder for an empty slot. The cleanest way to do this is to create a new type and use a value of this type. In Python, this is quite simple:

class EmptyValueClass(object):
    pass

EMPTY = EmptyValueClass()

We will now use EMPTY to denote an empty slot. After we do this, we will be able to insert None in the hash table safely.

But here is one critical thing: checking for equality of objects can be expensive. For example, comparing strings of length 10000 may require up to 10000 comparison operations - one per each pair of corresponding characters. And, we may end up doing several equality checks when doing linear probing.

When we only had integers, we didn't have this problem, because comparing integers is cheap. But here is a neat trick we can use to improve the performance in the case of arbitrary objects. We still get numbers from hash functions. So, we can save these numbers and compare them before comparing actual objects. When comparing hashes, there are two different outcomes. First, the hashes are different; in this case, we can safely conclude that the objects are different as well. Second, the hashes are equal; in this case, there is a good chance that the objects are equal but there is still a possibility of two distinct objects having the same hash, so we have to compare the actual objects.

This optimization is a space-time tradeoff. We spend extra memory to make the algorithm faster.

In this chapter, we'll allow duplicates. Remember how search works in the previous chapter? We retrace the steps necessary to insert the element and check if any slot along the way contains it. We also do all these steps when we are actually inserting an element. This means that we're effectively doing a search while inserting an element. So, handling duplicates is straightforward — we can terminate the insertion process if we find the element. And if we hit an empty slot without finding the element, then the element is not in the table and it means that we can safely insert it.

Now, let's see this algorithm in action. We'll use a separate list called hash_codes for caching values of hash functions.

Let's say we have a mixed list of integers and strings:

Let's build a hash table from it:

from_keys

keys

hash_codes

uname

time

-6

mkdir

less

-6

mkdir

uname

time

less

-6

-447
⋯
2260

2121
⋯
1505

1395
⋯
6003

1433
⋯
9313

2314
⋯
0903

5971
⋯
5735

def create_new(from_keys):                         
    hash_codes = [EMPTY] * (2 * len(from_keys))       ~ Create a new list of size 16 for hash codes
    keys = [EMPTY] * (2 * len(from_keys))             ~ Create a new list of size 16 for keys
                                                   
    for key in from_keys:                             ~ [8/8] The key to insert is "less"
        hash_code = hash(key)                         ~ Compute the hash code: 5971033325591305735
        idx = hash_code % len(keys)                   ~ Compute the starting slot index: 7 == 5971033325591305735 % 16
        while hash_codes[idx] is not EMPTY:           ~ After 1 collision, an empty slot (at 8) is found: the collision is successfully resolved
            if hash_codes[idx] == hash_code and \  
               keys[idx] == key:                   
                break                              
            idx = (idx + 1) % len(keys)            
                                                   
        hash_codes[idx], keys[idx] = hash_code, key   ~ Put "less" and its hash 5971033325591305735 in the empty slot 8
                                                   
    return hash_codes, keys                           ~ The hash table is complete, return the lists

Searching

The search algorithm hasn't changed much. We get the hash() function value for the object, and do linear probing. Just like in the first chapter, we essentially have a bunch of "islands", separated by empty slots. Each island usually contains only a few elements, and a linear scan over a small "island" isn't expensive.

For instance, let's search for

and see what happens:

keys

hash_codes

-6

mkdir

uname

time

less

-6

-447
⋯
2260

2121
⋯
1505

1395
⋯
6003

1433
⋯
9313

2314
⋯
0903

5971
⋯
5735

def has_key(hash_codes, keys, key):          
    hash_code = hash(key)                       ~ Compute the hash code: 2121891850435711505
    idx = hash_code % len(keys)                 ~ Compute the starting slot index: 2121891850435711505 % 16 == 1
    while hash_codes[idx] is not EMPTY:         ~ [Try #1] Slot 1 is occupied, so check it
        if hash_codes[idx] == hash_code and \   ~ 2121891850435711505 == 2121891850435711505, so the slot might contain the same key
           keys[idx] == key:                    ~ "uname" == "uname", so the key is found
            return True                         ~ so return True
        idx = (idx + 1) % len(keys)          
    return False                             

Removing objects

If we removed a key without a trace, it'd leave a hole, and this would break the search algorithm. Then, how do we remove a key?

The answer is that if we can't remove a key without a trace, we should leave a trace. When removing a key, we replace it with a "dummy" object (another term for this object is "tombstone"). This object acts as a placeholder that indicates we shouldn't stop probing during a search.

In code, we can create this placeholder object just like we created EMPTY:

class DummyValueClass(object):
    pass

DUMMY = DummyValueClass()

Let's see removing in action. Let's say we want to remove:

keys

hash_codes

-6

mkdir

uname

DUMMY

time

less

-6

-447
⋯
2260

2121
⋯
1505

1395
⋯
6003

1433
⋯
9313

2314
⋯
0903

5971
⋯
5735

def remove(hash_codes, keys, key):           
    hash_code = hash(key)                       ~ Compute the hash code: 13952083821126003
    idx = hash_code % len(keys)                 ~ Compute the starting slot index: 13952083821126003 % 16 == 3
                                             
    while hash_codes[idx] is not EMPTY:         ~ [Try #1] Slot 3 is occupied, so check it
        if hash_codes[idx] == hash_code and \   ~ 13952083821126003 == 13952083821126003, so the slot might contain the same key
           keys[idx] == key:                    ~ "mv" == "mv", so the key is found
            keys[idx] = DUMMY                   ~ Replace key in slot 3 with DUMMY placeholder
            return                              ~ The key is removed, now return
        idx = (idx + 1) % len(keys)          
                                             
    raise KeyError()                         

Removing a lot of objects may lead to a table being filled with these dummy objects. What if a table overflows with dummy objects? There is a way to clean them up. But first, let's see what happens if a table overflows with ordinary objects.

Resizing hash tables

How do we resize a hash table? The index of each element depends on the table size, so it may change if the size of a table changes. Moreover, because of collisions and linear probing, each index may depend on the indexes of other objects (which, in turn, also depend on the size of a table and the indexes of other objects). This is a tangled mess.

There is a way to disentangle this Gordian Knot, however. We can create a new, bigger table and re-insert all the elements from the smaller table (skipping dummy placeholders). This may sound expensive. And, it is expensive. But, the thing is, we don't have to resize the table after every operation. If we make the new table size 1.5x, 2x or even 4x the size of the old table, it'll take a while until it fills up again. We will do resizes rarely enough that the heavy cost of it will amortize (spread out) over many insertions/deletions.

keys

hash_codes

-6

mkdir

uname

DUMMY

time

less

-6

-447
⋯
2260

2121
⋯
1505

1395
⋯
6003

1433
⋯
9313

2314
⋯
0903

5971
⋯
5735

Since we're re-building the table, the code is fairly similar to the code for building it from scratch. The differences are:

the hash codes are already there, so we don't need to compute them the second time;
we don't have to check for duplicates, because we know that each object is present only once in the original table;
we skip empty and dummy slots.

keys

hash_codes

new_keys

new_hash_codes

-6

mkdir

uname

DUMMY

time

less

-6

-447
⋯
2260

2121
⋯
1505

1395
⋯
6003

1433
⋯
9313

2314
⋯
0903

5971
⋯
5735

mkdir

uname

time

-6

less

-447
⋯
2260

2121
⋯
1505

2314
⋯
0903

-6

1433
⋯
9313

5971
⋯
5735

def resize(hash_codes, keys):                              
    new_hash_codes = [EMPTY] * (2 * len(hash_codes))          ~ Create a new list of size 32 for hash codes
    new_keys = [EMPTY] * (2 * len(keys))                      ~ Create a new list of size 32 for keys
    for hash_code, key in zip(hash_codes, keys):              ~ [16/16] The current key to insert is EMPTY, its hash is EMPTY
        if key is EMPTY or key is DUMMY:                      ~ The current slot is empty
            continue                                          ~ So skip it
        idx = hash_code % len(new_keys)                    
        while new_hash_codes[idx] is not EMPTY:            
            idx = (idx + 1) % len(new_keys)                
        new_hash_codes[idx], new_keys[idx] = hash_code, key
                                                           
    return new_hash_codes, new_keys                           ~ The hash table has been rebuilt, return the lists

There is still one more important question. What condition should trigger the resizing operation? If we postpone resizing until a table is nearly full, the performance will severely degrade. If we resize a table when it is still sparse, we will waste memory. Typically, a hash table is resized when it is around 66% full.

The number of non-empty slots (including dummy/tombstone slots) is called fill. The ratio between fill and table capacity is called the load factor (also, sometimes: fill factor or fill ratio). So, using the new terms, we can say that a hash table is resized when the load factor reaches 66%. When a table is resized, it's size usually goes up, but sometimes it can actually go down, if there are a lot of dummy slots.

To implement this efficiently, we need to track the load factor. We will need two counters for tracking fill and "real" usage. With the current code structure, tracking these counters would be messy because we would need to pass these counters to and from every function. A much cleaner solution would be using classes. More on that in the next chapter.

Inside python dict — an explorable explanation

Searching efficiently in a list

Why are hash tables called hash tables?

Putting it all together to make an "almost"-python-dict

How python dict *really* works internally

Chapter 2. Why are hash tables called hash tables?

hash() implementation notes

Unhashable types

A note on the word "hash"

Using hash functions for hash tables

Searching

Removing objects

Resizing hash tables

Next: Putting it all together to make an "almost"-python-dict

How python dict really works internally