Inside python dict — an explorable explanation

JavaScript code is loading...

Searching efficiently in a list

Why are hash tables called hash tables?

Putting it all together to make an "almost"-python-dict

How python dict really works internally

Introduction

Hi! This is an explorable explanation of Python dictionaries. This page is dynamic and interactive — you can plug in your data and see how the algorithms work on it. (Once the javascript loads)

To start with, let's say we have a simple list of distinct integers (change it if you want - the page will update):

Python lists are actually arrays — contiguous chunks of memory. The name "list" may be misleading to people who know about double-linked lists but are unfamiliar with Python. You can picture a Python list as a row of slots, where each slot can hold a Python object:

To check if an element is present in a list, we can use the in operator like this: number in simple_list, which returns either True or False. Under the hood this short snippet does a linear scan. This can be a lot of work. To see this, let's reimplement it in Python.

Let's say we're looking for the following number in the original list:

simple_list

def simple_search(simple_list, key):
    idx = 0                            ~ Start from the beginning of the list
    while idx < len(simple_list):      ~ [Try #6] 5 < 8, so some elements have not been processed yet, and the number may be there
        if simple_list[idx] == key:    ~ 25 == 25 — the wanted number is found
            return True                ~ so return True
        idx += 1                    
    return False                    

(The visualization is interactive. The buttons allow you to step through the code. Notice here that the time slider is draggable - feel free to rewind the time or move it forward. Also, feel free to mess with the input and the original list - the visualization will update automatically)

What's not so great about linear scans? If we have a million distinct numbers, in the worst case scenario, we may need to scan the whole list. But scanning over a few elements is no big deal. We need to have some order and predictability to make the search fast. We need to have some idea of where the searched element is located.

A Python dict implementation is basically a scan of a list (but a pretty weird scan). We'll build the actual algorithm and data structure inside Python dictionary step by step, starting with the code above, which is intentionally verbose.

Chapter 1: searching efficiently in a list

A Python dict is a collection of key-value pairs. And, the most important part of it is handling keys. Keys need to be organized in such a way that efficient searching, inserting and deleting is possible.

In this chapter, to keep things simple, we won't have any values, and "keys" will just be plain integers. So, the simplified problem is to check if a number is present in a list, but we have to do this fast. We'll tackle the real problem in the following chapters. But for now, bear with me.

Accessing a single element by index is very fast. Accessing only a few elements would be fast too. We don't want to be doing a linear scan over the whole list every time we look up a number, so we need to organize our data in a clever way.

Here's how.

Let's begin by creating a new list of slots. Each slot will either hold a number from the original list or be empty (empty slots will hold None). We'll use the number itself to compute an index of a slot. The simplest way to do this is to take the remainder of number divided by len(the_list): number % len(the_list) and put our number in slot with this index. To check if the number is there we could compute the slot index again and see if it is empty.

Would this approach work, however? Not entirely. For example, 50 will get the same slot index (3) as 2, and it will be overwritten. Situations like these are called collisions.

original_list

new_list

def build_not_quite_what_we_want(original_list):
    new_list = [None] * len(original_list)         ~ Create a new list of 8 empty slots
                                                
    for number in original_list:                   ~ [8/8] The number to insert is 4
        idx = number % len(new_list)               ~ Compute the slot index: 4 == 4 % 8
        new_list[idx] = number                     ~ Collision of 4 with 44 in slot 4 - the number is overwritten
    return new_list                                ~ Return created list with some numbers missing: 50, 1, 25, 44

To make this approach viable, we need to somehow resolve collisions. Let's do the following: if the slot is already occupied by some other number, we'll just check the slot that comes right after it. And if that slot is empty, we'll put the number there. But, what if that slot is also occupied? Once again, we'll go ahead and check the next slot. We'll keep repeating this process until we finally hit an empty slot. This process is called probing. And because we do it linearly, it is called linear probing. In code, we would write this as (idx + 1) % len(simple_list), so it wraps around back to the beginning at the last index:

If we make the new list the same size as the original list, we'll have too many collisions. If we make it 10x larger, we'll have very few collisions, but we'll waste a lot of memory. So what size should it be? We want to hit the sweet spot where we don't use up too much memory but also don't have too many collisions. Twice the size of the original list is reasonable.

Let's transform the original list using this method (when reading this code, keep in mind that original_list is a list of distinct numbers, so we don't need to handle duplicates just yet).

original_list

new_list

def build_insert_all(original_list):            
    new_list = [None] * (2 * len(original_list))   ~ Create a new list of 16 empty slots
                                                
    for number in original_list:                   ~ [8/8] The number to insert is 4
        idx = number % len(new_list)               ~ Compute the slot index: 4 == 4 % 16
        while new_list[idx] is not None:           ~ After 1 collision, an empty slot (at 5) is found: the collision is successfully resolved
            idx = (idx + 1) % len(new_list)     
        new_list[idx] = number                     ~ Put 4 in slot 5
    return new_list                                ~ Return created list with all original numbers present

To search for a number, we retrace all the steps necessary to insert it: we start from the slot number % len(new_list) and do linear probing. We either end up finding the number or hitting an empty slot. The latter situation means that the number is not present.

Let's say we want to search for

new_list

def has_number(new_list, number):      
    idx = number % len(new_list)          ~ Compute the slot index: 2 == 2 % 16
    while new_list[idx] is not None:      ~ [Try #2] Slot 3 is occupied, so check it
        if new_list[idx] == number:       ~ The number is found: 2 == 2
            return True                   ~ Now simply return True
        idx = (idx + 1) % len(new_list)
    return False                       

Calculating an index based on the value of the number and resolving collisions by linear probing is an incredibly powerful idea. What we've just implemented is a simple hash table (more about the term in the next chapter). Python dict uses a hash table internally, albeit a more complex variant.

We still haven't discussed adding more elements (what happens if a table overflows?), removing elements (removing an element without a trace would cause a hole to appear, wouldn't this cause the search algorithm to stop prematurely in many cases?), and perhaps most importantly, handling objects other than integers - strings, tuples, floats. We'll do this in the next chapters.

Collision resolution via separate chaining

There is a different method of collision resolution, called separate chaining. It is also a good strategy which is commonly used. But that's not how Python resolves collision in dicts, so it is beyond the scope of this explanation.

A couple of the notes about the explanation

First, this explanation discusses dict as it is implemented in CPython — the "default" and most common implementation of the Python language (if you are not sure what implementation you use, it is almost certainly CPython). Some other implementations are PyPy, Jython and IronPython. The way dicts works in each of these implementations may be similar to CPython (in the case of PyPy) or very different from CPython (in the case of Jython).

Second, even though dict in CPython is implemented in C, this explanation uses Python for code snippets. The goal of this page is to help you understand the algorithms and the underlying data structures, not the minutiae of the C code (these details are interesting too - they are just are beyond the scope of this explanation).

Inside python dict — an explorable explanation

Contents

Searching efficiently in a list

Why are hash tables called hash tables?

Putting it all together to make an "almost"-python-dict

How python dict really works internally

Introduction

Chapter 1: searching efficiently in a list

Collision resolution via separate chaining

A couple of the notes about the explanation

Next: Why are hash tables called hash tables?

Inside python dict — an explorable explanation

Contents

Searching efficiently in a list

Why are hash tables called hash tables?

Putting it all together to make an "almost"-python-dict

How python dict *really* works internally

Introduction

Chapter 1: searching efficiently in a list

Collision resolution via separate chaining

A couple of the notes about the explanation

Next: Why are hash tables called hash tables?

How python dict really works internally