Inside python dict — an explorable explanation

JavaScript code is loading...

Searching efficiently in a list

Why are hash tables called hash tables?

Putting it all together to make an "almost"-python-dict

How python dict really works internally

Chapter 4. How Python dict really works internally

Now it is (finally!) time to explore how the dict works in Python!

This explanation is about the dict in CPython (the most popular, "default", implementation of Python). Most of CPython is implemented in C, and the dict implementation is written in C too. This explanations reimplements everything in Python, because the main focus of this explanation is algorithms. Nevertheless, the state of dicts on this page should match the state of dicts inside CPython 3.2. I.e. the hash() codes of keys should match, the probing result should match, and as a result the slots match as well.

In other words, this is an actual reimplementation of Python dict in Python that mirrors all of the aspects of the underlying data structure. And the visualization might as well be the visualization of the state produced from the C code.

Why CPython 3.2? It's a good starting point, later versions expand the implementation (rather than replace it), so this chapter focuses on CPython 3.2's dict, and future chapters will discuss changes in the later versions (3.3 - 3.7).

The central difference between almost-Python-dict from the third chapter and real Python dicts is the probing algorithm. This probing algorithm stayed the same in all versions (at least up until 3.7, which is the latest version at the time of writing)

The probing algorithm

The problem with the simple linear probing is that it doesn't mix up the keys well in many real-world data patterns. Real world data patterns tend to be regular, and a pattern like 16, 0, 1, 2, 3, 4... would lead to many collisions.

Linear probing is prone to clustering: once you get a "clump" of keys, the clump tends to grow, which causes more collisions, which cause the clump to grow further, which causes even more collisions. This is detrimental to performance.

One way to address this problem by using a better hash function, in particular when it comes to integers (hash(x) == x for small integers in Python). Another way to address this problem is by using a different probing algorithm - and this is what CPython developers decided.

There are two requirements for a probing algorithm:

It should be deterministic.
It should always hit an empty slot eventually (even if it takes many steps). We need it to work even in the worst possible scenario: when there is a collision in every non-empty slot.

Let's take a look at linear probing first. If we repeatedly run its recurrence (idx = (idx + 1) % size) until we end up hitting a slot twice, we get the following picture:

It does not matter what slot we start from, the picture will look exactly the same. Linear probing is very regular and predictable. Now, let's change the recurrence to idx = (5 * idx + 1) % size (note the 5):

idx = (5 * idx + 1) % size still guarantees to eventually hit every possible slot if size is a power of two (the proof of this fact is outside the scope of this page). Also, the algorithm is obviously deterministic. So, both requirements for a probing algorithm are satisfied. This algorithm scrambles the order of indexes a bit. It certainly less regular but it is still prone to clustering.

The probing algorithm in CPython takes this recurrence and adds a ton of scrambling to it: idx = ((5 * idx) + 1 + perturb) % size. What is this perturb weirdness though? In C code, it is initialized as basically this: size_t perturb = hash_code. Then, in every iteration, it is right-shifted by 5 bits (perturb >>= 5).

This probing algorithm uses some "randomness" in the form of bits from the hash code - but the probing is still fully deterministic because hash functions by their nature are deterministic. perturb eventually reaches zero, and the recurrence becomes idx = (5 * idx) + 1, which is guaranteed to hit every slot (eventually).

We can reimplement this algorithm in pure Python. However, in Python there are no unsigned (logical) bit shifts, and there is also no built-in way to convert a 64-bit signed integer to a 64-bit unsigned integer. The solution is to do the conversion with the following one-liner: 2**64 + hash_code if hash_code < 0 else hash_code and then use regular bit shifts (i.e. >> or >>=)

Let's see how the algorithm works for the following key:

Arrows are color-coded: green means perturb != 0 and blue means perturb == 0

PERTURB_SHIFT = 5                                              
def probe_all(key, slots_count=8):                                ~ Called with the key "hello"
    hash_code = hash(key)                                         ~ Compute the hash code: 840651671246116861
    perturb = 2**64 + hash_code if hash_code < 0 else hash_code   ~ perturb is 840651671246116861, the same as hash (because it is positive)
    idx = hash_code % slots_count                                 ~ Compute the starting slot index: 840651671246116861 % 8 == 5
    visited = set()                                               ~ Initialize the set of visited slots
    while len(visited) < slots_count:                             ~ all 8 / 8 slots are visited — stop
        visited.add(idx)                                       
        idx = (idx * 5 + perturb + 1) % slots_count            
        perturb >>= PERTURB_SHIFT                              

Adding noise (from perturb) makes things slower when a hash table is full, the worst case scenario becomes even worse (compared to (5 * idx) + 1). However, in practice, we keep dicts sparse, by ensuring the load factor never goes above 2/3), so naturally there are many chances to hit an empty slot.

If you are interested in more subtleties and technical details, you can check Objects/dictnotes.txt and comments near the top of Objects/dictobject.c

Python 3.2's dict

There are a couple more changes to almost-python-dict, but they are small.

When you type a dict literal in your code, for example:

Python actually knows the number of key-value pairs and tries to guess the optimal hash table size to possibly avoid some or all resizes. In most cases, the resulting hash table ends up being the same size or smaller. However, in some cases the resulting hash table may actually be larger if there are a lot of repeated keys in the literal (e.g. {1: 1, 1: 2, 1: 3, 1: 4, 1: 5, 1: 6, 1: 7, 1: 8, 1: 9})

So in __init__ we use the familiar find_closest_size() function to find the power of two greater than the number of items. This guarantees that there will be no more than one resize.

The code for the resize is a bit different, because a resize operation can increase the size of the hashtable by as much as 4x. The __setitem__ is the same, except for the probing algorithm, and it also tries to recycle slots containing tombstones (DUMMY placeholders), just like the insert operation in the previous chapter did. Although, in case of__init__ recycling is not going to happen, as we start with an empty table containing no dummy slots.

pairs[*][0]

pairs[*][1]

slots[*].key

slots[*].value

slots[*].hashCode

git

-h

cut

-f 1

-27

stash

a.py b.py

uniq

-c

git

stash

-530
⋯
8861

1382
⋯
4725

uniq

-c

6468
⋯
5819

1292
⋯
6553

a.py b.py

1267
⋯
4267

cut

1472
⋯
2910

-27

-f 1

-799
⋯
7437

-h

1280
⋯
5547

@staticmethod                                           
def signed_to_unsigned(hash_code):                      
    if hash_code < 0:                                   
        return 2**64 + hash_code                        
    else:                                               
        return hash_code                                
                                                        
def __init__(self, pairs=None):                         
    if pairs:                                              ~ If there are any pairs (and there are 9 pairs)
        start_size = self.find_closest_size(len(pairs))    ~ Find the power of two > 8 and > 9 . It is 16
    else:                                               
        start_size = 8                                  
    self.slots = [Slot() for _ in range(start_size)]       ~ Start by creating a list of empty slots of size 16
    self.fill = 0                                          ~ Set fill to 0, because there are no items (yet)
    self.used = 0                                          ~ Set used to 0, because there are no items (yet)
    if pairs:                                              ~ Check if there are pairs to insert: there are 9 pairs
        for k, v in pairs:                                 ~ [9/9] The current pair is -27 and 42
            self[k] = v                                    ~ Call self.__setitem__(-27, 42)
                                                        
PERTURB_SHIFT = 5                                       
                                                        
def __setitem__(self, key, value):                      
    hash_code = hash(key)                                  ~ Compute the hash code: -27
    idx = hash_code % len(self.slots)                      ~ Compute the starting slot index: -27 % 16 == 5
    perturb = self.signed_to_unsigned(hash_code)           ~ Compute perturb by converting the hash code to unsigned: 18446744073709551589
    target_idx = None                                      ~ Initialize target_idx - this is the index of the slot where we'll put the item
    while self.slots[idx].key is not EMPTY:                ~ After 1 collision, an empty slot (at 15) is found: the collision is successfully resolved
        if self.slots[idx].hash_code == hash_code and\  
           self.slots[idx].key == key:                  
            target_idx = idx                            
            break                                       
        if target_idx is None and \                     
           self.slots[idx].key is DUMMY:                
            target_idx = idx                            
        idx = (idx * 5 + perturb + 1) % len(self.slots) 
        perturb >>= self.PERTURB_SHIFT                  
                                                        
    if target_idx is None:                                 ~ target_idx is None, it means we haven't encountered a DUMMY slot or a slot with the key itself
        target_idx = idx                                   ~ So we'll put the item in the current slot (15), which is empty
    if self.slots[target_idx].key is EMPTY:                ~ If we're putting the item in an empty slot (and we are)
        self.used += 1                                     ~ then we  need to increment self.used, which makes it 9
        self.fill += 1                                     ~ and increment fill, which makes it 9
    elif self.slots[target_idx].key is DUMMY:           
        self.used += 1                                  
                                                        
    self.slots[target_idx] = Slot(hash_code, key, value)   ~ Put the item in slot 15
    if self.fill * 3 >= len(self.slots) * 2:               ~ fill factor (56%) < 2/3:  9 * 3 (== 27) < 16 * 2 (== 32), so no need to run resize()
        self.resize()                                   
                                                        

How much difference the probing algorithm and the other changes make? How different is the resulting dict compared to almost-python-dict from chapter 3? Here are the two versions side by side:

almost-python-dict

python 3.2 dict

git

stash

-530
⋯
8861

cut

-f 1

-799
⋯
7437

1382
⋯
4725

-27

1292
⋯
6553

a.py b.py

1267
⋯
4267

uniq

-c

6468
⋯
5819

-h

1280
⋯
5547

1472
⋯
2910

git

stash

-530
⋯
8861

1382
⋯
4725

uniq

-c

6468
⋯
5819

1292
⋯
6553

a.py b.py

1267
⋯
4267

cut

1472
⋯
2910

-27

-f 1

-799
⋯
7437

-h

1280
⋯
5547

The code for removing an item stays mostly the same. Again, a different probing algoithm is used, but conceptually it is the same exact algoirthm: try to find a key, and if it is there, overwrite the item with a DUMMY placeholder.

Deleting

slots[*].key

slots[*].value

slots[*].hashCode

git

stash

-530
⋯
8861

1382
⋯
4725

uniq

-c

6468
⋯
5819

1292
⋯
6553

a.py b.py

1267
⋯
4267

cut

1472
⋯
2910

-27

-f 1

-799
⋯
7437

DUMMY

1280
⋯
5547

def lookdict(self, key):                               
    hash_code = hash(key)                                 ~ Compute the hash code: 12800076900115547
    idx = hash_code % len(self.slots)                     ~ Compute the starting slot index: 12800076900115547 % 16 == 11
    perturb = self.signed_to_unsigned(hash_code)          ~ Compute perturb by converting the hash code to unsigned: 12800076900115547
    while self.slots[idx].key is not EMPTY:               ~ [Try #3] Slot 2 is occupied, so check it
        if self.slots[idx].hash_code == hash_code and \   ~ 12800076900115547 == 12800076900115547, so the slot might be occupied by the same key
           self.slots[idx].key == key:                    ~ "du" == "du", so the key is found
            return idx                                    ~ Return 2, the index of the current slot
                                                       
        idx = (idx * 5 + perturb + 1) % len(self.slots)
        perturb >>= self.PERTURB_SHIFT                 
                                                       
    raise KeyError()                                   
                                                       
def __delitem__(self, key):                            
    idx = self.lookdict(key)                              ~ Call self.lookdict("du")
                                                       
    self.used -= 1                                        ~ We're about to put the DUMMY placeholder in the slot, so set the counter of used slots to 8
    self.slots[idx].key = DUMMY                           ~ Replace key at 2 with the DUMMY placeholder
    self.slots[idx].value = EMPTY                         ~ Replace value at 2 with EMPTY

The search is also conceptually the same, with the only difference being — you probably guessed it at this point — the probing algorithm. For example, let's say we want to get the following key

slots[*].key

slots[*].value

slots[*].hashCode

git

stash

-530
⋯
8861

1382
⋯
4725

uniq

-c

6468
⋯
5819

1292
⋯
6553

a.py b.py

1267
⋯
4267

cut

1472
⋯
2910

-27

-f 1

-799
⋯
7437

DUMMY

1280
⋯
5547

def lookdict(self, key):                               
    hash_code = hash(key)                                 ~ Compute the hash code: 6468739353497665819
    idx = hash_code % len(self.slots)                     ~ Compute the starting slot index: 6468739353497665819 % 16 == 11
    perturb = self.signed_to_unsigned(hash_code)          ~ Compute perturb by converting the hash code to unsigned: 6468739353497665819
    while self.slots[idx].key is not EMPTY:               ~ [Try #3] Slot 8 is occupied, so check it
        if self.slots[idx].hash_code == hash_code and \   ~ 6468739353497665819 == 6468739353497665819, so the slot might be occupied by the same key
           self.slots[idx].key == key:                    ~ "uniq" == "uniq", so the key is found
            return idx                                    ~ Return 8, the index of the current slot
                                                       
        idx = (idx * 5 + perturb + 1) % len(self.slots)
        perturb >>= self.PERTURB_SHIFT                 
                                                       
    raise KeyError()                                   
                                                       
def __getitem__(self, key):                            
    idx = self.lookdict(key)                              ~ Call self.lookdict("uniq")
                                                       
    return self.slots[idx].value                          ~ Return "-c", value from slot 8

Resize

Resizes are expensive, so it is better to have less of them. So Python sometimes quadruples the size of a table. This is more reasonable for smaller hash tables, when the number of items is smaller than 50000. When the number of items is greater than 50 thousand, Python 3.2 aims to double the size. Although, just like in previous chapter, a resize operation can shrink the table or keep the size the same (while dropping the unnecessary dummy slots), if too many slots are wasted by DUMMY placeholders.

While building the dict from the original pairs, no resize operation was run, because Python correctly guessed the number of slots needed. To see a resize in action, let's insert some additional pairs: [("tee", 1), ("wc", 2)]

After running __setitem__ multiple times for these pairs, we can take a look at the resize in-depth:

oldSlots[*].key

oldSlots[*].value

oldSlots[*].hashCode

slots[*].key

slots[*].value

slots[*].hashCode

git

stash

-530
⋯
8861

1523
⋯
7476

1382
⋯
4725

uniq

-c

6468
⋯
5819

1292
⋯
6553

a.py b.py

1267
⋯
4267

cut

tee

-149
⋯
9139

1472
⋯
2910

-27

-f 1

-799
⋯
7437

DUMMY

1280
⋯
5547

cut

-f 1

-799
⋯
7437

-27

1472
⋯
2910

1523
⋯
7476

1382
⋯
4725

1292
⋯
6553

tee

-149
⋯
9139

git

stash

-530
⋯
8861

a.py b.py

1267
⋯
4267

uniq

-c

6468
⋯
5819

def resize(self):                                                       
    old_slots = self.slots                                                 ~ Copy the reference to slots (no actual copying is done)
    minused = self.used * (4 if self.used <= 50000 else 2)              
    new_size = self.find_closest_size(minused)                             ~ Find the smallest power of two greater than 40. It is 64
    self.slots = [Slot() for _ in range(new_size)]                         ~ Create a new list of empty slots of size 64
    self.fill = self.used                                                  ~ Set fill to 10, since we will skip all slots with the DUMMY placeholder
    for slot in old_slots:                                                 ~ [16/16] The current slot's key is -27 and its hash is -27
        if slot.key is not EMPTY and slot.key is not DUMMY:                ~ The current slot contains a normal item
            idx = slot.hash_code % len(self.slots)                         ~ Compute the starting slot index: -27 % 64 == 37
            perturb = self.signed_to_unsigned(slot.hash_code)              ~ Compute perturb by converting the hash code to unsigned: 18446744073709551589
            while self.slots[idx].key is not EMPTY:                        ~ Slot 37 is empty: no need to do collision resolution
                idx = (idx * 5 + perturb + 1) % len(self.slots)         
                perturb >>= self.PERTURB_SHIFT                          
                                                                        
            self.slots[idx] = Slot(slot.hash_code, slot.key, slot.value)   ~ Put the item in slot 37
                                                                        

More chapters to come

This is the last chapter currently available. Developing and debugging the engine for this explorable explanation took way, way longer than I expected, so I am releasing the first few chapters before the following ones.

Did interactive visualizations help you understand something? Did you discover something interested because of interactivity and animations? Did it help building an intuition for hash tables? I am really curious about this, I'd love to hear from you, so drop me a message. And if you find a bug, please open a new issue on Github.

Is dict in Python 3.7 similar to dict Python 3.2? I'd say yes. It's still an open addressing hash table with weird perturb-based probing. The hash function for strings changed a couple of times, some memory sharing for keys was introduced, and dicts became ordered.

A very brief history of changes in versions 3.3 - 3.7

3.3 introduced changes to internal structure of dicts ("Key-Sharing Dictionary") that improved memory consumption in certain cases. 3.3 also randomizes seed for hash functions, so that hash() return values are less predictable from the outside. This is a security-related change, and object hashes are still stable within the same "run" of Python interpreter.

In 3.4, the hash function for strings was changed to a more secure algorithm which is more resistant to hash collision attacks.

In 3.6 the dict internal structure became more compact and the dict became "ordered".

Inside python dict — an explorable explanation

Searching efficiently in a list

Why are hash tables called hash tables?

Putting it all together to make an "almost"-python-dict

How python dict *really* works internally

Chapter 4. How Python dict *really* works internally

The probing algorithm

Python 3.2's dict

Resize

More chapters to come

A very brief history of changes in versions 3.3 - 3.7

Contents

Searching efficiently in a list

Why are hash tables called hash tables?

Putting it all together to make an "almost"-python-dict

How python dict *really* works internally

How python dict really works internally

Chapter 4. How Python dict really works internally

How python dict really works internally