Inside python dict — an explorable explanation

JavaScript code is loading...

Chapter 4. How Python dict *really* works internally

Now it is (finally!) time to explore how the dict works in Python!

This explanation is about the dict in CPython (the most popular, "default", implementation of Python). Most of CPython is implemented in C, and the dict implementation is written in C too. This explanations reimplements everything in Python, because the main focus of this explanation is algorithms. Nevertheless, the state of dicts on this page should match the state of dicts inside CPython 3.2. I.e. the hash() codes of keys should match, the probing result should match, and as a result the slots match as well.

In other words, this is an actual reimplementation of Python dict in Python that mirrors all of the aspects of the underlying data structure. And the visualization might as well be the visualization of the state produced from the C code.

Why CPython 3.2? It's a good starting point, later versions expand the implementation (rather than replace it), so this chapter focuses on CPython 3.2's dict, and future chapters will discuss changes in the later versions (3.3 - 3.7).

The central difference between almost-Python-dict from the third chapter and real Python dicts is the probing algorithm. This probing algorithm stayed the same in all versions (at least up until 3.7, which is the latest version at the time of writing)

The probing algorithm

The problem with the simple linear probing is that it doesn't mix up the keys well in many real-world data patterns. Real world data patterns tend to be regular, and a pattern like 16, 0, 1, 2, 3, 4... would lead to many collisions.

Linear probing is prone to clustering: once you get a "clump" of keys, the clump tends to grow, which causes more collisions, which cause the clump to grow further, which causes even more collisions. This is detrimental to performance.

One way to address this problem by using a better hash function, in particular when it comes to integers (hash(x) == x for small integers in Python). Another way to address this problem is by using a different probing algorithm - and this is what CPython developers decided.

There are two requirements for a probing algorithm:

  1. It should be deterministic.
  2. It should always hit an empty slot eventually (even if it takes many steps). We need it to work even in the worst possible scenario: when there is a collision in every non-empty slot.

Let's take a look at linear probing first. If we repeatedly run its recurrence (idx = (idx + 1) % size) until we end up hitting a slot twice, we get the following picture:

It does not matter what slot we start from, the picture will look exactly the same. Linear probing is very regular and predictable. Now, let's change the recurrence to idx = (5 * idx + 1) % size (note the 5):

idx = (5 * idx + 1) % size still guarantees to eventually hit every possible slot if size is a power of two (the proof of this fact is outside the scope of this page). Also, the algorithm is obviously deterministic. So, both requirements for a probing algorithm are satisfied. This algorithm scrambles the order of indexes a bit. It certainly less regular but it is still prone to clustering.

The probing algorithm in CPython takes this recurrence and adds a ton of scrambling to it: idx = ((5 * idx) + 1 + perturb) % size. What is this perturb weirdness though? In C code, it is initialized as basically this: size_t perturb = hash_code. Then, in every iteration, it is right-shifted by 5 bits (perturb >>= 5).

This probing algorithm uses some "randomness" in the form of bits from the hash code - but the probing is still fully deterministic because hash functions by their nature are deterministic. perturb eventually reaches zero, and the recurrence becomes idx = (5 * idx) + 1, which is guaranteed to hit every slot (eventually).

We can reimplement this algorithm in pure Python. However, in Python there are no unsigned (logical) bit shifts, and there is also no built-in way to convert a 64-bit signed integer to a 64-bit unsigned integer. The solution is to do the conversion with the following one-liner: 2**64 + hash_code if hash_code < 0 else hash_code and then use regular bit shifts (i.e. >> or >>=)

Let's see how the algorithm works for the following key:

Arrows are color-coded: green means perturb != 0 and blue means perturb == 0

PERTURB_SHIFT = 5                                              

def probe_all(key, slots_count=8):                             
   ~ Called with the key "hello"

    hash_code = hash(key)                                      
   ~ Compute the hash code: 840651671246116861

    perturb = 2**64 + hash_code if hash_code < 0 else hash_code
   ~ perturb is 840651671246116861, the same as hash (because it is positive)

    idx = hash_code % slots_count                              
   ~ Compute the starting slot index: 840651671246116861 % 8 == 5

    visited = set()                                            
   ~ Initialize the set of visited slots

    while len(visited) < slots_count:                          
   ~ all 8 / 8 slots are visited — stop

        visited.add(idx)                                       

        idx = (idx * 5 + perturb + 1) % slots_count            

        perturb >>= PERTURB_SHIFT                              

Adding noise (from perturb) makes things slower when a hash table is full, the worst case scenario becomes even worse (compared to (5 * idx) + 1). However, in practice, we keep dicts sparse, by ensuring the load factor never goes above 2/3), so naturally there are many chances to hit an empty slot.

If you are interested in more subtleties and technical details, you can check Objects/dictnotes.txt and comments near the top of Objects/dictobject.c

Python 3.2's dict

There are a couple more changes to almost-python-dict, but they are small.

When you type a dict literal in your code, for example:

Python actually knows the number of key-value pairs and tries to guess the optimal hash table size to possibly avoid some or all resizes. In most cases, the resulting hash table ends up being the same size or smaller. However, in some cases the resulting hash table may actually be larger if there are a lot of repeated keys in the literal (e.g. {1: 1, 1: 2, 1: 3, 1: 4, 1: 5, 1: 6, 1: 7, 1: 8, 1: 9})

So in __init__ we use the familiar find_closest_size() function to find the power of two greater than the number of items. This guarantees that there will be no more than one resize.

The code for the resize is a bit different, because a resize operation can increase the size of the hashtable by as much as 4x. The __setitem__ is the same, except for the probing algorithm, and it also tries to recycle slots containing tombstones (DUMMY placeholders), just like the insert operation in the previous chapter did. Although, in case of__init__ recycling is not going to happen, as we start with an empty table containing no dummy slots.

pairs[*][0]
pairs[*][1]
slots[*].key
slots[*].value
slots[*].hashCode
git
/
su
0
du
-h
cut
-f 1
-27
42
stash
cp
a.py b.py
ed
36
uniq
-c
ls
git
stash
-530

8861
ls
/
1382

4725
uniq
-c
6468

5819
ed
36
1292

6553
cp
a.py b.py
1267

4267
cut
su
0
1472

2910
-27
42
-27
-f 1
-799

7437
du
-h
1280

5547
@staticmethod                                           

def signed_to_unsigned(hash_code):                      

    if hash_code < 0:                                   

        return 2**64 + hash_code                        

    else:                                               

        return hash_code                                

                                                        

def __init__(self, pairs=None):                         

    if pairs:                                           
   ~ If there are any pairs (and there are 9 pairs)

        start_size = self.find_closest_size(len(pairs)) 
   ~ Find the power of two > 8 and > 9 . It is 16

    else:                                               

        start_size = 8                                  

    self.slots = [Slot() for _ in range(start_size)]    
   ~ Start by creating a list of empty slots of size 16

    self.fill = 0                                       
   ~ Set fill to 0, because there are no items (yet)

    self.used = 0                                       
   ~ Set used to 0, because there are no items (yet)

    if pairs:                                           
   ~ Check if there are pairs to insert: there are 9 pairs

        for k, v in pairs:                              
   ~ [9/9] The current pair is -27 and 42

            self[k] = v                                 
   ~ Call self.__setitem__(-27, 42)

                                                        

PERTURB_SHIFT = 5                                       

                                                        

def __setitem__(self, key, value):                      

    hash_code = hash(key)                               
   ~ Compute the hash code: -27

    idx = hash_code % len(self.slots)                   
   ~ Compute the starting slot index: -27 % 16 == 5

    perturb = self.signed_to_unsigned(hash_code)        
   ~ Compute perturb by converting the hash code to unsigned: 18446744073709551589

    target_idx = None                                   
   ~ Initialize target_idx - this is the index of the slot where we'll put the item

    while self.slots[idx].key is not EMPTY:             
   ~ After 1 collision, an empty slot (at 15) is found: the collision is successfully resolved

        if self.slots[idx].hash_code == hash_code and\  

           self.slots[idx].key == key:                  

            target_idx = idx                            

            break                                       

        if target_idx is None and \                     

           self.slots[idx].key is DUMMY:                

            target_idx = idx                            

        idx = (idx * 5 + perturb + 1) % len(self.slots) 

        perturb >>= self.PERTURB_SHIFT                  

                                                        

    if target_idx is None:                              
   ~ target_idx is None, it means we haven't encountered a DUMMY slot or a slot with the key itself

        target_idx = idx                                
   ~ So we'll put the item in the current slot (15), which is empty

    if self.slots[target_idx].key is EMPTY:             
   ~ If we're putting the item in an empty slot (and we are)

        self.used += 1                                  
   ~ then we need to increment self.used, which makes it 9

        self.fill += 1                                  
   ~ and increment fill, which makes it 9

    elif self.slots[target_idx].key is DUMMY:           

        self.used += 1                                  

                                                        

    self.slots[target_idx] = Slot(hash_code, key, value)
   ~ Put the item in slot 15

    if self.fill * 3 >= len(self.slots) * 2:            
   ~ fill factor (56%) < 2/3: 9 * 3 (== 27) < 16 * 2 (== 32), so no need to run resize()

        self.resize()                                   

                                                        

How much difference the probing algorithm and the other changes make? How different is the resulting dict compared to almost-python-dict from chapter 3? Here are the two versions side by side:

almost-python-dict
python 3.2 dict
git
stash
-530

8861
cut
-f 1
-799

7437
ls
/
1382

4725
-27
42
-27
ed
36
1292

6553
cp
a.py b.py
1267

4267
uniq
-c
6468

5819
du
-h
1280

5547
su
0
1472

2910
git
stash
-530

8861
ls
/
1382

4725
uniq
-c
6468

5819
ed
36
1292

6553
cp
a.py b.py
1267

4267
cut
su
0
1472

2910
-27
42
-27
-f 1
-799

7437
du
-h
1280

5547

The code for removing an item stays mostly the same. Again, a different probing algoithm is used, but conceptually it is the same exact algoirthm: try to find a key, and if it is there, overwrite the item with a DUMMY placeholder.

Deleting
slots[*].key
slots[*].value
slots[*].hashCode
git
stash
-530

8861
ls
/
1382

4725
uniq
-c
6468

5819
ed
36
1292

6553
cp
a.py b.py
1267

4267
cut
su
0
1472

2910
-27
42
-27
-f 1
-799

7437
DUMMY
1280

5547
def lookdict(self, key):                               

    hash_code = hash(key)                              
   ~ Compute the hash code: 12800076900115547

    idx = hash_code % len(self.slots)                  
   ~ Compute the starting slot index: 12800076900115547 % 16 == 11

    perturb = self.signed_to_unsigned(hash_code)       
   ~ Compute perturb by converting the hash code to unsigned: 12800076900115547

    while self.slots[idx].key is not EMPTY:            
   ~ [Try #3] Slot 2 is occupied, so check it

        if self.slots[idx].hash_code == hash_code and \
   ~ 12800076900115547 == 12800076900115547, so the slot might be occupied by the same key

           self.slots[idx].key == key:                 
   ~ "du" == "du", so the key is found

            return idx                                 
   ~ Return 2, the index of the current slot

                                                       

        idx = (idx * 5 + perturb + 1) % len(self.slots)

        perturb >>= self.PERTURB_SHIFT                 

                                                       

    raise KeyError()                                   

                                                       

def __delitem__(self, key):                            

    idx = self.lookdict(key)                           
   ~ Call self.lookdict("du")

                                                       

    self.used -= 1                                     
   ~ We're about to put the DUMMY placeholder in the slot, so set the counter of used slots to 8

    self.slots[idx].key = DUMMY                        
   ~ Replace key at 2 with the DUMMY placeholder

    self.slots[idx].value = EMPTY                      
   ~ Replace value at 2 with EMPTY

The search is also conceptually the same, with the only difference being — you probably guessed it at this point — the probing algorithm. For example, let's say we want to get the following key
slots[*].key
slots[*].value
slots[*].hashCode
git
stash
-530

8861
ls
/
1382

4725
uniq
-c
6468

5819
ed
36
1292

6553
cp
a.py b.py
1267

4267
cut
su
0
1472

2910
-27
42
-27
-f 1
-799

7437
DUMMY
1280

5547
def lookdict(self, key):                               

    hash_code = hash(key)                              
   ~ Compute the hash code: 6468739353497665819

    idx = hash_code % len(self.slots)                  
   ~ Compute the starting slot index: 6468739353497665819 % 16 == 11

    perturb = self.signed_to_unsigned(hash_code)       
   ~ Compute perturb by converting the hash code to unsigned: 6468739353497665819

    while self.slots[idx].key is not EMPTY:            
   ~ [Try #3] Slot 8 is occupied, so check it

        if self.slots[idx].hash_code == hash_code and \
   ~ 6468739353497665819 == 6468739353497665819, so the slot might be occupied by the same key

           self.slots[idx].key == key:                 
   ~ "uniq" == "uniq", so the key is found

            return idx                                 
   ~ Return 8, the index of the current slot

                                                       

        idx = (idx * 5 + perturb + 1) % len(self.slots)

        perturb >>= self.PERTURB_SHIFT                 

                                                       

    raise KeyError()                                   

                                                       

def __getitem__(self, key):                            

    idx = self.lookdict(key)                           
   ~ Call self.lookdict("uniq")

                                                       

    return self.slots[idx].value                       
   ~ Return "-c", value from slot 8

Resize

Resizes are expensive, so it is better to have less of them. So Python sometimes quadruples the size of a table. This is more reasonable for smaller hash tables, when the number of items is smaller than 50000. When the number of items is greater than 50 thousand, Python 3.2 aims to double the size. Although, just like in previous chapter, a resize operation can shrink the table or keep the size the same (while dropping the unnecessary dummy slots), if too many slots are wasted by DUMMY placeholders.

While building the dict from the original pairs, no resize operation was run, because Python correctly guessed the number of slots needed. To see a resize in action, let's insert some additional pairs: [("tee", 1), ("wc", 2)]

After running __setitem__ multiple times for these pairs, we can take a look at the resize in-depth:

oldSlots[*].key
oldSlots[*].value
oldSlots[*].hashCode
slots[*].key
slots[*].value
slots[*].hashCode
git
stash
-530

8861
wc
2
1523

7476
ls
/
1382

4725
uniq
-c
6468

5819
ed
36
1292

6553
cp
a.py b.py
1267

4267
cut
tee
1
-149

9139
su
0
1472

2910
-27
42
-27
-f 1
-799

7437
DUMMY
1280

5547
cut
-f 1
-799

7437
-27
42
-27
wc
su
2
0
1472

2910
1523

7476
ls
/
1382

4725
ed
36
1292

6553
tee
1
-149

9139
git
stash
-530

8861
cp
a.py b.py
1267

4267
uniq
-c
6468

5819
def resize(self):                                                       

    old_slots = self.slots                                              
   ~ Copy the reference to slots (no actual copying is done)

    minused = self.used * (4 if self.used <= 50000 else 2)              

    new_size = self.find_closest_size(minused)                          
   ~ Find the smallest power of two greater than 40. It is 64

    self.slots = [Slot() for _ in range(new_size)]                      
   ~ Create a new list of empty slots of size 64

    self.fill = self.used                                               
   ~ Set fill to 10, since we will skip all slots with the DUMMY placeholder

    for slot in old_slots:                                              
   ~ [16/16] The current slot's key is -27 and its hash is -27

        if slot.key is not EMPTY and slot.key is not DUMMY:             
   ~ The current slot contains a normal item

            idx = slot.hash_code % len(self.slots)                      
   ~ Compute the starting slot index: -27 % 64 == 37

            perturb = self.signed_to_unsigned(slot.hash_code)           
   ~ Compute perturb by converting the hash code to unsigned: 18446744073709551589

            while self.slots[idx].key is not EMPTY:                     
   ~ Slot 37 is empty: no need to do collision resolution

                idx = (idx * 5 + perturb + 1) % len(self.slots)         

                perturb >>= self.PERTURB_SHIFT                          

                                                                        

            self.slots[idx] = Slot(slot.hash_code, slot.key, slot.value)
   ~ Put the item in slot 37

                                                                        

More chapters to come

This is the last chapter currently available. Developing and debugging the engine for this explorable explanation took way, way longer than I expected, so I am releasing the first few chapters before the following ones.

Did interactive visualizations help you understand something? Did you discover something interested because of interactivity and animations? Did it help building an intuition for hash tables? I am really curious about this, I'd love to hear from you, so drop me a message. And if you find a bug, please open a new issue on Github.

Is dict in Python 3.7 similar to dict Python 3.2? I'd say yes. It's still an open addressing hash table with weird perturb-based probing. The hash function for strings changed a couple of times, some memory sharing for keys was introduced, and dicts became ordered.

A very brief history of changes in versions 3.3 - 3.7

3.3 introduced changes to internal structure of dicts ("Key-Sharing Dictionary") that improved memory consumption in certain cases. 3.3 also randomizes seed for hash functions, so that hash() return values are less predictable from the outside. This is a security-related change, and object hashes are still stable within the same "run" of Python interpreter.

In 3.4, the hash function for strings was changed to a more secure algorithm which is more resistant to hash collision attacks.

In 3.6 the dict internal structure became more compact and the dict became "ordered".

Contents