Inside python dict — an explorable explanation

JavaScript code is loading...

Searching efficiently in a list

Why are hash tables called hash tables?

Putting it all together to make an "almost"-python-dict

How python dict really works internally

Chapter 3. Putting it all together to make an "almost"-Python-dict

In this chapter, we'll make a class that supports the basic interface of a Python dict. The class will keep track of the counters necessary for computing the load factor and auto-resize the hash table. On the inside, it'll work differently from Python dict, and we'll discuss these differences in the next chapter.

The almost-Python-dict class needs to support values. To handle values we could add another list (in addition to hash_codes and keys). This would totally work. Another alternative is to bundle hash_code, key, value corresponding to each slot in a single object. To do this, we'll need to create a class:

class Slot(object):
    def __init__(self, hash_code=EMPTY, key=EMPTY, value=EMPTY):
        self.hash_code = hash_code
        self.key = key
        self.value = value

How do we initialize an empty hash table? In previous chapters, we based the initial size of hash tables on the original list. Since we now know how to resize tables, we can start with an empty table and grow it. Hash tables inside Python dictionaries are size 8 when they are empty, so let's use 8 as the initial size.

Python hash table sizes are powers of 2, so we will also use powers of 2. even though nothing prevents us from using "non-round" values. The primary reason Python uses "round" powers of 2 is efficiency: computing % 2**n can be implemented using bit operations (& (1 << n)).

Here is the overview of the interface of the class:

class AlmostDict(object):
    def __init__(self, pairs=None):
        self.slots = [Slot() for _ in range(8)]
        self.fill = 0
        self.used = 0
        # Insert all initial pairs
        if pairs:
            for k, v in pairs:
                # This is syntactic sugar for self.__setitem__(k, v)
                self[k] = v

    def __setitem__(self, key, value):
        # Allows us to set a value in a dict-like fashion
        # d = Dict()
        # d[1] = 2 and d.__setitem__(1, 2) gets called
        <implementation goes here>

    def __getitem__(self, key):
        # Allows us to get a value from a dict, for example:
        # d = Dict()
        # d[1] = 2
        # d[1] calls d.__getitem__(1) and returns 2
        <implementation goes here>

    def __delitem__(self, key):
        # Allows us to use "del" in a dict-like fashion, for example:
        # d = Dict()
        # d[1] = 2
        # del d[1] calls d.__delitem__(1)
        # d[1] or d.__getitem__(1) raises KeyError now
        <implementation goes here>

__setitem__, __getitem__ and __delitem__ are special methods, often called magic methods. They are called this way because you don't have to invoke them directly: Python does it for you behind the scenes, when you use square brackets (e.g. d[1]). Other than that, they are normal methods.

Some methods are going to update fill (the number of non-empty slots including dummy slots) and used (the number of normal items in the dictionary, same as the dictionary size). Fill factor is the ratio between fill and len(slots). When it reaches 2/3, the table gets resized. The new size is based on the number of normal items in the table, which is self.used, and typically it is 2x of original size.

Let's take a look at the code, starting with the __init__ method. We're creating the dict from the following pairs:

The code in __init__ also assumes that the dict contents are passed as a list of pairs (not an actual dict — which we are reimplementing).

Compared to the previous chapter, the differences in building hash tables are:

inserting an individual item is now in a dedicated __setitem__ method;
The fill and used counters are incremented if necessary;
most importantly, resize() gets called after inserting an element if the fill factor gets too high.

pairs[*][0]

pairs[*][1]

slots[*].key

slots[*].value

slots[*].hashCode

git

-h

cut

-f 1

-27

stash

a.py b.py

uniq

-c

git

stash

-530
⋯
8861

cut

-f 1

-799
⋯
7437

1382
⋯
4725

-27

1292
⋯
6553

a.py b.py

1267
⋯
4267

uniq

-c

6468
⋯
5819

-h

1280
⋯
5547

1472
⋯
2910

def __init__(self, pairs=None):                       
    self.slots = [Slot() for _ in range(8)]              ~ Start by creating a list of empty slots of size 8
    self.fill = 0                                        ~ Set fill to 0, because there are no items (yet)
    self.used = 0                                        ~ Set used to 0, because there are no items (yet)
    if pairs:                                            ~ Check if there are pairs to insert: there are 9 pairs
        for k, v in pairs:                               ~ [9/9] The current pair is -27 and 42
            self[k] = v                                  ~ Call self.__setitem__(-27, 42)
                                                      
def __setitem__(self, key, value):                    
    hash_code = hash(key)                                ~ Compute the hash code: -27
    idx = hash_code % len(self.slots)                    ~ Compute the starting slot index: 5 == -27 % 16
    while self.slots[idx].key is not EMPTY:              ~ After 1 collision, an empty slot (at 6) is found: the collision is successfully resolved
        if self.slots[idx].hash_code == hash_code and\
           self.slots[idx].key == key:                
            break                                     
        idx = (idx + 1) % len(self.slots)             
                                                      
    if self.slots[idx].key is EMPTY:                     ~ If we're putting the item in an empty slot (and we are)
        self.used += 1                                   ~ then we  need to increment self.used, which makes it 9
        self.fill += 1                                   ~ and increment fill, which makes it 9
                                                      
    self.slots[idx] = Slot(hash_code, key, value)        ~ Put the item in slot 6
    if self.fill * 3 >= len(self.slots) * 2:             ~ fill factor (56%) < 2/3:  9 * 3 (== 27) < 16 * 2 (== 32), so no need to run resize()
        self.resize()                                 
                                                      

When resizing a hash table, how do we find a new optimal size? We find a valid table size greater than 2 * self.used. A valid hash table size is a power of two which is at least 8.

def find_closest_size(self, minused):
    new_size = 8
    while new_size <= minused:
        new_size *= 2

    return new_size

The code only uses self.used (the number of "useful" non-empty slots). It does not depend on self.fill (the total number of non-empty slots) in any way. The idea is to double the size of the table if there are very few dummy slots and shrink it if there are too many dummy slots (so that the memory is saved).

find_closest_size(2 * self.used) == find_closest_size(2 *

) == find_closest_size(22) == 32

While the original items were being inserted, a resize happened. Let's look at it in depth:

oldSlots[*].key

oldSlots[*].value

oldSlots[*].hashCode

slots[*].key

slots[*].value

slots[*].hashCode

git

stash

-530
⋯
8861

a.py b.py

1267
⋯
4267

uniq

-c

6468
⋯
5819

1382
⋯
4725

1472
⋯
2910

1292
⋯
6553

git

stash

-530
⋯
8861

1382
⋯
4725

1292
⋯
6553

a.py b.py

1267
⋯
4267

uniq

-c

6468
⋯
5819

1472
⋯
2910

def resize(self):                                                       
    old_slots = self.slots                                                 ~ Copy the reference to slots (no actual copying is done)
    new_size = self.find_closest_size(self.used * 2)                       ~ Find the smallest power of two greater than 6 * 2. It is 16
    self.slots = [Slot() for _ in range(new_size)]                         ~ Create a new list of empty slots of size 16
    self.fill = self.used                                                  ~ Set fill to 6, since we will skip all slots with the DUMMY placeholder
    for slot in old_slots:                                                 ~ [8/8] The current slot's key is "su" and its hash is 14720088435132910
        if slot.key is not EMPTY and slot.key is not DUMMY:                ~ The current slot contains a normal item
            idx = slot.hash_code % len(self.slots)                         ~ Compute the starting slot index: 14 == 14720088435132910 % 16
            while self.slots[idx].key is not EMPTY:                        ~ Slot 14 is empty: no need to do collision resolution
                idx = (idx + 1) % len(self.slots)                       
                                                                        
            self.slots[idx] = Slot(slot.hash_code, slot.key, slot.value)   ~ Put the item in slot 14
                                                                        

In the previous chapter, the code for removing and the code for searching were very similar, because, to remove an element, we need to find it first. We can reorganize the code so that the removing and searching functions share much of the same code. The common function's name will be lookdict().

Other than that, removing a key will look pretty much the same as in the previous chapter. __delitem__ magic method is now used for realism so that we can do del almost_dict[42]. And we decrement the self.used counter if we end up finding the element and removing it.

For example, let's say we want to remove

slots[*].key

slots[*].value

slots[*].hashCode

git

stash

-530
⋯
8861

cut

-f 1

-799
⋯
7437

1382
⋯
4725

-27

1292
⋯
6553

a.py b.py

1267
⋯
4267

uniq

-c

6468
⋯
5819

DUMMY

1280
⋯
5547

1472
⋯
2910

def lookdict(self, key):                               
    hash_code = hash(key)                                 ~ Compute the hash code: 12800076900115547
    idx = hash_code % len(self.slots)                     ~ Compute the starting slot index: 11 == 12800076900115547 % 16
    while self.slots[idx].key is not EMPTY:               ~ [Try #3] Slot 13 is occupied, so check it
        if self.slots[idx].hash_code == hash_code and \   ~ 12800076900115547 == 12800076900115547, so the slot might be occupied by the same key
           self.slots[idx].key == key:                    ~ "du" == "du", so the key is found
            return idx                                    ~ Return 13, the index of the current slot
                                                       
        idx = (idx + 1) % len(self.slots)              
                                                       
    raise KeyError()                                   
                                                       
def __delitem__(self, key):                            
    idx = self.lookdict(key)                              ~ Call self.lookdict("du")
                                                       
    self.used -= 1                                        ~ We're about to put the DUMMY placeholder in the slot, so set the counter of used slots to 8
    self.slots[idx].key = DUMMY                           ~ Replace key at 13 with the DUMMY placeholder
    self.slots[idx].value = EMPTY                         ~ Replace value at 13 with EMPTY

After using the new lookdict() function, search function __getitem__ also gets very short.

Searching for

slots[*].key

slots[*].value

slots[*].hashCode

git

stash

-530
⋯
8861

cut

-f 1

-799
⋯
7437

1382
⋯
4725

-27

1292
⋯
6553

a.py b.py

1267
⋯
4267

uniq

-c

6468
⋯
5819

DUMMY

1280
⋯
5547

1472
⋯
2910

def lookdict(self, key):                               
    hash_code = hash(key)                                 ~ Compute the hash code: 6468739353497665819
    idx = hash_code % len(self.slots)                     ~ Compute the starting slot index: 11 == 6468739353497665819 % 16
    while self.slots[idx].key is not EMPTY:               ~ [Try #2] Slot 12 is occupied, so check it
        if self.slots[idx].hash_code == hash_code and \   ~ 6468739353497665819 == 6468739353497665819, so the slot might be occupied by the same key
           self.slots[idx].key == key:                    ~ "uniq" == "uniq", so the key is found
            return idx                                    ~ Return 12, the index of the current slot
                                                       
        idx = (idx + 1) % len(self.slots)              
                                                       
    raise KeyError()                                   
                                                       
def __getitem__(self, key):                            
    idx = self.lookdict(key)                              ~ Call self.lookdict("uniq")
                                                       
    return self.slots[idx].value                          ~ Return "-c", value from slot 12

So we now have a class that emulates the basic part of the dict interface. Before we move on to the next chapter, let's discuss a neat trick for inserting new items.

Recycling dummy keys.

Dummy keys are used as placeholders. The only purpose of a dummy slot is to prevent a probing algorithm from breaking. The algorithm will work as long as the "deleted" slot is occupied by something, be it a dummy placeholder or a normal item.

This means that while inserting an item, if we end up hitting a dummy slot, we can put the item in that slot (if the key does isn't already inserted).

So, we still need to do a full look up, but we will also save an index of the first dummy slot to target_idx (if we encounter it). If we find that a key already exists, we save the index to target_idx and break. If we find neither a dummy slot nor the key, then we insert it in the first empty slot - as we did before.

In the absence of dummy slots, the code works the same. So, even though we built the table with a simpler version of __setitem__, it would look exactly the same as if we built it applying this optimization.

Remember that we removed the key

Now, what happens after if we insert another item to the modified hash table:

key =

value =

slots[*].key

slots[*].value

slots[*].hashCode

git

stash

-530
⋯
8861

cut

-f 1

-799
⋯
7437

1382
⋯
4725

-27

1292
⋯
6553

a.py b.py

1267
⋯
4267

uniq

-c

6468
⋯
5819

recycling

499

-787
⋯
9349

1472
⋯
2910

def __setitem__(self, key, value):                      
    hash_code = hash(key)                                  ~ Compute the hash code: -7876116100209519349
    idx = hash_code % len(self.slots)                      ~ Compute the starting slot index: 11 == -7876116100209519349 % 16
    target_idx = None                                      ~ Initialize target_idx - this is the index of the slot where we'll put the item
    while self.slots[idx].key is not EMPTY:                ~ After 4 collisions, an empty slot (at 15) is found: the collisions are successfully resolved
        if self.slots[idx].hash_code == hash_code and\  
           self.slots[idx].key == key:                  
            target_idx = idx                            
            break                                       
        if target_idx is None and \                     
           self.slots[idx].key is DUMMY:                
            target_idx = idx                            
        idx = (idx + 1) % len(self.slots)               
                                                        
    if target_idx is None:                                 ~ target_idx is not None, and this means we already know where to put the item
        target_idx = idx                                
    if self.slots[target_idx].key is EMPTY:                ~ If we're putting the item in an empty slot (and we aren't)
        self.used += 1                                  
        self.fill += 1                                  
    elif self.slots[target_idx].key is DUMMY:              ~ If we're putting the item in a DUMMY slot (and we are)
        self.used += 1                                     ~ then we only need to increment self.used, which makes it 9
                                                        
    self.slots[target_idx] = Slot(hash_code, key, value)   ~ Put the item in slot 13
    if self.fill * 3 >= len(self.slots) * 2:               ~ fill factor (56%) < 2/3:  9 * 3 (== 27) < 16 * 2 (== 32), so no need to run resize()
        self.resize()                                   
                                                        

After we inserted it, a DUMMY slot got recycled — as expected. However, when there is no DUMMY slot encountered, this version of __setitem__ would work exactly like the previous one. For example, if we instead tried to insert an item with a key "head", no dummy slot would get recycled, and the item would be inserted in an empty slot.

Inside python dict — an explorable explanation

Searching efficiently in a list

Why are hash tables called hash tables?

Putting it all together to make an "almost"-python-dict

How python dict *really* works internally

Chapter 3. Putting it all together to make an "almost"-Python-dict

Recycling dummy keys.

Next: How python dict *really* works internally

How python dict really works internally

Next: How python dict really works internally