Inside python dict — an explorable explanation

JavaScript code is loading...

Chapter 3. Putting it all together to make an "almost"-Python-dict

In this chapter, we'll make a class that supports the basic interface of a Python dict. The class will keep track of the counters necessary for computing the load factor and auto-resize the hash table. On the inside, it'll work differently from Python dict, and we'll discuss these differences in the next chapter.

The almost-Python-dict class needs to support values. To handle values we could add another list (in addition to hash_codes and keys). This would totally work. Another alternative is to bundle hash_code, key, value corresponding to each slot in a single object. To do this, we'll need to create a class:

class Slot(object):
    def __init__(self, hash_code=EMPTY, key=EMPTY, value=EMPTY):
        self.hash_code = hash_code
        self.key = key
        self.value = value

How do we initialize an empty hash table? In previous chapters, we based the initial size of hash tables on the original list. Since we now know how to resize tables, we can start with an empty table and grow it. Hash tables inside Python dictionaries are size 8 when they are empty, so let's use 8 as the initial size.

Python hash table sizes are powers of 2, so we will also use powers of 2. even though nothing prevents us from using "non-round" values. The primary reason Python uses "round" powers of 2 is efficiency: computing % 2**n can be implemented using bit operations (& (1 << n)).

Here is the overview of the interface of the class:

class AlmostDict(object):
    def __init__(self, pairs=None):
        self.slots = [Slot() for _ in range(8)]
        self.fill = 0
        self.used = 0
        # Insert all initial pairs
        if pairs:
            for k, v in pairs:
                # This is syntactic sugar for self.__setitem__(k, v)
                self[k] = v

    def __setitem__(self, key, value):
        # Allows us to set a value in a dict-like fashion
        # d = Dict()
        # d[1] = 2 and d.__setitem__(1, 2) gets called
        <implementation goes here>

    def __getitem__(self, key):
        # Allows us to get a value from a dict, for example:
        # d = Dict()
        # d[1] = 2
        # d[1] calls d.__getitem__(1) and returns 2
        <implementation goes here>

    def __delitem__(self, key):
        # Allows us to use "del" in a dict-like fashion, for example:
        # d = Dict()
        # d[1] = 2
        # del d[1] calls d.__delitem__(1)
        # d[1] or d.__getitem__(1) raises KeyError now
        <implementation goes here>

__setitem__, __getitem__ and __delitem__ are special methods, often called magic methods. They are called this way because you don't have to invoke them directly: Python does it for you behind the scenes, when you use square brackets (e.g. d[1]). Other than that, they are normal methods.

Some methods are going to update fill (the number of non-empty slots including dummy slots) and used (the number of normal items in the dictionary, same as the dictionary size). Fill factor is the ratio between fill and len(slots). When it reaches 2/3, the table gets resized. The new size is based on the number of normal items in the table, which is self.used, and typically it is 2x of original size.

Let's take a look at the code, starting with the __init__ method. We're creating the dict from the following pairs:

The code in __init__ also assumes that the dict contents are passed as a list of pairs (not an actual dict — which we are reimplementing).

Compared to the previous chapter, the differences in building hash tables are:

  • inserting an individual item is now in a dedicated __setitem__ method;
  • The fill and used counters are incremented if necessary;
  • most importantly, resize() gets called after inserting an element if the fill factor gets too high.
pairs[*][0]
pairs[*][1]
slots[*].key
slots[*].value
slots[*].hashCode
git
/
su
0
du
-h
cut
-f 1
-27
42
stash
cp
a.py b.py
ed
36
uniq
-c
ls
git
stash
-530

8861
cut
-f 1
-799

7437
ls
/
1382

4725
-27
42
-27
ed
36
1292

6553
cp
a.py b.py
1267

4267
uniq
-c
6468

5819
du
-h
1280

5547
su
0
1472

2910
def __init__(self, pairs=None):                       

    self.slots = [Slot() for _ in range(8)]           
   ~ Start by creating a list of empty slots of size 8

    self.fill = 0                                     
   ~ Set fill to 0, because there are no items (yet)

    self.used = 0                                     
   ~ Set used to 0, because there are no items (yet)

    if pairs:                                         
   ~ Check if there are pairs to insert: there are 9 pairs

        for k, v in pairs:                            
   ~ [9/9] The current pair is -27 and 42

            self[k] = v                               
   ~ Call self.__setitem__(-27, 42)

                                                      

def __setitem__(self, key, value):                    

    hash_code = hash(key)                             
   ~ Compute the hash code: -27

    idx = hash_code % len(self.slots)                 
   ~ Compute the starting slot index: 5 == -27 % 16

    while self.slots[idx].key is not EMPTY:           
   ~ After 1 collision, an empty slot (at 6) is found: the collision is successfully resolved

        if self.slots[idx].hash_code == hash_code and\

           self.slots[idx].key == key:                

            break                                     

        idx = (idx + 1) % len(self.slots)             

                                                      

    if self.slots[idx].key is EMPTY:                  
   ~ If we're putting the item in an empty slot (and we are)

        self.used += 1                                
   ~ then we need to increment self.used, which makes it 9

        self.fill += 1                                
   ~ and increment fill, which makes it 9

                                                      

    self.slots[idx] = Slot(hash_code, key, value)     
   ~ Put the item in slot 6

    if self.fill * 3 >= len(self.slots) * 2:          
   ~ fill factor (56%) < 2/3: 9 * 3 (== 27) < 16 * 2 (== 32), so no need to run resize()

        self.resize()                                 

                                                      

When resizing a hash table, how do we find a new optimal size? We find a valid table size greater than 2 * self.used. A valid hash table size is a power of two which is at least 8.

def find_closest_size(self, minused):
    new_size = 8
    while new_size <= minused:
        new_size *= 2

    return new_size

The code only uses self.used (the number of "useful" non-empty slots). It does not depend on self.fill (the total number of non-empty slots) in any way. The idea is to double the size of the table if there are very few dummy slots and shrink it if there are too many dummy slots (so that the memory is saved).

find_closest_size(2 * self.used) == find_closest_size(2 *
) == find_closest_size(22) == 32

While the original items were being inserted, a resize happened. Let's look at it in depth:

oldSlots[*].key
oldSlots[*].value
oldSlots[*].hashCode
slots[*].key
slots[*].value
slots[*].hashCode
git
stash
-530

8861
cp
a.py b.py
1267

4267
uniq
-c
6468

5819
ls
/
1382

4725
su
0
1472

2910
ed
36
1292

6553
git
stash
-530

8861
ls
/
1382

4725
ed
36
1292

6553
cp
a.py b.py
1267

4267
uniq
-c
6468

5819
su
0
1472

2910
def resize(self):                                                       

    old_slots = self.slots                                              
   ~ Copy the reference to slots (no actual copying is done)

    new_size = self.find_closest_size(self.used * 2)                    
   ~ Find the smallest power of two greater than 6 * 2. It is 16

    self.slots = [Slot() for _ in range(new_size)]                      
   ~ Create a new list of empty slots of size 16

    self.fill = self.used                                               
   ~ Set fill to 6, since we will skip all slots with the DUMMY placeholder

    for slot in old_slots:                                              
   ~ [8/8] The current slot's key is "su" and its hash is 14720088435132910

        if slot.key is not EMPTY and slot.key is not DUMMY:             
   ~ The current slot contains a normal item

            idx = slot.hash_code % len(self.slots)                      
   ~ Compute the starting slot index: 14 == 14720088435132910 % 16

            while self.slots[idx].key is not EMPTY:                     
   ~ Slot 14 is empty: no need to do collision resolution

                idx = (idx + 1) % len(self.slots)                       

                                                                        

            self.slots[idx] = Slot(slot.hash_code, slot.key, slot.value)
   ~ Put the item in slot 14

                                                                        

In the previous chapter, the code for removing and the code for searching were very similar, because, to remove an element, we need to find it first. We can reorganize the code so that the removing and searching functions share much of the same code. The common function's name will be lookdict().

Other than that, removing a key will look pretty much the same as in the previous chapter. __delitem__ magic method is now used for realism so that we can do del almost_dict[42]. And we decrement the self.used counter if we end up finding the element and removing it.

For example, let's say we want to remove
slots[*].key
slots[*].value
slots[*].hashCode
git
stash
-530

8861
cut
-f 1
-799

7437
ls
/
1382

4725
-27
42
-27
ed
36
1292

6553
cp
a.py b.py
1267

4267
uniq
-c
6468

5819
DUMMY
1280

5547
su
0
1472

2910
def lookdict(self, key):                               

    hash_code = hash(key)                              
   ~ Compute the hash code: 12800076900115547

    idx = hash_code % len(self.slots)                  
   ~ Compute the starting slot index: 11 == 12800076900115547 % 16

    while self.slots[idx].key is not EMPTY:            
   ~ [Try #3] Slot 13 is occupied, so check it

        if self.slots[idx].hash_code == hash_code and \
   ~ 12800076900115547 == 12800076900115547, so the slot might be occupied by the same key

           self.slots[idx].key == key:                 
   ~ "du" == "du", so the key is found

            return idx                                 
   ~ Return 13, the index of the current slot

                                                       

        idx = (idx + 1) % len(self.slots)              

                                                       

    raise KeyError()                                   

                                                       

def __delitem__(self, key):                            

    idx = self.lookdict(key)                           
   ~ Call self.lookdict("du")

                                                       

    self.used -= 1                                     
   ~ We're about to put the DUMMY placeholder in the slot, so set the counter of used slots to 8

    self.slots[idx].key = DUMMY                        
   ~ Replace key at 13 with the DUMMY placeholder

    self.slots[idx].value = EMPTY                      
   ~ Replace value at 13 with EMPTY

After using the new lookdict() function, search function __getitem__ also gets very short.

Searching for
slots[*].key
slots[*].value
slots[*].hashCode
git
stash
-530

8861
cut
-f 1
-799

7437
ls
/
1382

4725
-27
42
-27
ed
36
1292

6553
cp
a.py b.py
1267

4267
uniq
-c
6468

5819
DUMMY
1280

5547
su
0
1472

2910
def lookdict(self, key):                               

    hash_code = hash(key)                              
   ~ Compute the hash code: 6468739353497665819

    idx = hash_code % len(self.slots)                  
   ~ Compute the starting slot index: 11 == 6468739353497665819 % 16

    while self.slots[idx].key is not EMPTY:            
   ~ [Try #2] Slot 12 is occupied, so check it

        if self.slots[idx].hash_code == hash_code and \
   ~ 6468739353497665819 == 6468739353497665819, so the slot might be occupied by the same key

           self.slots[idx].key == key:                 
   ~ "uniq" == "uniq", so the key is found

            return idx                                 
   ~ Return 12, the index of the current slot

                                                       

        idx = (idx + 1) % len(self.slots)              

                                                       

    raise KeyError()                                   

                                                       

def __getitem__(self, key):                            

    idx = self.lookdict(key)                           
   ~ Call self.lookdict("uniq")

                                                       

    return self.slots[idx].value                       
   ~ Return "-c", value from slot 12

So we now have a class that emulates the basic part of the dict interface. Before we move on to the next chapter, let's discuss a neat trick for inserting new items.

Recycling dummy keys.

Dummy keys are used as placeholders. The only purpose of a dummy slot is to prevent a probing algorithm from breaking. The algorithm will work as long as the "deleted" slot is occupied by something, be it a dummy placeholder or a normal item.

This means that while inserting an item, if we end up hitting a dummy slot, we can put the item in that slot (if the key does isn't already inserted).

So, we still need to do a full look up, but we will also save an index of the first dummy slot to target_idx (if we encounter it). If we find that a key already exists, we save the index to target_idx and break. If we find neither a dummy slot nor the key, then we insert it in the first empty slot - as we did before.

In the absence of dummy slots, the code works the same. So, even though we built the table with a simpler version of __setitem__, it would look exactly the same as if we built it applying this optimization.

Remember that we removed the key
?
Now, what happens after if we insert another item to the modified hash table:
key =

value =
slots[*].key
slots[*].value
slots[*].hashCode
git
stash
-530

8861
cut
-f 1
-799

7437
ls
/
1382

4725
-27
42
-27
ed
36
1292

6553
cp
a.py b.py
1267

4267
uniq
-c
6468

5819
recycling
499
-787

9349
su
0
1472

2910
def __setitem__(self, key, value):                      

    hash_code = hash(key)                               
   ~ Compute the hash code: -7876116100209519349

    idx = hash_code % len(self.slots)                   
   ~ Compute the starting slot index: 11 == -7876116100209519349 % 16

    target_idx = None                                   
   ~ Initialize target_idx - this is the index of the slot where we'll put the item

    while self.slots[idx].key is not EMPTY:             
   ~ After 4 collisions, an empty slot (at 15) is found: the collisions are successfully resolved

        if self.slots[idx].hash_code == hash_code and\  

           self.slots[idx].key == key:                  

            target_idx = idx                            

            break                                       

        if target_idx is None and \                     

           self.slots[idx].key is DUMMY:                

            target_idx = idx                            

        idx = (idx + 1) % len(self.slots)               

                                                        

    if target_idx is None:                              
   ~ target_idx is not None, and this means we already know where to put the item

        target_idx = idx                                

    if self.slots[target_idx].key is EMPTY:             
   ~ If we're putting the item in an empty slot (and we aren't)

        self.used += 1                                  

        self.fill += 1                                  

    elif self.slots[target_idx].key is DUMMY:           
   ~ If we're putting the item in a DUMMY slot (and we are)

        self.used += 1                                  
   ~ then we only need to increment self.used, which makes it 9

                                                        

    self.slots[target_idx] = Slot(hash_code, key, value)
   ~ Put the item in slot 13

    if self.fill * 3 >= len(self.slots) * 2:            
   ~ fill factor (56%) < 2/3: 9 * 3 (== 27) < 16 * 2 (== 32), so no need to run resize()

        self.resize()                                   

                                                        

After we inserted it, a DUMMY slot got recycled — as expected. However, when there is no DUMMY slot encountered, this version of __setitem__ would work exactly like the previous one. For example, if we instead tried to insert an item with a key "head", no dummy slot would get recycled, and the item would be inserted in an empty slot.