In this chapter, we'll make a class that supports the basic interface of a Python dict. The class will keep track of the counters necessary for computing the load factor and auto-resize the hash table. On the inside, it'll work differently from Python dict, and we'll discuss these differences in the next chapter.
The almost-Python-dict class needs to support values. To handle values we could add another list (in addition to hash_codes
and keys
). This would totally work. Another alternative is to bundle hash_code
, key
, value
corresponding to each slot in a single object. To do this, we'll need to create a class:
class Slot(object):
def __init__(self, hash_code=EMPTY, key=EMPTY, value=EMPTY):
self.hash_code = hash_code
self.key = key
self.value = value
How do we initialize an empty hash table? In previous chapters, we based the initial size of hash tables on the original list. Since we now know how to resize tables, we can start with an empty table and grow it. Hash tables inside Python dictionaries are size 8 when they are empty, so let's use 8 as the initial size.
Python hash table sizes are powers of 2, so we will also use powers of 2. even though nothing prevents us from using "non-round" values. The primary reason Python uses "round" powers of 2 is efficiency: computing % 2**n
can be implemented using bit operations (& (1 << n)
).
Here is the overview of the interface of the class:
class AlmostDict(object):
def __init__(self, pairs=None):
self.slots = [Slot() for _ in range(8)]
self.fill = 0
self.used = 0
# Insert all initial pairs
if pairs:
for k, v in pairs:
# This is syntactic sugar for self.__setitem__(k, v)
self[k] = v
def __setitem__(self, key, value):
# Allows us to set a value in a dict-like fashion
# d = Dict()
# d[1] = 2 and d.__setitem__(1, 2) gets called
<implementation goes here>
def __getitem__(self, key):
# Allows us to get a value from a dict, for example:
# d = Dict()
# d[1] = 2
# d[1] calls d.__getitem__(1) and returns 2
<implementation goes here>
def __delitem__(self, key):
# Allows us to use "del" in a dict-like fashion, for example:
# d = Dict()
# d[1] = 2
# del d[1] calls d.__delitem__(1)
# d[1] or d.__getitem__(1) raises KeyError now
<implementation goes here>
__setitem__
, __getitem__
and __delitem__
are special methods, often called magic methods. They are called this way because you don't have to invoke them directly: Python does it for you behind the scenes, when you use square brackets (e.g. d[1]
). Other than that, they are normal methods.
Some methods are going to update fill
(the number of non-empty slots including dummy slots) and used
(the number of normal items in the dictionary, same as the dictionary size). Fill factor is the ratio between fill
and len(slots)
. When it reaches 2/3, the table gets resized. The new size is based on the number of normal items in the table, which is self.used
, and typically it is 2x of original size.
Let's take a look at the code, starting with the __init__
method. We're creating the dict from the following pairs:
The code in __init__
also assumes that the dict contents are passed as a list of pairs (not an actual dict — which we are reimplementing).
Compared to the previous chapter, the differences in building hash tables are:
__setitem__
method;fill
and used
counters are incremented if necessary;resize()
gets called after inserting an element if the fill factor gets too high.When resizing a hash table, how do we find a new optimal size? We find a valid table size greater than 2 * self.used
. A valid hash table size is a power of two which is at least 8.
def find_closest_size(self, minused):
new_size = 8
while new_size <= minused:
new_size *= 2
return new_size
The code only uses self.used
(the number of "useful" non-empty slots). It does not depend on self.fill
(the total number of non-empty slots) in any way. The idea is to double the size of the table if there are very few dummy slots and shrink it if there are too many dummy slots (so that the memory is saved).
find_closest_size(2 * self.used)
== find_closest_size(2 *
)
== find_closest_size(22)
== 32
While the original items were being inserted, a resize happened. Let's look at it in depth:
In the previous chapter, the code for removing and the code for searching were very similar, because, to remove an element, we need to find it first. We can reorganize the code so that the removing and searching functions share much of the same code. The common function's name will be lookdict()
.
Other than that, removing a key will look pretty much the same as in the previous chapter. __delitem__
magic method is now used for realism so that we can do del almost_dict[42]
. And we decrement the self.used
counter if we end up finding the element and removing it.
After using the new lookdict()
function, search function __getitem__
also gets very short.
So we now have a class that emulates the basic part of the dict interface. Before we move on to the next chapter, let's discuss a neat trick for inserting new items.
Dummy keys are used as placeholders. The only purpose of a dummy slot is to prevent a probing algorithm from breaking. The algorithm will work as long as the "deleted" slot is occupied by something, be it a dummy placeholder or a normal item.
This means that while inserting an item, if we end up hitting a dummy slot, we can put the item in that slot (if the key does isn't already inserted).
So, we still need to do a full look up, but we will also save an index of the first dummy slot to target_idx
(if we encounter it). If we find that a key already exists, we save the index to target_idx
and break. If we find neither a dummy slot nor the key, then we insert it in the first empty slot - as we did before.
In the absence of dummy slots, the code works the same. So, even though we built the table with a simpler version of __setitem__
, it would look exactly the same as if we built it applying this optimization.
After we inserted it, a DUMMY
slot got recycled — as expected. However, when there is no DUMMY
slot encountered, this version of __setitem__
would work exactly like the previous one. For example, if we instead tried to insert an item with a key "head"
, no dummy slot would get recycled, and the item would be inserted in an empty slot.