Now it is (finally!) time to explore how the dict works in Python!
This explanation is about the dict in CPython (the most popular, "default", implementation of Python). Most of CPython is implemented in C, and the dict implementation is written in C too. This explanations reimplements everything in Python, because the main focus of this explanation is algorithms. Nevertheless, the state of dicts on this page should match the state of dicts inside CPython 3.2. I.e. the hash() codes of keys should match, the probing result should match, and as a result the slots match as well.
In other words, this is an actual reimplementation of Python dict in Python that mirrors all of the aspects of the underlying data structure. And the visualization might as well be the visualization of the state produced from the C code.
Why CPython 3.2? It's a good starting point, later versions expand the implementation (rather than replace it), so this chapter focuses on CPython 3.2's dict, and future chapters will discuss changes in the later versions (3.3 - 3.7).
The central difference between almost-Python-dict from the third chapter and real Python dicts is the probing algorithm. This probing algorithm stayed the same in all versions (at least up until 3.7, which is the latest version at the time of writing)
The problem with the simple linear probing is that it doesn't mix up the keys well in many real-world data patterns. Real world data patterns tend to be regular, and a pattern like 16
, 0
, 1
, 2
, 3
, 4
...
would lead to many collisions.
Linear probing is prone to clustering: once you get a "clump" of keys, the clump tends to grow, which causes more collisions, which cause the clump to grow further, which causes even more collisions. This is detrimental to performance.
One way to address this problem by using a better hash function, in particular when it comes to integers (hash(x)
== x
for small integers in Python). Another way to address this problem is by using a different probing algorithm - and this is what CPython developers decided.
There are two requirements for a probing algorithm:
Let's take a look at linear probing first. If we repeatedly run its recurrence (idx = (idx + 1) % size
) until we end up hitting a slot twice, we get the following picture:
It does not matter what slot we start from, the picture will look exactly the same. Linear probing is very regular and predictable. Now, let's change the recurrence to idx = (5 * idx + 1) % size
(note the 5
):
idx = (5 * idx + 1) % size
still guarantees to eventually hit every possible slot if size
is a power of two (the proof of this fact is outside the scope of this page). Also, the algorithm is obviously deterministic. So, both requirements for a probing algorithm are satisfied. This algorithm scrambles the order of indexes a bit. It certainly less regular but it is still prone to clustering.
The probing algorithm in CPython takes this recurrence and adds a ton of scrambling to it: idx = ((5 * idx) + 1 + perturb) % size
. What is this perturb
weirdness though? In C code, it is initialized as basically this: size_t perturb = hash_code
. Then, in every iteration, it is right-shifted by 5
bits (perturb >>= 5
).
This probing algorithm uses some "randomness" in the form of bits from the hash code - but the probing is still fully deterministic because hash functions by their nature are deterministic. perturb
eventually reaches zero, and the recurrence becomes idx = (5 * idx) + 1
, which is guaranteed to hit every slot (eventually).
We can reimplement this algorithm in pure Python. However, in Python there are no unsigned (logical) bit shifts, and there is also no built-in way to convert a 64-bit signed integer to a 64-bit unsigned integer. The solution is to do the conversion with the following one-liner: 2**64 + hash_code if hash_code < 0 else hash_code
and then use regular bit shifts (i.e. >>
or >>=
)
Arrows are color-coded: green means perturb != 0
and blue means perturb == 0
Adding noise (from perturb
) makes things slower when a hash table is full, the worst case scenario becomes even worse (compared to (5 * idx) + 1
). However, in practice, we keep dicts sparse, by ensuring the load factor never goes above 2/3
), so naturally there are many chances to hit an empty slot.
If you are interested in more subtleties and technical details, you can check Objects/dictnotes.txt and comments near the top of Objects/dictobject.c
There are a couple more changes to almost-python-dict, but they are small.
When you type a dict literal in your code, for example:
Python actually knows the number of key-value pairs and tries to guess the optimal hash table size to possibly avoid some or all resizes. In most cases, the resulting hash table ends up being the same size or smaller. However, in some cases the resulting hash table may actually be larger if there are a lot of repeated keys in the literal (e.g. {1: 1, 1: 2, 1: 3, 1: 4, 1: 5, 1: 6, 1: 7, 1: 8, 1: 9}
)
So in __init__
we use the familiar find_closest_size()
function to find the power of two greater than the number of items. This guarantees that there will be no more than one resize.
The code for the resize is a bit different, because a resize operation can increase the size of the hashtable by as much as 4x. The __setitem__
is the same, except for the probing algorithm, and it also tries to recycle slots containing tombstones (DUMMY
placeholders), just like the insert operation in the previous chapter did. Although, in case of__init__
recycling is not going to happen, as we start with an empty table containing no dummy slots.
How much difference the probing algorithm and the other changes make? How different is the resulting dict compared to almost-python-dict from chapter 3? Here are the two versions side by side:
The code for removing an item stays mostly the same. Again, a different probing algoithm is used, but conceptually it is the same exact algoirthm: try to find a key, and if it is there, overwrite the item with a DUMMY
placeholder.
Resizes are expensive, so it is better to have less of them. So Python sometimes quadruples the size of a table. This is more reasonable for smaller hash tables, when the number of items is smaller than 50000. When the number of items is greater than 50 thousand, Python 3.2 aims to double the size. Although, just like in previous chapter, a resize operation can shrink the table or keep the size the same (while dropping the unnecessary dummy slots), if too many slots are wasted by DUMMY
placeholders.
While building the dict from the original pairs, no resize operation was run, because Python correctly guessed the number of slots needed. To see a resize in action, let's insert some additional pairs: [("tee", 1), ("wc", 2)]
After running __setitem__
multiple times for these pairs, we can take a look at the resize in-depth:
This is the last chapter currently available. Developing and debugging the engine for this explorable explanation took way, way longer than I expected, so I am releasing the first few chapters before the following ones.
Did interactive visualizations help you understand something? Did you discover something interested because of interactivity and animations? Did it help building an intuition for hash tables? I am really curious about this, I'd love to hear from you, so drop me a message. And if you find a bug, please open a new issue on Github.
Is dict in Python 3.7 similar to dict Python 3.2? I'd say yes. It's still an open addressing hash table with weird perturb-based probing. The hash function for strings changed a couple of times, some memory sharing for keys was introduced, and dicts became ordered.
3.3 introduced changes to internal structure of dicts ("Key-Sharing Dictionary") that improved memory consumption in certain cases. 3.3 also randomizes seed for hash functions, so that hash()
return values are less predictable from the outside. This is a security-related change, and object hashes are still stable within the same "run" of Python interpreter.
In 3.4, the hash function for strings was changed to a more secure algorithm which is more resistant to hash collision attacks.
In 3.6 the dict internal structure became more compact and the dict became "ordered".