The dangers of mutability in Python
#TIL that when using the * operator on lists in #Python, it does _not_ create a copy of the elements in the list, but rather references to them 🤯. Be careful if you're creating a binary mask this way... #pythontricks pic.twitter.com/LKYPudAmmy— Guillermo Carrasco (@guillemch) December 13, 2021
Disclaimer: This whole article’s insights are based on the most common implementation of Python: CPython.
A few weeks ago, I stumbled upon this weird behaviour: I was trying to solve one of the Advent Of Code problems using a binary matrix. I realised that, when updating a specific position in the matrix, I was updating other positions I did not intend to. The issue is very easy to reproduce, as you can see in the tweet above.
I did not understand what was happening, and this led me into a deep rabbit whole of Python mutability, how specific operators work, and how Python manages memory and references to objects.
The Object Realm
Before digging deeper into my specific issue, it’s important to understand the common sentence “In Python, everything is
an object”. What do we mean with that? Other languages like C or Java have primitive data types such as char, int
or float. In those languages, the mentioned types hold the true value of the char, int or float they represent, in bytes.
So when you declare a variable
int a = 2 in C, the variable
a is actually holding the binary representation of the integer
When you create a new variable with the same value, new space in memory is reserved to hold the same binary representation
of the number
This is not the case in Python. When you declare an integer variable in Python, that variable has properties and methods you can access. For example:
In : x = 2337 In : x.real Out: 2337 In : x.bit_length() Out: 12
In fact, technically Python does not declare variables, but rather names that point to objects. Those objects are called PyObjects, and they are the Python binding to the C representation of the data they are holding.
A PyObject is a structure that contains three information pieces:
- the data type is referencing – int, float, …
- the value
- a reference count
This illustration may help you visualise what a PyObject is:
Image extracted from The Real Python’s course on Pointers and Objects in Python
Generally when creating a new variable, a new PyObject is created and its reference count starts at 1. If more variables
reference the same object, the reference count is incremented. When no variables reference that object anymore, the object
is destroyed from memory. This process is automatic in Python and it’s called Garbage Collection. A way of knowing the
memory address of an object in Python is by using the built-in function
In : print(hex(id(a))) Out: 0x10d0fb7d0 In : b = a In : print(hex(id(b))) Out: 0x10d0fb7d0
(memory addresses are generally represented in hexadecimal)
As you can see, when creating a variable by using the assign operator
=, you are just creating another reference to the
same memory space (
0x10d0fb7d0 in this specific case, it will be different if you replicate this example).
When creating new variables/objects, new addresses are used:
In : c = 123 In : print(hex(id(c))) Out: 0x10d0fc1f0
Some special cases – Interning
While this is not really relevant for our particular case, I learned something very interesting while reading about memory management in Python. Look carefully at these lines of code:
In : a = 42 In : b = 42 In : c = 12345678 In : d = 12345678 In : print(hex(id(a))) Out: 0x10d0fb7d0 In : print(hex(id(b))) Out: 0x10d0fb7d0 In : print(hex(id(c))) Out: 0x7f998807eb30 In : print(hex(id(d))) Out: 0x7f998807ed10
If you didn’t notice, the addresses of
d are different, but the addresses of
b are the same?!
If every time we create a new variable, a new object is created… shouldn’t all 4 addresses be the different?
It turns out that, in order to save memory and gain efficiency, Python pre-loads in memory some of the most commonly used values.
Those are, apparently, integers in the range
[-5, 256], some strings, the
None object, and more.
This process is called interning, and you can read more about it in this blog post.
Be careful with your mutable objects
We are getting close to being able to explain the behaviour of my binary mask. But before that, one more concept. If you’ve worked with Python or any other programming language long enough, you are probably familiar with the concept of mutability. Simply put, an object is mutable if you can change its value, and immutable if you can’t.
Some examples of mutable objects in Python are lists, dictionaries or sets. Other objects like integers, strings or tuples are immutable.
Have you ever seen something like this?
In : l1 = [1, 2, 3] In : l2 = l1 In : l1 Out: [1, 2, 3] In : l2 Out: [1, 2, 3] In : l2.append(4) In : l1 Out: [1, 2, 3, 4]
l2, we are really only pointing to the same memory address as
l1. Since a list is a mutable object,
l2 is effectively the same as appending on
l1, and thus we are unintentionally modifying
If we would like to avoid this behaviour, we need to create two separate list objects, like this:
In : l1 = [1, 2, 3] In : l2 = [1, 2, 3] In : l1 Out: [1, 2, 3] In : l2 Out: [1, 2, 3] In : l2.append(4) In : l1 Out: [1, 2, 3] In : l2 Out: [1, 2, 3, 4]
(you can as well use the
copy() operator on
l1 to create an identical list)
This is the expected behaviour. I’ve seen a lot of novice Python developers stumble upon this issue. Default mutable arguments are another common source of nail-biting and hair pulling.
* operator for lists in Python
And finally, we are able to explain the behaviour of the binary matrix. The reason why I entered this rabbit whole of memory management and mutability…
It turns out that, contrary to what I thought, the repetition operator for lists (
*) in Python does not create new objects,
but rather creates references to the original one. In my case, I wanted to create a 3x5x5 matrix, but let’s simplify
for the sake of an example, and say I want to create a 3x3 boolean matrix. I would do:
In : l1 = [[False, False, False]]*3 In : l1 Out: [[False, False, False], [False, False, False], [False, False, False]]
and then I want to modify just the first position of the first row:
In : l1 = True In : l1 Out: [[True, False, False], [True, False, False], [True, False, False]]
(note that the first position of *all* the rows has been modified)
By now, you should quickly see what’s happening: The
* operator is not creating new lists, but rather referencing the
same one three times. So when updating the first position of the first row, I am really updating all other rows,
since they are the exact same object in memory.
How would you achieve the desired behaviour? Well, there are many ways. You could for example use the
which creates a true copy of the object rather than a reference:
In : row = [False, False, False] In : l1 = [row.copy(), row.copy(), row.copy()] In : l1 Out: [[False, False, False], [False, False, False], [False, False, False]] In : l1 = True In : l1 Out: [[True, False, False], [False, False, False], [False, False, False]]
or programmatically, use a list comprehension like this:
In : l1 = [[False, False, False] for _ in range(3)] In : l1 Out: [[False, False, False], [False, False, False], [False, False, False]] In : l1 = True In : l1 Out: [[True, False, False], [False, False, False], [False, False, False]]
However, this is rather cumbersome, and I would recommend going for numpy arrays instead.
Python memory management, while in most cases is very convenient and eases our lives greatly, can sometimes be the source of confusion and weird behaviour. It is thus very important for us to understand how objects and variables are created, if the objects we are using are mutable or not, what and how are we passing as function arguments and so on.
I also really recommend that when you see some strange behaviour, something you don’t understand, really try to get to the root of it. I learned a lot about Python memory management while investigating this issue, as I hope you did while reading this summary.
Have a nice one! And feel free to ask me any question in the comments
In automation, we trust.