Over the last few articles we have been covering all the different fundamental data structures that are provided by Python. We've covered the various types of lists and mapping types that are available. In this final article of the series, we will take a tour of some of the remaining types that are provided by Python.

Immutable Data

Some of the data types in python are designed to be immutable. That is, once the data is set, it cannot be changed. If you want to change the value, the only way is to create a new object with the new value.

Integers and Strings are the most common of the immutable data types. When you do an operation on an int or string, it does not change the existing object, but creates a new object with the output of the operation.

It may seem wasteful to create a new object every time a value changes, but there are some advantages too. For one, objects can be cached and reused when you need the same value. There is no risk that the object you cached will get its value changed later on in the program execution.

💡
Interning is a process where python caches certain frequently used integers (in the range -5 to 256) and certain strings. So everytime you do x + 1 python does not create a new integer object for the number 1. It just picks up the existing number object from the interned table and uses it. If the answer also falls in the interned table, then again python will pick up the exising object and return it.

Apart from integers and strings, tuples are another immutable data type. Once you set the values in a tuple, you can't change it. However it is important to note that you can always put mutable types like list and dict inside a tuple. And while you can't change the tuple itself, you can still mutate the data that is in the tuple.

Being immutable, the size of a tuple can never change, and this allows python to do another optimisation. Python preallocates object memory for different size tuple objects and keeps it available. When you create a tuple, python checks if it there is any free preallocated object and just reuses that. When the tuple is deleted, python doesn't free up the memory. It just adds it back to the preallocated list to be used again later.

a = (1, 2)
print(id(a)) # will print some memory address
del a
b = ("hi", "bye")
print(id(a)) # same memory address that a had previously

In the example above, we create a size-2 tuple a and then delete it. When we create another size-2 tuple b, python reuses the object that was created for a and we see that b occupies the same memory address that a was using.

Due to this optimisation, it is much faster to create and destroy a tuple object compared to a list.

# don't do this
if a in ['a', 'b', 'c']:
    do_something()
    
# prefer this
if a in ('a', 'b', 'c'):
    do_something()

namedtuple

Tuples are one of the most versatile python data types. You can use them as a lightweight way to represent objects like person = ('Anjali', 35) as well as grouping together few fields temporarily for a few lines so that they can be processed together.

They do have one drawback: tuple code can easily become very unreadable. It is easy to lose track of which index belongs to which field and you end up with code like the below

def process_item(item, extension):
    data = {}
    if extension in item[0]:
        for pair in item[1].split('&'):
            key, value = pair.split('=')
            data[key] = value
    return data
    
item = ('file.log', 'key=value&key1=value1')
process_item(item, 'log')

To solve this problem, Python provides us the namedtuple class in the collections module. namedtuple allows us to give names to each of the fields as well as the tuple itself. We can then access each of the fields using the field names which makes the code much more readable. Here is the same code written using namedtuple:

from collections import namedtuple

LogData = namedtuple('LogData', ('filename', 'content'))

def process_item(item, extension):
    data = {}
    if extension in item.filename:
        for pair in item.content.split('&'):
            key, value = pair.split('=')
            data[key] = value
    return data

item = LogData('file.log', 'key=value&key1=value1')
process_item(item, 'log')

By using namedtuple we get readable, low-overhead, immutable data objects without having to create a class. In Python you only need to reach for a class when you want to associate custom methods to your data.

Dataclasses

When you need more than just a namedtuple, python gives us another option in data classes. Data classes make it easy to create mutable or immutable value objects.

💡
Value objects are those that are defined by their values. For example, if we have a Point class with x and y coordinate and a = Point(1, 2) and b = Point(1, 2) then both the objects represent the same point. On the other hand, Entity objects have a separate identity from their values. Consider a Person class with a name and date_of_birth: two people could share the same name and date of birth but they should still be considered different objects. For entity objects, just because the values are the same does not mean the objects represent the same thing.

Creating a data class is as simple as defining the fields and using the @dataclass decorator on top

@dataclass
class Point:
    x: int
    y: int

a = Point(1, 2)
b = Point(x=1, y=2)
a == b # True

As you can see above, python automatically creates the __init__ method which takes the values as positional or keyword arguments. Python also defines a custom equality __eq__ which compares objects by value. Python also defines a helpful __str__ representation of the object.

By default these dataclasses are mutable. You can specify the frozen=True option on the decorator to make them immutable. You can then add methods to the class to make it easier to change the values.

@dataclass(frozen=True)
class Point:
    x: int
    y: int
    
    def add_x(self, delta_x):
        return Point(self.x + delta_x, self.y)
        
    def add_y(self, delta_y):
        return Point(self.x, self.y + delta_y)
        
a = Point(1, 2)
b = Point(x=1, y=2)
hash(a) == hash(b) # True
a.add_x(3) # Point(x=4, y=2)

When you make a frozen dataclass like above, python will also add in a __hash__ calculation based on the values. This means you can now use these objects as keys in dictionaries or add them to sets, both of which require the object to have a consistent hash() value.

Sorted Lists

In the article on lists, we mentioned that finding items in a list is slow: you need to go through every item one by one to check if an element is present or absent. We can make searching for items much faster if we keep the list sorted at all times. Now python does not directly give us a datatype for sorted lists, but it does give us the bisect module which makes it much easier for us to maintain a sorted list.

As the name implies, the bisect module implements the bisection algorithm, also called the Binary Search algorithm. The module takes an ordinary list that is sorted and has functions for inserting new data while maintaining the sorted order.

import bisect

l = [2, 13, 79, 151, 186, 212, 279]
bisect.insort(l, 100)
print(l) # [2, 13, 79, 100, 151, 186, 212, 279]

The functions also take a key parameter in case we are sorting based on a particular field or calculation. In the code below we have a list of (name, year) tuples that are sorted by year. We use the key parameter to tell bisect how to calculate the sort so that the new item is inserted in the proper place.

import bisect

l = [('Anjali', 1975), ('Anand', 1979), ('Aparna', 1985)]
bisect.insort(l, ('Varsha', 1975), key=lambda person: person[1])
print(l)
# [('Anjali', 1975), ('Varsha', 1975), ('Anand', 1979), ('Aparna', 1985)]

You will notice that Anjali and Varsha both have 1975 as the year, and in that case Varsha is inserted after Anjali. There is another function bisect.insort_left which will insert to the left of the existing items if they evaluate equal in the sort order.

The bisect module also provides two functions bisect and bisect_left for finding the index of the insertion point, without actually doing the insertion. Again they differ if they return the left side or right side when items are equal in the sort order. The bisect module does not provide a direct method to search items from the list, but we easily write our own as shown below

import bisect

def index(lst, item_key, key=None):
    """Locate the leftmost value exactly equal to the given key"""
    i = bisect.bisect_left(lst, item_key, key=key)
    if i != len(lst) and key(lst[i]) == item_key:
        return i
    raise ValueError
    
idx = index(l, 1979, key=lambda person: person[1]))
print(l[idx]) # ('Anand', 1979)

Heaps

A Priority Queue is like a sorted list, except we are not interested in sorting the entire list but we want to be able to quickly access the smallest element in the list. And when we remove that item from the list, then we want to access the new smallest element. We would go down the list processing them in their sort order one by one and removing them from the queue as we go along.

Generally, the data structure used for this kind of application is the heap, and python provides the heapq module that implements the heap algorithms.

💡
A heap is a tree data structure where the first item is the smallest (called min-heap) or the largest (max-heap) value in the dataset. Python's heapq module implements a min-heap where the data is arranged in such a way that for every node in the tree, it's value is smaller that all the values in its children.

Although heap algorithms are usually run on binary trees, python implements everything on top of a regular python list data structure. This is because of the benefits we saw before - fast random access to elements and the ability to get the entire data into the CPU cache.

Here is the heapq module in action:

import heapq

data = [13, 101, 58, 0, 7, -9, 12]
heapq.heapify(data)
print(data) # [-9, 0, 12, 101, 7, 58, 13]

The heapify function takes the data in the list and rearranges it in-place to form a heap. As we can see, it is not fully sorted, but the minimum value is in the first spot.

To remove the minimum item from the heap, we use heappop. This takes out the first item and rearranges everything else to maintain the heap conditions. Similarly, we can add an item to the heap using heappush.

import heapq

data = [-9, 0, 12, 101, 7, 58, 13]
heapq.heappop(data) # -9
print(data) # [0, 7, 12, 101, 13, 58]

As we discussed above, a common use for a heap like this is when you want to process items based on their priority. In this case, each item will be a tuple of the form (priority, value) with a smaller priority number indicating the item should be processed earlier. Here is an example from the python docs:

import heapq

h = []
heapq.heappush(h, (5, 'write code'))
heapq.heappush(h, (7, 'release product'))
heapq.heappush(h, (1, 'write spec'))
heapq.heappush(h, (3, 'create tests'))

heapq.heappop(h) # (1, 'write spec')

Another use case is when you want to process items in sorted order, but you don't want to sort the whole list at once. Maybe you want to calculate a top-5 and don't care about the order of the items beyond that. heapq is perfect for this kind of application. In fact, the module comes with two functions nlargest and nsmallest to quickly calculate these values.

Summary

Over the articles in this series, we have done a deep dive into the huge variety of data structures and algorithms that are provided out of the box by Python. From three different types of lists with different characteristics, many variations of dictionary types, immutable data types and implementations of bisection algorithms and heaps.

So the next time you want a data structure, keep in mind the different options provided by Python so you can use the right one for the task. 🎉

Did you like this article?

If you liked this article, consider subscribing to this site. Subscribing is free.

Why subscribe? Here are three reasons:

  1. You will get every new article as an email in your inbox, so you never miss an article
  2. You will be able to comment on all the posts, ask questions, etc
  3. Once in a while, I will be posting conference talk slides, longer form articles (such as this one), and other content as subscriber-only