In previous articles we saw that python provides many different list options - you have the regular list which is implemented using arrays. You also have the collections.deque class which uses a doubly linked list. And there is the array module (and the superior numpy.array) that gives us the ability to embed the data in the array.

Unlike lists, dictionary implementations are all based on hash tables. However, Python does give us many different mapping types with different features, which we will explore in this article.

Sets

The most basic of the non-dict mapping type is the set. A set is like a dict that has only keys without values.

Why might you want to store only keys without values? Remember that the hash table implementation of dicts makes it very fast to look up whether a certain item is present in the data structure - you just take the hash() and go directly to that particular index. So you can use a set to keep track of whether some values are present or absent.

Take a look at this piece of code. It goes through a dataset and processes each line containing a file name and some content. It stores a list of file names that have been processed, and if a particular filename has already been processed previously then it skips that line.

def process(data):
    filenames = []
    for line in data:
        name, content = line.split()
        if name not in filenames:
            process(name, content)
            filenames.append(name)

When you have a large amount of data, then this code will not perform well. This is because to check if an item is in a list, we have to go through every element in the list one by one and check. So the line if file not in filename: can get very slow once it starts to become really big. Each and every time it is going through the entire list.

This is the situation that is ideal for a set. Here is the code using set

def process(data):
    filenames = set()
    for line in data:
        name, content = line.split()
        if name not in filenames:
            process(name, content)
            filenames.add(name)

The code is almost the same. But because we are using set, the check if file not in filenames:  is very fast even when the set becomes huge.

Apart from this, the set also provides mathematical set operations which are not available on a dictionary - you can find the union, intersection and difference between two sets for example.

Another common use case for a set is to remove duplication from a list. Just like a dictionary, an item can be present only once in a set. Therefore taking a list and converting it to a set will end up keeping every value only once.

OrderedDict

Generally speaking, a dictionary is unordered. Because it uses a hash table, the first item might be stored somewhere in the middle of the hash table, the second item could be at the start and so on. This made the dictionary unordered.

There are many use cases where having an ordered mapping is useful and Python provided the collections.OrderedDict class that also kept track of the order of insertion.

However, in recent versions of python, the implementation of the dict was changed to use a separate index hash table and a separate entries list. Check out the previous article for the implementation details of a python dictionary

A Deep Dive into Python Dictionaries
In the previous post we took a look into the different list data structures that python provides us. Today, we will do the same for dictionary. A dictionary is implemented using a hash table, so lets take a moment to understand how a hash table works.

With this new design, the index hash table is unordered, but the entries list preserves the order in which entries were added into the dictionary. This makes the normal dictionary ordered by default and iterating a dictionary in a loop will give us every entry in the order it was put in. This makes collections.OrderedDict mostly unnecessary in the newer versions of Python.

One thing that the OrderedDict does have is a method move_to_end() to move an entry to either end (first or last) of the order. This makes it very useful for certain cases like for example a cache where you want to track how recently an item was updated

from collections import OrderedDict

class RecentlyUpdatedCache(OrderedDict):
    def __setitem__(self, key, value):
        super().__setitem__(key, value)
        # last=False will move the item to the start of the order
        self.move_to_end(key, last=False)
        
cache = RecentlyUpdatedCache()
cache['key1'] = 'value1'
cache['key2'] = 'value2'
cache['key3'] = 'value3'
cache['key2'] = 'value4'

# print the cache items in the order they were most recently updated
for item in cache.items():
    print(item)
    
# ('key2', 'value4'), ('key3', 'value3'), ('key1', 'value1')

defaultdict

The collections module also provides another dictionary variant called defaultdict.

When you access an item that doesn't exist in the dictionary, defaultdict will use a factory function to create an initial value and add it to the dictionary.

This is useful when the dictionary is going to be accumulating some value.

In the example below, we are going to count how many times each letter appears in a sentence. Normally, the first time we encounter a letter, we would need to add it to the dictionary. By using defaultdict(int) it will call the int() factory function (which returns 0) to set the initial value when the letter does not exist. This way, we avoid having a special case when the letter is added to the dictionary for the first time.

from collections import defaultdict

def counter(sentence):
    count = defaultdict(int)
    for letter in sentence:
        count[letter] = count[letter] + 1
    return count

Along the same lines, we can use defaultdict to build "multi-value" dictionaries where a key might have more than one value. Here, we use set as the factory function and each key will have a set of possible values (you can use a list instead if you want to preserve duplicate values).

from collections import defaultdict

query_string = "name=python&name=playful&type=website"
query = defaultdict(set)
for entry in query_string.split('&'):
    key, value = entry.split('=')
    query[key].add(value)

ChainMap

The collections module comes with a third type of dictionary variant called the ChainMap. As the name implies, it is used to chain mulpitle dictionaries together.

Consider a configuration system, where there may be a global configuration, a local configuration and a user-level configuration. When we need to look up a configuration, we first see it there is any user-level value for it. If not, then we check in the local configuration, and finally in the global configuration.

This kind of lookup is easy to do with ChainMap

from collections import ChainMap

userconf = {'key1': 'user'}
localconf = {'key1': 'local', 'key2': 'local'}
globalconf = {'key1': 'global', 'key2': 'global', 'key3': 'global'}

config = ChainMap(userconf, localconf, globalconf)

print(config['key1']) # user
print(config['key2']) # local
print(config['key3']) # global

Counter

One of the features of the set data type is that each item can only be present once. What if we need to know how many times each item is present? In computer science terminology, a data structure that can preserve the count of duplicated items in a set is known as a "Bag" data type, or a "Multiset" data type.

At first glance, it seems like Python does not provide an inbuilt bag / multiset data type but if you look closely, you will find that it is present as the collections.Counter class. Although the class does not use the term Bag or Multiset, it performs the same role.

Consider the calculation of prime factorisation of a number. The prime factorisation of 120 is the multiset {2, 2, 2, 3, 5}. Here we do not care about order, the multiset {2, 3, 5, 2, 2} is the same multiset. The duplication is important though, {2, 3, 5} is a different multiset. These requirements make both list and set unsuitable, but Counter fits the bill.

def factorise_once(num):
    for divisor in range(2, num+1):
        if num % divisor == 0:
            return divisor, num/divisor
            
def prime_factorisation(num):
    factors = Counter()
    while num != 1:
        factor, num = factorise_once(num)
        factors.update([factor])
    return factors
    
print(prime_factorisation(120)) # Counter({2: 3, 3: 1, 5: 1})

As we can see, Counter keeps track of the individual items, as well as how many times they are present.

In addition, Counter has a few more methods to make it more suitable for multiset operation. This includes features to add or subtract two Counter objects together, handling negative counts and so on. If you have been looking for a multiset data type in Python, then Counter is what you want.

Summary

In this article we took a look at the many different mapping data types that python provides apart from the regular dict. We looked at

  • set
  • collections.OrderedDict
  • collections.defaultdict
  • collections.ChainMap
  • collections.Counter

While set is a built in data type, all the others are in the collections module. As we can see, python comes with many different variations that we can take advantage of in the appropriate time in our code. We have seen in this article what features are provided by each of these types and when we may want to use each of them. Using these can literally make a lot of potentially complex code into simple one-liners.

Did you like this article?

If you liked this article, consider subscribing to this site. Subscribing is free.

Why subscribe? Here are three reasons:

  1. You will get every new article as an email in your inbox, so you never miss an article
  2. You will be able to comment on all the posts, ask questions, etc
  3. Once in a while, I will be posting conference talk slides, longer form articles (such as this one), and other content as subscriber-only