The collections module

Where we look at two data structures - defaultdict and Counter - supplied in the python standard library

In the previous solution to the word count problem, we used the get method of the vanilla dict class to supply a default value, thereby not having to have any conditional checks or error handling to clutter the code.

But who says we should only use the vanilla dict class? Python comes with a rich standard library, part of the batteries included philosophy. One module in the standard library is collections, a module that contains quite a few interesting data structures. In this post, we will take a look at two of them: defaultdict and Counter


defaultdict allows us to create a dictionary which has a default value. If the key exists, the dict returns the corresponding value as usual. However, if the key does not exist, it uses a factory function to create the default value and sets that value into the dictionary for that key.

Here is an example of defaultdict in action

from collections import defaultdict

def count_words(line):
    words = line.split()
    counts = defaultdict(int)
    for word in words:
        counts[word] = counts[word] + 1
    return counts

In line 5 we are creating the defaultdict passing in int as the factory function. If we index a key which doesn't exist, then it will execute int() (which returns 0) to set the default value.

In line 7, we are using counts[word] as normal. If the key exists, its value will be incremented. If it doesn't exist, int() will be called and zero will be set as the value for the key, and then it will be incremented.

How does defaultdict differ from using the get method with a default value? The get method does not modify the dict in any way. If the key is missing it just returns the default value but does not change the dict itself. Whereas defaultdict sets the value for the missing key in the dict, so if you were to check that key later it will then have the value.


Is it possible to believe that our word count problem can be solved in one line?

Yes!! This is because there is already an in-built functionality to count words. The data structure we want is Counter and it is part of the collections module. It does exactly what we have been doing manually so far: count the number of times an item appears 😄

As a bonus, the Counter can also return the data sorted by the count, if in case you wanted to know the top 5 most common words or anything like that.

In general, the Counter implements a multi-set data structure. Discussing the scope of what all can be done with a multi-set is for another day. For now, here is the one-liner code for the word count problem.

from collections import Counter

def count_words(line):
    return Counter(line.split())

Many times we find that the problem we want to solve is already solved within python's vast standard library. That's the beauty of python 🎉

In the final article of this series, we will look at implementing the bonus functionality.