The collections module

In the previous solution to the word count problem, we used the get method of the vanilla dict class to supply a default value, thereby not having to have any conditional checks or error handling to clutter the code.

But who says we should only use the vanilla dict class? Python comes with a rich standard library, part of the batteries included philosophy. One module in the standard library is collections, a module that contains quite a few interesting data structures. In this post, we will take a look at two of them: defaultdict and Counter

defaultdict

defaultdict allows us to create a dictionary which has a default value. If the key exists, the dict returns the corresponding value as usual. However, if the key does not exist, it uses a factory function to create the default value and sets that value into the dictionary for that key.

Here is an example of defaultdict in action

from collections import defaultdict

def count_words(line):
    words = line.split()
    counts = defaultdict(int)
    for word in words:
        counts[word] = counts[word] + 1
    return counts

In line 5 we are creating the defaultdict passing in int as the factory function. If we index a key which doesn't exist, then it will execute int() (which returns 0) to set the default value.

In line 7, we are using counts[word] as normal. If the key exists, its value will be incremented. If it doesn't exist, int() will be called and zero will be set as the value for the key, and then it will be incremented.

How does defaultdict differ from using the get method with a default value? The get method does not modify the dict in any way. If the key is missing it just returns the default value but does not change the dict itself. Whereas defaultdict sets the value for the missing key in the dict, so if you were to check that key later it will then have the value.

Counter

Is it possible to believe that our word count problem can be solved in one line?

Yes!! This is because there is already an in-built functionality to count words. The data structure we want is Counter and it is part of the collections module. It does exactly what we have been doing manually so far: count the number of times an item appears 😄

As a bonus, the Counter can also return the data sorted by the count, if in case you wanted to know the top 5 most common words or anything like that.

In general, the Counter implements a multi-set data structure. Discussing the scope of what all can be done with a multi-set is for another day. For now, here is the one-liner code for the word count problem.

from collections import Counter

def count_words(line):
    return Counter(line.split())

Many times we find that the problem we want to solve is already solved within python's vast standard library. That's the beauty of python 🎉

In the final article of this series, we will look at implementing the bonus functionality.

Did you like this article?

If you liked this article, consider subscribing to this site. Subscribing is free.

Why subscribe? Here are three reasons:

You will get every new article as an email in your inbox, so you never miss an article
You will be able to comment on all the posts, ask questions, etc
Once in a while, I will be posting conference talk slides, longer form articles (such as this one), and other content as subscriber-only

Tagged in:

series-wordcountproblem standard-library

The collections module

defaultdict

Counter

Siddharta Govindaraj

Other Stories

The overlooked "get" method of dictionaries

Implementing word transformations

AI: What is a language model?

AI: Embeddings

AI: Tokenisation

Press ESC to close

Or check our Popular Categories...

defaultdict

Counter

Share Article:

Related Articles

Other Stories

The overlooked "get" method of dictionaries

Implementing word transformations