In the previous solution to the word count problem, we used the
get method of the vanilla
dict class to supply a default value, thereby not having to have any conditional checks or error handling to clutter the code.
But who says we should only use the vanilla
dict class? Python comes with a rich standard library, part of the batteries included philosophy. One module in the standard library is
collections, a module that contains quite a few interesting data structures. In this post, we will take a look at two of them:
defaultdict allows us to create a dictionary which has a default value. If the key exists, the dict returns the corresponding value as usual. However, if the key does not exist, it uses a factory function to create the default value and sets that value into the dictionary for that key.
Here is an example of
defaultdict in action
from collections import defaultdict def count_words(line): words = line.split() counts = defaultdict(int) for word in words: counts[word] = counts[word] + 1 return counts
In line 5 we are creating the
defaultdict passing in
int as the factory function. If we index a key which doesn't exist, then it will execute
int() (which returns 0) to set the default value.
In line 7, we are using
counts[word] as normal. If the key exists, its value will be incremented. If it doesn't exist,
int() will be called and zero will be set as the value for the key, and then it will be incremented.
defaultdict differ from using the
get method with a default value? The
get method does not modify the dict in any way. If the key is missing it just returns the default value but does not change the dict itself. Whereas
defaultdict sets the value for the missing key in the dict, so if you were to check that key later it will then have the value.
Is it possible to believe that our word count problem can be solved in one line?
Yes!! This is because there is already an in-built functionality to count words. The data structure we want is
Counter and it is part of the
collections module. It does exactly what we have been doing manually so far: count the number of times an item appears 😄
As a bonus, the
Counter can also return the data sorted by the count, if in case you wanted to know the top 5 most common words or anything like that.
In general, the
Counter implements a multi-set data structure. Discussing the scope of what all can be done with a multi-set is for another day. For now, here is the one-liner code for the word count problem.
from collections import Counter def count_words(line): return Counter(line.split())
Many times we find that the problem we want to solve is already solved within python's vast standard library. That's the beauty of python 🎉
In the final article of this series, we will look at implementing the bonus functionality.
Did you like this article?
If you liked this article, consider subscribing to this site. Subscribing is free.
Why subscribe? Here are three reasons:
- You will get every new article as an email in your inbox, so you never miss an article
- You will be able to comment on all the posts, ask questions, etc
- Once in a while, I will be posting conference talk slides, longer form articles (such as this one), and other content as subscriber-only