Implementing word transformations

We have tried a variety of approaches to write a function to count the number of times a word appears in a sentence in the previous articles.

Now let us attempt the bonus part of it. This is what we said in the word count problem statement

Assume that the whole string is in lowercase and there is no punctuation anywhere in the string. We will look into handling case and punctuation later.

Let us look at those two cases now:

The same word may be in mixed case in the sentence, eg: "The quick dog and the quick fox". In this case "the" should have a count of 2, however in our current code, "the" will have a count of 1 and "The" will have a separate count of 1.
Some words may have a character of punctuation at the end, eg: "The quick dog, and the quick fox, ... and the fox ran quick". Here "fox" and "fox," will be counted as two separate words.

Lets fix these two cases now.

It is not particularly complicated, these are the steps we need to do:

Make all words lowercase
Remove all punctuation from the end of the word
Count the words

Removing Punctuation

Before we do that, let us write a short function to remove the punctuation. For now, let us assume that period, comma, colon, semi colon, question mark and exclamation are the only possible punctuation.

def strip_punc(word):
    if word[-1] in ".,:;?!"
        return word[:-1]
    return word

If you remember the first article on negative indexing, this is where that code snippet comes from.

Line 2 features a very pythonic way to check if a value is present among a list of options by using the in operator. In other languages you might have to write a separate function for this, but python has a very concise way of expressing this idea.

This function has a limitation though, it only removes a single character of punctuation from the end.

What if there are more characters, like "what??!?!". Well, it is a simple matter to write a loop and remove all the characters.

But there is a better way:

def strip_punc(word):
    return word.strip(".,:;?!")

In usual python style, it is a concise one-liner 😊. The strip method doesn't just strip out whitespace, but it can also optionally take a list of characters and strip those characters instead. And with that change, we don't need a separate function anymore.

Now we can get to actually implementing the main code. We are going to implement this in two styles.

Method 1

Here it the version we wrote using the get dictionary method. We will start with this as the base.

def count_words(line):
    words = line.split()
    counts = {}
    for word in words:
        counts[word] = counts.get(word, 0) + 1
    return counts

We need to now add in the code to lowercase the word and strip the punctuation before we do the counting

def count_words(line):
    words = line.split()
    counts = {}
    for word in words:
        word = word.lower().strip(".,;:?!")
        counts[word] = counts.get(word, 0) + 1
    return counts

In this implementation we have added the word transformation code on line 5 inside the loop. This is a pretty typical way to implement it, but is it a good idea?

I personally don't like it, because by using the same loop it mixes up the code for the word transformation with the code for the word counting. I don't like to mix the code for different concepts together.

Why?

Firstly, we might want to add more transformations, like removing stop words or stemming the words and all that code is going to make this loop really messy.

Also, we can no longer use the Counter class that we used in the previous article. That's because the Counter abstracts the looping within it.

Which brings us to the second way to implement it.

Method 2

def count_words(line):
    words = line.split()
    words = [word.lower() for word in words]
    words = [word.strip(".,;:?!") for word in words]
    counts = {}
    for word in words:
        counts[word] = counts.get(word, 0) + 1
    return counts

In this implementation, we have the transformation done separately on line 2 and line 3. This implementation keeps all the three concepts separate: lowercasing the words, stripping the punctuation, and counting the words.

With this implementation, we can easily add / remove transformations or substitute the way the counting is done. It is now simple to use Counter to do the word counting:

def count_words(line):
    words = line.split()
    words = [word.lower() for word in words]
    words = [word.strip(".,;:?!") for word in words]
    return Counter(words)

Some might argue that this implementation requires looping through all the words multiple times, so it is less efficient than the previous implementation which just looped once. Well, yes, but two things to keep in mind

Premature optimisation is the root of all evil etc etc. Unless you have millions of words, it probably does not matter. If you do, please profile your code and then if this is the bottleneck, it is simple enough to change the list comprehensions to generator expressions in which case it will just loop once (I'll write more about generators in a future article)
In general write code for readability and maintainability rather than performance

Summary

This brings us to the end of the word count problem. With this small problem as an example, we explored various different coding styles and approaches that we can use to solve it.

In future articles we will further explore all these topics (and many more) and that will help you to write beautiful, expressive, readable python code.

Did you like this article?

If you liked this article, consider subscribing to this site. Subscribing is free.

Why subscribe? Here are three reasons:

You will get every new article as an email in your inbox, so you never miss an article
You will be able to comment on all the posts, ask questions, etc
Once in a while, I will be posting conference talk slides, longer form articles (such as this one), and other content as subscriber-only

Tagged in:

series-wordcountproblem

Implementing word transformations

Removing Punctuation

Method 1

Method 2

Summary

Siddharta Govindaraj

Other Stories

The collections module

Python and Abstractions

AI: What is a language model?

AI: Embeddings

AI: Tokenisation

Press ESC to close

Or check our Popular Categories...

Removing Punctuation

Method 1

Method 2

Summary

Share Article:

Related Articles

Other Stories

The collections module

Python and Abstractions