Build Your Own Spelling Correction Tool
This challenge is to build your own spelling correction tool. We all probably use a tool with spelling correction built into it daily. Google offers spelling corrections for our searches, Google Docs, Notion or MS Office let us know when we might have spelt a word wrong as do many of the IDEs we use to write code.
Despite that, surprising few software engineers know how they work, which is a shame as being able to detect and correct spelling errors is incredibly useful for most of us.
If you’re building a website that allows users to enter data a spelling checker could improve the quality of the user experience or, improve the quality of the data you capture from users. If you’re working in data science, machine learning or the development and deployment of AI being able to detect and correct spelling errors can both improve the user experience and the quality and cleanliness of the data you capture to respond to or train upon.
Personally I’ve used spelling error detection and correct techniques to sanitise, and homogenise datasets from multiple sources that all contain the typical problems you see in user entered data:
- Inconsistent use of abbreviations - for example people use: ST, St., St, and Str for Street.
- Misspellings or typos - for example: Strete, Stret, Streat and Strt.
Alternately just do this project to have fun learning some new real-world algorithms!
The Challenge - Building A Spelling Correction Tool
In this coding challenge you’re going to implement a tool to suggest spelling corrections. For the sake of demonstration / testing the challenge I suggest building a command line tool. You might like to consider building your solution in two parts:
- A spelling correction library in your programming language of choice.
- A command line tool that demonstrates the use of the library.
Then you’ll be able to reuse your solution in future projects.
Step Zero
Coding Challenges indexes from zero! In this first step your goal is to setup a new project with your chosen programming language and development environment and then get some training data.
For the training data we want a set of words and the frequency they occur in a large body of text. You can get that by either scraping the data from a public data source directly and calculating it yourself or by accessing one of the public word frequency data sets.
For my solution I went with the English Word Frequency dataset from Kaggle. If you’re working in another language check your favourite search engine for alternate datasets (or scrape your own).
Step 1
In this step your goal is to parse the training data you’ve collected and create a word frequency table within your application. If you’ve collected the data yourself you may need to clean it too.
Step 2
In this step your goal is to check if the provided word is in your word frequency table. If it is your program should report it as being correctly spelled.
Step 3
In this step your goal is to look for the correct spelling of a word, assuming that there is just one character of the word that is incorrect or missing. In other words you want to generate a list of all the ‘words’ that are:
- A deletion - one letter removed from the provided word.
- An insertion - one letter added to the provided word.
- A transposition - two adjacent letters are swapped.
- A replacement - one letter in the word is replaced with another letter.
Bonus exercise: calculate the number of words that will be just one edit away from the provided word using Big O notation where n is the length of the word.
Once you have calculated all these “words” check if they are in the word frequency table, if not, then we can assume they’re not valid words. If they are, then suggest the most frequently occurring word as the correct spelling.
Step 4
In this step your goal is to expand your spelling corrector to consider words that have a Levenshtein distance of two from the supplied input word (in Step 3 we were considering words that had a Levenshtein distance of one).
The Levenshtein distance is the number of edits required to transform one word into another. You won’t need to calculate the Levenshtein distance for this Coding Challenge, but it’s a useful concept to know about if you’re ever faced with cleaning data or matching disparate data sets that have errors from human data entry. I’ve found it particularly useful for consolidating address data.
Once you have the words with a distance of two, again check if they exist in the word frequency table and if so, suggest the one that occurs with the highest frequency as the corrected spelling.
Do keep in mind that the correct spelling is likely to be the highest frequency word from the set of words with the lowest Levenshtein distance.
Step 5
In this optional step your goal is to track the time taken to suggest corrections to a list of words and present the correction and then calculate the time take and words per second, i.e.:
% ./ccspell speiling misteke executionw mekanism coding chalenges
speiling spelling
misteke mistake
executionw execution
mekanism mechanism
coding coding
chalenges challenges
Time : 20.026833ms 299.6 words per second
Going Further
If you want to take this further you could explore building your own Bloom filter and using it to speed up determining if a word is in the list of known words. Beyond that you could explore providing corrections with a Levenshtein distance of greater than two or adding support for more complex approaches to spelling correction, i.e.:
- Spelling correction without a word frequency table, i.e. using trigrams.
- Spelling correction using phonetic similarity.
- Spelling correction using context (i.e. the surrounding words).
Help Others by Sharing Your Solutions!
If you think your solution is an example other developers can learn from please share it, put it on GitHub, GitLab or elsewhere. Then let me know - ping me a message on the Discord Server, via Twitter or LinkedIn or just post about it there and tag me. Alternately please add a link to it in the Coding Challenges Shared Solutions Github repo.
Get The Challenges By Email
If you would like to receive the coding challenges by email, you can subscribe to the weekly newsletter on SubStack here: