awk Spell Checking
Spell checking
We create an AWK program for spell checking.
BEGIN {
count = 0
i = 0
while (getline myword <"/usr/share/dict/words") {
dict[i] = myword
i++
}
}
{
for (i=1; i<=NF; i++) {
field = $i
if (match(field, /[[:punct:]]$/)) {
field = substr(field, 0, RSTART-1)
}
mywords[count] = field
count++
}
}
END {
for (w_i in mywords) {
for (w_j in dict) {
if (mywords[w_i] == dict[w_j] ||
tolower(mywords[w_i]) == dict[w_j]) {
delete mywords[w_i]
}
}
}
for (w_i in mywords) {
if (mywords[w_i] != "") {
print mywords[w_i]
}
}
}
The script compares the words of the provided text file against a dictionary. Under the standard /usr/share/dict/words
path we can find an English dictionary; each word is on a separate line.
BEGIN {
count = 0
i = 0
while (getline myword <"/usr/share/dict/words") {
dict[i] = myword
i++
}
}
Inside the BEGIN
block, we read the words from the dictionary into the dict
array. The getline
command reads a record from the given file name; the record is stored in the $0
variable.
{
for (i=1; i<=NF; i++) {
field = $i
if (match(field, /[[:punct:]]$/)) {
field = substr(field, 0, RSTART-1)
}
mywords[count] = field
count++
}
}
In the main part of the program, we place the words of the file that we are spell checking into the mywords
array. We remove any punctuation marks (like commas or dots) from the endings of the words.
END {
for (w_i in mywords) {
for (w_j in dict) {
if (mywords[w_i] == dict[w_j] ||
tolower(mywords[w_i]) == dict[w_j]) {
delete mywords[w_i]
}
}
}
...
}
We compare the words from the mywords
array against the dictionary array. If the word is in the dictionary, it is removed with the delete
command. Words that begin a sentence start with an uppercase letter; therefore, we also check for a lowercase alternative utilizing the tolower()
function.
for (w_i in mywords) {
if (mywords[w_i] != "") {
print mywords[w_i]
}
}
Remaining words have not been found in the dictionary; they are printed to the console.
$ awk -f spellcheck.awk text
consciosness
finaly
We have run the program on a text file; we have found two misspelled words. Note that the program takes some time to finish.