Computer Science and Software Engineering

Department of Computer Science and Software Engineering

CITS4407 Open Source Tools and Scripting

Exercise sheet 3 - sample solutions and discussion

These exercises should be undertaken using bash and the Terminal window application (under macOS, Linux, or Windows WSL).

....A very basic introduction to vi....
[Refresher] Given a file of simple plain text, such as unix-1969-1971.txt, develop a command sequence to uniquely list all words found in the file (list each unique word just once).

The most difficult part of this exercise is understanding how we use tr to remove all the non-alphabetic characters. tr translates one 'type' of character for another; if we're aiming for one word per line, then we'd like all the non alphabetic charcaters to be replaced by a blank line, and then to remove all of those blank lines. So we might start with:

shell> tr 'A-Za-z' '\n' < unix-1969-1971.txt ..... 1969 ..... mostly lots of empty lines

.... but that translates everything other than the alphabetic characters. So instead of replacing the alphabetic characters with a newline, we need to replace the non-alphabetic characters with a newline. tr supports a -c option to specify the complement of the alphabetic characters:

shell> tr -c 'A-Za-z' '\n' < unix-1969-1971.txt Unix was born in out of .....

Better, but we want to remove those multiple empty lines. There's a few ways to do this (grep and sed are two possibilities) but tr supports a -s option to squash repeated lines:

shell> tr -cs 'A-Za-z' '\n' < unix-1969-1971.txt Unix was born in out of the mind of a computer scientist .....

Better, but the word 'of' appears multiple times. We finally use our well-known sort command to provide each word only once:

shell> tr -cs 'A-Za-z' '\n' < unix-1969-1971.txt | sort -u ASR Bell But Computer He John Ken Laboratories .....
For this exercise, you're asked to modify your previous solution, and develop a (very) rudimentary spelling checker. Firstly, we'll need a "dictionary" of valid words.
- On Linux platforms, the file /usr/share/dict/words. provides a collection of words collated from (many old) newspaper articles.
  There's also a copy here: usr-share-dict-words (like Linux software, data and even computer hardware can be open-source too!)
- On macOS, the file /usr/share/dict/web2 provides a collection of words found in the 1934 edition of Webster's International Dictionary.
View the dictionary on your system using less.
For this exercise, let's more rigorously define a "word" to be three or more lowercase characters. Now, with reference to the dictionary on your system, develop a command or shellscript that finds the words in the textfile (from the first exercise) that do not appear in the dictionary - potential spelling errors.

We may solve this easily using the comm command to compare two files with one-word-per-line. Note that we could redirect the output of the previous exercise to a temporary file, and then compare that file with a standard dictionary, but the sort command (like many, but not all, commands) accepts a filename of '-' to mean "read this filename from my standard input", in this case through a pipe:

shell> tr -cs 'A-Za-z' '\n' < unix-1969-1971.txt | sort -u | comm -23 - /usr/share/dict/words But Computer Laboratories ..... cellphone .....

We should also notice that some common, valid words "slip through" our command. This is not a fault of our logic, but shows that not every word is in the dictionary. Full spellchecking programs perform stemming to account for words' prefixes, suffixes, and origins.
Next, create and a new textfile using vi, and write a shellscript to execute the command sequence from the previous exercise.

shell> vi myspell ... type in command sequence; write to disk; exit the editor ... shell> chmod +x myspell shell> ./myspell But Computer Laboratories .....
Now, extend the previous exercise so that your shellscript receives a command-line argument informing it which textfile it should spellcheck.
Inside your shellscript (file), you can access the provided command-line argument using the value of $1.

shell> vi myspell ... replace the fixed filename with $1 ; write to disk; exit the editor ... shell> ./myspell classroom.txt Classroom Data Dead In Performance .....
🌶 When sorting a textfile containing both a header-line and data in multiple columns, we must be careful to not sort the header-line, too, else the header-line may end up in the "middle" of the lines of output.
Consider the textfile australian-universities.tsv
If we just sort it by its 1st field (state name), then the initial header line will be incorrectly positioned in the middle of the output.
Write a shellscript named sorttable to sort the textfile, australian-universities.txt by the number of its international students, while keeping the header-line at the top of the output.

This problem presents a challenge, but for the wrong reasons. Firstly, we need to save a copy of the file's header line in a temporary file:

shell> head -1 australian-universities.tsv > headerline

Next we need to sort the data by its third (numeric) field. This was easy for our .csv files, but here we have a .tsv file, so we need to indicate to sort that a tab character is the field delimiter. This is the challenge, because tab, along with space, separates command arguments in bash.
Searching the web gives this solution, which stores a tab character in a shell variable, and then provides it as a parameter to sort:

shell> TAB=`echo -e "\t"` shell> sort -t "$TAB" -k3 -n australian-universities.tsv

We are now correctly sorting the data, but we're still including the header line in the sorted output (and we can't cheat by knowing what the header line looks like). Drawing on the previous exercise, we can filter the header line from this output using comm. Think hard about why this works!

shell> sort -t "$TAB" -k3 -n australian-universities.tsv | comm -23 - headerline

Finally, we need to add the header line back at the top of this output, and remember to remove the temporary file:

shell> sort -t "$TAB" -k3 -n australian-universities.tsv | comm -23 - headerline | cat headerline - Name Local International Total University of Notre Dame Australia 10633 327 10960 University of New England 19833 1079 20912 Charles Darwin University 9687 1161 10848 ..... RMIT University 30843 26590 57433 shell> rm headerline

As we'll be writing this in a new executable shellscript named sorttable, the only output we see is the results, and the shellscript 'silently' removes the temporary file.

Chris McDonald
April 2020.

The University of Western Australia

School of Computer Science and Software Engineering

University information

CRICOS Code: 00126G

This Page

Written by: [email protected]