- ....A very basic introduction to vi....
- [Refresher] Given a file of simple plain text,
develop a command sequence to uniquely list all words found in the
file (list each unique word just once).
The most difficult part of this exercise is understanding how we use
tr to remove all the non-alphabetic characters.
tr translates one 'type' of character for another;
if we're aiming for one word per line,
then we'd like all the non alphabetic charcaters to be replaced by a blank line,
and then to remove all of those blank lines.
So we might start with:
shell> tr 'A-Za-z' '\n' < unix-1969-1971.txt
mostly lots of empty lines
.... but that translates everything other than the alphabetic characters.
So instead of replacing the alphabetic characters with a newline,
we need to replace the non-alphabetic characters with a newline.
tr supports a
-c option to specify the complement
of the alphabetic characters:
shell> tr -c 'A-Za-z' '\n' < unix-1969-1971.txt
Better, but we want to remove those multiple empty lines.
There's a few ways to do this
sed are two possibilities)
tr supports a
-s option to squash
shell> tr -cs 'A-Za-z' '\n' < unix-1969-1971.txt
Better, but the word 'of' appears multiple times.
We finally use our well-known
sort command to provide each word only once:
shell> tr -cs 'A-Za-z' '\n' < unix-1969-1971.txt | sort -u
- For this exercise, you're asked to modify your previous solution,
and develop a (very) rudimentary spelling checker.
Firstly, we'll need a "dictionary" of valid words.
View the dictionary on your system using less.
- On Linux platforms, the file
provides a collection of words collated from (many old) newspaper articles.
There's also a copy here:
(like Linux software, data and even computer hardware can be open-source too!)
- On macOS, the file
provides a collection of words found in the 1934 edition of
Webster's International Dictionary.
For this exercise,
let's more rigorously define a "word" to be three or more lowercase characters.
Now, with reference to the dictionary on your system,
develop a command or shellscript that finds the words
in the textfile (from the first exercise)
that do not appear in the dictionary - potential spelling errors.
We may solve this easily using the
comm command to
compare two files with one-word-per-line.
Note that we could redirect the output of the previous exercise to a
and then compare that file with a standard dictionary,
sort command (like many, but not all, commands)
accepts a filename of '-' to mean "read this filename from my standard input",
in this case through a pipe:
shell> tr -cs 'A-Za-z' '\n' < unix-1969-1971.txt | sort -u | comm -23 - /usr/share/dict/words
We should also notice that some common, valid words "slip through"
This is not a fault of our logic,
but shows that not every word is in the dictionary.
Full spellchecking programs perform
to account for words' prefixes, suffixes, and origins.
create and a new textfile using vi,
and write a shellscript to execute the command sequence from the
shell> vi myspell
... type in command sequence; write to disk; exit the editor ...
shell> chmod +x myspell
- Now, extend the previous exercise so that your shellscript receives a
command-line argument informing it which textfile it should spellcheck.
Inside your shellscript (file),
you can access the provided command-line argument using the value of $1.
shell> vi myspell
... replace the fixed filename with $1 ; write to disk; exit the editor ...
shell> ./myspell classroom.txt
- 🌶 When sorting a textfile
containing both a header-line and data in multiple columns,
we must be careful to not sort the header-line, too,
else the header-line may end up in the "middle" of the lines of output.
Consider the textfile
If we just sort it by its 1st field (state name),
then the initial header line will be incorrectly positioned in the
middle of the output.
Write a shellscript named sorttable to sort the textfile,
by the number of its international students,
while keeping the header-line at the top of the output.
This problem presents a challenge, but for the wrong reasons.
Firstly, we need to save a copy of the file's header line
in a temporary file:
shell> head -1 australian-universities.tsv > headerline
Next we need to sort the data by its third (numeric) field.
This was easy for our .csv files,
but here we have a .tsv file,
so we need to indicate to
that a tab character is the field delimiter.
This is the challenge, because tab, along with space,
separates command arguments in
Searching the web gives this solution,
which stores a tab character in a shell variable,
and then provides it as a parameter to
shell> TAB=`echo -e "\t"`
shell> sort -t "$TAB" -k3 -n australian-universities.tsv
We are now correctly sorting the data,
but we're still including the header line in the sorted output
(and we can't cheat by knowing what the header line looks like).
Drawing on the previous exercise,
we can filter the header line from this output using
Think hard about why this works!
shell> sort -t "$TAB" -k3 -n australian-universities.tsv | comm -23 - headerline
Finally, we need to add the header line back at the top of this output,
and remember to remove the temporary file:
shell> sort -t "$TAB" -k3 -n australian-universities.tsv | comm -23 - headerline | cat headerline -
Name Local International Total
University of Notre Dame Australia 10633 327 10960
University of New England 19833 1079 20912
Charles Darwin University 9687 1161 10848
RMIT University 30843 26590 57433
shell> rm headerline
As we'll be writing this in a new executable shellscript named
the only output we see is the results,
and the shellscript 'silently' removes the temporary file.