- ....A very basic introduction to vi....
- [Refresher] Given a file of simple plain text,
such as
unix-1969-1971.txt,
develop a command sequence to uniquely list all words found in the
file (list each unique word just once).
The most difficult part of this exercise is understanding how we use
tr
to remove all the non-alphabetic characters.
tr
translates one 'type' of character for another;
if we're aiming for one word per line,
then we'd like all the non alphabetic charcaters to be replaced by a blank line,
and then to remove all of those blank lines.
So we might start with:
shell> tr 'A-Za-z' '\n' < unix-1969-1971.txt
.....
1969
.....
mostly lots of empty lines
.... but that translates everything other than the alphabetic characters.
So instead of replacing the alphabetic characters with a newline,
we need to replace the non-alphabetic characters with a newline.
tr
supports a -c
option to specify the complement
of the alphabetic characters:
shell> tr -c 'A-Za-z' '\n' < unix-1969-1971.txt
Unix
was
born
in
out
of
.....
Better, but we want to remove those multiple empty lines.
There's a few ways to do this
(grep
and sed
are two possibilities)
but tr
supports a -s
option to squash
repeated lines:
shell> tr -cs 'A-Za-z' '\n' < unix-1969-1971.txt
Unix
was
born
in
out
of
the
mind
of
a
computer
scientist
.....
Better, but the word 'of' appears multiple times.
We finally use our well-known
sort
command to provide each word only once:
shell> tr -cs 'A-Za-z' '\n' < unix-1969-1971.txt | sort -u
ASR
Bell
But
Computer
He
John
Ken
Laboratories
.....
- For this exercise, you're asked to modify your previous solution,
and develop a (very) rudimentary spelling checker.
Firstly, we'll need a "dictionary" of valid words.
- On Linux platforms, the file
/usr/share/dict/words.
provides a collection of words collated from (many old) newspaper articles.
There's also a copy here:
usr-share-dict-words
(like Linux software, data and even computer hardware can be open-source too!)
- On macOS, the file
/usr/share/dict/web2
provides a collection of words found in the 1934 edition of
Webster's International Dictionary.
View the dictionary on your system using less.
For this exercise,
let's more rigorously define a "word" to be three or more lowercase characters.
Now, with reference to the dictionary on your system,
develop a command or shellscript that finds the words
in the textfile (from the first exercise)
that do not appear in the dictionary - potential spelling errors.
We may solve this easily using the comm
command to
compare two files with one-word-per-line.
Note that we could redirect the output of the previous exercise to a
temporary file,
and then compare that file with a standard dictionary,
but the sort
command (like many, but not all, commands)
accepts a filename of '-' to mean "read this filename from my standard input",
in this case through a pipe:
shell> tr -cs 'A-Za-z' '\n' < unix-1969-1971.txt | sort -u | comm -23 - /usr/share/dict/words
But
Computer
Laboratories
.....
cellphone
.....
We should also notice that some common, valid words "slip through"
our command.
This is not a fault of our logic,
but shows that not every word is in the dictionary.
Full spellchecking programs perform
stemming
to account for words' prefixes, suffixes, and origins.
- Next,
create and a new textfile using vi,
and write a shellscript to execute the command sequence from the
previous exercise.
shell> vi myspell
... type in command sequence; write to disk; exit the editor ...
shell> chmod +x myspell
shell> ./myspell
But
Computer
Laboratories
.....
- Now, extend the previous exercise so that your shellscript receives a
command-line argument informing it which textfile it should spellcheck.
Inside your shellscript (file),
you can access the provided command-line argument using the value of $1.
shell> vi myspell
... replace the fixed filename with $1 ; write to disk; exit the editor ...
shell> ./myspell classroom.txt
Classroom
Data
Dead
In
Performance
.....
- 🌶 When sorting a textfile
containing both a header-line and data in multiple columns,
we must be careful to not sort the header-line, too,
else the header-line may end up in the "middle" of the lines of output.
Consider the textfile
australian-universities.tsv
If we just sort it by its 1st field (state name),
then the initial header line will be incorrectly positioned in the
middle of the output.
Write a shellscript named sorttable to sort the textfile,
australian-universities.txt
by the number of its international students,
while keeping the header-line at the top of the output.
This problem presents a challenge, but for the wrong reasons.
Firstly, we need to save a copy of the file's header line
in a temporary file:
shell> head -1 australian-universities.tsv > headerline
Next we need to sort the data by its third (numeric) field.
This was easy for our .csv files,
but here we have a .tsv file,
so we need to indicate to sort
that a tab character is the field delimiter.
This is the challenge, because tab, along with space,
separates command arguments in bash
.
Searching the web gives this solution,
which stores a tab character in a shell variable,
and then provides it as a parameter to sort
:
shell> TAB=`echo -e "\t"`
shell> sort -t "$TAB" -k3 -n australian-universities.tsv
We are now correctly sorting the data,
but we're still including the header line in the sorted output
(and we can't cheat by knowing what the header line looks like).
Drawing on the previous exercise,
we can filter the header line from this output using comm
.
Think hard about why this works!
shell> sort -t "$TAB" -k3 -n australian-universities.tsv | comm -23 - headerline
Finally, we need to add the header line back at the top of this output,
and remember to remove the temporary file:
shell> sort -t "$TAB" -k3 -n australian-universities.tsv | comm -23 - headerline | cat headerline -
Name Local International Total
University of Notre Dame Australia 10633 327 10960
University of New England 19833 1079 20912
Charles Darwin University 9687 1161 10848
.....
RMIT University 30843 26590 57433
shell> rm headerline
As we'll be writing this in a new executable shellscript named sorttable
,
the only output we see is the results,
and the shellscript 'silently' removes the temporary file.
April 2020.