Computer Science and Software Engineering

Department of Computer Science and Software Engineering

CITS4407 Open Source Tools and Scripting

Exercise sheet 3 - for the week commencing 23rd March 2020

These exercises should be undertaken using bash and the Terminal window application (under macOS, Linux, or Windows WSL).

[file required: cits4407.txt]
A very basic introduction to vi
You'll want a simple vi Editor "Cheat Sheet" close by.
Now:
1. Edit the file cits4407.txt
2. Immediately stop editing (quit) the file.
3. Use the arrow keys to move to the last character one the last line.
4. Use the arrow keys to move to the word 'philosophy'. Delete the word and its following comma.
5. Now save the modified file to disk, but do not exit or quit.
6. Use a cursor navigation method other than the arrow keys to move to the word 'pipes'. Before 'pipes' insert the word 'communication'.
7. Now quit the editor without saving your changes. You should now be back at the shell prompt.
[Refresher] Given a file of simple plain text, such as unix-1969-1971.txt, develop a command sequence to uniquely list all words found in the file (list each unique word just once).
We'll define a "word" to be any sequence of one-or-more alphabetic characters, the words "Hello" and "hello" are distinct words, and all other non-alphabetic characters should be ignored. The desired output is simply the list of words (each listed only once) in the file. For the given file, the output would begin with:
ASR
Bell
But
Computer
He
John
Ken
Laboratories
.....
ⓘ Helpful command for this exercise: tr.
For this exercise, you're asked to modify your previous solution, and develop a (very) rudimentary spelling checker. Firstly, we'll need a "dictionary" of valid words.
- On Linux platforms, the file /usr/share/dict/words. provides a collection of words collated from (many old) newspaper articles.
  There's also a copy here: usr-share-dict-words.
- On macOS, the file /usr/share/dict/web2 provides a collection of words found in the 1934 edition of Webster's International Dictionary.
View the dictionary on your system using less.
For this exercise, let's more rigorously define a "word" to be three or more lowercase characters. Now, with reference to the dictionary on your system, develop a command or shellscript that finds the words in the textfile (from the first exercise) that do not appear in the dictionary - potential spelling errors.
ⓘ Helpful commands for this exercise: tr, comm, sort.
Next, create and a new textfile using vi, and write a shellscript to execute the command sequence from the previous exercise.
You'll want a simple vi Editor "Cheat Sheet" close by.
When finished creating your shellscript, exit the vi editor, make your shellscript executable, and test that it works.
Now, extend the previous exercise so that your shellscript receives a command-line argument informing it which textfile it should spellcheck.
Inside your shellscript (file), you can access the provided command-line argument using the value of $1.
Here's a few more textfiles: classroom.txt, debugging.txt, autonomous-vehicles.txt, and baseball.txt.
🌶 When sorting a textfile containing both a header-line and data in multiple columns, we must be careful to not sort the header-line, too, else the header-line may end up in the "middle" of the lines of output.
Consider the textfile australian-universities.tsv
If we just sort it by either its 1st field (state name), then the initial header line will be incorrectly positioned in the middle of the output.
Write a shellscript named sorttable to sort the textfile, australian-universities.txt by the number of its international students, while keeping the header-line at the top of the output.
ⓘ Helpful commands for this exercise: sort, head, tail.

Additional exercises involving filtering text data

A few students have asked for some additional exercises in filtering plain-text data (similar to Exercise sheet 2).

ⓘ Helpful commands for these exercises: cut, sort, uniq, grep, head, tail, wc.

[file required: UWA-ENROLMENTS.tsv]
The file UWA-ENROLMENTS is a 42,000 line textfile. The first column provides (randomized) UWA student numbers, and the second column presents the units in which they were enrolled (data is not from this year).
1. How many distinct enrolments (lines) are there in the file?
2. How many distinct students are there in the file?
3. How many distinct units are there in the file?
4. How many distinct teaching periods (similar to semesters) are there?
5. How many (CITS) units are presented by Computer Science and Software Engineering?
6. Which units(s) have the largest enrolment?
7. 🌶Which student(s) are taking the most units this year?
8. 🌶🌶Which units are offered in more than one teaching period?
[files required: wificapture-1.txt] and wificapture-2.txt]
The shorter file wificapture-1.txt provides details of 10,000 captured wireless Ethernet (WiFi) frames, as does the longer file wificapture-2.txt with 230,000 frames.
The contents of each frame have then been formatted to a textfile, providing details of each frame (one frame per line). Note that only the frame's header is captured, and none of its data-payload (which is likely encrypted, anyway). Thus, the only privacy concerns exposed by this data include which device was communicating with which other device, how often, how much, and at what time. No personal or private data is exposed.
The tab-SEPARATED fields of each line (frame) are:
- time-of-day (in seconds and microseconds),
- the transmitting device's distinct MAC (Media Access Control) address,
- the receiving device's distinct MAC address,
- the source device's distinct MAC address,
- the destination device's distinct MAC address,
- the length (in bytes) of the frame,
- the signal strength with which the frame was received, and
- a short English description of the frame.
Different frame types will appear to have different numbers of fields. Actually, all fields are present, and an 'empty' field will be represented by 2 TAB chartacters in a row. For the following exercises, we are only interested in the source and the destination MAC addresses, which will always be present.
While processing each frame (line) ignore all MAC addresses of the form ff:ff:ff:ff:ff:ff - the special broadcast address for frames transmitted to any device that can hear it.
Develop a number of short command sequences to find:
1. The single source device sending traffic most frequently,
2. 🌶The single source device sending the greatest volume of traffic,
3. 🌶🌶The 5 pairs of source and destination devices which collectively (considered pairwise) send the highest number of frames.

Chris McDonald
March 2020.

The University of Western Australia

School of Computer Science and Software Engineering

University information

CRICOS Code: 00126G

This Page

Written by: [email protected]