Computer Science and Software Engineering

Department of Computer Science and Software Engineering

CITS4407 Open Source Tools and Scripting

Exercise sheet 5 - for the week commencing 27th April 2020

These exercises should be undertaken using bash and the Terminal window application (under macOS, Linux, or Windows WSL).

In the lecture of Week-8 we introduced Regular Expressions (REs), strings of text (patterns) that allow you to match, locate, and manage text. Regular Expressions, often termed regexes, are supported in many contemporary programming languages, such as Python and Java, in most character- and GUI-based text editors, and many command-line programs on Linux and macOS systems.

The website RegexOne presents both a tutorial on Regular Expressions, and simple interactive examples that you can attempt to test your understanding. Place your mouse over the words "Interactive Tutorial" near the top of the page, and a popup list will display the list of 1-page tutorials and exercises. Regular Expressions can get very complex, far more than this website's examples, but for this Exercise Sheet 5 read and attempt Lesson 1: An Introduction, and the ABCs to Lesson 10: Starting and ending.
If you'd like some more practice with Regular Expressions, the website Regex Golf (Classic) presents a number of fun challenges asking you to find Regular Expressions matching one set of words, but not matching another set.
[files required: articles-small.txt and articles-big.txt]

We sometimes wonder why we receive spam email, and how the spammers have found our email addresses. A very simple technique, termed email harvesting, involves locating email addresses on webpages and in textfiles, perhaps associating them with the content/genre of the source document, and then selling by the millions.

Consider the two textfiles linked above (they are described in more detail in the next task). Think of what a 'standard' email address looks like - your email address, addresses of some friends, addresses of some well-known companies. Design a Regular Expression that matches the format of those email addresses.

Use the command grep to find email addresses in the small and big textfiles linked above. You may like to define a bash alias for grep:

shell> alias grep="grep --color=auto"

which requests that grep displays its matches in colour (probably red).

Can you develop different Regular Expressions to distinguish between UWA's student email addresses, and those of UWA staff?
[files required: articles-small.txt and articles-big.txt]

awk is a simple (and limited) programming language based on Regular Expressions, with a syntax a little like the C programming language. Its unusual name is an acronym from its 3 authors' surnames - Alfred Aho, Peter Weinberger, and Brian Kernighan, all very famous Computer Scientists.

Like bash, awk supports variables, conditions, and loops, and awk programs are usually stored in textfiles named awk scripts. Small statement sequences are executed when lines of input text match Regular Expression patterns. awk should be installed, by default, on your Linux or macOS systems as it is often used to undertake systems-administration. Historically, awk has outlived a similar language named Perl, and now 'overlaps' with many application areas of Python.

Tutorialspoint presents an easy to read Introduction to awk. Unless interested, do not try to read the whole tutorial, but do skim its sections entitled Overview, Workflow, Basic Examples, Builtin Variables, Arrays, and Control Flow.

These textfiles contain some the articles posted to the help2002 forum in 2015 (similar to our help4407 this year). Each textfile is an example of a standard email mailbox (if interested, see RFC-4155, but beyond the scope of this exercise). First examine the smaller file (the larger file just contains more articles). Each of its 6 articles begins with a line commencing with the pattern 'From ', and ends at the line before the next article, or the end of the file.

Let's do some simple processing of the articles in each of the file, using the awk programming language. We can compare many of these results with those with (more complex) traditional bash command sequences.
1. How many lines are in the file? Ensure that both awk and wc -l produce the same results.
2. How many distinct posts are there in the file (firstly, determine what counts as a distinct post) ?
3. Which person has posted the most articles?
4. 🌶 Which person has posted the greatest number of lines?
5. 🌶 A thread of discussion involves a number of articles with the same subject line. Which thread has the greatest number of replies in each file?

Chris McDonald
April 2020.

The University of Western Australia

School of Computer Science and Software Engineering

University information

CRICOS Code: 00126G

This Page

Written by: [email protected]