In the lecture of Week-8 we introduced Regular Expressions (REs), strings of text (patterns) that allow you to match, locate, and manage text. Regular Expressions, often termed regexes, are supported in many contemporary programming languages, such as Python and Java, in most character- and GUI-based text editors, and many command-line programs on Linux and macOS systems.
We sometimes wonder why we receive spam email, and how the spammers have found our email addresses. A very simple technique, termed email harvesting, involves locating email addresses on webpages and in textfiles, perhaps associating them with the content/genre of the source document, and then selling by the millions.
Consider the two textfiles linked above (they are described in more detail in the next task). Think of what a 'standard' email address looks like - your email address, addresses of some friends, addresses of some well-known companies. Design a Regular Expression that matches the format of those email addresses.
Use the command grep to find email addresses in the small and big textfiles linked above. You may like to define a bash alias for grep:
which requests that grep displays its matches in colour (probably red).
Can you develop different Regular Expressions to distinguish between UWA's student email addresses, and those of UWA staff?
awk is a simple (and limited) programming language based on Regular Expressions, with a syntax a little like the C programming language. Its unusual name is an acronym from its 3 authors' surnames - Alfred Aho, Peter Weinberger, and Brian Kernighan, all very famous Computer Scientists.
Like bash, awk supports variables, conditions, and loops, and awk programs are usually stored in textfiles named awk scripts. Small statement sequences are executed when lines of input text match Regular Expression patterns. awk should be installed, by default, on your Linux or macOS systems as it is often used to undertake systems-administration. Historically, awk has outlived a similar language named Perl, and now 'overlaps' with many application areas of Python.
Tutorialspoint presents an easy to read Introduction to awk. Unless interested, do not try to read the whole tutorial, but do skim its sections entitled Overview, Workflow, Basic Examples, Builtin Variables, Arrays, and Control Flow.
These textfiles contain some the articles posted to the help2002 forum in 2015 (similar to our help4407 this year). Each textfile is an example of a standard email mailbox (if interested, see RFC-4155, but beyond the scope of this exercise). First examine the smaller file (the larger file just contains more articles). Each of its 6 articles begins with a line commencing with the pattern 'From ', and ends at the line before the next article, or the end of the file.
Let's do some simple processing of the articles in each of the file, using the awk programming language. We can compare many of these results with those with (more complex) traditional bash command sequences.