Use the command grep to find email addresses in the small and big textfiles linked above.
An email address is assumed to include a username, the '@' symbol, and a hostname.
We need to define the set of characters that may appear before the '@',
and another set of characters that may appear after the '@'.
We'll make a simple solution, because
there's no perfect solution.
Can you develop different Regular Expressions to distinguish between UWA's student email addresses, and those of UWA staff?
UWA staff numbers begin with a '0', and are then followed by 7 more digits.
We can specify this using a character range: '0[0-9]{7}'
or using a character-set shortcut (not universally supported): '0\d{7}'
awk is a simple (and limited) programming language based on Regular Expressions, with a syntax a little like the C programming language....
Let's do some simple processing of the articles in each of the file, using the awk programming language. We can compare many of these results with those with (more complex) traditional bash command sequences.
We place this and all following solutions in a shellscript which simply invokes awk, telling it to read the awk program from the remainder of the file. The input is read from standard-input or a filename passed on the command line.
We need to find the beginning of each article, and then find out who posted it. awk splits each line using its IFS variable (which, conveniently is a space or a tab), so we can extract the poster's identity from the 2nd field on the line ($2). At the end of the input file, we examine the count of each poster's articles, find the maximum of these, and print the result.
Similar to the previous solution, we first need to determine the poster of each article. Then for the remainder of the current article, we need to remember if we're in the article's header or the article's body. The header and body are separated by a blank line, and we just count the lines in the bodies of each poster's articles. At the end of the input file, we examine the line count of each poster's articles, find the maximum of these, and print the result.
Similar to the previous solution, we need to determine the poster of each article, and whether we're in each article's header or body. Each article's Subject: line is found in the header, and the subject-title of each article begins at character column 10. At the end of the input file, we examine the counts of each different subject-title, find the maximum of these, and print the result (one fewer reply than the total count of the subject-title).