Computer Science and Software Engineering

Department of Computer Science and Software Engineering

CITS4407 Open Source Tools and Scripting

Exercise sheet 5 - sample solutions and discussion

The website RegexOne presents both a tutorial on Regular Expressions.....
If you'd like some more practice with Regular Expressions, the website Regex Golf (Classic)...
[files required: articles-small.txt and articles-big.txt]

Use the command grep to find email addresses in the small and big textfiles linked above.

An email address is assumed to include a username, the '@' symbol, and a hostname. We need to define the set of characters that may appear before the '@', and another set of characters that may appear after the '@'.
We'll make a simple solution, because there's no perfect solution.

shell> grep -E '[A-Za-z0-9\.\-_]+@[A-Za-z0-9\.\-_]+' < filename

Can you develop different Regular Expressions to distinguish between UWA's student email addresses, and those of UWA staff?

UWA staff numbers begin with a '0', and are then followed by 7 more digits.
We can specify this using a character range: '0[0-9]{7}'
or using a character-set shortcut (not universally supported): '0\d{7}'

shell> grep -E '0[0-9]{7}@[A-Za-z0-9\.\-_]+' < filename
[files required: articles-small.txt and articles-big.txt]

awk is a simple (and limited) programming language based on Regular Expressions, with a syntax a little like the C programming language....

Let's do some simple processing of the articles in each of the file, using the awk programming language. We can compare many of these results with those with (more complex) traditional bash command sequences.
1. How many lines are in the file? Ensure that both awk and wc -l produce the same results.
  
  We place this and all following solutions in a shellscript which simply invokes awk, telling it to read the awk program from the remainder of the file. The input is read from standard-input or a filename passed on the command line.
  
  #!/usr/bin/awk -f # WE DON'T REALLY NEED A 'BEGIN' CLAUSE, AS ALL VARIABLES ARE INITIALISED TO ZERO BEGIN { lc = 0 } # COUNT ALL LINES THAT HAVE A BEGINNING (ALL LINES!) /^/ { lc = lc + 1; } # AT THE END OF THE INPUT, PRINT THE LINE COUNT END { print lc }
2. How many distinct posts are there in the file (firstly, determine what counts as a distinct post) ?
  
  #!/usr/bin/awk -f BEGIN { lc = 0 } # COUNT ALL LINES THAT INDICATE THE BEGINNING OF AN ARTICLE /^From / { lc = lc + 1; } END { print lc }
3. Which person has posted the most articles?
  
  We need to find the beginning of each article, and then find out who posted it. awk splits each line using its IFS variable (which, conveniently is a space or a tab), so we can extract the poster's identity from the 2nd field on the line ($2). At the end of the input file, we examine the count of each poster's articles, find the maximum of these, and print the result.
  
  #!/usr/bin/awk -f # NOTHING TO DO AT THE BEGINNING OF OUR SCRIPT (COULD OMIT) BEGIN { } # EACH TIME WE FIND THE BEGINNING OF AN ARTICLE, REMEMBER WHO POSTED IT /^From / { postedby[ $2 ] += 1; } # WHEN THE INPUT ENDS, WE FIND WHO POSTED THE MOST ARTICLES END { # INITIALISE OUR MAXIMUM POST COUNT WITH THE LOWEST 'IMPOSSIBLE' VALUE maxcount = -1; # ITERATE OVER THE COUNTS OF EACH POSTER for( who in postedby ) { # IS THIS A NEW MAXIMUM? REMEMBER NEW MAXIMUM AND WHO POSTED THEM if(postedby[who] > maxcount) { maxcount = postedby[who]; maxposter = who; } } # FINALLY, PRINT THE RESULT (FORMAT IS NOT IMPORTANT) print maxposter, "has posted", postedby[maxposter], "articles"; }
4. 🌶 Which person has posted the greatest number of lines?
  
  Similar to the previous solution, we first need to determine the poster of each article. Then for the remainder of the current article, we need to remember if we're in the article's header or the article's body. The header and body are separated by a blank line, and we just count the lines in the bodies of each poster's articles. At the end of the input file, we examine the line count of each poster's articles, find the maximum of these, and print the result.
  
  #!/usr/bin/awk -f # USE THE inbody VARIABLE TO INDICATE IF WE'RE IN THE BODY OF AN ARTICLE BEGIN { false = 0; true = 1; inbody = false; } # EACH TIME WE FIND THE BEGINNING OF AN ARTICLE, REMEMBER WHO POSTED IT /^From / { who = $2; inbody = false; } # FOUND A BLANK LINE - POSSIBLY SEPARATING HEADER FROM BODY /^$/ { if(inbody == false) { inbody = true; } else { linecount[ who ] += 1; } } # FOUND A NON-BLANK LINE; ARE WE IN A HEADER OR A BODY? /.*/ { if(inbody == true) { linecount[ who ] += 1; } } # WHEN THE INPUT ENDS, WE FIND WHO POSTED THE MOST LINES END { # INITIALISE OUR MAXIMUM LINE COUNT TO THE LOWEST 'IMPOSSIBLE' VALUE maxcount = -1; # ITERATE OVER THE COUNTS OF EACH POSTER for( who in linecount ) { # IS THIS A NEW MAXIMUM? REMEMBER NEW MAXIMUM AND WHO POSTED THEM if(linecount[who] > maxcount) { maxcount = linecount[who]; maxposter = who; } } # FINALLY, PRINT THE RESULT (FORMAT IS NOT IMPORTANT) print maxposter, "has posted", linecount[maxposter], "lines"; }
5. 🌶 A thread of discussion involves a number of articles with the same subject line. Which thread has the greatest number of replies in each file?
  
  Similar to the previous solution, we need to determine the poster of each article, and whether we're in each article's header or body. Each article's Subject: line is found in the header, and the subject-title of each article begins at character column 10. At the end of the input file, we examine the counts of each different subject-title, find the maximum of these, and print the result (one fewer reply than the total count of the subject-title).
  
  #!/usr/bin/awk -f # Equivalent to: # grep '^Subject: ' < filename | sort | uniq -c | sort -n # USE THE inheader VARIABLE TO INDICATE IF WE'RE IN THE HEADER OF AN ARTICLE BEGIN { false = 0; true = 1; inheader = false; } # EACH TIME WE FIND THE BEGINNING OF AN ARTICLE.... /^From / { inheader = true; } # FOUND A Subject: LINE, AND ARE IN A HEADER, INCREMENT COUNT FOR THIS SUBJECT /^Subject: / { if(inheader == true) { subject = substr($0, 10); subjectcount[ subject ] += 1; } } # WHEN THE INPUT ENDS, WE FIND THE ARTICLE WITH THE MOST FREQUENT SUBJECT END { # INITIALISE OUR MAXIMUM SUBJECT COUNT TO THE LOWEST 'IMPOSSIBLE' VALUE maxcount = -1; # ITERATE OVER THE COUNTS OF EACH SUBJECT for( subject in subjectcount ) { # IS THIS A NEW MAXIMUM? REMEMBER NEW MAXIMUM AND SUBJECT if(subjectcount[subject] > maxcount) { maxcount = subjectcount[subject]; maxsubject = subject; } } # FINALLY, PRINT THE RESULT (FORMAT IS NOT IMPORTANT) print "\"" maxsubject "\" has " (subjectcount[maxsubject]-1) " replies"; }

Chris McDonald
May 2020.

The University of Western Australia

School of Computer Science and Software Engineering

University information

CRICOS Code: 00126G

This Page

Written by: [email protected]