Week 7 workshop exercises

Regular expressions

Download a copy of the ENROLMENTS-2017 file if you don’t have one handy.

Try answering the following questions:

  1. How many units start with ‘C’?

  2. The start of a unit code (the portion in capital letters) indicates what school or department offers the unit.

    How many distinct schools/departments are there?

    Here is one solution, which relies on us assuming that school codes will always be 4 characters long:

    Here is another, which uses sed to rewrite each unit – everything after the first digit (the [0-9]) is discarded (replaced with nothing).

    We can also get rid of the call to uniq, because if we check the man page for sort, we see that sort -u has the same effect as sort | uniq:

    Finally, we can make sort do nearly all the work:

    The -t$'\t' argument to sort says to use the “tab” character as a delimiter (so the student number will be considered field 1, and the unit/semester code field 2).
    The -k 2.1,2.4 argument to sort says to only sort on the first 4 characters of field 2.
    And finally, the -u argument says to only print out the first of a run of lines which compare as “equal”.

  3. Suppose the unit code for CITS4407 got changed to CITS4417, and we needed to update the enrolments.

    Using sed, apply a “search and replace” to the enrolments file – replace the ‘CITS4407’ with ‘CITS4417’.

    We can test that this works. Check how many occurrences of CITS4407 there are in the original file:

    Then check how many occurrences there are of CITS4407 after we pipe the file through sed:

    And how many occurrences of CITS4417:

  4. How would we list students taking both CHEM1001-1 and CHEM1002-1, using grep?

    First, work out which students are taking those units, and store their students numbers in two files:

    We can then use the join command (see the Shotts textbook, chap 20 for more details), to find lines that appear in both files:

    Note that files passed to join as arguments must already be sorted.

    In fact, we can avoid the use of cut, because we can ask join to operate on only the first field of the file.

    Then:

    Finally, we could avoid some repetition in our commands by writing the following: