# Week 7 workshop exercises

## Regular expressions

Download a copy of the ENROLMENTS-2017 file if you don’t have one handy.

``````\$ cut -f 2 ENROLMENTS-2017  | grep '^C' |  wc -l
8298``````
2. The start of a unit code (the portion in capital letters) indicates what school or department offers the unit.

How many distinct schools/departments are there?

Here is one solution, which relies on us assuming that school codes will always be 4 characters long:

``````\$ cut -f 2 ENROLMENTS-2017 | cut -c 1-4 | sort | uniq | wc -l
94``````

Here is another, which uses `sed` to rewrite each unit – everything after the first digit (the `[0-9]`) is discarded (replaced with nothing).

``````\$ cut -f 2 ENROLMENTS-2017 | sed 's/[0-9].*//' | sort | uniq | wc -l
94``````

We can also get rid of the call to `uniq`, because if we check the man page for `sort`, we see that `sort -u` has the same effect as `sort | uniq`:

``````\$ cut -f 2 ENROLMENTS-2017 | sed 's/[0-9].*//' | sort -u | wc -l
94``````

Finally, we can make `sort` do nearly all the work:

``````\$ sort -t\$'\t' -k 2.1,2.4 -u ENROLMENTS-2017 | wc -l
94``````

The `-t\$'\t'` argument to sort says to use the “tab” character as a delimiter (so the student number will be considered field 1, and the unit/semester code field 2).
The `-k 2.1,2.4` argument to `sort` says to only sort on the first 4 characters of field 2.
And finally, the `-u` argument says to only print out the first of a run of lines which compare as “equal”.

3. Suppose the unit code for CITS4407 got changed to CITS4417, and we needed to update the enrolments.

Using `sed`, apply a “search and replace” to the enrolments file – replace the ‘CITS4407’ with ‘CITS4417’.

``sed s/CITS4407/CITS4417/ ENROLMENTS-2017``

We can test that this works. Check how many occurrences of CITS4407 there are in the original file:

``````\$ grep CITS4407 ENROLMENTS-2017 | wc -l
58``````

Then check how many occurrences there are of CITS4407 after we pipe the file through `sed`:

``````\$ sed s/CITS4407/CITS4417/ ENROLMENTS-2017 | grep CITS4407 | wc -l
0``````

And how many occurrences of CITS4417:

``````\$ sed s/CITS4407/CITS4417/ ENROLMENTS-2017 | grep CITS4417 | wc -l
58``````
4. How would we list students taking both `CHEM1001-1` and `CHEM1002-1`, using grep?

First, work out which students are taking those units, and store their students numbers in two files:

``````\$ grep CHEM1001-1 ENROLMENTS-2017 | cut -f 1 | sort > CHEM1001-1-students
\$ grep CHEM1002-1 ENROLMENTS-2017 | cut -f 1 | sort > CHEM1002-1-students``````

We can then use the `join` command (see the Shotts textbook, chap 20 for more details), to find lines that appear in both files:

``````\$ join CHEM*-students | wc -l
26``````

Note that files passed to `join` as arguments must already be sorted.

In fact, we can avoid the use of `cut`, because we can ask `join` to operate on only the first field of the file.

``````\$ grep CHEM1001-1 ENROLMENTS-2017 | sort > CHEM1001-1-students
\$ grep CHEM1002-1 ENROLMENTS-2017 | sort > CHEM1002-1-students``````

Then:

``````\$ join -j 1 CHEM*-students | wc -l
26``````

Finally, we could avoid some repetition in our commands by writing the following:

``````\$ for unit in CHEM1001-1 CHEM1002-1; do grep "\$unit" ENROLMENTS-2017 | sort > "\${unit}-students"; done; join -j 1 CHEM*-students | wc -l
26``````