Week 4 workshop exercises

Exercises in filtering data

Helpful commands: cut, sort, uniq, grep, head, tail, wc.

Task 1:

ENROLMENTS-2017

  1. How many distinct enrolments (lines) are there in the file?

  2. How many distinct students are there in the file?

    The file is tab-delimited. (A handy way to see if a file contains tabs is to pipe it through cat -A – tabs will be shown as ^I, and newline characters will be marked with dollar signs ($). If for some reason you need to write a tab character at the command-line, you can do so with $'\t'. “Dollar-single quote” quotations have the effect of making bash interpret escape sequences, like \t for “tab” and \n for "newline.)

    So we can use:

    (Note that the ENROLMENTS-2017 file contains around 42,000 lines, and on most machines the above command should execute pretty much instantly. Until you get to around a million lines, sort should take very little time. Once you exceed a million lines, it can be helpful to add use sort --parallel=8 (where 8 is the number of processors on your computer – replace as appropriate) to improve performance. --parallel is an option understood by the GNU implementation of sort – which is the one we are using – which divides the work of sorting between multiple processors.)

  3. How many distinct units are there in the file?

    If we interpret “unit” as meaning “a unit–semester combination”, then

    If we interpret it as meaning just the unit ID, then we must discard the semester part of each line.

    One way is:

  4. How many distinct teaching periods (similar to semesters) are there?

  5. How many (CITS) units are presented by Computer Science and Software Engineering?

  6. Which units(s) have the largest enrolment?

    We can use the ‘-c’ flag to the uniq command to get a count of each unit:

  7. Which student(s) are taking the most units this year?

  8. Which units are offered in more than one teaching period?

Task 2:

wificapture-1.txt
wificapture-2.txt


The shorter file wificapture-1.txt provides details of 10,000 captured wireless Ethernet (WiFi) frames, as does the longer file wificapture-2.txt with 230,000 frames. The contents of each frame have then been formatted to a textfile, providing details of each frame (one frame per line). Note that only the frame's header is captured, and none of its data-payload (which is likely encrypted, anyway). Thus, the only privacy concerns exposed by this data include which device was communicating with which other device, how often, how much, and at what time. No personal or private data is exposed. The tab-SEPARATED fields of each line (frame) are:

Different frame types will appear to have different numbers of fields. Actually, all fields are present, and an 'empty' field will be represented by 2 TAB chartacters in a row. For the following exercises, we are only interested in the source and the destination MAC addresses, which will always be present. While processing each frame (line) ignore all MAC addresses of the form ff:ff:ff:ff:ff:ff - the special broadcast address for frames transmitted to any device that can hear it. Develop a number of short command sequences to find:

  1. The single source device sending traffic most frequently,
  2. The single source device sending the greatest volume of traffic,
  3. The 5 pairs of source and destination devices which collectively (considered pairwise) send the highest number of frames.