The University of Western Australia
Computer Science and Software Engineering
 
 

Department of Computer Science and Software Engineering

CITS4407 Open Source Tools and Scripting

Exercise sheet 2 - sample solutions and discussion

These exercises are performed using bash and the Terminal window application (under either Linux or macOS).


  1. The Western Australian Government requires all fuel companies to register their selling price for fuel by 4PM each day. Companies must sell their fuel, the next day, at the registered price (or below), and may not alter the price during the day. To provide some transparency, the FuelWatch Historical Data website records all daily price details, and the data may be downloaded for examination.

    1. Using your web-browser (Firefox), download the data for February 2020 to your desktop. Notice that the named file (its URL) ends in ".csv.zip" . What do you think that means? What is the name of the file that your web-browser stores on your desktop?

      The filename has two extensions and, for their meaning, we read them from right to left. The first (righmost) is .zip, signifying that the file has been compressed by the zip command which appends the .zip extension. The second extension .csv is an acronym for Comma Separated Values, a common textfile format produced by Microsoft Excel and, as we shall see, many other applications. So, our data is in a textfile that has been compressed by the zip command.

    2. The newly downloaded file has been compressed with the Lempel-Ziv-Markov chain algorithm. We'll just refer to this file as being 'a zip file', requiring decompression (expansion). Firstly, without uncompressing the file, use the unzip command to see what is inside the compressed file.

      shell> unzip -l fuelwatch.csv.zip Archive: fuelwatch.csv.zip Length Date Time Name --------- ---------- ----- ---- 8288478 04-01-2020 15:48 fuelwatch.csv --------- ------- 8288478 1 file

    3. Use the ls command to see the initial size of the zip file. Now, use unzip to decompress it, observe that its filename has changed, and check its new size.

      shell> ls -l fuelwatch.csv.zip -rw-------+ 1 chris admin 922672 Apr 1 15:49 fuelwatch.csv.zip shell> unzip fuelwatch.csv.zip shell> ls -l fuelwatch.csv -rw-r--r--@ 1 chris admin 8288478 Apr 1 15:50 fuelwatch.csv

  2. OK, we now have our data (file) in a format we can use for the following tasks.

    1. Use the wc command to determine the number of lines in the file.

      Without any options, wc will report line, word, and character counts. With the -l option, only the line count is reported.

      shell> wc fuelwatch.csv 74528 606971 8288478 fuelwatch.csv shell> wc -l fuelwatch.csv 74528 fuelwatch.csv

    2. Use the less command to view the contents of the data file....
    3. Observe what character is used to delimit (separate) the fields (columns) of data.
      Use the cut command to list all the service-station (garage, petrol station...) names present in the file.

      Fields are delimited by the comma character (remember 'csv'). We can use the cut command to 'break' the file into its fields. The default field delimiter for cut is a tab, so we need to override the default and specify the comma. We also only require the 2nd field.

      shell> cut -d, -f2 fuelwatch.csv TRADING_NAME 53 Mile Roadhouse 53 Mile Roadhouse 53 Mile Roadhouse 7-Eleven Ascot 7-Eleven Ascot 7-Eleven Ascot 7-Eleven Ascot ...... 74528 lines in total, names are sorted but are repeated

    4. 🌶 (Getting harder) How many distinct service-stations are represented in the file?

      Consider a list of anything. The easiest way to list each distinct item and eliminate duplicates is to first sort the items - then all identical items will appear consecutively. It's now easy to report the distinct items, and immediately remove any repeats.

      We can use the sort and the uniq commands, in combination to perform our task, first by using a temporary file and input and output file redirection, remembering to remove the temporary files:

      shell> cut -d, -f2 fuelwatch.csv > temporaryfile1 shell> sort < temporaryfile1 > temporaryfile2 shell> uniq < temporaryfile2 53 Mile Roadhouse 7-Eleven Ascot 7-Eleven Balcatta 7-Eleven Balga 7-Eleven Banksia Grove 7-Eleven Bassendean ..... shell> rm temporaryfile1 temporaryfile2

      Better still, we can avoid the use of those temporary files by directly connecting the output of each command to the input of the next command. We use a sequence of communication pipes to build a command pipeline. We visualise the data flowing from left-to-right between the commands, with typically less data flowing through each successive pipe.

      shell> cut -d, -f2 fuelwatch.csv | sort | uniq 53 Mile Roadhouse 7-Eleven Ascot 7-Eleven Balcatta 7-Eleven Balga 7-Eleven Banksia Grove 7-Eleven Bassendean .....

      The command sequence ... | sort | uniq is so common, that the actions of uniq have been 'built in' to sort:

      shell> cut -d, -f2 fuelwatch.csv | sort -u same output

    5. 🌶🌶 (Even harder) Using the grep command, find all of the fuel prices (across the month) sold by 'Caltex StarMart Nedlands'. Notice that the service-station name has spaces in it.

      We don't wish the 3 words of our required service-station name to be interpreted (by the shell) as 3 distinct command arguments. We can keep all words of the name 'together' by enclosing them in single-quotes. The command grep (standing for global regular expression print! will find all lines matching the pattern given as its first argument. Once grep has found all matching lines, we pass its output to cut to extract just the 5th field (the prices).

      shell> grep 'Caltex StarMart Nedlands' fuelwatch.csv | cut -d, -f5 136.9 159.9 150.9 99.9 158.9 132.9 ..... a total of 145 lines

    6. 🌶🌶🌶 (Brain busting) On what day was PULP (premium unleaded petrol) the cheapest at 'Caltex StarMart Nedlands'?

      In the previous exercise we 'threw away' too much data by only reporting the prices - we need the fuel type (PULP) as well:

      shell> grep 'Caltex StarMart Nedlands' fuelwatch.csv | cut -d, -f4,5 PULP,150.9 PULP,147.9 PULP,143.9 PULP,143.9 PULP,170.9 PULP,166.9 ..... a total of 29 lines (29 days in February 2020)

      Getting closer; now we need to sort the output by price. We also need to treat the (now) 2nd field as numeric, not just a string (else '101' comes before '13'). We inform sort that the comma is our field-separator, to use the 2nd field as the sort key, and to sort numerically.

      shell> grep 'Caltex StarMart Nedlands' fuelwatch.csv | cut -d, -f4,5 | grep PULP | sort -t, -k2 -n PULP,139.9 PULP,140.9 PULP,141.9 PULP,143.9 PULP,143.9 ..... a total of 29 lines (29 days in February 2020)

      There's the lowest price on the first line. We could finally extract it with:

      shell> grep 'Caltex StarMart Nedlands' fuelwatch.csv | cut -d, -f4,5 | grep PULP | sort -t, -k2 -n | cut -d, -f2 | head -1 139.9

      Phew, fantastic! But we should really re-read the question:

      shell> grep 'Caltex StarMart Nedlands' fuelwatch.csv | cut -d, -f1,4,5 | grep PULP | sort -t, -k3 -n | cut -d, -f1 | head -1 25/02/2020 or shell> grep 'Caltex StarMart Nedlands' fuelwatch.csv | grep PULP | sort -t, -k5 -n | cut -d, -f1 | head -1 25/02/2020

  3. Another similar example, from the Australian Bureau of Meteorology.

    This time we're seeking plain text versions of some monthly weather data. Download the weather data for both February 2019 and February 2020.

    1. The command-line program curl enables us to download files from websites without using a web-browser. You need to use a specific switch to curl to name the required output file, and the URL of the required file(data) from the website. Find the URLs for the February 2019 and February 2020 datasets, and same them to different files using curl.

      shell> curl -o weather-feb2019.csv http://www.bom.gov.au/climate/dwo/201902/text/IDCJDW6111.201902.csv shell> curl -o weather-feb2020.csv http://www.bom.gov.au/climate/dwo/202002/text/IDCJDW6111.202002.csv

    2. Now, choose a few metrics such as maximum temperature, rainfall, or maximum wind gust, and determine if February 2019 or February 2020 was hotter, wetter, or windier.

      There's no single correct answer to this exercise, but let's find the hotter month by examining minimum (field 3) and maximum (field 4) temperatures. Note that we're only interested in the lines providing data, and that they all include dates in a regular format. We'll ignore the fact that February 2020 had one extra day!

      shell> grep 2019-02 < weather-feb2019.csv | cut -d, -f4 | cut -c1 | sort | uniq -c 10 2 18 3 we see 10 maximums in the twenties, 18 maximums in the thirties shell> grep 2020-02 < weather-feb2020.csv | cut -d, -f4 | cut -c1 | sort | uniq -c 10 2 17 3 2 4 but 2020 had 2 maximums in the forties!

      No-one likes hot nights; perhaps we could add together all minimum temperatures across the month. Unfortunately, there's no well-known command to add a column of numbers, so let's search the web for bash add column of numbers.

      Many solutions employ the awk command, which we'll investigate later in the unit, but this article provides a solution employing an uncommon command sequence (and new to me!) employing 🌶 paste and bc:

      shell> grep 2019-02 < weather-feb2019.csv | cut -d, -f3 | paste -sd+ - | bc 497.7 shell> grep 2020-02 < weather-feb2020.csv | cut -d, -f3 | paste -sd+ - | bc 584.8

      So the sum of February 2020's minimums is 87 degrees more than 2019, even allowing for its extra day!


Chris McDonald
March 2020.

This Page

Written by: [email protected]