The filename has two extensions and, for their meaning,
we read them from right to left.
The first (righmost) is .zip, signifying that the file has been
compressed by the zip
command which appends the .zip extension.
The second extension .csv is an acronym for Comma Separated Values,
a common textfile format produced by Microsoft Excel and,
as we shall see,
many other applications.
So, our data is in a textfile that has been compressed by the zip
command.
Without any options, wc
will report line, word, and character counts.
With the -l
option, only the line count is reported.
Fields are delimited by the comma character (remember 'csv').
We can use the cut
command to 'break' the file into its fields.
The default field delimiter for cut
is a tab,
so we need to override the default and specify the comma.
We also only require the 2nd field.
Consider a list of anything. The easiest way to list each distinct item and eliminate duplicates is to first sort the items - then all identical items will appear consecutively. It's now easy to report the distinct items, and immediately remove any repeats.
We can use the sort
and the uniq
commands,
in combination to perform our task,
first by using a temporary file and input and output file redirection,
remembering to remove the temporary files:
Better still, we can avoid the use of those temporary files by directly connecting the output of each command to the input of the next command. We use a sequence of communication pipes to build a command pipeline. We visualise the data flowing from left-to-right between the commands, with typically less data flowing through each successive pipe.
The command sequence ... | sort | uniq
is so common,
that the actions of uniq
have been 'built in' to sort
:
We don't wish the 3 words of our required service-station name
to be interpreted (by the shell) as 3 distinct command arguments.
We can keep all words of the name 'together' by enclosing them in single-quotes.
The command grep
(standing for global regular expression print!
will find all lines matching the pattern given as its first argument.
Once grep
has found all matching lines,
we pass its output to cut
to extract just the 5th field (the prices).
In the previous exercise we 'threw away' too much data by only reporting the prices - we need the fuel type (PULP) as well:
Getting closer; now we need to sort the output by price.
We also need to treat the (now) 2nd field as numeric, not just a string
(else '101' comes before '13').
We inform sort
that the comma is our field-separator,
to use the 2nd field as the sort key,
and to sort numerically.
There's the lowest price on the first line. We could finally extract it with:
Phew, fantastic! But we should really re-read the question:
There's no single correct answer to this exercise, but let's find the hotter month by examining minimum (field 3) and maximum (field 4) temperatures. Note that we're only interested in the lines providing data, and that they all include dates in a regular format. We'll ignore the fact that February 2020 had one extra day!
No-one likes hot nights; perhaps we could add together all minimum temperatures across the month. Unfortunately, there's no well-known command to add a column of numbers, so let's search the web for bash add column of numbers.
Many solutions employ the awk
command,
which we'll investigate later in the unit,
but
this article
provides a solution employing an uncommon command sequence (and new to me!)
employing 🌶 paste
and bc
:
So the sum of February 2020's minimums is 87 degrees more than 2019, even allowing for its extra day!