Home > Undergraduate > Open Source Tools and Scripting > Assignment 2 |
CITS2003/CITS4407 OPEN SOURCE TOOLS AND SCRIPTING | |
|
|
Assignment 2 2023
Submission deadline: 11:59pm, Monday 22 May
This assignment will involve creating two Bash Shell scripts, which will use Unix tools, e.g. Sed, Awk,
and/or calls to other Bash Shell scripts.
The top level scripts are to be called You need to package the scripts (and .git folder) into a single submission consisting of a directory that has been compressed with zip or tar, and submit the zip/tar file via cssubmit. No other method of submission will be accepted. Revisiting Kaggle Catalogue of US Cybersecurity BreachesThis assignment will make use of the Catalogue of US Cybersecurity Breaches data that you first looked at in Assignment 1, although it will be a different version of that data-file. There are two parts to this assignment. The first will involve data-cleaning, which is a hugely important first step in almost every data analysis task. In the second part you will use cleaned data to analyse the distribution of incidents across months.Data CleaningThe Bash Shell script at the front of the data-cleaning task must be calledpreprocess , and will
be given the name of a data-file as its only argument.
The primary data-file you will be using is Cyber_Security_Breaches_noym.tsv,
though other data-files will also be tested.
You can assume the interpretation of the fields will not vary.
To give you something to work toward, and as input for the second part of the assignment, the cleaned file corresponding to Cyber_Security_Breaches_noym.tsv is Cyber_Security_Breaches_clean.tsv (although I could have use any file name). Data AnalysisYou undertook some analyses of the Cyber Breaches data in Assignment 1; there are many more that could be done, e.g. to answer the question, "Is the nature of the breaches changing over time, and if so, how?"
The particular analysis you are asked to do for this assignment is to see if
there is a pattern to the breaches across months, for
the several years that are covered by the data-file.
The Bash Shell script at the head of this program must be called
The values for the median and MAD should be printed to standard output.
When you have the median and MAD, you are to print a table of the months, where, against each
month you list the number of incidents and then either "++", if the count is 1 median-absolute-deviation
above the median (or more); "--", if the count is 1 median-absolute-deviation
below the median (or less). If the count is within 1 median-absolute-deviation of the median don't
add anything.
For example, the count for January may look like: Jan 100 -- or Jan 300 ++ or Jan 200 TestingAs with Assignment 1, your submission will be tested automatically against a range of seen, and unseen example. However, a human marker will be assessing you program's outputs for the range of tests, which means that the output format your program uses does not much matter for the auto-testing. However, do be aware of readability of your code and the program outputs. (See below for discussion of Style.)Marking criteriaThe program will be marked out of 20. Of the 20 marks, 15 will be awarded based on how the programs deal with different types of input, both input that conforms to expectations and error state input that anti-bugging should catch. However, beyond that, error messages need to be as informative as possible. You therefore need to consider the ways users inputs may not conform to what your system is expecting and add testing to catch those issues. You can assume that cleaned data-file submitted tobreaches_per_month
is, in fact, clean.
Of the remaining 5 marks, there will be 1 mark for be for appropriate use of Git. This means you must have multiple commits at different times with appropriate/relevant commit messages. One commit at the start and one at the end of the project is not sufficient. The final 4 marks will be for style/maintainability. Programs are written as much for human as for computers. As such, it is important that your code be readable and mantainable. Similarly, outputs should aim to be informative (but ever verbose). Style RubricMuch of this has been discussed in classes, but includes comments, meaningful variable names for significant variables (i.e. not throw away variables such as loop variables), and sensible anti-bugging. It also includes making sure your program removes any temporary files that were created along the way.For the style/maintainability mark, the rubric is:
Hints
|