![]() |
Home > Undergraduate > Open Source Tools and Scripting > Assignment 1 2022 |
CITS4407/CITS2003 OPEN SOURCE TOOLS AND SCRIPTING | |
|
|
Assignment 1 2022
Submission deadline: 11:59pm, Monday 2 May 2022.
This assignment will involve creating Shell scripts, which can call other Shell or Sed scripts. Each top-level script has been given a name. Please make sure you use the specified name as that is the name that we will use to test your scripts. You need to package all of these into a single submission consisting of a zip file, and submit the zip file cssubmit. No other method of submission is allowed. 1. Exploring Malaria Incidence DataKaggle is a remarkable web-based, data science resource which contains a a huge number of different data sets and tutorials on tools. (Highly recommended.) One particular data set is the World Health Organisation statistics for 2020. We have downloaded for you data on the incidence of Malaria for a range of countries and done a little pre-processing, mainly to convert real-valued incidence per 1,000 population into integers (because Shell can only handle integer data). Your program should explicitly read fromincedenceOfMalaria.csv , which will be placed in the same directory
as your program.
Your program should be called 1.1. ExamplesHere is an example session:
% ./malaria_incidence Afghanistan For the country Afghanistan, the year with the highest incidence was 2002, with a rate of 104 per 1,000 % ./malaria_incidence 2004 For the year 2004, the country with the highest incidence was Solomon Islands, with a rate of 744 per 1,000 % ./malaria_incidence "Solomon Islands" For the country Solomon Islands, the year with the highest incidence was 2004, with a rate of 744 per 1,000 Given that your submission will be not tested automatically, but assessed by a member of staff, slight variations in output format are acceptable. 1.2 Perhaps of UseYou will have noticed that names of countries have an initial capital letter, but not common words, such as "and". Thus a country in Africa that is in the database is Sao Tome and Principe; although it is part of the name, the "and" is not capitalized. However, the first word is captalized regardless, e.g. The. This is called title-case, i.e. the form of capitalisation used in title of books, articles, etc. Implementing this is a bit painful in Shell, so you may find useful the Python program that you can download from here. Use it as you would any other Unix program. (You may have to make it executable usingchmod .)
2. Common WordsThis task is a development of the example that motivated several of the lectures in the unit: finding all the words in text, and from that, the most common word, etc. The program you are to write should be calledcommon_words .
With no arguments, the usage summary should be:
The program will implement two related functions, indicated by the optional arguments
If the optional argument
If the optional argument
If there are no options, then assume you are being asked for the word that is the most common across
the largest number of files, i.e.
You can assume that all the text files in the text-file directory have the suffix In the directory textfiles you can find a selection of texts from the wonderful Project Gutenberg archive of copy-right-free books. 2.1 ExamplesHere is an example session based on the set of 10 texts in the Gutenberg sample linked above:
% ./common_words text_files The 1th most common word is "the" across 9 files % ./common_words -nth 2 text_files The 2th most common word is "and" across 5 files % ./common_words -w Alice text_files The most significant rank for the word Alice is 12 in file AliceInWonderland.txt % ./common_words -w I text_files The most significant rank for the word I is 1 in file ADollsHouse.txt As in Part 1, the output format you use does not have to exactly match the one used here. 3. Marking criteriaThe two programs are each worth 10 marks, but for convenience will be marked out of 20. Marking of programs will be on the basis of 80% for how the programs deal with different types of input, both input that conforms to expectations and error state input that anti-bugging should catch. The remaining 20% will be for style/maintainability.3.1 Style RubricMuch of this has been discussed in classes, but includes comments, meaningful variable names for meaningful variables (i.e. not throw away variables such as loop variables), and sensible anti-bugging. It also includes making sure your program removes any temporary files that were created along the way.For the style/maintainability mark, the rubric is:
|