General Instructions
- Your labsheets will be structured with complementory information. The labs will closely follow the structure of “Practical Data Science with R” book by Nina Zumel and John Mount
- From each lab you are expected to answer all the questions presented with a question number.
Aims of the Lab
In this lab you will learn and practice how to apply data mining techniques on different applications. In particular we will cover the following objectives:
- Prepare data for modelling
- Using models on outlier detection, association rule mining and regression
- Evaluating model quality
Outlier Detection
Outliers or anomalies are data points that may be generated by errors in the data generation process so that they deviate from the rest of the data. In statistical modaling, outliers are usually ‘treated’ (removed or replaced with representative statistics like mean). Howerver, in data mining, outliers may be the ‘thing’ that the data scientist is looking for. For example, fraudulant electronic transactions may happen once in a million. Yet, it is important to detect them.
Q1 Create a standard normal dataset of 1000 data point. You can use ‘rnorm()’ function. 99.7% of the data of a standard normal distribution within the range of 3 standard deviation from the mean. How many outlier points are there in your daaset? What are the values of these outliers?
Q2 Download the mean monthly solar exposure from year 1990 to 2017 in ‘Perth’ (station number 9225) from the Bereau of Meteorology website (http://www.bom.gov.au/climate/data/). Load the data inta variable called ‘slrad’.
2.1 Plot the histogram of the mean monthly solar exposure for all the years? Does the distribution look like a normal distribution? If not, how do you find the outliers from this data?
2.2 We are interested to look at anomalous year as well as the months of that year. You can use boxplot method to visualise the answer to this question.
2.2.1 Draw a boxplot diagram of solar exposure vs year where the boxplot of each year comprise of the solar expsure values of the months.
2.2.2 Now draw a boxplot diagram of solar exposure vs month where the boxplot of each month contains data from every year.
2.2.3 What the month in which the solar exposure was recorded the lowest? (Hint: Use the boxplots)
2.2.4 Which year had the median solar exposure highest?
2.2.5 How many years did Perth received solar exposure >30?
2.2.6 What anomalies can you find from the plots in 2.2.1 and 2.2.2?
2.3 Calculate the
Q3 Write a function ‘stdnormAnoms’ where it takes a list of values and return a list indicating true/false for each item. The method find the mean of the input list. For each item in the list you should check whethre it is more than or equal to three standard deviations of the mean. If it is further you mark the item as true. Otherwise false. In otherwords, this function implements the anomaly detection concept presented in Q1.
3.1 Use the stdnormAnoms() to find anomalies in Q1 and Q2.2.6. Are the results equal to your previous attempts?
Regression
Regression analysis is used to find the relationship among variables. Regression analyssis has a rich history of well developed procedures for predicting values and finding trends in data.
Q4 We want to find whether there is a trend in in maximum solar exposure and also is there a trend in minimum solar exposure to Perth?
4.1 Find the maximum solar expore values for every year and save it as ‘maxslrad’. Your data frame (maxslrad) should have two columns: one for year and another for maximum solar exposure.
4.2 Use linear regression to fit the maxslrad to a linear model. What can say about the maximum solar radiation in 2018?
4.3 How do you measure the accuracy of the results? One way of calculating the accuracy of regression results is to find the error between observed and model values (e.g. RMSE).
4.4 Find the trend of minimum solar exposure value for every year and save it as ‘minslrad’.
4.5 What general conclusions you can draw from the trends of minimum and maximum solar exposure values (4.2 and 4.4)?
Association Rule Mining
Association rule mining find how frequently different items occur together. Association rule mining is heavily used in market basket analysis. * Q5 As a preliminary you have to invoke ‘arules’ library.
Q5 Load the information in the data/bookdata.tsv into ‘bookdata’. The bookdata is in special format called transaction. You can use read.transactions() method. The data are separated by tabs. Give column names: ‘userId’ and ‘title’. Also, do not read any duplicate transactions.
5.1 Inspect the number of transactions and number of columns of ‘bookdata’. The columns of this transaction matrix represents different book names. The rows represent a single transaction data. For example consider a record 100000…001 Then the first and last books are checked out/(or in) in this transaction. 0s represent the books that were not used for the transaction.
5.2 Explore the column names to find the book names.
5.3 Find the five most frequent books. How many times they occur together?
5.4 Learn how to use apriori() function.
5.4.1 Find the distribution of transaction sizes. Hint use size() method. save the results on 'basketSizes'
5.4.2 Find the subset of bookdata where basketSizes >1
5.4.3 Use the apriori method find the rules/patterns of book data. Specify confidenc = 0.75 and support = 0.002 as parameters.
5.4.4 Print the sorted pair of books based on their confidence You can use inspect() method and sort() methods for this.
LS0tDQp0aXRsZTogIkNJVFMgNDAwOSBMYWIgNiAtIENob29zaW5nIGFuZCBFdmFsdWF0aW5nIE1vZGVscyAtIHBhcnQgMiINCm91dHB1dDogaHRtbF9ub3RlYm9vaw0KLS0tDQoNCiMjIyBHZW5lcmFsIEluc3RydWN0aW9ucw0KKiBZb3VyIGxhYnNoZWV0cyB3aWxsIGJlIHN0cnVjdHVyZWQgd2l0aCBjb21wbGVtZW50b3J5IGluZm9ybWF0aW9uLiBUaGUgbGFicyB3aWxsIGNsb3NlbHkgZm9sbG93IHRoZSBzdHJ1Y3R1cmUgb2YgIlByYWN0aWNhbCBEYXRhIFNjaWVuY2Ugd2l0aCBSIiBib29rIGJ5IE5pbmEgWnVtZWwgYW5kIEpvaG4gTW91bnQgDQoqIEZyb20gZWFjaCBsYWIgeW91IGFyZSBleHBlY3RlZCB0byBhbnN3ZXIgYWxsIHRoZSBxdWVzdGlvbnMgcHJlc2VudGVkIHdpdGggYSBxdWVzdGlvbiBudW1iZXIuIA0KDQojIyMgQWltcyBvZiB0aGUgTGFiDQpJbiB0aGlzIGxhYiB5b3Ugd2lsbCBsZWFybiBhbmQgcHJhY3RpY2UgaG93IHRvIGFwcGx5IGRhdGEgbWluaW5nIHRlY2huaXF1ZXMgb24gZGlmZmVyZW50IGFwcGxpY2F0aW9ucy4gSW4gcGFydGljdWxhcg0Kd2Ugd2lsbCBjb3ZlciB0aGUgZm9sbG93aW5nIG9iamVjdGl2ZXM6DQoNCiogUHJlcGFyZSBkYXRhIGZvciBtb2RlbGxpbmcNCiogVXNpbmcgbW9kZWxzIG9uIG91dGxpZXIgZGV0ZWN0aW9uLCBhc3NvY2lhdGlvbiBydWxlIG1pbmluZyBhbmQgcmVncmVzc2lvbg0KKiBFdmFsdWF0aW5nIG1vZGVsIHF1YWxpdHkNCg0KDQojIE91dGxpZXIgRGV0ZWN0aW9uDQpPdXRsaWVycyBvciBhbm9tYWxpZXMgYXJlIGRhdGEgcG9pbnRzIHRoYXQgbWF5IGJlIGdlbmVyYXRlZCBieSBlcnJvcnMgaW4gdGhlIGRhdGEgZ2VuZXJhdGlvbiBwcm9jZXNzIHNvIHRoYXQgdGhleSBkZXZpYXRlIGZyb20gdGhlIHJlc3Qgb2YgdGhlIGRhdGEuIEluIHN0YXRpc3RpY2FsIG1vZGFsaW5nLCBvdXRsaWVycyBhcmUgdXN1YWxseSAndHJlYXRlZCcgKHJlbW92ZWQgb3IgcmVwbGFjZWQgd2l0aCByZXByZXNlbnRhdGl2ZSBzdGF0aXN0aWNzIGxpa2UgbWVhbikuIEhvd2VydmVyLCBpbiBkYXRhIG1pbmluZywgb3V0bGllcnMgbWF5IGJlIHRoZSAndGhpbmcnIHRoYXQgdGhlIGRhdGEgc2NpZW50aXN0IGlzIGxvb2tpbmcgZm9yLiBGb3IgZXhhbXBsZSwgZnJhdWR1bGFudCBlbGVjdHJvbmljIHRyYW5zYWN0aW9ucyBtYXkgaGFwcGVuIG9uY2UgaW4gYSBtaWxsaW9uLiBZZXQsIGl0IGlzIGltcG9ydGFudCB0byBkZXRlY3QgdGhlbS4gDQoNCiogKipRMSoqIENyZWF0ZSBhIHN0YW5kYXJkIG5vcm1hbCBkYXRhc2V0IG9mIDEwMDAgZGF0YSBwb2ludC4gWW91IGNhbiB1c2UgJ3Jub3JtKCknIGZ1bmN0aW9uLiA5OS43JSBvZiB0aGUgZGF0YSBvZiBhIHN0YW5kYXJkIG5vcm1hbCBkaXN0cmlidXRpb24gd2l0aGluIHRoZSByYW5nZSBvZiAzIHN0YW5kYXJkIGRldmlhdGlvbiBmcm9tIHRoZSBtZWFuLiBIb3cgbWFueSBvdXRsaWVyIHBvaW50cyBhcmUgdGhlcmUgaW4geW91ciBkYWFzZXQ/IFdoYXQgYXJlIHRoZSB2YWx1ZXMgb2YgdGhlc2Ugb3V0bGllcnM/DQoNCiogKipRMioqIERvd25sb2FkIHRoZSBtZWFuIG1vbnRobHkgc29sYXIgZXhwb3N1cmUgZnJvbSB5ZWFyIDE5OTAgdG8gMjAxNyBpbiAnUGVydGgnIChzdGF0aW9uIG51bWJlciA5MjI1KSBmcm9tIHRoZSBCZXJlYXUgb2YgTWV0ZW9yb2xvZ3kgd2Vic2l0ZSAoaHR0cDovL3d3dy5ib20uZ292LmF1L2NsaW1hdGUvZGF0YS8pLiBMb2FkIHRoZSBkYXRhIGludGEgdmFyaWFibGUgY2FsbGVkICdzbHJhZCcuDQoNCiAgICAyLjEgUGxvdCB0aGUgaGlzdG9ncmFtIG9mIHRoZSBtZWFuIG1vbnRobHkgc29sYXIgZXhwb3N1cmUgZm9yIGFsbCB0aGUgeWVhcnM/IERvZXMgdGhlIGRpc3RyaWJ1dGlvbiBsb29rIGxpa2UgYSBub3JtYWwgZGlzdHJpYnV0aW9uPyBJZiBub3QsIGhvdyBkbyB5b3UgZmluZCB0aGUgb3V0bGllcnMgZnJvbSB0aGlzIGRhdGE/IA0KICAgIA0KICAgIDIuMiBXZSBhcmUgaW50ZXJlc3RlZCB0byBsb29rIGF0IGFub21hbG91cyB5ZWFyIGFzIHdlbGwgYXMgdGhlIG1vbnRocyBvZiB0aGF0IHllYXIuIFlvdSBjYW4gdXNlIGJveHBsb3QgbWV0aG9kIHRvIHZpc3VhbGlzZSB0aGUgYW5zd2VyIHRvIHRoaXMgcXVlc3Rpb24uIA0KICAgIA0KICAgICAgICAyLjIuMSBEcmF3IGEgYm94cGxvdCBkaWFncmFtIG9mIHNvbGFyIGV4cG9zdXJlIHZzIHllYXIgd2hlcmUgdGhlIGJveHBsb3Qgb2YgZWFjaCB5ZWFyIGNvbXByaXNlIG9mIHRoZSBzb2xhciBleHBzdXJlIHZhbHVlcyBvZiB0aGUgbW9udGhzLiANCiAgICAgICAgDQogICAgICAgIDIuMi4yIE5vdyBkcmF3IGEgYm94cGxvdCBkaWFncmFtIG9mIHNvbGFyIGV4cG9zdXJlIHZzIG1vbnRoIHdoZXJlIHRoZSBib3hwbG90IG9mIGVhY2ggbW9udGggY29udGFpbnMgZGF0YSBmcm9tIGV2ZXJ5IHllYXIuIA0KICAgICAgICANCiAgICAgICAgMi4yLjMgV2hhdCB0aGUgbW9udGggaW4gd2hpY2ggdGhlIHNvbGFyIGV4cG9zdXJlIHdhcyByZWNvcmRlZCB0aGUgbG93ZXN0PyAoSGludDogVXNlIHRoZSBib3hwbG90cykNCiAgICAgICAgDQogICAgICAgIDIuMi40IFdoaWNoIHllYXIgaGFkIHRoZSBtZWRpYW4gc29sYXIgZXhwb3N1cmUgaGlnaGVzdD8NCiAgICAgICAgDQogICAgICAgIDIuMi41IEhvdyBtYW55IHllYXJzIGRpZCBQZXJ0aCByZWNlaXZlZCBzb2xhciBleHBvc3VyZSA+MzA/DQogICAgICAgIA0KICAgICAgICAyLjIuNiBXaGF0IGFub21hbGllcyBjYW4geW91IGZpbmQgZnJvbSB0aGUgcGxvdHMgaW4gMi4yLjEgYW5kIDIuMi4yPw0KICAgIA0KICAgIDIuMyBDYWxjdWxhdGUgdGhlIA0KICAgIA0KKiAqKlEzKiogV3JpdGUgYSBmdW5jdGlvbiAnc3Rkbm9ybUFub21zJyB3aGVyZSBpdCB0YWtlcyBhIGxpc3Qgb2YgdmFsdWVzIGFuZCByZXR1cm4gYSBsaXN0IGluZGljYXRpbmcgdHJ1ZS9mYWxzZSBmb3IgZWFjaCBpdGVtLiBUaGUgbWV0aG9kIGZpbmQgdGhlIG1lYW4gb2YgdGhlIGlucHV0IGxpc3QuIEZvciBlYWNoIGl0ZW0gaW4gdGhlIGxpc3QgeW91IHNob3VsZCBjaGVjayB3aGV0aHJlIGl0IGlzIG1vcmUgdGhhbiBvciBlcXVhbCB0byB0aHJlZSBzdGFuZGFyZCBkZXZpYXRpb25zIG9mIHRoZSBtZWFuLiBJZiBpdCBpcyBmdXJ0aGVyIHlvdSBtYXJrIHRoZSBpdGVtIGFzIHRydWUuIE90aGVyd2lzZSBmYWxzZS4gSW4gb3RoZXJ3b3JkcywgdGhpcyBmdW5jdGlvbiBpbXBsZW1lbnRzIHRoZSBhbm9tYWx5IGRldGVjdGlvbiBjb25jZXB0IHByZXNlbnRlZCBpbiBRMS4NCiAgICANCiAgICAzLjEgVXNlIHRoZSBzdGRub3JtQW5vbXMoKSB0byBmaW5kIGFub21hbGllcyBpbiBRMSBhbmQgUTIuMi42LiBBcmUgdGhlIHJlc3VsdHMgZXF1YWwgdG8geW91ciBwcmV2aW91cyBhdHRlbXB0cz8NCiAgICAgICAgDQogICAgDQoNCg0KDQojIFJlZ3Jlc3Npb24NClJlZ3Jlc3Npb24gYW5hbHlzaXMgaXMgdXNlZCB0byBmaW5kIHRoZSByZWxhdGlvbnNoaXAgYW1vbmcgdmFyaWFibGVzLiBSZWdyZXNzaW9uIGFuYWx5c3NpcyBoYXMgYSByaWNoIGhpc3Rvcnkgb2Ygd2VsbCBkZXZlbG9wZWQgcHJvY2VkdXJlcyBmb3IgcHJlZGljdGluZyB2YWx1ZXMgYW5kIGZpbmRpbmcgdHJlbmRzIGluIGRhdGEuIA0KDQoqICoqUTQqKiBXZSB3YW50IHRvIGZpbmQgd2hldGhlciB0aGVyZSBpcyBhIHRyZW5kIGluIGluIG1heGltdW0gc29sYXIgZXhwb3N1cmUgYW5kIGFsc28gaXMgdGhlcmUgYSB0cmVuZCBpbiBtaW5pbXVtIHNvbGFyIGV4cG9zdXJlIHRvIFBlcnRoPw0KDQogICAgNC4xIEZpbmQgdGhlIG1heGltdW0gc29sYXIgZXhwb3JlIHZhbHVlcyBmb3IgZXZlcnkgeWVhciBhbmQgc2F2ZSBpdCBhcyAnbWF4c2xyYWQnLiBZb3VyIGRhdGEgZnJhbWUgKG1heHNscmFkKSBzaG91bGQgaGF2ZSB0d28gY29sdW1uczogb25lIGZvciB5ZWFyIGFuZCBhbm90aGVyIGZvciBtYXhpbXVtIHNvbGFyIGV4cG9zdXJlLg0KICAgIA0KICAgIDQuMiBVc2UgbGluZWFyIHJlZ3Jlc3Npb24gdG8gZml0IHRoZSBtYXhzbHJhZCB0byBhIGxpbmVhciBtb2RlbC4gV2hhdCBjYW4gc2F5IGFib3V0IHRoZSBtYXhpbXVtIHNvbGFyIHJhZGlhdGlvbiBpbiAyMDE4PyANCiAgICANCiAgICA0LjMgSG93IGRvIHlvdSBtZWFzdXJlIHRoZSBhY2N1cmFjeSBvZiB0aGUgcmVzdWx0cz8gT25lIHdheSBvZiBjYWxjdWxhdGluZyB0aGUgYWNjdXJhY3kgb2YgcmVncmVzc2lvbiByZXN1bHRzIGlzIHRvIGZpbmQgdGhlIGVycm9yIGJldHdlZW4gb2JzZXJ2ZWQgYW5kIG1vZGVsIHZhbHVlcyAoZS5nLiBSTVNFKS4NCiAgICANCiAgICA0LjQgRmluZCB0aGUgdHJlbmQgb2YgbWluaW11bSBzb2xhciBleHBvc3VyZSB2YWx1ZSBmb3IgZXZlcnkgeWVhciBhbmQgc2F2ZSBpdCBhcyAnbWluc2xyYWQnLg0KICAgIA0KICAgIDQuNSBXaGF0IGdlbmVyYWwgY29uY2x1c2lvbnMgeW91IGNhbiBkcmF3IGZyb20gdGhlIHRyZW5kcyBvZiBtaW5pbXVtIGFuZCBtYXhpbXVtIHNvbGFyIGV4cG9zdXJlIHZhbHVlcyAoNC4yIGFuZCA0LjQpPw0KDQoNCg0KDQojIEFzc29jaWF0aW9uIFJ1bGUgTWluaW5nDQpBc3NvY2lhdGlvbiBydWxlIG1pbmluZyBmaW5kIGhvdyBmcmVxdWVudGx5IGRpZmZlcmVudCBpdGVtcyBvY2N1ciB0b2dldGhlci4gQXNzb2NpYXRpb24gcnVsZSBtaW5pbmcgaXMgaGVhdmlseSB1c2VkIGluIG1hcmtldCBiYXNrZXQgYW5hbHlzaXMuIA0KKiAqKlE1KiogQXMgYSBwcmVsaW1pbmFyeSB5b3UgaGF2ZSB0byBpbnZva2UgJ2FydWxlcycgbGlicmFyeS4NCg0KKiAqKlE1KiogTG9hZCB0aGUgaW5mb3JtYXRpb24gaW4gdGhlIGRhdGEvYm9va2RhdGEudHN2ICBpbnRvICdib29rZGF0YScuIFRoZSBib29rZGF0YSBpcyBpbiBzcGVjaWFsIGZvcm1hdCBjYWxsZWQgdHJhbnNhY3Rpb24uIFlvdSBjYW4gdXNlIHJlYWQudHJhbnNhY3Rpb25zKCkgbWV0aG9kLiBUaGUgZGF0YSBhcmUgc2VwYXJhdGVkIGJ5IHRhYnMuIEdpdmUgY29sdW1uIG5hbWVzOiAndXNlcklkJyBhbmQgJ3RpdGxlJy4gQWxzbywgZG8gbm90IHJlYWQgYW55IGR1cGxpY2F0ZSB0cmFuc2FjdGlvbnMuDQoNCiAgICA1LjEgSW5zcGVjdCB0aGUgbnVtYmVyIG9mIHRyYW5zYWN0aW9ucyBhbmQgbnVtYmVyIG9mIGNvbHVtbnMgb2YgJ2Jvb2tkYXRhJy4gVGhlIGNvbHVtbnMgb2YgdGhpcyB0cmFuc2FjdGlvbiBtYXRyaXggcmVwcmVzZW50cyBkaWZmZXJlbnQgYm9vayBuYW1lcy4gVGhlIHJvd3MgcmVwcmVzZW50IGEgc2luZ2xlIHRyYW5zYWN0aW9uIGRhdGEuIEZvciBleGFtcGxlIGNvbnNpZGVyIGEgcmVjb3JkIDEwMDAwMC4uLjAwMSBUaGVuIHRoZSBmaXJzdCBhbmQgbGFzdCBib29rcyBhcmUgY2hlY2tlZCBvdXQvKG9yIGluKSBpbiB0aGlzIHRyYW5zYWN0aW9uLiAwcyByZXByZXNlbnQgdGhlIGJvb2tzIHRoYXQgd2VyZSBub3QgdXNlZCBmb3IgdGhlIHRyYW5zYWN0aW9uLg0KICAgIA0KICAgIDUuMiBFeHBsb3JlIHRoZSBjb2x1bW4gbmFtZXMgdG8gZmluZCB0aGUgYm9vayBuYW1lcy4NCiAgICANCiAgICA1LjMgRmluZCB0aGUgZml2ZSBtb3N0IGZyZXF1ZW50IGJvb2tzLiBIb3cgbWFueSB0aW1lcyB0aGV5IG9jY3VyIHRvZ2V0aGVyPw0KICAgIA0KICAgIDUuNCBMZWFybiBob3cgdG8gdXNlIGFwcmlvcmkoKSBmdW5jdGlvbi4gDQogICAgICAgIA0KICAgICAgICA1LjQuMSBGaW5kIHRoZSBkaXN0cmlidXRpb24gb2YgdHJhbnNhY3Rpb24gc2l6ZXMuIEhpbnQgdXNlIHNpemUoKSBtZXRob2QuIHNhdmUgdGhlIHJlc3VsdHMgb24gJ2Jhc2tldFNpemVzJw0KICAgICAgICANCiAgICAgICAgNS40LjIgRmluZCB0aGUgc3Vic2V0IG9mIGJvb2tkYXRhIHdoZXJlIGJhc2tldFNpemVzID4xDQogICAgICAgIA0KICAgICAgICA1LjQuMyBVc2UgdGhlIGFwcmlvcmkgbWV0aG9kIGZpbmQgdGhlIHJ1bGVzL3BhdHRlcm5zIG9mIGJvb2sgZGF0YS4gU3BlY2lmeSBjb25maWRlbmMgPSAwLjc1IGFuZCBzdXBwb3J0ID0gMC4wMDIgYXMgcGFyYW1ldGVycy4NCiAgICAgICAgDQogICAgICAgIDUuNC40IFByaW50IHRoZSBzb3J0ZWQgcGFpciBvZiBib29rcyBiYXNlZCBvbiB0aGVpciBjb25maWRlbmNlIFlvdSBjYW4gdXNlIGluc3BlY3QoKSBtZXRob2QgYW5kIHNvcnQoKSBtZXRob2RzIGZvciB0aGlzLg0KDQoNCg0KDQo=