General Instructions

Aims of the Lab

All the previous labs introduce you to R and how preprocessing stage of a data mining application works. In this lab you will learn and practice how to evaluate a model for a given problem. In particular we will cover the following objectives:

Preliminaries

spamCols <- c('word.freq.make', 'word.freq.address', 'word.freq.all',
   'word.freq.3d', 'word.freq.our', 'word.freq.over', 'word.freq.remove',
   'word.freq.internet', 'word.freq.order', 'word.freq.mail',
   'word.freq.receive', 'word.freq.will', 'word.freq.people',
   'word.freq.report', 'word.freq.addresses', 'word.freq.free',
   'word.freq.business', 'word.freq.email', 'word.freq.you',
   'word.freq.credit', 'word.freq.your', 'word.freq.font',
   'word.freq.000', 'word.freq.money', 'word.freq.hp', 'word.freq.hpl',
   'word.freq.george', 'word.freq.650', 'word.freq.lab',
   'word.freq.labs', 'word.freq.telnet', 'word.freq.857',
   'word.freq.data', 'word.freq.415', 'word.freq.85',
   'word.freq.technology', 'word.freq.1999', 'word.freq.parts',
   'word.freq.pm', 'word.freq.direct', 'word.freq.cs',
   'word.freq.meeting', 'word.freq.original', 'word.freq.project',
   'word.freq.re', 'word.freq.edu', 'word.freq.table',
   'word.freq.conference', 'char.freq.semi', 'char.freq.lparen',
   'char.freq.lbrack', 'char.freq.bang', 'char.freq.dollar',
   'char.freq.hash', 'capital.run.length.average',
   'capital.run.length.longest', 'capital.run.length.total',
   'spam')

The following code uses Unform distribution to assing a number to between 0-100 to the rgroup variable.

set.seed(2350290)
spamD$rgroup <- floor(100*runif(dim(spamD)[[1]]))

Now you have training and testing data for modelling. Before going into modelling, we will learn to implement functions according to evaluate a model. One way of evaluating accuracy of a classification problem (where we have labelled data of the final outcome) is to use a confusion matrix. A confusion matrix comapres the predicted values with the observed values.

obs<-c('spam','non-spam','non-spam','spam','non-spam','non-spam','non-spam','non-spam','non-spam','spam')
pred<-c(0.3,0.2,0.1,0.8,0.1,0.1,0.4,0.2,0.3,0.2)
threshold<-0.5
confdata<-c(obs,pred)
validateconf<-c('TP'=1,'TN'=7,'FP'=0,'FN'=2)

Mapping business problem to machine learning tasks

5.2 Evaluating models

It woul be easier to interprete the results if we know how many emails classified as spam are actually spam out the all spam labels. Such evaluation messures give more insight about the model. Read this wiki (https://en.wikipedia.org/wiki/Precision_and_recall) about precision, recall and accuracy.

Although labelling model (e.gl ‘spam’ or ‘non-spam’) can be evaluated using accuracy measures, the models that provide a score can not be just count and calculate the confusion matrix. Hence we need other measures such as root-mean-square-error or absolute error.

Culstering models do not use labelled data. Instead, often they use some distance information for the calculations.

---
title: "CITS 4009 Lab 5 - Choosing and Evaluating Models"
output: html_notebook
---

### General Instructions
* Your labsheets will be structured with complementory information. The labs will closely follow the structure of "Practical Data Science with R" book by Nina Zumel and John Mount 
* From each lab you are expected to answer all the questions presented with a question number. 

### Aims of the Lab
All the previous labs introduce you to R and how preprocessing stage of a data mining application works. In this lab you will learn and practice how to evaluate a model for a given problem. In particular
we will cover the following objectives:

* Writing functions 
* Spliting data from training and evaluating models
* Evaluating model quality




#### We will use the dataset at (http://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data). 

# Preliminaries
* **Q1** Load the above dataset into avariable named 'spamD' using read.table() function. The data is comma separated and there is no header. Change the column names of the data using spamCols vector. Note: for this lab, lab5-ex.R is not given. So you have to create a new R script to do this lab.

* **Q2** 'spam' column labels whether the email is spam or not. However, this is encoded as a score (e.g 0.7). Change the values in the spam column to factors/text 'spam' or 'non-spam' using 0.5 threshold.  E.g. if the earlier original dataset has 0.4 value for a record in the spam column, then the new value should be 'non-spam'



```{r}
spamCols <- c('word.freq.make', 'word.freq.address', 'word.freq.all',
   'word.freq.3d', 'word.freq.our', 'word.freq.over', 'word.freq.remove',
   'word.freq.internet', 'word.freq.order', 'word.freq.mail',
   'word.freq.receive', 'word.freq.will', 'word.freq.people',
   'word.freq.report', 'word.freq.addresses', 'word.freq.free',
   'word.freq.business', 'word.freq.email', 'word.freq.you',
   'word.freq.credit', 'word.freq.your', 'word.freq.font',
   'word.freq.000', 'word.freq.money', 'word.freq.hp', 'word.freq.hpl',
   'word.freq.george', 'word.freq.650', 'word.freq.lab',
   'word.freq.labs', 'word.freq.telnet', 'word.freq.857',
   'word.freq.data', 'word.freq.415', 'word.freq.85',
   'word.freq.technology', 'word.freq.1999', 'word.freq.parts',
   'word.freq.pm', 'word.freq.direct', 'word.freq.cs',
   'word.freq.meeting', 'word.freq.original', 'word.freq.project',
   'word.freq.re', 'word.freq.edu', 'word.freq.table',
   'word.freq.conference', 'char.freq.semi', 'char.freq.lparen',
   'char.freq.lbrack', 'char.freq.bang', 'char.freq.dollar',
   'char.freq.hash', 'capital.run.length.average',
   'capital.run.length.longest', 'capital.run.length.total',
   'spam')

```

The following code uses Unform distribution to assing a number to between 0-100 to the rgroup variable. 

```{r}
set.seed(2350290)
spamD$rgroup <- floor(100*runif(dim(spamD)[[1]]))

```



* **Q3** Read the documentation of the set.seed() function. What is the use of the number(2350290) passed to the function. Why the seed function is used before creating the rgroup variable.

* **Q4** You need to split the spamD emails into two subsets:a training set (spamTrain) and a testing set (spamTest). The rgroup values less than 10 are used for testing and rest are used for training. 

Now you have training and testing data for modelling. Before going into modelling, we will learn to implement functions according to evaluate a model. One way of evaluating accuracy of a classification problem (where we have labelled data of the final outcome) is to use a confusion matrix. A confusion matrix comapres the predicted values with the observed values.  

* **Q5** Read the link('http://www.statmethods.net/management/userfunctions.html') and how the functions are implemented in R. Now write a function named 'createConfMatrix(obs,pred,threshold)' where the 'obs' and 'pred' are vectors of equal length. threshold is a value in the range [0,1]. Within the function you should implement the following steps.

    5.1 Compare the pred values with the threshold. This results in a vector of true and false values.  True means we predicted the email as a spam email and false means the email is non-spam.
    
    5.2 Now count how many rows predicted as spam are actually spam emails. This can be done by comparing the resultant vector in 5.1 with the obs vector (assume you get obs as spam/non-spam). In other words you have counted the True Positives (TP) of the confusion matrix of the data.
    
    5.3 As you did in 5.3, count how many True Negatives, False Positives and False Negatives are in our prediction. 
    
    5.4 Returen the results a vector of c(TP,TN,FP,FN) four values
    
    (Hint: you can perform the steps 5.1 to 5.4 in single step using table() command.)
    
    
* **Q6** Now apply the createConfMatrix on confdata and check whether the implmenetation is correct using validateconf.
```{r}
obs<-c('spam','non-spam','non-spam','spam','non-spam','non-spam','non-spam','non-spam','non-spam','spam')
pred<-c(0.3,0.2,0.1,0.8,0.1,0.1,0.4,0.2,0.3,0.2)
threshold<-0.5
confdata<-c(obs,pred)
validateconf<-c('TP'=1,'TN'=7,'FP'=0,'FN'=2)
```




# Mapping business problem to machine learning tasks
* **Q7** Create a spam classification model called 'spamModel' using the trainding data (spamTrain). To do this

    7.1 Separate all the variables (columns of spamD) except spam and rgroup. Hint: Use the setdiff method. Save the selected data as spamVars.
        
    7.2 Use generalised linear model (glm()) to create a model (name it spamModel ) from the spamVars. glm() needs a formula to specify the variables and the linking function. 
            
        7.2.1 Use as.Formula() to specify the variables (spamVars).
            
        7.2.2 Use the family/linking function as binomial(linking='logit'). 
            
    7.3 Apply the spamModel on the spamTrain to get predictions on the training data. You can use predict() function. Save the results to spamTrain$pred
            
            
* **Q8** Apply the spamModel on the testing data (spamTest) as you did in 7.3. 

        

# 5.2 Evaluating models


* **Q9** Calculate the confusion matrix of the test data using createConfMatrix() function you implemented in Q6.

It woul be easier to interprete the results if we know how many emails classified as spam are actually spam out the all spam labels. Such evaluation messures give more insight about the model. Read this wiki (https://en.wikipedia.org/wiki/Precision_and_recall) about precision, recall and accuracy.

* **Q10** Implement functions that calculate accuracy,precision and recall. You can use the same parameter 'obs','pred' and 'threshold' as input parameters for these functions. 

* **Q11** Calculate accuracy,precision and recall of the predictions of the spamTest data using the functions you implement in Q10. Interpret each result in terms of the 



Although labelling model (e.gl 'spam' or 'non-spam') can be evaluated using accuracy measures, the models that provide a score can not be just count and calculate the confusion matrix. Hence we need other measures such as root-mean-square-error or absolute error.

Culstering models do not use labelled data. Instead, often they use some distance information for the calculations.

* **Q12** Create a 100*2 matrix (d) where each column is created from a uniform distribution. Note: you should be able to reproduce d exactly the same each time you run the program.

* **Q13** Cluster the matrix d using k-means algorithm. (Use kmeans() function). You can specify the number of centres as 5. Save the results (xx\$cluster) to d$clus.

* **Q14** Visualise the clustering results and see how the culsters are separated and how their sizes vary. 

* **Q15** Calculate the distances between every pair of centres of d.










