General Instructions

Aims of the Lab

All the previous labs introduce you to R and how preprocessing stage of a data mining application works. In this lab you will learn and practice how to evaluate a model for a given problem. In particular we will cover the following objectives:

Preliminaries

spamCols <- c('word.freq.make', 'word.freq.address', 'word.freq.all',
   'word.freq.3d', 'word.freq.our', 'word.freq.over', 'word.freq.remove',
   'word.freq.internet', 'word.freq.order', 'word.freq.mail',
   'word.freq.receive', 'word.freq.will', 'word.freq.people',
   'word.freq.report', 'word.freq.addresses', 'word.freq.free',
   'word.freq.business', 'word.freq.email', 'word.freq.you',
   'word.freq.credit', 'word.freq.your', 'word.freq.font',
   'word.freq.000', 'word.freq.money', 'word.freq.hp', 'word.freq.hpl',
   'word.freq.george', 'word.freq.650', 'word.freq.lab',
   'word.freq.labs', 'word.freq.telnet', 'word.freq.857',
   'word.freq.data', 'word.freq.415', 'word.freq.85',
   'word.freq.technology', 'word.freq.1999', 'word.freq.parts',
   'word.freq.pm', 'word.freq.direct', 'word.freq.cs',
   'word.freq.meeting', 'word.freq.original', 'word.freq.project',
   'word.freq.re', 'word.freq.edu', 'word.freq.table',
   'word.freq.conference', 'char.freq.semi', 'char.freq.lparen',
   'char.freq.lbrack', 'char.freq.bang', 'char.freq.dollar',
   'char.freq.hash', 'capital.run.length.average',
   'capital.run.length.longest', 'capital.run.length.total',
   'spam')

The following code uses Unform distribution to assing a number to between 0-100 to the rgroup variable.

set.seed(2350290)
spamD$rgroup <- floor(100*runif(dim(spamD)[[1]]))

Now you have training and testing data for modelling. Before going into modelling, we will learn to implement functions according to evaluate a model. One way of evaluating accuracy of a classification problem (where we have labelled data of the final outcome) is to use a confusion matrix. A confusion matrix comapres the predicted values with the observed values.

obs<-c('spam','non-spam','non-spam','spam','non-spam','non-spam','non-spam','non-spam','non-spam','spam')
pred<-c(0.3,0.2,0.1,0.8,0.1,0.1,0.4,0.2,0.3,0.2)
threshold<-0.5
confdata<-c(obs,pred)
validateconf<-c('TP'=1,'TN'=7,'FP'=0,'FN'=2)

Mapping business problem to machine learning tasks

5.2 Evaluating models

It woul be easier to interprete the results if we know how many emails classified as spam are actually spam out the all spam labels. Such evaluation messures give more insight about the model. Read this wiki (https://en.wikipedia.org/wiki/Precision_and_recall) about precision, recall and accuracy.

Although labelling model (e.gl ‘spam’ or ‘non-spam’) can be evaluated using accuracy measures, the models that provide a score can not be just count and calculate the confusion matrix. Hence we need other measures such as root-mean-square-error or absolute error.

Culstering models do not use labelled data. Instead, often they use some distance information for the calculations.

