General Instructions
- Your labsheets will be structured with complementory information. The labs will closely follow the structure of “Practical Data Science with R” book by Nina Zumel and John Mount
- From each lab you are expected to answer all the questions presented with a question number.
Aims of the Lab
In this lab you will learn and practice how to get sense (explore) of data from various visualization and summary statistics techniques before modeling. In particular we will cover the following objectives:
- Using summary statistics to explore data
- Exploring data using visualization
- Finding problems and issues during data exploration
We will use the same dataset and example used in the reference book (https://github.com/WinVector/zmPDSwR/tree/master/Custdata). Copy of the dataset is in the ‘data’ sub folder of your working directory. Open the data file in text
Preliminaries
The aim of the example using ‘custdata’ is to build a model that predicts the customers who dont have a health insurance. You have identified and collected data (this is done for you) that may lead to create model to achieve the goal.
- Q1 Read the data in the custdata.tsv stored in the ‘data’ subfolder into a variable called ‘custdata’. Hint: use the read.table() function.
A dataset may have a inconsistancies and may be not ready to use for a given task as it is due to missing values, outliers and irrelavant values. Hence we use data exploration to get an idea about a dataset.
Q2 Find the type of the ‘custdata’ variable using class command.
Q3 Usually in production environments the data are stored in SQL databases. Why a typical data exploration task in R uses its built in data structure rather than utilising SQL queries?
1. Working with Summary Statistics
To get a quick understanding a dataset we can use summary statistics.
- Q4 Use the following code snippet to get the summary of customer data.
r
r summary(custdata)
custid sex is.employed income marital.stat health.ins housing.type recent.move num.vehicles
Min. : 2068 F:440 Mode :logical Min. : -8700 Divorced/Separated:155 Mode :logical Homeowner free and clear :157 Mode :logical Min. :0.000
1st Qu.: 345667 M:560 FALSE:73 1st Qu.: 14600 Married :516 FALSE:159 Homeowner with mortgage/loan:412 FALSE:820 1st Qu.:1.000
Median : 693403 TRUE :599 Median : 35000 Never Married :233 TRUE :841 Occupied with no rent : 11 TRUE :124 Median :2.000
Mean : 698500 NA's :328 Mean : 53505 Widowed : 96 NA's :0 Rented :364 NA's :56 Mean :1.916
3rd Qu.:1044606 3rd Qu.: 67000 NA's : 56 3rd Qu.:2.000
Max. :1414286 Max. :615000 Max. :6.000
NA's :56
age state.of.res is.employed.fix
Min. : 0.0 California :100 Length:1000
1st Qu.: 38.0 New York : 71 Class :character
Median : 50.0 Pennsylvania: 70 Mode :character
Mean : 51.7 Texas : 56
3rd Qu.: 64.0 Michigan : 52
Max. :146.7 Ohio : 51
(Other) :600
- Q5 From the output of Q4, answer the following
- 5.1 Are there invalid values in ‘income’ summaries? If so what is the field and give reasons to your conclusion.
- 5.2 How many missing values are there in ‘is.employed’? Can you assert how significant this value as a percentage of the data?
- 5.3 Comment on how to interprete minimum, average and maximum age of a person using summary statistics? Are they plausible values?
1.1 Problems revealed by summary statistics
- Missing Values
Q6 Which fields in custdata have a common number of missing values? Are the missing values significant as a percentage of the data?
Q7 Compare the percentage figures obtained from Q5.2 and Q6. What is your strategy to deal with missing values in each case?
- Outliers and Invalid Values Outliers are data points that fall well out of the range of what you expect your data to be.
Q8 Comment what you observe about the summary of the income field?
2. Working with Visualizations
Spotting problems using graphics and visualisations
2.1 Visually checking distributions for a single variable.
The visualisations we discuss in this section can answer questions like
- What is the peak value of the distribution?
- How many peaks are there in the distribution (unimodality versus bimodality)?
- How normal (or lognormal) is the data?
- How much does the data vary? Is it concentrated in a certain interval or in a certain category?
Histograms
A histogram discretize the the range of a variable into bins present the frequency of the bins as a visualisation.
- Q11 Use the following code to generate the histogram of age variable of customer data.
- 11.1 What does the bandwidth parameter mean?
- 11.2 Change the bandwidth value to 2 and describe what happens to the histogram shape?
- 11.3 Change the bandwidth value to 10 and describe what happen to the histogram shape?
- 11.4 Are there disadvantages of using histograms?

- Q12 Children under 5 years do not use use healthcare and people rarely live over 100 years. Based on these statements and using the histogram in Q11, can you identify outliers and invalid values?
Densitiy Plots
A density plot can be used to quickly get an idea about the distribution. Whether the data is concentrated in one area or spreaded.
- Q13 Use the following code to generate the density plot for the income variable.
- Give a rough estimate of an income range where most of the population is concentrated. If you want to further expand this part of the population, you can use lagarithmic scale (e.g. scale_x_log10)
- How many sub population can be found in the income?
r
r library(scales) ggplot(custdata) + geom_density(aes(x=income)) + scale_x_continuous(labels=dollar)

Bar Charts
Bar chart is a histogram for discrete data.
- Q14 Use the following code snippet along with the codes used in the earlier questions to generate the bar plot for marital status. If you are going to use marital status as one of the modeling variables in health insurance, it is better to understand such a categorical variable has good representation across the population.

2.2 Visually checking relationships between two variables
The visualisations we discuss in this section can answer questions like
- Is there a relationship between the two inputs age and income in custdata?
- What kind of relationship, and how strong?
- Is there a relationship between the input marital status and the output health insurance? How strong?
Line Plots
Line plots works best when the relationship between two variables are relatively clean
- Q17 Use the following code to generate a line plot for abstract data.
r
r x <- runif(100) y <- x^2 + 0.2*x ggplot(data.frame(x=x,y=y), aes(x=x,y=y)) + geom_line()

Scatter Plots and Smoothing Curves
Sometimes the relationship between two variables are not clean (not strongly correlated) as the synthetic data we generated in Q17. We can use correlation summary statistic to find the relationship between two variales. Further information can be done using scatter plots.
- Q18 Use the following code to filter sensible subset of data from custdata. Then find the correlation between age and income using cor() function.
r
r custdata2 <- subset(custdata,(custdata\(age > 0 & custdata\)age < 100 & custdata$income > 0))
- Q19 Create a scatter plot to find relationship between income and age using the following code. In addition to the scatter plot, the code draws a smoothing line which shows the relationship(linear) betwen the two variables.
- Does the smoothing curve helpful to see the relationship between two variables in this example?

- Q20 Draw the scatter plot and the smoothing curve without specifying the smoothing function as follows.
- What is the difference betwen the smoothing lines in Q19 and Q20?
- What is the (shaded) ribbon around the smoothing curve mean?
r
r ggplot(custdata2, aes(x=age, y=income)) + geom_point() + geom_smooth() + ylim(0, 200000)

- Q21 Change the scatter plot code in Q20 to plot health.ins and age. Comment on the shape and direction of the smoothing line.
Bar Charts for two categorical variables
We can use barcharts for two categorical variables to represent probabilities.
- Q22 Draw two barchart types to present health insurance and marital status using the following code.
- How do you interpret the height of bars in each plot?
- What are advantages of drawing bar charts for two categorical variables (use these charts in your explanation)
r
r ggplot(custdata) + geom_bar(aes(x=marital.stat, fill=health.ins))


