General Instructions

Aims of the Lab

In this lab you will learn and practice how to load data from various source in to R. In particular we will cover the following objectives:

1. Working with Files or URLs

Structured data from a file/URL can be imported into R using an R table . More information about the table() function is at https://stat.ethz.ch/R-manual/R-devel/library/base/html/table.html.

Tip for getting help in R: When you want to access a help page from the console type ?nameofthefunction E.g. >?table will show the help page for R table in base package. Alternatively you can search the topic in the search bar on the help pane in R studio.

1.1 Working with Structured Data

  • Q1 Download the ‘lab1.zip’ folder from url and extract it. Open the ‘lab1-ex.R’ in RStudio. Set the working directory(https://stat.ethz.ch/R-manual/R-devel/library/base/html/getwd.html) to the extracted folder. Press ‘ALT+CRTL+R’ (this is the shortcut to execture an entire R script in RStudio). If there are no errors you can start on Q2, otherwise check the path you have assigned to the working directory. Note: paths in windows platforms are different to unix (mac or linux) platforms.

1.1.1 Familiarise with a dataset

  • Q2 Find the ‘car.data.csv’ file in the ‘data’ subfolder supplied with Lab 1. Open the file with Microsoft Excel (Please do not perform any suggested saving options by MS Excel). Note how the data and headers are given. Close the file. If you are in Mac environment, follow the steps in a spreadsheet application.

  • Q3 Open the ‘car.data.csv’ file with Microsoft Wordpad. Do you notice any difference in the view?

  • Q4 Create a duplicate of the file ‘car.data.csv’ and save it as ‘manuallyedit.car.data.csv’. Open this file change the first column name from ‘buying’ to ‘Buying’. Save the changes and close the file.

  • Q5 Categorize the data types in the file as quantitative or qualitiative.

1.1.2 Importing data into R

  • Q6 Use the following code snippet to load the content of the ‘car.data.csv’ into an R table data structure. Although the URL works, you have to provide the path to the data file in your code.
uciCar=read.table(file='http://www.win-vector.com/dfiles/car.data.csv',header = TRUE,sep = ',')
  • Q7 Import the same dataset into a different variable ‘uciCarWr’using the read.table() funnction. Set ’FALSE’ to the header parameter.

1.1.3 Examining the imported data

  • Q8 Use dim function to get the dimensions of the dataset. How many data entries are there in the dataset?

  • Q9 How many rows are there in the ‘uciCarWr’? Why the answer is different to the answer from Q7

  • Q10 Use the ‘class()’ function to find the type of the variables in ‘uciCar’. E.g. class(uciCar$buying) will return the type of the ‘buying’ variable. Are they all different or same? Write your observation as comment

  • Q11 Enter the following command. Describe the meaning of entries under ‘doors’.

summary(uciCar)
   buying      maint       doors     persons     lug_boot    safety      rating    
 high :432   high :432   2    :432   2   :576   big  :576   high:576   acc  : 384  
 low  :432   low  :432   3    :432   4   :576   med  :576   low :576   good :  69  
 med  :432   med  :432   4    :432   more:576   small:576   med :576   unacc:1210  
 vhigh:432   vhigh:432   5more:432                                     vgood:  65  

1.1.4 Working with other data formats

R has many adaptors that can be used to import data from a variety of sources.

1.2 Working with Less-Structured Data

Not all data are stored in a structured manner. Often data scientists use scripts to manipulate imported data to impose a structure, so that the data can be analysed conveniently.

  • Q12 Go to the URL ‘http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)’ which hosts the German Bank Credit Dataset. Download the dataset into your working directory’s data folder.

  • Q13 Use the following command to load the data into a variable called ‘d’. Use the downloaded file path and type in your code. What are the column names?

d <- read.table(paste('http://archive.ics.uci.edu/ml/',
'machine-learning-databases/statlog/german/german.data',sep=''),stringsAsFactors=F,header=F)
  • Q14 Change the column names of the dataset using the ‘colnames()’ function as follows. The ‘c()’ command construct a vector that contains the desired column names.
colnames(d) <- c('Status.of.existing.checking.account',
'Duration.in.month', 'Credit.history', 'Purpose',
'Credit.amount', 'Savings account/bonds',
'Present.employment.since','Installment.rate.in.percentage.of.disposable.income',
'Personal.status.and.sex', 'Other.debtors/guarantors',
'Present.residence.since', 'Property', 'Age.in.years',
'Other.installment.plans', 'Housing',
'Number.of.existing.credits.at.this.bank', 'Job',
'Number.of.people.being.liable.to.provide.maintenance.for',
'Telephone', 'foreign.worker', 'Good.Loan')

Using a data dictionary or a schema documentation you can transform data in R. A list in R can be used to create such a mapping structure. For example, meaning of the A-* codes can be decoded using a list like below.

mapping <- list(
'A40'='car (new)',
'A41'='car (used)',
'A42'='furniture/equipment',
'A43'='radio/television',
'A44'='domestic appliances'
)

The’ actual data can be decoded using the ‘mapping’ list as follows.

for(i in 1:(dim(d))[2]) {
    if(class(d[,i])=='character') {
        d[,i] <- as.factor(as.character(mapping[d[,i]]))
    }
}
  • Q15 Use the above code to transform the data. Then, use the following code to view the data
table(d$Purpose,d$Good.Loan)
                     
                        1   2
  car (new)           145  89
  car (used)           86  17
  domestic appliances   8   4
  furniture/equipment 123  58
  NULL                120  70
  radio/television    218  62

2. Working with Relational Databases

Relational databases can scale upto millions of records. And structured queries can be feteched easily with relational databases. In the following sections, we will examine, how to get data from a database into R for analysis.

2.1 Curating the Data

You need to record the source of the data properly. Go to the bereau of meteorology and get download climate data for july 2016 and july 2017 months.

  • Q16 Create a data provonence documentation for this dataset. Your note should include data access date, source url, how you navigate the data in the given url, data format and cryptographic hashes. ( Hint: You can find a similar document in chapter 2 of the reference book)

2.3 Loading Data from a Database into R

  • Q17 Explore the ‘julyclimate.db’ database in the ‘data/db’ folder using an SQL database viewer. How many records are there in the dataset?

  • Q18 Using the following code, import the data in the database into R.

require(DBI)
require(RSQLite)
DBFILENAME<-'C:\\lab1\\data\\db\\julyclimate.db'
thisdb <- dbConnect(RSQLite::SQLite(), DBFILENAME)
start_date <-'01-07-2016'
end_date   <-'31-07-2016'
#query = sprintf("SELECT * FROM july16 WHERE Date BETWEEN '%s' AND '%s'",start_date,end_date)
query= 'SELECT * FROM july16'
rawdata<-dbGetQuery(thisdb,query)
dbDisconnect(thisdb)
[1] TRUE
  • Q19 Open the database in your working folders (‘data.db’) in the database viewer. Find the other tables. Change the above code to import the july 2017 data into R and save it in ‘july17’ variable. Also change the variable name ‘rawdata’ as ‘july16’
july16<-rawdata
  • Q20 You can select a subset of data by manipulating the data frames (this is the data structure name). Execute the following code to find the days with minimum temperatures
mintemp_jul16<-head(sort(july16$`Minimumtemperature(�C)`))
mintemp_jul16
[1] "0.6"  "1.2"  "10.1" "10.6" "10.8" "10.9"
  • Q21 Find the lowest (minimum 5) temperatures in july 2017 data. Save the results in mintemp_july17 variable. Are there any variations in the minimum temperatures?

  • Q22 You can get a quick overview of a dataset using the ‘summary’ function. Explore information given by the summary function.

 summary(july16)
     Date           Minimumtemperature(�C) Maximumtemperature(�C) Rainfall(mm)      
 Length:31          Length:31                Length:31                Length:31         
 Class :character   Class :character         Class :character         Class :character  
 Mode  :character   Mode  :character         Mode  :character         Mode  :character  
 Evaporation(mm)    Sunshine(hours)    Directionofmaximumwindgust
 Length:31          Length:31          Length:31                 
 Class :character   Class :character   Class :character          
 Mode  :character   Mode  :character   Mode  :character          
 Speedofmaximumwindgust(km/h) Timeofmaximumwindgust 9amTemperature(�C)
 Length:31                    Length:31             Length:31           
 Class :character             Class :character      Class :character    
 Mode  :character             Mode  :character      Mode  :character    
 9amrelativehumidity(%) 9amcloudamount(oktas) 9amwinddirection   9amwindspeed(km/h)
 Length:31              Length:31             Length:31          Length:31         
 Class :character       Class :character      Class :character   Class :character  
 Mode  :character       Mode  :character      Mode  :character   Mode  :character  
 9amMSLpressure(hPa) 3pmTemperature(�C) 3pmrelativehumidity(%) 3pmcloudamount(oktas)
 Length:31           Length:31            Length:31              Length:31            
 Class :character    Class :character     Class :character       Class :character     
 Mode  :character    Mode  :character     Mode  :character       Mode  :character     
 3pmwinddirection   3pmwindspeed(km/h) 3pmMSLpressure(hPa)
 Length:31          Length:31          Length:31          
 Class :character   Class :character   Class :character   
 Mode  :character   Mode  :character   Mode  :character   

---
title: "CITS4009 Lab 2 - Loading Data into R"
output: html_notebook
---

### General Instructions
* Your labsheets will be structured with complementory information. The labs will closely follow the structure of "Practical Data Science with R" book by Nina Zumel and John Mount 
* From each lab you are expected to answer all the questions presented with a question number. 


### Aims of the Lab
In this lab you will learn and practice how to load data from various source in to R. In particular
we will cover the following objectives:

* Understanding R’s data frame structure
* Loading data into R from files and from relational databases
* Transforming data for analysis


# 1. Working with Files or URLs
Structured data from a file/URL can be imported into R using an **R table** . More information about the **table()** function is at  https://stat.ethz.ch/R-manual/R-devel/library/base/html/table.html. 

Tip for getting help in R: When you want to access a help page from the console type ?nameofthefunction E.g. >?table will show the help page for R table in base package. Alternatively you can search the topic in the search bar on the help pane in R studio.

## 1.1 Working with Structured Data
* **Q1** Download the 'lab1.zip' folder from *url* and extract it. Open the 'lab1-ex.R' in RStudio. Set the working directory(https://stat.ethz.ch/R-manual/R-devel/library/base/html/getwd.html) to the extracted folder. Press 'ALT+CRTL+R' (this is the shortcut to execture an entire R script in RStudio). If there are no errors you can start on **Q2**, otherwise check the path you have assigned to the working directory. Note: paths in windows platforms are different to unix (mac or linux) platforms.

### 1.1.1 Familiarise with a dataset

* **Q2** Find the 'car.data.csv' file in the 'data' subfolder supplied with Lab 1. Open the file with Microsoft Excel (Please do not perform any suggested saving options by MS Excel). Note how the data and headers are given. Close the file. If you are in Mac environment, follow the steps in a spreadsheet application. 

* **Q3** Open the 'car.data.csv' file with Microsoft Wordpad. Do you notice any difference in the view? 
<!-- If you are in Mac use the relavant program -->

* **Q4** Create a duplicate of the file 'car.data.csv' and save it as 'manuallyedit.car.data.csv'. Open this file change the first column name from 'buying' to 'Buying'. Save the changes and close the file.

* **Q5** Categorize the data types in the file as quantitative or qualitiative.

### 1.1.2 Importing data into R

* **Q6** Use the following code snippet to load the content of the 'car.data.csv' into an R table data structure. Although the URL works, you have to provide the path to the data file in your code.
```{r}
uciCar=read.table(file='http://www.win-vector.com/dfiles/car.data.csv',header = TRUE,sep = ',')
```

* **Q7** Import the same dataset into a different variable 'uciCarWr'using the read.table() funnction. Set 'FALSE' to the header parameter.


### 1.1.3 Examining the imported data

* **Q8** Use **dim** function to get the dimensions of the dataset. How many data entries are there in the dataset?
<!--# ```{r}
# dim(uciCar)
# ```-->

* **Q9** How many rows are there in the 'uciCarWr'? Why the answer is different to the answer from **Q7**

* **Q10** Use the 'class()' function to find the type of the variables in 'uciCar'. E.g. class(uciCar$buying) will return the type of the 'buying' variable. Are they all different or same? Write your observation as comment

* **Q11** Enter the following command. Describe the meaning of entries under 'doors'.
```{r}
summary(uciCar)
```
### 1.1.4 Working with other data formats
R has many adaptors that can be used to import data from a variety of sources. 

* XLS/XLSX—http://cran.r-project.org/doc/manuals/R-data.html#Reading-Excel-spreadsheets 
* JSON—http://cran.r-project.org/web/packages/rjson/index.html 
* XML—http://cran.r-project.org/web/packages/XML/index.html 
* MongoDB—http://cran.r-project.org/web/packages/rmongodb/index.html 
* SQL—http://cran.r-project.org/web/packages/DBI/index.html


## 1.2 Working with Less-Structured Data
Not all data are stored in a structured manner. Often data scientists use scripts to manipulate imported data to impose a structure, so that the data can be analysed conveniently. 

* **Q12** Go to the URL 'http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)' which hosts the German Bank Credit Dataset. Download the dataset into your working directory's data folder. 

* **Q13** Use the following command to load the data into a variable called 'd'. Use the downloaded file path and type in your code. What are the column names?
```{r}
d <- read.table(paste('http://archive.ics.uci.edu/ml/',
'machine-learning-databases/statlog/german/german.data',sep=''),stringsAsFactors=F,header=F)
```

* **Q14** Change the column names of the dataset using the 'colnames()' function as follows. The 'c()' command construct a vector that contains the desired column names.
```{r}
colnames(d) <- c('Status.of.existing.checking.account',
'Duration.in.month', 'Credit.history', 'Purpose',
'Credit.amount', 'Savings account/bonds',
'Present.employment.since','Installment.rate.in.percentage.of.disposable.income',
'Personal.status.and.sex', 'Other.debtors/guarantors',
'Present.residence.since', 'Property', 'Age.in.years',
'Other.installment.plans', 'Housing',
'Number.of.existing.credits.at.this.bank', 'Job',
'Number.of.people.being.liable.to.provide.maintenance.for',
'Telephone', 'foreign.worker', 'Good.Loan')
```

Using a **data dictionary** or a **schema documentation** you can transform data in R. A list in R can be used to create such a mapping structure. For example, meaning of the A-* codes can be decoded using a list like below.

```{r}
mapping <- list(
'A40'='car (new)',
'A41'='car (used)',
'A42'='furniture/equipment',
'A43'='radio/television',
'A44'='domestic appliances'
)
```

The' actual data can be decoded using the 'mapping' list as follows.

```{r}
for(i in 1:(dim(d))[2]) {
    if(class(d[,i])=='character') {
        d[,i] <- as.factor(as.character(mapping[d[,i]]))
    }
}
```
* **Q15** Use the above code to transform the data. Then, use the following code to view the data

```{r}
table(d$Purpose,d$Good.Loan)
```


# 2. Working with Relational Databases

Relational databases can scale upto millions of records. And structured queries can be feteched easily with relational databases. In the following sections, we will examine, how to get data from a database into R for analysis.

## 2.1 Curating the Data
You need to record the source of the data properly. Go to the bereau of meteorology and get download climate data for july 2016 and july 2017 months.


* **Q16** Create a data provonence documentation for this dataset. Your note should include data access date, source url, how you navigate the data in the given url, data format and cryptographic hashes. ( Hint: You can find a similar document in chapter 2 of the reference book)

## 2.3 Loading Data from a Database into R

* **Q17** Explore the 'julyclimate.db' database  in the 'data/db' folder using an SQL database viewer. How many records are there in the dataset?

* **Q18** Using the following code, import the data in the database into R.

```{r}
require(DBI)
require(RSQLite)

DBFILENAME<-'C:\\lab1\\data\\db\\julyclimate.db'

thisdb <- dbConnect(RSQLite::SQLite(), DBFILENAME)
start_date <-'01-07-2016'
end_date   <-'31-07-2016'
#query = sprintf("SELECT * FROM july16 WHERE Date BETWEEN '%s' AND '%s'",start_date,end_date)
query= 'SELECT * FROM july16'
rawdata<-dbGetQuery(thisdb,query)
dbDisconnect(thisdb)
```

* **Q19** Open the database in your working folders ('data\db\julyclimate.db') in the database viewer. Find the other tables. Change the above code to import the july 2017 data into R and save it in 'july17' variable. Also change the variable name 'rawdata' as 'july16' 

```{r}
july16<-rawdata
```


* **Q20** You can select a subset of data by manipulating the data frames (this is the data structure name). Execute the following code to find the days with minimum temperatures


```{r}
mintemp_jul16<-head(sort(july16$`Minimumtemperature(�C)`))
mintemp_jul16
```


* **Q21** Find the lowest (minimum 5) temperatures in july 2017 data. Save the results in mintemp_july17 variable. Are there any variations in the minimum temperatures?


* **Q22**
You can get a quick overview of a dataset using the 'summary' function. Explore information given by the summary function.

```{r}
 summary(july16)
```

<!-- * **Q23**  Run the following code to visualise the data in 2016. Similarly, generate a graph for 2017 july rainfall data.  -->


<!-- ```{r} -->
<!-- par(las=2) -->
<!-- par(mar=c(8,8,1,1)) -->
<!-- plot(july16$`Rainfall(mm)`,type = 'b',col='red',main='July 2016 Rainfall',xaxt=NULL,ylab='Rainfall (mm)',xlab='') -->
<!-- axis(side = 1,at = 1:length(july16$`Rainfall(mm)`),labels = july16$Date,) -->
<!-- ``` -->



