In-class exercise 01: Intro to R and ggplot2

15th Jan 2022 - Intro to R and ggplot

Farah https://sg.linkedin.com/in/farahfoo (SMU Masters in IT business (Fintech and Analytics))https://scis.smu.edu.sg/master-it-business
2022-02-12

NOTES FROM LECTURE TODAY

Below is the Data science workflow with Tidyverse from R for Data Science

Import –> Tidy –> transform –> visualise <–> model –> communicate

use r cheatsheets to get the syntax

Pike %>% - super important connector

Installing and loading required libraries

Show code
packages = c('ggplot2', 'tidyverse', 'ggrepel')

for(p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

Importing data into R

Show code
exam_data <-read_csv("data/Exam_data.csv")
summary(exam_data)
      ID               CLASS              GENDER         
 Length:322         Length:322         Length:322        
 Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character  
                                                         
                                                         
                                                         
     RACE              ENGLISH          MATHS          SCIENCE     
 Length:322         Min.   :21.00   Min.   : 9.00   Min.   :15.00  
 Class :character   1st Qu.:59.00   1st Qu.:58.00   1st Qu.:49.25  
 Mode  :character   Median :70.00   Median :74.00   Median :65.00  
                    Mean   :67.18   Mean   :69.33   Mean   :61.16  
                    3rd Qu.:78.00   3rd Qu.:85.00   3rd Qu.:74.75  
                    Max.   :96.00   Max.   :99.00   Max.   :96.00  

Visualisation using ggplot2 package

Show code
hist(exam_data$MATHS)

Note the longer coding but more powerful customisation

Show code
ggplot(data=exam_data, 
       aes(x = MATHS)) + 
  geom_histogram(bins=10,
                boundary = 100,
                colour = "black",
                  fill="grey") +
        ggtitle("Distribution of Maths scores") 

GGPLOT2 elements

Note that the default grom_histogram default bin is 30

adding theme

Show code
ggplot(data=exam_data, 
       aes(x = MATHS)) + 
  geom_histogram(bins=10,
                boundary = 100,
                colour = "black",
                  fill="grey") +
        ggtitle("Distribution of Maths scores") +
  theme_dark()

adding facet

Show code
ggplot(data=exam_data, 
       aes(x = MATHS)) + 
  geom_histogram(bins=10,
                boundary = 100,
                colour = "black",
                  fill="blue") +
        ggtitle("Distribution of Maths scores") +
  facet_grid(cols = vars(RACE))

putting GENDER variable as a fill filter

Show code
ggplot(data=exam_data, 
       aes(x = MATHS,
       fill = GENDER)) + 
  geom_histogram(bins=10,
                boundary = 100,
                colour = "black") +
        ggtitle("Distribution of Maths scores") +
  facet_grid(cols = vars(RACE))

putting GENDER variable as a fill filter

Show code
ggplot(data=exam_data, 
       aes(y = MATHS,
       x = GENDER)) + 
  geom_boxplot(notch = TRUE)

combining boxplot and geompoint together

Show code
ggplot(data=exam_data, 
       aes(y = MATHS,
       x = GENDER)) + 
  geom_boxplot() +
  geom_point(position = "jitter",
             size = 0.5)

volin plot is good to compliment boxplot, got error - cannot find dunction geom_volin

####```{r echo = TRUE} ggplot(data=exam_data, aes(y = MATHS, x = GENDER)) + geom_boxplot(alpha = 0.5) + geom_volin (fill = “light blue”)

####``` ### note the 2 syntax below gives the same result though the code is different.
Show code
ggplot(data=exam_data, 
       aes(y = MATHS, x = GENDER)) + 
  geom_boxplot() +
  geom_point (stat = "summary",
              fun.y = "mean",
              colour = "red",
              size = 4)
Show code
ggplot(data=exam_data, 
       aes(y = MATHS, x = GENDER)) + 
  geom_boxplot() +
  stat_summary (geom = "point",
                fun.y = "mean",
              colour = "red",
              size = 4)