15th Jan 2022 - Intro to R and ggplot
Below is the Data science workflow with Tidyverse from R for Data Science
Import –> Tidy –> transform –> visualise <–> model –> communicate
use r cheatsheets to get the syntax
packages = c('ggplot2', 'tidyverse', 'ggrepel')
for(p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
exam_data <-read_csv("data/Exam_data.csv")
summary(exam_data)
ID CLASS GENDER
Length:322 Length:322 Length:322
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
RACE ENGLISH MATHS SCIENCE
Length:322 Min. :21.00 Min. : 9.00 Min. :15.00
Class :character 1st Qu.:59.00 1st Qu.:58.00 1st Qu.:49.25
Mode :character Median :70.00 Median :74.00 Median :65.00
Mean :67.18 Mean :69.33 Mean :61.16
3rd Qu.:78.00 3rd Qu.:85.00 3rd Qu.:74.75
Max. :96.00 Max. :99.00 Max. :96.00
ggplot2 is used for Visual Data exploration purpose
using R graphics to plot histogram vs using ggplot.
hist(exam_data$MATHS)
Note the longer coding but more powerful customisation
ggplot(data=exam_data,
aes(x = MATHS)) +
geom_histogram(bins=10,
boundary = 100,
colour = "black",
fill="grey") +
ggtitle("Distribution of Maths scores")
Note that the default grom_histogram default bin is 30
ggplot(data=exam_data,
aes(x = MATHS)) +
geom_histogram(bins=10,
boundary = 100,
colour = "black",
fill="grey") +
ggtitle("Distribution of Maths scores") +
theme_dark()
ggplot(data=exam_data,
aes(x = MATHS)) +
geom_histogram(bins=10,
boundary = 100,
colour = "black",
fill="blue") +
ggtitle("Distribution of Maths scores") +
facet_grid(cols = vars(RACE))
ggplot(data=exam_data,
aes(x = MATHS,
fill = GENDER)) +
geom_histogram(bins=10,
boundary = 100,
colour = "black") +
ggtitle("Distribution of Maths scores") +
facet_grid(cols = vars(RACE))
ggplot(data=exam_data,
aes(y = MATHS,
x = GENDER)) +
geom_boxplot(notch = TRUE)
ggplot(data=exam_data,
aes(y = MATHS,
x = GENDER)) +
geom_boxplot() +
geom_point(position = "jitter",
size = 0.5)
####```{r echo = TRUE} ggplot(data=exam_data, aes(y = MATHS, x = GENDER)) + geom_boxplot(alpha = 0.5) + geom_volin (fill = “light blue”)
####``` ### note the 2 syntax below gives the same result though the code is different.ggplot(data=exam_data,
aes(y = MATHS, x = GENDER)) +
geom_boxplot() +
geom_point (stat = "summary",
fun.y = "mean",
colour = "red",
size = 4)
ggplot(data=exam_data,
aes(y = MATHS, x = GENDER)) +
geom_boxplot() +
stat_summary (geom = "point",
fun.y = "mean",
colour = "red",
size = 4)