Week 2 Notes: Statistical Modelling and Data Visualisation PDF
Document Details
Uploaded by Deleted User
2024
Tags
Summary
These notes cover week 2 of a statistical modeling and data visualization course. The document focuses on using ggplot for visualization with data. It details the importance of good visualizations and introduces different types of variables.
Full Transcript
KL5005: Statistical Modelling and Data Visualisation Week 2: Data visualisation with ggplot October 2024 2 / 129 Materials of the week Week 2: Data visualisation with ggplot ▶ Brief Recap on the importance of ▶ Visualising numerical data good visualisations...
KL5005: Statistical Modelling and Data Visualisation Week 2: Data visualisation with ggplot October 2024 2 / 129 Materials of the week Week 2: Data visualisation with ggplot ▶ Brief Recap on the importance of ▶ Visualising numerical data good visualisations ▶ Workshop ▶ Generic of the variables ▶ Homework ▶ Data visualisations with ggplot ▶ Acknowledgment Data mapping ▶ Visualising "Star Wars" data 3 / 129 Week 2: tidyverse collection of packages Brief Recap on the importance of good visualisations The tidyverse is an opinionated collection of R packages designed for data science. The most important ones are included on its official web page at https://www.tidyverse.org/packages/. All packages share an underlying design philosophy, grammar, and data structures. 4 / 129 Week 2: Tips for effective data visualisation Brief Recap on the importance of good visualisations “The simple graph has brought more information to the data analyst’s mind than any other device.” — John Tukey 5 / 129 Week 2: Importance of good visualisation Brief Recap on the importance of good visualisations What makes a good visualisation? Graphical excellence is the well-designed presentation of interesting data, a matter of substance and statistics and of design. Graphical excellence consists of complex ideas communicated with clarity, precision and efficiency. Graphical excellence is that which gives the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space 6 / 129 Week 2: Importance of good visualisation Brief Recap on the importance of good visualisations Scotland North Keep your figure in a simple style region Midlands / Wales Rest of South London 0 200 400 count 7 / 129 Week 2: Importance of good visualisation Brief Recap on the importance of good visualisations London Rest of South Midlands / Wales North Scotland Wrong Keep your figure in a simple style opinion Bring the attention by using colour Right Don’t know 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200 count 7 / 129 Week 2: Importance of good visualisation Brief Recap on the importance of good visualisations Was Britain right/wrong to vote to leave EU? YouGov Survey Results, 2−3 September 2019 Rest of Midlands / London North Scotland South Wales Keep your figure in a simple style Wrong Bring the attention by using colour Tell a story Right Don’t know 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200 Source: bit.ly/2lCJZVg 7 / 129 Week 2: Importance of good visualisation Brief Recap on the importance of good visualisations Was Britain right/wrong to vote to leave EU? YouGov Survey Results, 2−3 September 2019 Rest of Midlands / London North Scotland South Wales Keep your figure in a simple style Wrong Bring the attention by using colour Tell a story Right Produce a graph in excellent shape, size, transparency Don’t know 0 100 200 0 100 200 0 100 200 0 100 200 0 100 200 Source: bit.ly/2lCJZVg 7 / 129 Week 2: Useful links Brief Recap on the importance of good visualisations ▶ Data import readr - readr.tidyverse.org readxl - readxl.tidyverse.org haven - haven.tidyverse.org ▶ Data manipulation tidyr - tidyr.tidyverse.org dplyr - dplyr.tidyverse.org data.table - rdatatable.gitlab.io/data.table ▶ R programming R for Data science - r4ds.had.co.nz Advanced R programming - adv-r.hadley.nz R packages - r-pkgs.org 8 / 129 Generic of the variables 9 / 129 Week 2: Human side Generic of the variables What is variable? 10 / 129 Week 2: Human side Generic of the variables What is variable? Is it a numerical variable or categorical? 10 / 129 Week 2: Human side Generic of the variables What is variable? Is it a numerical variable or categorical? If numerical: Is it continuous or discrete? 10 / 129 Week 2: Human side Generic of the variables What is variable? Is it a numerical variable or categorical? If numerical: Is it continuous or discrete? If categorical: Is it a nominal or ordinal? 10 / 129 Week 2: Human side Generic of the variables What is variable? Is it a numerical variable or categorical? If numerical: Is it continuous or discrete? If categorical: Is it a nominal or ordinal? Are there other variables in the study? (Univariate or Multivariate?) 10 / 129 Week 2: Human side Generic of the variables Type of variables: Numerical variables can be classified as continuous or discrete based on whether or not the variable can take on an infinite number of values or only non-negative whole numbers, respectively. (weight and height are continuous and number of positive covid-19 is discrete) 11 / 129 Week 2: Human side Generic of the variables Type of variables: Numerical variables can be classified as continuous or discrete based on whether or not the variable can take on an infinite number of values or only non-negative whole numbers, respectively. (weight and height are continuous and number of positive covid-19 is discrete) If the variable is categorical, we can determine if it is ordinal based on whether or not the levels have a natural ordering. (sex is a nominal and Olympic rank is an ordinal variable) 11 / 129 Week 2: Human side Generic of the variables Number of variables involved: Univariate data analysis - distribution of single variable 12 / 129 Week 2: Human side Generic of the variables Number of variables involved: Univariate data analysis - distribution of single variable Bivariate data analysis - relationship between two variables 12 / 129 Week 2: Human side Generic of the variables Number of variables involved: Univariate data analysis - distribution of single variable Bivariate data analysis - relationship between two variables Multivariate data analysis - relationship between many variable at once, usually focusing on the relationship between two while conditioning for others 12 / 129 Week 2: Human side Generic of the variables Number of variables involved: Univariate data analysis - distribution of single variable Bivariate data analysis - relationship between two variables Multivariate data analysis - relationship between many variable at once, usually focusing on the relationship between two while conditioning for others Matrix-variate data analysis - finding pattern in matrix variables (Not the scope of this module) 12 / 129 Week 2: Variables in R Generic of the variables Most important data types in R: logical – boolean values "TRUE" and "FALSE". In R, you can use "T" for "TRUE" and "F" for "FALSE". 13 / 129 Week 2: Variables in R Generic of the variables Most important data types in R: logical – boolean values "TRUE" and "FALSE". In R, you can use "T" for "TRUE" and "F" for "FALSE". character – character strings. For example, "world", "hello". Characters come with double quotations. 13 / 129 Week 2: Variables in R Generic of the variables Most important data types in R: logical – boolean values "TRUE" and "FALSE". In R, you can use "T" for "TRUE" and "F" for "FALSE". character – character strings. For example, "world", "hello". Characters come with double quotations. double – floating point numerical values (Double-precision floating-point). It is a default numerical type. 13 / 129 Week 2: Variables in R Generic of the variables Most important data types in R: logical – boolean values "TRUE" and "FALSE". In R, you can use "T" for "TRUE" and "F" for "FALSE". character – character strings. For example, "world", "hello". Characters come with double quotations. double – floating point numerical values (Double-precision floating-point). It is a default numerical type. integer – integer numerical values (indicated with an L) 13 / 129 Week 2: Variables in R Generic of the variables Most important data types in R: logical – boolean values "TRUE" and "FALSE". In R, you can use "T" for "TRUE" and "F" for "FALSE". character – character strings. For example, "world", "hello". Characters come with double quotations. double – floating point numerical values (Double-precision floating-point). It is a default numerical type. integer – integer numerical values (indicated with an L) lists – is a combination of any types: mylist = list(A = "hello", B = 1:4, "knock knock" = "who’s there?") 13 / 129 Week 2: Variables in R Generic of the variables Find the type of variables with functions typeof(), str(). 14 / 129 Week 2: Variables in R Generic of the variables Find the type of variables with functions typeof(), str(). You can also check if the variable has a specific type by "is" function. For instance, is.integer(3). 14 / 129 Week 2: Variables in R Generic of the variables Find the type of variables with functions typeof(), str(). You can also check if the variable has a specific type by "is" function. For instance, is.integer(3). Switch between types via as if possible. For example, as.character(c(2, 3)) change the type of data to character ("2" "3") 14 / 129 Data visualisations with ggplot 15 / 129 Week 2: Data visualisations with ggplot ▶ The gg in "ggplot2" stands for Grammar of Graphics 16 / 129 Week 2: Data visualisations with ggplot ▶ The gg in "ggplot2" stands for Grammar of Graphics ▶ ggplot2 is tidyverse’s data visualization package 16 / 129 Week 2: Data visualisations with ggplot ▶ The gg in "ggplot2" stands for Grammar of Graphics ▶ ggplot2 is tidyverse’s data visualization package ▶ It is inspired by the book Grammar of Graphics by Leland Wilkinson 16 / 129 Week 2: Data visualisations with ggplot 1st edition in 1999 A theoretical deconstruction of data graphics Foundation for many graphic applications (ggplot2, Polaris, Vega-Lite) 17 / 129 Week 2: Graphics Data visualisations with ggplot Box plot Pie chart Scatter plot GRAPHICS Line chart Histogram Bar chart 18 / 129 19 / 129 20 / 129 Week 2: Graphics Data visualisations with ggplot Verify you load whole dataset Check the type of variable(s) Look for any unusual (miss) information Grammar requires a tidy format (though it precedes the notation) 21 / 129 Week 2: Graphics Data visualisations with ggplot Allow generic datasets to be understood by the graphic system Aesthetic mapping: link variables in the data to graphical properties in the geometry Facet mapping: link variables in the data to panels in the facet layout 22 / 129 Week 2: Graphics Data visualisations with ggplot Even though data is tidy it may not represent the displayed values Transform input variables to displayed values — Count number of observations in each category for a bar chart — Calculate summary statistics for a boxplot 23 / 129 Week 2: Graphics Data visualisations with ggplot Most data does not directly represent graphical properties A scale translate back and forth between variable ranges and property ranges. e.g. — Categorical ← Color — Numbers ← Position Imply a specific interpretation of values; discrete, continuous 24 / 129 Week 2: Graphics Data visualisations with ggplot How to interpret aesthetics as graphical representations Is a progression of positional aesthetics a number of points, a line, a single polygon or somethings else entirely? To a high degree the determination of your plot type 25 / 129 Week 2: Graphics Data visualisations with ggplot Define the number of panels with equal logic and split data among them Small multiples: Allows you to look at small subsets of your data in separate plots Panel layout may carry meaning 26 / 129 Week 2: Graphics Data visualisations with ggplot Positional aesthetics are special: — Variables are mapped, scaled, applied to a geometry — The position values are interpreted by a coordinate system Defines the physical mapping of the aesthetics to the paper Vaguely similar to color profile mapping for color aesthetics 27 / 129 Week 2: Graphics Data visualisations with ggplot None of the priors talked about the visual look of the plot Theming spans every part of the graphic that is not linked to data 28 / 129 Week 2: Data visualisations with ggplot Introduction to Data Science Firstly, install tidyverse package (if not done 90 already) and then load the tidyverse library (contains: dplyr, readr, ggplot2, tibble functions 80 required visualisations) waiting 1 library ( tidyverse ) 70 2 3 glimpse ( faithful ) 4 60 5 ? faithful 6 7 g g p l o t ( data = faithful , 50 m a p p i n g = aes ( x = eruptions , y = w a i t i n g ) ) + geom _ p o i n t () 2 3 4 5 eruptions 29 / 129 Data mapping 30 / 129 Week 2: Data mapping Data visualisations with ggplot Which dataset to plot?: 31 / 129 Week 2: Data mapping Data visualisations with ggplot Which dataset to plot?: ggplot( data = faithful, mapping = aes(x = eruptions, y = waiting)) + geom_point() 31 / 129 Week 2: Data mapping Data visualisations with ggplot Which dataset to plot?: ggplot( data = faithful, mapping = aes(x = eruptions, y = waiting)) + geom_point() Which columns to use for x and y?: 31 / 129 Week 2: Data mapping Data visualisations with ggplot Which dataset to plot?: ggplot( data = faithful, mapping = aes(x = eruptions, y = waiting)) + geom_point() Which columns to use for x and y?: ggplot(data = faithful mapping = aes(x = eruptions, y = waiting)) + geom_point() 31 / 129 Week 2: Data mapping Data visualisations with ggplot Which dataset to plot?: ggplot( data = faithful, mapping = aes(x = eruptions, y = waiting)) + geom_point() Which columns to use for x and y?: ggplot(data = faithful mapping = aes(x = eruptions, y = waiting)) + geom_point() How to draw the plot?: 31 / 129 Week 2: Data mapping Data visualisations with ggplot Which dataset to plot?: ggplot( data = faithful, mapping = aes(x = eruptions, y = waiting)) + geom_point() Which columns to use for x and y?: ggplot(data = faithful mapping = aes(x = eruptions, y = waiting)) + geom_point() How to draw the plot?: ggplot(data = faithful mapping = aes(x = eruptions, y = waiting)) + geom_point() ‘+’ is used to combine ‘ggplot’ elements: 31 / 129 Week 2: Data mapping Data visualisations with ggplot 90 1 g g p l o t ( f a i t h f u l ) + geom _ p o i n t ( aes ( x = eruptions , y = waiting , c o l o u r = e r u p t i o n s < 3) ) 80 2 3 # or eruptions < 3 waiting 4 70 FALSE 5 g g p l o t ( data = faithful , m a p p i n g = aes ( x = eruptions , TRUE y = waiting , c o l o u r = e r u p t i o n s < 60 3) ) + geom _ p o i n t () 50 Mapping color 2 3 4 5 eruptions 32 / 129 Week 2: Data mapping Data visualisations with ggplot 90 1 g g p l o t ( faithful , aes ( x = eruptions , y 80 = waiting )) + 2 geom _ d e n s i t y _ 2 d () + waiting 3 geom _ p o i n t () 70 60 Mapping contours: Layers are stacked on the 50 order of code appearance 2 3 4 5 eruptions 33 / 129 Week 2: Data mapping Data visualisations with ggplot 90 1 g g p l o t ( faithful , aes ( x = eruptions , y 80 = waiting , c o l o u r = e r u p t i o n s < 3) ) + eruptions < 3 waiting 2 geom _ d e n s i t y _ 2 d () + 70 FALSE 3 geom _ p o i n t () TRUE 60 Mapping grouped contours 50 2 3 4 5 eruptions 34 / 129 Week 2: Data mapping Data visualisations with ggplot 40 1 g g p l o t ( mpg , m a p p i n g = aes ( x = displ , y = hwy ) ) + geom _ line () 30 hwy Simple line plot for Fuel 20 economy data from 1999 to 2008 for 38 popular models of cars. 2 3 4 5 6 7 displ 35 / 129 Week 2: Data mapping Data visualisations with ggplot 1 g g p l o t ( mpg , m a p p i n g = aes ( x = displ , y = Fuel economy data from 1999 to 2008 38 popular models of cars hwy ) ) + geom _ line () + 2 labs ( x = " E n g i n e 40 Displacement Highway miles per gallon ( litres )" , 3 y = " H i g h w a y m i l e s per gallon " , 30 4 t i t l e = " Fuel e c o n o m y data from 1999 to 2008 " , 5 s u b t i t l e = " 38 p o p u l a r 20 m o d e l s of cars " ) 2 3 4 5 6 7 Engine Displacement (litres) Details into the plot 36 / 129 Week 2: Data mapping Data visualisations with ggplot 1 g g p l o t ( mpg , 2 m a p p i n g = aes ( x = displ , y = Fuel economy data from 1999 to 2008 hwy , g r o u p = fl , c o l o r = 38 popular models of cars fl ) ) + 3 geom _ line () + 40 scale _ color _ manual ( values Highway miles per gallon = c ( " blue " , " red " , fl " orange " , " green " , c " yellow ")) + 30 d 4 labs ( x = " E n g i n e D i s p l a c e m e n t e ( litres )" , y = " Highway m i l e s per g a l l o n " , t i t l e = p " Fuel e c o n o m y data from 20 r 1999 to 2008 " , 5 s u b t i t l e = " 38 p o p u l a r m o d e l s of cars " ) 2 3 4 5 6 7 Engine Displacement (litres) Grouping by the fuel type, f1 37 / 129 Week 2: Data mapping Data visualisations with ggplot 40 1 g g p l o t ( mpg ) + 30 2 geom _ p o i n t ( aes ( x = displ , y = hwy ) ) + 3 s c a l e _ x _ c o n t i n u o u s ( b r e a k s = c (3 , 5 , hwy 6) ) + 4 scale _y_ continuous ( trans = " log10 ") 20 x and y axes are also controlled by scale. 3 5 6 displ 38 / 129 Week 2: Data mapping Data visualisations with ggplot WHAT’S IN A PIE How does a pie chart fit into the grammar of graphics? What are: — the geoms and stats? — the coord? — the data? 39 / 129 Week 2: Data mapping Data visualisations with ggplot 60 suv 2seater 40 1 g g p l o t ( mpg ) + geom _ bar ( aes ( x = class )) + 20 subcompact compact 2 c o o r d _ p o l a r () count 0 A polar coordinate system interprets x and y as pickup midsize radius and angle minivan class 40 / 129 Week 2: Data mapping Data visualisations with ggplot 0 suv subcompact 60 pickup 1 g g p l o t ( mpg ) + geom _ bar ( aes ( x = minivan class )) + midsize 2 coord _ polar ( theta = ’y ’) + compact e x p a n d _ l i m i t s ( y = 70) class 2seater 20 Changing what is mapped to angle gives a very different plot 40 y 41 / 129 Week 2: Data mapping Data visualisations with ggplot Pie Chart of Car Class 1 m y t a b l e = t a b l e ( mpg $ c l a s s ) midsize 2 labls = paste ( names ( mytable ) , "\n" , 41 compact mytable , sep = " " ) minivan 47 3 pie ( mytable , l a b e l s = labls , 11 4 main = " Pie C h a r t of Car C l a s s " ) 2seater pickup 5 33 suv subcompact 62 35 A pie chart with data labels for categorical data from the mpg data frame concerning car class 42 / 129 Week 2: Data mapping Data visualisations with ggplot 1 p1 = g g p l o t ( m s l e e p ) + NA geom _ b o x p l o t ( aes ( x = s l e e p _ total , y = vore , fill = vore ) ) 2 omni 3 p2 = g g p l o t ( m s l e e p ) + geom _ bar ( aes ( y vore = vore , fill = vore ) ) carni 4 herbi vore 5 p3 = g g p l o t ( m s l e e p ) + insecti insecti geom _ p o i n t ( aes ( x = bodywt , y = omni s l e e p _ total , c o l o u r = NA vore ) ) + s c a l e _ x _ l o g 1 0 () 6 herbi 7 p1 8 p2 9 p3 carni Mapping box, bar, and scatter plots 5 10 15 20 sleep_total 43 / 129 Week 2: Data mapping Data visualisations with ggplot 1 p1 = g g p l o t ( m s l e e p ) + NA geom _ b o x p l o t ( aes ( x = s l e e p _ total , y = vore , fill = vore ) ) 2 omni 3 p2 = g g p l o t ( m s l e e p ) + geom _ bar ( aes ( y vore = vore , fill = vore ) ) carni 4 herbi vore 5 p3 = g g p l o t ( m s l e e p ) + insecti insecti geom _ p o i n t ( aes ( x = bodywt , y = omni s l e e p _ total , c o l o u r = NA vore ) ) + s c a l e _ x _ l o g 1 0 () 6 herbi 7 p1 8 p2 9 p3 carni Mapping box, bar, and scatter plots 0 10 20 30 count 43 / 129 Week 2: Data mapping Data visualisations with ggplot 20 1 p1 = g g p l o t ( m s l e e p ) + geom _ b o x p l o t ( aes ( x = s l e e p _ total , y = vore , fill = vore ) ) 2 15 3 p2 = g g p l o t ( m s l e e p ) + geom _ bar ( aes ( y vore = vore , fill = vore ) ) carni sleep_total 4 herbi 5 p3 = g g p l o t ( m s l e e p ) + insecti geom _ p o i n t ( aes ( x = bodywt , y = 10 s l e e p _ total , c o l o u r = omni vore ) ) + s c a l e _ x _ l o g 1 0 () NA 6 7 p1 5 8 p2 9 p3 Mapping box, bar, and scatter plots 1e−01 1e+01 1e+03 bodywt 43 / 129 Week 2: Data mapping Data visualisations with ggplot Multi-panel plots 1 g g p u b r :: g g a r r a n g e ( p1 , p2 , p3 , l a b e l s = c ( " A " , " B " , " C " ) , ncol = 3 , nrow = 1) 44 / 129 Week 2: Data mapping Data visualisations with ggplot Multi-panel plots 1 g g p u b r :: g g a r r a n g e ( p1 , p2 , p3 , l a b e l s = c ( " A " , " B " , " C " ) , ncol = 3 , nrow = 1) A B C 20 NA NA vore vore 15 vore omni omni carni carni carni sleep_total herbi herbi herbi vore insecti insecti vore insecti insecti 10 insecti omni omni omni herbi NA herbi NA NA 5 carni carni 20 15 10 5 30 20 10 0 1e+03 1e+01 1e−01 sleep_total count bodywt 44 / 129 Week 2: Data mapping Data visualisations with ggplot Figures in the EPS format are saved in the vectored structure. So, they should be saved carefully and in the right shape (how big should be?). 45 / 129 Week 2: Data mapping Data visualisations with ggplot Figures in the EPS format are saved in the vectored structure. So, they should be saved carefully and in the right shape (how big should be?). A B C 20 NA NA omni vore omni vore 15 vore carni carni carni sleep_total herbi herbi herbi vore vore insecti insecti insecti insecti 10 insecti omni omni omni herbi NA herbi NA NA 5 carni carni 5 10 15 20 0 10 20 30 1e−011e+011e+03 sleep_total count bodywt 45 / 129 Week 2: Data mapping Data visualisations with ggplot Alternatively, but in a better fashion, you can save your figures via the following code. 1 p o s t s c r i p t ( ’ D : / / D r o p b o x / U n d e r p r e p e r a t i o n / H a z a r d. eps ’ , w i d t h =27 , h e i g h t =27) # # Path in " " 2 3 g g p u b r :: g g a r r a n g e ( p1 , p2 , p3 , l a b e l s = c ( " A " , " B " , " C " ) , ncol = 3 , nrow = 1) 4 5 dev. off () 46 / 129 Week 2: Data mapping Data visualisations with ggplot A B C 20 NA NA 15 omni omni vore vore vore carni carni carni sleep_total herbi herbi herbi vore vore insecti insecti insecti insecti insecti omni omni 10 omni NA NA NA herbi herbi 5 carni carni 5 10 15 20 0 10 20 30 1e−01 1e+01 1e+03 sleep_total count bodywt 47 / 129 Week 2: Data mapping Data visualisations with ggplot A B NA NA vore vore omni carni omni carni herbi herbi vore vore insecti insecti insecti insecti herbi omni herbi omni NA NA 1 g g p u b r :: g g a r r a n g e ( p1 , p2 , p3 , carni carni 2 l a b e l s = c ( " A " , " B " , " C " ) , ncol = 2 , 5 10 15 20 0 10 20 30 nrow = 2) sleep_total count C 20 vore 15 carni sleep_total Arrange multiple ggplots on the same page. 10 herbi insecti omni 5 NA 1e−01 1e+01 1e+03 bodywt 48 / 129 Week 2: Data mapping Data visualisations with ggplot A 20 vore 15 carni sleep_total herbi 10 insecti omni 1 g g p u b r :: g g a r r a n g e ( p3 , 5 NA 2 g g p u b r :: g g a r r a n g e ( p1 , p2 , ncol = 2 , labels = c("B" , "C")) , 1e−01 1e+01 1e+03 3 nrow = 2 , l a b e l s = " A " ) bodywt B C NA NA vore vore omni carni omni carni herbi herbi vore vore Arrange multiple ggplots on the same page. insecti insecti insecti insecti herbi omni herbi omni NA NA carni carni 5 10 15 20 0 10 20 30 sleep_total count 49 / 129 Week 2: Data mapping Data visualisations with ggplot 12000 1 ggplot ( economics ) + unemploy 2 geom _ line ( aes ( x = date , y = u n e m p l o y ) ) 8000 Line graph for time series. 4000 1970 1980 1990 2000 2010 date 50 / 129 Week 2: Data mapping Data visualisations with ggplot 1 library ( gganimate ) 2 ggplot ( economics ) + 3 geom _ line ( aes ( x = date , y = unemploy )) + 4 t r a n s i t i o n _ r e v e a l ( a l o n g = date ) Run the code to see the graph. Make animation with "gganimate" package. 51 / 129 Week 2: Data mapping Data visualisations with ggplot Figure: The 3D scatter animation, colored according to the clustering labels, with overlaid contours for the DLBCL data. Press play/pause button to control the animation movement. 52 / 129 Visualising "Star Wars" data 53 / 129 Week 2: Visualising "Star Wars" data Data visualisations with ggplot Data terminology Each row is an observation Each column is a variable (character) 1 library ( dplyr ) 2 ? starwars 54 / 129 Week 2: Visualising “Star Wars" data Data visualisations with ggplot starwars 55 / 129 Week 2: Visualising “Star Wars" data Data visualisations with ggplot What’s in the Star Wars data? 1 glimpse ( starwars ) 56 / 129 Week 2: Visualising “Star Wars" data Data visualisations with ggplot 1 nrow ( s t a r w a r s ) 2 3 > nrow ( s t a r w a r s ) 4 87 1 ncol ( s t a r w a r s ) 2 3 > ncol ( s t a r w a r s ) 4 14 1 dim ( s t a r w a r s ) 2 3 > dim ( s t a r w a r s ) 4 87 14 57 / 129 Week 2: Visualising “Star Wars" data Data visualisations with ggplot mass VS height 1000 1 g g p l o t ( data = starwars , m a p p i n g = aes ( x = height , y = mass ) ) + mass 2 geom _ p o i n t () 500 Warning message: Removed 28 rows containing missing values or values outside the scale range (’geom_point()’). 0 100 150 200 250 height 58 / 129 Week 2: Visualising “Star Wars" data Data visualisations with ggplot What’s that warning? Not all characters have height and mass information (hence 28 data points were plotted) 59 / 129 Week 2: Visualising “Star Wars" data Data visualisations with ggplot What’s that warning? Not all characters have height and mass information (hence 28 data points were plotted) It is important to find missing information in the data. Depending on the data recorder, missing information may be recorded by a very large number (for example 999 when all data are in the range 0 and 1) or be saved as Na. 59 / 129 Week 2: Visualising “Star Wars" data Data visualisations with ggplot What’s that warning? Not all characters have height and mass information (hence 28 data points were plotted) It is important to find missing information in the data. Depending on the data recorder, missing information may be recorded by a very large number (for example 999 when all data are in the range 0 and 1) or be saved as Na. The location of missing observation (for instance in Na format) can be found by which and is.na() functions. 1 w h i c h ( is. na ( s t a r w a r s $ h e i g h t ) ) 2 3 28 83 84 85 86 87 4 5 w h i c h ( is. na ( s t a r w a r s $ mass ) ) 6 7 12 27 28 33 37 38 39 41 42 44 48 53 55 56 58 60 61 65 67 72 73 74 76 83 84 85 86 87 59 / 129 Week 2: Visualising “Star Wars" data Data visualisations with ggplot Mass vs. height of Star Wars characters Missing observations are removed Adding information to the plot 1 g g p l o t ( data = starwars , m a p p i n g = 1000 aes ( x = height , y = mass ) ) + Weight (kg) geom _ p o i n t () + 2 labs ( t i t l e = " Mass vs. h e i g h t of Star Wars c h a r a c t e r s " , 3 s u b t i t l e = " M i s s i n g o b s e r v a t i o n s are 500 removed " , 4 x = " H e i g h t ( cm ) " , y = " W e i g h t ( kg ) " ) 0 100 150 200 250 Height (cm) 60 / 129 Week 2: Visualising “Star Wars" data Data visualisations with ggplot Do you observe any relationship between mass Mass vs. height of Star Wars characters and height of Star wars characters? Missing observations are removed 1000 Weight (kg) 500 0 100 150 200 250 Height (cm) 61 / 129 Week 2: Visualising “Star Wars" data Data visualisations with ggplot Do you observe any relationship between mass Mass vs. height of Star Wars characters and height of Star wars characters? Missing observations are removed How could you describe this relationship? 1000 Weight (kg) 500 0 100 150 200 250 Height (cm) 61 / 129 Week 2: Visualising “Star Wars" data Data visualisations with ggplot Do you observe any relationship between mass Mass vs. height of Star Wars characters and height of Star wars characters? Missing observations are removed How could you describe this relationship? Can we describe the trend of data points that 1000 Weight (kg) don’t follow the overall trend? Are those points miss-imputation, miss-calculation, 500 miss-measuring? 0 100 150 200 250 Height (cm) 61 / 129 Week 2: Visualising “Star Wars" data Data visualisations with ggplot Do you observe any relationship between mass Mass vs. height of Star Wars characters and height of Star wars characters? Missing observations are removed How could you describe this relationship? Can we describe the trend of data points that 1000 Weight (kg) don’t follow the overall trend? Are those points miss-imputation, miss-calculation, 500 miss-measuring? What other variables would help us understand the trend of those points? 0 100 150 200 250 Height (cm) 61 / 129 Week 2: Visualising “Star Wars" data Data visualisations with ggplot Do you observe any relationship between mass Mass vs. height of Star Wars characters and height of Star wars characters? Missing observations are removed How could you describe this relationship? Can we describe the trend of data points that 1000 Weight (kg) don’t follow the overall trend? Are those points miss-imputation, miss-calculation, 500 miss-measuring? What other variables would help us understand the trend of those points? 0 100 150 200 250 Height (cm) Who is the not so tall but really chubby character? 61 / 129 Week 2: Visualising “Star Wars" data Data visualisations with ggplot 1 w h i c h ( s t a r w a r s $ mass > 1 0 0 0 ) 2 g l i m p s e ( s t a r w a r s [16 ,]) 62 / 129 Week 2: Visualising “Star Wars" data Data visualisations with ggplot 1 My _ plot = g g p l o t ( data = 1000 starwars , m a p p i n g = aes ( x = height , y = mass ) ) + mass geom _ p o i n t () 2 library ( cowplot ) 3 g g d r a w () + 500 4 draw _ plot ( My _ plot ) + 5 draw _ i m a g e ( " i m a g e _ path / i m a g e _ name. i m a g e _ type " , x = 0.23 , y = 0.4 , s c a l e =.2) 0 100 150 200 250 height 63 / 129 Week 2: Visualising “Star Wars" data Data visualisations with ggplot Additinal graphic variables: We can map additional graphic variables to various featueres of the plot 64 / 129 Week 2: Visualising “Star Wars" data Data visualisations with ggplot Additinal graphic variables: We can map additional graphic variables to various featueres of the plot Aesthetics — Color — Size — Shape — Alpha (Transparency) 64 / 129 Week 2: Visualising “Star Wars" data Data visualisations with ggplot Additinal graphic variables: We can map additional graphic variables to various featueres of the plot Aesthetics — Color — Size — Shape — Alpha (Transparency) Faceting: small multiples displaying different subsets 64 / 129 Week 2: Visualising “Star Wars" data Data visualisations with ggplot Aesthetics options: Visual characteristics of plotting characters that can be mapped to a specific variable in the data are Color Size Shape Alpha (Transparency) Groping information plays a key role in graph aesthetics. 65 / 129 Week 2: Visualising “Star Wars" data Data visualisations with ggplot "mass vs height" + "gender" 1000 sex female mass hermaphroditic male none 1 g g p l o t ( data = starwars , m a p p i n g = 500 NA aes ( x = height , y = mass , c o l o r = sex ) ) + geom _ p o i n t () 0 100 150 200 250 height 66 / 129 Week 2: Visualising “Star Wars" data Data visualisations with ggplot birth_year "mass vs height" + "gender" + "age" 200 1000 400 600 800 mass sex 1 g g p l o t ( data = starwars , m a p p i n g female 500 = aes ( x = height , y = mass , hermaphroditic male c o l o r = sex , size = none b i r t h _ year ) ) + geom _ p o i n t () NA 0 100 150 200 250 height 67 / 129 Week 2: Visualising “Star Wars" data Data visualisations with ggplot Please run the code 1 g g p l o t ( data = starwars , m a p p i n g = aes ( x = height , y = mass , c o l o r = sex , size = b i r t h _ year ) ) + geom _ p o i n t ( size = 2) 68 / 129 Week 2: Visualising “Star Wars" data Data visualisations with ggplot 1000 sex Please run the code female mass hermaphroditic male none 500 NA 1 g g p l o t ( data = starwars , m a p p i n g = aes ( x = height , y = mass , c o l o r = sex , size = 0 b i r t h _ year ) ) + geom _ p o i n t ( size 100 150 200 250 = 2) height Increase the size of all points not based on the values of the variable in the data. Note the superiority of size=2 on birth_year in plotting. 68 / 129 Week 2: Visualising “Star Wars" data Data visualisations with ggplot Aesthetics summary: Continuous variables are measured on a continuous scale Discrete variables are measured (or often counted) on a discrete scale aesthetics discrete continuous Color rainbow of colors gradient Size discrete steps linear mapping between radius and value Shape different shape for each doesn’t work Use aesthetics for mapping features of a plot to a variable, define the features in the "geom" for customization not mapped to a variable 69 / 129 Week 2: Visualising “Star Wars" data Data visualisations with ggplot Faceting: Smaller plots that display different subsets of the data Useful for exploring conditional relationships and large data 1 g g p l o t ( data = starwars , m a p p i n g = aes ( x = height , y = mass ) ) + fa c e t _ grid (. ~ sex ) + 2 geom _ p o i n t () + labs ( t i t l e = " Mass vs. h e i g h t of Star Wars characters ", s u b t i t l e = " F a c e t e d by g e n d e r " ) 70 / 129 Week 2: Visualising “Star Wars" data Data visualisations with ggplot Mass vs. height of Star Wars characters Faceted by gender female hermaphroditic male none NA 1000 mass 500 0 100 150 200 250 100 150 200 250 100 150 200 250 100 150 200 250 100 150 200 250 height 71 / 129 Week 2: Visualising “Star Wars" data Data visualisations with ggplot Faceting in short: facet_grid() — 2d grid — rows cols — use "." for no split facet_wrap(): 1d ribbon wrapped into 2d 72 / 129 Week 2: Visualising “Star Wars" data Data visualisations with ggplot Dive further...: In the next, few slides describe what each plot displays. Thinks about how the code relates to the output. The plots in following slides do not have proper title, axis labels, etc. You should always label your plot. 73 / 129 Week 2: Visualising “Star Wars" data Data visualisations with ggplot 1 g g p l o t ( data = starwars , m a p p i n g = aes ( x = height , y = mass ) ) + 2 geom _ p o i n t () + 3 f a c e t _ grid ( sex ~.) 74 / 129 Week 2: Visualising “Star Wars" data Data visualisations with ggplot 1000 female 500 0 hermaphroditic 1000 500 0 1 g g p l o t ( data = starwars , m a p p i n g = 1000 mass male aes ( x = height , y = mass ) ) + 500 2 geom _ p o i n t () + 0 3 f a c e t _ grid ( sex ~.) 1000 none 500 0 1000 NA 500 0 100 150 200 250 height 74 / 129 Week 2: Visualising “Star Wars" data Data visualisations with ggplot 1 g g p l o t ( data = starwars , m a p p i n g = aes ( x = height , y = mass ) ) + 2 geom _ p o i n t () + 3 f a c e t _ grid (. ~ sex ) 75 / 129 Week 2: Visualising “Star Wars" data Data visualisations with ggplot female hermaphroditic male none NA 1000 1 g g p l o t ( data = starwars , m a p p i n g = mass aes ( x = height , y = mass ) ) + 2 geom _ p o i n t () + 500 3 f a c e t _ grid (. ~ sex ) 0 100150200250 100150200250 100150200250 100150200250 100150200250 height 75 / 129 Week 2: Visualising “Star Wars" data Data visualisations with ggplot 1 g g p l o t ( data = starwars , m a p p i n g = aes ( x = height , y = mass ) ) + 2 geom _ p o i n t () + 3 f a c e t _ wrap ( ~ eye _ c o l o r ) 76 / 129 Week 2: Visualising “Star Wars" data Data visualisations with ggplot black blue blue−gray brown 1000 500 0 dark gold green, yellow hazel 1000 1 g g p l o t ( data = starwars , m a p p i n g = 500 0 mass aes ( x = height , y = mass ) ) + orange pink red red, blue 2 geom _ p o i n t () + 1000 3 f a c e t _ wrap ( ~ eye _ c o l o r ) 500 0 100 150 200 250 unknown white yellow 1000 500 0 100 150 200 250 100 150 200 250 100 150 200 250 height 76 / 129 Visualising numerical data 77 / 129 Week 2: Visualising numerical data Visualising “Star Wars" data Describing shapes of numerical distributions Shape — Skewness: right-skewed, left-skewed, symmetric (skew is to the side of the longer tail) — Modality: Unimodal, bimodal, multimodal, uniform Center: mean(mean), median(median), mode Spread: range (range), standard deviation (sd), inter-quartile range (IQR) Unusual observations 78 / 129 Week 2: Visualising numerical data Visualising “Star Wars" data 20 Histogram 15 count 10 1 g g p l o t ( data = starwars , m a p p i n g = aes ( x = h e i g h t ) ) + 2 geom _ h i s t o g r a m ( b i n w i d t h = 10) 5 0 100 150 200 250 height 79 / 129 Week 2: Visualising numerical data Visualising “Star Wars" data 0.020 Density plot 0.015 density 0.010 1 g g p l o t ( data = starwars , m a p p i n g = aes ( x = h e i g h t ) ) + 2 geom _ d e n s i t y () 0.005 0.000 100 150 200 250 height 80 / 129 Week 2: Visualising numerical data Visualising “Star Wars" data 250 Box plot 200 height 1 g g p l o t ( data = starwars , m a p p i n g = 150 aes ( y = h e i g h t ) ) + geom _ b o x p l o t () 100 −0.4 −0.2 0.0 0.2 0.4 81 / 129 Week 2: Visualising numerical data Visualising “Star Wars" data 250 Side-by-side box plots 200 height 1 g g p l o t ( data = starwars , m a p p i n g = 150 aes ( y = height , x = sex ) ) + 2 geom _ b o x p l o t () 100 female hermaphroditic male none NA sex 82 / 129 Week 2: Visualising numerical data Visualising “Star Wars" data 250 Scatter plot 200 height 1 g g p l o t ( data = starwars , m a p p i n g = 150 aes ( y = height , x = sex ) ) + geom _ p o i n t () 100 female hermaphroditic male none NA sex 83 / 129 Week 2: Visualising numerical data Visualising “Star Wars" data 250 Violin plot 200 height 1 g g p l o t ( data = starwars , m a p p i n g = 150 aes ( y = height , x = sex ) ) + 2 geom _ v i o l i n () 100 female hermaphroditic male none NA sex 84 / 129 Week 2: Visualising numerical data Visualising “Star Wars" data 250 Jitter plot 200 height 1 g g p l o t ( data = starwars , m a p p i n g = 150 aes ( y = height , x = sex ) ) + 2 geom _ j i t t e r () 100 female hermaphroditic male none NA sex 85 / 129 Workshop Analysing PENGUIN dataset 86 / 129 Workshop: Palmer penguins Measurements for penguin species, island in Palmer Archipelago, size (flipper length, body mass, bill dimensions), and sex. 87 / 129 Workshop: Palmer penguins Measurements for penguin species, island in Palmer Archipelago, size (flipper 1 library ( palmerpenguins ) 2 length, body mass, bill dimensions), and 3 ? penguins sex. 87 / 129 Workshop: Palmer penguins Workshop 88 / 129 Workshop: Palmer penguins Workshop Bill depth and length Dimensions for Adelie, Chinstrap, and Gentoo Penguins 60 length (mm) 50 Species Adelie Bill Chinstrap Gentoo 40 15.0 17.5 20.0 Bill depth (mm) 89 / 129 Workshop: Palmer penguins Workshop Bill depth and length 1 g g p l o t ( data = penguins , Dimensions for Adelie, 2 m a p p i n g = aes ( x = Chinstrap, and Gentoo Penguins bill _ d e p t h _ mm , 60 3 y = bill _ l e n g t h _ mm , 4 colour = species )) + 5 geom _ p o i n t () + 6 labs ( t i t l e = " Bill d e p t h and length (mm) 50 Species length " , 7 s u b t i t l e = " D i m e n s i o n s for Adelie Adelie , Bill Chinstrap 8 Chinstrap , and G e n t o o Gentoo Penguins " , 40 9 x = " Bill d e p t h ( mm ) " , y = " Bill 10 l e n g t h ( mm ) " , 11 colour = " Species ") 15.0 17.5 20.0 12 Bill depth (mm) 89 / 129 Workshop: Palmer penguins Workshop Coding out loud: 1 # # S t a r t i n g with the p e n g u i n s data f r a m e 2 g g p l o t ( data = p e n g u i n s ) 3 4 # ## Map bill d e p t h to the x - axis 5 g g p l o t ( data = penguins , m a p p i n g = aes ( x = bill _ d e p t h _ mm ) ) 6 7 # ## Map bill d e p t h to the x - axis and l e n g t h to the y - axis 8 g g p l o t ( data = penguins , m a p p i n g = aes ( x = bill _ d e p t h _ mm , y = bill _ l e n g t h _ mm ) ) 90 / 129 Workshop: Palmer penguins Workshop Coding out loud: 1 # # S t a r t i n g with the p e n g u i n s data f r a m e 2 g g p l o t ( data = p e n g u i n s ) 3 4 # ## Map bill d e p t h to the x - axis 5 g g p l o t ( data = penguins , m a p p i n g = aes ( x = bill _ d e p t h _ mm ) ) 6 7 # ## Map bill d e p t h to the x - axis and l e n g t h to the y - axis 8 g g p l o t ( data = penguins , m a p p i n g = aes ( x = bill _ d e p t h _ mm , y = 15.0 17.5 20.0 bill _ l e n g t h _ mm ) ) bill_depth_mm 90 / 129 Workshop: Palmer penguins Workshop Coding out loud: 60 1 # # S t a r t i n g with the p e n g u i n s data f r a m e 2 g g p l o t ( data = p e n g u i n s ) 3 4 # ## Map bill d e p t h to the 50 x - axis bill_length_mm 5 g g p l o t ( data = penguins , m a p p i n g = aes ( x = bill _ d e p t h _ mm ) ) 6 40 7 # ## Map bill d e p t h to the x - axis and l e n g t h to the y