Data Visualization & Reporting PDF

BACKGROUND THE INFORMATION THAT IS BEING THE PROCESS OF CREATING MAPPING WHAT ARE THE BEST VISUAL VARIABLES VISUALIZED MAY NOT HAVE ANY FROM THE INFORMATION TO THE FOR A PARTICULAR SET OF OBVIOUS VISUAL MANIFESTATION VISUAL REPRESENTATION IS NON- INFORMATION? TRIVIAL VISUALIZATION PIPELINE BASIC VISUAL UNITS (MARKS) Points Location, size, shape, color Lines Length, location, Change in thickness, texture, or color does not change the meaning of the line Changing location will change its meaning BASIC VISUAL UNITS Areas Length, width Changing length and width will change the meaning Changing position, color, value, or texture does not change its meaning BASIC VISUAL UNITS Surfaces Similar to area but exist in 3D Changing color, texture does not change its meaning Changes in position, size, shape, or orientation will change its meaning BASIC VISUAL UNITS Volumes Length, width, and height Their size is their meaning Changing position, color, or texture doesn’t change its meaning Changing size, shape, or orientation will change its meaning VISUAL Each visual unit may have UNITS multiple visual variables VISUAL VARIABLES VISUAL Each visual variable may have VARIABLES multiple characters CHARACTERS OF VISUAL VARIABLES Selective Is change in this visual variable alone enough to allow us to select it from a group? How easy is it to spot an outlier? Associative Is a change in this visual variable enough to allow us to perceive them as a group? How easy is it to see a cluster? Quantitative Is there a numerical reading obtainable from changes in this visual variable? CHARACTERS OF VISUAL VARIABLES Order Are changes in this visual variable perceived as ordered? How easy is it to spot a trend? How easy is it to rank things numerically? Length How many changes in value can still be recognized with confidence as separate? How big a range of data can this visual variable encode? CHARACTERS OF VISUAL VARIABLES Interpretation of symbolic meanings Bertin didn’t discuss this How easy is it to interpret the symbolic (not numeric) meaning of a visual variable? Greatly influence the experience of reading a visualization We will discuss the five characteristics of each visual variable C H A R AC T E R I S T I C S Selective O F E AC H V I S UA L VA R I A B L E Associative Quantitative Order Length POSITION EXAMPLE EXAMPLE EXAMPLE SIZE SIZE Numerical readings interpreted from changes in size alone are usually approximate and often less accurate Using size to represent numerical variable should be done with caution EXAMPLE: HUMAN POVERTY INDEX EXAMPLE: PIE CHART EXAMPLE EXAMPLE: SIZE OR POSITION? EXAMPLE EXAMPLE: LINE THICKNESS Monsieur Minard’s visualization of Napoleon’s 1812-1813 invasion of Russia SHAPE SHAPE While changes in shapes are distinguishable, this distinction can often require considerable interpretation effort Quick visual interpretation of all shapes is often difficult Shape is not a quantitative visual variable Shape is not an ordered visual variable SHAPE SHAPE The representation power of shape comes from its infinite length and from symbolic interpretation The link between the shape and the intended meaning must be explicit to reduce the mental workload of symbolic interpretation But it’s often difficult to memorize the meanings when many shapes are used EXAMPLE: CHERNOFF FACE http://mathworld.wolfram.com/ ChernoffFace.html http://www.csun.edu/~hfgeg005 /eturner/gallery/lifeinla.GIF WORDS AND TEXT We can see a word as a special case of shape Selective (?) Associative (?) Quantitative (No) Order (No) Texts often require serial processing On the other hand, visual marks can often be processed in parallel NUMBERS Again, we can see a number as a special case of shape Selective (?) Associative (?) Quantitative (yes!) Order (it depends) This is why spreadsheets are generally not good for spotting outliers, clusters, or trends VALUE VALUE Changing a mark’s value is achieved by changes in darkness of lightness of the mark. Color is divided into hue, saturation, and value. The color in the later slides actually refer to hue Changes in saturation are not discussed VALUE VALUE Changes in value do not provide numerical readings One grey may be seen as darker or lighter than other grey, it will not be seen as 4 times as dark as the other grey. Value (grey scale) is not quantitative. EXAMPLE EXAMPLE COLOR COLOR Color is not quantitative since the relationship between two marks differing on color will not be read numerically Color is not ordered since changes in color do not easily lend themselves to readings of greater or lesser The link between the color and the intended meaning are often not explicit making it difficult to interpret, especially when many colors are used COLOR EXAMPLE EXAMPLE ORIENTATION ORIENTATION Numerical values, quantities or ratios are not associated with changes in orientation. There seems to be some notion of order if the changes in orientation are progressive If they are organized randomly then this sense of order does not exist ORIENTATION While variations in orientation is theoretically infinite, practically it may be wise to limit its use to four variables: vertical, horizontal and two opposing diagonals EXAMPLE: FLOW VISUALIZATION http://web.cs.wpi.edu/~matt/courses/cs563/talks/flowvis/flowvis.html GRAIN, PATTERN, AND TEXTURE GRAIN PATTERN The characters of pattern is basically the same as shape TEXTURE EXAMPLE: FLAGS EXAMPLE: LOGOS Well known flags and logos usually requires very low mental workload for interpreting its symbolic meaning FLAGS AND Famous flags and logos are selective and associative (e.g., when you go grocery shopping) LOGOS Obscure logos are generally not selective MOTION Selective: probably Associative: yes Quantitative: no Order: probably Length: considerable variations OTHER VISUAL VARIABLES Bertin’s book did not include depth, occlusion, and transparency, which should be addressed. WHAT VISUAL MARKS AND VISUAL VARIABLES ARE USED? Based on the works of Drs. Jock Mackinlay, William Cleveland, et al. RANKINGS Rank visual variables for different data types OF VISUAL Quantitative (numerical) data VARIABLES Ordinal data Nominal data Ranking by accuracy for quantitative data Position Length RANKINGS Angle OF VISUAL Slop VARIABLES Area Volume Density Color saturation Ranking by accuracy for ordinal data Position Density Color Saturation Color Hue RANKINGS Texture Connection OF VISUAL Containment VARIABLES Length Angle Slope Area Volume Ranking by accuracy for nominal data Position Color Hue Texture Connection RANKINGS Containment Density OF VISUAL Color Saturation VARIABLES Shape Length Angle Slope Area Volume Selective All the visual variables are selective Associative Everyone is associative except for shape SUMMARY Quantitative Position: yes Size: maybe Order Position, size, and value Length SUMMARY All have theoretically infinite length but limited by the resolution of computer display All visual variables require some mental effort to interpret their symbolic meanings This is also part of visual mapping INTERPRETATION OF Developers often pay less attention to this aspect of SYMBOLIC visual mapping MEANINGS However, the complexity of reading a visualization is largely influenced by the symbolic interpretation M.S.T. Carpendale, “Considering Visual Variables as a Basis for Information Visualization”, Technical Report, Dept. of Computer Science, University of Calgary 2001 http://innovis.cpsc.ucalgary.ca/innovis/uploads/Publications/Publ ications/Carpendale- READINGS VisualVariablesInformationVisualization.2003.pdf Interview with Jacques Bertin http://www.infovis.net/printMag.php?num=116&lang=2 REFERENCE Jacques Bertin, “Semiology of graphics: diagrams, networks, maps.” University of Wisconsin Press, 1983 (first published in French in 1967. Translated in 1983) TO STUDY REFERENCE AND Carpendale M.S.T., “Considering Visual Variables as a Basis MATERIAL TO for information Visualization” STUDY FOR THESE https://uninadue.sharepoint.com/:b:/s/v.msteams_2024091 SLIDES 0204608/EXe_eFndKh9LnRbwNpTGhN8B0wchIc- dhvFAwEYvziU4Rg?e=txOP4a DATA VISUALIZATION & REPORTING Prof.Antonio Irpino BASIC PLOTS Plot Plot Plot Plot based on Cartesian Plot based on angles, or plot based on coordinate systems (X-Y, polar coordinate topological spaces X-Y-Z) systems (pie charts) (maps, …) TYPES OF PLOTS FROM A GEOMETRICAL POINT OF VIEW (VERTICAL) BAR CHART A Bar chart (or bar diagram, or bar graph) is a plot based a Cartesian coordinate system In general, on the X-axis are listed objects to compare, while on the Y-axis their values. A vertical bar is associated with each object and its height depends on the corresponding value Please, consider objects and values as very generic terms: objects can be individuals, categories, discrete values, while values can be intensities, frequencies, counts, etc. NOTE The Y-axis is always numeric! A Horizontal Bar Chart is a Vertical one where X-Y axes are reverted. Barcharts Use Type of variable Type of Objects (X) Values (Y) case tabulation 1 Quantitative Series Individuals Intensity 2 Quantitative Geo series Places Intensity 3 Quantitative Time series Time Intensity 4 Quantitative discrete Frequency table Single values Frequency 5 Qualitative Frequency table Categories Frequency BAR CHART: INPUT DATA BAR CHART. CASES 1 AND 2 (SERIES: X=INDIVIDUALS) prec Mobile 67.0 Juneau 54.7 Phoenix 7.0 Little Rock 48.5 Los Angeles 14.0 Sacramento 17.2 San Francisco 20.7 Denver 13.0 Some precipitations B AR CHART. C ASES 1 AND 2 (SERIES: X=INDIVIDUALS) In the example, we reported the cities (the statistical units) according to the level of precipitations (the observed variable). In this case the X-axis is not a classical dimension (like in a Cartesian plot). Indeed, it represents a set of objects that is not a numeric set and without an intrinsic or conventional order (i.e., there is no order between objects if we consider only the list of objects). When presenting such a plot, maybe we can improve it doing something simple to enrich its communication potential. For example, sorting subjects according the observed values. SORTING BARS Allowed! When the series is related to objects without a natural or WE CAN conventional order. SORT In this case, the communication of ALWAYS? the graph improves! Not allowed! Otherwise! (We see it in the next!) B AR CHART: C ASE 3 (TIME SERIES) A time series is a set of values observed in time. Is a series of observations where each observation is a time-stamp or a time-period. Some time-related pattern could be observed: trend, variation between periods. Even if, it is not the optimal graph for time series. In general, for time series are used the line graphs (we see them in the future!). NOTE In this case, the observations (the objects) are naturally ordered. So the X-axis represents time. BAR CHART: CASE 4 (FREQUENCY TABLE FROM A NUMERIC DISCRETE VARIABLE) A frequency table is a tabular organization of raw data in a compact form by displaying a series of unique values of a variable together with their frequencies — the number of times each value occurs in the respective data set. A frequency table consists typically of two columns: one for the values and a column showing the frequency of each value in the data set. In this case, the objects on the X-axis are the single values of the observed variable, while the values on the Y-axis are frequencies (counts, relative counts, percentages of counts,…) Since the variable is numeric the object have a natural order. IMPORTANT NOTE When data are arranged in frequency tables: Bar-charts are allowed only for discrete variables. For continuous variables the graph is not a Bar Chart but, as we will see, is a Histogram. FREQUENCY DISTRIBUTION OF SOME HOUSEHOLDS ACCORDING TO THEIR SIZE Num. in Freq household 1 1244 2 2156 3 1357 4 1208 5 549 6 193 7 81 8 46 9+ 42 Tot. 6876 B AR CHART: C ASE 5 (FREQUENCY TABLE OF A CATEGORICAL VARIABLE) A frequency table of a categorical variable consists typically of two columns: one for the categories and a column showing the frequency of each category in the data set. In this case, the objects on the X-axis are the categories of the observed variable, while the values on the Y-axis are frequencies (counts, relative counts, percentages of counts,…) ORDERING CATEGORIES If a variable is Nominal there not exists a natural or conventional order between categories (only the equivalence relation can be verified between observed values). In this case, it is possible to change the order of the categories (the X-axis objects) for improving the communication power of the graph. Variables like: “Gender”, “Marital status”, “Ethnicity”,… If a variable is Ordinal there not exists a natural or conventional order between categories (only the equivalence relation can be verified between observed values). In this case, it is not useful to change the order of the categories (the X-axis objects). Variables like: “Education level”, “Agreement with a sentence”,… B AR CHART OF A FR. TAB. FROM A NOMINAL VARIABLE (UNSORTED) Occupation Freq professional/managerial 2,333 sales 617 laborer 590 clerical/service 834 homemaker 504 student 1,128 military 140 retired 488 unemployed 242 Tot. 6,876 B AR CHART OF A FR. TAB. FROM A NOMINAL VARIABLE (SORTED) Occupation Freq professional/managerial 2,333 student 1,128 clerical/service 834 sales 617 laborer 590 homemaker 504 retired 488 unemployed 242 military 140 Tot. 6,876 BAR CHART OF A FR. TAB. FROM AN ORDINAL VARIABLE Education Freq grade % pipe operator new_mtcars %>% arrange(desc(drat)) %>% mutate(car_names=factor(car_names, levels=car_names)) %>% # This trick update the names levels ggplot(aes(x=car_names,y=drat))+ #take care here I have to map both x and y and I have not to set the dat a geom_bar(stat="identity") file:///C:/Users/AntonioYOGA/Desktop/LAB_01/LAB_03-barcharts-and-pies.html 8/18 22/10/2020 LAB_03 Barcharts and Piecharts in GGplot Maybe it is better a horizontal barchart ggplot(new_mtcars,aes(x=reorder(car_names, -drat),y=drat))+ #take care here I have to map both x and y geom_bar(stat="identity") + # stat parameter "identity" take values as they are without counting them xlab("Cars") + #I want this label coord_flip() #flip the plot file:///C:/Users/AntonioYOGA/Desktop/LAB_01/LAB_03-barcharts-and-pies.html 9/18 22/10/2020 LAB_03 Barcharts and Piecharts in GGplot Drawing pies and donuts The starting point for drawing a pie or a donut plot in ggplot is the stacked percentage bar chart. but horizontal ggplot(new_mtcars,aes(y=1, # x is fixed into a point fill=as.factor(gear)))+ #gear is transformed into a categorical variable for using po sition="fill" geom_bar(position="fill") # draw a stacked percentage barchart file:///C:/Users/AntonioYOGA/Desktop/LAB_01/LAB_03-barcharts-and-pies.html 10/18 22/10/2020 LAB_03 Barcharts and Piecharts in GGplot For obtaining a PIE CHART with ggplot you have to change the coordinate system from cartesian to polar ggplot(new_mtcars,aes(y=1, # x is fixed into a point fill=as.factor(gear))) + #gear is transformed into a categorical variable for using p osition="fill" geom_bar(position="fill") + # draw a stacked percentage barchart coord_polar() + #change the coordinates and says what is the angle phi theme_void() # for seeing only the pie file:///C:/Users/AntonioYOGA/Desktop/LAB_01/LAB_03-barcharts-and-pies.html 11/18 22/10/2020 LAB_03 Barcharts and Piecharts in GGplot For the donut we have to add some tricks ggplot(new_mtcars,aes(y=1, # y is fixed into a point fill=as.factor(gear))) + #gear is transformed into a categorical variable for using p osition="fill" geom_bar(position="fill") + # draw a stacked percentage barchart coord_polar() + #change the coordinates and says what is the angle phi ylim(c(-1,2))+ #this trick for having donuts theme_void() # for seeing only the pie file:///C:/Users/AntonioYOGA/Desktop/LAB_01/LAB_03-barcharts-and-pies.html 12/18 22/10/2020 LAB_03 Barcharts and Piecharts in GGplot How to do some pretty staked % bars, pies and donuts I would like to add values close the graph elements. To do this we have to change or perspective. 1. We need to use stat="identity" in the settings of geom_bar 2. For doing that we have to transform our data 1. We have to compute the frequency table (from a series) 2. We have to compute where to put the label 3. We have to write the label Computing the frequency table Using dplyr is quite easy to compute a simple frequency table my_new_freq_table % select(gear) %>% # select a column group_by(gear) %>% # group by the values summarize(Abs_freq=n()) # compute frequencies my_new_freq_table gear Abs_freq 3 15 4 12 file:///C:/Users/AntonioYOGA/Desktop/LAB_01/LAB_03-barcharts-and-pies.html 13/18 22/10/2020 LAB_03 Barcharts and Piecharts in GGplot gear Abs_freq 5 5 3 rows Let’s compute the relative and the percentage frequencies my_new_freq_table % select(gear) %>% # select a column group_by(gear) %>% # group by the values summarize(Abs_freq=n()) %>% # compute frequencies mutate(Rel_freq=round(Abs_freq/sum(Abs_freq),digits=4), #compute the relative frequenc ies Perc_freq=Rel_freq*100) #and the percentage my_new_freq_table gear Abs_freq Rel_freq Perc_freq 3 15 0.4688 46.88 4 12 0.3750 37.50 5 5 0.1562 15.62 3 rows Using my new freq table I can draw a pie (but note that now I use stat="identity" , so geom_bar does not compute frequencies but uses the data table value as they are) my_new_freq_table %>% ggplot(aes(x="", y=Perc_freq,fill=as.factor(gear)))+ geom_bar(stat="identity") file:///C:/Users/AntonioYOGA/Desktop/LAB_01/LAB_03-barcharts-and-pies.html 14/18 22/10/2020 LAB_03 Barcharts and Piecharts in GGplot First of all I want to put the percentages at the center of my bars but, remember that bars are stacked. For doing this we ﬁrst create a variable containing the text we want to add my_new_freq_table % mutate(Lab_text=paste0(Perc_freq,"%")) # paste0 collates strings thus Lab_text is a new chr variable my_new_freq_table gear Abs_freq Rel_freq Perc_freq Lab_text 3 15 0.4688 46.88 46.88% 4 12 0.3750 37.50 37.5% 5 5 0.1562 15.62 15.62% 3 rows Now let’s place the text labels at the center of the bars my_new_freq_table %>% ggplot(aes(x="", y=Perc_freq,fill=as.factor(gear))) + geom_bar(stat="identity") + geom_text(aes(label = Lab_text), #add a text for each row of the table position = position_stack(vjust = 0.5)) #vertical centerend file:///C:/Users/AntonioYOGA/Desktop/LAB_01/LAB_03-barcharts-and-pies.html 15/18 22/10/2020 LAB_03 Barcharts and Piecharts in GGplot Now let’s pie (i.e. change the coordinate system), but carefully (here we deal with stat="identity" ) my_new_freq_table %>% ggplot(aes(x=1, y=Perc_freq,fill=as.factor(gear))) + geom_bar(stat="identity") + geom_text(aes(label = Lab_text),position = position_stack(vjust = 0.5)) + coord_polar(theta="y") + # here I have to specify where values are used for computing the angle labs(fill = "# of fron gears") + #change the legend title according to the aestethics used for the legend ggtitle("Pie chart from mtcars dataset")+ #add a title theme_void() file:///C:/Users/AntonioYOGA/Desktop/LAB_01/LAB_03-barcharts-and-pies.html 16/18 22/10/2020 LAB_03 Barcharts and Piecharts in GGplot … and the donut. my_new_freq_table %>% ggplot(aes(x=1, y=Perc_freq,fill=as.factor(gear))) + geom_bar(stat="identity") + geom_text(aes(label = Lab_text),position = position_stack(vjust = 0.5)) + xlim(c(-0.5,1.5))+ #the trick for doing a hole in the pie (i.e. how to make a donut in ggplot) coord_polar(theta="y") + labs(fill = "# of fron gears") + ggtitle("Donut chart from mtcars dataset")+ theme_void() file:///C:/Users/AntonioYOGA/Desktop/LAB_01/LAB_03-barcharts-and-pies.html 17/18 22/10/2020 LAB_03 Barcharts and Piecharts in GGplot file:///C:/Users/AntonioYOGA/Desktop/LAB_01/LAB_03-barcharts-and-pies.html 18/18 LAB_03 GGplot for continuous Code univariate variables The data In this talk we use the dataset that is on my Dropbox Hide #libraries library(tidyverse) library(devtools) install_github("JanCoUnchained/ggunchained") # How to read a csv from the web #file % pivot_longer(cols = c("math.score","reading.score", "writing.score"),#what variables I have to take names_to = "var", #the first variable values_to = "score")# the second one head(scores_bygender) gender var score female math.score 72 female reading.score 72 female writing.score 74 female math.score 69 female reading.score 90 gender var score female writing.score 88 6 rows Now we can plot the boxplots, using carefully the aesthetics: Hide ggplot(scores_bygender, #the new dataset aes(x=var,y=score, fill=gender))+ #a new aesthetics geom_boxplot()+ theme_bw() You can combine jittering and boxplots, for example: Hide ggplot(scores_bygender, #the new dataset aes(x=var,y=score, fill=gender))+ #a new aesthetics geom_boxplot()+ geom_jitter(alpha=0.3, position=position_jitterdodge())+ #the position is important here theme_bw() Plots for densities Beeswarms For plotting beeswarms we will use a new library ggbeeswarm (see examples here (https://cran.r- project.org/web/packages/ggbeeswarm/vignettes/usageExamples.pdf)) Hide # We use a new library created by someone for swarm plots using gg framework # install.packages("ggbeeswarm") #for installing use this (uncommented) library(ggbeeswarm) # load the library ggplot(data,aes(y=math.score,x=""))+ #if you have only a variable must be put on the y aesthetic s geom_beeswarm(colour="red",alpha=0.5)+ coord_flip()+ theme_bw() You can use different stiles of beeswarming: Hide ggplot(data,aes(y=math.score,x=""))+ geom_quasirandom(color="darkgreen",alpha=0.5)+# another geom in beeswarm coord_flip()+ theme_bw() Hide data %>% sample_n(200) %>% #sampled data for seeing better ggplot(aes(y=math.score,x=""))+ geom_quasirandom(method = "frowney", color="red", size=2, alpha=0.5)+# another geom in beeswarm coord_flip()+ theme_bw() Hide data %>% sample_n(200) %>% ggplot(aes(y=math.score,x=""))+ geom_quasirandom(method = "smiley", color="blue", size=2, alpha=0.5) +# another geom in beeswar m coord_flip()+ theme_bw() Also in this case we can use for comparisons, for example: Hide ggplot(onlyscores,aes(x=var,y=score, color=var))+ geom_beeswarm(alpha=0.5)+ theme_bw() or Hide ggplot(data, #we have to see the original data aes(x=gender,y=math.score,colour=gender)) + geom_beeswarm(alpha=0.7)+ theme_bw() Histograms Histograms in ggplot use the (see some examples here (http://www.sthda.com/english/wiki/ggplot2-histogram- plot-quick-start-guide-r-software-and-data-visualization)). Only equi width histograms are allowed in gglot Hide # Basic histogram ggplot(data, aes(x=math.score)) + # the histogram has one variable geom_histogram(color="black", fill="green") ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. Hide # Fixing the bins ggplot(data, aes(x=math.score)) + geom_histogram(color="black", fill="green", bins = 20) #how many bins do you want? Hide # Fixing the bin width ggplot(data, aes(x=math.score)) + geom_histogram(color="black", fill="green", binwidth = 5) #how width is a bin? Also in this case you can use (carefully) for comparisons, but take care of the “density”: Hide ggplot(onlyscores, aes(x=score, fill=var)) + # the histogram has one variable geom_histogram(alpha=0.3, color="grey30")+ ggtitle("Bad histograms") ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. This is due because geom_hist inherits some features of geom_bar one of this is the default position="stack". The good way Hide ggplot(onlyscores, aes(x=score, fill=var)) + # the histogram has one variable geom_histogram(aes(y=..density..),position="identity",# this is the right way! Plot densities not counts alpha=0.3, color="grey30")+ ggtitle("The good histograms") ## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0. ## ℹ Please use `after_stat(density)` instead. ## This warning is displayed once every 8 hours. ## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was ## generated. ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. Is it better faceting? Hide ggplot(onlyscores, aes(x=score, fill=var)) + # the histogram has one variable geom_histogram(aes(y=..density..),position="identity",# this is the right way! Plot densities not counts alpha=0.3, color="grey30")+ facet_grid(var ~.)#faceting by rows and one variable ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. Using also gender Hide ggplot(scores_bygender, aes(x=score, fill=gender)) + # the histogram has one variable geom_histogram(aes(y=..density..),position="identity",# this is the right way! Plot densities not counts alpha=0.3, color="grey30")+ facet_grid(var ~ gender)#faceting by rows and one variable ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. … or … Hide ggplot(scores_bygender, aes(x=score, fill=gender)) + # the histogram has one variable geom_histogram(aes(y=..density..),position="identity",# this is the right way! Plot densities not counts alpha=0.3, color="grey30")+ facet_grid(gender ~ var)#faceting by rows and one variable ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. what do you prefer? Kernel Density Estimation With ggplot the main geometry is geom_density Hide # Basic density ggplot(data, aes(x=math.score)) + # the density has one variable geom_density(color="black", fill="green") Also in this case you can use (carefully) for comparisons, but take care of the “density”: Hide ggplot(onlyscores, aes(x=score, fill=var)) + # the density has one variable geom_density(alpha=0.3, color="grey30")+ ggtitle("Densities with overplotting") It is better faceting? Hide ggplot(onlyscores, aes(x=score, fill=var)) + # the density has one variable geom_density(alpha=0.3, color="grey30")+ facet_grid(var ~.)#faceting by rows and one variable Using also gender Hide ggplot(scores_bygender, aes(x=score, fill=gender)) + # the density has one variable geom_density(alpha=0.3, color="grey30")+ facet_grid(var ~ gender)#faceting by rows and one variable … or … Hide ggplot(scores_bygender, aes(x=score, fill=gender)) + # the density has one variable geom_density(alpha=0.3, color="grey30")+ facet_grid(gender ~ var)#faceting by rows and one variable … or ridging… (see here (https://cran.r-project.org/web/packages/ggridges/vignettes/introduction.html)) Hide library(ggridges) ggplot(scores_bygender, aes(x=score, y=var, fill=gender)) + # the density has one variable geom_density_ridges(alpha=0.3, color="grey30") ## Picking joint bandwidth of 3.69 or by sex Hide library(ggridges) ggplot(scores_bygender, aes(x=score, y=gender, fill=var)) + # the density has one variable geom_density_ridges(alpha=0.3, color="grey30") ## Picking joint bandwidth of 3.69 Hide scores_bygender %>% mutate(gend.x.subj=paste(gender,var,sep="_")) %>% ggplot(aes(x=score, y=gend.x.subj, fill=gender)) + # the density has one variable geom_density_ridges(alpha=0.3, color="grey30") ## Picking joint bandwidth of 3.69 Hide scores_bygender %>% mutate(gend.x.subj=paste(var,gender,sep="_")) %>% ggplot(aes(x=score, y=gend.x.subj, fill=gender)) + # the density has one variable geom_density_ridges(alpha=0.3, color="grey30") ## Picking joint bandwidth of 3.69 Violins and more We ends with violin plots, a fusion between boxplot, beeswarm and kde. In ggplot we can use the geom_violin function: Hide # Basic density ggplot(data, aes(x=math.score, y="")) + #violins need y aes geom_violin(color="black", fill="green") Also in this case you can use for comparisons: Hide ggplot(onlyscores, aes(x=score, y=var,fill=var)) + geom_violin(alpha=0.3, color="grey30") or vertical… Hide ggplot(onlyscores, aes(y=score, x=var,fill=var)) + geom_violin(alpha=0.3, color="grey30") … using aesthetics … Hide ggplot(scores_bygender, aes(y=score, x=var,fill=gender)) + geom_violin(alpha=0.5, color="grey30") … or …. Hide # NOT on the CRAN if you want to install follow this steps: # install.packages("remotes") # remotes::install_github("JanCoUnchained/ggunchained") library(ggunchained) ggplot(scores_bygender, aes(y=score, x=var,fill=gender)) + geom_split_violin(alpha=0.5, color="grey30") Now you can play with the data as you want Let’s have a complete pivoted table of our data Hide pivot_data % pivot_longer(cols = c("math.score","reading.score", "writing.score"),#what variables I have to take names_to = "var", #the first variable values_to = "score")# the second one head(pivot_data) gen… race.ethnicity parental.level.of.education lunch test.preparation.course female group B bachelor's degree standard none female group B bachelor's degree standard none female group B bachelor's degree standard none female group C some college standard completed female group C some college standard completed gen… race.ethnicity parental.level.of.education lunch test.preparation.course female group C some college standard completed 6 rows | 1-6 of 7 columns Hide ggplot(pivot_data, aes(x=score, fill=parental.level.of.education )) + # the density has one variable geom_density(alpha=0.3, color="grey30")+ facet_grid(parental.level.of.education ~ var)#faceting by rows and one variable Assignments Use data or pivot_data for discovering at least three interesting patterns using the techniques presented here. For each graph you produce comment the results. DATA VISUALIZATION AND REPORTING Comparing data Prof.Antonio Irpino MORE THAN ONE VARIABLE So far, we have seen the main plots and charts for describing one-variable at a time. to describe the distribution of data of But, generally, data contain several phenomena MULTIPLE variables visually registered on some units. In data analysis, the main aims of an analyst are: to describe and/or measure the variability of the data to explain the variability in the data at hand A variable depends on another, or there is a measuring the dispersion of data using indexes of correlation. variability using standard deviation, IQR, Gini or we can measure association or correlation or Shannon indexes of heterogeneity measuring the shape of the variability: skewness and model causal relationships kurtosis. we can express the association and the describing the distribution of data of a single models visually variable visually: bar-plots, pie-charts, line charts, Data present a cluster structure (Groups in histograms, density plots,… data): We can describe cluster structure DESCRIBING MORE THAN ONE VARIABLE AT A TIME There are many ways of representing information arising from a dataset described by multiple and heterogeneous variables. Adding aesthetics to Extending classical existing charts charts to multiple we know that we can use, Existing charts and plots variables color, area, shapes, line for example: 2D or 3D for example: 3D barcharts, widths, gradients, and scatter plots, mosaic 3D histograms, 2D density transparency effects, also plots. combining them for showing groups, categories, values. Let’s start with plots for representing 2 variables accordingly to the nature of the two variables (Both numerical, both categorical, mixed). NUMERICAL VS NUMERICAL SCATTER-PLOT A scatterplot displays the relationship between 2 numeric variables. For each data point, the value of its first variable is represented on the X-axis, the second on the Y-axis. Here is an example of considering the price of 1,460 apartments and their ground living area. This dataset comes from a Kaggle machine learning competition.You can download the data from here. SC ATTER-PLOT OF KAGGLE APARTMENT DATASET Thus it is often accompanied by a correlation coefficient calculation, that usually tries to measure the linear relationship. However other types of relationship can be detected using scatterplots, and a common task consists to fit a model explaining Y in function of X. Here is a few pattern you can detect doing a scatterplot. A SCATTERPLOT IS MADE TO STUDY THE RELATIONSHIP BETWEEN 2 VARIABLES HOW TO ENRICH A SCATTERPLOT: SHOW MARGINAL DISTRIBUTIONS S HOWING MULTIPLE VARI A BLE S US ING A S C ATTE R PL OT Even if a scatterplot is for two numeric variables, when the displayed points are not very high, it is possible to add colors or shapes or sizes (of points) accordingly to other variables: if variables are categorical, we can use color and shapes if variables are numerical, we can use also size let’s see some examples. Some are inspired Salaries data from the carData package in R from https://rkabacoff.github.io/datavis/ PLOT COMMON MIS TAKE S. 1 ) OVE RPL OT T IN G Overplotting is the most common mistake when sample size is high. Overplotting is a common issue in dataviz. When your dataset is large, the dots of your scatterplot will tend to overlap, making the graphic unreadable. This issue is illustrated in the scatterplot below. A first look might lead to the conclusion that there is no obvious relationship between X and Y. We will see below how wrong this conclusion would be. COMMON MISTAKES: II) GROUPS IN DATA Overplotting is the most common mistake when sample size is high. This post describes about 10 different workarounds to fix this issue. Don’t forget to show subgroups if you have some. Indeed it can reveal important hidden patterns in your data, like in the case of the Simpson’s paradox. Simpson’s paradox Simpson’s paradox (or Simpson’s reversal, Yule–Simpson effect, amalgamation paradox, or reversal paradox) is a phenomenon in probability and statistics, in which a trend appears in several Example of Simpson’s paradox different groups of data but disappears or reverses when these groups are combined. S O M E WO R K A RO U N D S F O R OV E R P L OT T I N G I N A S C AT T E R P L OT 1 1) Decreasing dot size The easiest workaround is probably to reduce dot size. Depending on the quantity of overlap you have, it can give a really satisfying result. Here it appears clearly that 3 clusters are present, which was hidden in the previous figure. S O M E WO R K A RO U N D S F O R OV E R P L OT T I N G I N A S C AT T E R P L OT 2 Transparency In combination with decreasing dot size, using transparency also allows you to reveal patterns encountering overplotting issues. Look at the figure better!!! What could be do? S OME WORK A RO UNDS FOR OVE RPL OT T I NG IN A S C ATTE R PLOT: DE NS ITY 2D density The 2D density chart basically counts the number of observations within a particular area of the 2D space and represents this count by a color. If you divide the space into several squares, you get a 2D histogram. If you use hexagons instead of squares, you get a hexbin plot.You can also calculate a density estimate and represent 2D density plots or Contour plots. S O M E WO R K A RO U N D S F O R OV E R P L OT T I N G I N A S C AT T E R P L OT: DENSITY AND MORE More on 2D density plots These graphics are basically extensions of the well known density plot and histogram. The global concept is the same for each variation. One variable is represented on the X axis, the other on the Y axis, like for a scatterplot. Then, the number of observations within a particular area of the 2D space is counted, a density is computed, and represented by a color gradient. The shape can vary: Squares or rectangles make 2d histograms Hexagones are often used, leading to a hexbin chart It is also possible to compute kernel density estimate to get 2d density plots or contour plots Here is an overview of these different possibilities 2D HISTOGRAM PLOTS Both x- and y-axes is binned, such that a rectangular grid is considered on 2d data. Each rectangular tile is coloured accordingly to the density (or the frequency counts if tiles are square) The density of a tile is obtained dividing the count of data of the tile (the frequency) and the tile area. 2D HEXBIN PLOTS Hexagon binning is a form of bivariate histogram useful for visualizing the structure in datasets with large n. The underlying concept of hexagon binning is extremely simple; 1. the xy plane over the set (range(x), range(y)) is tessellated by a regular grid of hexagons. 2. the number of points falling in each hexagon are counted and stored in a data structure 3. the hexagons with count > 0 are plotted using a color ramp or varying the radius of the hexagon in proportion to the counts. The underlying algorithm is extremely fast and effective for displaying the structure of datasets with 𝑛𝑛 > 106.If the size of the grid and the cuts in the color ramp are chosen in a clever fashion than the structure inherent in the data should emerge in the binned plots. The same caveats apply to hexagon binning as apply to histograms and care should be exercised in choosing the binning parameters. A tessellation of a flat surface is the tiling of a plane using one or more geometric shapes, called tiles, with no overlaps and no gaps. We consider only tessellations with regular polygons here! Why hexagons? There are many reasons for using hexagons, at least over squares. Hexagons have symmetry of nearest neighbors which is lacking in square bins. Hexagons are the maximum number of sides a polygon can have for a regular WHY, tessellation of the plane, so in terms of packing a hexagon is 13% more efficient for covering the plane than squares. This property translates into better sampling HEXAGONS ARE efficiency at least for elliptical shapes. Hexagons are visually less biased for displaying densities than other regular SO IMPORTANT tessellations. For instance with squares our eyes are drawn to the horizontal and vertical lines of the grid. IN TESSELLATION? Three tessellation possibilities 2D HISTOGRAMS AND HEXBIN PLOTS: CHOOSING THE BIN WINDTH FROM BOXPLOTS TO BAG PLOTS In 1999, Rousseuw et al. introduced an extension of the box-plot called the bagplot to 2D data. A bagplot allows one to visualize the location, spread, skewness, and outliers of a data set. A bagplot consists of three nested polygons, called the bag, the fence, and the loop. The inner polygon, called the bag, is constructed on the basis of Tukey depth, the smallest number of observations that can be contained by a half-plane that also contains a given point. It contains at most 50% of the data points. The outermost of the three polygons, called the fence is not drawn as part of the bagplot, but is used to construct it. It is formed by inflating the bag by a certain factor (usually 3). Observations outside the fence are flagged as outliers. The observations that are not marked as outliers are surrounded by a loop, the convex hull of the observations within the fence. An asterisk symbol near the center of the graph is used to mark the depth median, the point with the highest possible Tukey depth. The observations between the bag and fence are marked by line segments, on a line to the depth median, connecting them to the bag. The three-dimensional version consists of an inner and outer bag. The outer bag must be drawn in transparent colors so that the inner bag remains visible. Before to see a bagplot, let’s see the concept of depth. THE DEPTH OF A POINT IN A SCATTER PLOT First of all, let’s consider the concept of median: the median of a 1D series is the value that separates the upper 50% to the lower 50% of the series of sorted values. It is the value having the minimum sum of Euclidean distance from all the other points. (The mean has minimum sum of squared Euclidean distance) Extending the concept of median to multivariate data, the generalized median of a scatter plot, is a point that is the most central to the scatter point: or its centerpoint. Given a set of points in d-dimensional space, a centerpoint of the set is a point such that any hyperplane that goes through that point divides the set of points in two roughly equal subsets. Like the median, a centerpoint need not be one of the data points. Every non-empty set of points (with no duplicates) has at least one centerpoint. The centerpoint is considered the one with the highest depth. The depth of a point relative to a given data set measures how deep that point lies in the data cloud. The data depth concept provides center-outward ordering of points in any dimension. There exist several ways for measuring the depth see, for example the first part of this paper: https://fenix.tecnico.ulisboa.pt/downloadFile/395137801290/paper.pdf LET’S SEE THE BAGPLOT A LITTLE REC AP. 1D FUNCTIONS (1d) One dimensional function 𝒚𝒚 = 𝑓𝑓 𝒙𝒙 You can plot it on a x-y cartesian plane!! Examples 1. straight line general formula 𝒚𝒚 = 𝑏𝑏𝒙𝒙 + 𝑎𝑎 example 𝒚𝒚 = 2𝒙𝒙 + 1 2. the Gaussian density function: given 𝜇𝜇 and 𝜎𝜎 general formulation 2 1 − 𝒙𝒙−𝜇𝜇 𝒚𝒚 = 𝑒𝑒 2𝜎𝜎2 2𝜋𝜋𝜎𝜎 2 FUNCTIONS OF SEVERAL VARIABLES The temperature T at a point on the surface of the earth at any given time depends on the longitude x and latitude y of the point. We can think of T as being a function of the two variables x and y , or as a function of the pair (x, y). We indicate this functional dependence by writing T = f(x, y). The volume of V a circular cylinder depends on its radius r and its height h. In fact, we know that V = π r2h. We say that V is a function of r and h, and we write V( r, h) = π r2h. FUNCTIONS OF SEVERAL VARIABLES We often write z = f(x, y) to make explicit the value taken on by f at the general point (x, y). The variables x and y are independent variables and z is the dependent variable. [Compare this with the notation y = f(x) for functions of a single variable.] EXAMPLE 2 In regions with severe winter weather, the wind-chill index is often used to describe the apparent severity of the cold. This index W is a subjective temperature that depends on the actual temperature T and the wind speed v. So W is a function of T and v, and we can write W = f(T, v). Table 1 records values of W compiled by the National Weather Service of the US and the Meteorological Service of Canada. For instance, the table shows that if the temperature is –5°C and the wind speed is 50 km/h, then subjectively it Table 1 would feel as cold as a temperature of about –15°C Wind-chill index as a function of air temperature and wind speed with no wind. So f(–5, 50) = –15 GRAPHIC AL REPRESENTATION OF FUNCTIONS WITH TWO VARIABLES 2d function 𝒛𝒛 = 𝑓𝑓 𝒙𝒙, 𝒚𝒚 We need three coordinates; thus we can represent it into a 3d cartesian space. Examples 1. a plane general formulation 𝒛𝒛 = 𝑎𝑎𝒙𝒙 + 𝑏𝑏𝒚𝒚 + 𝑐𝑐 2. the 2d Gaussian density function: general formulation GRAPHS Another way of visualizing the behavior of a function of two variables is to consider its graph. Just as the graph of a function f of one variable is a curve c with equation y = f(x), so the graph of a function f of two variables is a surface S with equation z = f(x, y). We can visualize the graph S of f as lying directly above or below its domain D in the xy-plane (see Figure ). EXAMPLE 6 Sketch the graph of Solution: The graph has equation We square both sides of this equation to obtain z2 = 9 – x2 – y2, or x2 + y2 + z2 = 9, which we recognize as an equation of the sphere with center the origin and radius 3. But, since z ≥ 0, the graph of g is just the top half of this sphere (see Figure). Graph of GRAPHIC AL REPRESENTATION OF FUNCTIONS WITH TWO VARIABLES 2d function 𝒛𝒛 = 𝑓𝑓 𝒙𝒙, 𝒚𝒚 We need three coordinates; thus we can represent it into a 3d cartesian space. Examples 1. a plane general formulation 𝒛𝒛 = 𝑎𝑎𝒙𝒙 + 𝑏𝑏𝒚𝒚 + 𝑐𝑐 2. the 2d Gaussian density function: general formulation But 3d representation is not always effective on a plane (a sheet of paper or a screen), because of the projection on a 2d surface. LEVEL CURVES LEVEL CURVES A method, borrowed from mapmakers, is a contour map on which points of constant elevation are joined to form contour lines, or level curves. A level curve f(x, y) = k is the set of all points in the domain of f at which f takes on a given value k. In other words, it shows where the graph of f has height k. You can see from figure the relation between level curves and horizontal traces. Level curves LEVEL CURVES The level curves f (x, y) = k are just the traces of the graph of f in the horizontal plane z = k projected down to the xy-plane. So if you draw the level curves of a function and visualize them being lifted up to the surface at the indicated height, then you can mentally piece together a picture of the graph. The surface is steep where the level curves are close together. It is somewhat flatter where they are farther apart. LEVEL CURVES One common example of level curves occurs in topographic maps of mountainous regions, such as the map in Figure. If you walk along The level curves one of these are curves of contour lines, you constant elevation neither ascend above sea level. nor descend. LEVEL CURVES Another common Here the level example is the curves are called temperature at isothermals and locations join locations (x, y) with with the same longitude x and temperature. latitude y. LEVEL CURVES Figure shows a weather map of the world indicating the average January temperatures. The isothermals are the curves that separate the colored bands. World mean sea-level temperatures in January in degrees Celsius LEVEL CURVES For some purposes, a contour map is more useful than a graph. It is true in estimating function values. Figure shows some computer-generated level curves together with the corresponding computer-generated graphs. LEVELS CURVES AND CONTOUR PLOT We use color for understanding visually the level of the function GRAPHIC AL REPRESENTATION OF FUNCTIONS WITH TWO VARIABLES 2d function 𝒛𝒛 = 𝑓𝑓 𝒙𝒙, 𝒚𝒚 We need three coordinates; thus we can represent it into a 3d cartesian space. Examples 1. 1 a plane general formulation 𝒚𝒚 = 𝑎𝑎𝒙𝒙 + 𝑏𝑏𝒚𝒚 + 𝑐𝑐 example 𝒚𝒚 = 3 + 2 ⋅ 𝒙𝒙 2. 2 the 2d Gaussian density function: general formulation But 3d representation is not always effective on a plane (a sheet of paper or a screen), because of the projection on a 2d surface. 2D DENSITY REPRESENTATION If we want a smooth representation of the density, we can use the Kernel density technique also for 2d data Remember: a Kernel is a density function. In this case the kernel is a symmetric 2d density function. With each point is associated a centered 2d kernel. The density estimated for a general point of coordinates (x,y) is given as the average of the kernels for each observed point. BUT IF YOU REALLY WANT GO IN 3D 2D density plot 3D density plot 3D HISTOGRAMS USING 3D PLOTS 3D plots are not always easily understandable because of Perspective Angle of view Camera position Lights and shadows confusion and Escherness DATA VISULIZATION AND REPORTING avoid 3D in plots DEPTH PERCEPTION: A SUMMARY Depth perception is our ability to perceive the world in three dimensions (3D), and to estimate how far away an object is. This actually occurs at the visual cortex of our brain with information coming from our eyes. Such information includes those that involve both eyes (binocular cues) and those that involve only one eye (monocular cues). Binocular view BINOCULAR CUES- SEEING 3D WITH TWO EYES. There are two main binocular cues that help us to perceive depth: Stereopsis, or retinal (binocular) disparity, or binocular parallax: because our eyes are located at different lateral positions on the head, binocular vision results in two slightly different images of the same scene projected onto the retinas of the eyes. The differences are mainly in the relative positions of objects in the two images. Such positional differences are referred to as binocular disparities. The visual cortex of our brain processes the disparities to yield depth perception. Convergence When the two eyes focus on the same object, they will agle inwards towards each other (convergence). Depending on the distance of the object, the eyes will converge more (for colse-up object) or less (for distant object). The extra effort used by the muscles on the outside of each eyeball gives a clue to the brain about how far away the object is. Try holding your finger 10 cm in front of your eyes and focusing on it, you will feel the eye muscles at work much sooner than if your finger is 50 cm away. 3D movies make use of binocular disparity by providing each eye with a different image to create the 3D effect. However, the brain does not receive any cues from convergence as it normally would. This is why some people may feel uncomfortable when watching 3D movies. These binocular cues are most effective for objects up to 6 m away. Beyond that, our eye separation does not give a great enough difference in images to be useful in depth perception. MONOCULAR CUES – 3D INFORMATION FROM A SINGLE EYE. With one eye closed, we can still perceive the world in 3D and move around without crashing into things. This is because of monocular cues that help us to gauge the distance of an object. Some of these monocular cues are as follows: Accommodation: The ciliary muscles inside the eye adjust the shape of the lens so that we can focus on an object. Depending on the distance of that object, the ciliary muscles contracts more (for close-up object) or less (for distant object) to focus. The effort required provides the brain with information about distance of the object. Linear perspective is the monocular cue provided by the convergence of lines toward a single point on the horizon. Looking down a set of railroad tracks is a good example. We know that the tracks do not converge; they are parallel throughout, but when we look down the tracks, it appears that they converge to a single point. Relative size: Growing, we have learned the normal sizes of various objects. So when we see an object whose size shouldn’t change becoming larger or smaller, we interpret that as a distance cue (closer or farther). Superposition/Interposition: Objects that are in front of other objects may partially block our view of the rearmost object. Because we know what the object should look like, yet we see only part of it, we interpret the obstructed object as being farther away. Shadows: The way that light falls onto an object and reflects off its surfaces, and the shadows created also provide effective cues for our brain to determine the shape of objects and their position in space. Motion parallax: If you move your head, objects that are close to you will appear to move faster and more than those objects that are further away. This is clearly noted when we sit in a moving car and watch things passing by outside. ADDITIONAL MONOCULAR CUES Sharp focus or blurry: If two objects are at the same distance, they will both appear to be in focus. Objects that are closer or further away will appear blurry. Definition and textures: Close objects will have a lot of detail and definition apparent. More distant objects will not show with as much detail. This is very noticeable when looking at a field of grass. Close up, the blades of grass will be noticeable. Further away, the grass is more of a sea of green. Texture of an object is also affected by our proximity to it. The closer one is to something, the more detail or texture one can see. For example, if I look at a wall from 20 feet away, it will look fairly smooth. But, as we move closer, we begin to notice the roughness and texture. Such correlation between distance and texture also provides depth information to our brain. Vividness of colours: distant objects often appear less bright and colourful. This is due to the scattering of light as it travels from that distant object. Having more of the atmosphere to travel through means that light will be scattered more, so the colours will not seem as bright. Note that the last three cues are not so obvious in the pictures taken by a camera, this is because modern cameras have complex lens that give much larger depth of view than our simple eyes (objects at different distances can be in focus simultaneously). With our eyes, if we focus on a close-up object, other objects at different distances become blurry and less defined. DON’T GO 3D 3D plots are quite popular in particular in business presentations but also among academics. They are also almost always inappropriately used. It is rare to see a 3D plot that couldn’t be improved by turning it into a regular 2D figure. We will see why 3D plots have problems, why they generally are not needed, and in what limited circumstances 3D plots may be appropriate. (ref:3d-b) AVOID GRATUITOUS 3D Many visualization softwares enable you to beautify your plots by turning the plots’ graphical elements into three- dimensional objects. Most commonly, we see pie charts turned into disks rotated in space, bar plots turned into columns, and line plots turned into bands. Notably, in none of these cases does the third dimension convey any true and additional information. 3D is used simply to decorate and adorn the plot. The problem with unuseful 3D is that the projection of 3D objects into two dimensions for printing or display on a monitor distorts the data. The human visual system tries to correct for this distortion as it maps the 2D projection of a 3D image back into a 3D space. However, this correction can be partial. AN EXAMPLE: 3D PIES As an example, let’s take a simple pie chart with two slices, one representing 25% of the data and one 75%, and rotate this pie in space. As we change the angle at which we’re looking at the pie, the size of the slices seems to change as well. In particular, the 25% slice, which is located in the front of the pie, looks much bigger than 25% when we look at the pie from a flat angle. The same 3D pie chart shown from four different angles. Rotating a pie into the third dimension makes pie slices in the front appear larger than they really are and pie slices in the back appear smaller. Here, in parts (a), (b), and (c), the blue slice corresponding to 25% of the data visually occupies more than 25% of the area representing the pie. Only part (d) is an accurate representation of the data. 3D BARS Similar problems arise for other types of 3D plot. The figure on the right shows the breakdown of Titanic passengers by class and gender using 3D bars. The figure shows the numbers of female and male passengers on the Titanic traveling in 1st, 2nd, and 3rd class, shown as a 3D stacked bar plot. The total numbers of passengers in 1st, 2nd, and 3rd class are 322, 279, and 711, respectively. Because of the way the bars are arranged relative to the axes, the bars all look shorter than they actually are. For example, there were 322 passengers total traveling in 1st class, yet figure seems to suggest that the number was less than 300! This illusion arises because the columns representing the data are located at a distance from the two back surfaces on which the gray horizontal lines are drawn. AVOID 3D POSITION SCALES While visualizations with gratuitous 3D can easily be dismissed as bad, it is less clear what to think of visualizations using three genuine position scales (x, y, and z) to represent data. In this case, the use of the third dimension serves an actual purpose. Nevertheless, the resulting plots are frequently difficult to interpret, and in my mind they should be avoided. Consider a 3D scatter plot of fuel efficiency versus displacement and power for 32 cars. Here, we plot displacement along the x axis, power along the y axis, and fuel efficiency along the z axis, and we represent each car with a dot. Even though the 3D visualization is shown from four different perspectives, it is difficult to envision how exactly the points are distributed in space. For example, I think you agree about part (d) is particularly confusing. It almost seems to show a different dataset, even though nothing has changed other than the angle from which we look at the dots. A 3D SCATTERPLOT Fuel efficiency versus displacement and power for 32 cars (1973–74 models). Each dot represents one car, and the dot color represents the number of cylinders of the car. The four panels (a)–(d) show exactly the same data but use different perspectives. WHY THIS EFFECT AFFECTS VISUALIZATION? The fundamental problem with such 3D visualizations is that they require two separate, successive data transformations. The first transformation maps the data from the data space into the 3D visualization space. The second one maps the data from the 3D visualization space into the 2D space of the final figure. (This second transformation obviously does not occur for visualizations shown in a true 3D environment, such as when shown as physical sculptures or 3D-printed objects. But here 3D visualizations are shown on 2D displays.) The second transformation is non-invertible, because each point on the 2D display corresponds to a line of points in the 3D visualization space. Therefore, we cannot uniquely determine where in 3D space any particular data point lies. Our visual system nevertheless attempts to invert the 3D to 2D transformation. However, this process is unreliable, with an high risk of making errors, and highly dependent on appropriate cues in the image that convey some sense of three-dimensionality. When we remove these points of reference the inversion becomes entirely impossible. LET’S SEE THAT This can be seen here, which is identical to previous one except all depth cues have been removed. The result is four random arrangements of points that we cannot interpret at all and that aren’t even easily relatable to each other. Could you tell which points in part (a) correspond to which points in part (b)? HOW TO COPE WITH THAT Instead of applying two separate data transformations, one of which is non-invertible, it is generally better to just apply one appropriate, invertible transformation and map the data directly into 2D space. It is rarely necessary to add a third dimension as a position scale, since variables can also be mapped onto color, size, or shape scales. BACK TO 3D BARS You may wonder whether the problem with 3D scatter plots is that the actual data representation, the dots, do not themselves convey any 3D information. What happens, for example, if we use 3D bars instead? The next Figure (1) shows a typical dataset that one might visualize with 3D bars, the mortality rates in 1940 Virginia stratified by age group and by gender and housing location. We see that indeed the 3D bars help us interpret the plot. It is unlikely that one might mistake a bar in the foreground for one in the background or vise versa. Nevertheless, the problems discussed in the context of Titanic exist here as well. It is difficult to judge exactly how tall the individual bars are, and it is also difficult to make direct comparisons. For example, was the mortality rate of urban females in the 65–69 age group higher or lower than that of urban males in the 60–64 age group? TRELLIS PLOTS In general, it is better to use Trellis plots instead of 3D visualizations. Trellis Graphics is a family of techniques for viewing complex, multi-variable data sets. The ideas have been around for a while, but were formalized by researchers at Bell Laboratories during the 1990s. The techniques were given the name Trellis because they usually result in a rectangular array of plots, resembling a garden trellis. A number of statistical software systems provide multi-panel conditioning plots under the name Trellis plots or Crossplots. The Virginia mortality dataset requires only 4 panels. The figure looks clear and easy to interpret. It is immediately obvious that mortality rates were higher among men than among women, and also that urban males seem to have had higher mortality rates than rural males, differently from urban and rural females. COMPARING THEM 3D bars Virginia mortality Trellis plot APPROPRIATE USE OF 3D VISUALIZATIONS Visualizations using 3D position scales can sometimes be appropriate, however. First, the issues described before are of lesser concern if the visualization is interactive and can be rotated by the viewer, or alternatively, if it is shown in a VR or augmented reality environment where it can be inspected from multiple angles. Second, even if the visualization isn’t interactive, showing it slowly rotating, rather than as a static image from one perspective, will allow the viewer to discern where in 3D space different graphical elements reside. The human brain is very good at reconstructing a 3D scene from a series of images taken from different angles, and the slow rotation of the graphic provides exactly these images. Finally, it makes sense to use 3D visualizations when we want to show actual 3D objects and/or data mapped onto them. For example, showing the topographic relief of a mountainous island is a reasonable choice. Similarly, if we want to visualize the evolutionary sequence conservation of a protein mapped onto its structure, it makes sense to show the structure as a 3D object. In either case, however, these visualizations would still be easier to interpret if they were shown as rotating animations. While this is not possible in traditional print publications, it can be done easily when posting figures on the web or when giving presentations. Corsica relief Patterns of evolutionary variation in a protein. The colored tube represents the backbone of the protein Exonuclease III from the bacterium Escherichia coli. The coloring indicates the evolutionary conservation of the individual sites in this protein, with dark coloring indicating conserved amino acids and light coloring indicating variable amino acids. DATA VISUALIZATION & REPORTING Correlograms and Heatmaps Prof.Antonio Irpino REPRESENTING A MULTIVARIATE DATASET We observed that 2D scatterplots are preferred to 3D ones when the plot is static. But, we know that by exploring a dataset we are interested in showing association/similarity patterns between variables and/or objects. How to explore visually a 𝐷𝐷 > 3 − dimensional dataset (namely, a dataset described by more than 3 numerical variables)? A question: what does it means that a set of 𝐷𝐷 variables (or objects) are similar? some variables are similar if they are associated or correlated or fully dependent. What are your opinions about the following statements? If 𝐷𝐷 objects are similar (or 𝐷𝐷 variables are perfectly associated), then each pair of objects is similar. If each pair of objects (in a set of 𝐷𝐷 objects) is similar , then all the 𝐷𝐷 objects are similar. A measure of similarity/dissimilarity is required! For a value of similarity tending to its maximum value (or of dissimilarity tending to zero) objects tend to be equal. But what emerges is that we can have an idea of the relationships between objects (objects or variables) considering pair comparisons. SC ATTERPLOT MATRIX A scatter plot matrix is table of scatter plots. Each plot is small so that many plots can be fit on a page. When you need to look at several plots, such as at the beginning of a multiple regression analysis, a scatter plot matrix is a very useful tool. It is a table containing 𝐷𝐷 × 𝐷𝐷 mini- scatterplots It is a symmetric table, where plots are duplicated (above and bottom the main diagonal) Each mini-scatterplot can be enriched Scatterplot matrix using Iris data and ggpairs function in the for letting emerge interesting patterns GGally R package ENRICHING SC ATTERPLOTS TABLES Since plots are repeated we can recover that space and enrich the plot Actually, we have 𝐷𝐷 × 𝐷𝐷 = 𝐷𝐷 2 slots in the table. However, the possible pairs (combinations on 𝐷𝐷 elements in pairs) are 𝐷𝐷 𝐷𝐷 × (𝐷𝐷 − 1) 𝐷𝐷 2 − 𝐷𝐷 = =. 2 2 2 Thus, in the grid we can use the 𝐷𝐷 slots on the main diagonal and the upper (or lower) triangular matrix (of 𝐷𝐷2 cells) for enriching the plot. Scatterplot matrix using Iris data and ggpairs function in the GGally R package ENRICHING 2 Scatterplot matrix Scatterplot WHAT IF CATEGORICAL VARIABLES ARE PRESENT Scatterplot matrix: Categories are used as grouping factors Scatterplot matrix: we can see some details mapping colour to categories LET’S SEE ANOTHER DATASET Swiss Fertility and Socioeconomic Indicators (1888) Data Description Standardized fertility measure and socio-economic indicators for each of 47 French-speaking provinces of Switzerland at about 1888. A data frame with 47 observations on 6 variables, each of which is in percent, i.e., in [0, 100]. [,1] Fertility Ig, ‘common standardized fertility measure’ [,2] Agriculture % of males involved in agriculture as occupation [,3] Examination % draftees receiving highest mark on army examination [,4] Education % education beyond primary school for draftees. [,5] Catholic % ‘catholic’ (as opposed to ‘protestant’). [,6] Infant.Mortality live births who live less than 1 year. All variables but ‘Fertility’ give proportions of the population. THE SCATTERPLOTS MATRIX COMPARING NUMERICAL VARIABLES When comparing several numerical variables, we could be interested into “How many pairs of variables are similar” What does it means similarity between variables? Different populations, different samples or different groups in the population/sample Two variables are similar if they have the same distribution, moments, characteristic function, etc. to observe similarity we can center, rescaling or transforming variables for being comparable Same population (or same sample) The more two variables are correlated the more they are similar. WHAT IS CORRELATION? Correlation is a measure of inter-dependence between two variables. Linear correlation (𝑟𝑟 Bravais Pearson correlation index) pros: Easy to understand thanks to linearity (the two variables are more ore less proportional) cons: measure only linear (inter)dependence Concordance or Rank correlation (𝜌𝜌 Spearman correlation: means that two variables are related according the order of observed values) pros: also non linear relationships are taken into account, more robust in presence of outliers. cons: less intutitive with respect to the concept of “proportionality” OTHER TYPES OF CORRELATION Note that, 𝑟𝑟 and 𝜌𝜌 tells both the strength and direction of the association. Other advanced measures of correlation that try to measure not only linear (inter)dependence have been developed. But, since they require a deeper statistical knowledge are not presented here. See the file “Types of correlation.pdf” in the shared folder «PAPERS TO STUDY». Distance correlation Maximal correlation Other… Remember: CORRELATION is not CAUSATION!!! BRAVAIS VS SPEARMAN 𝑟𝑟 Bravais pearson 𝜌𝜌 Spearman correlation 𝜎𝜎𝑋𝑋𝑋𝑋 6 ∑𝑁𝑁 2 𝑟𝑟(𝑋𝑋, 𝑌𝑌) = 𝑖𝑖=1 𝑑𝑑𝑖𝑖 𝜎𝜎𝑋𝑋 𝜎𝜎𝑌𝑌 𝜌𝜌(𝑋𝑋, 𝑌𝑌) = 1 − 𝑁𝑁(𝑁𝑁 2 − 1) Where Where 𝑑𝑑𝑖𝑖 is difference between the rankings of the 𝑖𝑖 − 𝑡𝑡𝑡 𝑁𝑁 individual with respect to the 𝑋𝑋 and 𝑌𝑌 variable. 𝜎𝜎𝑋𝑋𝑋𝑋 = 𝑁𝑁 −1 𝑥𝑥𝑖𝑖 − 𝜇𝜇𝑋𝑋 𝑦𝑦𝑖𝑖 − 𝜇𝜇𝑦𝑦 𝑖𝑖=1 How it is obtained: or 1. replace the 𝑥𝑥𝑖𝑖 and 𝑦𝑦𝑖𝑖 data with their rankings with respect to 𝑁𝑁 the 𝑋𝑋 and the 𝑌𝑌 variables. After this step, instead of observing the actual values you observe two vectors of rankings where, 𝜎𝜎𝑋𝑋𝑋𝑋 = 𝑁𝑁 −1 𝑥𝑥𝑖𝑖 𝑦𝑦𝑖𝑖 − 𝜇𝜇𝑋𝑋 𝜇𝜇𝑌𝑌 generally, for example, min(𝑋𝑋) = 1 (resp. min(𝑌𝑌) = 1) and 𝑖𝑖=1 max(𝑋𝑋) = 𝑁𝑁 (resp. max(𝑌𝑌) = 𝑁𝑁). Remember to adjust for ties. is the Covariance between 𝑋𝑋 and 𝑌𝑌, and 𝜎𝜎𝑋𝑋 , respectively 2. 𝜌𝜌(𝑋𝑋, 𝑌𝑌) formula is obtained starting from the 𝑟𝑟(𝑋𝑋, 𝑌𝑌) applied to 𝜎𝜎𝑌𝑌 , i the standard deviation of 𝑋𝑋, resp. 𝑌𝑌. the two vector of rankings. −1 ≤ 𝑟𝑟(𝑋𝑋, 𝑌𝑌) ≤ +1 −1 ≤ 𝜌𝜌(𝑋𝑋, 𝑌𝑌) ≤ +1 𝑟𝑟(𝑋𝑋, 𝑌𝑌) = 0 => linear independence 𝜌𝜌(𝑋𝑋, 𝑌𝑌) = 0 => no agreement between rankings 𝑟𝑟(𝑋𝑋, 𝑌𝑌) = ±1 => perfect collinearity, or maximum 𝜌𝜌(𝑋𝑋, 𝑌𝑌) = ±1 => perfect positive/negative monothonic direct/inverse proportionality relationship between variables (linear relationship). LINEAR VS RANK CORRELATION A monothonic relationship Outliers in data: By Skbkekas - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=8778570 CORRELOGRAMS (A PLOT FOR EXPLORING CORRELATION) Visualizations of correlation coefficients are called correlograms. Correlograms are statitical charts, based on the same concept of Grid of charts. To illustrate the use of a correlogram, we will consider the car dataset. We can display correlations at once as a matrix of colored tiles, where each tile represents one correlation coefficient. This correlogram allows us to quickly grasp trends in the data, such as that mpg is negatively correlated with cyl, disp and hp and that disp and wt have a strong positive correlation. Using corrplot library corrplot function OTHER TYPES USING TILES OF DIFFERENT COLORS AND SHAPES Correlogram using colors Correlogram using circles anc colors ALSO IN THIS C ASE MORE THAN AN HALF OF THE TABLE IS NECESSARY Correlogram using circles anc colors Correlogram using circles anc colors MORE Correlogram using two colors and corrgram package Correlogram using ellipses… Why ellipses? WHY ELLIPSES? BEC AUSE OF BIVARIATE NORMAL DISTRIBUTION ASSUMPTION! The level curves (of density) of a Bivariate dataset that is a realization of a Bivariate normal distribution function are elliptical, and the eccentricity is related to the correlation coefficient. A READING PROBLEM IN CORRELOGRAM Correlogram using the original order of variables may be unclear. It is useful to sort columns and rows (variables) such that the plot is more readable. UNSORTED VS SORTED AN EXAMPLE Unsorted Sorted using clustering of variables MORE BE CAREFULL WHEN USE CORRELOGRAMS All correlograms have one important drawback: They are fairly abstract. While they show us important patterns in the data, they also hide the underlying data points and may cause us to draw incorrect conclusions. It is always better to visualize the raw data rather than abstract, derived quantities that have been calculated from it. Scatterplots and correlogram with ggpairs BE CAREFULL WHEN USE CORRELOGRAMS All correlograms have one important drawback: They are fairly abstract. While they show us important patterns in the data, they also hide the underlying data points and may cause us to draw incorrect conclusions. It is always better to visualize the raw data rather than abstract, derived quantities that have been calculated from it. But,…, Correlograms origin is related to the possibility of representing the cell of a table by colors, instead of numbers. This family of plots are called Heatmaps. HEATMAPS (USING COLOUR INSTEAD OF VALUES) Useful for representing tables of values. Values must be comparable rowwise and columnwise ! It means that if the table must cantain variables measured using the same scale unit and in a common range. Each cell assumes a colors according to the contained value and to a color scale.The resulting plot is a matrix of coloured cells. For example, heatmaps are widely used by biologists, for representing microarray data where each cell contains the level of expression of a gene. Now, you can undertand why correlograms could be considered as a heatmap. Sometimes they are confused with density maps (effectively thay have some communalities!) A microarray SOME USES OF HEATMAPS Browsing the Web and searching for the Heatmap term Appeared differnt types of heatmaps. zones of presence of players on gamefield zones must viewed by a user on a web-page This is because Heatmaps are representation of a gridded space using a colour scale ranging from cool to warm colours so, heatmaps are considered a wide range of plots. Also a 2D density plot could be assimilated to a Heatmap. In the two figures, the heatmap is constrained by the “background” structure (The gamefield or the page structure). Some heatmaps HEATMAPS USE WITH DATA TABLES Heatmaps are useful for data tables having the following characteristics: Individuals vs. variables: if variables are homogeneous, namely, they have the same nature, the same scale unit, and the same range of variation. For example, microarray, multiple time series (each row is an individual, and each column a time-stamp). Homogenization is also possible by scaling (dividing cell values by the variable range) or normalizing (subtracting to each cell values the mean and dividing by the standard deviation of the variable) each column. individuals vs. individuals distance/dissimilarity matrices incidence matrices (they are binary matrices, namely matrices containing 0’s and 1’s, generally used for representing Heatmap of car dataset mtcars, after scaling each relational data, or networks. We see this kind of data in the variable using heatmap function of R. Not so useful! future) variables vs. variables (via correlograms) HOW TO IMPROVE THE HEATMAP (GROUPING AN SORTING) UNSORTED Heatmap of car dataset mtcars Heatmap of car dataset mtcars, after grouping rows and columns via a clustering algorithm. Some patterns are clearer now! SOME MORE COMPLIC ATED EXAMPLES ON THE WEB microarray https://warwick.ac.uk/fac/sci/moac/people/stud ents/peter_cock/r/heatmap some more sophisticated examples (with code) here: https://www.datanovia.com/en/lessons/heatma p-in-r-static-and-interactive-visualization/ Complex heatmaps DATA VISUALIZATION & REPORTING Plots for very large data tables, dimensionality reduction and Biplots, parallel coordinates Prof.Antonio Irpino SHOWING MULTIDIMENSIONAL DATA IS NOT AN EASY TASK!!! About correlogram we said: It is always better to visualize the raw data rather than abstract, derived quantities that have been calculated from it. Fortunately, we can frequently find a middle ground between showing important patterns and showing the raw data by applying techniques of dimension reduction. Dimension reduction Dimension reduction relies on the key insight that most high-dimensional datasets consist of multiple correlated variables that convey overlapping information. Such datasets can be reduced to a smaller number of key dimensions without loss of much critical information. As a simple, intuitive example, consider a dataset of multiple physical traits of people, including quantities such as each person’s height and weight, the lengths of the arms and legs, the circumferences of waist, hip, and chest, etc. We can understand immediately that all these quantities will relate first and foremost to the overall size of each person. All else being equal, a larger person will be taller, weigh more, have longer arms and legs, and larger waist, hip, and chest circumferences. The next important dimension is going to be the person’s sex. Male and female measurements are substantially different for persons of comparable size. For example, a woman will tend to have higher hip circumference than a man, all else being equal. DIMENSION REDUCTION TECHNIQUES There are many techniques for dimension reduction, also related to the different nature of variables. When variables are all numeric the most widely used one, called principal components analysis (PCA). Principal component analysis (PCA) allows us to summarize and to visualize the information in a data set containing individuals/observations described by multiple inter-correlated quantitative variables. Each variable could be considered as a different dimension. If you have more than 3 variables in your data sets, it could be very difficult to visualize a multi-dimensional hyperspace. Principal component analysis is used to extract the important information from a multivariate data table and to express this information as a set of few new variables called principal components. These new variables correspond to a linear combination of the originals. The number of principal components is less than or equal to the number of original variables. The information in a given data set corresponds to the total variation it contains. The goal of PCA is to identify directions (or principal components) along which the variation in the data is maximal. In other words, PCA reduces the dimensionality of a multivariate data to two or three principal components, that can be visualized graphically, with minimal loss of information. A VERY INTUTIVE EXPLANATION OF PCA (a) The original data, the head-length and skull-size measurements from a birds dataset. (b) As the first step in PCA, we standardize the original data values to zero mean and unit variance. Then we define new variables (the principal components, PCs) along the directions of maximum variation in the data. (c) Finally, data are projected into the new coordinates. SOME MINIMAL CONCEPTS FOR UNDERSTANDING PC A PHILOSOPHY A dataset is descibed by a data matrix X having N rows (the individuals) and P columns (the numeric variables). The generic row i is a vector of P numbers representing the coordinates of the i-th individual in a ℜ𝑃𝑃 space spanned by the P variables. Since also the columns are vectors, the generic j-th column is a vector of N numbers representing the coordinates of j-th variable in a ℜ𝑁𝑁 space spanned by the N individuals (Yes, this is less intutive, but in matrix algebra row vectors have the same dignity of column vectors!). Since, in general, P dim(music) = c(2, 2, 2) > dimnames(music) = list(Age = c("Old", "Young"), Education = c("High", "Low"), Listen = c("Yes", "No")) First Prev Next Last Go Back Full Screen Close Quit Data Inspection > music , , Listen = Yes Education Age High Low Old 210 170 Young 194 110 , , Listen = No Education Age High Low Old 190 730 Young 406 290 First Prev Next Last Go Back Full Screen Close Quit Producing A Mosaic Plot The R function which produces mosaic plots is called mosaicplot. The simplest way to produce a mosaic plot is: > mosaicplot(music) It is also easy to colour the plot and to add a title. > mosaicplot(music, col = hcl(240), main = "Classical Music Listening") First Prev Next Last Go Back Full Screen Close Quit Classical Music Listening Old Young Yes No Yes No High Education Low Age First Prev Next Last Go Back Full Screen Close Quit Example: Survival on the Titanic On Sunday, April 14th, 1912 at 11:40pm, the RMS Titanic struck an iceberg in the North Atlantic. Within two hours the ship had sunk. At best reckoning 705 survived the sinking, 1,523 did not. First Prev Next Last Go Back Full Screen Close Quit The Data There is very good documentation on who survived and who did not survive the sinking of the Titanic. R has a data set called “Titanic” which gives data on the passengers on the Titanic, cross-classified by: – Class: 1st, 2nd, 3rd, Crew. – Sex: Male, Female. – Age: Child, Adult. – Survived: No, Yes. First Pre

Data Visualization & Reporting PDF

Document Details

Tags

Related

Summary

Full Transcript