Data Visualization Outline PDF
Document Details
Uploaded by DaringAcropolis7487
Wilfrid Laurier University
Tags
Summary
This document is an outline on data visualization, covering topics like coordinate systems, color scales, and visualizations based on different data types, including amount, distribution, proportions, associations, and uncertainty. The document also discusses different data visualization methods, techniques and various examples.
Full Transcript
Data Visualization Outline Visualizing data: Mapping data onto aesthetics Coordinate systems and axis Color Scales Visualization for different types of data Amount Distribution Proportions Associations Uncertainty... Visualizing data: Mapping data onto ae...
Data Visualization Outline Visualizing data: Mapping data onto aesthetics Coordinate systems and axis Color Scales Visualization for different types of data Amount Distribution Proportions Associations Uncertainty... Visualizing data: Mapping data onto aesthetics Commonly used aesthetics in data visualization: Position Shape Size Color Line width Line type 23 Which one(s) can only represent discrete data?? Scales map data values onto aesthetics The mapping between data values and aesthetics values is created via scales. A scale must be one-to-one 24 Data visualization is part art and part science. Examples of ugly, bad, and wrong figures 25 Data Visualization Outline Visualizing data: Mapping data onto aesthetics Coordinate systems and axis Color Scales Visualization for different types of data Amount Distribution Proportions Associations Uncertainty... Coordinate systems and axes Position scales - determine where in a graphic different data values are located. 2-dimension visualizations - two numbers are required to uniquely specify a point, and therefore we need two position scales. 3-d: 3 position scales 27 Cartesian coordinates The most widely used coordinate system The axis run orthogonally to each other. Data values are placed in an even spacing along both axis 28 Example: two axes representing two different units 29 Example: Same unit & change in unit Note: Use equal grids for same unit Cartesian coordinate systems are invariant under linear transformations 30 What if we want to visualize highly skewed data? Nonlinear axes Even spacing in data units corresponds to uneven spacing in the visualization Example: log scale 31 curved axes polar coordinate Pole Radius Polar angle geospatial data 32 Data Visualization Outline Visualizing data: Mapping data onto aesthetics Coordinate systems and axis Color Scales Visualization for different types of data Amount Distribution Proportions Associations Uncertainty... Color Scales There are three fundamental use cases for color in data visualizations: (i) distinguish groups of data from each other - discrete (ii) represent numerical data values - continuous (iii) highlight 34 Color as a tool to distinguish We frequently use color to distinguish discrete items or groups that do not have an intrinsic order, such as different countries on a map or different manufacturers of a certain product. In this case, we use a qualitative color scale. Such a scale contains a finite set of specific colors that are chosen to look clearly distinct from each other while also being equivalent to each other. The second condition requires that no one color should stand out relative to the others. 35 36 Color to represent numerical values Color can also be used to represent data values, such as income, temperature, or speed – continuous In this case, we use a sequential color scale. Such a scale contains a sequence of colors that clearly indicate (i) which values are larger or smaller than which other ones and (ii) how distant two specific values are from each other. Sequential scales can be based on a single hue (e.g., from dark blue to light blue) or on multiple hues (e.g., from dark red to light yellow). 37 38 Color as a tool to highlight There may be specific categories or values in the dataset that carry key information about the story we want to tell, and we can strengthen the story by emphasizing the relevant figure elements to the reader. This effect can be achieved with accent color scales, which are color scales that contain both a set of subdued colors and a matching set of stronger, darker, and/or more saturated colors. 39 40 Data Visualization Outline Visualizing data: Mapping data onto aesthetics Coordinate systems and axis Color Scales Visualization for different types of data Amount Distribution Proportions Associations Uncertainty... Visualizing amounts We have a set of categories (e.g., brands of cars, cities, or sports) and a quantitative value for each category. The standard visualization in this scenario is the bar chart. Variation include grouped and stacked bars. Alternatives to the bar plot are the dot plot and the heatmap. 42 Bar Plot/Chart Commonly visualized with vertical bars. 43 A bar plot/chart presents categorical data with rectangular bars the bars’ heights or lengths are proportional to the values that they represent. One axis of the chart shows the specific categories being compared the other axis represents a measured value. The bars can be plotted vertically or horizontally. 44 Regardless of whether we place bars vertically or horizontally, we need to pay attention to the order in which the bars are arranged. 45 Whenever there is a natural ordering (i.e., when our categorical variable is an ordered factor) we should retain that ordering in the visualization. 46 47 Design: The principle of proportional ink The principle of proportional ink(Bergstrom and West 2016 ): The sizes of shaded areas in a visualization need to be proportional to the data values they represent. When we use graphical elements such as bars, rectangles, shaded areas of arbitrary shape, or any other elements that have a clear visual extent 48 49 Bars should always start at 0. Visualize negative value 50 Grouped bars When we are interested in two categorical variables at the same time, we can visualize this dataset with a grouped bar plot. we first draw a group of bars at each position along the x axis, determined by one categorical variable then we draw bars within each group according to the other categorical variable 51 Stacked Bars Instead of drawing groups of bars side-by-side, it is sometimes preferable to stack bars on top of each other. Stacking is useful when the sum of the amounts represented by the individual stacked bars is in itself a meaningful amount. Stacked bar charts are designed to help you simultaneously compare totals and notice sharp changes at the item level that are likely to have the most influence on movements in category totals. 52 53 Dot plots and heatmaps Bars are not the only option for visualizing amounts. One important limitation of bars is that they need to start at zero, so that the bar length is proportional to the amount shown. In this case, we can indicate amounts by placing dots at the appropriate locations along the x or y axis. 54 55 56 57 Line graphs Appropriate whenever the data points have a natural order that is reflected in the variable shown along the x-axis Neighboring points can be connected with a line 58 Heatmap As an alternative to mapping data values onto positions via bars or dots, we can map data values onto colors. Such a figure is called a heatmap. Heat maps make it easy to visualize complex data and understand it at a glance. 59 Internet adoption over time, for select countries. Color represents the percent of internet users for the respective country and year. Countries were ordered by percent internet users in 2016. Data source: World Bank 60 Internet adoption over time, for select countries. Countries were ordered by the year in which their internet usage first exceeded 20%. Data source: World Bank 61 A Click map of user clicks on web vs mobile app 62 A complete visualization should be Self-explanatory with title and annotations (e.g. axis labels, legend, etc.) Not redundant annotations Figure: Stock price over time for four major tech companies. The stock price for each company has been normalized to equal 100 in June 2012. 63 Data Visualization Outline Visualizing data: Mapping data onto aesthetics Coordinate systems and axis Color Scales Visualization for different types of data Amount Distribution Proportions Associations Uncertainty... Visualizing distributions Visualizing distributions: Histograms and density plots Visualizing a single distribution Visualizing multiple distributions at the same time Visualizing distributions: Empirical cumulative distribution functions and q-q plots Empirical cumulative distribution functions Highly skewed distributions Quantile–quantile plots 65 Visualizing distributions: Histograms and density plots How a particular variable is distributed in a dataset? E.g. There were approximately 1300 passengers on the Titanic and we have reported ages for 756 of them. We might want to know how many passengers of what ages there were on the Titanic, i.e., how many children, young adults, middle-aged people, seniors, and so on. We call the relative proportions of different ages among the passengers the age distribution of the passengers. 66 The age distribution among the passengers by grouping all passengers into bins with comparable ages and then counting the number of passengers in each bin 67 Histogram A histogram displays the shape and spread of continuous sample data. 68 bin widths: (a) one year; (b) three years; (c) five years; (d) fifteen years. 69 Density plot Visualize the underlying probability distribution of the data by drawing an appropriate continuous curve Probability Density A random variable x has a probability distribution f(x). The relationship between the outcomes of a random variable and its probability is referred to as the probability density, or simply the “density.” 70 Note that we have two requirements on f(x): Density estimation All we have access to is a sample of observations We must assume a probability distribution Kernel Density Estimation Nonparametric method for using a dataset to estimating probabilities for new points. 71 Kernel Density Estimation [McLachlan, 1992, Silverman, 1998] 72 The height of the curve is scaled such that the area under the curve equals one. The density estimate was performed with a Gaussian kernel and a bandwidth of 2. 73 74 (a) Gaussian kernel, bandwidth = 0.5; (b) Gaussian kernel, bandwidth = 2; (c) Gaussian 75 kernel, bandwidth = 5; (d) Rectangular kernel, bandwidth = 2. Be careful with the tails 76 Visualizing multiple distributions at the same time Stacked histogram 77 An age pyramid 78 Density Plot with multiple distributions 79 Histogram and density plots both are highly intuitive and visually appealing both share the limitation that the resulting figure depends to a substantial degree on parameters the user has to choose, such as the bin width for histograms and the bandwidth for density plots. both have to be considered as an interpretation of the data rather than a direct visualization of the data itself. 80 Visualizing distributions: Empirical cumulative distribution functions and q-q plots Aggregate methods that highlight properties of the distribution rather than the individual data points Require no arbitrary parameter choices Show all of the data at once A little less intuitive 81 Empirical cumulative distribution function (ECDF) An ECDF is an estimator of the Cumulative Distribution Function. If you have a set of samples (X1 < X2 < … < Xn) from an observed random variable, then the ECDF is 82 Assume our hypothetical class has 50 students, and the students just completed an exam on which they could score between 0 and 100 points. We can rank all students by the number of points they obtained, in ascending order (so the student with the fewest points receives the lowest rank and the student with the most points the highest) Then plot the rank versus the actual points obtained. ECDF (not normalized) ECDF (normalized) 83 Quantile–quantile plots Quantile–quantile (q-q) plots are a useful visualization when we want to determine to what extent the observed data points do or do not follow a given distribution. Example: Assume the data values have a mean of 10 and a standard deviation of 3 Assuming a normal distribution, we would expect a data point ranked at the 50th percentile to lie at position 10 (the mean) a data point at the 84th percentile to lie at position 13 (one standard deviation above from the mean) … 84 85 Visualize multiple distributions at the same time Boxplot Violin plot Strip chart Sina plot Boxplot Example: visualize how temperature varies across different months while also showing the distribution of observed temperatures within each month. violin plots Equivalent to the density estimates Rotated by 90 degrees and then mirrored 87 Strip chart Plot all the individual data points of the response variable directly Note: don’t plot too many points on top of each other Jittering: A simple solution to overplotting is to spread out the points somewhat along the x axis, by adding some random noise in the x dimension 88 Sina plot A hybrid between a violin plot and jittered points 89 Data Visualization Outline Visualizing data: Mapping data onto aesthetics Coordinate systems and axis Color Scales Visualization for different types of data Amount Distribution Proportions Associations Uncertainty... Visualizing proportions Pie chart Stacked bar chart Side by side bars Visualizing nested proportions Multiple grouping variables Break down a dataset by multiple categorical variables at once Example: There are 106 bridges in Pittsburgh. This dataset contains various pieces of information about the bridges, such as the material from which they are constructed (steel, iron, or wood) and the year when they were erected (crafts, emerging, modern). 92 Mosaic plot Whenever we have categories that overlap, it is best to show clearly how they relate to each other. We begin by placing one categorical variable along the x axis and subdivide the x axis by the relative proportions that make up the categories of the y variable 93 Treemap A series of nested rectangles of sizes proportional to the corresponding data value. A large rectangle represents a branch of a data tree, and it is subdivided into smaller rectangles that represent the size of each node within that branch. We recursively nest rectangles inside each other 94 Parallel sets Proportions described by more than two categorical variables We show how the total dataset breaks down by each individual categorical variable Then we draw shaded bands that show how the subgroups relate to each other 96 Data Visualization Outline Visualizing data: Mapping data onto aesthetics Coordinate systems and axis Color Scales Visualization for different types of data Amount Distribution Proportions Associations Uncertainty... Visualizing associations How the numerical variables relate to each other To plot the relationship of just two such variables, e.g. the height and weight, we will normally use a scatter plot. To show more than two variables at once, we may plot a scatter plot matrix, or a correlogram, … Scatter plot Each instance of the dataset gets plotted as a point whose (x,y) coordinates relates to its values for the two variables. Patterns or relationships in scatterplots represent correlation between the variables. Information of 123 blue jay birds: head length against body mass. Each dot corresponds to one bird. 99 In a scatter plot, when the y variable tends to increase as the x variable increases, we say there is a positive correlation between the variables. 100 Scatter Plot Matrix 101 Correlograms Visualizations of correlation coefficients are called correlograms. An abstract way for visualize associations. 102 Data Visualization Outline Visualizing data: Mapping data onto aesthetics Coordinate systems and axis Color Scales Visualization for different types of data Amount Distribution Proportions Associations Uncertainty... Discrete outcome visualization Creating a graph that emphasizes both the frequency aspect and the unpredictability of a random trial. 104 Numeric random variables Probability distributions Hypothetical prediction of an election outcome 105 Visualizing the uncertainty of point estimates How to get the “best estimate” and “margin of error”? Total set of possible values is called the population Subset we polled is called the sample. The number of the individual observations in the sample is called the sample size. Quantities that describe the population are called parameters. Estimates: Use a sample to make a guess about the true parameter values 106 Types of Error bars for uncertainty Sample standard deviation Standard error Confidence Z Interval 80% 1.282 85% 1.440 Confidence Interval 90% 1.645 95% 1.960 99% 2.576 99.5% 2.807 107 99.9% 3.291 Whenever you visualize uncertainty with error bars, you must specify what quantity and/or confidence level the error bars represent. Relationship between sample, sample mean, standard deviation, standard error, and 108 confidence intervals, in an example of chocolate bar ratings. Examples of Uncertainty with other visualization methods Median income versus median Mean butterfat contents in the milk age for 67 counties in of four cattle breeds. Error bars Pennsylvania. Error bars indicate +/- one standard error of represent 90% confidence the mean. intervals. 109 Data Visualization Outline Visualizing data: Mapping data onto aesthetics Coordinate systems and axis Color Scales Visualization for different types of data Amount Distribution Proportions Associations Uncertainty... Amounts 111 Distributions 112 Proportions 113 Associations 114 Uncertainty 115 Reference for visualization Data Visualization with Python: Create an impact with meaningful data insights using interactive and engaging visuals. By Mario Dobler and Tim Großmann. (ISBN-13: 978- 1789956467) Interactive Visualization: Insight through Inquiry. By Bill Ferster and Ben Shneiderman. (ISBN-13: 978-0262018159) Fundamentals of Data Visualization. By Claus O. Wilke. (ISBN- 13: 978-1492031086)