Data Visualization Course Notes PDF
Document Details
Uploaded by ProfuseCotangent21
Dr. Abdulla Eid
Tags
Summary
These are course notes for a data visualization class, covering topics like data collection, scales, coordinate systems and color representation in data visualizations.
Full Transcript
STAT 288: Data Visualization Course Overview and Key Concepts Dr. ABDULLA EID Welcome & Course Syllabus Introduction to Data Overview of course Learning objectives Visualization content and structure and expectations Data Science Cycle Data Collectio...
STAT 288: Data Visualization Course Overview and Key Concepts Dr. ABDULLA EID Welcome & Course Syllabus Introduction to Data Overview of course Learning objectives Visualization content and structure and expectations Data Science Cycle Data Collection Data Cleaning (Simple Data Mining) Extract Info from Data Open-Source data Data Visualization (focus of this course) Represent a large sample Story Telling Data Collection Examples STUDENTS RESULTS STOCK MARKETS & DEMOGRAPHY ELECTION DATA FINANCIAL INSTITUTIONS (2024 US PRESIDENTIAL ELECTION) SPORT DATA (OLYMPICS FRANCE 2024, AMERICAN FOOTBALL) Open Data Sources Data.gov: science and OpenDataNetwork.com: research to Census.gov: U.S. Robust data search manufacturing and demographical data engine climate Kaggle.com: NCES.ed.gov: Education HealthData.gov: Health- Community-published data related databases datasets DBpedia.org: Databases similar to Wikipedia What is Data Visualization? Data visualization is part art and part science. The challenge is to get the art right without getting the science wrong and vice versa. A data visualization first and foremost has to accurately convey the data. It must not mislead or distort. If one number is twice as large as another, but in the visualization they look to be about the same, then the visualization is wrong. At the same time, a data visualization should be aesthetically pleasing. Good visual presentations tend to enhance the message of the visualization. If a figure contains jarring colors, imbalanced visual elements, or other features that distract, then the viewer will find it harder to inspect the figure and interpret it correctly. Ugly, Bad, and Wrong… 3Es of Displaying Data: Effective: Convey the message clearly Efficient: Minimal use of resources Ethical: Represent data truthfully Tips for Effective Display 1. Assure that the visual is placed within proximity to the text and vice versa. 2. Visuals give readers opportunities to pause and consider the ideas in the text. 3. Graphics visually reinforce your argument; readers tend to trust what they can see. 4. Tell a simple story with your data. Effective Display: Graphic via: NOAA National Weather Service semi-annual report on climate change. http://climate.nasa.gov/causes/ Tips for Efficient Display 1. When using color, aim for careful and minimal usage. 2. Don’t use color where black and white will work better. 3. Color can help establish visual patterns, but don’t overuse color. Readers can typically interpret only two or three colors at a time. Which of these looks better? Which of these looks better? Tips for Ethical Display 1. Be honest and do not exaggerate or inflate results. 2. Cite sources for data or graphics you didn't create. 3. \ 4. Obtain permission to publish graphics you don't own. 5. Include all relevant data, even if unexplained. 6. Represent quantities accurately. 7. Avoid using tables to hide obvious data points. 8. Do not use color to mislead or misrepresent importance. Conclusion & Next Steps Review key points Upcoming assignments and topics Resources for further study STAT 288 Data Visualization Dr Abdulla Eid Chapter 2, 3, and 4 CHAPTER 2 Data Visualization :Mapping Data Onto Aesthetics Data Visualization Convert data values in a systematic and logical way into the visual elements that make up the final graphic All data visualizations (pie, bar, histogram, etc) map data values into quantifiable features of the resulting graphic – these features are aesthetics Aesthetics and types of data Aesthetics All aesthetics fall into one of two groups Continuous Continuous Discrete Data Types quantitative/numerical continuous quantitative/numerical discrete Continuous qualitative/categorical unordered qualitative/categorical ordered date or time text Example of data type Month Day Location Station ID Temperature Jan 1 Chicago USW00014819 25.6 The Table shows the first few rows of Jan 1 San Diego USW00093107 55.2 a dataset providing the daily Jan 1 Houston USW00012918 53.9 temperature (average daily Jan 1 Death Valley USC00042319 51.0 temperatures over a 30-year Jan 2 Chicago USW00014819 25.5 window) for four U.S. locations. Jan 2 San Diego USW00093107 55.3 Jan 2 Houston USW00012918 53.8 Jan 2 Death Valley USC00042319 51.2 This table contains five variables: Jan 3 Chicago USW00014819 25.3 month, day, location, station ID, and Jan 3 San Diego USW00093107 55.3 temperature (in degrees Fahrenheit). Jan 3 Death Valley USC00042319 51.3 Jan 3 Houston USW00012918 53.8 2.2 Scales map data values onto aesthetics To map data values onto aesthetics, we need to specify which data values correspond to which specific aesthetics values. For example, if our graphic has an x axis, then we need to specify which data values fall onto particular positions along this axis. Similarly, we may need to specify which data values are represented by particular shapes or colors. This mapping between data values and aesthetics values is created via scales. Example: Map Data to Aesthetics Another mapping 03 - SOCIAL MEDIA How many scales we are using? 03 - SOCIAL MEDIA Position Scale It determines where in a graphic different data values are located. We cannot visualize data without placing different data points at different locations, even if we just arrange them next to each other along a line The combination of a set of position scales and their relative geometric arrangement is called a coordinate system CHAPTER 3 Coordinate System And Axes 3.1 Cartesian coordinates What can you say about these three visualization? 3.2 Non - Linear Scale 3.3 Coordinate systems with curved axes Example of Curves Coordinate System Different System (Subject specific) CHAPTER 4 Color Scales Color Scale There are three fundamental use cases for color in data visualizations: –(i) we can use color to distinguish groups of data from each other –(ii) we can use color to represent data values –(iii) we can use color to highlight. 4.1 Color as a tool to distinguish Qualitative Color Scale We frequently use color as a means to distinguish discrete items or groups that do not have an intrinsic order, such as different countries on a map or different manufacturers of a certain product. In this case, we use a qualitative color scale. Such a scale contains a finite set of specific colors that are chosen to look clearly distinct from each other while also being equivalent to each other. The second condition requires that no one color should stand out relative to the others. And, the colors should not create the impression of an order, as would be the case with a sequence of colors that get successively lighter. 4.2 Color to represent data values Sequential Color Color can also be used to represent data values, such as income, temperature, or speed. We use a sequential color scale. Such a scale contains a sequence of colors that clearly indicate –(i) which values are larger or smaller than which other ones –(ii) how distant two specific values are from each other. 4.3 Color as a tool to highlight Accent Colors Color can also be an effective tool to highlight specific elements in the data. There may be specific categories or values in the dataset that carry key information about the story we want to tell. This effect can be achieved with accent color scales, which are color scales that contain both a set of subdued colors and a matching set of stronger, darker, and/or more saturated colors S TAT 2 8 8 Week 3 Chapter 5 and 6 Dr Abdulla Eid CHAPTER 5 DIRECTORY OF VISUALIZATIONS: 5.1 AMOUNTS: 5.2 DISTRIBUTION: DISTRIBUTION (CONT.): 5.3 PROPORTIONS PROPORTION (CONT.): PROPORTION (CONT.): 5.4 X–Y RELATIONSHIPS: X-Y RELATIONSHIP (CONT.): 5.5 GEOSPATIAL DATA: 5.6 Uncertainty: Error bars are meant to indicate the range of likely values for some estimate or measurement. They extend horizontally and/or vertically from some reference point representing the estimate or measurement Reference points can be shown in various ways, such as by dots or by bars. Uncertainty (Cont.): Confidence strips provide a clear visual sense of uncertainty but are difficult to read accurately Eyes and half-eyes combine error bars with approaches to visualize distributions (violins and ridgelines, respectively), and thus show both precise ranges for some confidence levels and the overall uncertainty distribution A quantile dot plot can serve as an alternative visualization of an uncertainty distribution UNCERTAINTY (CONT.): Chapter 6 Visualizing amounts: In many scenarios, we are interested in the magnitude of some set of numbers. For example, we might want to visualize the total sales volume of different brands of cars, or the total number of people living in different cities, or the age of olympians performing different sports. In all these cases, we have a set of categories (e.g., brands of cars, cities, or sports) and a quantitative value for each category. These cases are visualizing amounts, because the main emphasis in these visualizations will be on the magnitude of the quantitative values. The standard visualization in this scenario is the bar plot, which comes in several variations, including simple bars as well as grouped and stacked bars. Alternatives to the bar plot are the dot plot and the heatmap. 6.1 Bar Plot: To motivate the concept of a bar plot: Consider the total ticket sales for the most popular movies on a given weekend. The table on the right shows the top-five weekend gross ticket sales on the Christmas weekend of 2017. The movie “Star Wars: The Last Jedi” was by far the most popular movie on that weekend, outselling the fourth- and fifth-ranked movies “The Greatest Showman” and “Ferdinand” by almost a factor of 10. BAR PLOT (BAR CHART): UGLY (WHY?) BETTER SOLUTION BAD (WHY?) ORDER MATTER (OF SOME SENSE) BAD (WHY?) Moral Pay attention to the bar order. If the bars represent unordered categories, order them by ascending or descending data value 6.2 Grouped and stacked bars: Example: The U.S. Census Bureau provides median income levels broken down by both age and race. We can visualize this dataset with a grouped bar plot. In a grouped bar plot, we draw a group of bars at each position along the x axis, determined by one categorical variable, and then we draw bars within each group according to the other categorical variable. HARD TO READ! A BIT EASIER? ULTIMATE SOLUTION! EXERCISE: CAN YOU DESCRIBE IT? 6.3 Dot plots and heatmaps: Bars are not the only option for visualizing amounts. One important limitation of bars is that they need to start at zero, so that the bar length is proportional to the amount shown DOTS PLOTS BAD (WHY?) BAD (WHY?) HEATMAP (DATA VALUES ONTO COLOR) STAT 288 Dr Abdulla Eid Week 4 Chapter 7 and 8 Chapter 7: Visualizing distributions: Histograms and density plots Understand how a particular variable is distributed in a dataset Example to be used: There were approximately 1300 passengers on the Titanic (not counting crew) We have reported ages for 756 of them. We might want to know how many passengers of what ages there were on the Titanic, i.e., how many children, young adults, middle-aged people, seniors, and so on. We call the relative proportions of different ages among the passengers the age distribution of the passengers. 7.1 Visualizing a single distribution Age Age Count Count range range Age Count 0–5 36 31–35 76 range 6–10 19 36–40 74 61–65 16 11–15 18 41–45 54 66–70 3 16–20 99 46–50 50 71–75 3 21–25 139 51–55 26 26–30 121 56–60 22 Histogram We can visualize this table by drawing filled rectangles whose heights correspond to the counts and whose widths correspond to the width of the age bins (Figure below). Such a visualization is called a histogram. (Note that all bins must have the same width for the visualization to be a valid histogram.) Bin Width is important Density Plots In a density plot, we attempt to visualize the underlying probability distribution of the data by drawing an appropriate continuous curve This curve needs to be estimated from the data, and the most commonly used method for this estimation procedure is called kernel density estimation we draw a continuous curve (the kernel) with a small width (controlled by a parameter called bandwidth) at the location of each data point, and then we add up all these curves to obtain the final density estimate. Bandwidth (Appearance) What is the area under the curve in density plot? Take care of the tails! Warning! So should you use a histogram or a density plot to visualize a distribution? Heated discussions can be had on this topic. Some people are against density plots and believe that they are arbitrary and misleading. Others realize that histograms can be just as arbitrary and misleading. There is also the possibility of using neither and instead choosing empirical cumulative density functions or q-q plots (Chapter 8). Finally, I believe that density estimates have an inherent advantage over histograms as soon as we want to visualize more than one distribution at a time (see next section). 7.2 Visualizing multiple distributions at the same time In many scenarios we have multiple distributions we would like to visualize simultaneously. For example, let’s say we’d like to see how the ages of Titanic passengers are distributed between men and women. Were men and women passengers generally of the same age, or was there an age difference between the genders? One commonly employed visualization strategy in this case is a stacked histogram, where we draw the histogram bars for women on top of the bars for men, in a different color Why? As Density Plot? Better Solution Ideal solution for two distributions Example of More than two distributions Chapter 8 Visualizing distributions: Empirical cumulative distribution functions and q-q plots histograms or density plots are highly intuitive and visually appealing. However, they both share the limitation that the resulting figure depends to a substantial degree on parameters the user has to choose, such as the bin width for histograms and the bandwidth for density plots. As a result, both have to be considered as an interpretation of the data rather than a direct visualization of the data itself. Ecdfs and qq We could simply show all the data points individually, as a point cloud. However, this approach becomes unwieldy for very large datasets, and in any case there is value in aggregate methods that highlight properties of the distribution rather than the individual data points. To solve this problem, statisticians have invented empirical cumulative distribution functions (ecdfs) and quantile–quantile (q-q) plots. These types of visualizations require no arbitrary parameter choices, and they show all of the data at once. Unfortunately, they are a little less intuitive than a histogram or a density plot is. However, these are unpopular! They are quite popular among statisticians, though, and anybody interested in data visualization should be familiar with these techniques. 8.1 Empirical cumulative distribution functions (Ecdfs) Example: A dataset of student grades. Assume the grades of 50 students in an exam (0-100). How can we best visualize the class performance, for example to determine appropriate grade boundaries? We can rank all students by the number of points they obtained, in ascending order (so the student with the fewest points receives the lowest rank and the student with the most points the highest), and then plot the rank versus the actual points obtained Empirical cumulative distribution function Descending Order Normalized Ranking (y-axis) For example, approximately a quarter of the students (25%) received less than 75 points. The median point value (corresponding to a cumulative frequency of 0.5) is 81. Approximately 20% of the students received 90 points or more. Extract Information Focus at 80. It happened caused by three students receiving 80 points on their exam while the next poorer performing student received only 76. In this scenario, I might decide that everybody with a point score of 80 or more receives a B and everybody with 79 or less receives a C. 8.3 Quantile–quantile plots (QQ) Quantile–quantile (q-q) plots are a useful visualization when we want to determine to what extent the observed data points do or do not follow a given distribution. Just like ecdfs, q-q plots are also based on ranking the data and visualizing the relationship between ranks and actual values. However, in q-q plots we don’t plot the ranks directly, we use them to predict where a given data point should fall if the data were distributed according to a specified reference distribution. Most commonly, q-q plots are constructed using a normal distribution as the reference. Methodologies To give a concrete example, assume the actual data values have a mean of 10 and a standard deviation of 3. Then, assuming a normal distribution, we would expect a data point ranked at the 50th percentile to lie at position 10 (the mean), a data point at the 84th percentile to lie at position 13 (one standard deviation above from the mean), A data point at the 2.3rd percentile to lie at position 4 (two standard deviations below the mean). We can carry out this calculation for all points in the dataset and then plot the observed values (i.e., values in the dataset) against the theoretical values (i.e., values expected given each data point’s rank and the assumed reference distribution). Example Regression Line The solid line here is not a regression line but indicates the points where x equals y, i.e., where the observed values equal the theoretical ones. To the extent that points fall onto that line, the data follow the assumed distribution (here, normal). We see that the student grades follow mostly a normal distribution, with a few deviations at the bottom and at the top (a few students performed worse than expected on either end). The deviations from the distribution at the top end are caused by the maximum point value of 100 in the hypothetical exam; regardless of how good the best student is, he or she could at most obtain 100 points. STAT 288: Data Visualization Week 5 Dr Abdulla Eid Chapter 9 and 10 Chapter 9 Visualizing many distributions at once There are many scenarios in which we want to visualize multiple distributions at the same time. For example, consider weather data. We may want to visualize how temperature varies across different months while also showing the distribution of observed temperatures within each month. This scenario requires showing twelve temperature distributions at once, one for each month. None of the visualizations discussed in Chapters 7 or 8 work well in this case. Instead, viable approaches include boxplots, violin plots, and ridgeline plots. 9.1 Visualizing distributions along the vertical axis The simplest approach to showing many distributions at once is to show their mean or median as points, with some indication of the variation around the mean or median shown by error bars But why Bad? First, by representing each distribution by only one point and two error bars, we are losing a lot of information about the data. Second, it is not immediately obvious what the points represent, even though most readers would likely guess that they represent either the mean or the median. Third, it is definitely not obvious what the error bars represent. (Do they represent the standard deviation of the data, the standard error of the mean, a 95% confidence interval, or something else altogether?) Box Plot Temperature Example Violin Plots Are equivalent to the density estimates discussed in Chapter 7 but rotated by 90 degrees and then mirrored Temperature in Violin Plots strip chart Better (More points) Jettering sina plot 9.2 Visualizing distributions along the horizontal axis Chapter 10 Visualizing proportions We often want to show how some group, entity, or amount breaks down into individual pieces that each represent a proportion of the whole. Common examples include the proportions of men and women in a group of people, the percentages of people taking certain majors, or the market shares of companies Pie Charts Stacked Bar Charts Side-by-Side Charts Pie Charts Rectangular Pie Charts (Stacked Bars) Side-by-Side Bars Table 10.1: Pros and cons of common approaches to visualizing proportions: pie charts, stacked bars, and side-by-side bars. Side-by-side Pie chart Stacked bars bars Clearly visualizes the data as proportions of a whole Allows easy visual comparison of the relative proportions Visually emphasizes simple fractions, such as 1/2, 1/3, 1/4 Looks visually appealing even for very small datasets Works well when the whole is broken into many pieces Works well for the visualization of many sets of proportions or time series of proportions A case for side-by-side bars A Better way 10.3 A case for stacked bars and stacked densities Heat Maps STAT 288 Week 6 Dr Abdulla Eid Chapter 11 and 12 Chapter 11: Break down a dataset by multiple categorical variables at once Example from people’s health status, we could ask how health status further breaks down by marital status. These are nested propotions 11.1 Nested proportions gone wrong Consider a dataset of 106 bridges in Pittsburgh. This dataset contains various pieces of information about the bridges, such as the material from which they are constructed (steel, iron, or wood) and the year when they were erected. Based on the year of erection, bridges are grouped into distinct categories, such as crafts bridges that were erected before 1870 and modern bridges that were erected after 1940. Let’s assume we want to visualize both the fraction of bridges made from steel, iron, or wood and the fraction that are crafts or modern Why Wrong? Why Bad? 11.2 Mosaic plots and treemaps Mosaic To draw a mosaic plot, we begin by placing one categorical variable along the x axis (here, era of bridge construction) and subdivide the x axis by the relative proportions that make up the categories. We then place the other categorical variable along the y axis (here, building material) and, within each category along the x axis, subdivide the y axis by the relative proportions that make up the categories of the y variable. The result is a set of rectangles whose areas are proportional to the number of cases representing each possible combination of the two categorical variables. Treemaps Treemaps 11.3 Nested pies Chapter 12 Visualizing associations among two or more quantitative variables Many datasets contain two or more quantitative variables, and we may be interested in how these variables relate to each other. For example, we may have a dataset of quantiative measurements of different animals, such as the animals’ height, weight, length, and daily energy demands. To plot the relationship of just two such variables, e.g. the height and weight, we will normally use a scatter plot. If we want to show more than two variables at once, we may opt for a bubble chart, a scatter plot matrix, or a correlogram. 12.1 Scatter plots Example: A dataset of measurements performed on 123 blue jay birds. The dataset contains information such as the head length the skull size, and the body mass of each bird. We expect that there are relationships between these variables. For example, birds with longer heads would be expected to have larger skull sizes, and birds with higher body mass should have larger heads and skulls than birds with lower body mass. All in All 12.2 Correlograms correlograms We will consider a data set of over 200 glass fragments obtained during forensic work. For each glass fragment, we have measurements about its composition, expressed as the percent in weight of various mineral oxides. There are seven different oxides for which we have measurements, yielding a total of 21 pairwise correlations Enhancement 12.4 Paired data STAT 288 Dr Abdulla Eid Chapter 14 and 26 Week 7 14 Visualizing trends We are often more interested in the overarching trend of the data than in the specific detail of where each individual data point lies By drawing the trend on top of or instead of the actual data points, usually in the form of a straight or curved line, we can create a visualization that helps the reader immediately see key features of the data Methods: Smooth the data set (e.g., moving average) Curve fitting 14.1 Smoothing Example: Dow Jones, a stock- market index representing the price of 30 companies in 2009. (What happened then?) Moving Average (After or at) Observation from the previous figure The 20-day moving average only removes small, short-term spikes but otherwise follows the daily data closely. The 100-day moving average, on the other hand, removes even fairly substantial drops or spikes that play out over a time span of multiple weeks. For example, the massive drop to below 7000 points in the first quarter of 2009 is not visible in the 100-day moving average, which replaces it with a gentle curve that doesn’t dip much below 8000 points Similarly, the drop around July 2009 is completely invisible in the 100-day moving average. Another Moving Average (LOESS) LOESS (Locally Estimate Scatterplot smoothing) its low-degree polynomials to subsets of the data Importantly, the points in the center of each subset are weighted more heavily than points at the boundaries, and this weighting scheme yields a much smoother result than we get from a weighted average 14.2 Showing trends with a defined functional form We use linear (or exponential) functions to fit the data set into a graph Log scale and Exp Chapter 26: Don’t go 3D 3D plots are quite popular, in particular in business presentations but also among academics. They are also almost always inappropriately used. It is rare that to see a 3D plot that couldn’t be improved by turning it into a regular 2D figure. 26.1 Avoid gratuitous 3D Many visualization softwares enable you to spruce up your plots by turning the plots’ graphical elements into three-dimensional objects. Most commonly, we see pie charts turned into disks rotated in space, bar plots turned into columns, and line plots turned into bands. Notably, in none of these cases does the third dimension convey any actual data. 3D is used simply to decorate and adorn the plot. This use of 3D as gratuitous. It is unequivocally bad and should be erased from the visual vocabulary of data scientists. Focus on the size of 25% in these four graphs The problem with gratuitous 3D is that the projection of 3D objects into two dimensions for printing or display on a monitor distorts the data. The human visual system tries to correct for this distortion as it maps the 2D projection of a 3D image back into a 3D space. let’s take a simple pie chart with two slices, one representing 25% of the data and one 75%, and rotate this pie in space As we change the angle at which we’re looking at the pie, the size of the slices seems to change as well. In particular, the 25% slice, which is located in the front of the pie, looks much bigger than 25% when we look at the pie from a flat angle Because of the way the bars are arranged relative to the axes, the bars all look shorter than they actually are. For example, there were 322 passengers total traveling in 1st class, yet the figure suggests that the number was less than 300. This illusion arises because the columns representing the data are located at a distance from the two back surfaces on which the gray horizontal lines are drawn. 26.2 Avoid 3D position scales While visualizations with gratuitous 3D can easily be dismissed as bad, it is less clear what to think of visualizations using three genuine position scales (x, y, and z) to represent data. In this case, the use of the third dimension serves an actual purpose. Nevertheless, the resulting plots are frequently difficult to interpret, and in my mind they should be avoided. Consider a 3D scatter plot of fuel efficiency versus displacement and power for 32 cars. Even though this 3D visualization is shown from four different perspectives, it is difficult to envision how exactly the points are distributed in space. I find part (d) particularly confusing. It almost seems to show a different dataset, even though nothing has changed other than the angle from which we look at the dots. Better Solution Another Bad Example of 3D, What do you suggest? Solution 26.3 Appropriate use of 3D visualizations Visualizations using 3D position scales can sometimes be appropriate. First, the issues described in the preceding section are of lesser concern if the visualization is interactive and can be rotated by the viewer, or alternatively, if it is shown in a VR or augmented reality environment where it can be inspected from multiple angles. Second, even if the visualization isn’t interactive, showing it slowly rotating, rather than as a static image from one perspective, will allow the viewer to discern where in 3D space different graphical elements reside. The human brain is very good at reconstructing a 3D scene from a series of images taken from different angles, and the slow rotation of the graphic provides exactly these images. Good us of 3D Finally, it makes sense to use 3D visualizations when we want to show actual 3D objects and/or data mapped onto them. For example, showing the topographic relief of a mountainous island is a reasonable choice Good us of 3D The evolutionary sequence conservation of a protein mapped onto its structure, it makes sense to show the structure as a 3D object STAT 288 Dr Abdulla Eid Week 8 15 Visualizing geospatial data Many datasets contain information linked to locations in the physical world. For example, in an ecological study, a dataset may list where specific plants or animals have been found. Similarly, in a socioeconomic or political context, a dataset may contain information about where people with specific attributes (such as income, age, or educational attainment) live, or where man-made objects (e.g., bridges, roads, buildings) have been constructed. In all these cases, it can be helpful to visualize the data in their proper geospatial context, i.e., to show the data on a realistic map or alternatively as a map-like diagram. Maps tend to be intuitive to readers but they can be challenging to design. We need to think about concepts such as map projections and whether for our specific application the accurate representation of angles or areas is more critical. A common mapping technique, the choropleth map, consists of representing data values as differently colored spatial areas. Choropleth maps can at times be very useful and at other times highly misleading. As an alternative, we can construct map-like diagrams called cartograms, which may purposefully distort map areas or represent them in stylized form, for example as equal-sized squares. 15.1 Projections The earth is approximately a sphere (spheroid that is slightly flattened along its axis of rotation). The two locations where the axis of rotation intersects with the spheroid are called the poles (north pole and south pole). We separate the spheroid into two hemispheres, the northern and the southern hemisphere, by drawing a line equidistant to both poles around the spheroid. This line is called the equator. Location To uniquely specify a location on the earth, we need three pieces of information: where we are located along the direction of the equator (the longitude), how close we are to either pole when moving perpendicular to the equator (the latitude), and how far we are from the earth’s center (the altitude). Longitude, latitude, and altitude are specified relative to a reference system called the datum. The datum specifies properties such as the shape and size of the earth, as well as the location of zero longitude, latitude, and altitude. One widely used datum is the World Geodetic System (WGS) 84, which is used by the Global Positioning System (GPS). Longtiude While altitude is an important quantity in many geospatial applications, when visualizing geospatial data in the form of maps we are primarily concerned with the other two dimensions, longitude and latitude. Both longitude and latitude are angles, expressed in degrees. Degrees longitude measure how far east or west a location lies. Lines of equal longitude are referred to as meridians, and all meridians terminate at the two poles. The prime meridian, corresponding to 0° longitude, runs through the village of Greenwich in the United Kingdom. The meridian opposite to the prime meridian lies at 180° longitude (also referred to as 180°E), which is equivalent to -180° longitude (also referred to as 180°W), near the international date line. Latitude Degrees latitude measure how far north or south a location lies. The equator corresponds to 0° latitude, the north pole corresponds to 90° latitude (also referred to as 90°N), and the south pole corresponds to -90° latitude (also referred to as 90°S). Lines of equal latitude are referred to as parallels, since they run parallel to the equator. All meridians have the same length, corresponding to half of a great circle around the globe. The longest parallel is the equator, at 0° latitude, and the shortest parallels lie at the north and south poles, 90°N and 90°S, and have length zero. Projection Technique The challenge in map-making is that we need to take the spherical surface of the earth and flatten it out so we can display it on a map. This process, called projection, necessarily introduces distortions, because a curved surface cannot be projected exactly onto a flat surface. Specifically, the projection can preserve either angles or areas but not both. A projection that does the former is called conformal and a projection that does the latter is called equal-area Mercator Projection Mercator Projection One of the earliest map projections in use, the Mercator projection, was developed in the 16th century for nautical navigation. It is a conformal projection that accurately represents shapes but introduces severe area distortions near the poles. The Mercator projection maps the globe onto a cylinder and then unrolls the cylinder to arrive at a rectangular map. Meridians in this projection are evenly spaced vertical lines, whereas parallels are horizontal lines whose spacing increases the further we move away from the equator. The spacing between parallels increases in proportion to the extent to which they have to be stretched closer to the poles to keep meridians perfectly vertical. Focus on Hawii and Alaska Is it good? Why Bad? Ultimate Solution 15.2 Layers To visualize geospatial data in the proper context, we usually create maps consisting of multiple layers showing different types of information. To demonstrate this concept, Consider the locations of wind turbines in the San Francisco Bay area. In the Bay Area, wind turbines are clustered in two locations. One location is the Shiloh Wind Farm, lies near Rio Vista and the other lies east of Hayward near Tracy What are the layers here (there are four!) The figure consists of four separate layers. At the bottom, we have the terrain layer, which shows hills, valleys, and water. The next layer shows the road network. On top of the road layer, a layer indicating the location of individual wind turbines. This layer also contains the two rectangles highlighting the majority of the wind turbines. Finally, the top layer adds the locations and names of cities. Separated Layers 15.3 Choropleth mapping We frequently want to show how some quantity varies across locations. We can do so by coloring individual regions in a map according to the data dimension we want to display. Such maps are called choropleth maps. As an example, consider the population density (persons per square kilometer) across the United States. We take the population number for each county in the U.S., divide it by the county’s surface area, and then draw a map where the color of each county corresponds to the ratio between population number and area We can see how the the major cities on the east and the west coast are the most populated areas of the U.S., the great plains and western states have low population densities, and the state of Alaska is the least populated of all. Why? (Focus on Alaska) 15.4 Cartograms Not every map-like visualization has to be geographically accurate to be useful. For example, the problem with Figure previous figure is that some states take up a comparatively large area but are sparsely populated while others take up a small area yet have a large number of inhabitants. What if we deformed the states so their size was proportional to their number of inhabitants? Such a modified map is called a cartogram. We can still recognize individual states, yet we also see how the adjustment for population numbers has introduced important modifications. The east coast states, Florida, and California have grown a lot in size, whereas the other western states and Alaska have collapsed. Cartogram Cartogtam Heatmap Plots on the geographical Location STAT 288 Dr Abdulla Eid Week 9 Chapter 28 Choosing the right visualization software How do we actually generate our figures? What tools should we use? This question can generate heated discussions, as many people have strong emotional bonds to the specific tools they are familiar with. Learning any new tool will require time and effort, and you will have to go through a painful transition period where getting things done with the new tool is much more difficult than it was with the old tool. Therefore, regardless of the pros and cons of different tools and approaches, the overriding principle is that you need to pick a tool that works for you. If you can make the figures you want to make, without excessive effort, then that’s all that matters. 28.1 Reproducibility and repeatability Reproducible A work as reproducible if the overarching scientific finding of the work will remain unchanged if a different research group performs the same type of study. For example, if one research group finds that a new pain medication reduces perceived headache pain significantly without causing noticeable side effects and a different group subsequently studies the same medication on a different patient group and has the same findings, then the work is reproducible Repeatable A work is repeatable if very similar or identical measurements can be obtained by the same person repeating the exact same measurement procedure on the same equipment. For example, if a doctor weighs her patient and finds she weighs 41 lbs and then weighs her again on the same scales and find again that she weighs 41 lbs, then this measurement is repeatable. Reproducible Data Visualization With minor modifications, we can apply these concepts to data visualization. A visualization is reproducible if the plotted data are available and any data transformations that may have been applied are exactly specified. For example, if you make a figure and then send me the exact data that you plotted, then I can prepare a figure that looks substantially similar. We may be using slightly different fonts or colors or point sizes to display the same data, so the two figures may not be exactly identical, but your figure and mine convey the same message and therefore are reproductions of each other. Repeatable Data Visualization A visualization is repeatable, if it is possible to recreate the exact same visual appearance, down to the last pixel, from the raw data. For random data, repeatability generally requires that we specify a particular random number generator for which we set and record a seed. Interactive Software Both reproducibility and repeatability can be difficult to achieve when we’re working with interactive plotting software. Many interactive programs allow you to transform or otherwise manipulate the data but don’t keep track of every individual data transformation you perform, only of the final product. So it is very difficult to keep reproducibility and repeatedly with interactive programs. Advice Try to stay away from interactive programs as much as possible. Make figures programmatically, by writing code (scripts) that generates the figures from the raw data. Programmatically generated figures will generally be repeatable by anybody who has access to the generating scripts and the programming language and specific libraries used. 28.2 Data exploration versus data presentation There are two distinct phases of data visualization, and they have very different requirements. The first is data exploration. Whenever you start working with a new dataset, you need to look at it from different angles and try various ways of visualizing it, just to develop an understanding of the dataset’s key features. In this phase, speed and efficiency are of the essence. You need to try different types of visualizations, different data transformations, and different subsets of the data. The faster you can iterate through different ways of looking at the data, the more you will explore, and the higher the likelihood that you will notice an important feature in the data that you might otherwise have overlooked. Data Presentation The second phase is data presentation. You enter it once you understand your dataset and know what aspects of it you want to show to your audience. The key objective in this phase is to prepare a high-quality, publication-ready figure that can be printed in an article or book, included in a presentation, or posted on the internet. Exploration Stage and Software In the exploration stage, whether the figures you make look appealing is secondary. It’s fine if the axis labels are missing, the legend is messed up, or the symbols are too small, as long as you can evaluate the various patterns in the data. What is critical, however, is how easy it is for you to change how the data are shown. To truly explore the data, you should be able to rapidly move from a scatter plot to overlapping density distribution plots to boxplots to a heatmap. A well-designed data exploration tool will allow you to easily change which variables are mapped onto which aesthetics, and it will provide a wide range of different visualization options within a single coherent framework. However, many visualization tools (and in particular libraries for programmatic figure generation) are not set up in this way. Instead, they are organized by plot type, where each different type of plot requires somewhat different input data and has its own idiosyncratic interface. Such tools can get in the way of efficient data exploration, because it’s difficult to remember how all the different plot types work. Data Presentation and Software Once we have determined how exactly we want to visualize our data, what data transformations we want to make, and what type of plot to use, we will commonly want to prepare a high-quality figure for publication. At this point, we have several different avenues we can pursue. First, we can finalize the figure using same software platform we used for initial exploration. Second, we can switch platform to one that provides us finer control over the final product, even if that platform makes it harder to explore. Third, we can produce a draft figure with a visualization software and then manually post- process with an image manipulation or illustration program such as Photoshop or Illustrator (Never do it please) Fourth, we can manually redraw the entire figure from scratch, either with pen and paper or using an illustration program. Highly unadvisable 28.3 Separation of content and design A good visualization software should allow you to think separately about the content and the design of your figures. By content, I refer to the specific data set shown, the data transformations applied (if any), the specific mappings from data onto aesthetics, the scales, the axis ranges, and the type of plot (scatter plot, line plot, bar plot, boxplot, etc.). Design, on the other hand, describes features such as the foreground and background colors, font specifications (e.g. font size, face, and family), symbol shapes and sizes, the placement of legends, axis ticks, axis titles, and plot titles, and whether or not the figure has a background grid. Themes The book used ggplot2. Separation of content and design is achieved via themes. A theme specifies the visual appearance of a figure, and it is easy to take an existing figure and apply different themes to it (Look at the figure to the right). Themes can be written by third parties and distributed as R or Python packages. Benefit of Separation Separation of content and design allows data scientists and designers to each focus on what they do best. Most data scientists are not designers, and therefore their primary concern should be the data, not the design of a visualization. Likewise, most designers are not data scientists, and they should be able provide a unique and appealing visual language for figures without having to worry about specific data, appropriate transformations, and so on. Summary of the Chapter When choosing your visualization software, think about how easily you can reproduce figures and redo them with updated or otherwise changed datasets, whether you can rapidly explore different visualizations of the same data, and to what extent you can tweak the visual design separately from generating the figure content. Depending on your skill level and comfort with programming, it may be beneficial to use different visualization tools at the data exploration and the data presentation stages, and you may prefer to do the final visual tweaking interactively or by hand. If you have to make figures interactively, in particular with a software that does not keep track of all data transformations and visual tweaks you have applied, consider taking careful notes on how you make each figure, so that all your work remains reproducible. STAT 288 Data Visualization Week 10 Chapter 29: Story Telling 29 Telling a story and making a point Most data visualization is done for the purpose of communication. We have an insight about a dataset, and we have a potential audience, and we would like to convey our insight to our audience. To communicate our insight successfully, we will have to present the audience with a clear and exciting story. Story!! The need for a story may seem disturbing to scientists and engineers, who may equate it with making things up, putting a spin on things, or overselling results. However, this perspective misses the important role that stories play in reasoning and memory. We get excited when we hear a good story, and we get bored when the story is bad or when there is none. Benefit of Story Moreover, any communication creates a story in the audience’s minds. If we don’t provide a clear story ourselves, then our audience will make one up. In the best-case scenario, the story they make up is reasonably close to our own view of the material presented. However, it can be and often is much worse. The made-up story could be “this is boring,” “the author is wrong,” or “the author is incompetent.” Compelling Story! Your goal in telling a story should be to use facts and logical reasoning to get your audience interested and excited. Let me tell you a story about the theoretical physicist Stephen Hawking. He was diagnosed with motor neuron disease at age 21—one year into his PhD—and was given two years to live. Hawking did not accept this predicament and started pouring all his energy into doing science. Hawking ended up living to be 76, became one of the most influential physicists of his time, and did all of his seminal work while being severely disabled. It’s also entirely fact-based and true. 29.1 What is a story? A story is a set of observations, facts, or events, true or invented, that are presented in a specific order such that they create an emotional reaction in the audience. The emotional reaction is created through the build-up of tension at the beginning of the story followed by some type of resolution towards the end of the story. We refer to the flow from tension to resolution also as the story arc, and every good story has a clear, identifiable arc. Experienced writers know that there are standard patterns for storytelling that resonate with how humans think. For example, we can tell a story using the Opening–Challenge– Action–Resolution format. Recall Steven Hawking Story! Other Formats.Other story formats are also commonly used. Newspaper articles frequently follow the Lead–Development–Resolution format or, even shorter, just Lead–Development, where the lead gives away the main point up front and the subsequent material provides further details. Yet another format is Action–Background–Development– Climax–Ending, which develops the story a little more rapidly than Opening–Challenge–Action–Resolution but not as rapidly as Lead–Development. The goal of this chapter is not to describe these standard forms of story telling in more detail. Instead, I want to discuss how we can bring data visualizations into the story arc. Most importantly, we need to realize that a single (static) visualization will rarely tell an entire story. A visualization may illustrate the opening, the challenge, the action, or the resolution, but it is unlikely to convey all these parts of the story at once. Complete story To tell a complete story, we will usually need multiple visualizations. For example, when giving a presentation, we may first show some background or motivational material, then a figure that creates a challenge, and eventually some other figure that provides the resolution. Likewise, in a research paper, we may present a sequence of figures that jointly create a convincing story arc. It is, however, also possible to condense an entire story arc into a single figure. Such a figure must contain a challenge and a resolution at the same time, and it is comparable to a story arc that starts with a lead. Example of a Story Example: The growth of preprints in the biological sciences. Preprints are manuscripts in draft form that scientists share with their colleagues before formal peer review and official publication. Scientists have been sharing manuscript drafts for as long as scientific manuscripts have existed. However, in the early 1990s, with the advent of the internet, physicists realized that it was much more efficient to store and distribute manuscript drafts in a central repository. They invented the preprint server, a web server where scientists can upload, download, and search for manuscript drafts. ArXiV The preprint server physicists developed and still use today is called arXiv.org. Shortly after it was established, arXiv.org started to branch out and become popular in related quantitative fields, including mathematics, astronomy, computer science, statistics, quantitative finance, and quantitative biology. Here, I am interested in the preprint submissions to the quantitative biology (q-bio) section of arXiv.org. The number of submissions per month grew exponentially from 2007 to late 2013, but then the growth suddenly stopped (Figure 29.1). Something must have happened in late 2013 that radically changed the landscape in preprint submissions for quantitative biology. What caused this drastic change in submission growth? The number of submissions per month grew exponentially from 2007 to late 2013, but then the growth suddenly stopped. Something must have happened in late 2013 that radically changed the landscape in preprint submissions for quantitative biology. What caused this drastic change in submission growth? I will argue that late 2013 marks the point in time when preprints took off in biology, and ironically this caused the q-bio archive to slow its growth. In November 2013, the biology- specific preprint server bioRxiv was launched by Cold Spring Harbor Laboratory (CSHL) Press. CSHL Press is a publisher that is highly respected among biologists. The backing of CSHL Press helped tremendously with the acceptance of preprints in general and bioRxiv in particular among biologists. The same biologists that would have been quite suspicious of arXiv.org were much more comfortable with bioRxiv. As a result bioRxiv quickly gained acceptance among biologists, to a degree that arXiv had never managed. In fact, soon after its launch, bioRxiv started experiencing rapid, exponential growth in monthly submissions, and the slowdown in q-bio submissions exactly coincides with the start of this exponential growth in bioRxiv). It appears to be the case that many quantitative biologists who otherwise might have deposited a preprint with q-bio decided to deposit it with bioRxiv instead. This is my story about preprints in biology. I purposefully told it with two figures, even though the first is fully contained within the second. I think this story has the strongest impact when broken into two pieces, and this is how I would present it in a talk. However, Figure 2 alone can be used to tell the entire story, and the single-figure version might be more suitable to a medium where the audience can be expected to have short attention span, such as in a social media post. 29.2 Make a figure for the generals Help your audience to connect with your story and remain engaged throughout your entire story arc. First, and most importantly, you need to show your audience figures they can actually understand. Our hope is The audience can see your figures and immediately infer the points you are trying to make; The audience can rapidly process complex visualizations and understand the key trends and relationships that are shown. Reality is different Neither of these assumptions is true. We need to do everything we can to help our readers understand the meaning of our visualizations and see the same patterns in the data that we see. This usually means less is more. Simplify your figures as much as possible. Remove all features that are tangential to your story. Only the important points should remain. I refer to this concept as “making a figure for the generals.” The figure shows the arrival delays for all flights departing out of the New York City area in 2013. Can you process it? Solution (Break it down) 29.3 Build up towards complex figures Sometimes, however, we do want to show more complex figures that contain a large amount of information at once. In those cases, we can make things easier for our readers if we first show them a simplified version of the figure before we show the final one in its full complexity. The same approach is also highly recommended for presentations. Never jump straight to a highly complex figure; first show an easily digestible subset. Simple plot first The figure shows the aggregate numbers of United Airlines departures out of Newark Airport (EWR) in 2013, broken down by weekday. Once we have seen and digested this figure, seeing the same information for ten airlines and three airports at once is much easier to process All plots 29.4 Make your figures memorable Simple and clean figures such as simple bar plots have the advantage that they avoid distractions, are easy to read, and let your audience focus on the most important points you want to bring across. However, the simplicity can come with a disadvantage: Figures can end up looking generic. They don’t have any features that stand out and make them memorable. If I showed you ten bargraphs in quick succession you’d have a hard time keeping them apart and afterwards remembering what they showed. For example, if you take a quick look at Figure on the right, you will notice the visual similarity to one of the previous figures (number of flights) However, the two figures have nothing in common other than they are bar charts. Neither figure has any element that helps you intuitively perceive what topic the figure covers, and therefore neither figure is particularly memorable. Make it memorable 29.5 Be consistent but don’t be repetitive It is important to use a consistent visual language for the different parts of a larger figure. The same is true across figures. If we make three figures that are all part of one larger story, then we need to design those figures so they look like they belong together. Using a consistent visual language does not mean, however, that everything should look exactly the same. On the contrary. It is important that figures describing different analyses look visually distinct, so that your audience can easily recognize where one analysis ends and another one starts. This is best achieved by using different visualization approaches for different parts of the overarching story. If you have used a bar plot already, next use a scatterplot, or a boxplot, or a line plot. Otherwise, the different analyses will blur together in your audience’s mind, and they will have a hard time distinguishing one part of the story from another. Why? Too Many repetitive Much Better How many Figures to add How many figures you should you use to tell your story? The answer depends on the publication venue. For a short blog post or tweet, make one figure. For scientific papers, I recommend between three and six figures. If I have many more than six figures for a scientific paper, then some of them need to be moved into an appendix or supplementary materials section. A book or thesis will contain more than one story, and in fact may contain one story per chapter or section. In those scenarios, each distinct story-line or subplot should be presented with no more than three to six figures.