dokumen.pub_fundamentals-of-data-visualization-a-primer-on-making-informative-and-compelling-figures-1nbsped-1492031089-9781492031086.pdf

Praise for Fundamentals of Data Visualization Wilke has written the rare data visualization book that will help you move beyond the standard line, bar, and pie charts that you know and use. He takes you through the conceptual underpinnings of what makes an effective visualization and through a library of different graphs that anyone can utilize. This book will quickly become a go-to reference for anyone working with and visualizing data. —Jonathan Schwabish, Senior Fellow, Urban Institute In this well-illustrated view of what it means to clearly visualize data, Claus Wilke explains his rationale for why some graphs are effective and others are not. This incredibly useful guide provides clear examples that beginners can emulate as well as explanations for stylistic choices so experts can learn what to modify. —Steve Haroz, Research Scientist, Inria Wilke’s book is the best practical guide to visualization for anyone with a scientific disposition. This clear and accessible book is going to live at arm’s reach on lab tables everywhere. —Scott Murray, Lead Program Manager, O’Reilly Media Fundamentals of Data Visualization A Primer on Making Informative and Compelling Figures Claus O. Wilke Beijing Boston Farnham Sebastopol Tokyo Fundamentals of Data Visualization by Claus O. Wilke Copyright © 2019 Claus O. Wilke. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or [email protected]. Editors: Mike Loukides and Indexer: Ellen Troutman-Zaig Melissa Potter Interior Designer: David Futato Production Editor: Kristen Brown Cover Designer: Karen Montgomery Copyeditor: Rachel Head Illustrator: Claus Wilke Proofreader: James Fraleigh March 2019: First Edition Revision History for the First Edition 2019-03-15: First Release See http://oreilly.com/catalog/errata.csp?isbn=9781492031086 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Fundamentals of Data Visualization, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. The views expressed in this work are those of the author, and do not represent the publisher’s views. While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-492-03108-6 [GP] Table of Contents Preface....................................................................... xi 1. Introduction................................................................ 1 Ugly, Bad, and Wrong Figures 2 Part I. From Data to Visualization 2. Visualizing Data: Mapping Data onto Aesthetics................................ 7 Aesthetics and Types of Data 7 Scales Map Data Values onto Aesthetics 10 3. Coordinate Systems and Axes............................................... 13 Cartesian Coordinates 13 Nonlinear Axes 16 Coordinate Systems with Curved Axes 22 4. Color Scales.............................................................. 27 Color as a Tool to Distinguish 27 Color to Represent Data Values 29 Color as a Tool to Highlight 33 5. Directory of Visualizations.................................................. 37 Amounts 37 Distributions 38 Proportions 39 x–y relationships 41 Geospatial Data 42 v Uncertainty 43 6. Visualizing Amounts....................................................... 45 Bar Plots 45 Grouped and Stacked Bars 50 Dot Plots and Heatmaps 53 7. Visualizing Distributions: Histograms and Density Plots......................... 59 Visualizing a Single Distribution 59 Visualizing Multiple Distributions at the Same Time 64 8. Visualizing Distributions: Empirical Cumulative Distribution Functions and Q-Q Plots...................... 71 Empirical Cumulative Distribution Functions 71 Highly Skewed Distributions 74 Quantile-Quantile Plots 78 9. Visualizing Many Distributions at Once....................................... 81 Visualizing Distributions Along the Vertical Axis 81 Visualizing Distributions Along the Horizontal Axis 88 10. Visualizing Proportions.................................................... 93 A Case for Pie Charts 93 A Case for Side-by-Side Bars 97 A Case for Stacked Bars and Stacked Densities 99 Visualizing Proportions Separately as Parts of the Total 101 11. Visualizing Nested Proportions............................................. 105 Nested Proportions Gone Wrong 105 Mosaic Plots and Treemaps 107 Nested Pies 111 Parallel Sets 113 12. Visualizing Associations Among Two or More Quantitative Variables............. 117 Scatterplots 117 Correlograms 121 Dimension Reduction 124 Paired Data 127 13. Visualizing Time Series and Other Functions of an Independent Variable......... 131 Individual Time Series 131 Multiple Time Series and Dose–Response Curves 135 vi | Table of Contents Time Series of Two or More Response Variables 138 14. Visualizing Trends........................................................ 145 Smoothing 145 Showing Trends with a Defined Functional Form 151 Detrending and Time-Series Decomposition 155 15. Visualizing Geospatial Data................................................ 161 Projections 161 Layers 169 Choropleth Mapping 172 Cartograms 176 16. Visualizing Uncertainty................................................... 181 Framing Probabilities as Frequencies 181 Visualizing the Uncertainty of Point Estimates 186 Visualizing the Uncertainty of Curve Fits 197 Hypothetical Outcome Plots 201 Part II. Principles of Figure Design 17. The Principle of Proportional Ink........................................... 207 Visualizations Along Linear Axes 208 Visualizations Along Logarithmic Axes 212 Direct Area Visualizations 215 18. Handling Overlapping Points.............................................. 219 Partial Transparency and Jittering 219 2D Histograms 222 Contour Lines 225 19. Common Pitfalls of Color Use.............................................. 233 Encoding Too Much or Irrelevant Information 233 Using Nonmonotonic Color Scales to Encode Data Values 237 Not Designing for Color-Vision Deficiency 238 20. Redundant Coding....................................................... 243 Designing Legends with Redundant Coding 243 Designing Figures Without Legends 250 Table of Contents | vii 21. Multipanel Figures....................................................... 255 Small Multiples 255 Compound Figures 260 22. Titles, Captions, and Tables................................................ 267 Figure Titles and Captions 267 Axis and Legend Titles 270 Tables 273 23. Balance the Data and the Context.......................................... 277 Providing the Appropriate Amount of Context 277 Background Grids 282 Paired Data 287 Summary 290 24. Use Larger Axis Labels.................................................... 291 25. Avoid Line Drawings...................................................... 297 26. Don’t Go 3D............................................................. 305 Avoid Gratuitous 3D 305 Avoid 3D Position Scales 307 Appropriate Use of 3D Visualizations 313 Part III. Miscellaneous Topics 27. Understanding the Most Commonly Used Image File Formats................... 319 Bitmap and Vector Graphics 319 Lossless and Lossy Compression of Bitmap Graphics 321 Converting Between Image Formats 324 28. Choosing the Right Visualization Software................................... 325 Reproducibility and Repeatability 326 Data Exploration Versus Data Presentation 327 Separation of Content and Design 330 29. Telling a Story and Making a Point.......................................... 333 What Is a Story? 334 Make a Figure for the Generals 337 Build Up Toward Complex Figures 341 viii | Table of Contents Make Your Figures Memorable 343 Be Consistent but Don’t Be Repetitive 345 Annotated Bibliography....................................................... 351 Technical Notes.............................................................. 355 References.................................................................. 357 Index....................................................................... 361 Table of Contents | ix Preface If you are a scientist, an analyst, a consultant, or anybody else who has to prepare technical documents or reports, one of the most important skills you need to have is the ability to make compelling data visualizations, generally in the form of figures. Figures will typically carry the weight of your arguments. They need to be clear, attractive, and convincing. The difference between good and bad figures can be the difference between a highly influential or an obscure paper, a grant or contract won or lost, a job interview gone well or poorly. And yet, there are surprisingly few resour‐ ces to teach you how to make compelling data visualizations. Few colleges offer cour‐ ses on this topic, and there are not that many books on this topic either. (Some exist, of course.) Tutorials for plotting software typically focus on how to achieve specific visual effects rather than explaining why certain choices are preferred and others not. In your day-to-day work, you are simply expected to know how to make good figures, and if you’re lucky you have a patient adviser who teaches you a few tricks as you’re writing your first scientific papers. In the context of writing, experienced editors talk about “ear,” the ability to hear (internally, as you read a piece of prose) whether the writing is any good. I think that when it comes to figures and other visualizations, we similarly need “eye,” the ability to look at a figure and see whether it is balanced, clear, and compelling. And just as is the case with writing, the ability to see whether a figure works or not can be learned. Having eye means primarily that you are aware of a larger collection of simple rules and principles of good visualization, and that you pay attention to little details that other people might not. In my experience, again just as in writing, you don’t develop eye by reading a book over the weekend. It is a lifelong process, and concepts that are too complex or too subtle for you today may make much more sense five years from now. I can say for myself that I continue to evolve in my understanding of figure preparation. I rou‐ tinely try to expose myself to new approaches, and I pay attention to the visual and design choices others make in their figures. I’m also open to changing my mind. I might today consider a given figure great, but next month I might find a reason to xi criticize it. So with this in mind, please don’t take anything I say as gospel. Think crit‐ ically about my reasoning for certain choices and decide whether you want to adopt them or not. While the materials in this book are presented in a logical progression, most chapters can stand on their own, and there is no need to read the book cover to cover. Feel free to skip around, to pick out a specific section that you’re interested in at the moment, or one that covers a particular design choice you’re pondering. In fact, I think you will get the most out of this book if you don’t read it all at once, but rather read it piecemeal over longer stretches of time, try to apply just a few concepts from the book in your figuremaking, and come back to read about other concepts or reread sections on concepts you learned about a while back. You may find that the same chapter tells you different things if you reread it after a few months have passed. Even though nearly all of the figures in this book were made with R and ggplot2, I do not see this as an R book. I am talking about general principles of figure preparation. The software used to make the figures is incidental. You can use any plotting software you want to generate the kinds of figures I’m showing here. However, ggplot2 and similar packages make many of the techniques I’m using much simpler than other plotting libraries. Importantly, because this is not an R book, I do not discuss code or programming techniques anywhere in this book. I want you to focus on the concepts and the figures, not on the code. If you are curious about how any of the figures were made, you can check out the book’s source code at its GitHub repository. Thoughts on Graphing Software and Figure-Preparation Pipelines I have over two decades of experience preparing figures for scientific publications and have made thousands of figures. If there has been one constant over these two deca‐ des, it’s been the change in figure preparation pipelines. Every few years, a new plot‐ ting library is developed or a new paradigm arises, and large groups of scientists switch over to the hot new toolkit. I have made figures using gnuplot, Xfig, Mathema‐ tica, Matlab, matplotlib in Python, base R, ggplot2 in R, and possibly others I can’t currently remember. My current preferred approach is ggplot2 in R, but I don’t expect that I’ll continue using it until I retire. This constant change in software platforms is one of the key reasons why this book is not a programming book and why I have left out all code examples. I want this book to be useful to you regardless of which software you use, and I want it to remain val‐ uable even once everybody has moved on from ggplot2 and is using the next new thing. I realize that this choice may be frustrating to some ggplot2 users who would like to know how I made a given figure. However, anybody who is curious about my xii | Preface coding techniques can read the source code of the book. It is available. Also, in the future I may release a supplementary document focused just on the code. One thing I have learned over the years is that automation is your friend. I think fig‐ ures should be autogenerated as part of the data analysis pipeline (which should also be automated), and they should come out of the pipeline ready to be sent to the printer, with no manual post-processing needed. I see a lot of trainees autogenerate rough drafts of their figures, which they then import into Illustrator for sprucing up. There are several reasons why this is a bad idea. First, the moment you manually edit a figure, your final figure becomes irreproducible. A third party cannot generate the exact same figure you did. While this may not matter much if all you did was change the font of the axis labels, the lines are blurry, and it’s easy to cross over into territory where things are less clear-cut. As an example, let’s say you want to manually replace cryptic labels with more readable ones. A third party may not be able to verify that the label replacement was appropriate. Second, if you add a lot of manual post- processing to your figure-preparation pipeline, then you will be more reluctant to make any changes or redo your work. Thus, you may ignore reasonable requests for change made by collaborators or colleagues, or you may be tempted to reuse an old figure even though you’ve actually regenerated all the data. Third, you may yourself forget what exactly you did to prepare a given figure, or you may not be able to gener‐ ate a future figure on new data that exactly visually matches your earlier figure. These are not made-up examples. I’ve seen all of them play out with real people and real publications. For all these reasons, interactive plot programs are a bad idea. They inherently force you to manually prepare your figures. In fact, it’s probably better to autogenerate a figure draft and spruce it up in Illustrator than to make the entire figure by hand in some interactive plot program. Please be aware that Excel is an interactive plot pro‐ gram as well and is not recommended for figure preparation (or data analysis). One critical component in a book on data visualization is the feasibility of the pro‐ posed visualizations. It’s nice to invent some elegant new type of visualization, but if nobody can easily generate figures using this visualization then there isn’t much use to it. For example, when Tufte first proposed sparklines nobody had an easy way of making them. While we need visionaries who move the world forward by pushing the envelope of what’s possible, I envision this book to be practical and directly appli‐ cable to working data scientists preparing figures for their publications. Therefore, the visualizations I propose in the subsequent chapters can be generated with a few lines of R code via ggplot2 and readily available extension packages. In fact, nearly every figure in this book, with the exception of a few figures in Chapters 26, 27, and 28, was autogenerated exactly as shown. Preface | xiii Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width Used to refer to program elements such as variable or function names, state‐ ments, and keywords. This element signifies a tip or suggestion. This element signifies a general note. This element indicates a warning or caution. Using Code Examples Supplemental material is available for download at https://github.com/clauswilke/data viz. This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a signifi‐ cant amount of example code from this book into your product’s documentation does require permission. xiv | Preface We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Fundamentals of Data Visualization by Claus O. Wilke (O’Reilly). Copyright 2019 Claus O. Wilke, 978-1-492-03108-6.” You may find that additional uses fall within the scope of fair use (for example, reus‐ ing a few figures from the book). If you feel your use of code examples or other con‐ tent falls outside fair use or the permission given above, feel free to contact us at [email protected]. O’Reilly Online Learning For almost 40 years, O’Reilly Media has provided technology and business training, knowledge, and insight to help compa‐ nies succeed. Our unique network of experts and innovators share their knowledge and expertise through books, articles, conferences, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in- depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, please visit http://oreilly.com. How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-998-9938 (in the United States or Canada) 707-829-0515 (international or local) 707-829-0104 (fax) We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at http://bit.ly/fundamentals-of-data- visualization. To comment or ask technical questions about this book, send email to bookques‐ [email protected]. For more information about our books, courses, conferences, and news, see our web‐ site at http://www.oreilly.com. Preface | xv Find us on Facebook: http://facebook.com/oreilly Follow us on Twitter: http://twitter.com/oreillymedia Watch us on YouTube: http://www.youtube.com/oreillymedia Acknowledgments This project would not have been possible without the fantastic work the RStudio team has put into turning the R universe into a first-rate publishing platform. In par‐ ticular, I have to thank Hadley Wickham for creating ggplot2, the plotting software that was used to make all the figures throughout this book. I would also like to thank Yihui Xie for creating R Markdown and for writing the knitr and bookdown packages. I don’t think I would have started this project without these tools ready to go. Writing R Markdown files is fun, and it’s easy to collect material and gain momentum. Special thanks go to Achim Zeileis and Reto Stauffer for colorspace, Thomas Lin Pedersen for ggforce and gganimate, Kamil Slowikowski for ggrepel, Edzer Pebesma for sf, and Claire McWhite for her work on colorspace and colorblindr to simulate color- vision deficiency in assembled R figures. Several people have provided helpful feedback on draft versions of this book. Most importantly, Mike Loukides, my editor at O’Reilly, and Steve Haroz have both read and commented on every chapter. I also received helpful comments from Carl Berg‐ strom, Jessica Hullman, Matthew Kay, Tristan Mahr, Edzer Pebesma, Jon Schwabish, and Hadley Wickham. Len Kiefer’s blog and Kieran Healy’s book and blog postings have provided numerous inspirations for figures to make and datasets to use. A num‐ ber of people pointed out minor issues or typos, including Thiago Arrais, Malcolm Barrett, Jessica Burnett, Jon Calder, Antônio Pedro Camargo, Daren Card, Kim Cressman, Akos Hajdu, Thomas Jochmann, Andrew Kinsman, Will Koehrsen, Alex Lalejini, John Leadley, Katrin Leinweber, Mikel Madina, Claire McWhite, S’busiso Mkhondwane, Jose Nazario, Steve Putman, Maëlle Salmon, Christian Schudoma, James Scott-Brown, Enrico Spinielli, Wouter van der Bijl, and Ron Yurko. I would also more broadly like to thank all the other contributors to the tidyverse and the R community in general. There truly is an R package for any visualization chal‐ lenge one may encounter. All these packages have been developed by an extensive community of thousands of data scientists and statisticians, and many of them have in some form contributed to the making of this book. Finally, I would like to thank my wife Stefania for patiently enduring many evenings and weekends during which I spent hours in front of the computer writing ggplot2 code, obsessing over minute details of certain figures, and fleshing out chapter details. xvi | Preface CHAPTER 1 Introduction Data visualization is part art and part science. The challenge is to get the art right without getting the science wrong, and vice versa. A data visualization first and fore‐ most has to accurately convey the data. It must not mislead or distort. If one number is twice as large as another, but in the visualization they look to be about the same, then the visualization is wrong. At the same time, a data visualization should be aes‐ thetically pleasing. Good visual presentations tend to enhance the message of the vis‐ ualization. If a figure contains jarring colors, imbalanced visual elements, or other features that distract, then the viewer will find it harder to inspect the figure and interpret it correctly. In my experience, scientists frequently (though not always!) know how to visualize data without being grossly misleading. However, they may not have a well-developed sense of visual aesthetics, and they may inadvertently make visual choices that detract from their desired message. Designers, on the other hand, may prepare visualizations that look beautiful but play fast and loose with the data. It is my goal to provide useful information to both groups. This book attempts to cover the key principles, methods, and concepts required to visualize data for publications, reports, or presentations. Because data visualization is a vast field, and in its broadest definition could include topics as varied as schematic technical drawings, 3D animations, and user interfaces, I necessarily had to limit my scope. I am specifically covering the case of static visualizations presented in print, online, or as slides. The book does not cover interactive visuals or movies, except in one brief section in Chapter 16. Therefore, throughout this book, I will use the words “visualization” and “figure” somewhat interchangeably. The book also does not pro‐ vide any instruction on how to make figures with existing visualization software or programming libraries. The annotated bibliography at the end of the book includes pointers to appropriate texts covering these topics. 1 The book is divided into three parts. The first, “From Data to Visualization,” describes different types of plots and charts, such as bar graphs, scatterplots, and pie charts. Its primary emphasis is on the science of visualization. In this part, rather than attempt‐ ing to provide encyclopedic coverage of every conceivable visualization approach, I discuss a core set of visuals that you will likely encounter in publications and/or need in your own work. In organizing this part, I have attempted to group visualizations by the type of message they convey rather than by the type of data being visualized. Stat‐ istical texts often describe data analysis and visualization by type of data, organizing the material by number and type of variables (one continuous variable, one discrete variable, two continuous variables, one continuous and one discrete variable, etc.). I believe that only statisticians find this organization helpful. Most other people think in terms of a message, such as how large something is, how it is composed of parts, how it relates to something else, and so on. The second part, “Principles of Figure Design,” discusses various design issues that arise when assembling data visualizations. Its primary but not exclusive emphasis is on the aesthetic aspect of data visualization. Once we have chosen the appropriate type of plot or chart for our dataset, we have to make aesthetic choices about the vis‐ ual elements, such as colors, symbols, and font sizes. These choices can affect both how clear a visualization is and how elegant it looks. The chapters in this second part address the most common issues that I have seen arise repeatedly in practical applications. The third part, “Miscellaneous Topics,” covers a few remaining issues that didn’t fit into the first two parts. It discusses file formats commonly used to store images and plots, provides thoughts about the choice of visualization software, and explains how to place individual figures into the context of a larger document. Ugly, Bad, and Wrong Figures Throughout this book, I frequently show different versions of the same figures, some as examples of how to make a good visualization and some as examples of how not to. To provide a simple visual guideline of which examples should be emulated and which should be avoided, I am labeling problematic figures as “ugly,” “bad,” or “wrong” (Figure 1-1): Ugly A figure that has aesthetic problems but otherwise is clear and informative Bad A figure that has problems related to perception; it may be unclear, confusing, overly complicated, or deceiving Wrong A figure that has problems related to mathematics; it is objectively incorrect 2 | Chapter 1: Introduction Figure 1-1. Examples of ugly, bad, and wrong figures. (a) A bar plot showing three val‐ ues (A = 3, B = 5, and C = 4). This is a reasonable visualization with no major flaws. (b) An ugly version of part (a). While the plot is technically correct, it is not aesthetically pleasing. The colors are too bright and not useful. The background grid is too prominent. The text is displayed using three different fonts in three different sizes. (c) A bad version of part (a). Each bar is shown with its own y axis scale. Because the scales don’t align, this makes the figure misleading. One can easily get the impression that the three values are closer together than they actually are. (d) A wrong version of part (a). Without an explicit y axis scale, the numbers represented by the bars cannot be ascertained. The bars appear to be of lengths 1, 3, and 2, even though the values displayed are meant to be 3, 5, and 4. I am not explicitly labeling good figures. Any figure that isn’t labeled as flawed should be assumed to be at least acceptable. It is a figure that is informative, looks appealing, and could be printed as is. Note that among the good figures, there will still be differ‐ ences in quality, and some good figures will be better than others. I generally provide my rationale for specific ratings, but some are a matter of taste. In general, the “ugly” rating is more subjective than the “bad” or “wrong” rating. More‐ over, the boundary between “ugly” and “bad” is somewhat fluid. Sometimes poor design choices can interfere with human perception to the point where a “bad” rating is more appropriate than an “ugly” rating. In any case, I encourage you to develop your own eye and to critically evaluate my choices. Ugly, Bad, and Wrong Figures | 3 PART I From Data to Visualization CHAPTER 2 Visualizing Data: Mapping Data onto Aesthetics Whenever we visualize data, we take data values and convert them in a systematic and logical way into the visual elements that make up the final graphic. Even though there are many different types of data visualizations, and on first glance a scatterplot, a pie chart, and a heatmap don’t seem to have much in common, all these visualiza‐ tions can be described with a common language that captures how data values are turned into blobs of ink on paper or colored pixels on a screen. The key insight is the following: all data visualizations map data values into quantifiable features of the resulting graphic. We refer to these features as aesthetics. Aesthetics and Types of Data Aesthetics describe every aspect of a given graphical element. A few examples are provided in Figure 2-1. A critical component of every graphical element is of course its position, which describes where the element is located. In standard 2D graphics, we describe positions by an x and y value, but other coordinate systems and one- or three-dimensional visualizations are possible. Next, all graphical elements have a shape, a size, and a color. Even if we are preparing a black-and-white drawing, graphi‐ cal elements need to have a color to be visible: for example, black if the background is white or white if the background is black. Finally, to the extent we are using lines to visualize data, these lines may have different widths or dash–dot patterns. Beyond the examples shown in Figure 2-1, there are many other aesthetics we may encounter in a data visualization. For example, if we want to display text, we may have to specify font family, font face, and font size, and if graphical objects overlap, we may have to spec‐ ify whether they are partially transparent. 7 Figure 2-1. Commonly used aesthetics in data visualization: position, shape, size, color, line width, line type. Some of these aesthetics can represent both continuous and discrete data (position, size, line width, color), while others can usually only represent discrete data (shape, line type). All aesthetics fall into one of two groups: those that can represent continuous data and those that cannot. Continuous data values are values for which arbitrarily fine intermediates exist. For example, time duration is a continuous value. Between any two durations, say 50 seconds and 51 seconds, there are arbitrarily many intermedi‐ ates, such as 50.5 seconds, 50.51 seconds, 50.50001 seconds, and so on. By contrast, number of persons in a room is a discrete value. A room can hold 5 persons or 6, but not 5.5. For the examples in Figure 2-1, position, size, color, and line width can repre‐ sent continuous data, but shape and line type can usually only represent discrete data. Next we’ll consider the types of data we may want to represent in our visualization. You may think of data as numbers, but numerical values are only two out of several types of data we may encounter. In addition to continuous and discrete numerical values, data can come in the form of discrete categories, in the form of dates or times, and as text (Table 2-1). When data is numerical we also call it quantitative and when it is categorical we call it qualitative. Variables holding qualitative data are factors, and the different categories are called levels. The levels of a factor are most commonly without order (as in the example of dog, cat, fish in Table 2-1), but factors can also be ordered, when there is an intrinsic order among the levels of the factor (as in the example of good, fair, poor in Table 2-1). 8 | Chapter 2: Visualizing Data: Mapping Data onto Aesthetics Table 2-1. Types of variables encountered in typical data visualization scenarios. Type of variable Examples Appropriate Description scale Quantitative/ 1.3, 5.7, 83, 1.5 × Continuous Arbitrary numerical values. These can be integers, rational numerical 10–2 numbers, or real numbers. continuous Quantitative/ 1, 2, 3, 4 Discrete Numbers in discrete units. These are most commonly but not numerical discrete necessarily integers. For example, the numbers 0.5, 1.0, 1.5 could also be treated as discrete if intermediate values cannot exist in the given dataset. Qualitative/ dog, cat, fish Discrete Categories without order. These are discrete and unique categorical categories that have no inherent order. These variables are unordered also called factors. Qualitative/ good, fair, poor Discrete Categories with order. These are discrete and unique categorical categories with an order. For example, “fair” always lies ordered between “good” and “poor.” These variables are also called ordered factors. Date or time Jan. 5 2018, 8:03am Continuous or Specific days and/or times. Also generic dates, such as July 4 discrete or Dec. 25 (without year). Text The quick brown fox None, or Free-form text. Can be treated as categorical if needed. jumps over the lazy discrete dog. To examine a concrete example of these various types of data, take a look at Table 2-2. It shows the first few rows of a dataset providing the daily temperature normals (aver‐ age daily temperatures over a 30-year window) for four US locations. This table con‐ tains five variables: month, day, location, station ID, and temperature (in degrees Fahrenheit). Month is an ordered factor, day is a discrete numerical value, location is an unordered factor, station ID is similarly an unordered factor, and temperature is a continuous numerical value. Table 2-2. First 8 rows of a dataset listing daily temperature normals for four weather stations. Data source: National Oceanic and Atmospheric Administration (NOAA). Month Day Location Station ID Temperature (°F) Jan 1 Chicago USW00014819 25.6 Jan 1 San Diego USW00093107 55.2 Jan 1 Houston USW00012918 53.9 Jan 1 Death Valley USC00042319 51.0 Jan 2 Chicago USW00014819 25.5 Jan 2 San Diego USW00093107 55.3 Jan 2 Houston USW00012918 53.8 Jan 2 Death Valley USC00042319 51.2 Aesthetics and Types of Data | 9 Scales Map Data Values onto Aesthetics To map data values onto aesthetics, we need to specify which data values correspond to which specific aesthetics values. For example, if our graphic has an x axis, then we need to specify which data values fall onto particular positions along this axis. Simi‐ larly, we may need to specify which data values are represented by particular shapes or colors. This mapping between data values and aesthetics values is created via scales. A scale defines a unique mapping between data and aesthetics (Figure 2-2). Importantly, a scale must be one-to-one, such that for each specific data value there is exactly one aesthetics value and vice versa. If a scale isn’t one-to-one, then the data visualization becomes ambiguous. Figure 2-2. Scales link data values to aesthetics. Here, the numbers 1 through 4 have been mapped onto a position scale, a shape scale, and a color scale. For each scale, each number corresponds to a unique position, shape, or color, and vice versa. Let’s put things into practice. We can take the dataset shown in Table 2-2, map tem‐ perature onto the y axis, day of the year onto the x axis, and location onto color, and visualize these aesthetics with solid lines. The result is a standard line plot showing the temperature normals at the four locations as they change during the year (Figure 2-3). Figure 2-3 is a fairly standard visualization for a temperature curve and likely the vis‐ ualization most data scientists would intuitively choose first. However, it is up to us which variables we map onto which scales. For example, instead of mapping tempera‐ ture onto the y axis and location onto color, we can do the opposite. Because now the key variable of interest (temperature) is shown as color, we need to show sufficiently large colored areas for the colors to convey useful information [Stone, Albers Szafir, and Setlur 2014]. Therefore, for this visualization I have chosen squares instead of lines, one for each month and location, and I have colored them by the average tem‐ perature normal for each month (Figure 2-4). 10 | Chapter 2: Visualizing Data: Mapping Data onto Aesthetics Figure 2-3. Daily temperature normals for four selected locations in the US. Temperature is mapped to the y axis, day of the year to the x axis, and location to line color. Data source: NOAA. Figure 2-4. Monthly normal mean temperatures for four locations in the US. Data source: NOAA. I would like to emphasize that Figure 2-4 uses two position scales (month along the x axis and location along the y axis), but neither is a continuous scale. Month is an ordered factor with 12 levels and location is an unordered factor with 4 levels. There‐ fore, the two position scales are both discrete. For discrete position scales, we gener‐ ally place the different levels of the factor at an equal spacing along the axis. If the factor is ordered (as is here the case for month), then the levels need to be placed in the appropriate order. If the factor is unordered (as is here the case for location), then the order is arbitrary, and we can choose any order we want. I have ordered the loca‐ tions from overall coldest (Chicago) to overall hottest (Death Valley) to generate a Scales Map Data Values onto Aesthetics | 11 pleasant staggering of colors. However, I could have chosen any other order and the figure would have been equally valid. Both Figures 2-3 and 2-4 used three scales in total, two position scales and one color scale. This is a typical number of scales for a basic visualization, but we can use more than three scales at once. Figure 2-5 uses five scales—two position scales, one color scale, one size scale, and one shape scale—and each scale represents a different vari‐ able from the dataset. Figure 2-5. Fuel efficiency versus displacement, for 32 cars (1973–74 models). This figure uses five separate scales to represent data: (i) the x axis (displacement); (ii) the y axis (fuel efficiency); (iii) the color of the data points (power); (iv) the size of the data points (weight); and (v) the shape of the data points (number of cylinders). Four of the five variables displayed (displacement, fuel efficiency, power, and weight) are numerical con‐ tinuous. The remaining one (number of cylinders) can be considered to be either numer‐ ical discrete or qualitative ordered. Data source: Motor Trend, 1974. 12 | Chapter 2: Visualizing Data: Mapping Data onto Aesthetics CHAPTER 3 Coordinate Systems and Axes To make any sort of data visualization, we need to define position scales, which deter‐ mine where in a graphic different data values are located. We cannot visualize data without placing different data points at different locations, even if we just arrange them next to each other along a line. For regular 2D visualizations, two numbers are required to uniquely specify a point, and therefore we need two position scales. These two scales are usually but not necessarily the x and y axes of the plot. We also have to specify the relative geometric arrangement of these scales. Conventionally, the x axis runs horizontally and the y axis vertically, but we could choose other arrangements. For example, we could have the y axis run at an acute angle relative to the x axis, or we could have one axis run in a circle and the other run radially. The combination of a set of position scales and their relative geometric arrangement is called a coordinate system. Cartesian Coordinates The most widely used coordinate system for data visualization is the 2D Cartesian coordinate system, where each location is uniquely specified by an x and a y value. The x and y axes run orthogonally to each other, and data values are placed in an even spacing along both axes (Figure 3-1). The two axes are continuous position scales, and they can represent both positive and negative real numbers. To fully specify the coordinate system, we need to specify the range of numbers each axis covers. In Figure 3-1, the x axis runs from –2.2 to 3.2 and the y axis runs from –2.2 to 2.2. Any data values between these axis limits are placed at the appropriate respective location in the plot. Any data values outside the axis limits are discarded. 13 Figure 3-1. Standard Cartesian coordinate system. The horizontal axis is conventionally called x and the vertical axis y. The two axes form a grid with equidistant spacing. Here, both the x and the y grid lines are separated by units of one. The point (2, 1) is located two x units to the right and one y unit above the origin (0, 0). The point (–1, –1) is loca‐ ted one x unit to the left and one y unit below the origin. Data values usually aren’t just numbers, however. They come with units. For example, if we’re measuring temperature, the values may be measured in degrees Celsius or Fahrenheit. Similarly, if we’re measuring distance, the values may be measured in kilometers or miles, and if we’re measuring duration, the values may be measured in minutes, hours, or days. In a Cartesian coordinate system, the spacing between grid lines along an axis corresponds to discrete steps in these data units. In a temperature scale, for example, we may have a grid line every 10 degrees Fahrenheit, and in a dis‐ tance scale, we may have a grid line every 5 kilometers. A Cartesian coordinate system can have two axes representing two different units. This situation arises quite commonly whenever we’re mapping two different types of variables to x and y. For example, in Figure 2-3, we plotted temperature versus days of the year. The y axis of Figure 2-3 is measured in degrees Fahrenheit, with a grid line every at 20 degrees, and the x axis is measured in months, with a grid line at the first of every third month. Whenever the two axes are measured in different units, we can stretch or compress one relative to the other and maintain a valid visualization of the data (Figure 3-2). Which version is preferable may depend on the story we want to 14 | Chapter 3: Coordinate Systems and Axes convey. A tall and narrow figure emphasizes change along the y axis and a short and wide figure does the opposite. Ideally, we want to choose an aspect ratio that ensures that any important differences in position are noticeable. Figure 3-2. Daily temperature normals for Houston, TX. Temperature is mapped to the y axis and day of the year to the x axis. Parts (a), (b), and (c) show the same figure in different aspect ratios. All three parts are valid visualizations of the temperature data. Data source: NOAA. On the other hand, if the x and y axes are measured in the same units, then the grid spacings for the two axes should be equal, such that the same distance along the x or y axis corresponds to the same number of data units. As an example, we can plot the temperature in Houston, TX, against the temperature in San Diego, CA, for every day of the year (Figure 3-3a). Since the same quantity is plotted along both axes, we need to make sure that the grid lines form perfect squares, as is the case in Figure 3-3a. Cartesian Coordinates | 15 Figure 3-3. Daily temperature normals for Houston, TX, plotted versus the respective temperature normals of San Diego, CA. The first days of the months January, April, July, and October are highlighted to provide a temporal reference. (a) Temperatures are shown in degrees Fahrenheit. (b) Temperatures are shown in degrees Celsius. Data source: NOAA. You may wonder what happens if you change the units of your data. After all, units are arbitrary, and your preferences might be different from somebody else’s. A change in units is a linear transformation, where we add or subtract a number to or from all data values and/or multiply all data values with another number. Fortunately, Carte‐ sian coordinate systems are invariant under such linear transformations. Therefore, you can change the units of your data and the resulting figure will not change as long as you change the axes accordingly. As an example, compare Figures 3-3a and 3-3b. Both show the same data, but in part (a) the temperature units are degrees Fahrenheit and in part (b) they are degrees Celsius. Even though the grid lines are in different locations and the numbers along the axes are different, the two data visualizations look exactly the same. Nonlinear Axes In a Cartesian coordinate system, the grid lines along an axis are spaced evenly both in data units and in the resulting visualization. We refer to the position scales in these coordinate systems as linear. While linear scales generally provide an accurate repre‐ sentation of the data, there are scenarios where nonlinear scales are preferred. In a nonlinear scale, even spacing in data units corresponds to uneven spacing in the 16 | Chapter 3: Coordinate Systems and Axes visualization, or conversely even spacing in the visualization corresponds to uneven spacing in data units. The most commonly used nonlinear scale is the logarithmic scale, or log scale for short. Log scales are linear in multiplication, such that a unit step on the scale corre‐ sponds to multiplication with a fixed value. To create a log scale, we need to log- transform the data values while exponentiating the numbers that are shown along the axis grid lines. This process is demonstrated in Figure 3-4, which shows the numbers 1, 3.16, 10, 31.6, and 100 placed on linear and log scales. The numbers 3.16 and 31.6 may seem like strange choices, but they were selected because they are exactly half‐ way between 1 and 10 and between 10 and 100 on a log scale. We can see this by observing that 100.5 = 10 ≈ 3.16, and equivalently 3.16 × 3.16 ≈ 10. Similarly, 101.5 = 10 × 100.5 ≈ 31.6. Figure 3-4. Relationship between linear and logarithmic scales. The dots correspond to the data values 1, 3.16, 10, 31.6, and 100, which are evenly spaced numbers on a loga‐ rithmic scale. We can display these data points on a linear scale, we can log-transform them and then show them on a linear scale, or we can show them on a logarithmic scale. Importantly, the correct axis title for a logarithmic scale is the name of the variable shown, not the logarithm of that variable. Mathematically, there is no difference between plotting the log-transformed data on a linear scale or plotting the original data on a logarithmic scale (Figure 3-4). The only difference lies in the labeling for the individual axis ticks and for the axis as a whole. Nonlinear Axes | 17 In most cases, the labeling for a logarithmic scale is preferable, because it places less mental burden on the reader to interpret the numbers shown as the axis tick labels. There is also less of a risk of confusion about the base of the logarithm. When work‐ ing with log-transformed data, we can get confused about whether the data was trans‐ formed using the natural logarithm or the logarithm to base 10. And it’s not uncommon for labeling to be ambiguous—e.g., log(x), which doesn’t specify a base at all. I recommend that you always verify the base when working with log-transformed data. When plotting log-transformed data, always specify the base in the labeling of the axis. Because multiplication on a log scale looks like addition on a linear scale, log scales are the natural choice for any data that has been obtained by multiplication or divi‐ sion. In particular, ratios should generally be shown on a log scale. As an example, I have taken the number of inhabitants in each county in Texas and divided it by the median number of inhabitants across all Texas counties. The resulting ratio is a num‐ ber that can be larger or smaller than 1. A ratio of exactly 1 implies that the corre‐ sponding county has the median number of inhabitants. When visualizing these ratios on a log scale, we can see that the population numbers in Texas counties are symmetrically distributed around the median, and that the most populous counties have over 100 times more inhabitants than the median while the least populous coun‐ ties have over 100 times fewer inhabitants (Figure 3-5). 18 | Chapter 3: Coordinate Systems and Axes Figure 3-5. Population numbers of Texas counties relative to their median value. Select counties are highlighted by name. The dashed line indicates a ratio of 1, corresponding to a county with median population number. The most populous counties have approxi‐ mately 100 times more inhabitants than the median county, and the least populous counties have approximately 100 times fewer inhabitants than the median county. Data source: 2010 US Decennial Census. By contrast, for the same data, a linear scale obscures the differences between a county with median population number and a county with a much smaller popula‐ tion number than median (Figure 3-6). Nonlinear Axes | 19 Figure 3-6. Population sizes of Texas counties relative to their median value. By display‐ ing a ratio on a linear scale, we have overemphasized ratios > 1 and have obscured ratios < 1. As a general rule, ratios should not be displayed on a linear scale. Data source: 2010 US Decennial Census. On a log scale, the value 1 is the natural midpoint, similar to the value 0 on a linear scale. We can think of values greater than 1 as representing multiplications and values less than 1 divisions. For example, we can write 10 = 1 × 10 and 0.1 = 1/10. The value 0, on the other hand, can never appear on a log scale. It lies infinitely far from 1. One way to see this is to consider that log(0) = –∞. Or, alternatively, consider that to go from 1 to 0, it takes either an infinite number of divisions by a finite value (e.g., 1/10/10/10/10/10/10⋯ = 0) or one division by infinity (i.e., 1/∞ = 0). Log scales are frequently used when the dataset contains numbers of very different magnitudes. For the Texas counties shown in Figures 3-5 and 3-6, the most populous one (Harris) had 4,092,459 inhabitants in the 2010 US Census while the least popu‐ lous one (Loving) had 82. So, a log scale would be appropriate even if we hadn’t divi‐ ded the population numbers by their median to turn them into ratios. But what would we do if there was a county with 0 inhabitants? This county could not be shown on the logarithmic scale, because it would lie at minus infinity. In this situa‐ tion, the recommendation is sometimes to use a square-root scale, which uses a square-root transformation instead of a log transformation (Figure 3-7). Just like a log scale, a square-root scale compresses larger numbers into a smaller range, but unlike a log scale, it allows for the presence of 0. 20 | Chapter 3: Coordinate Systems and Axes Figure 3-7. Relationship between linear and square-root scales. The dots correspond to the data values 0, 1, 4, 9, 16, 25, 36, and 49, which are evenly spaced numbers on a square-root scale, since they are the squares of the integers from 0 to 7. We can display these data points on a linear scale, we can square-root-transform them and then show them on a linear scale, or we can show them on a square-root scale. I see two problems with square-root scales. First, while on a linear scale one unit step corresponds to addition or subtraction of a constant value, and on a log scale it corre‐ sponds to multiplication with or division by a constant value, no such rule exists for a square-root scale. The meaning of a unit step on a square-root scale depends on the scale value at which we’re starting. Second, it is unclear how to best place axis ticks on a square-root scale. To obtain evenly spaced ticks, we would have to place them at squares, but axis ticks at, for example, positions 0, 4, 25, 49, and 81 (every second square) would be unintuitive. Alternatively, we could place them at linear intervals (10, 20, 30, etc.), but this would result in either too few axis ticks near the low end of the scale or too many near the high end. In Figure 3-7, I have placed the axis ticks at positions 0, 1, 5, 10, 20, 30, 40, and 50 on the square-root scale. These values are arbi‐ trary but provide a reasonable covering of the data range. Despite these problems with square-root scales, they are valid position scales and I do not discount the possibility that they have appropriate applications. For example, just like a log scale is the natural scale for ratios, one could argue that the square-root scale is the natural scale for data that comes in squares. One scenario in which data is naturally squares is in the context of geographic regions. If we show the areas of geo‐ graphic regions on a square-root scale, we are highlighting the regions’ linear extent from east to west or north to south. These extents could be relevant, for example, if we were wondering how long it might take to drive across a region. Figure 3-8 shows the areas of states in the US Northeast on both a linear and a square-root scale. Even Nonlinear Axes | 21 though the areas of these states are quite different (Figure 3-8a), the relative time it will take to drive across each state is more accurately represented by the figure on the square-root scale (Figure 3-8b) than the figure on the linear scale (Figure 3-8a). Figure 3-8. Areas of northeastern US states. (a) Areas shown on a linear scale. (b) Areas shown on a square-root scale. Data source: Google. Coordinate Systems with Curved Axes All the coordinate systems we have encountered so far have used two straight axes positioned at a right angle to each other, even if the axes themselves established a nonlinear mapping from data values to positions. There are other coordinate systems, however, where the axes themselves are curved. In particular, in the polar coordinate system, we specify positions via an angle and a radial distance from the origin, and therefore the angle axis is circular (Figure 3-9). Polar coordinates can be useful for data of a periodic nature, such that data values at one end of the scale can be logically joined to data values at the other end. For exam‐ ple, consider the days in a year. December 31st is the last day of the year, but it is also one day before the first day of the year. If we want to show how some quantity varies over the year, it can be appropriate to use polar coordinates with the angle coordinate specifying each day. Let’s apply this concept to the temperature normals of Figure 2-3. Because temperature normals are average temperatures that are not tied to any spe‐ cific year, Dec. 31st can be thought of as 366 days later than Jan. 1st (temperature nor‐ mals include Feb. 29th) and also 1 day earlier. 22 | Chapter 3: Coordinate Systems and Axes By plotting the temperature normals in a polar coordinate system, we emphasize this cyclical property they have (Figure 3-10). In comparison to Figure 2-3, the polar ver‐ sion highlights how similar the temperatures are in Death Valley, Houston, and San Diego from late fall to early spring. In the Cartesian coordinate system, this fact is obscured because the temperature values in late December and in early January are shown in opposite parts of the figure and therefore don’t form a single visual unit. Figure 3-9. Relationship between Cartesian and polar coordinates. (a) Three data points shown in a Cartesian coordinate system. (b) The same three data points shown in a polar coordinate system. We have taken the x coordinates from part (a) and used them as angular coordinates and the y coordinates from part (a) and used them as radial coordinates. The circular axis runs from 0 to 4 in this example, and therefore x = 0 and x = 4 are the same locations in this coordinate system. Coordinate Systems with Curved Axes | 23 Figure 3-10. Daily temperature normals for four selected locations in the US, shown in polar coordinates. The radial distance from the center point indicates the daily tempera‐ ture in Fahrenheit, and the days of the year are arranged counterclockwise starting with Jan. 1st at the 6:00 position. Data source: NOAA. A second setting in which we encounter curved axes is in the context of geospatial data, i.e., maps. Locations on the globe are specified by their longitude and latitude. But because the earth is a sphere, drawing latitude and longitude as Cartesian axes is misleading and not recommended (Figure 3-11). Instead, we use various types of nonlinear projections that attempt to minimize artifacts and that strike different bal‐ ances between conserving areas or angles relative to the true shape lines on the globe (Figure 3-11). 24 | Chapter 3: Coordinate Systems and Axes Figure 3-11. Map of the world, shown in four different projections. The Cartesian longi‐ tude and latitude system maps the longitude and latitude of each location onto a regular Cartesian coordinate system. This mapping causes substantial distortions in both areas and angles relative to their true values on the 3D globe. The interrupted Goode homolo‐ sine projection perfectly represents true surface areas, at the cost of dividing some land masses into separate pieces, most notably Greenland and Antarctica. The Robinson pro‐ jection and the Winkel tripel projection both strike a balance between angular and area distortions, and they are commonly used for maps of the entire globe. Coordinate Systems with Curved Axes | 25 CHAPTER 4 Color Scales There are three fundamental use cases for color in data visualizations: we can use color to distinguish groups of data from each other, to represent data values, and to highlight. The types of colors we use and the way in which we use them are quite dif‐ ferent for these three cases. Color as a Tool to Distinguish We frequently use color as a means to distinguish discrete items or groups that do not have an intrinsic order, such as different countries on a map or different manufactur‐ ers of a certain product. In this case, we use a qualitative color scale. Such a scale con‐ tains a finite set of specific colors that are chosen to look clearly distinct from each other while also being equivalent to each other. The second condition requires that no one color should stand out relative to the others. Also, the colors should not create the impression of an order, as would be the case with a sequence of colors that get successively lighter. Such colors would create an apparent order among the items being colored, which by definition have no order. Many appropriate qualitative color scales are readily available. Figure 4-1 shows three representative examples. In particular, the ColorBrewer project provides a nice selec‐ tion of qualitative color scales, including both fairly light and fairly dark colors [Brewer 2017]. 27 Figure 4-1. Example qualitative color scales. The Okabe Ito scale is the default scale used throughout this book [Okabe and Ito 2008]. The ColorBrewer Dark2 scale is provided by the ColorBrewer project [Brewer 2017]. The ggplot2 hue scale is the default qualitative scale in the widely used plotting software ggplot2. As an example of how we use qualitative color scales, consider Figure 4-2. It shows the percent population growth from 2000 to 2010 in US states. I have arranged the states in order of their population growth, and I have colored them by geographic region. This coloring highlights that states in the same regions have experienced simi‐ lar population growth. In particular, states in the West and the South have seen the largest population increases, whereas states in the Midwest and the Northeast have grown much less. 28 | Chapter 4: Color Scales Figure 4-2. Population growth in the US from 2000 to 2010. States in the West and South have seen the largest increases, whereas states in the Midwest and Northeast have seen much smaller increases (or even, in the case of Michigan, a decrease). Data source: US Census Bureau. Color to Represent Data Values Color can also be used to represent quantitative data values, such as income, tempera‐ ture, or speed. In this case, we use a sequential color scale. Such a scale contains a sequence of colors that clearly indicate which values are larger or smaller than which other ones, and how distant two specific values are from each other. The second point Color to Represent Data Values | 29 implies that the color scale needs to be perceived to vary uniformly across its entire range. Sequential scales can be based on a single hue (e.g., from dark blue to light blue) or on multiple hues (e.g., from dark red to light yellow) (Figure 4-3). Multihue scales tend to follow color gradients that can be seen in the natural world, such as dark red, green, or blue to light yellow, or dark purple to light green. The reverse (e.g., dark yellow to light blue) looks unnatural and doesn’t make a useful sequential scale. Figure 4-3. Example sequential color scales. The ColorBrewer Blues scale is a monochro‐ matic scale that varies from dark to light blue. The Heat and Viridis scales are multihue scales that vary from dark red to light yellow and from dark blue via green to light yel‐ low, respectively. Representing data values as colors is particularly useful when we want to show how the data values vary across geographic regions. In this case, we can draw a map of the geographic regions and color them by the data values. Such maps are called choro‐ pleths. Figure 4-4 shows an example where I have mapped annual median income within each county in Texas onto a map of those counties. In some cases, we need to visualize the deviation of data values in one of two direc‐ tions relative to a neutral midpoint. One straightforward example is a dataset con‐ taining both positive and negative numbers. We may want to show those with different colors, so that it is immediately obvious whether a value is positive or nega‐ tive as well as how far in either direction it deviates from zero. The appropriate color scale in this situation is a diverging color scale. We can think of a diverging scale as two sequential scales stitched together at a common midpoint, which usually is repre‐ sented by a light color (Figure 4-5). Diverging scales need to be balanced, so that the progression from light colors in the center to dark colors on the outside is approxi‐ mately the same in either direction. Otherwise, the perceived magnitude of a data value would depend on whether it fell above or below the midpoint value. 30 | Chapter 4: Color Scales Figure 4-4. Median annual income in Texas counties. The highest median incomes are seen in major Texas metropolitan areas, in particular near Houston and Dallas. No median income estimate is available for Loving County in West Texas, and therefore that county is shown in gray. Data source: 2015 Five-Year American Community Survey. Figure 4-5. Example diverging color scales. Diverging scales can be thought of as two sequential scales stitched together at a common midpoint color. Common color choices for diverging scales include brown to greenish blue, pink to yellow-green, and blue to red. Color to Represent Data Values | 31 As an example application of a diverging color scale, consider Figure 4-6, which shows the percentage of people identifying as white in Texas counties. Even though percentage is always a positive number, a diverging scale is justified here, because 50% is a meaningful midpoint value. Numbers above 50% indicate that whites are in the majority and numbers below 50% indicate the opposite. The visualization clearly shows in which counties whites are in the majority, in which they are in the minority, and in which whites and nonwhites occur in approximately equal proportions. Figure 4-6. Percentage of people identifying as white in Texas counties. Whites are in the majority in North and East Texas but not in South or West Texas. Data source: 2010 US Decennial Census. 32 | Chapter 4: Color Scales Color as a Tool to Highlight Color can also be an effective tool to highlight specific elements in the data. There may be specific categories or values in the dataset that carry key information about the story we want to tell, and we can strengthen the story by emphasizing the relevant figure elements to the reader. An easy way to achieve this emphasis is to color these figure elements in a color or set of colors that vividly stand out against the rest of the figure. This effect can be achieved with accent color scales, which are color scales that contain both a set of subdued colors and a matching set of stronger, darker, and/or more saturated colors (Figure 4-7). Figure 4-7. Example accent color scales, each with four base colors and three accent col‐ ors. Accent color scales can be derived in several different ways: (top) we can take an existing color scale (e.g., the Okabe Ito scale, Figure 4-1) and lighten and/or partially desaturate some colors while darkening others; (middle) we can take gray values and pair them with colors; (bottom) we can use an existing accent color scale (e.g., the one from the ColorBrewer project). As an example of how the same data can support differing stories with different col‐ oring approaches, I have created a variant of Figure 4-2 where now I highlight two specific states, Texas and Louisiana (Figure 4-8). Both states are in the South, they are immediate neighbors, and yet one state (Texas) was the fifth fastest growing state within the US from 2000 to 2010 whereas the other was the third slowest growing. Color as a Tool to Highlight | 33 Figure 4-8. From 2000 to 2010, the two neighboring southern states, Texas and Louisi‐ ana, experienced among the highest and lowest population growth across the US. Data source: US Census Bureau. When working with accent colors, it is critical that the baseline colors do not compete for attention. Notice how drab the baseline colors are in Figure 4-8, yet they work well to support the accent color. It is easy to make the mistake of using baseline colors that are too colorful, so that they end up competing for the reader’s attention against the accent colors. There is an easy remedy, however: just remove all color from all elements in the figure except the highlighted data categories or points. An example of this strategy is provided in Figure 4-9. 34 | Chapter 4: Color Scales Figure 4-9. Track athletes are among the shortest and leanest of male professional ath‐ letes participating in popular sports. Data source: [Telford and Cunningham 1991]. Color as a Tool to Highlight | 35 CHAPTER 5 Directory of Visualizations This chapter provides a quick visual overview of the various plots and charts that are commonly used to visualize different types of data. It is meant both to serve as a table of contents, in case you are looking for a particular visualization whose name you may not know, and as a source of inspiration, if you need to find alternatives to the figures you routinely make. Amounts The most common approach to visualizing amounts (i.e., numerical values shown for some set of categories) is using bars, either vertically or horizontally arranged (Chap‐ ter 6). However, instead of using bars, we can also place dots at the location where the corresponding bar would end (Chapter 6). 37 If there are two or more sets of categories for which we want to show amounts, we can group or stack the bars (Chapter 6). We can also map the categories onto the x and y axes and show amounts by color, via a heatmap (Chapter 6). Distributions Histograms and density plots (Chapter 7) provide the most intuitive visualizations of a distribution, but both require arbitrary parameter choices and can be misleading. Cumulative densities and quantile-quantile (q-q) plots (Chapter 8) always represent the data faithfully but can be more difficult to interpret. 38 | Chapter 5: Directory of Visualizations Boxplots, violin plots, strip charts, and sina plots are useful when we want to visualize many distributions at once and/or if we are primarily interested in overall shifts among the distributions (see “Visualizing Distributions Along the Vertical Axis” on page 81). Stacked histograms and overlapping densities allow a more in-depth com‐ parison of a smaller number of distributions, though stacked histograms can be diffi‐ cult to interpret and are best avoided (see “Visualizing Multiple Distributions at the Same Time” on page 64). Ridgeline plots can be a useful alternative to violin plots and are often useful when visualizing very large numbers of distributions or changes in distributions over time (see “Visualizing Distributions Along the Horizontal Axis” on page 88). Proportions Proportions can be visualized as pie charts, side-by-side bars, or stacked bars (Chap‐ ter 10). As for amounts, when we visualize proportions with bars, the bars can be arranged either vertically or horizontally. Pie charts emphasize that the individual parts add up to a whole and highlight simple fractions. However, the individual pieces Proportions | 39 are more easily compared in side-by-side bars. Stacked bars look awkward for a single set of proportions, but can be useful when comparing multiple sets of proportions. When visualizing multiple sets of proportions or changes in proportions across con‐ ditions, pie charts tend to be space-inefficient and often obscure relationships. Grou‐ ped bars work well as long as the number of conditions compared is moderate, and stacked bars can work for large numbers of conditions. Stacked densities (Chap‐ ter 10) are appropriate when the proportions change along a continuous variable. When proportions are specified according to multiple grouping variables, mosaic plots, treemaps, or parallel sets are useful visualization approaches (Chapter 11). Mosaic plots assume that every level of one grouping variable can be combined with every level of another grouping variable, whereas treemaps do not make such an assumption. Treemaps work well even if the subdivisions of one group are entirely distinct from the subdivisions of another. Parallel sets work better than either mosaic plots or treemaps when there are more than two grouping variables. 40 | Chapter 5: Directory of Visualizations x–y relationships Scatterplots (Chapter 12) represent the archetypical visualization when we want to show one quantitative variable relative to another. If we have three quantitative vari‐ ables, we can map one onto the dot size, creating a variant of the scatterplot called a bubble chart. For paired data, where the variables along the x and y axes are meas‐ ured in the same units, it is generally helpful to add a line indicating x = y (see “Paired Data” on page 127). Paired data can also be shown as a slopegraph of paired points connected by straight lines. For large numbers of points, regular scatterplots can become uninformative due to overplotting. In this case, contour lines, 2D bins, or hex bins may provide an alterna‐ tive (Chapter 18). When we want to visualize more than two quantities, on the other hand, we may choose to plot correlation coefficients in the form of a correlogram instead of the underlying raw data (see “Correlograms” on page 121). x–y relationships | 41 When the x axis represents time or a strictly increasing quantity such as a treatment dose, we commonly draw line graphs (Chapter 13). If we have a temporal sequence of two response variables we can draw a connected scatterplot, where we first plot the two response variables in a scatterplot and then connect dots corresponding to adja‐ cent time points (see “Time Series of Two or More Response Variables” on page 138). We can use smooth lines to represent trends in a larger dataset (Chapter 14). Geospatial Data The primary mode of showing geospatial data is in the form of a map (Chapter 15). A map takes coordinates on the globe and projects them onto a flat surface, such that shapes and distances on the globe are approximately represented by shapes and dis‐ tances in the 2D representation. In addition, we can show data values in different regions by coloring those regions in the map according to the data. Such a map is called a choropleth (see “Choropleth Mapping” on page 172). In some cases, it may be helpful to distort the different regions according to some other quantity (e.g., popula‐ tion number) or simplify each region into a square. Such visualizations are called car‐ tograms (see “Cartograms” on page 176). 42 | Chapter 5: Directory of Visualizations Uncertainty Error bars are meant to indicate the range of likely values for some estimate or meas‐ urement. They extend horizontally and/or vertically from some reference point rep‐ resenting the estimate or measurement (Chapter 16). Reference points can be shown in various ways, such as by dots or by bars. Graded error bars show multiple ranges at the same time, where each range corresponds to a different degree of confidence. They are in effect multiple error bars with different line thicknesses plotted on top of each other. To achieve a more detailed visualization than is possible with error bars or graded error bars, we can visualize the actual confidence or posterior distributions (Chap‐ ter 16). Confidence strips provide a visual sense of uncertainty but are difficult to read accurately. Eyes and half-eyes combine error bars with approaches to visualize distributions (violins and ridgelines, respectively), and thus show both precise ranges for some confidence levels and the overall uncertainty distribution. A quantile dot plot can serve as an alternative visualization of an uncertainty distribution (see “Framing Probabilities as Frequencies” on page 181). Because it shows the distribu‐ tion in discrete units, the quantile dot plot is not as precise but can be easier to read than the continuous distribution shown by a violin or ridgeline plot. Uncertainty | 43 For smooth line graphs, the equivalent of an error bar is a confidence band (see “Vis‐ ualizing the Uncertainty of Curve Fits” on page 197). It shows a range of values the line might pass through at a given confidence level. Like with error bars, we can draw graded confidence bands that show multiple confidence levels at once. We can also show individual fitted draws in lieu of or in addition to the confidence bands. 44 | Chapter 5: Directory of Visualizations CHAPTER 6 Visualizing Amounts In many scenarios, we are interested in the magnitude of some set of numbers. For example, we might want to visualize the total sales volume of different brands of cars, or the total number of people living in different cities, or the age of Olympians per‐ forming different sports. In all these cases, we have a set of categories (e.g., brands of cars, cities, or sports) and a quantitative value for each category. I refer to these cases as visualizing amounts, because the main emphasis in these visualizations will be on the magnitude of the quantitative values. The standard visualization in this scenario is the bar plot, which has several variations, including simple bars as well as grouped and stacked bars. Alternatives to the bar plot are the dot plot and the heatmap. Bar Plots To motivate the concept of a bar plot, consider the total ticket sales for the most pop‐ ular movies on a given weekend. Table 6-1 shows the top five highest-grossing films for the weekend before Christmas in 2017. Star Wars: The Last Jedi was by far the most popular movie on that weekend, outselling the fourth- and fifth-ranked movies, The Greatest Showman and Ferdinand, by almost a factor of 10. Table 6-1. Highest-grossing movies for the weekend of December 22–24, 2017. Data source: Box Office Mojo. Used with permission. Rank Title Weekend gross 1 Star Wars: The Last Jedi $71,565,498 2 Jumanji: Welcome to the Jungle $36,169,328 3 Pitch Perfect 3 $19,928,525 4 The Greatest Showman $8,805,843 5 Ferdinand $7,316,746 45 This kind of data is commonly visualized with vertical bars. For each movie, we draw a bar that starts at zero and extends all the way to the dollar value for that movie’s weekend gross (Figure 6-1). This visualization is called a bar plot or bar chart. Figure 6-1. Highest-grossing movies for the weekend of December 22–24, 2017, displayed as a bar plot. Data source: Box Office Mojo. Used with permission. One problem we commonly encounter with vertical bars is that the labels identifying each bar take up a lot of horizontal space. In fact, I had to make Figure 6-1 fairly wide and space out the bars so that I could place the movie titles underneath. To save hori‐ zontal space, we could place the bars closer together and rotate the labels (Figure 6-2). However, I am not a big proponent of rotated labels. I find the resulting plots awkward and difficult to read. And, in my experience, whenever the labels are too long to place horizontally, they also don’t look good rotated. 46 | Chapter 6: Visualizing Amounts Figure 6-2. Highest-grossing movies for the weekend of December 22–24, 2017, displayed as a bar plot with rotated axis tick labels. Rotated axis tick labels tend to be difficult to read and require awkward space use underneath the plot. For these reasons, I generally consider plots with rotated tick labels to be ugly. Data source: Box Office Mojo. Used with permission. The better solution for long labels is usually to swap the x and y axes, so that the bars run horizontally (Figure 6-3). After swapping the axes, we obtain a compact figure in which all visual elements, including all text, are horizontally oriented. As a result, the figure is much easier to read than Figure 6-2 or even Figure 6-1. Bar Plots | 47 Figure 6-3. Highest-grossing movies for the weekend of December 22–24, 2017, displayed as a horizontal bar plot. Data source: Box Office Mojo. Used with permission. Regardless of whether we place bars vertically or horizontally, we need to pay atten‐ tion to the order in which the bars are arranged. I often see bar plots where the bars are arranged arbitrarily or by some criterion that is not meaningful in the context of the figure. Some plotting programs arrange bars by default in alphabetical order of the labels, and other similarly arbitrary arrangements are possible (Figure 6-4). In general, the resulting figures are more confusing and less intuitive than figures where bars are arranged in order of their size. We should only rearrange bars, however, when there is no natural ordering to the cat‐ egories the bars represent. Whenever there is a natural ordering (i.e., when our cate‐ gorical variable is an ordered factor), we should retain that ordering in the visualization. For example, Figure 6-5 shows the median annual income in the US by age groups. In this case, the bars should be arranged in order of increasing age. Sort‐ ing by bar height while shuffling the age groups makes no sense (Figure 6-6). 48 | Chapter 6: Visualizing Amounts Figure 6-4. Highest-grossing movies for the weekend of December 22–24, 2017, displayed as a horizontal bar plot. Here, the bars have been placed in descending order of the lengths of the movie titles. This arrangement of bars is arbitrary, doesn’t serve a mean‐ ingful purpose, and makes the resulting figure much less intuitive than Figure 6-3. Data source: Box Office Mojo. Used with permission. Figure 6-5. 2016 median US annual household income versus age group. The 45-to-54- year age group has the highest median income. Data source: US Census Bureau. Bar Plots | 49 Figure 6-6. 2016 median US annual household income versus age group, sorted by income. While this order of bars looks visually appealing, the order of the age groups is now confusing. Data source: US Census Bureau. Pay attention to the bar order. If the bars represent unordered cate‐ gories, order them by ascending or descending data values. Grouped and Stacked Bars All the examples from the previous section showed how a quantitative amount varied with respect to one categorical variable. Frequently, however, we are interested in two categorical variables at the same time. For example, the US Census Bureau provides median income levels broken down by both age and race. We can visualize this data‐ set with a grouped bar plot (Figure 6-7). In a grouped bar plot, we draw a group of bars at each position along the x axis, determined by one categorical variable, and then we draw bars within each group according to the other categorical variable. Grouped bar plots show a lot of information at once, and they can be confusing. In fact, even though I have not labeled Figure 6-7 as bad or ugly, I find it difficult to read. In particular, it is difficult to compare median incomes across age groups for a given racial group. So, this figure is only appropriate if we are primarily interested in the differences in income levels among racial groups, separately for specific age groups. If we care more about the overall pattern of income levels among racial 50 | Chapter 6: Visualizing Amounts groups, it may be preferable to show race along the x axis and show ages as distinct bars within each racial group (Figure 6-8). Figure 6-7. 2016 median US annual household income versus age group and race. Age groups are shown along the x axis, and for each age group there are four bars, corre‐ sponding to the median income of Asian, white, Hispanic, and black people, respectively. Data source: US Census Bureau. Figure 6-8. 2016 median US annual household income versus age group and race. In contrast to Figure 6-7, now race is shown along the x axis, and for each race we show seven bars according to the seven age groups. Data source: US Census Bureau. Grouped and Stacked Bars | 51 Both Figures 6-7 and 6-8 encode one categorical variable by position along the x axis and the other by bar color. And in both cases, the encoding by position is easy to read while the encoding by bar color requires more mental effort, as we have to mentally match the colors of the bars against the colors in the legend. We can avoid this added mental effort by showing four separate regular bar plots rather than one grouped bar plot (Figure 6-9). Which of these various options we choose is ultimately a matter of taste. I would likely choose Figure 6-9, because it circumvents the need for different bar colors. Figure 6-9. 2016 median US annual household income versus age group and race. Instead of displaying this data as a grouped bar plot, as in Figures 6-7 and 6-8, we now show the data as four separate regular bar plots. This choice has the advantage that we don’t need to encode either categorical variable by bar color. Data source: US Census Bureau. Instead of drawing groups of bars side-by-side, it is sometimes preferable to stack bars on top of each other. Stacking is useful when the sum of the amounts repre‐ sented by the individual stacked bars is in itself a meaningful amount. So, while it would not make sense to stack the median income values of Figure 6-7 (the sum of two median income values is not a meaningful value), it might make sense to stack the weekend gross values of Figure 6-1 (the sum of the weekend gross values of two movies is the total gross for the two movies combined). Stacking is also appropriate 52 | Chapter 6: Visualizing Amounts when the individual bars re

dokumen.pub_fundamentals-of-data-visualization-a-primer-on-making-informative-and-compelling-figures-1nbsped-1492031089-9781492031086.pdf

Document Details

Related

Full Transcript