Summary

This lecture explores data visualization techniques. It discusses the concept of data visualization, including different types of graphical representations and the use of aesthetics like color, shape, and size, to represent data. The role of scales in mapping data values to visual elements is explained, along with the use of coordinate systems and axis labels. The document also provides examples relating to data and relevant aspects to different aspects of data analysis.

Full Transcript

 Data visualisation means graphical or pictorial representation of the data using graph, chart, etc.  The purpose of plotting data is to visualise variation or show relationships between variables.  Data visualization is part art and part science.  The challenge is to get the art righ...

 Data visualisation means graphical or pictorial representation of the data using graph, chart, etc.  The purpose of plotting data is to visualise variation or show relationships between variables.  Data visualization is part art and part science.  The challenge is to get the art right without getting the science wrong, and vice versa.  A data visualization first and foremost has to accurately convey the data. If one number is twice as large as another, but in the visualization they look to be about the same, then the visualization is wrong. If a figure contains jarring colors, imbalanced visual elements, or other features that distract, then the viewer will find it harder to inspect the figure and interpret it correctly. Ugly A figure that has aesthetic problems but otherwise is clear and informative Bad A figure that has problems related to perception; it may be unclear, confusing, overly complicated, or deceiving Wrong A figure that has problems related to mathematics; it is objectively incorrect (a) A bar plot showing three values(A =3, B= 5, and C = 4). (b)An ugly version of part (a). While the plot is technically correct, it is not aesthetically pleasing. The colors are too bright and not useful. The background grid is too prominent. The text is displayed using three different fonts in three different sizes. (c) A bad version of part (a). Each bar is shown with its own y axis scale. Because the scales don’t align, this makes the figure misleading. (d) A wrong version of part (a). Without an explicit y axis scale, the numbers represented by the bars cannot be ascertained. The bars appear to be of lengths 1, 3, and 2, even though the values displayed are meant to be 3,5, and 4. we visualize data, we take data values and convert them in a systematic and logical way into the visual elements that make up the final graphic. All data visualizations map data values into quantifiable features of the resulting graphic. We refer to these features as aesthetics. Aesthetics describe every aspect of a given graphical element. A critical component of every graphical element is of course its position, which describes where the element is located. In standard 2D graphics, we describe positions by an x and y value, but other coordinate systems and one- or three-dimensional visualizations are possible. Next, all graphical elements have a shape, a size, and a color. if we want to display text, we may have to specify font family, font face, and font size, and if graphical objects overlap, we may have to specify whether they are partially transparent. Commonly used aesthetics in data visualization: position, shape, size, color, line width, line type. Some of these aesthetics can represent both continuous and discrete data (position, size, line width, color), while others can usually only represent discrete data (shape, line type).  All aesthetics fall into one of two groups:  Those that can represent continuous data and those that cannot.  Continuous data values are values for which arbitrarily fine intermediates exist.  For example, time duration is a continuous value.  Between any two durations, say 50 seconds and 51 seconds, there are arbitrarily many intermediates, such as 50.5 seconds, 50.51 seconds, 50.50001 seconds, and so on.  By contrast, number of persons in a room is a discrete value. In addition to continuous and discrete numerical values, data can come in the form of discrete categories, in the form of dates or times, and as text When data is numerical we also call it quantitative and when it is categorical we call it qualitative. Variables holding qualitative data are factors, and the different categories are called levels. The levels of a factor are most commonly without order (as in the example of dog, cat, fish ), but factors can also be ordered, when there is an intrinsic order among the levels of the factor (as in the example of good, fair, poor ). The following table contains five variables: month, day, location, station ID, and temperature (in degrees Fahrenheit). Month is an ordered factor, day is a discrete numerical value, location is an unordered factor, station ID is similarly an unordered factor, and temperature is a continuous numerical value. First 8 rows of a dataset listing daily temperature normals for four weather stations. To map data values onto aesthetics, we need to specify which data values correspond to which specific aesthetics values. For example, if our graphic has an x axis, then we need to specify which data values fall onto particular positions along this axis. Similarly, we may need to specify which data values are represented by particular shapes or colors. This mapping between data values and aesthetics values is created via scales. A scale defines a unique mapping between data and aesthetics. Importantly, a scale must be one-to-one, such that for each specific data value there is exactly one aesthetics value and vice versa. If a scale isn’t one-to-one, then the data visualization becomes ambiguous. Scales link data values to aesthetics. Here, the numbers 1 through 4 have been mapped onto a position scale, a shape scale, and a color scale. For each scale, each number corresponds to a unique position, shape, or color, and vice versa. Example map temperature onto the y axis, day of the year onto the x axis, and location onto color, and visualize these aesthetics with solid lines. The result is a standard line plot showing the temperature normals at the four locations as they change during the year. instead of mapping temperature onto the y axis and location onto color, we can do the opposite. Because now the key variable of interest (temperature) is shown as color, we need to show sufficiently large colored areas for the colors to convey useful information. Therefore, for this visualization I have chosen squares instead of lines, one for each month and location, and I have colored them by the average temperature normal for each month. Month is an ordered factor with 12 levels and location is an unordered factor with 4 levels. Therefore, the two position scales are both discrete. Monthly normal mean temperatures for four locations in the US. For discrete position scales, we generally place the different levels of the factor at an equal spacing along the axis. If the factor is ordered (as is here the case for month), then the levels need to be placed in the appropriate order. If the factor is unordered (as is here the case for location), then the order is arbitrary, and we can choose any order we want. I have ordered the locations from overall coldest (Chicago) to overall hottest (Death Valley) to generate a pleasant staggering of colors. Both Figures used three scales in total, two position scales and one color scale. The following Figure uses five scales—two position scales, one color scale, one size scale, and one shape scale—and each scale represents a different variable from the dataset. Fuel efficiency versus displacement, for 32 cars. This figure uses five separate scales to represent data: (i) the x axis (displacement); (ii) the y axis (fuel efficiency); (iii) the color of the data points (power); (iv) the size of the data points(weight); and (v) the shape of the data points (number of cylinders). Four of the five variables displayed (displacement, fuel efficiency, power, and weight) are numerical continuous. The remaining one (number of cylinders) can be considered to be either numerical discrete or qualitative ordered.  Cartesian Coordinates The most widely used coordinate system for data visualization is the 2D Cartesian coordinate system, where each location is uniquely specified by an x and a y value. The x and y axes run orthogonally to each other, and data values are placed in an even spacing along both axes.  The two axes are continuous position scales, and they can represent both positive and negative real numbers.  To fully specify the coordinate system, we need to specify the range of numbers each axis covers. Data values usually aren’t just numbers, however. They come with units. For example, if we’re measuring temperature, the values may be measured in degrees Celsius or Fahrenheit. Similarly, if we’re measuring distance, the values may be measured in kilometers or miles, and if we’re measuring duration, the values may be measured in minutes, hours, or days. In a Cartesian coordinate system, the spacing between grid lines along an axis corresponds to discrete steps in these data units. In a temperature scale, for example, we may have a grid line every 10 degrees Fahrenheit, and in a distance scale, we may have a grid line every 5 kilometers. A Cartesian coordinate system can have two axes representing two different units. For example, we plotted temperature versus days of the year. The y axis is measured in degrees Fahrenheit, with a grid line every at 20 degrees, and the x axis is measured in months, with a grid line at the first of every third month. Whenever the two axes are measured in different units, we can stretch or compress one relative to the other and maintain a valid visualization of the data. Which version is preferable may depend on the story we want to convey. A tall and narrow figure emphasizes change along the y axis and a short and wide figure does the opposite. Ideally, we want to choose an aspect ratio that ensures that any important differences in position are noticeable.  On the other hand, if the x and y axes are measured in the same units, then the grid spacings for the two axes should be equal, such that the same distance along the x or y axis corresponds to the same number of data units.  As an example, we can plot the temperature in Houston against the temperature in San Diego for every day of the year. Since the same quantity is plotted along both axes, we need to make sure that the grid lines form perfect squares.  A change in units is a linear transformation, where we add or subtract a number to or from all data values and/or multiply all data values with another number.  Fortunately, Cartesian coordinate systems are invariant under such linear transformations. Therefore, you can change the units of your data and the resulting figure will not change as long as you change the axes accordingly.  As an example, compare Figures 3-3a and 3-3b.  Both show the same data, but in part (a) the temperature units are degrees Fahrenheit and in part (b) they are degrees Celsius.  Even though the grid lines are in different locations and the numbers along the axes are different, the two data visualizations look exactly the same.

Use Quizgecko on...
Browser
Browser