Data Visualization Level 2 Lecture PDF
Document Details
Uploaded by DeadCheapBrazilNutTree842
Tags
Summary
This document provides a lecture on data visualization, covering various types of scales, coordinate systems, and different visualization methods for different data types such as proportions and distributions. It explains concepts like linear and nonlinear scales, and how colors can be used to represent and highlight data values. The examples and explanations are suitable for an undergraduate-level course in data visualization or statistics.
Full Transcript
In a Cartesian coordinate system, the grid lines along an axis are spaced evenly both in data units and in the resulting visualization. We refer to the position scales in these coordinate systems as linear. In a nonlinear scale, even spacing in data units corresponds to uneven spacing in the visual...
In a Cartesian coordinate system, the grid lines along an axis are spaced evenly both in data units and in the resulting visualization. We refer to the position scales in these coordinate systems as linear. In a nonlinear scale, even spacing in data units corresponds to uneven spacing in the visualization, or conversely even spacing in the visualization corresponds to uneven spacing in data units. were selected because they are exactly halfway between 1 and 10 and between 10 and 100 on a log scale. We can see this by observing that 100.5 = ≈ 3.16, and equivalently 3.16 × 3.16 ≈ 10. Similarly, 101.5 =10x 100.5 ≈31.6 In the polar system, we specify positions via an angle and a radial distance from the origin, and therefore the angle axis is circular Polar coordinates can be useful for data of a periodic nature, such that data values at one end of the scale can be logically joined to data values at the other end. For example, consider the days in a year. December 31st is the last day of the year, but it is also one day before the first day of the year. There are three fundamental use cases for color in data visualizations: 1- use color to distinguish groups of data from each other, 2- to represent data values, and 3-to highlight. Color as a Tool to Distinguish Use color as a means to distinguish discrete items or groups that do not have an intrinsic order, such as different countries on a map or different manufacturers of a certain product. In this case, we use a qualitative color scale. The conditions of the color scale:- 1- Such a scale contains a finite set of specific colors that are chosen to look clearly distinct from each other while also being equivalent to each other. 2- no one color should stand out relative to the others. 3- The colors should not create the impression of an order, as would be the case with a sequence of colors that get successively lighter. Such colors would create an apparent order among the items being colored, which by definition have no order. Color can also be used to represent quantitative data values, such as income, temperature, or speed. In this case, we use a sequential color scale. 1- Such a scale contains a sequence of colors that clearly indicate which values are larger or smaller than which other ones, and how distant two specific values are from each other. 2- the color scale needs to be perceived to vary uniformly across its entire range. Sequential scales can be based on a single hue (e.g., from dark blue to light blue) or on multiple hues (e.g., from dark red to light yellow). Multihue scales tend to follow color gradients that can be seen in the natural world, such as dark red, green, or blue to light yellow, or dark purple to light green. The reverse (e.g., dark yellow to light blue) looks unnatural and doesn’t make a useful sequential scale. ٌ مكن أٌضا استخدام اللون لتمثٌل قٌم البٌانات الكمٌة ،مثل الدخل ودرجة الحرارة ، او السرعة.فً هذه الحالة ،نستخدم مقٌاس ألوان متسلسل. ٌ -1 حتوي هذا المقٌاس على سلسلة من األلوان التً تشٌر بوضوح إلى القٌم األكبر أو األصغر من القٌم األخرى ،ومدى بعد قٌمتٌن محددتٌن عن بعضهما البعض. ٌ -2 جب أن ٌنظر إلى مقٌاس األلوان على أنه ٌختلف بشكل موحد عبر نطاقه بالكامل. ٌ مكن أن تستند المقاٌٌس المتسلسلة إلى صبغة واحدة (على سبٌل المثال ،من األزرق الداكن إلى األزرق الفاتح) أو على درجات متعددة (على سبٌل المثال ،من األحمر الداكن إلى األصفر الفاتح).تمٌل المقاٌٌس متعددة األلوان إلى اتباع تدرجات األلوان التً ٌمكن رؤٌتها فً العالم الطبٌعً ،مثل األحمر الداكن ،أخضر ،أو أزرق إلى أصفر فاتح ،أو أرجوانً داكن إلى أخضر فاتح. العكس (على سبٌل المثال ،الظالم ألصفر إلى األزرق الفاتح) ٌبدو غٌر طبٌعً وال ٌصنع مقٌاسا تسلسلٌا مفٌدا. we need to visualize the deviation of data values in one of two directions relative to a neutral midpoint. One straightforward example is a dataset containing both positive and negative numbers. The appropriate color scale in this situation is a diverging color scale. We can think of a diverging scale as two sequential scales stitched together at a common midpoint, which usually is represented by a light color نحتاج إلى تصور انحراف قٌم البٌانات فً أحد االتجاهٌن، فً بعض الحاالت.بالنسبة إلى نقطة منتصف محاٌدة.أحد األمثلة المباشرة هو مجموعة بٌانات تحتوي على أرقام موجبة وسالبة.مقٌاس اللون المناسب فً هذه الحالة هو مقٌاس ألوان متباعد ٌمكننا التفكٌر فً مقٌاس متباعد على أنه مقٌاسان متتابعان مخٌطان معا عند نقطة والتً عادة ما ٌتم تمثٌلها بلون فاتح،منتصف مشتركة Diverging scales need to be balanced, so that the progression from light colors in the center to dark colors on the outside is approximately the same in either direction. Otherwise, the perceived magnitude of a data value would depend on whether it fell above or below the midpoint value. Color can also be an effective tool to highlight specific elements in the data. There may be specific categories or values in the dataset that carry key information about the story we want to tell, and we can strengthen the story by emphasizing the relevant figure elements to the reader. Amounts The most common approach to visualizing amounts (i.e., numerical values shown for some set of categories) is using bars, either vertically or horizontally arranged. However, instead of using bars, we can also place dots at the location where the corresponding bar would end If there are two or more sets of categories for which we want to show amounts, we can group or stack the bars. We can also map the categories onto the x and y axes and show amounts by color, via a heatmap. Histograms and density plots provide the most intuitive visualizations of a distribution, but both require arbitrary parameter choices and can be misleading. Cumulative densities and quantile-quantile (q- q) plots always represent the data faithfully but can be more difficult to interpret. Boxplots, violin plots, strip charts, and sina plots are useful when we want to visualize many distributions at once and/or if we are primarily interested in overall shifts among the distributions. Stacked histograms and overlapping densities allow a more in-depth comparison of a smaller number of distributions, though stacked histograms can be difficult to interpret and are best avoided. Ridgeline plots can be a useful alternative to violin plots and are often useful when visualizing very large numbers of distributions or changes in distributions over time. Proportions can be visualized as pie charts, side-by-side bars, or stacked bars. As for amounts, when we visualize proportions with bars, the bars can be arranged either vertically or horizontally. Pie charts emphasize that the individual parts add up to a whole and highlight simple fractions. However, the individual pieces are more easily compared in side-by-side bars. Stacked bars look awkward for a single set of proportions, but can be useful when comparing multiple sets of proportions. When visualizing multiple sets of proportions or changes in proportions across conditions, pie charts tend to be space-inefficient and often obscure relationships. Grouped bars work well as long as the number of conditions compared is moderate, and stacked bars can work for large numbers of conditions. Stacked densities are appropriate when the proportions change along a continuous variable. When proportions are specified according to multiple grouping variables, mosaic plots, treemaps, or parallel sets are useful visualization approaches. Mosaic plots assume that every level of one grouping variable can be combined with every level of another grouping variable, whereas treemaps do not make such an assumption. Treemaps work well even if the subdivisions of one group are entirely distinct from the subdivisions of another. Parallel sets work better than either mosaic plots or treemaps when there are more than two grouping variables. Scatterplots represent the archetypical visualization when we want to show one quantitative variable relative to another. If we have three quantitative variables, we can map one onto the dot size, creating a variant of the scatterplot called a bubble chart. For paired data, where the variables along the x and y axes are measured in the same units, it is generally helpful to add a line indicating x = y. Paired data can also be shown as a slopegraph of paired points connected by straight lines. For large numbers of points, regular scatterplots can become uninformative due to overplotting. In this case, contour lines, 2D bins, or hex bins may provide an alternative. When we want to visualize more than two quantities, we may choose to plot correlation coefficients in the form of a correlogram instead of the underlying raw data.