Visualization Lecture Notes PDF

Visualization Lecture Notes Module 1 Visualization is typically used for data exploration and making the unseen visible o Based on human visual perception system o Exploit pattern recognition capabilities of human visual system o Eyes act as a very high-bandwidth channel to the brain o Often an intuitive step → graphical illustration Numbers do not tell the whole story for data sets (plot graphs to see real distribution) Why Visualization? o +/- 90% of information about environment is received through the eyes o +/- 50% of our brain neurons are involved in the processing of visual information o The presence of pictures increases desire to read the text by +/- 80% o We remember 10% of what we heard, 20% of what we read, and 80% of what we see Information Visualization: “Use of computer-supported, interactive, visual representations of abstract data to amplify cognition Abstract data: data without physical, real location in the world, but more abstract We use visualization to augment human capabilities → why we use human in the loop Visualization pipeline: o Data Visualization is NOT making pretty pictures, BUT making useful pictures 3 Goals of Visualization: o To explore (if nothing is known, use for data exploration) o To analyse (there are hypotheses, used for verification or falsification) o To present (“everything” known about data, used for communication of results) Nested Model: o o Iterative refinement process o Domain situation: understand the user, the data and tasks ▪ User needs, workflow, limitations, user satisfaction ▪ Providing actionable knowledge (decisions to be made, relevant info) ▪ Danger: misunderstanding needs ▪ To understand user → perform interviews with user to understand needs o Data/task abstraction: ▪ Data described in generic terms → table, hierarchy, sets ▪ Tasks described in generic terms → search, compare, see trend ▪ Danger: you’re showing the wrong thing o Visual encoding: ▪ Design space, select visual encodings, define interactions ▪ Danger: The way you show it doesn’t work o Algorithm: ▪ Layout algorithm, ordering, rendering ▪ Danger: your code is too slow o A mistake at the higher level cannot be corrected on the lower level o Nested model has good factors to validate each phase Visual Encoding Design: o What is shown? → data abstraction ▪ Data types: items, attributes, links, positions, grids ▪ Data sets: Tables, Networks (for hierarchy), Geometry (spatial), Fields (continuous) ▪ ▪ ▪ ▪ Attributes are columns, and items are rows: Keys are what you identify each row with ▪ Attribute Types: Categorical (don’t have any order at all → fruits, colors, shapes, ID #) Ordered (has an intrinsic order → age, temperature) o Ordinal (has discontinuous space → shirt sizes S, M, and L) o Quantitative (has continuous space → height of person) o Ordering Direction: ▪ Sequential (only one direction of order → age, length, weight) ▪ Diverging (point in middle, and see values on both directions → temperature) Sequential attribute can become diverging depending on aim of task ▪ Cyclic (values keep repeating themselves + no clear start/end time→ weeks, months, years (anything related to time)) Dataset Availability: o Static (doesn’t change to much too often) o Dynamic (changes a lot quite frequently) o Why is the user looking at it? → task abstraction ▪ Data, Tasks and Users ▪ Quantitative Data: Describes a measurable physical dimension (temp, age, weight) ▪ Ordinal Data: Categorical variables with implied order (small, medium large) ▪ Nominal Data: Describe categories with no ordering (single-player, FPS, sports) ▪ Define tasks as tuples: Actions and Targets Action (how the visualization is used) Target (aspect of data of interest) Examples: Analyse → All data, Search → Attributes, Query → Network Data and Spatial Data Analyse: o Consume ▪ Discover → find new knowledge, generate hypotheses, verify ▪ Present → communication of information ▪ Enjoy → Casual encounters with a visualization o Produce ▪ Annotate → add graphical or textual annotations to existing data/visualization ▪ Record → save/capture visualization elements as persistent artifacts ▪ Derive → produce new data elements based on existing elements Search: o o Location Known, Target Known → Lookup o Location Known, Target Unknown → Browse o Location Unknown, Target Known → Locate o Location Unknown, Target Unknown → Explore Query: o Identify → Identify characteristics of a single target o Compare → Compare multiple targets o Summarize → Summarization of possible targets + provide overview Targets – All Data o Trends → High level characterization of a pattern in data + increase/decrease, peak, valley, plateaus… o Outliers → data that does not fit with backdrop or normal behavior o Features → particular structures of interest + task & data dependent Targets – Attributes o One (Distribution + Extremes) o Many (Dependency, Correlation, Similarity) ▪ Depends on scope of action Targets – Networked and Spatial Data o Network Data ▪ Topology (some targets pertain to specific types of datasets such as network data or spatial data) Paths ▪ Spatial Data (Understanding and comparing geometric shape) Task Abstraction Examples: o o How is it shown? → visual encoding and interaction Module 2 ▪ How is it shown? → visual encoding and interaction o Marks → geometric primitives ▪ Points (0D) ▪ Lines (1D) ▪ Areas (2D) ▪ 3D Marks: Volume, … ▪ These are marks for items ▪ Links as marks are used for containment (bubble graph for trees) ▪ ▪ Links as marks are also used for connection ▪ ▪ Visual Channels: control appearance Positions (horizontal, vertical or both) Color Tilt (Angle) Size Shape Motion o Visual Encoding → analyse as combination of marks and channels showing data attributes ▪ Example: More Examples: o How to select Visual Encodings? o Show all but only what is in the data o Match the channel/mark to data characteristics o Expressiveness principle: map the data that is in your data set with the information that is there, but just the information that is there → don’t communicate something that is not in your data set. o Effectiveness principle (Salience): o Encode the most important attributes with highest ranked channels Visual Channel Rankings: For Categorical Data: Spatial region → colour hue → motion → shape For Ordered Data: Position (common scale) → Position (unaligned scale) → length (1D size) → Tilt/angle → Area (2D size) → Depth (3D position) → Color luminance → color saturation → curvature → volume (3D Size) (In order of preference) Visual Channel Rankings (Munzner’s) Rankings based on: Accuracy Based on user experiments → see which channels users make more errors on and test accuracy (humans are best in identifying length + position) Length as a visual channel is easier to see than angle (in pie charts eg) Steven’s Psychophysical Power Law: S = IN → based on idea that not all visual channels behave equally in relation to the quantity they represent and the way they are perceived Saturation → Overestimate, Length → No bias, Area → Underestimate Example: o Can we correct the bias? o Example: W – Weight (Attribute) map to Area A, Power Law S = IN, for area N = 0.7 (from graph), so S = A0.7, if A = W, then S = W0.7, so user will underestimate W. If A = W1/0.7, then S = W so no bias. o We see a clear difference in sizes in the perceptual scaling Discriminability Not about how accurately we can predict the values, but how many different values can we distinguish from the visualization and compare with each other How many usable steps? o Attribute levels/precision math encoding o Linewidth has very few recognizable bins: ▪ Separability Separability vs Integrality Example: o o Popout Color (hue) or shape alone: pre-attentive o Attentional system not invoked o Search speed independent of distractor count → parallel processing Combined hue and shape: not pre-attentive o Requires attentions o Search speed linear with distractor count → Serial search Most channels are pre-attentive (color, shape, orientation) o o Parallel processing: sufficiently different item noticed immediately, independent of distractor Few channels have no popout o Parallel titled pairs from parallel ▪ ▪ Serial search: Speed depends on the distractor count How do we see? Rods Achromatic perception Low-Light vision (night vision) Saturated in day vision Cones Chromatic perception (types: Short, Medium, Long) o 6 to 7 million in the retina – 3x full HD Not used in night vision We are able to see ~ 10 million different colours Color is defined by cones Color Blindness Deficiencies of the functioning of the cones (e.g. have two cones instead of three) Red-green color-blindness o 5-8% of men and 0.5% of women ▪ Red cone deficiency (protanomaly) ▪ Green cone deficiency (deuteranomaly) Blue-yellow colorblindness (Tritanopia) o 1% males and 1% females Color representation 3 dimensions since 3 cones o Device Oriented: Physical realization ▪ For Screens: RGB space (3 lamps) ▪ For Printers: CMY(K) ▪ Simple: each device has its own space ▪ But it is device-dependent (100 can be different on each screen) ▪ Represents a part of the human visible colors Semantic: Intuitive properties of color HSL/HSV o Hue: the color wheel → basically the wavelength o Saturation: how much grey → how much power of color o Luminance/Brightness/Light/Value: how much light/brightness o o How do we represent all human visible color? o CIE 1931 XYZ color ▪ Represent all human visible colors ▪ Build through experimentation ▪ Mathematical formulation ▪ Device independent ▪ Easy conversion Our visual system also includes bias o Green is slightly better distinguishable o Blue perception is very weak o We are very sensitive to luminance (change of brightness) Device Oriented: RGB/CMYK poor choice Semantic: HSV/HSL better choice Color Representations: Categorical (Data not having order): We usually use Hue for categorical data (since Hue doesn’t have natural order) Ordered Sequential (starting point, and increase only one direction) o We usually use Luminance for ordered sequential data o Rainbow is poor choice since it doesn’t have natural order o Luminance/Black body is much more detailed and we see discriminability Diverging (starting point in middle, and we see values on both sides) o Colormap with middle point (with different hue) and you change luminance to the 2 ends of the map) Colormaps characteristics Rainbow (Hue) is a poor default: Problems: o Perceptually unordered o Perceptually non-linear o Perceptual borders that are not there Benefits: o Nameable regions Alternatives: o Few multiple hues with monotonically increasing luminance for fine-grained Visual Perception – Luminance: Physical unit: Light intensity per unit area Higher luminance = higher brightness: Non-linear perception Better at perceiving relative brightness changes in darker areas Humans can differentiate 100 grey levels Better at distinguishing luminance than hue → very sensitive to luminance Hue values have different luminance values → yellow has more luminance than blue for example in the image Perception: Colors are not evaluated independently Perception is complex Our brain uses information according to its experience Context is relevant We evaluate things relatively not absolutely most of the times: o Weber’s Law: Perceptual system mostly operates with relative judgements, not absolute Smallest change in stimuli is perceived (delta S) to background stimuli (S) is constant (k): delta S/S = k Accuracy increases with common frame/scale and alignment: o Image Example: ▪ Filled rectangles 1:9, difficult judgement ▪ White rectangles 1:4, easy judgement Module 3 Gestalt Principles → how human perception groups elements, sees patterns and simplifies information o Proximity (Emergence) → we group elements that are close to each other (position) o Similarity → we tend to group elements with a similar appearance o Common Region → we group elements that are in the same enclosed region o Good Figure → Objects grouped together tend to be perceived as a single figure o Closure (Reification) → we complete missing parts o Continuity → we tend to form and group continuous lines from pieces o Figure/Ground (Multi-stability) → we see depending on our perception of figure or background Tufte’s Principles: o Graphical integrity → missing scales + scale distortion ▪ o Design principles → maximize data-ink ratio + avoid chart junk ▪ Data Ink Ratio = (Data Ink)/(Total ink used in the graphic) Low data ink ratio vs high data ink ratio (in pics above) Ink used to represent data = data ink (we try to minimize this by only using ink for the information that is relevant) ▪ Remove redundancy (avoid multiple representations of same info) Dangers of Depth o Ranking for planar spatial position, not depth o o We do not really see 3D, we see 2.05D ▪ We see 2D projections brain combines into depth ▪ Motion parallax depth from view point movement ▪ Occlusion is not resolved ▪ Difficulties of 3D: Occlusion (we don’t truly see 3D) Interaction complexity Difficult text legibility Perspective distortion (interferes with all-size channel encodings so the power of the plane is lost) Detailed comparisons are impossible to make ▪ 3D legitime for true 3D spatial data → shape understanding (brain structure) 3D needs very careful justification for non-spatial data (for trees) Resolution Beats Immersion o Resolution is much more important (pixels scarcest resource) o Desktop better for workflow integration o Virtual Reality – Immersion (VR harder to embed into workflow) ▪ Immersion can be useful where there is a lot to memorize in the data ▪ For non-spatial data very difficult to justify ▪ Do not need sense of presence or stereoscopic 3D Eyes Beat Memory o Long-term memory → unlimited o Short-term memory (Working memory) → has a limit (things you have to remember to fulfil a task within a short period of time) ▪ Reaching the limit for short term memory implies cognitive load ▪ Common task: memorize as many words as possible from a list Most people remember 5-9 words in short-term memory o Attention: very limited attention for conscious visual search tasks o Animation vs Side-by-Side: ▪ Side-by-side views are easy to compare (low cognitive load) by moving eyes ▪ Animation hard to compare visible items to memory of what you saw before o Animation: ▪ Great for choreographed storytelling (series of events happening = localized action) ▪ Great for transition between two states/data sets ▪ Give control of time to the user on the animation (replay, stop, pause) ▪ Change blindness → when focusing on a particular task, we tend to overlook big changes happening in the data set Poor for many states with changes everywhere ▪ Time vs Space: Module 4 Visualization idioms (idioms = chart type) o Choices to make on encoding and manipulation o Focus on visualization of tabular data o Idioms put restrictions on tasks, so choose carefully o Idioms are described with: ▪ Data: what is the data behind the chart? Number of categorical attributes Number of quantitative attributes Semantics of keys and values o Key: person_id, value: height o Value of cell in table o Multidimensional table: multiple keys o All attributes are columns, and items are represented as rows, all rows which represent an item are identified by identifier which we call a key, and the value that corresponds to the key is a value, so for each item we have a key-value pair ▪ Mark: which visual elements are used? Points, lines, glyphs, etc. ▪ Channels: How is the data encoded? (how is marks arranged and how to map to visual element) Arrangement and mapping ▪ Tasks: what are the supported tasks? Discover trends, outliers, distribution Idioms with n (multiple) keys and 1 value: o Bar Chart: ▪ Data: 1 categorical attribute (key) → x-axis 1 Quantitative attribute (Value) → y-axis ▪ Mark: Lines Channels: Length to convey quantitative value Spatial regions: one per mark ▪ Tasks: Compare, lookup values We can order bar chart by key or order by value o Quickly lookup keys → order by keys o Quickly lookup what keys have a certain value → order by value ▪ Scalability? → hundreds of levels for key attribute o Stacked Bar Chart: ▪ Data: 2 categorical attributes (keys) e.g. keys are age and gender 1 quantitative attribute (value) ▪ Mark: Stack of line marks ▪ Channel: Length and color hue Spatial regions: one per mark o Aligned: first bar o Unaligned: other bars ▪ Tasks: Compare, lookup values Part-to-whole relationship (due to having multiple keys) ▪ Scalability? → hundreds of levels for key attribute, stacked attribute: several to one dozen (this is the number of colors we can distinguish as a human) ▪ Downside: only first bar is aligned so easy to compare, but other keys are harder to compare, so we make diverging stacked bar chart to align value of interest, if we want to compare all items we can make a layered bar chart/group bar chart o Line chart: ▪ Data: 2 quantitative attributes (one is key other is value) ▪ Mark: Points, line connecting marks (emphasize relation of points) ▪ Channels: Aligned lengths to express quantitative value Separated and ordered by key attribute into horizontal regions ▪ Tasks: Find trends Connection marks emphasize ordering of items along key axis → show relationship ▪ Scalability? → key attribute dozens to hundreds o Choosing bar vs line chart? ▪ Depends on the type of key attribute: Bar charts if categorical Line chart if ordered/quantitative Why: you cannot interpret later between categorical attributes ▪ Do not use line charts for categorical key: Violates expressiveness principle Implication of trend so strong that it overrides semantics → example: “ the more male a person is, the taller he/she is” ▪ Do not use bar charts for quantitative key: If you use bar chart on quantitative attribute → violating expressiveness principle o Pie chart/polar area chart: ▪ Data: 1 categorical attribute (key) → used as slices of pie chart 1 quantitative attribute (value) ▪ Mark: Separate colored areas ▪ Channels: Color by categorical attribute Angle for quantitative attribute ▪ Tasks: Part-to-whole judgement ▪ Scalability? → one dozen (since we use color for key values) ▪ Not good for: Comparing precise values Hard to observe trends Why: We cannot compare angles well (angle is lowly ranked visual channels) If we see same data with bar chart, we can compare much better, why: because length/position is one of the highly ranked channels o Streamgraph (Extension of stacked bar chart → but here we can use for quantitative key) and it emphasizes horizontal continuity: ▪ Data: 1 Categorical key attribute (names) 1 ordered key attribute (time) 1 quantitative value attribute (counts) ▪ Marks and channels: Derived geometry: layers, height encodes counts (use width to encode to different quantitative values) Using color to separate between keys, and we can see for each key how it evolves over time (grow, shrink, stable) ▪ Tasks: Find trends, part-to-whole relationship ▪ Scalability? → hundreds of time keys, dozens to hundreds of (names) keys → more than stacked bars, since most layers don’t extend across whole chart (for streamgraph, a key does not have to be present at each timestamp) o Heatmap: ▪ Data: 1 quantitative attribute Two keys, one value ▪ Mark: Separate and align in 2D matrix Indexed by 2 attributes (one for x and one for y axis) ▪ Channels: Color by quantitative attribute ▪ Tasks: Find clusters, outliers, patterns ▪ Scalability? → very scalable (main advantage), 1M items, 100s of categorical levels, roughly 10 quantitative attribute levels o Summary: ▪ Idioms put restrictions on tasks so choose carefully ▪ Idioms are described by Data, Marks, Channels and Tasks Statistical value idioms: o Histogram: ▪ Data: ▪ 1 quantitative attribute ▪ Derive data: Keys are bins, values are counts Bin size is crucial Take quantitative attributes and chop it into bins, and for each bin, we check how many data items fall into each bin and we count them (so basically frequency of each bin) ▪ Mark: ▪ Line ▪ Channel: ▪ Length of line encodes frequency ▪ Tasks: ▪ Understand distribution of attribute ▪ Number of bins you choose highly influences what you observe on the histogram o Boxplot: ▪ Data: ▪ 1 quantitative attribute ▪ Derive data: 5 quantitative attributes (median, min, max, lower/upper quartile) Explicitly show outliers: Outlier = 2 x standard deviation ▪ Marks: ▪ Lines (+ box) ▪ Channels: ▪ Length of line encodes derived values ▪ Tasks: ▪ Understand distributions (same as histogram) ▪ Downside: ▪ Hides a lot of information (we aggregate and summarize attribute into 5 statistical attributes so we hide a lot of details) → solution is violin plot o Violin plot: ▪ Data: ▪ 1 quantitative attribute ▪ Derive data: 5 quantitative attributes Density at each point (addition to boxplot to show distribution of attribute) ▪ Mark: ▪ Lines (+ box) ▪ Channel: ▪ length of lines encodes derived values ▪ Width encodes frequency of attribute ▪Tasks: ▪ Understand distribution o Boxplot v Violin Plot: ▪ For violin plot → we show density at each value for the attribute, so we are not missing out on detail like the box plot does (true distribution) Why interaction? o Too much data to show in one view (so we show the data in parts like a timeline) o Different audiences with different questions (different viewpoints on same data) o Increase engagement o Potential issues: ▪ Takes time to learn ▪ Takes time to use ▪ Getting lost Visualization Pipeline: o o First step: filter data (what data do we want to show user) o Second step: how to show (marks, channels) o Third step: view on data (projection) o Then present it to the user o We want user to interact on all of these steps Different modalities: (affects design choices to implement interaction) o Virtual Reality o Large displays o Tabletop displays o Tablet o Many displays o Physical visualization Interaction principles: o Result of interaction → low-latency visual feedback (action should be immediate) ▪ Manipulate, e.g., selection and highlighting ▪ Why? → Research suggests everything we see within 1/10th of a second is visually processed in our brain, and everything within 1 second is immediate response, if it take 10 seconds then we can still wait, but beyond 10 seconds means people think system is wrong or hangs (so if task takes longer than 10 seconds, we need to show progress bar to user to say that system is still on) ▪Overview first, Zoom and Filter, Details on Demand – Visual Information-Seeking Mantra → if you don’t know too much about data and just want to explore it (which is typical), then you want to start with overview of data and basically show everything. Then we need certain encoding and guidance to user so the user can focus on items they are interested and see global trends of items, and if we need specific details of items, then we click on particular item to see values in detail. o Manipulate: ▪ Change over time → sets apart digital medium from version on paper Advantage of digital over paper o Change encoding, parameters, viewpoints, aggregation, etc (change scatterplot into bar chart for example) Within encoding o Change sorting order, re-arrange layout, etc Consider animation o To show the user changing parameters (protonation of the mental map) o ▪ Select Basic operation for all interaction Design choice: click vs hover Highlight: change visual encoding for items of interest o Color o Border o Explicitly link items o Selection mechanism itself (how big is selection, do we use box, lasso selection, etc → depends on task and data we need to support) ▪ Navigate Item reduction o Zoom ▪ Geometric or Semantic ▪ Geometric → camera metaphor (if you zoom in, all visual elements get larger) → (zoom, translate, rotate) ▪ Semantic → change visual encoding depending on zoom level (we show more detail rather than items getting bigger → more items appear as you zoom in) o Pan/Translate → camera metaphor (zoom, rotate, translate) o Constrained → constrain camera to specified area o Attribute reduction o Slice → show only items matching specific value for given attribute o Cut → show only items within some region or attribute range o Project → dimensionality reduction (e.g., 3D → 2D) Co-ordinated multiple views Different visual encodings of same data Show relations between items and attributes Provide different perspective on same items Many design choices to make: o View count → few vs many o View visibility → popup vs side-by-side o View arrangement → manual vs automatic Linking and brushing: o “We want to somehow connect all visualizations together” o Linked view parameters o Linked highlight/selection – items highlight/selected in one view are also selected in the other views o Linking and brushing happens through selection (of data that user wants to see) o Brushing is a two-way interaction → happens between multiple visualizations (e.g., items selected in scatterplot are also selected in tree map) → gives you a way to show different attributes of data and see the correlation between attributes o Brushing can be done with selection OR highlighting o Brushing vs filtering: ▪ In brushing → we see both items of interest and see how they perform in context ▪ In filtering → we throw away the context (makes it difficult to see relation between 2 different views) o Types of multiple views: ▪ Multiple views: different visual encodings of same data Have same encoding but change one parameter of encoding (data stays same, but we show one attribute differently) ▪ Small multiples: same representations of data, changing for one attribute ▪ Overview and detail: Same visual encoding, same data, different zoom-level (Google Maps) Detail on demand: extra view with more information on selection Focus + Context Visualization o We show focus and context in a single visualization o Different levels of detail integrated in the same view ▪ Show area/items of interest (focus) in detail and surroundings (context) in less detail ▪ Distortion techniques → Fish-eye technique (We distort items of area we are interested in) ▪ Taxonomy based on user’s intent: o Seven categories of interaction techniques based on user’s intent: ▪ 1. Select ▪ 2. Explore ▪ 3. Re-configure ▪ 4. Encode ▪ 5. Abstract/Elaborate ▪ 6. Filter ▪ 7. Connect Module 5 – Multivariate Idioms Scatterplots: o Data: ▪ 2 Quantitative attributes ▪ No keys only attribute values o Mark: ▪ Points o Channels: ▪ Horizontal + vertical position o Tasks: ▪ Find trends, outliers, distribution, correlation, clusters o Scalability? → hundreds of items o To extend to multivariate variables → scatterplot matrix (SPLOM) o Parallel Co-ordinate Plots (PCPs) ▪ ▪ Scalability → dozens of attributes, millions of items (potentially) ▪ SPLOMs vs PCPs SPLOMs: o Primarily relationships between pairs of axis o Limited scalability (standard rendering) ▪ ~ 20 dimensions ▪ ~ 500 to 1k samples o Brushing is crucial PCPs: o Primarily relationships between adjacent axis o Limited scalability (standard rendering) ▪ ~ 50 dimensions ▪ ~ 1 to 5k samples o Interaction is crucial o Axis ordering is important To check for positive correlation: o SPLOM → diagonal low to high o PCP → Parallel line segments To check for negative correlation: o SPLOM → diagonal high to low o PCP → segments intersect at halfway point To check for uncorrelated values: o SPLOM → scattered points o PCP → scattered crossings Radar plots: o We can easily compare attributes o Flexible Linked Axes (each variable is a polygon) Icons/Glyphs: o Map multidimensional data to properties of graphics objects o Data Reduction: o Filtering → eliminate elements ▪ Filtering Items: Goal 1: eliminate items based on their values with respect to specific attributes o The number of attributes does not change ▪ Filtering Attributes: Goal 2: eliminate attributes o Number of items do not change o Attribute ordering: ▪ Determine similarity measure – computational, visual Quantitative ordered value for each attribute Based on all item values for that attribute ▪ Use that measure to order attributes E.g. filter out highly correlated (and/or vice-versa) o ▪ Straightforward and intuitive ▪ Out of sight, out of mind (people tend to forget elements they don’t see or assume they don’t exist) o Aggregate → generate representatives ▪ Summarize dataset ▪ But details are lost ▪ Substitute groups of elements with other elements derived from them, maintaining their aggregated characteristics ▪ Aggregating Items: Goal → merge together groups of similar items Represent many data items with single mark Also called clustering → define groups of similar items (many algorithms available) ▪ Aggregating attributes: Goal → summarize attributes Number of items does not change 1. Establish similarity measure OR 2. Dimensionality reduction (DR) → preserve “Structure” of the n- dimensional space o o “Keeps same number of elements, but lesser variables” o “preserve meaningful structure of the data set while using fewer attributes to represent the items (e.g., n-dimensional to 2- dimensional ▪ ▪ Find combination of attributes to represent the n- dimensional space ▪ Minimize an objective function that measures the discrepancy between similarities in the data and similarities in the projection o 2 types of DR methods: ▪ 1. Linear: resulting attributes are linear combination of existing attributes (interpretable) Principle Component Analysis (PCA) → preserve variation Linear Discriminant Analysis (LDA) → preserve class separation ▪ 2. Non-Linear: resulting attributes do not have straight- forward relation to original attributes Multi-Dimensional Scaling (MDS)→ preserves distances t-Distributed Stochastic Neighbour Embedding (t- SNE) → preserve neighbourhoods UMAP → uses Riemannian metric ▪ Interpreting DR-ed Data: Only relative distances matter o This inspection is meant to identify large clusters ▪ Fine grained structures might not be reliably represented o 2D scatterplots (SPLOMS): safe idiom choice → for inspecting DR-ed data o Which method to use? ▪ Depends on tasks and data ▪ Items: Use of derived elements (e.g. average) → histograms, box plots, etc ▪ Attributes: Use of derived attributes (e.g. by similarity measure) → dimensionality reduction Module 6 Maps: o Purpose: Understanding spatial relationships o Advantages: ▪ Familiarity → people know where something is on a map (assuming the are familiar with the region), ▪ Act as an index from spatial to semantic information and vice-versa: o o Choropleth Map: ▪ Data: o 1 Quantitative attribute (table with 1 quantitative attribute per region) → quantitative attribute = value and region = key o Geographic geometry ▪ Marks: o Geometric areas ▪ Channels: o Colour ▪ Tasks: o Understanding spatial relationships ▪ Note: The size of objects depends on geography, not on the attributes o Size does not depend on quantity of value you want to show o To prevent it → introduce glyphs instead of colouring the area (use dot or point as mark, and size as channel to visualize quantitative attribute): o o Cartogram: ▪ Size represents quantity at cost of familiarity due to distortion → due to this, size does encode our quantitative value (See different sizes of continents below): o o Dot map: ▪ Introduce a point that represent a number of cases, and you add multiple points to a certain region (denser region = more points = higher quantitative value) ▪ o Density Map: ▪ Need a data transformation to turn discrete data into continuous data. Typically using density estimation mechanism (e.g. Kernel Density Estimation) o o Techniques can also be used for other types of space (not only geographical data) → basketball field for example can be the map o Topographic Map: ▪ ▪ Data: o Scalar spatial field (1 quantitative attribute per grid cell) o Geographic geometry o Derived data: o Isoline geometry ▪ Mark: o Lines ▪ Channels: o Shape, position, colour ▪ Tasks: o Understand spatial relationships → especially good if you have field data where you can interpolate between the points o Caution with maps: ▪ A dataset may contain geographical information ▪ Yet creating a geographical visualization may not be relevant (do we really need the spatial arrangement based on the task at hand?) ▪ Position is the most effective visual channel → do not waste it, if not relevant ▪ A map is not always the best/only solution ▪ Absolute vs relative: o Population density vs per region (gives distorted image → so we should normalize data to get relative value for income: o o When populations are low, variations tend to be high (Crime rates in regions with very low populations will be dense) o Graphs (Static Networks) o Data Types → Tables, Multi-dimensional tables, Networks, Fields, Trees, Geometry o Graph → Network, Vertices → Nodes, Edges → Links ▪ Networks describe the relations between objects ▪ G = (V,E), E is a subset of V2, where V = set of vertices, E = set of edges ▪ Networks can be → email between people, financial transactions, etc o Can be physical, and non-physical o Friendship → undirected o Financial transaction → directed (source bank account and destination) o Trees: ▪ Graph without cycles and one root o Every other node exactly one parent → cyclic o Single root: all other nodes reachable from there o Edges (E) = Vertices (V) – 1 o Hierarchy = “rooted tree” o Network Types: ▪ Static → structure ▪ Multivariate data → extra information ▪ Dyanmic → evolving over time ▪ Static Networks: o Design choices for arrangement o Node-Link Diagram o Connection marks o Structure Types: Radial and Arc: ▪ ▪ Ordering plays an important role (can hide patterns in data) o Static network visualization ▪ Node-link diagram easy to understand by non-experts ▪ Positioning of nodes is called layout or embedding ▪ Compute layouts → graph drawing (force directed methods) ▪ Readability and aesthetics: Equal edge lengths Minimize crossings Non-overlapping nodes High-degree nodes should have a central position Symmetry should be maximised Communities should be clearly visible ▪ Force-directed algorithms: Mechanical laws Model edges as springs, also nodes repel each other Numerically simulate until stable state is reached ▪ Vertex Force: Repelling force between vertices i and j Prevents vertices to come too close to each other ▪ Edge Force: Spring forces on edge Attracts vertices connected by edge Prevents these vertices from getting too far apart ▪ Forces in action: Compute repulsion and attraction forces for all nodes Compute net forces Move vertices to new position ▪ Termination: Fixed number of iterations Total energy below some threshold Local minimum User input ▪ Tasks: Explore topology, locate paths Identify communities, node centrality o Adjacency Matrix (derived table) o Enclosure (represent trees) o Large Networks: ▪ Traditional (force-directed) network visualizations do not scale ▪ Network + Hierarchy = compound network o Hierarchical Edge Bundling o We draw curve that follows hierarchy (Bezier Splines) o We interpolate between control points ▪ Visual Adjacency Matrix: o o More scalable o Mi,j : existence/weight of edge i → j o Maximise edge visibility, no crossing edge o Cons: difficult to see paths, motifs o Note: sorting is extremely important o Ordering plays a very important role (similar to layout for node link diagram) → we want to see outliers in communities of data o Sorting important, paths no longer visible, if edges are important (communities) ▪ Compound graph: ▪ Hierarchical clustering: ▪ ▪ Clique = fully connected group of nodes (each node is connected to every other node) Graphs – Trees: o Graph without cycles and one root ▪ Every other node exactly has one parent → acyclic ▪ Single root: all other nodes reachable from there ▪ E=V–1 ▪ Hierarchy = “rooted tree” o Arrangements for trees: Node-Link diagram, Adjacency matrix (hardly used), Enclosure o Static tree visualizations: ▪ ▪ Typical node link diagram → has lot of wasted white space o Node-link Tree circumvents this using a radial layout (concentric circles) ▪ Enclosure → Inspired by Venn diagram o Tree map – space-filling technique o Long-thin rectangles can occur o Squarified layout generates nicer rectangles ▪ Icicle plot and Sunburst diagram: o No overlapping parent chid – attributes easier displayed o Not as dense as Treemaps o Better use of space compared to node-link trees o ▪ Node-link vs Tree map: o Tree map – enclosure techniques: o Scalable: very good usage of available space → show attributes o Difficulty of the hierarchy (implicit) o Node-link diagram: o Intuitive o Good at exposing structure of information o Lot of empty space Dynamic Networks: o Networks that change over time o For each edge we have a time stamp on the graph (when it occurs in the data) o Structural properties: ▪ Communities ▪ Motifs ▪ More important nodes o Temporal properties: ▪ Trends ▪ Periodicity ▪ Temporal shifts ▪ Anomalies o Discover/exploration of states ▪ Characterizing the evolution of the network ▪ Stable states, recurring states, outlier states ▪ Transitions between states o Why not use automated methods? ▪ E.g., #nodes, edges, degree distributions ▪ High level summary of network o Aggregation of results ONLY o Loss of context o No exploration possible (discover the unknown unknowns) → only confirm the expected o Time-varying network, time-stamped network, longitudinal network, evolving network, temporal network → all synonyms o Animation → map time to time ▪ For each timestep compute the layout ▪ Improve: keep variation of the layouts as small as possible → preservation of the mental map, to achieve this: o Aggregation (of time): super-graph o Add timeline control ▪ Pros: o Simple to implement o Easy to spot big changes o Applicable to all methods for which a single timestep can be visualized ▪ Cons: o Need to focus on many moving or changing items simultaneously o Keep track of multiple changes over longer time periods → reliance on memory o Change blindness o Small multiples → split time in intervals ▪ Juxtaposed visualization using a filmstrip or grid layout ▪ Pros: o Independent of the visualization method used ▪ Cons: o Decide on the number of multiples to use o Limit on the number of multiples o Multiples might be far apart → difficult to spot patterns o Integrated approaches → map time to space ▪ Provide static overview of the entire time span of the network in one visualization ▪ Use automated method for analysis and/or visualization ▪ Pros: o Complete overview of the network is presented in one visualization o Global patterns can be directly identified ▪ Cons: o Specialized visual encoding often difficult to interpret to non-experts o Often restricted to specialized type of network (acyclic, compound, etc) ▪ Visualization: o Show entire timespan o Scalable in both number of nodes and timesteps o Individual nodes and edges not visible anymore ▪ Automated method: o Cluster nodes and show change of clusters over time o ▪ What if network becomes really large? ▪ Discovery of states: stable, recurring, outlier, transitions between them Time series: o Animation → map time to time o Small multiples → split time in intervals o Gantt chart: ▪ Data: 1 Categorical attribute 2 Quantitative attributes One key, two (related) values ▪ Mark: Line, length duration ▪ Channels: Horizontal Position: start/end times Horizontal length: duration ▪ Tasks: Emphasize temporal overlaps, start/end dependencies between items ▪ Scalability? → dozens of key levels, hundreds of value levels o Other visualizations for time series: line chart, streamgraph, connected scatterplot, Gantt chart Module 7 “The Right Tool for the Job” 2 Types of Validation: Downstream and Upstream o Algorithm Validation: ▪ Downstream → Analyse computational complexity ▪ Upstream → Experimental study o Visual Encodings: ▪ Downstream → justification of choices (important validation step), directly derives from task abstraction ▪ Upstream → different strategies Informal Usability Study: o Quantitative → measurable performance indicators (quality metrics are specific to idioms) o Qualitative → non-measurable indicators o ICE-T Method: ▪ o Value Equation: ▪ “Evaluation” of the value of the visualization ▪ T – a visualization’s ability to minimize the total time needed to answer a wide variety of questions about the data ▪ I – A visualization’s ability to spur and discover insights and/or insightful questions about the data ▪ E – a visualizations’ ability to convey an overall essence or take-away sense of the data ▪ C – a visualization’s ability to generate confidence, knowledge and trust about the data, its domain and context ▪ o Lab Study: ▪ Quantitative → Performance/Accuracy of users ▪ Qualitative → user experience and/or opinions (requires time + effort) o Data/Task Abstraction Validation ▪ Field Study Threat: wrong task/data abstraction → misunderstand user requirements Validation: field user study → how people act in real-world setting ▪ Case-Study/Insight Based Validation: cases/scenarios where non-trivial discoveries can be achieved thanks to the visualization o Domain Validation ▪ Downstream: observe target users ▪ Upstream: Measure adoption ▪ “well designed is not always what is adopted”

Visualization Lecture Notes PDF

Document Details

Tags

Related

Summary

Full Transcript