Midterm Study Guide - Lecture Slides.pdf
Document Details
Uploaded by SpellbindingConstructivism
Purdue University
Tags
Full Transcript
Midterm Study Guide - Lecture Slides Data Sources, Data Types, and Data Collection Data Sources Secondary data is data already collected and available for use. Examples include reports, databases, and books. ○ Advantages include large sample sizes, lower cost, faster collect...
Midterm Study Guide - Lecture Slides Data Sources, Data Types, and Data Collection Data Sources Secondary data is data already collected and available for use. Examples include reports, databases, and books. ○ Advantages include large sample sizes, lower cost, faster collection, and often no need for informed consent. ○ Disadvantages include a fixed study design, potential irrelevance to specific research questions, lack of depth, potential undervaluation in some fields, and the need for advanced statistical analysis. Primary data is collected directly from the source for a specific purpose, such as through surveys, interviews, or experiments. ○ Advantages include collecting exactly the data needed, testing interventions purely, and controlling data collection for quality, minimized missing values, and instrument reliability assessment. ○ Disadvantages include high costs. Data Types Categorical/Nominal: Data with no intrinsic order (e.g., gender). Ordinal: Data with a specific order but with differences between values that are not quantifiable (e.g., Likert scales). Interval: Data with a meaningful order and consistent intervals between values but no true zero point (e.g., temperature in Celsius). Ratio: Data with a meaningful order, consistent intervals, and a true zero point (e.g., height, weight, age, money). Textual: Unstructured data in the form of words (e.g., survey comments). Data Collection Sample Selection: ○ Representation gap: Discrepancy between the sample and the broader population. Consider potential bias in sample selection. Data Collection Reliability: ○ Consider who/what collected the data (human error, tool misuse, machine malfunction) and for whom/what/where (subjectivity, imprecision, bias). ○ Accuracy gap: Difference between collected data and the true values. Data Collection Best Practices Define Clear Objectives: Understand the purpose of data collection. Choose the Right Data Collection Method: Surveys, direct measurement, focus groups, observation, diaries. Ensure Data Quality: Minimize errors and biases. ○ Break down questions, use consistent formats, include "prefer not to answer", provide examples, avoid open-ended fields (when possible), automated validations, backend checks, and data collector training. Maintain Ethical Standards: Obtain informed consent and ensure data privacy. Pilot Testing: Identify issues early. Document Processes: Ensure consistency and transparency. Data Cleaning and Transformation Data Cleaning Data cleaning ensures your data is consistent, accurate, and ready for analysis. Data Separation: Divide data into relevant fields. Data Type Check: Ensure data is in the correct format. Range Check: Identify and handle values outside the expected range. ○ Imputation: Replace incorrect values with estimates (e.g., median). ○ Missing/Unknown: Flag incorrect values as missing. Format Check: Ensure data consistency. Handling Missing Data: Address missing values. ○ Deletion: Consider deleting entries with less than 5% missing data for a variable. ○ Placeholders: Use placeholders like "Unknown" for categorical or "Mean/Median" for numerical data. ○ Interpolation: Estimate missing values for temporal or interval data. Duplication: Remove duplicate entries. Spelling Check: Correct spelling errors. Data Standardization: Ensure consistency in text entries, formats, and units. Best Practices: Make a copy of data before cleaning and document the process. Data Transformation Data Transformation techniques manipulate data to extract meaningful insights. Selection and Aggregation: Selecting specific data subsets and applying calculations. ○ Count: Determine the frequency of occurrences. ○ Sum: Calculate the total of values. ○ Average/Mean: Find the central tendency, but consider the potential impact of outliers. ○ Median: A robust measure of central tendency less affected by outliers. ○ Percentage: Express proportions for comparisons. Sorting and Filtering: Arrange data based on criteria and isolate specific subsets. Data Reshaping: Changing the structure of the data. ○ Wide to Long (Unpivot): Converts data from a wide format with multiple columns to a long format with fewer columns but more rows. Long format is generally preferred for data analysis and visualization. Mathematical Transformation: Applying mathematical operations to alter data distribution and reduce the impact of outliers and skewness. ○ Logarithmic: Addresses data spanning multiple orders of magnitude or with heavy right skewness. ○ Square root: Suitable for moderately skewed data or data with moderate right-skewed outliers. ○ Reciprocal: Effective when large values disproportionately influence the dataset or for right-skewed data. ○ Squaring/Cubing: Addresses left-skewed data. Data Visualization and Exploration What is Data Visualization? Data Visualization is the visual representation of data to facilitate understanding using marks and attributes. Benefits include: ○ Uncovering trends in data. ○ Making it easier to extract meaning from data compared to raw numbers or text. ○ Leveraging preattentive attributes for faster processing (60,000x). Data Exploration Examine the Data: Begin by understanding the data source and context. Statistical Overview: Calculate descriptive statistics like mean, median, minimum, maximum, standard deviation, etc. to gain insights into the data distribution and identify potential outliers. Visual Overview: Create initial visualizations to gain a visual understanding of the data. Explore different chart types to find the most effective representation. Form Questions: Ask relevant questions based on initial observations to guide further exploration. Visual Exploration: Utilize various visualizations and adjust granularity to answer your questions and uncover insights. Examples: ○ Explore the distribution of collisions across boroughs using a bar chart. ○ Analyze trends in collisions over time using a line chart. ○ Compare the average number of people injured across boroughs using a bar chart. ○ Investigate the main contributing factors to collisions in different boroughs using a stacked bar chart. Iterate: Data exploration is iterative; new questions may arise, leading to further visualization and analysis. Data Visualization Background and History Ancient and Early Examples ○ ~3000 BCE: Sumerian tablets with pictorial data. ○ ~1160 BCE: Turin Papyrus, an ancient Egyptian mining map. ○ ~100 CE: Ptolemy’s maps and graphical data. 17th - 19th Century ○ 1620: Francis Bacon's emphasis on empirical data. ○ 1686: John Graunt's statistical charts. ○ 1749: William Playfair's invention of bar and line charts. ○ 1786: Joseph Priestley's visualization of rainfall. ○ 1801: William Playfair's creation of pie charts. ○ 1854: John Snow's cholera map. ○ 1869: Charles Minard's flow map depicting Napoleon's Russian Campaign. ○ 1874: Florence Nightingale's polar area diagrams. 20th - 21st Century ○ 1920s: Development of Pareto charts. ○ 1960s: John Tukey's contributions to exploratory data analysis and box plots. ○ 1985: Stephen Few's work on dashboard design. ○ 1987: Ben Shneiderman's "Visual Information-Seeking Mantra" and the introduction of Microsoft Excel. ○ 2000s: Emergence of Tableau for interactive visualizations. ○ 2007: Hans Rosling's Gapminder project. ○ 2010s: D3.js for web visualizations becomes popular. ○ 2020s: Increasing use of AI and machine learning in data visualization. Visual Encoding and Chart Types Visual Encoding Marks: Basic visual elements like points, lines, or areas used to represent data items. Attributes: Visual properties of marks that can be manipulated to encode data, including position, size, shape, color, and orientation. Chart Taxonomy The choice of chart type depends on the data type and the message to be conveyed. Here's a categorization of charts from your lecture slides: Categorical Charts: Useful for comparing categories and distributions. Bar Chart: Displays data using rectangular bars, with lengths proportional to the values they represent. Clustered Bar Chart: Compares values across multiple categories by grouping bars. Bullet Chart: A bar chart with additional visual elements for context. Waterfall Chart: Shows how a total value is composed of positive and negative changes. Radar Chart: Displays multivariate data in a circular format with axes radiating from the center. Polar Chart: Similar to a radar chart, using bars instead of lines. Connected Dot Plot: Shows quantitative values for categories with secondary categorical breakdowns. Pictogram: Uses icons or symbols to represent data, with the number of icons proportional to the value. Proportional Symbol Chart: Represents data using symbols of different sizes, with area proportional to the values. Word Cloud: Visualizes word frequency by size. Matrix Chart & Heatmap: Displays quantitative values across two dimensions using color intensity. Dot Plot & Jitter Plot/Beeswarm Plot: Visualizes data distribution and clustering, useful for small to medium datasets. Histogram & Density Plot: Shows the distribution of a single continuous variable by dividing it into bins and representing the frequency of data points within each bin. Box-and-Whisker Plot: Displays data distribution through quartiles, highlighting outliers. Hierarchical Charts: Represent part-to-whole relationships and hierarchical structures. Pie Chart & Donut Chart: Show proportions of a whole for different categories. Waffle Chart: A square grid where each square represents a unit of value. Diverging Bar Chart & Stacked Bar Chart: Compares the contribution of different categories to a total value. Marimekko Chart: A two-dimensional stacked bar chart where both bar height and width vary. Treemap: Uses nested rectangles to represent hierarchical data. Sunburst Chart & Dendrogram: Visualize hierarchical data with multiple tiers. Venn Diagram: Illustrates relationships between sets using overlapping circles. Relational Charts: Explore correlations and connections between variables. Scatter Plot: Shows the relationship between two quantitative variables. Bubble Plot: A variation of a scatter plot where a third variable is represented by the size of the bubble. Network Diagram: Visualizes connections or relationships between entities. Sankey/Alluvial Diagram & Chord Diagram: Show flows and connections between categories. Temporal Charts: Visualize data over time. Line Chart: Displays trends and changes over time. ○ Sparklines: Small, word-sized line charts. Bump Chart/Ribbon Chart/Rank Chart: Tracks changes in ranking over time. Slope graph: Compares values at two points in time. Connected Scatter Plot: Shows changes in two variables over time. Area Chart: Displays the magnitude of change over time. Stacked Area Chart: Shows the cumulative total of multiple categories over time. Gantt Chart: Visualizes project schedules and task durations. Instance Chart: Plots individual events or occurrences along a timeline. Spatial Charts: Depict data geographically. Choropleth: Displays data on shaded or patterned areas of a map. Isarithmic Map/Contour Map: Shows continuous values across a geographic area using contour lines (e.g., elevation). Proportional Symbol Map: Represents data using symbols of varying sizes at specific locations. Dot Map: Uses dots to show the geographic distribution of data points. Flow Map: Illustrates movement or connections between locations. Cartograms (Distorted Maps): Spatial charts where the size of geographic areas is distorted to represent a variable. Area Cartogram: Distorts area proportionally to data. Dorling Cartogram: Represents data with uniformly-sized circles, the size of which represents the data value, placed near their geographic location. Grid Map: Divides geographic areas into a grid of uniform shapes (tiles), and uses color to represent a data value for each tile. Visualizing Geospatial Data Projections: Transforming 3D geographical data onto a 2D surface. Different projections preserve different aspects of the data (e.g., shape, area, distance, direction), and the choice of projection can impact the message conveyed. Common projections include Mercator, Equal Earth, Lambert Azimuthal, Winkel-Tripel, and Mollweide. Layers: Combining multiple data layers on a single map for a comprehensive view. Data Visualization Design Principles Trustworthy Design Data Integrity: Use accurate, consistent, complete, and reliable data. Reasonable and Faithful Data Transformation: Avoid manipulating data to mislead; use appropriate transformations that maintain the data's integrity. No Misleading Data Representation: Present data honestly and objectively; avoid distortion or manipulation that could lead to false interpretations. Accessible Design Relevant: Ensure visualizations are relevant to the audience and the message being conveyed. Understandable: Make visualizations easy to comprehend; use clear labels, titles, legends, and annotations to guide interpretation. Elegant Design Eliminate the Arbitrary: Remove unnecessary elements; focus on clarity and conciseness. Be Thorough: Pay attention to detail; ensure consistency in design choices. Develop a Style: Maintain a consistent visual style for coherence. Decoration Should Be Additive, Not Negative: Use visual embellishments to enhance understanding, not distract from the data. Innovative: Explore new and effective ways to represent data. Long-Lasting: Create visualizations that remain relevant and informative over time. Environmentally Friendly: Consider the environmental impact of visualization choices. AI in Data Visualization Role of AI: AI can automate tasks, assist in data analysis, and generate visualizations. Human-AI Collaboration: While AI is a powerful tool, human expertise is still crucial for tasks such as: ○ Verification: Ensuring the accuracy and reliability of AI-generated results. ○ Prompting: Guiding AI systems with appropriate questions and instructions. ○ Solution Refinement: Interpreting and refining AI-generated visualizations for clarity and insight. Importance of Fundamentals: Understanding the fundamentals of data visualization remains essential, even with the increasing use of AI.