Statistical Description and Data Visualization PDF

Summary

This document provides an overview of statistical descriptions and data visualization techniques. It covers topics such as measures of central tendency, data dispersion, and examples of data visualization including charts and graphs. It also explains the concept of TF-IDF for textual data analysis.

Full Transcript

Statistical Description and Visualization Lecture Outline Using statistical summaries to understand the data. Using Visualization to understand data. 2 Basic Statistical Descriptions of Data Basic statistical descriptions : can be used to identify...

Statistical Description and Visualization Lecture Outline Using statistical summaries to understand the data. Using Visualization to understand data. 2 Basic Statistical Descriptions of Data Basic statistical descriptions : can be used to identify properties of the data and highlight which data values should be treated as noise or outliers. Measures of central tendency : which measure the location of the middle or center of a data distribution. Dispersion of the data: How is data spread out? The most common data dispersion measures are the range, quartiles, the five-number summary and boxplots; and the variance and standard deviation of the data. 3 Measuring the Central Tendency Mean : Average value. Median: Middle value. Mode: Most common value. Midrange : The average of the largest and smallest values in the set. 4 Examples We have the test scores of five students: 77, 65, 80, 91, 77 To find the mean = 77+65+80+92+77/5= 78 To find the mode, most frequent value = 77 To find the median, we arrange ascending or descending: 65, 77, 77, 80, 91 the middle value is 77 Symmetric vs. Skewed Data Median, mean and mode of symmetric, positively and negatively skewed data positively skewed negatively skewed Data Mining: Concepts and 6 January 22, 2025 Techniques Symmetric vs. Skewed Data Median, mean and mode of symmetric, positively and negatively skewed data symmetric Data Mining: Concepts and 7 January 22, 2025 Techniques Measuring the Dispersion of Data Quartiles, outliers and boxplots Quartiles: Q1 (25th percentile), Q3 (75th percentile) Inter-quartile range: IQR = Q3 – Q1 Five number summary: min, Q1, median, Q3, max 8 Q1, Q2, Q3 and IQR Example We have the test scores of nine students: 77, 65, 80, 91, 77, 87, 99, 50, 48 Order the values from least to greatest: 48, 50, 65, 77, 77, 80, 87, 91, 99 Identify the extremes: min, median, and max 48, 50, 65, 77, 77, 80, 87, 91, 99 Q1 is the median of the first half Q3 is the median of the second half 48, 50, 65, 77, 77, 80, 87, 91, 99 …IQR Q3-Q1 = 87-65=22 Measuring the Dispersion of Data Variance: measures data spread. Calculates average squared deviation of each each data point from the mean. Standard deviation s (or σ) is the square root of variance s2 ( or σ2) 11 For textual data- example TF-IDF Term Frequency-Inverse Document Frequency (TF-IDF) is a statistical measure used in text analysis to evaluate how important a word is to a document in a collection or corpus. Used in Natural Language Processing (NLP) and text analytics. The importance of a word increases proportionally to the number of times it appears in a document (Term Frequency) but is offset by how often it appears in the entire collection of documents (Inverse Document Frequency). Example TF/IDF Assume we have three documents being processed together in a corpus: 1. Document 1: "The cat sat on the mat." 2. Document 2: "The dog sat on the log." 3. Document 3: "The cat lay on the rug.“ To calculate TF/IDF we remove stop words (the, on, is..) 1. Document 1: ["cat", "sat", "mat"] 2. Document 2: ["dog", "sat", "log"] 3. Document 3: ["cat", "lay", "rug"] In Document 1, "cat" appears 1 time out of 3 total terms, so TF(cat,d1)=1/3=0.33 To calculate IDF Determine how rare each word is across all documents: TF "cat" appears in 2 out of 3 documents "mat" appears in 1 out of 3 documents Data Visualization Definition: Data visualization is the graphical representation of information and data. It uses visual elements like charts, graphs, and maps to make complex data easy to understand and interpret. Purpose: To uncover trends, patterns, and insights that might not be apparent in raw data. Why important: Makes large datasets understandable. Enhances decision-making by presenting key insights. 15 Communicates complex ideas quickly and effectively. Helps identify patterns, correlations, and anomalies. Basic Charts Line Chart Mariam Elhussein, Samiha Brahimi, Abdullah Alreedy, Mohammed Alqahtani, Sunday O. Olatunji, “Google trends identifying seasons of religious gathering: applied to investigate the correlation between crowding and flu outbreak”, Information Processing & Management, Volume 57, Issue 3, 2020, 102208, ISSN 0306-4573,https://doi.org/10.1016/j.ipm.2020.102208. Histogram Analysis 19 Histograms Often Tell More The two histograms shown in the left may have the same boxplot representation The same values for: min, Q1, median, Q3, max But they have rather different data distributions 20 Advanced Visuals Scatter plot Provides a first look at bivariate data to see clusters of points, outliers, etc. Each pair of values is treated as a pair of coordinates and plotted as points in the plane 22 Positively and Negatively Correlated Data The left half fragment is positively correlated The right half is negative correlated 23 Uncorrelated Data 24 Boxplots Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and plot outlier's individual. Boxplot Analysis * Source: https://statisticsbyjim. com/graphs/box-plot/ 26 Tree-Map 27 Word Cloud https://www.uxforthemasses.com/word-clouds/ Install the wordcloud conda install -c conda-forge wordcloud Practical Part! Thank you!