Learning from Data Lecture 6 PDF

Learning from Data Lecture 6 Dr Marcos Oliveira Preprocessing usually involves: 1. Data integration 2. Data cleaning 3. Data transformation 4. Data visualisation Data Integration / Data Fusion From now on, we assume that all the data can be retrieved from the same place. Data Cleaning It is the process of detecting and correcting corrupt or inaccurate records in our data. Data Cleaning It is the process of detecting and correcting corrupt or inaccurate records in our data. Weight (kg) Height (m) Age (years) 80 1.80 30 18 0.25 0.5 - 50 1.49 15 - 1.55 13 199 2.72 22 Data Cleaning: missing values Data values that we expect to have but are absent in our data set. In pandas, they are often represented as NaNs: but, they could be presented in different ways (e.g., None, 9999, N/A) Data Cleaning: missing values pandas_df.info() Data values that we expect to have but are absent in our data set. In pandas, they are often represented as NaNs: pandas_df[col].isnan() but, they could be presented in different ways (e.g., None, 9999, N/A) Data Cleaning: missing values — Why? Data values that we expect to have but are absent in our data set. Why? human error, respondents may refuse to answer a survey question, the person taking the survey does not understand the question, the provided value is an obvious error, so it was deleted, not enough time to respond to questions, lost records due to lack of effective database management, among many others! Data Cleaning: types of missing values Missing values fall under three types: Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR). From MCAR to MNAR: missing values become more problematic and harder to deal with. Data Cleaning: types of missing values Missing Completely At Random (MCAR) occurs when data is missing purely by chance. In this case, the probability of missing value is equal for all units and is unrelated to both observed and unobserved data. Example: An air quality sensor fails to communicate with its server to save records due to random fluctuations in the internet connection. It has missing values of the MCAR type. Data Cleaning: types of missing values Missing At Random (MAR) is when some data objects in the data are more likely to have missing values. In this case, the probability of missing value is related to observed data but not with the missing data. Example: High wind speed might sometimes cause a air quality sensor to malfunction and renders it unable to give a reading. The missing values that happens in the high wind are classed as MAR. Data Cleaning: types of missing values Missing Not At random (MNAR) happens when we know exactly which data object will have missing values. In this case, the probability of missing values is related to the actual missing data. Example: If a power plant that tends to emit too much air pollutant that tampers with the air quality sensor. The data objects that are not collected due to this situation would be classed as MNAR. Missing values - What to do? Four approaches to dealing with missing values: keep as-is, remove rows, remove columns, impute values. Guiding principles: preserve data and information, minimize bias introduction. Considerations: analytic goals (e.g., clustering, classification, visualization, etc), analytic tools, missing values cause, missing value type (MCAR, MAR, MNAR). Missing values - What to do? Approach 1: Keep the missing value as is. When? 1. When sharing data with others, allowing them to determine how to address missing values based on their requirements. 2. When your tools/goals can handle missing values seamlessly without the need of changes. For instance, the K-Nearest Neighbors algorithm can be fine-tuned to accommodate missing values without omitting data objects. Missing values - What to do? Approach 2: Remove data objects (rows) with missing values. ⚠ ⚠ ⚠ Careful with that axe! ⚠ ⚠ ⚠ Omitting data can conflict with goals of keeping information and avoiding bias introduction. In the case of MNAR and MAR types: avoid removal as it implies excluding a notably distinct subset of the dataset, thus risking bias. In the case of MCAR, explore alternative handling methods first, considering removal only when other strategies are unfeasible. Missing values - What to do? Approach 3: Remove the attributes (columns) with missing values. Consider attribute removal when they exhibit a high incidence of missing values. Critical attributes: If crucial to the project, too many missing values may make the project infeasible. If not crucial, consider removal to maintain data integrity. Practical threshold: when missing rate is higher than 25%, removal is often more viable than estimation to avoid potential misrepresentations in data analysis. Missing values - What to do? Approach 4: Estimate and impute missing values. Use models and problem knowledge to replace missing values, being mindful of potential bias introduction. General imputation methods Central tendency (mean, median, or mode): In the case of MCAR: Use overall mean, median, or mode. In the case of MAR: Use central tendency of a relevant data subgroup. Regression analysis: applicable (though not ideal) for MNAR type missing values. Interpolation: used for time series datasets with MCAR type missing values. Note that the goal of these methods is not precise prediction of missing value, but imputation that minimizes analytical bias. Missing values - What to do? Jafari, R. (2022). Hands-On Data Preprocessing in Python: Learn how to Effectively Prepare Data for Successful Data Analytics. Packt Publishing. Data Cleaning: outliers Outliers are data points that differ significantly from other observations. Weight (kg) Height (m) Age (years) 80 1.80 30 18 0.25 0.5 - 50 1.49 15 - 1.55 13 199 2.72 22 Data Cleaning: outliers Outliers are data points that differ significantly from other observations. Outliers could be data errors that require correction or removal, legitimate values that may skew model results, fraudulent entries necessitating scrutiny. Data Cleaning: outliers Outliers are data points that differ significantly from other observations. Outliers could be data errors that require correction or removal, legitimate values that may skew model results, fraudulent entries necessitating scrutiny. One approach to detect outliers is to use the quartiles: We define outliers the data points outside this interval: ⇥ AAACInicbVDLSgMxFM3UV62vUZdugkWoaMuMb3dFNy5bsA+YDkMmzbShmQdJRijDfIsbf8WNC0VdCX6M6XQWtnog5Nxz7yG5x40YFdIwvrTCwuLS8kpxtbS2vrG5pW/vtEUYc0xaOGQh77pIEEYD0pJUMtKNOEG+y0jHHd1O+p0HwgUNg3s5jojto0FAPYqRVJKjXyc9lw6glTadxEyro4q6T9NqVh0eZ8XRrDg12Kmjl42akQH+JWZOyiBHw9E/ev0Qxz4JJGZICMs0ImkniEuKGUlLvViQCOERGhBL0QD5RNhJtmIKD5TSh17I1QkkzNTfjgT5Qox9V036SA7FfG8i/tezYuld2QkNoliSAE8f8mIGZQgnecE+5QRLNlYEYU7VXyEeIo6wVKmWVAjm/Mp/SfukZl7Uzptn5fpNHkcR7IF9UAEmuAR1cAcaoAUweATP4BW8aU/ai/aufU5HC1ru2QUz0L5/AHLMomU= ⇤ Q1 k(Q3 Q1 ), Q3 + k(Q3 Q1 ) with k = 1.5 Q1 = df[‘column’].quantile(0.25) Q3 = df[‘column’].quantile(0.75) IQR = Q3 - Q1 BM = (df[‘column’] < (Q1 - 1.5 * IQR)) | (df[‘column’] > (Q3 + 1.5 * IQR)) df[BM] Data Cleaning: outliers — What to do? 1. Do nothing: It is the best approach when the modeling used is robust against outliers. Sometimes, merely detecting outliers might be the analytic goal. 2. Replace with the upper cap or lower cap: It is ideal when analysis is sensitive to outliers and retaining all data objects is vital. 3. Log transformation (we will see more about it soon): It is applicable when data is skewed, so some data objects significantly deviate from the majority. 4. Remove data objects with outliers: Worst option due to potential loss of information. It should be done when other methods are inapplicable and when data is correct, but outlier values are excessively distinct. Data Cleaning: errors A discrepancy or deviation in measured data from the actual or true values, which can arise due to various factors during data collection or measurement. Weight (kg) Height (m) Age (years) 80 1.80 30 18 0.25 0.5 - 50 1.49 15 - 1.55 13 199 2.72 22 Data Cleaning: errors A discrepancy or deviation in measured data from the actual or true values, which can arise due to various factors during data collection or measurement. Random errors: unavoidable fluctuations or inconsistencies in data. Systematic errors: consistent, repeatable errors that can be associated with particular sources or causes. Detecting systematic errors is challenging, and they likely go unnoticed. Outlier detection can help to identify potential systematic errors. Example: inaccurate instruments, misconfigured hardware. Data Transformation The last stage of data preprocessing before using our analytic tools. It ensures our dataset meets key characteristics and is ready for analysis! Data Transformation Data Transformation: different data ranges We often want to adjust data ranges to make our dataset compatible with model assumptions, improving algorithm stability, and ensuring fair contribution from all features. Data Transformation: different data ranges We often want to adjust data ranges to make our dataset compatible with model assumptions, improving algorithm stability, and ensuring fair contribution from all features. In standardization, we rescale data to have a mean of 0 and standard deviation of 1. Standardisation For every feature, subtract the mean, divide by standard deviation Data Transformation: different data ranges We often want to adjust data ranges to make our dataset compatible with model assumptions, improving algorithm stability, and ensuring fair contribution from all features. In standardization, we rescale data to have a mean of 0 and standard deviation of 1. Normalisation In normalization, we rescale data to have a common scale, typically [0, 1]. For every feature, subtract the minimum, divide by its full range Data Transformation: log transformation To address skewness and extreme values, we can use a log transformation: Log-scale Good for features that vary over orders of magnitude or when you care about ratios more than differences Data Transformation: log transformation To address skewness and extreme values, we can use a log transformation: Log-scale Good for features that vary over orders of magnitude or when you care about ratios more than differences Data Transformation: cat num We often want to transform our data from numerical representation to categorical representation, or vice versa: binary coding, ranking transformation, attribute construction. ↔︎ Data Transformation: cat num We often want to transform our data from numerical representation to categorical representation, or vice versa: binary coding, ranking transformation, attribute construction. Discretization: ↔︎ Data Transformation: smoothing We can use smoothing to eliminate noise or fluctuations in data, providing a clearer view of the underlying trend or pattern. Data Transformation: smoothing We can use smoothing to eliminate noise or fluctuations in data, providing a clearer view of the underlying trend or pattern. Many approaches; Moving Average is a widely-used method of smoothing that involves averaging data points in successive subsets. We average a specified number of consecutive data points. The smoothed value for each position is the average of its and its neighbors’ values. Data Visualisation Data Visualisation Data Visualisation Data Visualisation Line plots Data Visualisation Line plots Visualisation Scatter plots Visualisation Scatter plots Visualisation Visualisation Visualisation Careful with pie charts: Visualisation Careful with pie charts: Just use something else: Heatmaps Passengers Month Year Resources for inspiration D3 https://d3js.org/ (even if you are not planning to learn D3) Interactive Data Lab, Uni of Washington: http://idl.cs.washington.edu/ Our World in Data: https://ourworldindata.org/ Big Picture Group, Google Brain: https://research.google.com/bigpicture/ ColorBrewer: http://colorbrewer2.org/ Learning from Data Lecture 6 Dr Marcos Oliveira

Learning from Data Lecture 6 PDF

Document Details

Tags

Related

Summary

Full Transcript