Outlier Detection - Business Intelligence & Data Analytics PDF
Document Details
data:image/s3,"s3://crabby-images/668c9/668c91fc3b4fae6796ff533e5a528be1779f5007" alt="MesmerizingJasper2985"
Uploaded by MesmerizingJasper2985
Algonquin College
Grace Pauly
Tags
Summary
This document, part of the CST8390 Business Intelligence & Data Analytics course from Algonquin College, explores outlier detection. It covers different types of outliers, the reasons for their presence, and methods like Z-score, LOF, and Isolation Forest for identifying them.
Full Transcript
CST8390 BUSINESS INTELLIGENCE & DATA ANALYTICS Week 6 Prof. Grace Pauly Outlier Detection What are outliers in the data? An Outlier is an observation that is significantly different from all other observations. https://www.analyticsvidhya.com/blog/2021/05/why-you-shou...
CST8390 BUSINESS INTELLIGENCE & DATA ANALYTICS Week 6 Prof. Grace Pauly Outlier Detection What are outliers in the data? An Outlier is an observation that is significantly different from all other observations. https://www.analyticsvidhya.com/blog/2021/05/why-you-shouldnt-just-delete-outliers/ https://humansofdata.atlan.com/2017/10/how-to-find-outliers-data-set/ Common causes of outliers in datasets: Human error while manually entering data, such as a typo. Intentional errors, such as dummy outliers included in a dataset to test detection methods. Measurement errors as a result of instrumental error. Data processing errors that arise from data manipulation, or unintended mutations of a dataset. Why detecting outliers is important? https://humansofdata.atlan.com/2017/10/how-to-find-outliers-data-set/ Applications of Outlier Detection Financial fraud detection (banking, credit card etc.) Telecom fraud detection Medical Diagnosis Web Analytics Types of Outliers Three types: Global Outlier (point anomalies) Contextual outlier (conditional outlier) Collective Outliers Examples: https://towardsdatascience.com/outliers-analysis-a-quick-guide-to-the-different-types-of-outliers-e41de37e6bf6 Global Outlier An outlier object significantly deviates from the rest of the data set Contextual Outlier An outlier object deviates significantly based on a selected context Ex: a temp of June in Ottawa. Collective Outliers A subset of data objects collectively deviates significantly from the whole data set, even if the individual data objects may not be outliers Identify Outliers_________________________ Global Outliers https://victoriametrics.com/blog/victoriametrics-anomaly-detection-handbook-chapter-2/ Identify Outliers Contextual Outliers Identify Outliers Collective Outliers How Can You Identify Outliers? Visual Methods Statistical Methods Machine Learning Methods Z-Score Outliers are found from z-score calculations by observing the data points that are too far from 0 (mean). To calculate the Z-score, subtract the mean from each of the individual data points and divide the result by the standard deviation. Eg: Set threshold as +3 to -3, anything above +3 or below -3 respectively will be considered outliers. Limitation of Z-Score https://www.scribbr.com/statistics/outliers/ Local Outlier Factor - LOF Local outlier factor (LOF) is an algorithm used for Unsupervised outlier detection. It produces an anomaly score that represents data points that are outliers in the data set. It does this by measuring the local density deviation of a given data point with respect to the data points near it. Working of LOF: Local density is determined by estimating distances between data points that are neighbors (k-nearest neighbors). So for each data point, local density can be calculated. By comparing these we can check which data points have similar densities and which have a lesser density than its neighbors. The ones with the lesser densities are considered as the outliers. https://www.geeksforgeeks.org/local-outlier-factor/ Isolation Forest Isolation Forest is an unsupervised machine- learning algorithm for anomaly detection. Isolation forests are a type of tree-based ensemble algorithm similar to random forests. Isolation Forest Isolation Forest RapidMiner Demo LOF ISF References http://researchmining.blogspot.com/2012/10/types-of- outliers.html http://scikit-learn.org/stable/modules/outlier_detection.html