Full Transcript

Test yourself testmoz.com/13653924 Meaning of data quality Data is like garbage. You'd better know what you are going to do with it before you collect it. (Mark Twain) Poor quality data can be useless or even dangerous Trusting poor quality data is difficult The effects of data qu...

Test yourself testmoz.com/13653924 Meaning of data quality Data is like garbage. You'd better know what you are going to do with it before you collect it. (Mark Twain) Poor quality data can be useless or even dangerous Trusting poor quality data is difficult The effects of data quality problems Incorrect prices in inventory retail databases [English 1999] Costs for consumers 2.5 billion $ 80% of barcode-scan-errors to the disadvantage of consumer IRS 1992: almost 100,000 tax refunds not deliverable (English 1999) 50% to 80% of computerized criminal records in the U.S. were found to be inaccurate, incomplete, or ambiguous. (Strong et al. 1997a) US-Postal Service: of 100,000 mass-mailings up to 7,000 undeliverable due to incorrect addresses (Pierce 2004) Cost of dirty data A.T. Kearny: 25%-40% of operative costs due to poor data quality Data Warehouse Institute: Industry and administration in US lose 600 billion USD annually SAS study: Only 18% of German companies trust their own data 80% of all hospital records contain errors... Data accuracy Accuracy: Closeness between the value in the data and the true value. Reason of low accuracy of numerical attributes: noisy measurements, limited precision, wrong measurements, transposition of digits (when entered manually). Reason of low accuracy of categorical attributes: erroneous entries, typos. 179 quality criteria and 15 dimensions Some aspects of data quality Syntactic accuracy : Entry is not in the domain Examples: fmale in gender, text in numerical attributes,... Can be checked quite easy Semantic accuracy : Entry is in the domain but not correct. Example: John Smith is female Needs more information to be checked Completeness: is violated if an entry is missing some critical values Example: Complete records are missing, the data is biased (A bank has rejected customers with low income Unbalanced data: The data set might be biased extremely to one type of records Example: Defective goods are a very small fraction of all Timeliness: Is the available data up to date? Typical problems with data quality End of line is not recognized Footer/Preamble Separation of fields(comma, semicolon, tab) Incorrect value Missing values Incorrect format Incorrect title Superfluous characters The role of decimals Searching for “Britney Spears” Duplicate announcement in same newspaper on same day Direct Marketing by The Economist FIFA registration form 2010 Transaction Duplicate Product duplicate (1) Product duplicate (2) Reviewer complaint German Umlaute Status change from “booked” to Booked” Employee list Death by typo Translation problems Filling an application form False pricing What to do with outliers? An outlier is a value or data object that is far away or very different from all or most of the other data. Causes for outliers: Data quality problems (erroneous data coming from wrong measurements or typing mistakes) Exceptional or unusual situations/data objects Outliers coming from erroneous data should be excluded from the analysis Even if the outliers are correct (exceptional data), it is sometime useful to exclude them from the analysis. For example, a single extremely large outlier can lead to completely misleading values for the mean value Visualizing the distribution An easy way to find outliers is to visualize the distribution of the variables. Boxplot or Histogram can be used for one variable and use Scatter plot for two variables. Outliers in multidimensional data Scatter plots for (visually detecting) outliers with two attributes PCA or MDS plots for (visually detecting) outliers Cluster analysis techniques: Outliers are those points which cannot be assigned to any cluster Deletion, replacement and Sub- Sampling The first way to handle outliers is to delete them. If the outlier is caused by human error (i.e. typo, unrealistic response), we can delete the observation. The second way is replacement. Replace observations with other values (mean, etc.) instead of deleting them. The third way sub-sampling If the samples belonging to a category are outliers, it is advisable to analyze them separately from the rest. Treatment of missing values Deletion: Delete all observations with missing values (Delete All, Listwise Deletion, Partial deletion) Deletion of all is an easy way, but the total number of observations will be reduced, making the model less valid. Example with sensor data The zero values might come from a broken or blocked sensor and might be consider as missing values. Reasons for missing values For some instances values of single attributes might be missing. Causes for missing values: broken sensors refusal to answer a question irrelevant attribute for the corresponding object (pregnant (yes/no) for men) Missing value might not necessarily be indicated as missing (instead: zero or default values). Replace with summary statistics Replace with other values (mean, mode, median) Example: If the mean value of male height is 173 and the mean height of female height is 158, the missing value of male observation is replaced with 173. In this approach, there is a possibility that the model will be distorted because it randomly chooses the value. Replace with predicted values Insert predicted values: It is a method of predicting and assigning them using statistical methods (regression modeling) or machine learning methods (clustering or supervised learning methods) This is better than method of replacement with summary statistic (because the subjectivity of the analyst falls out) However, the same limitation still exist because it is not the actual observed value A checklist for data understanding Determine the quality of the data. (e.g. syntactic accuracy) Find outliers. (e.g. using visualization techniques) Detect and examine missing values. Possible hidden by default values. Discover new or confirm expected dependencies or correlations between attributes. Check specific application dependent assumptions (e.g. the attribute follows a normal distribution) Compare statistics with the expected behavior. Thank you for your attention!

Use Quizgecko on...
Browser
Browser