Data Cleaning with Pandas - Lecture Slides PDF
Document Details

Uploaded by GraciousPanther7106
Mount Royal University
Tags
Summary
This document provides lecture slides on the topic of data cleaning using Pandas, covering key concepts such as data objects, attributes, data quality, and handling missing values, outliers and duplicate data. The course also covers types of data.
Full Transcript
Data Cleaning with Pandas Lecture Outline Data Objects and Basic Statistical Data Visualization Measuring Data Attribute Types Descriptions of Data Similarity and Dissimilarity...
Data Cleaning with Pandas Lecture Outline Data Objects and Basic Statistical Data Visualization Measuring Data Attribute Types Descriptions of Data Similarity and Dissimilarity 2 What is Data Sets? Data sets are made up of Attributes data objects A data object represent an Tid Refund Marital Taxable entity. Status Income Cheat An attribute is a property or 1 Yes Single 125K No characteristic of an object 2 No Married 100K No Examples: eye color of a 3 No Single 70K No person, temperature, etc. 4 Yes Married 120K No Attribute is also known as dimensions , feature , or Objects 5 No Divorced 95K Yes variable. 6 No Married 60K No A collection of attributes 7 Yes Divorced 220K No describe an object 8 No Single 85K Yes Object is also known as 9 No Married 75K No samples, examples, data 10 No Single 90K Yes points, or instance. 10 Attribute Types Nominal: categories, states, or “names of things” Hair_color = {auburn, black, blond, brown, grey, red, white} marital status, occupation 4 Binary Nominal attribute with only 2 states (0 and 1) e.g., diagnosis Ordinal Values have a meaningful order (ranking) but magnitude between successive values is not known. Size = {small, medium, large}, grades {A+, A, A-…} Discrete vs. Continuous Attributes Discrete Attribute Has only a finite or countably infinite set of values E.g., zip codes, profession, or the set of words in a collection of 5 documents Sometimes, represented as integer variables Note: Binary attributes are a special case of discrete attributes Continuous Attribute Has real numbers as attribute values E.g., temperature, height, or weight Practically, real values can only be measured and represented using a finite number of digits Continuous attributes are typically represented as floating-point variables Types of data sets Datasets are collections of data used for analysis, insights, and decision-making in data analytics. They are categorized into different types based on their structure and format. Structured Data Organized data stored in a predefined format (e.g., rows and columns). Easy to search, query, and analyze. Databases (SQL tables) Examples include: Spreadsheets (Excel) CSV files Common use cases: Business reports, financial records, inventory systems. Unstructured Data Data that lacks a predefined format or structure. Difficult to store and process using traditional databases. Examples include: Text (emails, social media posts) Images, videos, and audio files Sensor data Common use cases: Sentiment analysis, image recognition, multimedia analytics. Semi-Structured Data Data that doesn’t conform to a rigid structure but has some organizational properties. Combines elements of both structured and unstructured data. Examples include: JSON, XML files NoSQL databases Common use cases: Web data, APIs, and log files. Data Quality What kinds of data quality problems? How can we detect problems with the data? What can we do about these problems? Examples of data quality problems: missing values outliers duplicate data Missing Values Reasons for missing values Information is not collected (e.g., people decline to give their age and weight) Attributes may not be applicable to all cases (e.g., annual income is not applicable to children) Handling missing values Eliminate Data Objects Estimate Missing Values Ignore the Missing Value During Analysis Replace with all possible values (weighted by their probabilities) Outliers Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set Duplicate Data Data set may include data objects that are duplicates, or almost duplicates of one another Examples: Same person with multiple email addresses More than a person holding the same name and address Data cleaning Process of dealing with duplicate data issues Practical part Titanic Datset