Data Wrangling Primer PDF
Document Details
Uploaded by Deleted User
Tags
Summary
This document provides a primer on data wrangling, outlining the three key steps: data gathering, data assessment, and data cleaning. It explains the importance of data quality and tidiness, and provides examples of techniques to correct data issues. The process of data wrangling is crucial for effective data analysis and modeling.
Full Transcript
Data wrangling is the process of gathering the data, assessing it for quality, and cleaning the data. 3 Steps in Data Wrangling Raw data collected for a project from various sources are usually in different formats and not suitable for further analysis and modeling. Sometimes, this gathered data i...
Data wrangling is the process of gathering the data, assessing it for quality, and cleaning the data. 3 Steps in Data Wrangling Raw data collected for a project from various sources are usually in different formats and not suitable for further analysis and modeling. Sometimes, this gathered data is not really clean and well structured. This makes working with such data difficult which leads to making mistakes, getting misleading insights, and wasting your valuable time. It is often said, data wrangling costs 70% to 80% of entire project time leaving only 20% to 30% time for exploration and modeling. The data obtained at the end of the data wrangling process is then used for further analysis, visualization, or building models using machine learning. Data wrangling transforms the raw data into a cleaned, well structured data Below three easy-to-understand pictures will give you an idea about the entire data wrangling process. 1. Data Gathering Depending on your data science project, the required data can be acquired from a single file or it can be spread over several resources and different formats. Sometimes data gathering can be a challenging task. 2. Data Assessment After gathering, it is time to assess your collected data. But wait...what you are going to assess?? And the answer is, Quality and Tidiness This is not data exploration, but you will be checking how dirty and messy your data is. The poor quality data has issues with its contents and such data is dirty data. Usually observed quality-related issues are, missing values, inconsistent data, incorrect data types, and duplicates. Things to be checked in Data Assessment Further, the data can be assessed visually and programmatically. Visual assessment is simple, just open the data with any application you like, scroll it, and look for quality, tidiness issues. 3. Data Cleaning Now it is time to clean and organize your data. The data cleaning step is always focused on solving the quality and tidiness issues found in the data assessment stage. Being a wrangler, until now, you have just collected the data and looked at its problems. Now it is time to act, round up, and organize it well for getting better outcomes. Some of data cleaning techniques Depending on the problem and data type, different data cleaning techniques are used. One of the thing which can be quickly checked is the correlation between different variables...this can give you a partial idea about which variables are unimportant for your analysis and will help you to decide which columns to drop. If you have multiple similar types of datasets in the same project then it is a good idea to combine this data in a single dataset. When such data merging is performed then there is the possibility of duplicate or repeated data points in the final dataset. These duplicate entries can be simply removed. Another often observed problem is, having incorrect datatypes. While data cleaning, you should ensure that the numbers are stored as numeric values, date and time values should be stored as an datetime object, and so on. Ultimately, dirty and messy data gets corrected by removing or replacing incorrect entries, removing irrelevant variables, and adding new variables. Summing up, The quality and structure of the data are the two important aspects to get better insights into data. Even after the data is cleaned, it is always a good idea to reassess the data to ensure the required quality and tidiness. That’s why data wrangling is viewed as an iterative process in the first graphic. Remember, good quality and well-organized data powers further analysis, visualization, and modeling.