TYCS Data Science SEM-VI PDF
Document Details
Uploaded by AwesomeBrazilNutTree6654
null
Megha Sharma
Tags
Summary
These lecture notes from a TYCS SEM-VI course provide an overview of data science. The document discusses what data science is, its applications, and its comparisons with other fields. Topics include image recognition, gaming, internet search, and healthcare. It also covers differences between data science, business intelligence, machine learning, and artificial intelligence. Note that while this document is from a course, it focuses on core data science principles, rather than course content or specific examples.
Full Transcript
T.Y.C.S. SEM-VI DATA SCIENCE Compiled By: MEGHA SHARMA https://www.youtube.com/@omega_teched Compiled By: Asst.Prof. MEGHA SHARMA Chapter-1 What is Data Science? Definition and scope of Data Science, Applications and domains of Data Science, Comparison with other field...
T.Y.C.S. SEM-VI DATA SCIENCE Compiled By: MEGHA SHARMA https://www.youtube.com/@omega_teched Compiled By: Asst.Prof. MEGHA SHARMA Chapter-1 What is Data Science? Definition and scope of Data Science, Applications and domains of Data Science, Comparison with other fields like Business Intelligence (BI), Artificial Intelligence (AI), Machine Learning (ML), and Data Warehousing/Data Mining (DW-DM) Data Science: Data science is a deep study of the massive amount of data, which involves extracting meaningful insights from raw, structured, and unstructured data that is processed using the scientific method, different technologies, and algorithms. It is a multidisciplinary field that uses tools and techniques to manipulate data so that you can find something new and meaningful. Applications of Data Science: o Image recognition and speech recognition: Data science is currently used for Image and speech recognition. When you upload an image on Facebook and start getting the suggestion to tag your friends. This automatic tagging suggestion uses an image recognition algorithm, which is part of data science. When you say something using, "Ok Google, Siri, Cortana", etc., these devices respond as per voice control, so this is possible with speech recognition algorithms. o Gaming In the gaming world, the use of Machine learning algorithms is increasing day by day. EA Sports, Sony, Nintendo, are widely using data science for enhancing user experience. o Internet: When we want to search for something on the internet, then we use different types of search engines such as Google, Yahoo, Bing, Ask, etc. All these search engines use data science technology to make the search experience better, and you can get a search result within a fraction of seconds. OMega TechEd 1 Compiled By: Asst.Prof. MEGHA SHARMA o Transport: Transport industries are also using data science technology to create self- driving cars. With self-driving cars, it will be easy to reduce the number of road accidents. o Healthcare: In the healthcare sector, data science is providing lots of benefits. Data science is being used for tumor detection, drug discovery, medical image analysis, virtual medical bots, etc. o Recommendation systems: Most of the companies, such as Amazon, Netflix, Google Play, etc., are using data science technology for making a better user experience with personalized recommendations. Such as, when you search for something on Amazon, and you start getting suggestions for similar products, so this is because of data science technology. o Risk detection: Finance industries always had an issue of fraud and risk of losses, but with the help of data science, this can be rescued. Most of the finance companies are looking for data scientists to avoid risk and any type of losses with an increase in customer satisfaction. BI stands for business intelligence, which is also used for data analysis of business information: differences between BI and Data sciences: Criterion Business intelligence Data science Data Business intelligence deals with Data science deals with structured Source structured data, e.g., data and unstructured data, e.g., warehouse. weblogs, feedback, etc. OMega TechEd 2 Compiled By: Asst.Prof. MEGHA SHARMA Method Analytical (historical data) Scientific (goes deeper to know the reason for the data report) Skills Statistics and Visualization are Statistics, Visualization, and the two skills required for Machine learning are the required business intelligence. skills for data science. Focus Business intelligence focuses Data science focuses on past data, on both Past and present data present data, and also future predictions. Difference between Data Science and Machine Learning: Data Science Machine Learning It deals with understanding and It is a subfield of data science that enables finding hidden patterns or useful the machine to learn from the past data and insights from the data, which helps experiences automatically. to make smarter business decisions. It is used for discovering insights It is used for making predictions and from the data. classifying the result for new data points. OMega TechEd 3 Compiled By: Asst.Prof. MEGHA SHARMA It is a broad term that includes It is used in the data modeling step of data various steps to create a model for a science as a complete process. given problem and deploy the model. A data scientist needs to have skills A Machine Learning Engineer needs to to use big data tools like Hadoop, have skills such as computer science Hive and Pig, statistics, fundamentals, programming skills in programming in Python, R, or Scala. Python or R, statistics and probability concepts, etc. It can work with raw, structured, and It mostly requires structured data to work unstructured data. on. Data scientists spend lots of time ML engineers spend a lot of time managing handling the data, cleansing the data, the complexities that occur during the and understanding its patterns. implementation of algorithms and mathematical concepts behind that. Difference between Data Science and AI OMega TechEd 4 Compiled By: Asst.Prof. MEGHA SHARMA Data Science is a detailed AI(short) is the implementation of a process that mainly involves Basics predictive model to forecast future pre- processing analysis, events and trends. visualization and prediction. Identifying the patterns that are Automation of the process and the Goals concealed in the data is the main granting of autonomy to the data objective of data science. model are the main goals of artificial intelligence. Data Science will have a variety of AI uses standardized Types of different types of data, including data in the form of data structured, semi-structured, and vectors and unstructured type of data. embeddings. It has a lot of high Scientific It has a high degree of scientific levels of complex Processing processing. processing. OMega TechEd 5 Compiled By: Asst.Prof. MEGHA SHARMA The tools utilized in Data Science are far The tools used in AI more extensive than those used in AI. are less extensive Tools used This is because Data Science entails a compared to Data few procedures for analyzing data and Science. developing insights from it. By using the concept of data By using this we emulate science, we can build complex cognition and human Build models about statistics and facts understanding to a certain about data. level. Technique It uses the technique of data It uses a lot of machine used analysis and data analytics. learning techniques. Artificial intelligence makes Data science makes use of Use use of algorithms and graphical representation. network node representation. OMega TechEd 6 Compiled By: Asst.Prof. MEGHA SHARMA Its knowledge was established to Its knowledge is all about Knowledge find hidden patterns and trends in imparting some autonomy to a the data. data model. Data Warehousing A Data Warehouse (DW) is a relational database that is designed for query and analysis rather than transaction processing. It includes historical data derived from transaction data from single and multiple sources. A Data Warehouse provides integrated, enterprise-wide, historical data and focuses on providing support for decision-makers for data modeling and analysis. A Data Warehouse is a group of data specific to the entire organization, not only to a particular group of users. It is not used for daily operations and transaction processing but used for making decisions. A Data Warehouse can be viewed as a data system with the following attributes: o It is a database designed for investigative tasks, using data from various applications. o It supports a relatively small number of clients with relatively long interactions. o It includes current and historical data to provide a historical perspective of information. o Its usage is read-intensive. o It contains a few large tables. "Data Warehouse is a subject-oriented, integrated, and time-variant store of information in support of management's decisions." OMega TechEd 7 Compiled By: Asst.Prof. MEGHA SHARMA Characteristics: Subject-Oriented A data warehouse target on the modeling and analysis of data for decision-makers. Therefore, data warehouses typically provide a concise and straightforward view around a particular subject, such as customer, product, or sales, instead of the global organization's ongoing operations. This is done by excluding data that is not useful concerning the subject and including all data needed by the users to understand the subject. Integrated A data warehouse integrates various heterogeneous data sources like RDBMS, flat files, and online transaction records. It requires performing data cleaning and integration during data warehousing to ensure consistency in naming conventions, attribute types, etc., among different data sources. Time-Variant Historical information is kept in a data warehouse. For example, one can retrieve files from 3 months, 6 months, 12 months, or even previous data from a data warehouse. These variations with a transactions system, where often only the most current file is kept. OMega TechEd 8 Compiled By: Asst.Prof. MEGHA SHARMA Non-Volatile The data warehouse is a physically separate data storage, which is transformed from the source operational RDBMS. The operational updates of data do not occur in the data warehouse, i.e., update, insert, and delete operations are not performed. It usually requires only two procedures in data accessing: Initial loading of data and access to data. Therefore, the DW does not require transaction processing, recovery, and concurrency capabilities, which allows for substantial speedup of data retrieval. Non-Volatile defines that once entered the warehouse, and data should not change. Goals of Data Warehousing o To help reporting as well as analysis o Maintain the organization's historical information. o Be the foundation for decision making. Benefits of Data Warehouse 1. Understand business trends and make better forecasting decisions. 2. Data Warehouses are designed to store enormous amounts of data. 3. The structure of data warehouses is more accessible for end-users to navigate, understand, and query. 4. Queries that would be complex in many normalized databases could be easier to build and maintain in data warehouses. 5. Data warehousing is an efficient method to manage demand for lots of information from lots of users. 6. Data warehousing provides the capabilities to analyze a large amount of historical data. OMega TechEd 9 Compiled By: Asst.Prof. MEGHA SHARMA Difference between database and data warehouse: - Database Data Warehouse 1. It is used for Online Transactional 1. It is used for Online Analytical Processing (OLTP) but can be used for Processing (OLAP). This reads the other objectives such as Data Warehousing. historical information for the This records the data from the clients for customers for business decisions. history. 2. The tables and joins are complicated 2. The tables and joins are accessible since they are normalized for RDBMS. since they are denormalized. This is This is done to reduce redundant files and done to minimize the response time to save storage space. for analytical queries. 3. Data is dynamic 3. Data is largely static 4. Entity: Relational modeling procedures 4. Data: Modeling approaches are are used for RDBMS database design. used for the Data Warehouse design. 5. Optimized for write operations. 5. Optimized for read operations. 6. Performance is low for analysis queries. 6. High performance for analytical queries. OMega TechEd 10 Compiled By: Asst.Prof. MEGHA SHARMA 7. The database is the place where the data 7. Data Warehouse is the place is taken as a base and managed to get where the application data is available fast and efficient access. handled for analysis and reporting objectives. ETL (Extract, Transform, and Load) Process The mechanism of extracting information from source systems and bringing it into the data warehouse is commonly called ETL, which stands for Extraction, Transformation and Loading. The ETL process requires active input from various stakeholders, including developers, analysts, testers, top executives and is technically challenging. To maintain its value as a tool for decision-makers, Data warehouse technique needs to change with business changes. ETL is a recurring method (daily, weekly, monthly) of a Data warehouse system and needs to be agile, automated, and well documented. Extraction o Extraction is the operation of extracting information from a source system for further use in a data warehouse environment. This is the first stage of the ETL process. o Extraction process is often one of the most time-consuming tasks in the ETL. o The source systems might be complicated and poorly documented, and thus determining which data needs to be extracted can be difficult. o The data has to be extracted several times in a periodic manner to supply all the changed data to the warehouse and keep it up-to-date. Cleansing The cleansing stage is crucial in a data warehouse technique because it is supposed to improve data quality. The primary data cleansing features found in ETL tools are rectification and homogenization. They use specific dictionaries to rectify typing OMega TechEd 11 Compiled By: Asst.Prof. MEGHA SHARMA mistakes and to recognize synonyms, as well as rule-based cleansing to enforce domain-specific rules and define appropriate associations between values. Transformation Transformation is the core of the reconciliation phase. It converts records from its operational source format into a particular data warehouse format. If we implement a three-layer architecture, this phase outputs our reconciled data layer. Loading The Load is the process of writing the data into the target database. During the load step, it is necessary to ensure that the load is performed correctly and with as little resources as possible. Loading can be carried in two ways: 1. Refresh: Data Warehouse data is completely rewritten. This means that older files are replaced. Refresh is usually used in combination with static extraction to populate a data warehouse initially. 2. Update: Only those changes applied to source information are added to the Data Warehouse. An update is typically carried out without deleting or modifying pre-existing data. This method is used in combination with incremental extraction to update data warehouses regularly. Data Mining: The process of extracting information to identify patterns, trends, and useful data that would allow the business to take the data-driven decision from huge sets of data is called Data Mining. We can say that Data Mining is the process of investigating hidden patterns of information to various perspectives for categorization into useful data, which is collected and assembled areas such as data warehouses, efficient analysis, data mining algorithms, helping decision making and other data requirements to eventually cost-cutting and generating revenue. Data mining is the act of automatically searching for large stores of information to find trends and patterns that go beyond simple analysis procedures. Data mining utilizes complex mathematical algorithms for data segments and evaluates the OMega TechEd 12 Compiled By: Asst.Prof. MEGHA SHARMA probability of future events. Data Mining is also called Knowledge Discovery of Data (KDD). Data mining can be performed on the following types of data: Relational Database: A relational database is a collection of multiple data sets formally organized by tables, records, and columns from which data can be accessed in various ways without having to recognize the database tables. Tables convey and share information, which facilitates data searchability, reporting, and organization. Data warehouses: A Data Warehouse is the technology that collects the data from various sources within the organization to provide meaningful business insights. The huge amount of data comes from multiple places such as Marketing and Finance. The extracted data is utilized for analytical purposes and helps in decision- making for a business organization. The data warehouse is designed for the analysis of data rather than transaction processing. Data Repositories: The Data Repository generally refers to a destination for data storage. However, many IT professionals utilize the term more clearly to refer to a specific kind of setup within an IT structure. For example, a group of databases, where an organization has kept various kinds of information. Object-Relational Database: A combination of an object-oriented database model and relational database model is called an object-relational model. It supports Classes, Objects, Inheritance, etc. Transactional Database: A transactional database refers to a database management system (DBMS) that has the potential to undo a database transaction if it is not performed appropriately. Even though this was a unique capability a very long while back, today, most of the relational database systems support transactional database activities. Advantages of Data Mining o The Data Mining technique enables organizations to obtain knowledge-based data. OMega TechEd 13 Compiled By: Asst.Prof. MEGHA SHARMA o Data mining enables organizations to make lucrative modifications in operation and production. o Compared with other statistical data applications, data mining is cost- efficient. o Data Mining helps the decision-making process of an organization. o It Facilitates the automated discovery of hidden patterns as well as the prediction of trends and behaviors. o It can be induced in the new system as well as the existing platforms. o It is a quick process that makes it easy for new users to analyze enormous amounts of data in a short time. Disadvantages of Data Mining o There is a probability that the organizations may sell useful data of customers to other organizations for money. As per the report, American Express has sold credit card purchases of their customers to other organizations. o Many data mining analytics software is difficult to operate and needs advance training to work on. o Different data mining instruments operate in distinct ways due to the different algorithms used in their design. Therefore, the selection of the right data mining tools is a very challenging task. o The data mining techniques are not precise, so that it may lead to severe consequences in certain conditions. Data Mining Applications Data Mining is primarily used by organizations with intense consumer demands- Retail, Communication, Financial, marketing company, determine price, consumer preferences, product positioning, and impact on sales, customer satisfaction, and corporate profits. Data mining enables a retailer to use point-of-sale records of customer purchases to develop products and promotions that help the organization to attract the customer. OMega TechEd 14 Compiled By: Asst.Prof. MEGHA SHARMA Data Mining Techniques Data mining includes the utilization of refined data analysis tools to find previously unknown, valid patterns and relationships in huge data sets. These tools can incorporate statistical models, machine learning techniques, and mathematical algorithms, such as neural networks or decision trees. Thus, data mining incorporates analysis and prediction. Depending on various methods and technologies from the intersection of machine learning, database management, and statistics, professionals in data mining have devoted their careers to better understanding how to process and make conclusions from the huge amount of data, but what are the methods they use to make it happen? In recent data mining projects, various major data mining techniques have been developed and used, including association, classification, clustering, prediction, sequential patterns, and regression. Chapter Ends… OMega TechEd 15 Compiled By: Asst.Prof. MEGHA SHARMA Chapter-2 Data Types and Sources Data Types and Sources: Different types of data: structured, unstructured, semi-structured, Data sources: databases, files, APIs, web scraping, sensors, social media Data can be Structured data, Semi-structured data, and Unstructured data. 1. Structured Structured data is data whose elements are addressable for effective analysis. It has been organized into a formatted repository that is typically a database. It concerns all data which can be stored in database SQL in a table with rows and columns. They have relational keys and can easily be mapped into pre-designed fields. Today, those data are most processed in the development and simplest way to manage information. Example: Relational-data. 2. Semi-Structured – Semi-structured data is information that does not reside in a relational database but that has some organizational properties that make it easier to analyze. With some processes, you can store them in the relation database (it could be very hard for some kind of semi-structured data), but Semi-structured exists to ease space. Example: XML data. 3. Unstructured data – Unstructured data is data which is not organized in a predefined manner or does not have a predefined data model; thus, it is not a good fit for a mainstream relational database. So, for Unstructured data, there are alternative platforms for storing and managing. It is increasingly prevalent in IT systems and is used by organizations in a variety of business intelligence and analytics applications. Example: Word, PDF, Text, Media logs. OMega TechEd 16 Compiled By: Asst.Prof. MEGHA SHARMA Differences between Structured, Semi-structured and Unstructured data: Unstructured Properties Structured data Semi-structured data data It is based on It is based on It is based on XML/RDF(Resource Technology Relational character and Description database table binary data Framework). Matured No transaction transaction and Transaction is adapted Transaction management various from DBMS not management and no concurrency matured concurrency techniques Version Versioning over Versioning over tuples Versioned as a management tuples,row,tables or graph is possible whole It is more flexible than It is more It is schema structured data but less flexible and Flexibility dependent and less flexible than there is absence flexible unstructured data of schema OMega TechEd 17 Compiled By: Asst.Prof. MEGHA SHARMA It is very difficult It’s scaling is simpler It is more Scalability to scale DB than structured data scalable. schema New technology, not Robustness Very robust — very spread Data Types Based on Its Collection Based on how data is collected, it can be divided into two categories - Primary and Secondary data. Let’s review the key differences between these two types in the following table - Factor Primary Data Secondary Data Definition Primary Data refers to the Secondary Data has been collected first-hand data collected by by other teams in the past. It does the team. It is collected based not necessarily need to be aligned on the researcher’s needs. with the researcher’s requirements. Data Real-time Data Historical Data OMega TechEd 18 Compiled By: Asst.Prof. MEGHA SHARMA Process Time Consuming Quick and Easy Collection Long Short Time Available In Raw and Crude form Refined form Accuracy Very high Relatively less and Reliability Examples Personal Interviews, Surveys, Websites, Articles, Research Observations, etc. Papers, Historical Data, etc. Types of Data: OMega TechEd 19 Compiled By: Asst.Prof. MEGHA SHARMA The data in statistics is classified into four categories: Nominal data Ordinal data Discrete data Continuous data In statistics, there are four main types of data: nominal, ordinal, interval, and ratio. These types of data are used to describe the nature of the data being collected or analyzed, and they help determine the appropriate statistical tests to use. Qualitative Data (Categorical Data) As the name suggests Qualitative Data tells the features of the data in the statistics. Qualitative Data is also called Categorical Data and its categories the data into various categories. Qualitative data includes data such as gender of people, their family name, and others in a sample of population data. Qualitative data is further categorized into two categories that includes, Nominal Data Ordinal Data Nominal Data Nominal data is a type of data that consists of categories or names that cannot be ordered or ranked. Nominal data is often used to categorize observations into groups, and the groups are not comparable. In other words, nominal data has no inherent order or ranking. Examples of nominal data include gender (Male or female), race (White, Black, Asian), religion (Hinduism, Christianity, Islam, Judaism), and blood type (A, B, AB, O). Nominal data can be represented using frequency tables and bar charts, which display the number or proportion of observations in each category. For example, a frequency table for gender might show the number of males and females in a sample of people. Nominal data is analyzed using non-parametric tests, which do not make any assumptions about the underlying distribution of the data. Common non-parametric tests for nominal data include Chi-Squared Tests and Fisher’s Exact Tests. These OMega TechEd 20 Compiled By: Asst.Prof. MEGHA SHARMA tests are used to compare the frequency or proportion of observations in different categories. Ordinal Data Ordinal data is a type of data that consists of categories that can be ordered or ranked. However, the distance between categories is not necessarily equal. Ordinal data is often used to measure subjective attributes or opinions, where there is a natural order to the responses. Examples of ordinal data include education level (Elementary, Middle, High School, College), job position (Manager, Supervisor, Employee), etc. Ordinal data can be represented using bar charts, line charts. These displays show the order or ranking of the categories, but they do not imply that the distances between categories are equal. Ordinal data is analyzed using non-parametric tests, which make no assumptions about the underlying distribution of the data. Common non-parametric tests for ordinal data include the Wilcoxon Signed-Rank test and Mann-Whitney U test. Quantitative Data (Numerical Data) Quantitative Data is the type of data that represents the numerical value of the data. They are also called Numerical Data. This data type is used to represent the height, weight, length, and other things of the data. Quantitative data is further classified into two categories that are, Discrete Data Continuous Data Discrete Data Discrete data type is a type of data in statistics that only uses Discrete Value or Single Values. These data types have values that can be easily counted as whole numbers. The example of the discrete data types is, Height of Students in a class Marks of the students in a class test Weight of different members of a family, etc. Continuous Data OMega TechEd 21 Compiled By: Asst.Prof. MEGHA SHARMA Continuous data is the type of quantitative data that represent the data in a continuous range. The variable in the data set can have any value between the range of the data set. Examples of the continuous data types are, Temperature Range Salary range of Workers in a Factory, etc. Difference between Quantitative and Qualitative Data Quantitative and Qualitative data has huge differences and the basic differences between them are studied in the table added below, Quantitative data Qualitative data Data is not depicted in numerical Data is depicted in numerical terms. terms. Can be shown in numbers and variables Could be about the behavioral like ratio, percentage, and more. attributes of a person, or thing. Examples: loud behavior, fair skin, Example: 100%, 1:3, 123 soft quality, and more. Difference between Discrete and Continuous Data Discrete data and continuous data both come under Quantitative data and the differences between them is studied in the table added below, OMega TechEd 22 Compiled By: Asst.Prof. MEGHA SHARMA Discrete Data Continuous Data The type of data that has clear spaces This information falls into a continuous between values is discrete data. series. Discrete Data is Countable Continuous Data is Measurable There are distinct or different values Every value within a range is included in in discrete data. continuous data. Discrete Data is depicted using bar Continuous Data is depicted using graphs histograms Ungrouped frequency distribution of Grouped distribution of continuous data discrete data is performed against a tabulation frequencies is performed single value. against a value group. Data Sources: OMega TechEd 23 Compiled By: Asst.Prof. MEGHA SHARMA A data source is the location where data that is being used originates from. A data source may be the initial location where data is born or where physical information is first digitized, however even the most refined data may serve as a source, as long as another process accesses and utilizes it. Databases A database is an organized collection of structured information, or data, typically stored electronically in a computer system. A database is usually controlled by a database management system (DBMS). Types: Relational Database NoSQL Database Files: Data stored in files, which can be in various formats such as text files, CSV, Excel Spreadsheets, and more. APIs (Application Programming Interface) API stands for Application Programming Interface. In the context of APIs, the word Application refers to any software with a distinct function. Interface can be thought of as a contract of service between two applications. This contract defines how the two communicate with each other using requests and responses. Types: Web APIs: Allow access to data over HTTP (eg. RESTful APIs) and usually return data in JSON or XML format. Library APIs: APIs provided by programming libraries to access specific functions and data. Web Scraping Web scraping is the process of using bots to extract content and data from a website. Unlike screen scraping, which only copies pixels displayed on screen, web scraping extracts underlying HTML code, and, with it, data stored in a database. The scraper can then replicate entire website content elsewhere. Usage: Extracting news articles, product information, reviews, and more from websites. OMega TechEd 24 Compiled By: Asst.Prof. MEGHA SHARMA Sensors A sensor is a device that detects and responds to some type of input from the physical environment. The input can be light, heat, motion, moisture, pressure, or any number of other environmental phenomena. Sensors collect data from the environment or devices, providing valuable information for various applications and IOT projects. In the context of data science sensor data is valuable for IOT applications, environmental monitoring, health care manufacturing and more. Social Media Social Media platforms generate vast amounts of data daily including text messages, videos, and user engagement metrics. Usage: Analyzing trends, sentiments, user behavior, and engagement patterns. __________________________________________________________________ Chapter Ends… OMega TechEd 25 Compiled By: Asst.Prof. MEGHA SHARMA Chapter-3 Data Preprocessing Data Preprocessing: Data cleaning: handling missing values, outliers, duplicates, Data transformation: scaling, normalization, encoding categorical. variables, Feature selection: selecting relevant features/columns, Data. merging: combining multiple datasets. Data cleaning: Data cleaning is one of the important parts of machine learning. It plays a significant part in building a model. Data cleaning is a crucial step in the machine learning (ML) pipeline, as it involves identifying and removing any missing, duplicate, or irrelevant data. The goal of data cleaning is to ensure that the data is accurate, consistent, and free of errors, as incorrect or inconsistent data can negatively impact the performance of the ML model. Professional data scientists usually invest a very large portion of their time in this step because of the belief that “Better data beats fancier algorithms”. Data cleaning is essential because raw data is often noisy, incomplete, and inconsistent, which can negatively impact the accuracy and reliability of the insights derived from it. Data cleaning involves the systematic identification and correction of errors, inconsistencies, and inaccuracies within a dataset, encompassing tasks such as handling missing values, removing duplicates, and addressing outliers. This meticulous process is essential for enhancing the integrity of analyses, promoting more accurate modeling, and ultimately facilitating informed decision-making based on trustworthy and high-quality data. Steps to Perform Data Cleanliness Performing data cleaning involves a systematic process to identify and rectify errors, inconsistencies, and inaccuracies in a dataset. Removal of Unwanted Observations: Identify and eliminate irrelevant or redundant observations from the dataset. The step involves scrutinizing data entries for duplicate records, irrelevant information, or data points that do not contribute meaningfully to the analysis. Removing unwanted OMega TechEd 26 Compiled By: Asst.Prof. MEGHA SHARMA observations streamlines the dataset, reducing noise and improving the overall quality. Fixing Structure errors: Address structural issues in the dataset, such as inconsistencies in data formats, naming conventions, or variable types. Standardize formats, correct naming discrepancies, and ensure uniformity in data representation. Fixing structure errors enhances data consistency and facilitates accurate analysis and interpretation. Managing Unwanted outliers: Identify and manage outliers, which are data points significantly deviating from the norm. Depending on the context, decide whether to remove outliers or transform them to minimize their impact on analysis. Managing outliers is crucial for obtaining more accurate and reliable insights from the data. Handling Missing Data: Devise strategies to handle missing data effectively. This may involve imputing missing values based on statistical methods, removing records with missing values, or employing advanced imputation techniques. Handling missing data ensures a more complete dataset, preventing biases and maintaining the integrity of analyses. Handling missing values: Identify the Missing Data Values Most analytics projects will encounter three possible types of missing data values, depending on whether there’s a relationship between the missing data and the other data in the dataset: Missing completely at random (MCAR): In this case, there may be no pattern as to why a column’s data is missing. For example, survey data is missing because someone could not make it to an appointment, or an administrator misplaces the test results he is supposed to enter the computer. The reason for the missing values is unrelated to the data in the dataset. Missing at random (MAR): In this scenario, the reason the data is missing in a column can be explained by the data in other columns. For example, a school student who scores above the cutoff is typically given a grade. So, a missing grade for a student can be explained by the column that has scores OMega TechEd 27 Compiled By: Asst.Prof. MEGHA SHARMA below the cutoff. The reason for these missing values can be described by data in another column. Missing not at random (MNAR): Sometimes, the missing value is related to the value itself. For example, higher income people may not disclose their incomes. Here, there is a correlation between the missing values and the actual income. The missing values are not dependent on other variables in the dataset. Handling Missing Data Values The first common strategy for dealing with missing data is to delete the rows with missing values. Typically, any row which has a missing value in any cell gets deleted. However, this often means many rows will get removed, leading to loss of information and data. Therefore, this method is typically not used when there are few data samples. We can also impute the missing data. This can be based solely on information in the column that has missing values, or it can be based on other columns present in the dataset. Finally, we can use classification or regression models to predict missing values. 1. Missing Values in Numerical Columns The first approach is to replace the missing value with one of the following strategies: Replace it with a constant value. This can be a good approach when used in discussion with the domain expert for the data we are dealing with. Replace it with the mean or median. This is a decent approach when the data size is small—but it does add bias. Replace it with values by using information from other columns. 2. Predicting Missing Values Using an Algorithm Another way to predict missing values is to create a simple regression model. The column to predict here is the Salary, using other columns in the dataset. If there are missing values in the input columns, we must handle those conditions when creating OMega TechEd 28 Compiled By: Asst.Prof. MEGHA SHARMA the predictive model. A simple way to manage this is to choose only the features that do not have missing values or take the rows that do not have missing values in any of the cells. 3. Missing Values in Categorical Columns Dealing with missing data values in categorical columns is a lot easier than in numerical columns. Simply replace the missing value with a constant value or the most popular category. This is a good approach when the data size is small, though it does add bias. For example, we have a column for Education with two possible values: High School and College. If there are more people with a college degree in the dataset, we can replace the missing value with College Degree: 2. Handling Duplicates: The simplest and most straightforward way to handle duplicate data is to delete it. This can reduce the noise and redundancy in our data, as well as improve the efficiency and accuracy of your models. However, we need to be careful and make sure that you are not losing any valuable or relevant information by removing duplicate data. We also need to consider the criteria and logic for choosing which duplicates to keep or discard. For example, we can use the df.drop_duplicates() method in pandas to remove duplicate rows or columns, specifying the subset, keep, and inplace arguments. Removing duplicates: In python using Pandas: df.drop_duplicates() In SQL: Use DISTINCT keyword in SELECT statement. 3. Outliers Detection and Treatment Outlier detection is the process of detecting outliers, or a data point that is far away from the average, and depending on what we are trying to accomplish. Detecting and appropriately dealing with outliers is essential in data science to ensure that statistical analysis and machine learning models are not unduly influenced, and the results are accurate and reliable. Techniques used for outlier detection. OMega TechEd 29 Compiled By: Asst.Prof. MEGHA SHARMA A data scientist can use several techniques to identify outliers and decide if they are errors or novelties. Numeric outlier This is the simplest nonparametric technique, where data is in a one-dimensional space. Outliers are calculated by dividing them into three quartiles. The range limits are then set as upper and lower whiskers of a box plot. Then, the data that is outside those ranges can be removed. Z-score This parametric technique indicates how many standard deviations a certain point of data is from the sample’s mean. This assumes a gaussian distribution (a normal, bell- shaped curve). However, if the data is not normally distributed, data can be transformed by scaling it, and giving it a more normal appearance. The z-score of data points is then calculated, placed on the bell curve, and then using heuristics (rule of thumb) a cut-off point for thresholds of standard deviation can be decided. Then, the data points that lie beyond that standard deviation can be classified as outliers and removed from the equation. The Z-score is a simple, powerful way to remove outliers, but it is only useful with medium to small data sets. It can’t be used for nonparametric data. DBSCAN This is Density Based Spatial Clustering of Applications with Noise, which is basically a graphical representation showing density of data. Using complex calculations, it clusters data together in groups of related points. DBSCAN groups data into core points, border points, and outliers. Core points are main data groups, border points have enough density to be considered part of the data group, and outliers are in no cluster at all, and can be disregarded from data. Isolation forest This method is effective for finding novelties and outliers. It uses binary decision trees which are constructed using randomly selected features and a random split value. The forest trees then form a tree forest, which is averaged out. Then, outlier scores can be calculated, giving each node, or data point, a score from 0 to 1, 0 being normal and 1 being more of an outlier. OMega TechEd 30 Compiled By: Asst.Prof. MEGHA SHARMA Visualization for Outlier Detection We can use the box plot, or the box and whisker plot, to explore the dataset and visualize the presence of outliers. The points that lie beyond the whiskers are detected as outliers. Handling Outliers a. Removing Outliers i) Listwise detection: Remove rows with outliers. ii) Trimming: Remove extreme values while keeping a certain percentage (1% or 5%) of data. b. Transforming Outliers i) Winsorization: Cap or replace outliers with values at a specified percentile. ii) Log Transformation: Apply a log transformation to reduce the impact of extreme values. c. Imputation Impute outliers with a value derived from statistical measures(mean,median) or more advanced imputation methods. d. Treating as Anomaly: Treat outliers as anomalies and analyze them separately. This is common in fraud detection or network security. Data Transformation: Data transformation is the process of converting data from one format or structure into another format or structure. It is a fundamental aspect of most data integration and data management tasks such as data wrangling, data warehousing, data integration and application integration. Data transformation is part of an ETL process and refers to preparing data for analysis and modeling. This involves cleaning (removing duplicates, fill-in missing values), reshaping (converting currencies, pivot tables), and computing new dimensions and metrics. Data transformation techniques include scaling, normalization, and encoding categorical variables. OMega TechEd 31 Compiled By: Asst.Prof. MEGHA SHARMA 1. Scaling: Scaling is the process of transforming the features of a dataset so that they fall within a specific range. Scaling is useful when we want to compare two different variables on equal grounds. This is especially useful with variables which use distance measures. For example, models that use Euclidean Distance are sensitive to the magnitude of distance, so scaling helps even with the weight of all the features. This is important because if one variable is more heavily weighted than the other, it introduces bias into our analysis. Min-Max Scaling: The objective of Min-Max scaling is to shift the values closer to the mean of the column. This method scales the data to a fixed range, usually [0, 1] or [-1, 1]. A drawback of bounding this data to a small, fixed range is that we will, in turn, end up with smaller standard deviations, which suppresses the weight of outliers in our data. Standardization (Z-Score Normalization): Standardization is used to compare features that have different units or scales. This is done by subtracting a measure of location (x- x̅) and dividing by a measure of scale (σ). OMega TechEd 32 Compiled By: Asst.Prof. MEGHA SHARMA This transforms your data, so the resulting distribution has a mean of 0 and a standard deviation of 1. This method is useful (in comparison to normalization) when we have important outliers in our data, and we don’t want to remove them and lose their impact. 2. Normalization Data normalization is a technique used in data mining to transform the values of a dataset into a common scale. This is important because many machine learning algorithms are sensitive to the scale of the input features and can produce better results when the data is normalized. There are several different normalization techniques that can be used in data mining, including: 1. Min-Max normalization: This technique scales the values of a feature to a range between 0 and 1. This is done by subtracting the minimum value of the feature from each value, and then dividing it by the range of the feature. 2. Z-score normalization: This technique scales the values of a feature to have a mean of 0 and a standard deviation of 1. This is done by subtracting the mean of the feature from each value, and then dividing it by the standard deviation. 3. Decimal Scaling: This technique scales the values of a feature by dividing the values of a feature by a power of 10. 4. Logarithmic transformation: This technique applies a logarithmic transformation to the values of a feature. This can be useful for data with a wide range of values, as it can help to reduce the impact of outliers. OMega TechEd 33 Compiled By: Asst.Prof. MEGHA SHARMA 5. Root transformation: This technique applies a square root transformation to the values of a feature. This can be useful for data with a wide range of values, as it can help to reduce the impact of outliers. 6. It’s important to note that normalization should be applied only to the input features, not the target variable, and that different normalization techniques may work better for different types of data and models. Note: The main difference between normalizing and scaling is that in normalization you are changing the shape of the distribution and in scaling you are changing the range of your data. Normalizing is a useful method when you know the distribution is not Gaussian. Normalization adjusts the values of your numeric data to a common scale without changing the range whereas scaling shrinks or stretches the data to fit within a specific range. 3. Encoding Categorical Variables The process of encoding categorical data into numerical data is called “categorical encoding.” It involves transforming categorical variables into a numerical format suitable for machine learning models. 1. Label Encoding: Label Encoding is a technique that is used to convert categorical columns into numerical ones so that they can be fitted by machine learning models which only take numerical data. It is an important preprocessing step in a machine-learning project. Example Of Label Encoding Suppose we have a column Height in some dataset that has elements as Tall, Medium, and short. To convert this categorical column into a numerical column we will apply label encoding to this column. After applying label encoding, the Height column is converted into a numerical column having elements 0,1, and 2 where 0 is the label for tall, 1 is the label for medium, and 2 is the label for short height. OMega TechEd 34 Compiled By: Asst.Prof. MEGHA SHARMA Height Height Tall 0 Medium 1 Short 2 2. One-Hot Encoding: One hot encoding is a technique that we use to represent categorical variables as numerical values in a machine learning model. One- hot encoding is used when there is no ordinal relationship among the categories, and each category is treated as a separate independent feature. It creates a binary column for each category, where a “1” indicates the presence of the category and “0” its absence. This method is suitable for nominal variables. Advantages: It allows the use of categorical variables in models that require numerical input. It can improve model performance by providing more information to the model about the categorical variable. OMega TechEd 35 Compiled By: Asst.Prof. MEGHA SHARMA ▪ It can help to avoid the problem of ordinality, which can occur when a categorical variable has a natural ordering (e.g. “small”, “medium”, “large”). One Hot Encoding Examples In One Hot Encoding, the categorical parameters will prepare separate columns for both Male and Female labels. So, wherever there is a Male, the value will be 1 in the Male column and 0 in the Female column, and vice-versa. Let’s understand with an example: Consider the data where fruits, their corresponding categorical values, and prices are given. Fruit Categorical value of fruit Price apple 1 5 mango 2 10 apple 1 15 orange 3 20 The output after applying one-hot encoding on the data is given as follows, OMega TechEd 36 Compiled By: Asst.Prof. MEGHA SHARMA apple mango orange price 1 0 0 5 0 1 0 10 1 0 0 15 0 0 1 20 The disadvantages of using one hot encoding include: 1. It can lead to increased dimensionality, as a separate column is created for each category in the variable. This can make the model more complex and slower to train. 2. It can lead to sparse data, as most observations will have a value of 0 in most of the one-hot encoded columns. 3. It can lead to overfitting, especially if there are many categories in the variable and the sample size is relatively small. OMega TechEd 37 Compiled By: Asst.Prof. MEGHA SHARMA 4. One-hot-encoding is a powerful technique to treat categorical data, but it can lead to increased dimensionality, sparsity, and overfitting. It is important to use it cautiously and consider other methods such as ordinal encoding or binary encoding. 3. Binary Encoding: Binary encoding combines elements of label encoding and one-hot encoding. It first assigns unique integer labels to each category and then represents these labels in binary form. It’s especially useful when we have many categories, reducing the dimensionality compared to one hot encoding. 4. Frequency Encoding (Count Encoding) Frequency encoding replaces each category with the count of how often it appears in the dataset. This can be useful when we suspect that the frequency of a category is related to the target variable. 5. Target Encoding (Mean Encoding) Target encoding is used when we want to encode categorical variables based on their relationship with the target variable. It replaces each category with the mean of the target variable for that category. Feature Selection A feature is an attribute that has an impact on a problem or is useful for the problem, and choosing the important features for the model is known as feature selection. Each machine learning process depends on feature engineering, which mainly contains two processes, which are Feature Selection and Feature Extraction. Although feature selection and extraction processes may have the same objective, both are completely different from each other. The main difference between them is that feature selection is about selecting the subset of the original feature set, whereas feature extraction creates new features. Feature selection is a way of reducing the input variable for the model by using only relevant data to reduce overfitting in the model. OMega TechEd 38 Compiled By: Asst.Prof. MEGHA SHARMA We can define feature Selection as, "It is a process of automatically or manually selecting the subset of most appropriate and relevant features to be used in model building." Feature selection is performed by either including the important features or excluding the irrelevant features in the dataset without changing them. Feature Selection Techniques There are mainly two types of Feature Selection techniques, which are: o Supervised Feature Selection technique Supervised Feature selection techniques consider the target variable and can be used for the labeled dataset. o Unsupervised Feature Selection technique Unsupervised Feature selection techniques ignore the target variable and can be used for the unlabeled dataset. There are mainly three techniques under supervised feature Selection: OMega TechEd 39 Compiled By: Asst.Prof. MEGHA SHARMA 1. Wrapper Methods In wrapper methodology, selection of features is done by considering it as a search problem, in which different combinations are made, evaluated, and compared with other combinations. It trains the algorithm by using the subset of features iteratively. Based on the output of the model, features are added or subtracted, and with this feature set, the model has trained again. Some techniques of wrapper methods are: o Forward selection - Forward selection is an iterative process, which begins with an empty set of features. After each iteration, it keeps adding on a feature and evaluates the performance to check whether it is improving the performance or not. The process continues until the addition of a new variable/feature does not improve the performance of the model. o Backward elimination - Backward elimination is also an iterative approach, but it is the opposite of forward selection. This technique begins the process by considering all the features and removes the least significant feature. This elimination process continues until removing the features does not improve the performance of the model. OMega TechEd 40 Compiled By: Asst.Prof. MEGHA SHARMA o Exhaustive Feature Selection- Exhaustive feature selection is one of the best feature selection methods, which evaluates each feature set as brute-force. It means this method tries & make each possible combination of features and return the best performing feature set. o Recursive Feature Elimination- Recursive feature elimination is a recursive greedy optimization approach, where features are selected by recursively taking a smaller and smaller subset of features. Now, an estimator is trained with each set of features, and the importance of each feature is determined using coef_attribute or through a feature_importances_attribute. 2. Filter Methods In the Filter Method, features are selected based on statistics measures. This method does not depend on the learning algorithm and chooses the features as a pre- processing step. The filter method filters out the irrelevant features and redundant columns from the model by using different metrics through ranking. The advantage of using filter methods is that it needs low computational time and does not overfit the data. OMega TechEd 41 Compiled By: Asst.Prof. MEGHA SHARMA Some common techniques of Filter methods are as follows: o Information Gain o Chi-square Test o Fisher's Score o Missing Value Ratio Information Gain: Information gain determines the reduction in entropy while transforming the dataset. It can be used as a feature selection technique by calculating the information gain of each variable with respect to the target variable. Chi-square Test: Chi-square test is a technique to determine the relationship between the categorical variables. The chi-square value is calculated between each feature and the target variable, and the desired number of features with the best chi-square value is selected. Fisher's Score: Fisher's score is one of the popular supervised techniques of feature selection. It returns the rank of the variable on the fisher's criteria in descending order. Then we can select the variables with a large fisher's score. OMega TechEd 42 Compiled By: Asst.Prof. MEGHA SHARMA Missing Value Ratio: The value of the missing value ratio can be used for evaluating the feature set against the threshold value. The formula for obtaining the missing value ratio is the number of missing values in each column divided by the total number of observations. The variable having more than the threshold value can be dropped. 3. Embedded Methods Embedded methods combined the advantages of both filter and wrapper methods by considering the interaction of features along with low computational cost. These are fast processing methods like the filter method but more accurate than the filter method. These methods are also iterative, which evaluates each iteration, and optimally finds the most important features that contribute the most to training in a particular iteration. Some techniques of embedded methods are: o Regularization- Regularization adds a penalty term to different parameters of the machine learning model for avoiding overfitting in the model. This penalty term is added to the coefficients; hence it shrinks some coefficients to zero. Those features with zero coefficients can be removed from the dataset. The types of regularization techniques are L1 Regularization (Lasso Regularization) or Elastic Nets (L1 and L2 regularization). OMega TechEd 43 Compiled By: Asst.Prof. MEGHA SHARMA o Random Forest Importance - Different tree-based methods of feature selection help us with feature importance to provide a way of selecting features. Here, feature importance specifies which feature has more importance in model building or has a great impact on the target variable. Random Forest is such a tree-based method, which is a type of bagging algorithm that aggregates a different number of decision trees. It automatically ranks the nodes by their performance or decrease in the impurity (Gini impurity) over all the trees. Nodes are arranged as per the impurity values, and thus it allows pruning of trees below a specific node. The remaining nodes create a subset of the most important features. Data Merging: Combining Multiple Datasets The most common method for merging data is through a process called “joining”. There are several types of joins. Inner Join: Uses a comparison operator to match rows from two tables that are based on the values in common columns from each table. Left join/left outer join. Returns all the rows from the left table that are specified in the left outer join clause, not just the rows in which the columns match. Right join/right outer join Returns all the rows from the right table that are specified in the right outer join clause, not just the rows in which the columns match. Full outer join Returns all the rows in both the left and right tables. Cross joins (cartesian join) Returns all possible combinations of rows from two tables. ____________________________________________________________ Chapter Ends… OMega TechEd 44 Compiled By: Asst.Prof. MEGHA SHARMA Chapter-4 Data Wrangling and Feature Engineering Data Wrangling and Feature Engineering: Data wrangling techniques: reshaping, pivoting, aggregating, Feature engineering: creating new features, handling time-series data Dummification: converting categorical variables. into binary indicators, Feature scaling: standardization, normalization Data Wrangling and Feature Engineering Data Wrangling: A data wrangling process, also known as a data munging process, consists of reorganizing, transforming, and mapping data from one "raw" form into another to make it more usable and valuable for a variety of downstream uses including analytics. Data wrangling can be defined as the process of cleaning, organizing, and transforming raw data into the desired format for analysts to use for prompt decision-making. Also known as data cleaning or data munging, data wrangling enables businesses to tackle more complex data in less time, produce more accurate results, and make better decisions. Data Wrangling Tools Spreadsheets/ ExcelPower Query OpenRefine Tabula Google Dataprep Datawrangler Reshaping Data: Reshaping data involves changing the structure of the dataset. The shape of a data set refers to the way in which a data set is arranged into rows and columns, and reshaping data is the rearrangement of the data without altering the content of the data set. Reshaping data sets is a very frequent and cumbersome task in the process of data manipulation and analysis. OMega TechEd 45 Compiled By: Asst.Prof. MEGHA SHARMA Common reshaping techniques include: Merging (Joining): Combining multiple datasets by a common key or identifier. This is useful when we have data in different tables or sources. Melting (Unpivoting): Transforming a dataset from wide format (many columns) to long format (fewer columns but more rows). This is useful when we have data with multiple variables in separate columns. Pivoting Data pivoting enables us to rearrange the columns and rows in a report so we can view data from different perspectives. Common pivoting techniques include: Pivot Tables: A PivotTable is an interactive way to quickly summarize large amounts of data. You can use a PivotTable to analyze numerical data in detail and answer unanticipated questions about your data. PivotTable is especially designed for: Querying large amounts of data in many user-friendly ways. Crosstabs (Contingency Tables): A contingency table (also known as a cross tabulation or crosstab) is a type of table in a matrix format that displays the multivariate frequency distribution of the variables. They are heavily used in survey research, business intelligence, engineering, and scientific research. Transpose: This simple operation flips rows and columns, making the data easier to work with in some cases. Data Aggregation Data aggregation is the process of compiling typically [large] amounts of information from a given database and organizing it into a more consumable and comprehensive medium. A common statistical data aggregation is reducing a distribution of values to a mean and standard deviation. Another example of data reduction is frequency tables. A histogram is an example of aggregation for exploration. Histograms count (aggregate) the number of observations that fall into bins. While some data is lost in this aggregation, it also provides a very useful visualization of the distribution of a set of values. OMega TechEd 46 Compiled By: Asst.Prof. MEGHA SHARMA Feature Engineering: Creating New features handling Time-Series data. Feature engineering involves a set of techniques that enable us to create new features by combining or transforming the existing ones. These techniques help to highlight the most important patterns and relationships in the data, which in turn helps the machine learning model to learn from the data more effectively. Feature engineering is the pre-processing step of machine learning, which is used to transform raw data into features that can be used for creating a predictive model using Machine learning or statistical Modeling. Feature engineering in machine learning aims to improve the performance of models. Feature engineering in ML contains mainly four processes: Feature Creation, Transformations, Feature Extraction, and Feature Selection. These processes are described as below: 1. Feature Creation: Feature creation is finding the most useful variables to be used in a predictive model. The process is subjective, and it requires human creativity and intervention. The new features are created by mixing existing features using addition, subtraction, and ration, and these new features have great flexibility. 2. Transformations: The transformation step of feature engineering involves adjusting the predictor variable to improve the accuracy and performance of the model. For example, it ensures that the model is flexible to take input of OMega TechEd 47 Compiled By: Asst.Prof. MEGHA SHARMA the variety of data; it ensures that all the variables are on the same scale, making the model easier to understand. It improves the model's accuracy and ensures that all the features are within the acceptable range to avoid any computational error. 3. Feature Extraction: Feature extraction is an automated feature engineering process that generates new variables by extracting them from the raw data. The main aim of this step is to reduce the volume of data so that it can be easily used and managed for data modeling. Feature extraction methods include cluster analysis, text analytics, edge detection algorithms, and principal components analysis (PCA). 4. Feature Selection: While developing the machine learning model, only a few variables in the dataset are useful for building the model, and the rest features are either redundant or irrelevant. If we input the dataset with all these redundant and irrelevant features, it may negatively impact and reduce the overall performance and accuracy of the model. Hence it is very important to identify and select the most appropriate features from the data and remove the irrelevant or less important features, which is done with the help of feature selection in machine learning. "Feature selection is a way of selecting the subset of the most relevant features from the original features set by removing the redundant, irrelevant, or noisy features." Feature Engineering Techniques Some of the popular feature engineering techniques include: 1. Imputation Feature engineering deals with inappropriate data, missing values, human interruption, general errors, insufficient data sources, etc. Missing values within the dataset highly affect the performance of the algorithm, and to deal with them an "Imputation" technique is used. Imputation is responsible for handling irregularities within the dataset. For example, removing the missing values from the complete row or complete column by a huge percentage of missing values. But at the same time, to maintain the data size, it is required to impute the missing data, which can be done as: OMega TechEd 48 Compiled By: Asst.Prof. MEGHA SHARMA o For numerical data imputation, a default value can be imputed in a column, and missing values can be filled with means or medians of the columns. o For categorical data imputation, missing values can be interchanged with the maximum occurred value in a column. 2. Handling Outliers Outliers are the deviated values or data points that are observed too away from other data points in such a way that they badly affect the performance of the model. Outliers can be handled with this feature engineering technique. This technique first identifies the outliers and then removes them. Standard deviation can be used to identify the outliers. For example, each value within a space has a definite to an average distance, but if a value is greater than a certain value, it can be considered as an outlier. Z-score can also be used to detect outliers. 3. Log transform Logarithm transformation or log transform is one of the commonly used mathematical techniques in machine learning. Log transform helps in handling the skewed data, and it makes the distribution more approximate to normal after transformation. It also reduces the effects of outliers on the data, as because of the normalization of magnitude differences, a model becomes much more robust. 4. Binning In machine learning, overfitting is one of the main issues that degrades the performance of the model, and which occurs due to a greater number of parameters and noisy data. However, one of the popular techniques of feature engineering, "binning", can be used to normalize the noisy data. This process involves segmenting different features into bins. 5. Feature Split As the name suggests, feature split is the process of splitting features intimately into two or more parts and performing to make new features. This technique helps the algorithms to better understand and learn the patterns in the dataset. OMega TechEd 49 Compiled By: Asst.Prof. MEGHA SHARMA The feature splitting process enables the new features to be clustered and binned, which results in extracting useful information and improving the performance of the data models. 6. One hot encoding One hot encoding is the popular encoding technique in machine learning. It is a technique that converts categorical data in a form so that they can be easily understood by machine learning algorithms and hence can make a good prediction. It enables grouping of categorical data without losing any information. A typical approach to feature engineering in time series forecasting involves the following types of features. Lagged variables: A lag variable is a variable based on the past values of the time series. By incorporating previous time series values as features, patterns such as seasonality and trends can be captured. For example, if we want to predict today's sales, using lagged variables like yesterday’s sales can provide valuable information about the ongoing trend. Moving window statistics: Moving statistics can also be called moving window statistics, rolling statistics, or running statistics. A predefined window around each dimension value is used to calculate various statistics before moving to the next. Time-based features: such as the day of the week, the month of the year, holiday indicators, seasonal it, and other time related patterns can be valuable for prediction. For instance, if certain products tend to have higher average sales on weekends, incorporating the day of the week as a feature can improve the accuracy of the forecasting model. Dummification: Converting categorical variables into binary indicators. The word “dummy” means the act of replication. In the field of data science, it holds the same meaning. The whole art of dummifying variables in data science is the process of “transforming the variables into a numerical representation”. Example: One-Hot-Encoding OMega TechEd 50 Compiled By: Asst.Prof. MEGHA SHARMA Feature Scaling: Standardization, Normalization Feature scaling is a data preprocessing technique used to transform the values of features or variables in a dataset to a similar scale. The purpose is to ensure that all features contribute equally to the model and to avoid the domination of features with larger values. Feature scaling becomes necessary when dealing with datasets containing features that have different ranges, units of measurement, or orders of magnitude. In such cases, the variation in feature values can lead to biased model performance or difficulties during the learning process. There are several common techniques for feature scaling, including standardization, normalization, and min-max scaling. These methods adjust the feature values while preserving their relative relationships and distributions. By applying feature scaling, the dataset’s features can be transformed to a more consistent scale, making it easier to build accurate and effective machine learning models. Scaling facilitates meaningful comparisons between features, improves model convergence, and prevents certain features from overshadowing others based solely on their magnitude. Normalization: Normalization, a vital aspect of Feature Scaling, is a data preprocessing technique employed to standardize the values of features in a dataset, bringing them to a common scale. This process enhances data analysis and modeling accuracy by mitigating the influence of varying scales on machine learning models. Normalization is a scaling technique in which values are shifted and rescaled so that they end up ranging between 0 and 1. It is also known as Min-Max scaling. Here’s the formula for normalization: Here, Xmax and Xmin are the maximum and the minimum values of the feature, respectively. When the value of X is the minimum value in the column, the numerator will be 0, and hence X’ is 0 OMega TechEd 51 Compiled By: Asst.Prof. MEGHA SHARMA On the other hand, when the value of X is the maximum value in the column, the numerator is equal to the denominator, and thus the value of X’ is 1 If the value of X is between the minimum and the maximum value, then the value of X’ is between 0 and 1 Standardization: Standardization is another Feature scaling method where the values are centered around the mean with a unit standard deviation. This means that the mean of the attribute becomes zero, and the resultant distribution has a unit standard deviation. Here’s the formula for standardization: is the mean of the feature values and is the standard deviation of the feature values. Note that, in this case, the values are not restricted to a particular range. Normalization Standardization Rescales values to a range between 0 Centers data around the mean and scales and 1 to a standard deviation of 1 Useful when the distribution of the data Useful when the distribution of the data is unknown or not Gaussian is Gaussian or unknown Sensitive to outliers Less sensitive to outliers Retains the shape of the original Changes the shape of the original distribution distribution OMega TechEd 52 Compiled By: Asst.Prof. MEGHA SHARMA May not preserve the relationships Preserves the relationships between the between the data points data points Equation: (x – min)/(max – min) Equation: (x – mean)/standard deviation __________________________________________________________________ Chapter Ends… OMega TechEd 53 Compiled By: Asst.Prof. MEGHA SHARMA Chapter-5 Tools and Libraries Tools and Libraries: Introduction to popular libraries and technologies used, in Data Science like Pandas, NumPy, Sci-kit Learn, etc. 1. TensorFlow: TensorFlow is a free and open-source software library for machine learning and artificial intelligence. It can be used across a range of tasks but has a particular focus on training and inference of deep neural networks. It was developed by the Google Brain team for Google's internal use in research and production. 2. Matplotlib: Matplotlib is a multi-platform data visualization library built on NumPy arrays and designed to work with the broader SciPy stack. It was introduced by John Hunter in 2002. One of the greatest benefits of visualization is that it allows us visual access to huge amounts of data in easily digestible visuals. Matplotlib consists of several plots like line, bar, scatter, histogram, etc. 3. Pandas: Pandas are built on top of two core Python libraries—matplotlib for data visualization and NumPy for mathematical operations. Pandas acts as a wrapper over these libraries, allowing you to access many of matplotlib and NumPy's methods with less code. 4. Numpy: NumPy (Numerical Python) is an open-source Python library that's used in almost every field of science and engineering. It's the universal standard for working with numerical data in Python, and it's at the core of the scientific Python and PyData ecosystems. OMega TechEd 54 Compiled By: Asst.Prof. MEGHA SHARMA 5. Scipy: SciPy is an open-source Python library that's used in almost every field of science and engineering optimization, stats, and signal processing. Like NumPy, SciPy is open source so we can use it freely. SciPy was created by NumPy's creator Travis Olliphant. 6. Scrapy: Scrapy is a comprehensive open-source framework and is among the most powerful libraries used for web data extraction. Scrapy natively integrates functions for extracting data from HTML or XML sources using CSS and XPath expressions. 7. Scikit-learn: Scikit-Learn, also known as sklearn is a python library to implement machine learning models and statistical modelling. Through scikit- learn, we can implement various machine learning models for regression, classification, clustering, and statistical tools for analyzing these models. 8. PyGame: Pygame is a cross-platform set of Python modules designed for writing video games. It includes computer graphics and sound libraries designed to be used with the Python programming language. __________________________________________________________ Chapter Ends… OMega TechEd 55 Compiled By: Asst.Prof. MEGHA SHARMA Chapter 6 Exploratory data Analysis (EDA) Exploratory Data Analysis (EDA): Data visualization techniques: histograms, scatter plots, box plots, etc., Descriptive statistics: mean, median, mode, standard deviation, etc., Hypothesis testing: t-tests, chi-square tests, ANOVA, etc. Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods. Exploratory Data Analysis (EDA) refers to the method of studying and exploring record sets to apprehend their predominant traits, discover patterns, locate outliers, and identify relationships between variables. EDA is normally carried out as a preliminary step before undertaking extra formal statistical analyses or modeling. The Foremost Goals of EDA 1. Data Cleaning: EDA involves examining the information for errors, lacking values, and inconsistencies. It includes techniques including recording imputation, managing missing statistics, and figuring out and getting rid of outliers. 2. Descriptive Statistics: EDA utilizes precise records to recognize the important tendency, variability, and distribution of variables. Measures like suggest, median, mode, preferred deviation, range, and percentiles are usually used. 3. Data Visualization: EDA employs visual techniques to represent the statistics graphically. Visualizations consisting of histograms, box plots, scatter plots, line plots, heatmaps, and bar charts assist in identifying styles, trends, and relationships within the facts. 4. Feature Engineering: EDA allows for the exploration of various variables and their adjustments to create new functions or derive meaningful insights. Feature engineering can include scaling, normalization, binning, encoding express variables, and creating interplay or derived variables. OMega TechEd 56 Compiled By: Asst.Prof. MEGHA SHARMA 5. Correlation and Relationships: EDA allows discover relationships and dependencies between variables. Techniques such as correlation analysis, scatter plots, and pass-tabulations offer insights into the power and direction of relationships between variables. 6. Data Segmentation: EDA can contain dividing the information into significant segments based totally on sure standards or traits. This segmentation allows advantage insights into unique subgroups inside the information and might cause extra focused analysis. 7. Hypothesis Generation: EDA aids in generating hypotheses or studies questions based totally on the preliminary exploration of the data. It facilitates forming the inspiration for in addition evaluation and model building. 8. Data Quality Assessment: EDA permits assessing the inability and reliability of the information. It involves checking for records integrity, consistency, and accuracy to make certain the information is suitable for analysis. Data Visualization Techniques Histograms Histograms are one of the most popular visualizations to analyze the distribution of data. They show the numerical variable's distribution with bars. The hist function in Matplotlib is used to create histogram. To build a histogram, the numerical data is first divided into several ranges or bins, and the frequency of occurrence of each range is counted. The horizontal axis shows the range, while the vertical axis represents the frequency or percentage of occurrences of a range. Histograms immediately showcase how a variable's distribution is skewed or where it peaks. OMega TechEd 57 Compiled By: Asst.Prof. MEGHA SHARMA Box and whisker plots. Another great plot to summarize the distribution of a variable is boxplots. Boxplots provide an intuitive and compelling way to spot the following elements: Median. The middle value of a dataset where 50% of the data is less than the median and 50% of the data is higher than the median. The upper quartile. The 75th percentile of a dataset where 75% of the data is less than the upper quartile, and 25% of the data is higher than the upper quartile. The lower quartile. The 25th percentile of a dataset where 25% of the data is less than the lower quartile and 75% is higher than the lower quartile. The interquartile range. The upper quartile minus the lower quartile The upper adjacent value. Or colloquially, the “maximum.” It represents the upper quartile plus 1.5 times the interquartile range. The lower adjacent value. Or colloquially, the “minimum." It represents the lower quartile minus 1.5 times the interquartile range. Outliers. Any values above the “maximum” or below the “minimum.” OMega TechEd 58 Compiled By: Asst.Prof. MEGHA SHARMA Scatter plots. Scatter plots are used to visualize the relationship between two continuous variables. Each point in the plot represents a single data point, and the position of the point on the x and y-axis represents the values of the two variables. It is often used in data exploration to understand the data and quickly surface potential correlations. OMega TechEd 59 Compiled By: Asst.Prof. MEGHA SHARMA Heat maps. A heatmap is a common and beautiful matrix plot that can be used to graphically summarize the relationship between two variables. The degree of correlation between two variables is represented by a color code. OMega TechEd 60 Compiled By: Asst.Prof. MEGHA SHARMA Mean, Median and Mode Mean, Median, and Mode are measures of the central tendency. These values are used to define the various parameters of the given data set. The measure of central tendency (Mean, Median, and Mode) gives useful insights about the data studied. These are used to study any type of data such as the average salary of employees in an organization, the median age of any class, the number of people who play cricket in a sports club, etc. Measure of central tendency is the representation of various values of the given data set. There are various measures of central tendency and the most important three measures of central tendency are, Mean (x̅ or μ) Median(M) Mode(Z) Mean is the sum of all the values in the data set divided by the number of values in the data set. It is also called the Arithmetic Average. Mean is denoted as x̅ and is read as x bar. The formula to calculate the mean is, Mean Formula The formula to calculate the mean is, Mean (x̅) = Sum of Values / Number of Values If x1, x2, x3,……, xn are the values of a data set then the mean is calculated as: x̅ = (x1 + x2 + x3 + …… + xn) / n Median: A Median is a middle value for sorted data. The sorting of the data can be done either in ascending order or descending order. A median divide the data into two equal halves. The formula for the median is, OMega TechEd 61 Compiled By: Asst.Prof. MEGHA SHARMA If the number of values (n value) in the data set is odd then the formula to calculate the median is, Median = [(n + 1)/2]th term If the number of values (n value) in the data set is even then the formula to calculate the median is: Median = [(n/2)th term + {(n/2) + 1}th term] / 2 Mode: A mode is the most frequent value or item of the data set. A data set can generally have one or more than one mode value. If the data set has one mode, then it is called “Uni-modal”. Similarly, If the data set contains 2 modes, then it is called “Bimodal” and if the data set contains 3 modes, then it is known as “Trimodal”. If the data set consists of more than one mode, then it is known as “multi-modal” (can be bimodal or trimodal). There is no mode for a data set if every number appears only once. Mode Formula Mode = Highest Frequency Term Standard Deviation Standard Deviation is a measure which shows how much variation (such as spread, dispersion, spread,) from the mean exists. The standard deviation indicates a “typical” deviation from the mean. It is a popular measure of variability because it returns to the original units of measure of the data set. Like the variance, if the data points are close to the mean, there is a small variation whereas the data points are highly spread out from the mean, then it has a high variance. Standard deviation calculates the extent to which the values differ from the average. Standard Deviation, the most widely used measure of dispersion, is based on all values. Therefore, a change in even one value affects the value of standard deviation. It is independent of origin but not of scale. It is also useful in certain advanced statistical problems. Standard Deviation Formula The population standard deviation formula is given as: Here, OMega TechEd 62 Compiled By: Asst.Prof. MEGHA SHARMA σ = Population standard deviation N = Number of observations in population Xi = ith observation in the population μ = Population mean Hypothesis Testing Hypothesis testing is used to assess the plausibility of a hypothesis by using sample data. The test provides evidence concerning the plausibility of the hypothesis, given the data. Statistical analysts test a hypothesis by measuring and examining a random sample of the population being analyzed. An analyst performs hypothesis testing on a statistical sample to present evidence of the plausibility of the null hypothesis. Measurements and analyses are conducted on a random sample of the population to test a theory. Analysts use a random population sample to test two hypotheses: the null and alternative hypotheses. The null hypothesis is typically an equality hypothesis between population parameters; for example, a null hypothesis may claim that the population means return equals zero. The alternate hypothesis is essentially the inverse of the null hypothesis (e.g., the population means the return is not equal to zero). As a result, they are mutu