APznzaY4g_-WovsJOsJwnHQAMnDI5Wo_ilhmzJtVLh2Sa-Xf9gJ_gNuoSOrxo7DqD3u2qUA0OuW2x8k8cD1ho301ZByianD7VnvX4KJ-neiRIN_ZdUpsyNL1SYopHQ8dqWnsgG613DeeBPrjkZR-mqJYdKE9do6K5PH1iZ0LEdYph9gauaoLuYqt6b9Yi7lmSl5t.pdf

Full Transcript

UNIT-I SYLLABUS: Major Tasks in Data Science: Data Collection, storing data, Data Processing, Exploratory Data Analysis, Data Modeling. Life cycle of Data Science: Business Understanding, Data Understanding, Data Preparation, Model Building, Model Evaluation, and Depl...

UNIT-I SYLLABUS: Major Tasks in Data Science: Data Collection, storing data, Data Processing, Exploratory Data Analysis, Data Modeling. Life cycle of Data Science: Business Understanding, Data Understanding, Data Preparation, Model Building, Model Evaluation, and Deployment, CRISP DM methodology Applications of data science: Finance, Healthcare, Business and Marketing, Manufacturing, Cyber security, Transportation, Social Media, Agriculture, etc. 1. Major Tasks in Data Science Data Collection Storing data Data Processing Exploratory Data Analysis Data Modeling 1.1 Data Collection: Data collection is a fundamental process in data science and analytics, involving the gathering of raw data from diverse sources to analyze and derive insights. Here are some key methods for data collection: Gathering raw data from various sources such as ✓ Databases ✓ APIs ✓ web scraping ✓ sensors. Databases: SQL Databases: Structured data stored in relational databases like MySQL, PostgreSQL, SQL Server. NoSQL Databases: Unstructured or semi-structured data stored in databases like MongoDB, Cassandra, Redis. APIs (Application Programming Interfaces): Access data provided by third-party services or platforms through their APIs. Examples include social media APIs (Twitter, Facebook), financial APIs (Alpha Vantage, Quandl), and public data APIs (OpenWeatherMap, NASA). Web Scraping Tools and Libraries: Use tools like BeautifulSoup, Scrapy, or Selenium to extract data from websites. Ethical Considerations: Ensure compliance with website terms of service and legal guidelines. Sensors IoT Devices: Collect real-time data from Internet of Things (IoT) devices such as temperature sensors, motion detectors, and environmental sensors. Wearable Devices: Gather data from health and fitness trackers. Example program: import pandas as pd data=pd.read_csv('https://archive.ics.uci.edu/ml/machine- learning-databases/iris/iris.data') data.columns=['Sepal Length','Sepal widtth','Petal Length','Petal Width','class'] data.head() OUTPUT: 1.2. Storing Data Goal: Store the collected data in a way that it can be easily accessed and used for analysis. Steps: Choose Storage Solutions: Select databases or data warehouses (SQL, NoSQL, cloud storage solutions) based on data type, size, and access requirements. Data Organization: Structure the data in a logical and accessible manner. Data Security: Implement security measures to protect data from unauthorized access or breaches. Backup and Recovery: Set up regular backups and recovery plans to prevent data loss. 1.3 Data Processing The data received from the data retrieval phase is likely to be “a diamond in the rough.” Your task now is to sanitize and prepare it for use in the modeling and reporting phase. 1.3.1 Cleansing data Data cleansing is a subprocess of the data science process that focuses on removing errors in data so data becomes a true and consistent representation of the processes it originates from. By “true and consistent representation” we imply that at least two types of errors exist. The first type is the interpretation error, such as when you take the value in your data for granted, like saying that a person’s age is greater than 300 years. The second type of error points to inconsistencies between data sources or against your company’s standardized values. An example of this class of errors is putting “Female” in one table and “F” in another when they represent the same thing: that the person is female. Another example is that you use Pounds in one table and Dollars in another. Table shows an overview of the types of errors that can be detected with easy checks. Table: An overview of common errors General solution Try to fix the problem early in the data acquisition chain or else fix it in the program. Error description Possible solution Errors pointing to false values within one data set Mistakes during data entry Manual overrules Redundant white space Use string functions Impossible values Manual overrules Missing values Remove observation or value Outliers Validate and, if erroneous, treat as missing value (remove or insert) Errors pointing to inconsistencies between data sets Deviations from a code Match on keys or else use manual overrules book Different units of Recalculate measurement Different levels of Bring to same level of measurement by aggregation or aggregation extrapolation Advanced solution: Regression Sometimes you’ll use more advanced methods, such as simple modeling, to find and identify data errors; diagnostic plots can be especially insightful. For example, in figure we use a measure to identify data points that seem out of place. We do a regression to get acquainted with the data and detect the influence of individual observations on the regression line. When a single observation has too much influence, this can point to an error in the data, but it can also be a valid point. At the data cleansing stage, these advanced methods are, however, rarely applied and often regarded by certain data scientists as overkill. Regression captures the correlation between variables observed in a data set and quantifies whether those correlations are statistically significant or not. Regression allows researchers to predict or explain the variation in one variable based on another variable. Regression in statistics: A measure of the relation of between the mean value of one variable(eg., Output) and corresponding values of other variables(time & cost). Figure 2.5. The encircled point influences the model heavily and is worth investigating because it can point to a region where you don’t have enough data or might indicate an error in the data, but it also can be a valid data point. Now that we’ve given the overview, it’s time to explain these errors in more detail. Data entry errors Data collection and data entry are error-prone processes. They often require human intervention, and because humans are only human, they make typos or lose their concentration for a second and introduce an error into the chain. But data collected by machines or computers isn’t free from errors either. Errors can arise from human sloppiness, whereas others are due to machine or hardware failure. Examples of errors originating from machines are transmission errors or bugs in the extract, transform, and load phase (ETL). For small data sets you can check every value by hand. Detecting data errors when the variables you study don’t have many classes can be done by tabulating the data with counts. When you have a variable that can take only two values: “Good” and “Bad”, you can create a frequency table and see if those are truly the only two values present. In table, the values “Godo” and “Bade” point out something went wrong in at least 16 cases. Table 2.3. Detecting outliers on simple variables with a frequency table Value Count Good 1598647 Bad 1354468 Godo 15 Bade 1 Most errors of this type are easy to fix with simple assignment statements and if-then-else rules: 1 2 3 4 if x == "Godo": x = "Good" if x == "Bade": x = "Bad" Redundant whitespace Whitespaces tend to be hard to detect but cause errors like other redundant characters would. The cleaning during the ETL phase wasn’t well executed, and keys in one table contained a whitespace at the end of a string. This caused a mismatch of keys such as “FR” – “FR ”, dropping the observations that couldn’t be matched. If you know to watch out for them, fixing redundant whitespaces is luckily easy enough in most programming languages. They all provide string functions that will remove the leading and trailing whitespaces. For instance, in Python you can use the strip() function to remove leading and trailing spaces. Fixing capital letter mismatches Capital letter mismatches are common. Most programming languages make a distinction between “Brazil” and “brazil”. In this case you can solve the problem by applying a function that returns both strings in lowercase, such as.lower() in Python. "Brazil".lower() == "brazil".lower() should result in true. Impossible values and sanity checks Sanity checks are another valuable type of data check. Here you check the value against physically or theoretically impossible values such as people taller than 3 meters or someone with an age of 299 years. Sanity checks can be directly expressed with rules: check = 0

Use Quizgecko on...
Browser
Browser