Fundamentals of Data Science PDF

```markdown # Fundamentals of Data Science ## 1. Study Process The process of conducting a data science study follows four key steps: 1. **Problem Definition** $\rightarrow$ Define the objectives of the study. 2. **Data Collection** $\rightarrow$ Gather and process relevant data. 3. **Analysis** $\rightarrow$ Extract useful information and identify patterns. 4. **Conclusion** $\rightarrow$ Make decisions and provide recommendations. **Example Study: Proportion of Smokers in Sri Lanka** * **Problem**: Determine the proportion of smokers in Sri Lanka. * **Population**: The entire population of Sri Lanka. * **Sample**: Instead of analyzing the entire population (which is costly), a smaller representative sample (e.g., 1000 people) is used for estimation. * **Sampling Methods**: Different sampling techniques can improve accuracy. ## 2. Types of Statistics Statistics is divided into two main categories: 1. **Descriptive Statistics** * Summarizes and describes data from a given sample. * Includes measures such as mean, median, mode, and standard deviation. 2. **Inferential Statistics** * Uses sample data to make predictions about a larger population. * Includes hypothesis testing, confidence intervals, and regression analysis. ## 3. Data Collection Methods A. **Questionnaires (Surveys)** * Can be automated using digital tools. * Requires technical devices and digital literacy. * May not always represent the entire population. B. **Direct Observation** * Uses sensors, cameras, and scanners for data collection. * Efficient but requires filtering of irrelevant (noisy) data. C. **Interviews** * Can be conducted in person, over the phone, or through voice/video recordings. * Requires text mining to analyze spoken information. ## 4. Structured vs. Unstructured Data A. **Structured Data (Conventional Method)** * Organized in tables with rows (observations) and columns (variables). * Example: Data in a spreadsheet. B. **Unstructured Data** * Includes speech, videos, images, and text. * Requires techniques like text mining and topic modeling to extract insights. * Example Use Case: Analyzing speech from news channels to detect trending topics; extracting key insights from social media posts. ## 5. Data Cleaning Data cleaning is the process of removing errors and inconsistencies from data. **Steps in Data Cleaning:** 1. **Data Collection** $\rightarrow$ Raw data is gathered. 2. **Data Cleaning** $\rightarrow$ Errors, noise, and irrelevant data are removed. 3. **Data Analysis** $\rightarrow$ The refined dataset is analyzed for insights. **Methods:** * **Conventional Approach**: Collect only necessary data to minimize cleaning effort. * **Automated Approach**: Collect all data, then perform extensive cleaning. ## 6. Data Wrangling vs. Data Cleaning * **Data Cleaning**: Focuses on removing errors and inconsistencies from raw data. * **Data Wrangling**: Involves restructuring and transforming data into a format suitable for analysis. ## 7. Handling Missing Values Missing values occur due to various reasons, such as respondents skipping sensitive questions (e.g., age, salary). **Solutions for Handling Missing Values:** 1. **Deleting Entire Records** $\rightarrow$ Only if a minimal number of missing values exist. 2. **Replacing Missing Values** $\rightarrow$ Using estimation techniques like: * **Mean Imputation**: Replacing missing values with the mean of available data. * **K-Nearest Neighbors (K-NN) Imputation**: Filling missing values using the closest observations. ## 8. Data Analysis Data analysis is classified into: A. **Descriptive Analysis** * Focuses on summarizing and visualizing data. * Includes tables, graphs, and summary statistics. B. **Inferential Analysis** * Uses sample data to make generalizations about a population. * Includes: * Estimation * Predictive Analysis * Hypothesis Testing **Hypothesis Testing:** * A hypothesis is a statement about a population parameter. * Hypothesis testing is used to validate or reject a hypothesis using sample data. ## 9. Statistical Learning Statistical learning involves extracting patterns and insights from data. It is classified into: 1. **Supervised Learning** 2. **Unsupervised Learning** **Supervised Learning** * Involves training a model using labeled data. * Example: Predicting whether a customer will continue using a network provider. **Supervised Learning Model** We define a function to relate input (X) and output (y): $y = f(X, parameters) + \epsilon$ (random error) **Goals of Supervised Learning:** * Understand relationships between inputs and outputs. * Predict future outcomes. **Applications:** * Email Spam Detection * Medical Diagnosis * Stock Price Prediction * Customer Churn Prediction ## 10. What is Data Science? Data Science (DS) is an interdisciplinary field that combines: * Mathematics * Statistics * Computer Science **Purpose of Data Science:** * Extracting knowledge and insights from structured and unstructured data. * Using scientific methods, algorithms, and processes to analyze data. **Example Use Case:** * Data Interpretation * Graph Visualization * Automated Data Collection ## 11. Components of Data Science A. **Algorithms** * Modern data science replaces traditional statistical models with machine learning algorithms. B. **Processes** * Data Collection * Big Data Processing C. **Systems** * Big Data Storage and Data Management ## 12. Scope of Data Science A. **Data Analysis and Visualization** * Data visualization helps interpret and communicate results effectively. B. **Predictive Modeling** * Uses past data to predict future outcomes. * Example: Trend forecasting. C. **Natural Language Processing (NLP)** * Enables computers to understand human language. * Applications: Text Analysis, Machine Translation, Speech Recognition, Summarization & Recommendations. D. **Big Data Processing** * Focuses on storing, transforming, and analyzing large datasets. E. **Automation & Decision Support** * AI-powered systems provide real-time predictions for decision-making. * Example: Fraud detection in banking using AI. ## 13. Applications of Data Science Data Science is applied in various industries: * Business Analytics & Decision Making * Healthcare & Medical Research * Financial Modeling * Social Media Analysis * Scientific Research * Artificial Intelligence & Machine Learning **Example:** * Profit Prediction: Estimating next year's profit based on historical data. * Involves: Statistical modeling, Predictive analytics, Cost optimization through automation. ## 14. Key Components of Data Science 1. Data (Structured & Unstructured) 2. Tools & Technologies 3. Statistical Methods (Machine Learning & AI) 4. Domain Expertise 5. Communication & Visualization **Tools & Technologies** * Data Analysis Software: MINITAB, SAS, Excel, R, Python * Big Data Tools: Jupyter Notebook, Power BI, Tableau * Platforms: Hadoop, Spark, AWS, Google Cloud, Microsoft Azure ## 15. Characteristics of Big Data 1. Volume $\rightarrow$ Large-scale data. 2. Variety $\rightarrow$ Different data formats. 3. Velocity $\rightarrow$ Real-time data processing. 4. Veracity $\rightarrow$ Data accuracy and quality. **Challenges of Big Data:** * Noise, bias, and incomplete data affect decision-making. ```

Fundamentals of Data Science PDF

Document Details

Tags

Related

Summary

Full Transcript