Summary

This document provides a comprehensive overview of data science fundamentals. It covers key concepts such as study processes, data collection methods, descriptive and inferential statistics, and supervised learning. The content is relevant for anyone looking to understand the core principles of data science, including algorithms and model building.

Full Transcript

```markdown # Fundamentals of Data Science ## 1. Study Process The process of conducting a data science study follows four key steps: 1. **Problem Definition** $\rightarrow$ Define the objectives of the study. 2. **Data Collection** $\rightarrow$ Gather and process relevant data. 3. **Analysis...

```markdown # Fundamentals of Data Science ## 1. Study Process The process of conducting a data science study follows four key steps: 1. **Problem Definition** $\rightarrow$ Define the objectives of the study. 2. **Data Collection** $\rightarrow$ Gather and process relevant data. 3. **Analysis** $\rightarrow$ Extract useful information and identify patterns. 4. **Conclusion** $\rightarrow$ Make decisions and provide recommendations. **Example Study: Proportion of Smokers in Sri Lanka** * **Problem**: Determine the proportion of smokers in Sri Lanka. * **Population**: The entire population of Sri Lanka. * **Sample**: Instead of analyzing the entire population (which is costly), a smaller representative sample (e.g., 1000 people) is used for estimation. * **Sampling Methods**: Different sampling techniques can improve accuracy. ## 2. Types of Statistics Statistics is divided into two main categories: 1. **Descriptive Statistics** * Summarizes and describes data from a given sample. * Includes measures such as mean, median, mode, and standard deviation. 2. **Inferential Statistics** * Uses sample data to make predictions about a larger population. * Includes hypothesis testing, confidence intervals, and regression analysis. ## 3. Data Collection Methods A. **Questionnaires (Surveys)** * Can be automated using digital tools. * Requires technical devices and digital literacy. * May not always represent the entire population. B. **Direct Observation** * Uses sensors, cameras, and scanners for data collection. * Efficient but requires filtering of irrelevant (noisy) data. C. **Interviews** * Can be conducted in person, over the phone, or through voice/video recordings. * Requires text mining to analyze spoken information. ## 4. Structured vs. Unstructured Data A. **Structured Data (Conventional Method)** * Organized in tables with rows (observations) and columns (variables). * Example: Data in a spreadsheet. B. **Unstructured Data** * Includes speech, videos, images, and text. * Requires techniques like text mining and topic modeling to extract insights. * Example Use Case: Analyzing speech from news channels to detect trending topics; extracting key insights from social media posts. ## 5. Data Cleaning Data cleaning is the process of removing errors and inconsistencies from data. **Steps in Data Cleaning:** 1. **Data Collection** $\rightarrow$ Raw data is gathered. 2. **Data Cleaning** $\rightarrow$ Errors, noise, and irrelevant data are removed. 3. **Data Analysis** $\rightarrow$ The refined dataset is analyzed for insights. **Methods:** * **Conventional Approach**: Collect only necessary data to minimize cleaning effort. * **Automated Approach**: Collect all data, then perform extensive cleaning. ## 6. Data Wrangling vs. Data Cleaning * **Data Cleaning**: Focuses on removing errors and inconsistencies from raw data. * **Data Wrangling**: Involves restructuring and transforming data into a format suitable for analysis. ## 7. Handling Missing Values Missing values occur due to various reasons, such as respondents skipping sensitive questions (e.g., age, salary). **Solutions for Handling Missing Values:** 1. **Deleting Entire Records** $\rightarrow$ Only if a minimal number of missing values exist. 2. **Replacing Missing Values** $\rightarrow$ Using estimation techniques like: * **Mean Imputation**: Replacing missing values with the mean of available data. * **K-Nearest Neighbors (K-NN) Imputation**: Filling missing values using the closest observations. ## 8. Data Analysis Data analysis is classified into: A. **Descriptive Analysis** * Focuses on summarizing and visualizing data. * Includes tables, graphs, and summary statistics. B. **Inferential Analysis** * Uses sample data to make generalizations about a population. * Includes: * Estimation * Predictive Analysis * Hypothesis Testing **Hypothesis Testing:** * A hypothesis is a statement about a population parameter. * Hypothesis testing is used to validate or reject a hypothesis using sample data. ## 9. Statistical Learning Statistical learning involves extracting patterns and insights from data. It is classified into: 1. **Supervised Learning** 2. **Unsupervised Learning** **Supervised Learning** * Involves training a model using labeled data. * Example: Predicting whether a customer will continue using a network provider. **Supervised Learning Model** We define a function to relate input (X) and output (y): $y = f(X, parameters) + \epsilon$ (random error) **Goals of Supervised Learning:** * Understand relationships between inputs and outputs. * Predict future outcomes. **Applications:** * Email Spam Detection * Medical Diagnosis * Stock Price Prediction * Customer Churn Prediction ## 10. What is Data Science? Data Science (DS) is an interdisciplinary field that combines: * Mathematics * Statistics * Computer Science **Purpose of Data Science:** * Extracting knowledge and insights from structured and unstructured data. * Using scientific methods, algorithms, and processes to analyze data. **Example Use Case:** * Data Interpretation * Graph Visualization * Automated Data Collection ## 11. Components of Data Science A. **Algorithms** * Modern data science replaces traditional statistical models with machine learning algorithms. B. **Processes** * Data Collection * Big Data Processing C. **Systems** * Big Data Storage and Data Management ## 12. Scope of Data Science A. **Data Analysis and Visualization** * Data visualization helps interpret and communicate results effectively. B. **Predictive Modeling** * Uses past data to predict future outcomes. * Example: Trend forecasting. C. **Natural Language Processing (NLP)** * Enables computers to understand human language. * Applications: Text Analysis, Machine Translation, Speech Recognition, Summarization & Recommendations. D. **Big Data Processing** * Focuses on storing, transforming, and analyzing large datasets. E. **Automation & Decision Support** * AI-powered systems provide real-time predictions for decision-making. * Example: Fraud detection in banking using AI. ## 13. Applications of Data Science Data Science is applied in various industries: * Business Analytics & Decision Making * Healthcare & Medical Research * Financial Modeling * Social Media Analysis * Scientific Research * Artificial Intelligence & Machine Learning **Example:** * Profit Prediction: Estimating next year's profit based on historical data. * Involves: Statistical modeling, Predictive analytics, Cost optimization through automation. ## 14. Key Components of Data Science 1. Data (Structured & Unstructured) 2. Tools & Technologies 3. Statistical Methods (Machine Learning & AI) 4. Domain Expertise 5. Communication & Visualization **Tools & Technologies** * Data Analysis Software: MINITAB, SAS, Excel, R, Python * Big Data Tools: Jupyter Notebook, Power BI, Tableau * Platforms: Hadoop, Spark, AWS, Google Cloud, Microsoft Azure ## 15. Characteristics of Big Data 1. Volume $\rightarrow$ Large-scale data. 2. Variety $\rightarrow$ Different data formats. 3. Velocity $\rightarrow$ Real-time data processing. 4. Veracity $\rightarrow$ Data accuracy and quality. **Challenges of Big Data:** * Noise, bias, and incomplete data affect decision-making. ```

Use Quizgecko on...
Browser
Browser