Introduction to Data Science.pdf

Data Science Fundamentals Course Code : MDC 103 Data : Deﬁnition, Classiﬁcation of Data, Little data vs Big data , Issues with Big data Analysis Data Science : Deﬁnition, Prerequisite for Data Science, Application of Data Science, Data science Cycle Data : Deﬁnition and classiﬁcation of Data Data refers to raw, unprocessed facts, figures or any information that are collected for analysis, decision-making, and generating insights. In the context of data science, data is the foundation used to derive patterns, make predictions, and inform decisions. Classification of Data 1. Classification by Structure Structured data Data can be classified based on different Unstructured data characteristics such as structure, source, or 2. Classification by Type the type of analysis performed. The two Qualitative data main ways to classify data are by its Quantitative data “structure” and its “type”. 3. Other Types of Data Classification Time Series Data Spatial Data Text Data Image Data I. Classification by Structure Data 1. Structured Data Structured data refers to data that is organized into Structure Type Others a predefined format, typically rows and columns in a database or spreadsheet. It is highly organized, making it easy to store, query, and analyze. Text data Structured Unstructure Examples: Image Data in relational databases (SQL) data Qualitative Quantitative Spreadsheets (Excel) Time Tables containing financial records, Series product inventories, and transaction logs Spatial Characteristics: Organized into tabular format Easier to manage and query using SQL or similar tools Can have relationships defined between different datasets (e.g., foreign keys in databases) 2. Unstructured Data 3. Semi-Structured Data Unstructured data lacks a predefined format or Semistructured data lies between structured and structure. It includes a vast range of data types such unstructured data. It doesn’t conform to a strict as text, images, videos, audio, and more. Analyzing tabular format but may contain some unstructured data requires specialized tools and organizational properties, like tags or metadata, techniques. making it easier to process. Examples: Examples: Emails Social media posts JSON, XML files Images and videos HTML pages Sensor data, log files Emails with metadata (subject, sender, Characteristics: receiver) Not organized in a tabular format Characteristics: Harder to process and analyze using Contains tags, markers, or other forms of traditional tools Requires techniques like Natural Language organization Processing (NLP) for text data or computer Often used for data interchange between vision for image data systems (e.g., web data, API responses) Requires specialized parsing and processing II. Classification by Type 1. Qualitative (Categorical) Data b. Ordinal Data Qualitative data describes qualities, characteristics, or Ordinal data represents categories with a meaningful categories rather than numerical values. It can be order or ranking, but the intervals between categories further classified into two types: Nominal and Ordinal. are not necessarily equal. Examples: a. Nominal Data Customer satisfaction ratings (Satisfied, Nominal data represents categories without any Neutral, Unsatisfied) inherent order. It simply labels or names categories. Educational levels (High school, Bachelor's, Examples: Master's, PhD) Gender (Male, Female, Nonbinary) Rankings (1st, 2nd, 3rd) Colors (Red, Blue, Green) Characteristics: Yes/No answers Has a clear ordering Characteristics: Differences between ranks are not always No order or ranking among categories consistent or measurable Often analyzed using mode or frequency counts III. Other Types of Data Classification 1. Time Series Data 2. Spatial Data Time series data consists of data points collected or Spatial data refers to data that represents objects, recorded at specific time intervals. The order of the events, or phenomena with a spatial aspect. It data is important since the data is dependent on time. includes geographical or locational information. Examples: Examples: Stock prices over time Geographic coordinates (latitude,longitude) Weather data (e.g., daily temperature) Maps and satellite images Heart rate monitoring data Real estate locations Characteristics: Characteristics: Sequentially ordered in time Contains location specific attributes Analyzed using techniques like time series Often used in Geographic Information forecasting, ARIMA models, etc. Systems (GIS) for mapping and analysis What is Big Data ? Big Data refers to extremely large and complex Properties of Big Data : datasets that are difficult to process, manage, and analyze using traditional data processing Volume tools. Velocity Variety These datasets often exceed the capacity of Veracity or Accuracy typical database software, requiring specialized Value technologies, architectures, and algorithms to store, manage, and extract meaningful insights from them. Key Characteristics of Big Data (The V's) 1. Volume: Volume refers to the vast amounts of data generated every second, from various sources like social media, sensors, transactions, and more. Social media platforms like Facebook generate terabytes of data daily from posts, likes, comments, and videos. Sensors on self-driving cars generate petabytes of data from cameras, LIDAR, and radar. 2. Velocity: Velocity refers to the speed at which data is generated, collected, and processed. Many systems today require real-time or near-real-time processing of data. High-frequency trading systems analyze financial market data in milliseconds to execute trades. Streaming platforms (like Netflix) generate data in real-time on user viewing behavior. 3. Variety: Variety refers to the different types of data that Big Data encompasses. It includes structured, semi-structured, and unstructured data from diverse sources. Structured data like transactional records (e.g., purchases). Unstructured data like social media posts, images, and videos.Semi-structured data like XML, JSON, and emails. 4. Veracity: Veracity refers to the uncertainty and inconsistency in the data. Big Data often contains noise, missing values, or inaccuracies, and dealing with these challenges is essential for meaningful analysis. Social media data may contain irrelevant or misleading information (e.g., fake news, rumors). Sensor data may be prone to errors or anomalies due to malfunctions. 5. Value: Value refers to the potential insights and benefits that can be extracted from Big Data. The goal of Big Data analytics is to derive valuable and actionable insights from vast amounts of data. Retailers use data to personalize shopping experiences and optimize marketing strategies. Healthcare organizations use data to improve patient outcomes through predictive analytics and personalized treatments. Issues with big data While Big Data provides immense opportunities, it also poses significant challenges: 1. Storage: Traditional databases may not handle the sheer volume of Big Data. 2. Processing Power: Big Data requires distributed computing (e.g., Hadoop clusters) to process massive datasets efficiently. 3. Data Quality: Big Data often contains incomplete or noisy data that needs cleaning. 4. Privacy Concerns: Handling sensitive data (e.g., personal, financial data) can raise ethical and legal concerns. 5. Scalability: Infrastructure must scale as data grows exponentially. 6. Skills Gap: Big Data technologies and analysis methods require specialized knowledge and skills, leading to a shortage of qualified data scientists and engineers. Data Science & Application Definition of Data Science : Data Science is a scientific discipline which uses scientific methods, algorithms, systems and processes to extract useful information from structured and unstructured data. Data science is used in various application fields like healthcare systems, speech recognition, advanced image recognition, search engines, banking sectors and forecasting etc. Real World Application of Data Sciences Business or financial data science Behavioral data science Manufacturing or industrial data science Mobile data science Medical or health data science Multimedia data science IoT data science: Internet of things (IoT) Smart cities or urban data science Cybersecurity data science Smart villages or rural data science Prerequisite for Data Science Programming languages R, Python, Matlab. Statistics Coding in above language. Algorithm and data structure Linear Algebra handling Machine Learning Domain is individual speciﬁc i.e., topic/subject in which scientist is working and objective of analysis individual want to perform. Having vast knowledge of the subject/domain help in making analysis more precise and best feature selection in machine learning step. Applications of Big Data & Data Science 1. Healthcare Predictive Analytics: Big Data helps in predicting patient outcomes, optimizing hospital resources, and personalizing treatments. Genomics: Analyzing large-scale genetic data to identify disease markers and potential treatments. 2. Finance Fraud Detection: Analyzing transaction data in real-time to detect fraudulent activities. Risk Management: Assessing risks in loan approvals and financial investments using large datasets. 3. Retail Personalized Marketing: Leveraging customer data to offer personalized recommendations and targeted promotions. Inventory Management: Using Big Data to forecast demand and optimize stock levels. 4. Transportation and Logistics Route Optimization: Analyzing traffic data and delivery routes in real-time to reduce costs and improve efficiency. Predictive Maintenance: Using sensor data to predict when vehicles or equipment need maintenance. 5. Telecommunications Network Optimization: Analyzing usage patterns and network traffic to optimize bandwidth allocation and improve service quality. 6. Government and Public Policy Smart Cities: Using Big Data to manage resources (e.g., water, energy) more efficiently, reduce traffic congestion, and improve public safety. Crime Prevention: Analyzing crime patterns and predicting areas of high risk for law enforcement deployment. Data Science Cycle Data Science Cycle The Data Science Life Cycle is a series of steps taken to solve data-driven problems. 1. Problem Definition:Clearly define the problem you're solving (e.g., "How to reduce customer churn?"). 2. Data Collection:Gather relevant data from various sources like databases, APIs, or external datasets. 3. Data Cleaning and Preprocessing:Remove missing or corrupted data, normalize values, and standardize formats. 4. Data Exploration (EDA):Explore data using visualizations to identify patterns, trends, or outliers. 5. Data Modeling:Apply machine learning algorithms to predict outcomes or classify data (e.g., logistic regression for predicting churn). 6. Model Evaluation:Measure model performance using metrics like accuracy or precision and fine-tune it. 7. Model Deployment:Integrate the model into a production system (e.g., a website or app) for real-time predictions. 8. Communication and Visualization:Present the findings to stakeholders through dashboards, reports, or presentations, ensuring actionable insights are clear.

Introduction to Data Science.pdf

Document Details

Related

Full Transcript