Podcast
Questions and Answers
Which of the following is NOT a primary category of data in data science?
Which of the following is NOT a primary category of data in data science?
- Structured Data
- Data Streams
- Unstructured Data
- Abstract Data (correct)
Which of the following is the MOST direct goal of data science?
Which of the following is the MOST direct goal of data science?
- Building faster computer hardware.
- Extracting knowledge and insights from data. (correct)
- Creating complex mathematical formulas.
- Developing new programming languages.
Which of the following data types represents characteristics or attributes divided into distinct groups or categories?
Which of the following data types represents characteristics or attributes divided into distinct groups or categories?
- Categorical Data (correct)
- Quantitative Data
- Discrete Data
- Continuous Data
Which of the following is an example of ordinal data?
Which of the following is an example of ordinal data?
What distinguishes ratio scale data from interval scale data?
What distinguishes ratio scale data from interval scale data?
In the context of data measurement scales, what characteristic is present in ordinal, interval, and ratio scales, but NOT in nominal scales?
In the context of data measurement scales, what characteristic is present in ordinal, interval, and ratio scales, but NOT in nominal scales?
Which sampling method ensures every member of the population has an equal chance of being selected?
Which sampling method ensures every member of the population has an equal chance of being selected?
Which type of analysis is used to answer 'Why did this happen?' by identifying relationships and patterns?
Which type of analysis is used to answer 'Why did this happen?' by identifying relationships and patterns?
Which of the following analysis seeks to provide recommendations for actions to achieve desired outcomes?
Which of the following analysis seeks to provide recommendations for actions to achieve desired outcomes?
What is a primary goal of Exploratory Data Analysis (EDA)?
What is a primary goal of Exploratory Data Analysis (EDA)?
What is a key challenge associated with Predictive Data Analysis (PDA)?
What is a key challenge associated with Predictive Data Analysis (PDA)?
Which of the following statements is a misconception about data science?
Which of the following statements is a misconception about data science?
In the data science life cycle, which stage involves gathering relevant data from various sources, ensuring it aligns with the problem being addressed?
In the data science life cycle, which stage involves gathering relevant data from various sources, ensuring it aligns with the problem being addressed?
What does a probability of 1
indicate?
What does a probability of 1
indicate?
Two dice are rolled. Event A is that the first die shows a 2. Event C is that the die shows an even number. What is P(A ∩ C)?
Two dice are rolled. Event A is that the first die shows a 2. Event C is that the die shows an even number. What is P(A ∩ C)?
What is the purpose of data preprocessing in data analysis and machine learning?
What is the purpose of data preprocessing in data analysis and machine learning?
What is a common technique for handling missing values in a dataset?
What is a common technique for handling missing values in a dataset?
What is the primary goal of data curation?
What is the primary goal of data curation?
Which KDD step involves applying algorithms to extract patterns from prepared data?
Which KDD step involves applying algorithms to extract patterns from prepared data?
The 68-95-99.7 rule applies to which distribution?
The 68-95-99.7 rule applies to which distribution?
Flashcards
What is Data Science?
What is Data Science?
A multi-disciplinary field extracting knowledge from varies data formats, using computer science, mathematics, and statistics.
Sources of Data
Sources of Data
Databases, web logs, social media, and sensors.
Data Processing
Data Processing
Cleaning, integration, and transformation to prepare it for analysis, handling missing values, converting formats.
Data Analysis
Data Analysis
Signup and view all the flashcards
Data Visualization
Data Visualization
Signup and view all the flashcards
Real world Applications of Data Science
Real world Applications of Data Science
Signup and view all the flashcards
Structured Data
Structured Data
Signup and view all the flashcards
Semi-Structured Data
Semi-Structured Data
Signup and view all the flashcards
Unstructured Data
Unstructured Data
Signup and view all the flashcards
Data Streams
Data Streams
Signup and view all the flashcards
Categorical Data
Categorical Data
Signup and view all the flashcards
Quantitative Data
Quantitative Data
Signup and view all the flashcards
Nominal Scale
Nominal Scale
Signup and view all the flashcards
Ordinal Scale
Ordinal Scale
Signup and view all the flashcards
Interval Scale
Interval Scale
Signup and view all the flashcards
Ratio Scale
Ratio Scale
Signup and view all the flashcards
Descriptive Analysis.
Descriptive Analysis.
Signup and view all the flashcards
Diagnostic Analysis.
Diagnostic Analysis.
Signup and view all the flashcards
Predictive Analysis.
Predictive Analysis.
Signup and view all the flashcards
Prescriptive Analysis.
Prescriptive Analysis.
Signup and view all the flashcards
Study Notes
- Data Science is a multidisciplinary field extracting knowledge and insights from data by integrating computer science, mathematics, statistics, and domain expertise
- The goal of data science is data-driven decisions through pattern identification, prediction, and problem-solving
Key Aspects of Data Science
- Data Collection involves obtaining data from various sources in structured, semi-structured, or unstructured formats
- Data Processing includes cleaning, integrating, and transforming data
- Analysis and Interpretation uses statistical and computational tools for trend identification
- Visualization and Presentation involves communicating findings through charts, graphs, and dashboards using tools such as Tableau, Matplotlib, and Power BI
- Data Science Applications include predicting customer behaviour and supply chain optimization in Business
- Data Science Applications include disease prediction and health monitoring in Healthcare
- Data Science Applications involve fraud detection and risk analysis in Finance
- Data Science Applications include route optimization and logistics planning in Transport
- Data science uses machine learning, AI and big data tools
Data Types
- Structured Data is highly organized for easy searching and analysis, residing in fixed fields within records or files, used in CRM and OLTP systems
- Semi-Structured Data contains tags or markers to separate semantic elements and enforce hierarchies, serving as a middle ground between structured and unstructured
- Unstructured Data lacks a predefined format, requiring specialized techniques like NLP and computer vision, and machine learning for analysis
- Data Streams are continuous flows of real-time data that require immediate processing
Statistical Data Types
- Data is split into categorical and quantitative types for correct analytical methods selection and results interpretation
- Categorical Data represents qualitative attributes divided into distinct, unordered groups as Nominal Data, such as eye color or cuisine type
- Categorical Data represents qualitative attributes divided into meaningful order but inconsistent intervals as Ordinal Data, like education levels or satisfaction ratings
- Quantitative Data consists of numerical values for counts or measurements
- Discrete Data uses integers like students in a class or cars in a lot
- Continuous Data can take any value within a range, for example height, weight, or temperature
Measurement of Data
- Assigning values to variables according to set rules allows for appropriate statistical analyses
- Nominal Scale categorizes unordered data, such as numbering different fruits without ranking
- Ordinal Scale categorizes data with a meaningful order but inconsistent intervals, like ranking runners in a race
- Interval Scale uses numerical data with consistent intervals but lacks a true zero, such as Celsius
- Ratio Scale similar to the interval scale, includes a true zero point, such as weight or height
Measurement Scales Characteristics
- Identity guarantees each value is uniquely identified; gender is captured by using "F" for female, and "M" for male
- Magnitude indicates values can be ranked or ordered, for example in a range from "low income" to "high income"
- Equality in Intervals means the difference between any two consecutive values is consistent
- Minimum or Zero Value is when 0 indicates absence
Four Measurement Scales
- Nominal Scale only has identity, gender is an example
- Ordinal Scale has identity and magnitude, income categories from low to high
- Interval Scale has identity, magnitude, and equal intervals, such as temperature in Celsius
- Ratio Scale has identity, magnitude, equal intervals, and a minimum zero value, such as temperature in Kelvin
Methods of Sampling
- Sampling involves selecting a subset from a population
- This makes inferences about the entire population practical and cost-effective
- Probability Sampling ensures all members of population have a non-zero chance of being selected
- Probability Sampling uses Simple Random Sampling where everyone has equal chance of selection
- Probability Sampling uses Systematic Sampling where every nth person is chosen
- Probability Sampling uses Stratified Sampling dividing the population into subgroups and randomly sampling from each
- Probability Sampling uses Cluster Sampling dividing the population into clusters, selecting some, and then sampling all within
- Non-Probability Sampling where equal chance of being included doesn't exist.
- Non Probability Sampling could cause bias in results
- Non-Probability Sampling techniques uses Convenience Sampling where the easiest to reach are selected
- Non-Probability Sampling techniques uses Quota Sampling where sample meets certain quotas like 50% male 50% female
- Non-Probability Sampling selected uses Purposive (Judgmental) Sampling selecting individuals based on expertise
- Non-Probability Sampling techniques uses Snowball Sampling where participants are recruited from acquaintances
The importance of Sampling
- It allows researchers to draw conclusions about a population without examining every individual
- The chosen sampling method must align with the research objectives to ensure validity and reliability
- Probability sampling methods are generally preferred for their ability to produce representative samples
Data Analysis Methods
- Data Analysis encompasses methods to extract insights from data
- These methods can be broadly categorized into four primary types: Descriptive, Diagnostic, Predictive, and Prescriptive
Descriptive Analysis
- Descriptive Analysis focuses on summarizing and understanding main dataset characteristics
- Descriptive Analysis provides insights via mean, median, mode, and standard deviation
- Descriptive analysis uses visualizations
Diagnostic Analysis
- Diagnostic Analysis understands the reasons behind past outcomes
- Diagnostic analysis seeks to answer "why did this happen?"
- Drill-Down Analysis is used
- Data mining is used
- Correlation Analysis is employed
Predictive Analysis
- Predictive Analysis uses historical data to forecast future events
- Predictive Analysis answers "what is likely to happen?"
- Regression Analysis is employed
- Time Series Analysis is used
- Classification and clustering is employed
Prescriptive Analysis
- Prescriptive Analysis provides recommendations for achieving outcomes
- Prescriptive analysis answers "what should we do?"
- Optimization techniques are used
Exploratory Data Analysis (EDA)
- EDA involves examining datasets to summarize their main characteristics
- EDA uncovers patterns, anomalies, relationships without assumptions
- EDA has 4 primary objectives, identifying, spotting, testing, and checking
- Common EDA Techniques: Data Visualization, descriptive statistics, Data Transformation
Inferential Analysis
- It involves predicting a population from sample data
- Inference is beyond the immediate data available
- Inferential Analysis key components: Estimating Parameters, Hypothesis Testing, Interval creation
- Common Inferential methods: running T-tests, creating an ANOVA, Regression Analysis
EDA vs Inference
- EDA's purpose is to explore and visualize without formal modelling, where as Inference sets out to make predictions
- EDA's approach is Informal where inference is formal
- EDA aims to create understanding, where is Inference relies on hypothesis testing
Predictive Data Analysis (PDA)
- PDA forecast's future outcomes using data sets with algorithms
- PDA applications include customer prediction, risk assessment, and optimization
- PDA analysis issues involves data quality, model complexity, and interpretability
Data Science Misconceptions
- Data Science is only about modeling- Data science uses a broad amount of statistics
- Data roles are identical - Data engineers focus on building data infrastructure, data analysts interpret data
- Solely for math experts - Should diverse
- Provides absolute answers - Results require context
- Quantity better than quality
- Can be fully automated - Humans must be present to make decisions
Data science applications
- medical image analysis and prediction
- Fraud detection
- Recommendation Systems
- Life Cycle helps to extract valuable data and insights
Key Probability definitions
- The numerical value reflecting likelihood
- Probability is the number of outcomes/ total space
- Sample space represents all possible outcomes of the experiment
Key Definitions
- Independent Events, where the outcome of one does not affect the other
- Disjoint Events, because they do not overlap
- Intersection of Events, a common value
Conditional Probability
- Formula includes that an event has already occurred
- Formula: P(X|Y) = P(X∩Y)/P(Y)
Applying Bayes' Theorem:
- Bayes' theorem reverses conditional probabilities and follows formula:
P(X|Y) = (P(Y|X) x P(X))/P(Y)
Random Variables Definition:
- Random Variables maps outcomes of an experiment to numerical values
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.