Podcast
Questions and Answers
What does the Data Science Methodology provide to a data scientist?
What does the Data Science Methodology provide to a data scientist?
- Access to large datasets
- Tools for data visualization
- Training on the latest algorithms
- A framework for designing an AI Project (correct)
The Data Science Methodology is a linear process where each step is completed before moving to the next.
The Data Science Methodology is a linear process where each step is completed before moving to the next.
False (B)
Who put forward the Data Science Methodology?
Who put forward the Data Science Methodology?
- John Rollins (correct)
- Andrew Ng
- Geoffrey Hinton
- Yann LeCun
How many steps does the Data Science Methodology consist of?
How many steps does the Data Science Methodology consist of?
Match the following modules of the Data Science Methodology with their corresponding stages:
Match the following modules of the Data Science Methodology with their corresponding stages:
What is the primary goal of the 'Business Understanding' stage in the Data Science Methodology?
What is the primary goal of the 'Business Understanding' stage in the Data Science Methodology?
The 5W1H Problem Canvas is a tool used to understand the problem of stakeholders.
The 5W1H Problem Canvas is a tool used to understand the problem of stakeholders.
Which framework is employed in the 'Problem to Approach' stage of the Data Science Methodology?
Which framework is employed in the 'Problem to Approach' stage of the Data Science Methodology?
What type of questions are asked in the phase of analytic approach?
What type of questions are asked in the phase of analytic approach?
Which type of data analytics answers the question: 'What happened?'
Which type of data analytics answers the question: 'What happened?'
Diagnostic analytics focus on predicting future outcomes based on historical data.
Diagnostic analytics focus on predicting future outcomes based on historical data.
A company wants to forecast its sales for the next quarter. Which type of analytics would be most appropriate?
A company wants to forecast its sales for the next quarter. Which type of analytics would be most appropriate?
Providing the right course of action based on data analysis is called ______ Analytics
Providing the right course of action based on data analysis is called ______ Analytics
Match the type of analytics with its focus:
Match the type of analytics with its focus:
What is the role of 5W1H in determining data requirements?
What is the role of 5W1H in determining data requirements?
Structured data refers to data without a predefined structure, such as social media posts and images.
Structured data refers to data without a predefined structure, such as social media posts and images.
Which of the following is/are type(s) of data?
Which of the following is/are type(s) of data?
What are the two sources of data collection?
What are the two sources of data collection?
Which of the following is an example of a primary data source?
Which of the following is an example of a primary data source?
Data scientists should aim to exclude DBA's and programmers when extracting data from data sources.
Data scientists should aim to exclude DBA's and programmers when extracting data from data sources.
In which stage should one assess the representativeness of the collected dataset?
In which stage should one assess the representativeness of the collected dataset?
Techniques such as ______ can be applied to the dataset, to assess the content, quality, and initial insights about the data.
Techniques such as ______ can be applied to the dataset, to assess the content, quality, and initial insights about the data.
What are the activities included in Data Preperation?
What are the activities included in Data Preperation?
Data collection takes more time than Data Preperation.
Data collection takes more time than Data Preperation.
What is raw data?
What is raw data?
What is the primary focus during the 'AI modelling' stage?
What is the primary focus during the 'AI modelling' stage?
In data modelling, the development aspect is usually iterative.
In data modelling, the development aspect is usually iterative.
According to data modelling, identifying the best model for capstone projects is called
According to data modelling, identifying the best model for capstone projects is called
[Blank] is useful to understand what is happening within the data.
[Blank] is useful to understand what is happening within the data.
Summary staticstics includes which of the following?
Summary staticstics includes which of the following?
In machine learning, what is a training set?
In machine learning, what is a training set?
What is the role of the training set?
What is the role of the training set?
Evaluation in an Al project cycle is the process of underperforming of the model.
Evaluation in an Al project cycle is the process of underperforming of the model.
What measures the metrics like accuracy, precision, recall, or F1 score?
What measures the metrics like accuracy, precision, recall, or F1 score?
In Data Science methodology, the model evaluation can have ______ main phases.
In Data Science methodology, the model evaluation can have ______ main phases.
Which data do scientists must make the stakeholders familiar with?
Which data do scientists must make the stakeholders familiar with?
What is the last stage in the methodology discussed?
What is the last stage in the methodology discussed?
The trained AI model is made available to the users in real-world applications in which stage?
The trained AI model is made available to the users in real-world applications in which stage?
Model validation involves which dataset?
Model validation involves which dataset?
Test dataset is used to fit the machine learning model.
Test dataset is used to fit the machine learning model.
The train-test split is a technique for evaluating the ______ of a machine learning algorithm.
The train-test split is a technique for evaluating the ______ of a machine learning algorithm.
Match the data:
Match the data:
Which type of problem does the train test split address?
Which type of problem does the train test split address?
Flashcards
Data Science Methodology
Data Science Methodology
A framework for designing an AI project for data scientists.
Data Science Methodology
Data Science Methodology
A process with a prescribed sequence of iterative steps that data scientists follow to approach a problem and find a solution
Business Understanding
Business Understanding
Breaking down a problem to understand customer needs and objectives.
Analytic Approach
Analytic Approach
Signup and view all the flashcards
Data Requirements
Data Requirements
Signup and view all the flashcards
Data Collection
Data Collection
Signup and view all the flashcards
Data Understanding
Data Understanding
Signup and view all the flashcards
Data preparation
Data preparation
Signup and view all the flashcards
Feature engineering
Feature engineering
Signup and view all the flashcards
AI Modelling
AI Modelling
Signup and view all the flashcards
Descriptive Modeling
Descriptive Modeling
Signup and view all the flashcards
Predictive modeling
Predictive modeling
Signup and view all the flashcards
Evaluation
Evaluation
Signup and view all the flashcards
Deployment
Deployment
Signup and view all the flashcards
Feedback
Feedback
Signup and view all the flashcards
Primary data source
Primary data source
Signup and view all the flashcards
Secondary data source
Secondary data source
Signup and view all the flashcards
Structured data
Structured data
Signup and view all the flashcards
Unstructured data
Unstructured data
Signup and view all the flashcards
Semi-structured data
Semi-structured data
Signup and view all the flashcards
Descriptive Analytics
Descriptive Analytics
Signup and view all the flashcards
Diagnostic Analytics
Diagnostic Analytics
Signup and view all the flashcards
Predictive Analytics
Predictive Analytics
Signup and view all the flashcards
Prescriptive Analytics
Prescriptive Analytics
Signup and view all the flashcards
Model Validation
Model Validation
Signup and view all the flashcards
Train Test Split
Train Test Split
Signup and view all the flashcards
Cross Validation
Cross Validation
Signup and view all the flashcards
Evaluation metrics
Evaluation metrics
Signup and view all the flashcards
Accuracy
Accuracy
Signup and view all the flashcards
Confusion Matrix
Confusion Matrix
Signup and view all the flashcards
Precision
Precision
Signup and view all the flashcards
Recall
Recall
Signup and view all the flashcards
MAE
MAE
Signup and view all the flashcards
MSE
MSE
Signup and view all the flashcards
RMSE
RMSE
Signup and view all the flashcards
Study Notes
- The Unit focuses on Data Science Methodology as an analytic approach to a capstone project.
Objectives
- Understand major steps in addressing a data science problem.
- Define data science methodology and its importance.
- Demonstrate the steps in Data Science Methodology.
Learning Outcomes
- Integrate Data Science Methodology steps into a Capstone Project.
- Identify the best way to represent a problem's solution.
- Understand the importance of validating machine learning models.
- Use key evaluation metrics for various machine learning tasks.
Introduction to Data Science Methodology
- Methodology provides a framework for designing an AI Project.
- The framework helps the team decide on methods, processes, and strategies to obtain correct output from the AI Project.
- Methodology systematic project organization to save time and costs.
- Data Science Methodology prescribes steps for data scientists to approach a problem and find solutions.
- Data Science Methodology enables data handling and comprehension.
- John Rollins, an IBM Analytics Data Scientist, put forward a Data Science Methodology
- The process involves 10 steps.
- Foundation provides detailed information on how AI projects can be solved.
- There are five modules, with two stages, and they describe the rationale for each stage.
- The five modules are:
- From Problem to Approach
- From Requirements to Collection
- From Understanding to Preparation
- From Modelling to Evaluation
- From Deployment to Feedback
From Problem to Approach
Business Understanding
- Understand the customer's problem by asking questions to comprehend specific requirements.
- Figure out objectives that support the customer's goal.
- This stage is known as "Problem Scoping and defining".
- The 5W1H Problem Canvas is used to deeply understand the issue.
- This stage also involves using the Design Thinking (DT) Framework.
- Understanding customer needs is critical to solve the problem.
- Ask relevant questions and engage in discussions with all stakeholders to identify specific requirements and create a list of business needs.
Analytic Approach
- Establish the business problem and then the data scientist defines the analytical approach to solve it.
- Seek clarification to choose the correct path or approach.
- Asking more questions to stakeholders will help the AI Project team select the correct approach to solve it.
- Sample Questions:
- Is there is a need to find out how much or how many? (Regression)
- Which category does the data belong to? (Classification)
- Data capable of being grouped? (Clustering)
- Is there is an unusual pattern in the data? (Anomaly detection)
- Which option to provide to a customer? (Recommendation)
Data Analytics
- Data Analytics is used to solve project problems.
- Types of Data Analytics:
- Descriptive Analytics
- Diagnostic Analytics
- Predictive Analytics
- Prescriptive Analytics
Descriptive Analytics
- Summarizes past data to understand what has happened.
- Describes trends and patterns using graphs, charts, and statistical measures like mean, median, and mode to understand the central tendency.
- Examines the spread of data using range, variance, and standard deviation.
- Example: Calculate average student marks or analyze previous year's sales data.
Diagnostic Analytics
- Helps understand why things have happened.
- Done by analyzing past data using techniques like root cause analysis, hypothesis testing, and correlation analysis.
- The main purpose is to identify the causes/factors that led to an outcome.
- Example: Investigate why sales dropped by asking "Is it due to poor customer service or low product quality?".
Predictive Analytics
- Uses past data to predict future events or trends with techniques like regression, classification, and clustering.
- Purpose to foresee future outcomes and make decisions based on those predictions.
- Example: Forecast sales, demand, inventory, and customer purchase patterns based on sales data.
Prescriptive Analytics
- Recommends actions to achieve a desired outcome with techniques like optimization, simulation, and decision analysis.
- Guiding decisions by suggesting the best action course based on data analysis.
- Example: Design price strategies during seasonal events by analyzing past data to optimize pricing, marketing, and production.
- Thoroughly understanding the problem, clarifying the approach, and effective decision-making are vital to problem-solving.
- The initial stage sets the project's direction, and ensure focus stays on effectively solving the problem
- Consideration and planning for achieving significant results and stakeholder value.
From Requirements to Collection
Data Requirements
- Data requirements depend on the analytic approach chosen.
- The 5W1H questioning method can be used to determine data needs.
- Identify data content, format, and sources for initial data collection.
- Determine specific information needed to analyze projects needs:
- Identify data types needed, such as numbers, words, or images.
- Determine data structure (table, text file, or database).
- Identify data collection sources.
- Organize cleaning steps needed prior to analysis.
Data Types
- Defining data requirements and pre-processing are key to is usable and accurate
- Data categorization includes three types:
- Structured Data: organized in tables (customer databases).
- Unstructured Data: lacks a predefined structure (social media).
- Semi-structured Data: organization (emails, XML files).
- These data types are essential for effective data collection and project management.
Data Collection
- Data collection involves revision from the previous phases.
- Analytics require a systematic data collections through analytics.
- The 2 sources of data collection:
- primary
- secondary
Primary Source
- Data is collected firsthand through observation, interviews and experimentation.
- This data is unbiased, raw, unprocessed, reliable and accurate.
- Some primary sources are marketing campaigns, IoT sensor data and feedback forms.
Secondary Source
- Secondary sources include data thats ready for use and already stored
- Data in books, transactional databases. journals, websites are reused for analysis
- Social media tracking, web scraping and Satellite data tracking are methods of secondary data collection.
- Smart forms are a way to procure smart data online
- programmers extract data from primary and secondary sources/
- once data is collected, data scientists understand workflow.
- Revisit data Collection stage after Data Collection if gaps are found.
From Understanding to Preparation
Data Understanding
- Data Understanding includes all activities related to constructing the dataset.
- Determine if the collected data represents the project to be accomplished.
- Techniques using visualization and stats analysis to assess dataset content, quality and insights.
Data Preparation
- Activities that build the dataset to be used.
- Steps include:
- transform data into workable data.
- cleaning data like invalid values.
- removing duplicated values
- data combining from platforms of multiple source.
- Transformation into input variables.
- Data Preparation includes Feature Engineering
- Preparing is the most time-consuming stage of data science.
Feature Engineering
-
It is a selecting, creating or modifying new features from data
-
improves the performance of machine learning models.
-
Example:
-
Raw Data in square ft of home model to determine price predictions.
-
New Features such as age of house = Current year/year built
-
Price per square ft is the price of the house divied by area.
From Modelling to Evaluation
- The modelling stage uses dataset and develops model to the analytical approach.
- modeling is iterative and enables adjustments to the data.
- Data Scientists can test many Algorithms to identify best suited model.
Data modelling
- Data modelling makes descriptive or predictive models.
Descriptive Model
- Summarizes characteristics without decisions.
- The data is described rather than making decisions.
- Includes major characteristics
- Trends that help data behave for the project
- Useful for understanding what happens, and how a specific process can behave.
Common Descriptive Techniques
-
Summary Statistics:
-
Calculating measures like mode
-
Calculating measures like median for the distribution
-
Calculating measures like the mean
-
Variation of deviation
-
Calculating measures between measures and lowest value
-
Visualization: Charting and graphing data such as:
- scatter plots, box plots
- pie charts
- Histograms
- Bar charts
Predictive Modelling
- Identify patterns, trends and values using data algorithms.
- create Model that predicts future trends.
- Technique Regression.
- powerful tool for forecasting.
- Training sets are data for predictive modelling
- The training sets are compared the models for adjustment.
- Data is used to ensure selection requirements.
Evaluation
- Evaluation assesses after the trainings.
- Involves collecting metrics such as precision.
- Helps determine is effective and reliable
Model Evaluations:
- There are two phases Diagnostic measures and tests.
- Used to verify is working as designed
- models are aligned to design.
- Models refine it if necessary. Statistical significant and tests data. verify accuracy processes. Unnecessary processes are avoided.
From Deployment to Feedback
- The AI model should be used to real-world user applications
- Data scientists educate stakeholders on various tool produced scenarios.
- Ensure Data works.
Implementation Consideration
- Deployment may be for groups or used board.
- Integration may require different skill sets and technology
Feedback
- This is the last stage to process through methodology
- Deployment of models performance.
- Continue providing acceptable result.
Benefits of feedback
- Help access performance
- Help refine and improve
- Automate data quickens refreshment and improves
- Customer receipt and methods.
- The data science method is cyclical at the various stages.
Model Validation
- Model validations is the way to a trained model and testing dataset.
- Developing stages is crucial to ensure validation and accuracy can predict
Benefits
- To enhance the model and reduce any errors
- Model techniques- train-test split validation
Test/ Train Split
- A technique to evaluate machine algorithms.
- Uses any supervised learning algorithm
- Dataset splits to subsets
- Models fit for tests
- testing measures of fit and expects.
- Used to evaluate machine learning model.
- Aims to evaluate on new data.
Configuration to Split
- Can be configured though percentage.
- May not be 100 optimal split
- Considerations:
- Train set and test set
- computing
- cost for training
K- fold
- Technique for measures of folds
- Trained with fixed tests number
- To get multiple models
- It works on sub-sets
- Measure quality of model
Metrics
- Metrics help assessment of models and enable comparison.
- Categorizes models based on their performance
Confusion metrics
- Classification is 2x2 matrix and evaluates the performance True positives- actual output are yes. True negatives, outputs and results confirm.
- False positive, models predict yes but, no.
- False negatives, models predict no but, yes.
Precision
- What proportion is positive.
- Recall and sensitivity, positives correctly rate
Accuracy
- Accuracy is correct and accurate
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.