Data Science Methodology

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson
Download our mobile app to listen on the go
Get App

Questions and Answers

What does the Data Science Methodology provide to a data scientist?

  • Access to large datasets
  • Tools for data visualization
  • Training on the latest algorithms
  • A framework for designing an AI Project (correct)

The Data Science Methodology is a linear process where each step is completed before moving to the next.

False (B)

Who put forward the Data Science Methodology?

  • John Rollins (correct)
  • Andrew Ng
  • Geoffrey Hinton
  • Yann LeCun

How many steps does the Data Science Methodology consist of?

<p>10</p> Signup and view all the answers

Match the following modules of the Data Science Methodology with their corresponding stages:

<p>Problem to Approach = Understanding the initial challenge and forming strategies Requirements to Collection = Defining data needs and gathering necessary information Understanding to Preparation = Ensuring data quality and suitability for analysis Modelling to Evaluation = Creating and assessing predictive models Deployment to Feedback = Implementing the model and gathering user input for refinement</p> Signup and view all the answers

What is the primary goal of the 'Business Understanding' stage in the Data Science Methodology?

<p>To understand the customer's problem (A)</p> Signup and view all the answers

The 5W1H Problem Canvas is a tool used to understand the problem of stakeholders.

<p>True (A)</p> Signup and view all the answers

Which framework is employed in the 'Problem to Approach' stage of the Data Science Methodology?

<p>DT (Design Thinking) Framework (A)</p> Signup and view all the answers

What type of questions are asked in the phase of analytic approach?

<p>Data-driven questions</p> Signup and view all the answers

Which type of data analytics answers the question: 'What happened?'

<p>Descriptive Analytics (D)</p> Signup and view all the answers

Diagnostic analytics focus on predicting future outcomes based on historical data.

<p>False (B)</p> Signup and view all the answers

A company wants to forecast its sales for the next quarter. Which type of analytics would be most appropriate?

<p>Predictive Analytics (A)</p> Signup and view all the answers

Providing the right course of action based on data analysis is called ______ Analytics

<p>Prescriptive</p> Signup and view all the answers

Match the type of analytics with its focus:

<p>Descriptive Analytics = Summarizing historical data Diagnostic Analytics = Understanding why certain events occurred Predictive Analytics = Predicting future outcomes based on historical data patterns Prescriptive Analytics = Determining the best course of action</p> Signup and view all the answers

What is the role of 5W1H in determining data requirements?

<p>To define data requirements (A)</p> Signup and view all the answers

Structured data refers to data without a predefined structure, such as social media posts and images.

<p>False (B)</p> Signup and view all the answers

Which of the following is/are type(s) of data?

<p>All of the above (D)</p> Signup and view all the answers

What are the two sources of data collection?

<p>Primary and Secondary</p> Signup and view all the answers

Which of the following is an example of a primary data source?

<p>A marketing campaign's feedback forms (A)</p> Signup and view all the answers

Data scientists should aim to exclude DBA's and programmers when extracting data from data sources.

<p>False (B)</p> Signup and view all the answers

In which stage should one assess the representativeness of the collected dataset?

<p>Data Understanding (A)</p> Signup and view all the answers

Techniques such as ______ can be applied to the dataset, to assess the content, quality, and initial insights about the data.

<p>Descriptive statistics</p> Signup and view all the answers

What are the activities included in Data Preperation?

<p>All of the above (D)</p> Signup and view all the answers

Data collection takes more time than Data Preperation.

<p>False (B)</p> Signup and view all the answers

What is raw data?

<p>Area of the house</p> Signup and view all the answers

What is the primary focus during the 'AI modelling' stage?

<p>Model Testing (C)</p> Signup and view all the answers

In data modelling, the development aspect is usually iterative.

<p>True (A)</p> Signup and view all the answers

According to data modelling, identifying the best model for capstone projects is called

<p>Determined Technique (A)</p> Signup and view all the answers

[Blank] is useful to understand what is happening within the data.

<p>Descriptive modelling</p> Signup and view all the answers

Summary staticstics includes which of the following?

<p>All of the above (D)</p> Signup and view all the answers

In machine learning, what is a training set?

<p>Historical data</p> Signup and view all the answers

What is the role of the training set?

<p>Gauge to determine (C)</p> Signup and view all the answers

Evaluation in an Al project cycle is the process of underperforming of the model.

<p>False (B)</p> Signup and view all the answers

What measures the metrics like accuracy, precision, recall, or F1 score?

<p>Test data (C)</p> Signup and view all the answers

In Data Science methodology, the model evaluation can have ______ main phases.

<p>two</p> Signup and view all the answers

Which data do scientists must make the stakeholders familiar with?

<p>Tool produced in different scenarios (D)</p> Signup and view all the answers

What is the last stage in the methodology discussed?

<p>Feedback</p> Signup and view all the answers

The trained AI model is made available to the users in real-world applications in which stage?

<p>Deployment (D)</p> Signup and view all the answers

Model validation involves which dataset?

<p>Testing dataset (A)</p> Signup and view all the answers

Test dataset is used to fit the machine learning model.

<p>False (B)</p> Signup and view all the answers

The train-test split is a technique for evaluating the ______ of a machine learning algorithm.

<p>Performance</p> Signup and view all the answers

Match the data:

<p>Divides the data into training data set and testing dataset = Train-Test Applied on small datasets = Cross Validation</p> Signup and view all the answers

Which type of problem does the train test split address?

<p>Both A and B (D)</p> Signup and view all the answers

Flashcards

Data Science Methodology

A framework for designing an AI project for data scientists.

Data Science Methodology

A process with a prescribed sequence of iterative steps that data scientists follow to approach a problem and find a solution

Business Understanding

Breaking down a problem to understand customer needs and objectives.

Analytic Approach

Defining the analytical approach to solve the problem

Signup and view all the flashcards

Data Requirements

Finding out the types of data required for the project

Signup and view all the flashcards

Data Collection

Systematic process of gathering observations or measurements.

Signup and view all the flashcards

Data Understanding

Checking whether the collected data represents the problem to be solved or not.

Signup and view all the flashcards

Data preparation

This stage covers all the activities to build the set of data that will be used in the modelling step.

Signup and view all the flashcards

Feature engineering

Selecting, modifying, or creating new features (variables) from raw data to improve the performance of machine learning models.

Signup and view all the flashcards

AI Modelling

Developing models and Data visualization

Signup and view all the flashcards

Descriptive Modeling

It is a concept in data science and statistics that focuses on summarizing and understanding the characteristics of a dataset without making predictions or decisions.

Signup and view all the flashcards

Predictive modeling

It involves using data and statistical algorithms to identify patterns and trends in order to predict future outcomes or values.

Signup and view all the flashcards

Evaluation

Assessing how well a model performs after training.

Signup and view all the flashcards

Deployment

The stage where the trained Al model is made available to the users in real-world applications.

Signup and view all the flashcards

Feedback

Includes results collected from the deployment of the model and Data Scientist can get quick improved results. Feedback from users.

Signup and view all the flashcards

Primary data source

Original source of data collected firsthand through direct observation, experimentation, surveys, interviews, or other methods.

Signup and view all the flashcards

Secondary data source

Data which is already stored and ready for use.

Signup and view all the flashcards

Structured data

Data organized in tables.

Signup and view all the flashcards

Unstructured data

Data without a predefined structure, e.g., social media posts, images

Signup and view all the flashcards

Semi-structured data

Data with some organization, e.g., emails, XML files

Signup and view all the flashcards

Descriptive Analytics

This summarizes past data to understand what has happened.

Signup and view all the flashcards

Diagnostic Analytics

It helps to understand the reason behind why things have happened.

Signup and view all the flashcards

Predictive Analytics

This uses the past data to make predictions about future events

Signup and view all the flashcards

Prescriptive Analytics

This recommends the action to be taken to achieve the desired outcome.

Signup and view all the flashcards

Model Validation

Model validation is the step conducted post Model Training, wherein the effectiveness of the trained model is assessed using a testing dataset.

Signup and view all the flashcards

Train Test Split

A technique for evaluating the performance of a machine learning algorithm. It can be used for classification or regression problems and can be used for any supervised learning algorithm.

Signup and view all the flashcards

Cross Validation

Divides a dataset into subsets (folds), trains the model on some folds, and evaluates its performance on the remaining data.

Signup and view all the flashcards

Evaluation metrics

Help assess the performance of a trained model on a test dataset, providing insights into its strengths and weaknesses.

Signup and view all the flashcards

Accuracy

Number of correct predictions / Total number of predictions

Signup and view all the flashcards

Confusion Matrix

A table used to evaluate the performance of a classification model.

Signup and view all the flashcards

Precision

"What proportion of predicted Positives is truly Positive?"

Signup and view all the flashcards

Recall

“What proportion of actual Positives is correctly classified?

Signup and view all the flashcards

MAE

Sum of the absolute differences between predictions and actual values.

Signup and view all the flashcards

MSE

Mean (average) of squared distances between our target variable and predicted values.

Signup and view all the flashcards

RMSE

Standard deviation of the residuals (prediction errors).

Signup and view all the flashcards

Study Notes

  • The Unit focuses on Data Science Methodology as an analytic approach to a capstone project.

Objectives

  • Understand major steps in addressing a data science problem.
  • Define data science methodology and its importance.
  • Demonstrate the steps in Data Science Methodology.

Learning Outcomes

  • Integrate Data Science Methodology steps into a Capstone Project.
  • Identify the best way to represent a problem's solution.
  • Understand the importance of validating machine learning models.
  • Use key evaluation metrics for various machine learning tasks.

Introduction to Data Science Methodology

  • Methodology provides a framework for designing an AI Project.
  • The framework helps the team decide on methods, processes, and strategies to obtain correct output from the AI Project.
  • Methodology systematic project organization to save time and costs.
  • Data Science Methodology prescribes steps for data scientists to approach a problem and find solutions.
  • Data Science Methodology enables data handling and comprehension.
  • John Rollins, an IBM Analytics Data Scientist, put forward a Data Science Methodology
  • The process involves 10 steps.
  • Foundation provides detailed information on how AI projects can be solved.
  • There are five modules, with two stages, and they describe the rationale for each stage.
  • The five modules are:
  • From Problem to Approach
  • From Requirements to Collection
  • From Understanding to Preparation
  • From Modelling to Evaluation
  • From Deployment to Feedback

From Problem to Approach

Business Understanding

  • Understand the customer's problem by asking questions to comprehend specific requirements.
  • Figure out objectives that support the customer's goal.
  • This stage is known as "Problem Scoping and defining".
  • The 5W1H Problem Canvas is used to deeply understand the issue.
  • This stage also involves using the Design Thinking (DT) Framework.
  • Understanding customer needs is critical to solve the problem.
  • Ask relevant questions and engage in discussions with all stakeholders to identify specific requirements and create a list of business needs.

Analytic Approach

  • Establish the business problem and then the data scientist defines the analytical approach to solve it.
  • Seek clarification to choose the correct path or approach.
  • Asking more questions to stakeholders will help the AI Project team select the correct approach to solve it.
  • Sample Questions:
  • Is there is a need to find out how much or how many? (Regression)
  • Which category does the data belong to? (Classification)
  • Data capable of being grouped? (Clustering)
  • Is there is an unusual pattern in the data? (Anomaly detection)
  • Which option to provide to a customer? (Recommendation)

Data Analytics

  • Data Analytics is used to solve project problems.
  • Types of Data Analytics:
  • Descriptive Analytics
  • Diagnostic Analytics
  • Predictive Analytics
  • Prescriptive Analytics

Descriptive Analytics

  • Summarizes past data to understand what has happened.
  • Describes trends and patterns using graphs, charts, and statistical measures like mean, median, and mode to understand the central tendency.
  • Examines the spread of data using range, variance, and standard deviation.
  • Example: Calculate average student marks or analyze previous year's sales data.

Diagnostic Analytics

  • Helps understand why things have happened.
  • Done by analyzing past data using techniques like root cause analysis, hypothesis testing, and correlation analysis.
  • The main purpose is to identify the causes/factors that led to an outcome.
  • Example: Investigate why sales dropped by asking "Is it due to poor customer service or low product quality?".

Predictive Analytics

  • Uses past data to predict future events or trends with techniques like regression, classification, and clustering.
  • Purpose to foresee future outcomes and make decisions based on those predictions.
  • Example: Forecast sales, demand, inventory, and customer purchase patterns based on sales data.

Prescriptive Analytics

  • Recommends actions to achieve a desired outcome with techniques like optimization, simulation, and decision analysis.
  • Guiding decisions by suggesting the best action course based on data analysis.
  • Example: Design price strategies during seasonal events by analyzing past data to optimize pricing, marketing, and production.
  • Thoroughly understanding the problem, clarifying the approach, and effective decision-making are vital to problem-solving.
  • The initial stage sets the project's direction, and ensure focus stays on effectively solving the problem
  • Consideration and planning for achieving significant results and stakeholder value.

From Requirements to Collection

Data Requirements

  • Data requirements depend on the analytic approach chosen.
  • The 5W1H questioning method can be used to determine data needs.
  • Identify data content, format, and sources for initial data collection.
  • Determine specific information needed to analyze projects needs:
  • Identify data types needed, such as numbers, words, or images.
  • Determine data structure (table, text file, or database).
  • Identify data collection sources.
  • Organize cleaning steps needed prior to analysis.

Data Types

  • Defining data requirements and pre-processing are key to is usable and accurate
  • Data categorization includes three types:
  • Structured Data: organized in tables (customer databases).
  • Unstructured Data: lacks a predefined structure (social media).
  • Semi-structured Data: organization (emails, XML files).
  • These data types are essential for effective data collection and project management.

Data Collection

  • Data collection involves revision from the previous phases.
  • Analytics require a systematic data collections through analytics.
  • The 2 sources of data collection:
  • primary
  • secondary

Primary Source

  • Data is collected firsthand through observation, interviews and experimentation.
  • This data is unbiased, raw, unprocessed, reliable and accurate.
  • Some primary sources are marketing campaigns, IoT sensor data and feedback forms.

Secondary Source

  • Secondary sources include data thats ready for use and already stored
  • Data in books, transactional databases. journals, websites are reused for analysis
  • Social media tracking, web scraping and Satellite data tracking are methods of secondary data collection.
  • Smart forms are a way to procure smart data online
  • programmers extract data from primary and secondary sources/
  • once data is collected, data scientists understand workflow.
  • Revisit data Collection stage after Data Collection if gaps are found.

From Understanding to Preparation

Data Understanding

  • Data Understanding includes all activities related to constructing the dataset.
  • Determine if the collected data represents the project to be accomplished.
  • Techniques using visualization and stats analysis to assess dataset content, quality and insights.

Data Preparation

  • Activities that build the dataset to be used.
  • Steps include:
  • transform data into workable data.
  • cleaning data like invalid values.
  • removing duplicated values
  • data combining from platforms of multiple source.
  • Transformation into input variables.
  • Data Preparation includes Feature Engineering
  • Preparing is the most time-consuming stage of data science.

Feature Engineering

  • It is a selecting, creating or modifying new features from data

  • improves the performance of machine learning models.

  • Example:

  • Raw Data in square ft of home model to determine price predictions.

  • New Features such as age of house = Current year/year built

  • Price per square ft is the price of the house divied by area.

From Modelling to Evaluation

  • The modelling stage uses dataset and develops model to the analytical approach.
  • modeling is iterative and enables adjustments to the data.
  • Data Scientists can test many Algorithms to identify best suited model.

Data modelling

  • Data modelling makes descriptive or predictive models.

Descriptive Model

  • Summarizes characteristics without decisions.
  • The data is described rather than making decisions.
  • Includes major characteristics
  • Trends that help data behave for the project
  • Useful for understanding what happens, and how a specific process can behave.

Common Descriptive Techniques

  • Summary Statistics:

  • Calculating measures like mode

  • Calculating measures like median for the distribution

  • Calculating measures like the mean

  • Variation of deviation

  • Calculating measures between measures and lowest value

  • Visualization: Charting and graphing data such as:

    • scatter plots, box plots
    • pie charts
    • Histograms
    • Bar charts

Predictive Modelling

  • Identify patterns, trends and values using data algorithms.
  • create Model that predicts future trends.
  • Technique Regression.
  • powerful tool for forecasting.
  • Training sets are data for predictive modelling
  • The training sets are compared the models for adjustment.
  • Data is used to ensure selection requirements.

Evaluation

  • Evaluation assesses after the trainings.
  • Involves collecting metrics such as precision.
  • Helps determine is effective and reliable

Model Evaluations:

  • There are two phases Diagnostic measures and tests.
  • Used to verify is working as designed
  • models are aligned to design.
  • Models refine it if necessary. Statistical significant and tests data. verify accuracy processes. Unnecessary processes are avoided.

From Deployment to Feedback

  • The AI model should be used to real-world user applications
  • Data scientists educate stakeholders on various tool produced scenarios.
  • Ensure Data works.

Implementation Consideration

  • Deployment may be for groups or used board.
  • Integration may require different skill sets and technology

Feedback

  • This is the last stage to process through methodology
  • Deployment of models performance.
  • Continue providing acceptable result.

Benefits of feedback

  • Help access performance
  • Help refine and improve
  • Automate data quickens refreshment and improves
  • Customer receipt and methods.
  • The data science method is cyclical at the various stages.

Model Validation

  • Model validations is the way to a trained model and testing dataset.
  • Developing stages is crucial to ensure validation and accuracy can predict

Benefits

  • To enhance the model and reduce any errors
  • Model techniques- train-test split validation

Test/ Train Split

  • A technique to evaluate machine algorithms.
  • Uses any supervised learning algorithm
  • Dataset splits to subsets
  • Models fit for tests
  • testing measures of fit and expects.
  • Used to evaluate machine learning model.
  • Aims to evaluate on new data.

Configuration to Split

  • Can be configured though percentage.
  • May not be 100 optimal split
  • Considerations:
  • Train set and test set
  • computing
  • cost for training

K- fold

  • Technique for measures of folds
  • Trained with fixed tests number
  • To get multiple models
  • It works on sub-sets
  • Measure quality of model

Metrics

  • Metrics help assessment of models and enable comparison.
  • Categorizes models based on their performance

Confusion metrics

  • Classification is 2x2 matrix and evaluates the performance True positives- actual output are yes. True negatives, outputs and results confirm.
  • False positive, models predict yes but, no.
  • False negatives, models predict no but, yes.

Precision

  • What proportion is positive.
  • Recall and sensitivity, positives correctly rate

Accuracy

  • Accuracy is correct and accurate

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Use Quizgecko on...
Browser
Browser