Data Science Process - Chapter 2
10 Questions
1 Views

Data Science Process - Chapter 2

Created by
@KidFriendlyMoonstone1810

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

What is the first step in the data science process?

  • Applying the model on a dataset
  • Deploying and maintaining the models
  • Preparing the data samples
  • Understanding the problem (correct)
  • What is the primary goal of the data science process?

  • To address the analysis question (correct)
  • To collect as much data as possible
  • To develop new software tools
  • To calculate statistical measures
  • Which of the following frameworks is considered one of the most popular for developing data science solutions?

  • PDM-Cycle
  • DMAIC
  • SEMMA
  • CRISP-DM (correct)
  • In the context of data science, what does the 'Application' phase entail?

    <p>Testing the model on real-world data</p> Signup and view all the answers

    Why is the objective of the problem considered the most important step in the data science process?

    <p>It determines the data sets needed for analysis</p> Signup and view all the answers

    What is the purpose of the Business Understanding phase in the CRISP-DM process?

    <p>To understand the objectives and requirements of the project</p> Signup and view all the answers

    What is meant by 'prior knowledge' in the data science process?

    <p>Information already known about a subject or problem</p> Signup and view all the answers

    What does the acronym SEMMA stand for in data science frameworks?

    <p>Sample, Explore, Modify, Model, Assess</p> Signup and view all the answers

    What role does understanding the subject area of the problem play in the data science process?

    <p>It helps to ignore spurious patterns</p> Signup and view all the answers

    Which of the following is NOT a characteristic of the data science process?

    <p>It is a linear, step-by-step process</p> Signup and view all the answers

    Study Notes

    Fundamentals of Data Science

    • The methodical discovery of useful relationships and patterns in data is enabled by a set of iterative activities collectively known as the data science process.
    • The standard data science process involves:
      • Understanding the problem
      • Preparing data samples
      • Developing a model
      • Applying the model to a dataset
      • Deploying and maintaining the models

    Reference Books

    • Data Science: Concepts and Practice, by Vijay Kotu and Bala Deshpande, 2019
    • DATA SCIENCE: FOUNDATION & FUNDAMENTALS, by B. S. V. Vatika, L. C. Dabra, Gwalior, 2023

    Lecture 2

    Chapter 2: Data Science Process

    Data Science Process

    • The methodical discovery of patterns and relationships in data is enabled by iterative activities collectively known as the data science process.
    • The standard data science process steps are:
      • Understanding the problem
      • Preparing the data samples
      • Developing the model
      • Applying the model
      • Deploying and maintaining models

    Prior Knowledge

    • Prior knowledge refers to information already known about a subject.
    • The prior knowledge step helps define the problem, business context, and needed data. Components of the prior knowledge step involve:
      • Objective of the problem
      • Subject area of the problem
      • Data

    Why Is It Important?

    • Wide availability of huge amounts of data
    • Transforming data into useful information and knowledge
    • Data mining—natural evolution of information technology

    Data science process frameworks

    • Cross Industry Standard Process for Data Mining (CRISP-DM) is a widely adopted framework for developing data science solutions.
    • Other frameworks include SEMMA (Sample, Explore, Modify, Model, and Assess) and DMAIC (Define, Measure, Analyze, Improve, and Control).

    CRISP-DM process

    • CRISP-DM is a process model with six phases that naturally describes the data science life cycle.
    • The Business Understanding phase focuses on understanding the project objectives and customer needs.
    • Data Understanding focuses on identifying, collecting, and analyzing data sets.

    Data Science Process (Generic Steps)

    • The fundamental objective of any data science process is to address the analysis question.
    • The learning algorithm for solving the business question could be a decision tree, an artificial neural network, or a scatterplot.
    • Software tools for developing and implementing data science algorithms include custom coding, RapidMiner, R, Weka, SAS, Oracle Data Miner, and Python.

    Data Preparation

    • Preparing the dataset to suit a data science task is the most time-consuming part of the process.
    • Data is rarely available in the required format.
    • Data science algorithms primarily require data in a tabular format.
    • Data must be transformed for other formats.

    Data Preparation Steps

    • Data Exploration
    • Data Quality
    • Handling Missing Values
    • Data Type Conversion
    • Transformation
    • Outliers
    • Feature Selection
    • Sampling

    1-Data Exploration

    • Data exploration, also known as exploratory data analysis, uses simple tools to achieve a basic understanding of the data.
    • Data exploration approaches involve computing descriptive statistics and visualization of data.
    • These approaches expose data structure, value distribution, extreme values, and inter-relationships within the dataset.

    2-Data Quality

    • Data quality is essential throughout the data collection, processing, and storage lifecycle.
    • The accuracy and reliability of data are key.

    3-Handling Missing Values

    • Missing attribute values are a common data quality problem.
    • Understanding the reason for missing values is critical for developing strategies like imputation. (e.g., the mean, minimum, or maximum value of the attribute could be used).
    • Dropping records with missing values can simplify the problem.

    4-Data Type Conversion

    • Data attributes can be continuous, integer numeric, or categorical.
    • Linear regression models require numeric input.
    • Categorical data may need to be converted to continuous numeric form.

    5-Transformation

    • Some algorithms require numeric and normalized input.
    • Normalization can prevent one attribute from dominating distance calculations due to large values.

    6-Outliers

    • Outliers are abnormal data points in a dataset.
    • These may be due to correct or incorrect data capturing.
    • Outlier identification warrants special treatment.

    7-Feature Selection

    • A large number of attributes can increase the complexity of a model and negatively impact performance.
    • Not all attributes are equally important for prediction.
    • Attribute selection is a critical step.

    8-Sampling

    • Sampling selects a representative subset of data.
    • Sampling can significantly speed up the process of building prediction models.
    • Theoretical model errors due to sampling are manageable with appropriate techniques.

    Studying That Suits You

    Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

    Quiz Team

    Related Documents

    Description

    This quiz covers the essential steps involved in the data science process as outlined in Chapter 2. You will explore how to understand problems, prepare data, develop models, and apply those models effectively. Test your knowledge on the methodical discovery of patterns and relationships in data!

    More Like This

    Data Science Process Overview
    10 questions
    Data Science Process Overview
    24 questions

    Data Science Process Overview

    JubilantGyrolite3632 avatar
    JubilantGyrolite3632
    Data Science Process - DS302 Lecture 2
    45 questions
    Data Science Process - Lecture 2
    50 questions
    Use Quizgecko on...
    Browser
    Browser