Recent Lessons

Show all results for ""

Data Science Process - Chapter 2

Data Science Process - Chapter 2

Choose a study mode

Play Quiz

Study Flashcards

Spaced Repetition

Chat to Lesson

Podcast

Listen to an AI-generated conversation about this lesson

Download our mobile app to listen on the go

Get App

Questions and Answers

What is the first step in the data science process?

Applying the model on a dataset
Deploying and maintaining the models
Preparing the data samples
Understanding the problem (correct)

What is the primary goal of the data science process?

To address the analysis question (correct)
To collect as much data as possible
To develop new software tools
To calculate statistical measures

Which of the following frameworks is considered one of the most popular for developing data science solutions?

PDM-Cycle
DMAIC
SEMMA
CRISP-DM (correct)

In the context of data science, what does the 'Application' phase entail?

<p>Testing the model on real-world data (D)</p>

Signup and view all the answers

Why is the objective of the problem considered the most important step in the data science process?

<p>It determines the data sets needed for analysis (C)</p>

Signup and view all the answers

What is the purpose of the Business Understanding phase in the CRISP-DM process?

<p>To understand the objectives and requirements of the project (C)</p>

Signup and view all the answers

What is meant by 'prior knowledge' in the data science process?

<p>Information already known about a subject or problem (C)</p>

Signup and view all the answers

What does the acronym SEMMA stand for in data science frameworks?

<p>Sample, Explore, Modify, Model, Assess (D)</p>

Signup and view all the answers

What role does understanding the subject area of the problem play in the data science process?

<p>It helps to ignore spurious patterns (B)</p>

Signup and view all the answers

Which of the following is NOT a characteristic of the data science process?

<p>It is a linear, step-by-step process (D)</p>

Signup and view all the answers

Flashcards

Data Science Process

A series of steps used in data science, regardless of the specific problem, algorithm, or tool. Its goal is to answer analysis questions.

Prior Knowledge

Information already known about a subject, critical to understand the problem's context and needed data.

Objective of the problem

The specific analysis question or business goal that the data science process aims to answer.

Subject Area of the Problem

The specific field or context of the problem, helping to understand the data and avoid misleading interpretations.

Signup and view all the flashcards

Data Understanding

The step in the data science process that involves identifying, collecting, and analyzing datasets.

Signup and view all the flashcards

Data Science Tools

Software used to develop and implement data science algorithms, ranging from custom coding to specialized applications.

Signup and view all the flashcards

Data Science Process

A set of iterative activities to discover useful relationships and patterns in data.

Signup and view all the flashcards

Understanding the Problem

The initial step in the data science process, focusing on defining the objectives and requirements of the project.

Signup and view all the flashcards

Data Preparation

Preparing the data samples for use in the data science project.

Signup and view all the flashcards

Model Development

The step of building a data model or solution to the problem.

Signup and view all the flashcards

Model Application

Applying the model to a dataset to see how the model will perform in the real world.

Signup and view all the flashcards

Model Deployment and Maintenance

Putting the model into action and ensuring its ongoing performance.

Signup and view all the flashcards

CRISP-DM

Cross Industry Standard Process for Data Mining; a popular data science process framework.

Signup and view all the flashcards

Why Data Science is Important

Huge amounts of data need to be turned into useful information and knowledge.

Signup and view all the flashcards

SEMMA

Sample, Explore, Modify, Model, and Assess; another data science framework.

Signup and view all the flashcards

DMAIC

Define, Measure, Analyze, Improve, and Control; used in Six Sigma practice; another data science framework.

Signup and view all the flashcards

Study Notes

Fundamentals of Data Science

The methodical discovery of useful relationships and patterns in data is enabled by a set of iterative activities collectively known as the data science process.
The standard data science process involves:
- Understanding the problem
- Preparing data samples
- Developing a model
- Applying the model to a dataset
- Deploying and maintaining the models

Reference Books

Data Science: Concepts and Practice, by Vijay Kotu and Bala Deshpande, 2019
DATA SCIENCE: FOUNDATION & FUNDAMENTALS, by B. S. V. Vatika, L. C. Dabra, Gwalior, 2023

Lecture 2

Chapter 2: Data Science Process

Data Science Process

The methodical discovery of patterns and relationships in data is enabled by iterative activities collectively known as the data science process.
The standard data science process steps are:
- Understanding the problem
- Preparing the data samples
- Developing the model
- Applying the model
- Deploying and maintaining models

Prior Knowledge

Prior knowledge refers to information already known about a subject.
The prior knowledge step helps define the problem, business context, and needed data. Components of the prior knowledge step involve:
- Objective of the problem
- Subject area of the problem
- Data

Why Is It Important?

Wide availability of huge amounts of data
Transforming data into useful information and knowledge
Data mining—natural evolution of information technology

Data science process frameworks

Cross Industry Standard Process for Data Mining (CRISP-DM) is a widely adopted framework for developing data science solutions.
Other frameworks include SEMMA (Sample, Explore, Modify, Model, and Assess) and DMAIC (Define, Measure, Analyze, Improve, and Control).

CRISP-DM process

CRISP-DM is a process model with six phases that naturally describes the data science life cycle.
The Business Understanding phase focuses on understanding the project objectives and customer needs.
Data Understanding focuses on identifying, collecting, and analyzing data sets.

Data Science Process (Generic Steps)

The fundamental objective of any data science process is to address the analysis question.
The learning algorithm for solving the business question could be a decision tree, an artificial neural network, or a scatterplot.
Software tools for developing and implementing data science algorithms include custom coding, RapidMiner, R, Weka, SAS, Oracle Data Miner, and Python.

Data Preparation

Preparing the dataset to suit a data science task is the most time-consuming part of the process.
Data is rarely available in the required format.
Data science algorithms primarily require data in a tabular format.
Data must be transformed for other formats.

Data Preparation Steps

Data Exploration
Data Quality
Handling Missing Values
Data Type Conversion
Transformation
Outliers
Feature Selection
Sampling

1-Data Exploration

Data exploration, also known as exploratory data analysis, uses simple tools to achieve a basic understanding of the data.
Data exploration approaches involve computing descriptive statistics and visualization of data.
These approaches expose data structure, value distribution, extreme values, and inter-relationships within the dataset.

2-Data Quality

Data quality is essential throughout the data collection, processing, and storage lifecycle.
The accuracy and reliability of data are key.

3-Handling Missing Values

Missing attribute values are a common data quality problem.
Understanding the reason for missing values is critical for developing strategies like imputation. (e.g., the mean, minimum, or maximum value of the attribute could be used).
Dropping records with missing values can simplify the problem.

4-Data Type Conversion

Data attributes can be continuous, integer numeric, or categorical.
Linear regression models require numeric input.
Categorical data may need to be converted to continuous numeric form.

5-Transformation

Some algorithms require numeric and normalized input.
Normalization can prevent one attribute from dominating distance calculations due to large values.

6-Outliers

Outliers are abnormal data points in a dataset.
These may be due to correct or incorrect data capturing.
Outlier identification warrants special treatment.

7-Feature Selection

A large number of attributes can increase the complexity of a model and negatively impact performance.
Not all attributes are equally important for prediction.
Attribute selection is a critical step.

8-Sampling

Sampling selects a representative subset of data.
Sampling can significantly speed up the process of building prediction models.
Theoretical model errors due to sampling are manageable with appropriate techniques.

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

Fundamentals of Data Science DS302 PDF

More Like This

Data Science Process and Role of Data Scientists

10 questions

Data Science Process and Role of Data Scientists

EnticingPalladium

Data Science Process Overview

10 questions

Data Science Process Overview

PeaceableTulsa

Data Science Process Overview

24 questions

Data Science Process Overview

JubilantGyrolite3632

Data Science Process - Chapter 2

10 questions

Data Science Process - Chapter 2

EasierCosmos4638

Use Quizgecko on...

Browser