Podcast
Questions and Answers
What is the first step in the data science process?
What is the first step in the data science process?
What is the primary goal of the data science process?
What is the primary goal of the data science process?
Which of the following frameworks is considered one of the most popular for developing data science solutions?
Which of the following frameworks is considered one of the most popular for developing data science solutions?
In the context of data science, what does the 'Application' phase entail?
In the context of data science, what does the 'Application' phase entail?
Signup and view all the answers
Why is the objective of the problem considered the most important step in the data science process?
Why is the objective of the problem considered the most important step in the data science process?
Signup and view all the answers
What is the purpose of the Business Understanding phase in the CRISP-DM process?
What is the purpose of the Business Understanding phase in the CRISP-DM process?
Signup and view all the answers
What is meant by 'prior knowledge' in the data science process?
What is meant by 'prior knowledge' in the data science process?
Signup and view all the answers
What does the acronym SEMMA stand for in data science frameworks?
What does the acronym SEMMA stand for in data science frameworks?
Signup and view all the answers
What role does understanding the subject area of the problem play in the data science process?
What role does understanding the subject area of the problem play in the data science process?
Signup and view all the answers
Which of the following is NOT a characteristic of the data science process?
Which of the following is NOT a characteristic of the data science process?
Signup and view all the answers
Study Notes
Fundamentals of Data Science
- The methodical discovery of useful relationships and patterns in data is enabled by a set of iterative activities collectively known as the data science process.
- The standard data science process involves:
- Understanding the problem
- Preparing data samples
- Developing a model
- Applying the model to a dataset
- Deploying and maintaining the models
Reference Books
- Data Science: Concepts and Practice, by Vijay Kotu and Bala Deshpande, 2019
- DATA SCIENCE: FOUNDATION & FUNDAMENTALS, by B. S. V. Vatika, L. C. Dabra, Gwalior, 2023
Lecture 2
Chapter 2: Data Science Process
Data Science Process
- The methodical discovery of patterns and relationships in data is enabled by iterative activities collectively known as the data science process.
- The standard data science process steps are:
- Understanding the problem
- Preparing the data samples
- Developing the model
- Applying the model
- Deploying and maintaining models
Prior Knowledge
- Prior knowledge refers to information already known about a subject.
- The prior knowledge step helps define the problem, business context, and needed data. Components of the prior knowledge step involve:
- Objective of the problem
- Subject area of the problem
- Data
Why Is It Important?
- Wide availability of huge amounts of data
- Transforming data into useful information and knowledge
- Data mining—natural evolution of information technology
Data science process frameworks
- Cross Industry Standard Process for Data Mining (CRISP-DM) is a widely adopted framework for developing data science solutions.
- Other frameworks include SEMMA (Sample, Explore, Modify, Model, and Assess) and DMAIC (Define, Measure, Analyze, Improve, and Control).
CRISP-DM process
- CRISP-DM is a process model with six phases that naturally describes the data science life cycle.
- The Business Understanding phase focuses on understanding the project objectives and customer needs.
- Data Understanding focuses on identifying, collecting, and analyzing data sets.
Data Science Process (Generic Steps)
- The fundamental objective of any data science process is to address the analysis question.
- The learning algorithm for solving the business question could be a decision tree, an artificial neural network, or a scatterplot.
- Software tools for developing and implementing data science algorithms include custom coding, RapidMiner, R, Weka, SAS, Oracle Data Miner, and Python.
Data Preparation
- Preparing the dataset to suit a data science task is the most time-consuming part of the process.
- Data is rarely available in the required format.
- Data science algorithms primarily require data in a tabular format.
- Data must be transformed for other formats.
Data Preparation Steps
- Data Exploration
- Data Quality
- Handling Missing Values
- Data Type Conversion
- Transformation
- Outliers
- Feature Selection
- Sampling
1-Data Exploration
- Data exploration, also known as exploratory data analysis, uses simple tools to achieve a basic understanding of the data.
- Data exploration approaches involve computing descriptive statistics and visualization of data.
- These approaches expose data structure, value distribution, extreme values, and inter-relationships within the dataset.
2-Data Quality
- Data quality is essential throughout the data collection, processing, and storage lifecycle.
- The accuracy and reliability of data are key.
3-Handling Missing Values
- Missing attribute values are a common data quality problem.
- Understanding the reason for missing values is critical for developing strategies like imputation. (e.g., the mean, minimum, or maximum value of the attribute could be used).
- Dropping records with missing values can simplify the problem.
4-Data Type Conversion
- Data attributes can be continuous, integer numeric, or categorical.
- Linear regression models require numeric input.
- Categorical data may need to be converted to continuous numeric form.
5-Transformation
- Some algorithms require numeric and normalized input.
- Normalization can prevent one attribute from dominating distance calculations due to large values.
6-Outliers
- Outliers are abnormal data points in a dataset.
- These may be due to correct or incorrect data capturing.
- Outlier identification warrants special treatment.
7-Feature Selection
- A large number of attributes can increase the complexity of a model and negatively impact performance.
- Not all attributes are equally important for prediction.
- Attribute selection is a critical step.
8-Sampling
- Sampling selects a representative subset of data.
- Sampling can significantly speed up the process of building prediction models.
- Theoretical model errors due to sampling are manageable with appropriate techniques.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.
Related Documents
Description
This quiz covers the essential steps involved in the data science process as outlined in Chapter 2. You will explore how to understand problems, prepare data, develop models, and apply those models effectively. Test your knowledge on the methodical discovery of patterns and relationships in data!