Data Science Fundamentals Lecture 2
45 Questions
0 Views

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

What is the first step in the data science process?

  • Preparing the data samples
  • Developing the model
  • Applying the model on a dataset
  • Understanding the problem (correct)
  • The CRISP-DM process is primarily focused on the data collection phase.

    False

    Name one other data science framework aside from CRISP-DM.

    SE MMA or DMAIC

    The __________ phase in CRISP-DM focuses on understanding the objectives and requirements of the project.

    <p>Business Understanding</p> Signup and view all the answers

    Match the following data science frameworks with their descriptions:

    <p>CRISP-DM = Cross Industry Standard Process for Data Mining SE MMA = Sample, Explore, Modify, Model, and Assess DMAIC = Define, Measure, Analyze, Improve, Control KDD = Knowledge Discovery in Databases</p> Signup and view all the answers

    What is the primary purpose of data mining in the context of data science?

    <p>To discover useful relationships and patterns in data</p> Signup and view all the answers

    The data science process is a linear sequence of steps that does not require iteration.

    <p>False</p> Signup and view all the answers

    What is the primary purpose of data exploration?

    <p>To summarize and visualize data</p> Signup and view all the answers

    List the five main steps of the standard data science process.

    <p>Understanding the problem, Preparing the data samples, Developing the model, Applying the model, Deploying and maintaining the models.</p> Signup and view all the answers

    Data cleansing is only required before storing data into a data warehouse.

    <p>False</p> Signup and view all the answers

    What is one common data quality issue faced in datasets?

    <p>Missing attribute values</p> Signup and view all the answers

    Descriptive statistics like mean, median, and mode provide a _______ summary of the data.

    <p>readable</p> Signup and view all the answers

    Match the following data quality practices with their descriptions:

    <p>Data Alerts = Notify users of abnormal data values Cleansing = Removal of duplicate records Transformation = Changing the format of data values Outliers handling = Quarantining data that exceed expected bounds</p> Signup and view all the answers

    Which statistical method helps highlight the relationship between credit score and interest rate?

    <p>Scatterplot</p> Signup and view all the answers

    Data sourced from well-maintained data warehouses generally have lower quality.

    <p>False</p> Signup and view all the answers

    What should be the first step in managing missing values?

    <p>Understanding the reason behind missing values</p> Signup and view all the answers

    What is the primary objective of the data science process?

    <p>To address the analysis question</p> Signup and view all the answers

    Prior knowledge in the data science process is not essential to define the problem.

    <p>False</p> Signup and view all the answers

    What are some software tools used in the data science process?

    <p>RapidMiner, R, Weka, SAS, Oracle Data Miner, Python</p> Signup and view all the answers

    The ________ area of the problem helps uncover hidden patterns in the dataset.

    <p>subject</p> Signup and view all the answers

    Match the following components of the data science process with their descriptions:

    <p>Objective of the problem = The need for data analysis Subject area = Domain knowledge relevant to the data Prior knowledge = Information that is already known about a topic Algorithms = Methods for solving a business question</p> Signup and view all the answers

    Which of the following is a common issue faced during the data science process?

    <p>False or spurious signals</p> Signup and view all the answers

    The data science process is a fixed series of steps without any iterations.

    <p>False</p> Signup and view all the answers

    What must be carefully defined as the first step in the data science process?

    <p>The objective of the problem</p> Signup and view all the answers

    What is the purpose of assessing the available data in the data science process?

    <p>To determine if new data needs to be sourced</p> Signup and view all the answers

    A dataset is always available in the format required by data science algorithms.

    <p>False</p> Signup and view all the answers

    What is the term used for a single instance in a dataset?

    <p>data point</p> Signup and view all the answers

    A collection of data with a defined structure is known as a __________.

    <p>dataset</p> Signup and view all the answers

    Match the following terms with their definitions:

    <p>Dataset = A single instance in a dataset Data point = A collection of data with a defined structure Identifier = Special attributes used for locating records Label = The special attribute to be predicted</p> Signup and view all the answers

    Which of the following factors should be considered when assessing data for a business question?

    <p>Quality, quantity, and gaps in data</p> Signup and view all the answers

    Identifiers in a dataset are used to show the relationships between different attributes.

    <p>False</p> Signup and view all the answers

    What is the most time-consuming part of the data science process?

    <p>Data preparation</p> Signup and view all the answers

    What is one primary application of detecting outliers in data science?

    <p>Fraud detection</p> Signup and view all the answers

    A large number of attributes in a dataset can improve the performance of a model.

    <p>False</p> Signup and view all the answers

    What is the purpose of sampling in data analysis?

    <p>To select a subset of records that represent the original dataset.</p> Signup and view all the answers

    What is one method used to handle missing credit score values?

    <p>Replace with the maximum value in the dataset</p> Signup and view all the answers

    Sampling reduces the amount of data that needs to be processed and speeds up the __________ process of modeling.

    <p>build</p> Signup and view all the answers

    Match the following terms with their definitions:

    <p>Feature selection = Identifying important attributes for prediction Curse of dimensionality = Degradation of model performance due to too many features Outlier detection = Identifying data points that are significantly different from others Sampling = Selecting a subset of data to represent a larger dataset</p> Signup and view all the answers

    Ignoring records with missing values increases the size of the dataset.

    <p>False</p> Signup and view all the answers

    What is the process called that converts numeric values to categorical data types?

    <p>Binning</p> Signup and view all the answers

    Normalization helps prevent one attribute from dominating the _______ results.

    <p>distance</p> Signup and view all the answers

    Which of the following is considered an outlier?

    <p>A human height recorded as 1.73 cm instead of 1.73 m</p> Signup and view all the answers

    All data science algorithms require input attributes to be numeric.

    <p>False</p> Signup and view all the answers

    What attribute type must be transformed for linear regression models?

    <p>Numeric</p> Signup and view all the answers

    Match the following terms with their definitions:

    <p>Outliers = Anomalies in the dataset Normalization = Scaling attributes to a similar range Binning = Converting numeric data into categories Missing values = Absences of data points in a dataset</p> Signup and view all the answers

    Study Notes

    Fundamentals of Data Science

    • The methodical discovery of useful relationships and patterns in data is enabled by a series of iterative activities known as the data science process.
    • The standard data science process includes: understanding the problem, preparing data samples, developing a model, applying the model to a dataset to see how it works in the real world, deploying the model, and maintaining it.
    • Lecture 2 covers the data science process.

    Reference Books

    • Data Science: Concepts and Practice by Vijay Kotu and Bala Deshpande (2019)
    • DATA SCIENCE: FOUNDATION & FUNDAMENTALS by B. S. V. Vatika, L. C. Dabra (2023)

    Data Science Process Frameworks

    • CRISP-DM (Cross Industry Standard Process for Data Mining).
    • SEMMA (Sample, Explore, Modify, Model, and Assess).
    • DMAIC (Define, Measure, Analyze, Improve, and Control).
    • These are frameworks for data science solutions.

    CRISP-DM Process

    • CRISP-DM is a process model with six phases that naturally describes the data science life cycle.
    • Helps plan, organize, and implement data science (or machine learning) projects.
    • Business Understanding phase focuses on understanding the customer's needs.
    • Data Understanding phase focuses on identifying, collecting, and analyzing data sets.

    Data Science Process (General)

    • The fundamental objective of any data science process is to answer the analysis question.
    • The learning algorithm used for a business question can be a decision tree, artificial neural network, or scatterplot.
    • Software tools for developing and implementing data science algorithms include custom coding, RapidMiner, R, Weka, SAS, Oracle Data Miner, and Python.

    Prior Knowledge

    • Prior knowledge is information already known about a subject.
    • The step in data science helps define the problem, its context, and needed data.
    • Essential factors to gain information on are:
      • Objective of the problem
      • Subject area of the problem
      • Data
    • The objective of the data science process starts with the need for analysis, or a question, or a business objective..
    • It's imperative to get this first step right.
    • It's common to go back to previous steps and revise assumptions when faced with an iterative process.
    • The process of data science uncovers hidden patterns and relationships.
    • Identifying and isolating useful patterns are crucial from a large number of patterns.

    Data

    • Survey all available data to narrow down needed new data.
    • Data quality is a crucial factor. This relates to data quantity, availability, gaps, and if a lack of data compels changing the business question.
    • Data can be collected, stored, transformed, reported, and used.
    • A dataset is a collection of data with a defined structure (like rows and columns).
    • A data point is a single instance, record, or object in a dataset.
    • Each row in the dataset is a data point.
    • Identifiers (PK) are attributes used for locating and providing context to individual records (e.g., names, accounts, employee IDs).

    Data Preparation

    • Preparing the dataset is a critical, time-consuming part of the data science process.
    • Data often isn't in the required format for algorithms.
    • The data is typically structured in a tabular format.
    • Raw data needs to be converted to the required format using pivot, type conversion, join, transpose functions.

    Data Preparation Details

    • Data exploration, data quality, handling missing values, data type conversion, transformation, outliers, feature selection, and sampling are essential steps.

    Data Exploration

    • Data exploration, also called exploratory data analysis.
    • Provides simple tools for basic data understanding.
    • Involves calculating descriptive statistics and visualizations to understand data structure, distributions, extremes, and interrelationships.
    • Descriptive statistics like mean, median, mode, standard deviation, range summarize key characteristics.
    • Scatterplots of variables can show relationships.

    Data Quality

    • Data quality is important in all data collection, processing, and storage.
    • Data in a dataset must be accurate.
    • Data sources like data warehouses contain better quality data due to better controls.
    • Data cleansing techniques include eliminating duplicates, handling outliers, standardizing attribute values, replacing missing values.

    Handling Missing Values

    • One of the most common data quality issues is missing attribute values.
    • Techniques use to manage missing values include replacing missing values with derived values from the dataset (e.g. mean, minimum or maximum).
    • Alternatively, data records with missing values can be ignored to create a model.
    • This process reduces the size of the dataset

    Data Type Conversion

    • Attributes in data sets can be various types (e.g., continuous numeric, integer numeric, categorical).
    • Different algorithms expect different data types.
    • Categorical data (e.g., low, med, high) needs conversion to numeric for models like linear regression.
    • Binning is a technique for grouping numeric values into categories (e.g., < 500 = low, 500 to 700 = med, > 700 = high).

    Transformation

    • Algorithms like k-nearest neighbor (k-NN) expect numeric and normalized input attributes.
    • Attributes with large values can distort calculations.
    • Normalization converts ranges to a uniform scale (e.g., 0 to 1).

    Outliers

    • Outliers are unusual data points.
    • Correct or erroneous data capture can cause outliers.
    • Their presence needs understanding and special treatment.
    • Outliers might warrant special data science application, like fraud detection.

    Feature Selection

    • Datasets may have hundreds or thousands of attributes.
    • A large number of attributes increases model complexity and may decrease model performance.
    • Not all attributes are equally important for the target variable.

    Sampling

    • Sampling involves selecting a subset of data for use in data analysis/modeling.
    • Samples should represent the original dataset (similar properties).
    • Sampling reduces processing needed and speeds up building models.
    • Sampling is often crucial for obtaining insights and building representative models.
    • Theoretically, sampling may decrease model relevance, but practical benefits outweigh this risk.

    Studying That Suits You

    Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

    Quiz Team

    Related Documents

    Description

    This quiz assesses your understanding of the data science process, including key frameworks like CRISP-DM, SEMMA, and DMAIC. It covers the iterative activities involved in discovering patterns and relationships within data. Prepare to demonstrate your knowledge of these essential concepts in data science.

    More Like This

    Data Science Process - Lecture 2
    50 questions
    Use Quizgecko on...
    Browser
    Browser