CRISP-DM for Data Science Projects

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

What is a key distinction between traditional data mining processes and contemporary data science projects?

  • Data mining emphasizes data-driven approaches, while data science relies solely on business goals.
  • Data mining focuses on exploratory analysis, while data science is goal-oriented.
  • Data mining is exclusive to academic research, while data science is tailored for business applications.
  • Data mining starts from precise business goals, while data science projects are more exploratory. (correct)

Which of the following best describes the role of exploratory activities in data science projects?

  • They are applied to projects that lacks clear objectives
  • They are central, helping to discover new data sources and potential value. (correct)
  • They are exclusively used for data cleaning and preparation.
  • They are minimized to maintain focus on pre-defined business objectives.

In the context of data science trajectories (DST), what does the term 'backwards compatible' signify in relation to CRISP-DM?

  • DST incorporates all steps of CRISP-DM, while allowing for additional flexibility. (correct)
  • Both DST and CRISP-DM are the same
  • CRISP-DM incorporates all steps of DST, while allowing for less flexibility.
  • DST is the foundational parent of CRISP-DM

How does the Data Science Trajectories (DST) model aim to improve upon existing data mining methodologies?

<p>By incorporating exploratory activities common in data science but not covered by CRISP-DM. (B)</p> Signup and view all the answers

Which of the following is an example of how data may have multiple uses beyond its original collection context in contemporary data science?

<p>Data collected by an electronic payment system is used by a multinational company to determine the location for a new store. (A)</p> Signup and view all the answers

What does the 'data architecting' activity involve within the context of data management activities in data science?

<p>Designing the logical and physical layout of data and integrating different data sources. (C)</p> Signup and view all the answers

What is the significance of representing data science activities as a 'trajectory' rather than a linear process?

<p>Trajectories show iterative or backtracking steps. (A)</p> Signup and view all the answers

How can the Data Science Trajectories (DST) model assist project managers in planning data science projects?

<p>By enabling clear separation between data management, CRISP-DM, and exploratory activities, each with different time and cost characteristics. (D)</p> Signup and view all the answers

In the context of the tourism recommender system described, what is an example of the 'data value exploration' activity?

<p>Deciding to use user location and history records as relevant data. (A)</p> Signup and view all the answers

Within the Environmental Simulator example, which activity creates data that is difficult or expensive to collect in the real world?

<p>Data Simulation. (D)</p> Signup and view all the answers

In comparing the discussed data science activities to the scientific method, what is a key difference highlighted in the text?

<p>Data science is more flexible (B)</p> Signup and view all the answers

In the context of automation in data science, what is a noted limitation?

<p>While parts such as modelling (AutoML) are automated, many parts escape it. (B)</p> Signup and view all the answers

What trend causes ethical issue concerns within data-driven initiatives?

<p>The increased incentives behind many data science projects, focusing on data monetization through exploration of new products. (A)</p> Signup and view all the answers

In the discussion of activity types for project management, what is recognized as requiring more expert data scientists and increased time/cost uncertainty?

<p>Exploratory activities. (B)</p> Signup and view all the answers

What is a potential benefit of utilizing Data Science Trajectories (DST) over solely relying on the CRISP-DM model?

<p>DST allows for design with added exploration and exploration of current workflows. (A)</p> Signup and view all the answers

Flashcards

What is CRISP-DM?

CRISP-DM is a methodology to catalogue and guide the most common steps in data mining projects.

Data science activities

Data science projects can involve data simulation, narrative exploration and more.

Data Science Trajectories model (DST)

A visual representation of activities in a data science project.

DST Map

A flexible model that identifies common routes of data science projects.

Signup and view all the flashcards

Knowledge Discovery in Databases (KDD)

A kind of data that is stored and can be accessed to scale algorithms to massive datasets and still run efficiently to visualise results

Signup and view all the flashcards

Six steps of CRISP-DM

Business understanding, data understanding, data preparation, Modelling, Evaluation, and Deployment

Signup and view all the flashcards

SEMMA

Sample, Explore, Modify, Model and Assess.

Signup and view all the flashcards

60 approach

Emphasises measurement and statistical control techniques for quality and excellence in management.

Signup and view all the flashcards

CRISP-DM

Incorporates principles and ideas from most of the aforementioned methodologies, while also forming the basis for many later proposals.

Signup and view all the flashcards

Goal exploration

finding business goals which can be achieved in a data-driven way

Signup and view all the flashcards

Data source exploration

Finding new and valuable sources of data

Signup and view all the flashcards

Data value exploration

Finding out what value might be extracted from the data

Signup and view all the flashcards

Result exploration

relating data science results to the business goals

Signup and view all the flashcards

Study Notes

  • CRISP-DM originated in the second half of the nineties.
  • It is still considered the de facto standard for data mining and knowledge discovery projects.
  • Data science is now favored over data mining.
  • Factors investigated include CRISP-DM's fitness for data science projects.
  • The process model view still holds if the project is goal-directed and process-driven.
  • A more flexible model is called for when data science projects become more exploratory.
  • Seven real-life exemplars are examined where exploratory activities play an important role.
  • They are compared against 51 use cases extracted from the NIST Big Data Public Working Group.
  • The categorization can help project planning in terms of time and cost characteristics.

Introduction

  • Toward the end of the previous century, companies and institutions joined forces to identify good practices and common mistakes.
  • With funding from the European Union, a team of data mining engineers developed a broadly applicable data mining methodology.
  • In 1999 the first version of CRISP-DM was introduced, designed to catalogue and guide common steps in data mining projects.
  • In the last two decades, electronic devices and sensors, social networks, and data storage have increased.
  • They have increased opportunities for extracting knowledge through data mining projects.
  • The diversity of data has increased in origin, format, and modalities.
  • The variety of techniques coming from machine learning, data management, visualization, and causal inference has increased.
  • Data can be monetized in many more ways, through new applications, interfaces, and business models.
  • The area of deriving value from data has grown exponentially in size and complexity.
  • In contrast to traditional data mining, data-driven and knowledge-driven stages interact.
  • Traditional data mining starts from business goals that translate into a clear data mining task.
  • New methodologies have been proposed to accommodate some of the changes.
  • IBM introduced ASUM-DM [3], and SAS introduced SEMMA [4].
  • The original CRISP-DM model can still be recognized in more recent proposals.
  • It remains focused on the traditional paradigm of a sequential list of stages from data to knowledge.
  • After twenty years, to what extent do the original CRISP-DM and the underlying data mining paradigm remain applicable.
  • It is applicable for the much wider range of data science projects seen today.

Limitations of the original CRISP-DM

  • Recognition of the limitations of the original CRISP-DM and other related methodologies is key.
  • The diversity of data science projects today is considered.
  • Exploratory activities that are common in data science but not covered by CRISP-DM are identified.
  • Recognition of popular trajectories in this space describes practices in data science.
  • Popular trajectories can be used as templates, making the DST model exemplary rather than prescriptive.
  • Some general suggestions are made on how the DST model can be coupled with project management methodologies.
  • The DST model can be customised to different organisations and contexts.

Organization of the Paper

  • CRISP-DM and other related variations are revisited.
  • The identification of new activities and the formulation of the DST map are included.
  • These trajectories are illustated on real cases of data science projects.
  • A precise notation on trajectory charts is used.
  • Data science project management is discussed.
  • Three kinds of activities are considered for these seven real cases plus 51 use cases from the NIST. Comparison of the model with software methodologies and the scientific method
  • It suggests how organisations can couple this with new methodologies and ethical issues and automation.

The Goal standard

  • CRISP-DM is overhauled
  • It is cruicial to not discard CRISP-DM
  • It represents common trajectories in data science
  • It goes from data to knowledge when there is a goal that translates data mining
  • the DST is "backwards compatible" with the need for flexibility
  • Some other trajectories that capture the common routes of data science projects are identiified
  • Section 2 gives a succinct description of the most used and cited data mining.
  • It provides an overview of its evolution, basis and primary characteristics.
  • Knowledge Discovery in Databases (KDD)overall process of knowledge discovery from data.
  • It includes how the data is stored and accessed, how algorithms can be scaled to massive data-sets and still run efficiently.

Fayyad, Piatetsky-Shapiro and Smyth

  • How results can be interpreted and visualised, how the overall human-machine interaction can be modeled and supported"
  • Data mining is a step in this process, turning pre-processed data into patterns that can be turned into knowledge [7]. Data mining is often used as a synonym for KDD, and it will not distinguish between the two meanings in this paper.
  • It elaborates and extends the steps in the original KDD proposal into six steps.
  • Business understanding, Data understanding, Data preparation, Modeling, Evaluation, and Deployment are key.
  • The way they are sequenced in a data mining application.

Process models and methodologies:

  • Human-Centered Approach to Data Mining: Involves a holistic understanding of the entire Knowledge Discovery Process and interpretations.
  • SEMMA stands for Sample, Explore, Modify, Model and Assess.
  • Cabena's model: used in the marketing and sales domain.
  • Buchner's model is adapted to the development of web mining projects focused on an online customer.
  • Two Crows: takes insights from (first versions of) CRISP-DM (before release) and proposes a non-linear list of steps.
  • D3M, a domain-driven data mining approach.
  • The 5 A's Process originally developed by SPSS, already included an “Automate” step. Motorola developed the 60 approach stressing statistical control to excellence in management.
  • KDD Roadmap is an iterative data mining methodology.
  • The arrows in the figure indicate that CRISP-DM incorporates principles and ideas from most of the aforementioned methodologies.
  • CRISP-DM is still considered the most complete data mining methodology in terms of meeting the needs of industrial projects.

Modernizing CRISP-DM

  • The data mining methodology has evolved in the business application of data mining
  • Several new methodologies have appeared as extensions of CRISP-DM
  • The CRISP-DM 2.0 Special Interest Group (SIG) was established.
  • SIG’s Goal: Modernize without fundamental changes
  • Cios et al.'s Six-step discovery process adapts the CRISP-DM model to academia.
  • RAMSYS (RApid collaborative data Mining SYStem) is another way to develop collaborative DM and KD projects with geographically diverse groups

Crisp-DM and extensions/personalizations

  • ASUM-DM refines and extends CRISP-DM.
  • It adds infrastructure, operations, deployment and project management.
  • It has templates and guidelines, personalised for IBM's practices.
  • CASP-DM addresses specific challenges of machine learning and data mining.
  • HACE is a Big Data processing framework based on a three tier structure
  • Mining platform (Tier I), challenges on information sharing and pri-vacy, and Big Data application domains (Tier II), and Big Data mining algorithms (Tier III).

Business Understanding

  • Spend time with the business.
  • Methodologies have a great deal that aims at gathering information before starting a data mining.
  • Current data deluge as well as the experimental / exploratory nature of data requires lightweight and flexible methods.
  • Similar lifecycles / methodologies have bee introduced by IT companies for data science
  • IBM released the Foundational Methodology for Data Science (FMDS).
  • Microsoft released TDSP a data science methodology to deliver predictive analytics solutions and applications effeciently
  • FMDS and TDSP = CRISP DM
  • FMDS and TDSP = data mining methodologies assuming an identifiable Goal, unlike data science requires an exploratory mindset

Processes and Data

  • Mining is like starting mining at place known for minerals and metals
  • Data is the valuable knowledge
  • Whenever this metaphor is applicable, it's suggested that CRISP-DM is methodology
  • Data science is more commong now than Data mining in context of discovery a Google Trends query showed this
  • Term used a lot now
  • Two Broad senses: the science OF data applying scientific method = testing machine Learning

Data Science is Data Oriented and Exploratory

  • CRISP is about the processes and tasks = process takes centre stage.
  • In Dat Science = Data takes Centre stage.
  • What happens to the Data? What possible operations can apply? While move away = this becomes more inquisitive & prescriptive, things you can do to data.
  • Continue with mining metaphor, data mining = previous data Data Science Prospecting searching precious metals that can be located

Data prospecting

  • Find business goals
  • discover and new value
  • Data value explanation
  • Relating data science results
  • Narrative exploration - extracting valuable stories
  • Product exploration: turn extracted data value that comes from consumers

Data Science and CRIP Example

  • it is possible two see (Weak) Links business data
  • Data Depends on decisions of scientist
  • unstatifactory result
  • need to further source
  • No data = source before value

Managing stages of projects

  • exploration with parts / steps of DM
  • activities from start to deplyment are there
  • partial tracers happen
  • there is work but no activity beyond preperation
  • CRISP interrupted by explorer, new data sets goals
  • successful plan and project is one like the one depicted

What a data Project Requires

  • Contrasted to CRISP
  • No Arrows. No Predetermination.
  • Leader decides the next
  • Contatins these phases
  • exploratory = these can set more information
  • Take in science and
  • modelling will depend on carrying out over

Technical data

  • “data” is not fully capturing work
  • The variety
  • Applications now Especially under the “business intelligence” extraction
  • data from other source For Self own the following data acquisition simulation

The variety of models

  • acquisition
  • Simulation: simulate to produce
  • architecture
  • Release: to database

Trajectories

  • Activities have been introduced trajectory.
  • The graph over sequence is forking
  • Give in the data
  • Way the exploration
  • knowledge value
  • Activities cleaning, model transformation
  • presented
  • Examples in next
  • Remove this activities Trajectorey

DST (data science trajectorey)

  • dst chart = graph w activities conections .
  • arrows are labelled from sequence activities
  • cannot be loops
  • the types are circles
  • If two they take place as well
  • The trajectory can go through the same activity many more times
  • annotate as label - to show more transition
  • chart is charted
  • the charts make model

Types of data science trajectories

  • It's no way of complete = demonstrate data
  • Data Acquisition is more than expensive + complex
  • analyzing different scenarios
  • estimating alternates
  • weather, fuel, byproducts
  • Also stimulating the of certain parts of city
  • the city to predict the pollution
  • Estimated using volume. Integrate a model

Insurance /refininf policies

  • Using history to Clients data can be can get
  • Data for example The info can be to pay the data
  • Data where has the product for the customer enrich other data to what needs by means needed

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Data Life Cycle and CRISP-DM Methodology
16 questions
Data Mining Review and CRISP-DM Lifecycle
84 questions
Use Quizgecko on...
Browser
Browser