Data Mining Concepts

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson
Download our mobile app to listen on the go
Get App

Questions and Answers

Flashcards

Data Mining

The process of discovering useful knowledge from large amounts of data.

Data Mining Process

A process that uses statistical, mathematical, and artificial intelligence techniques to extract and identify useful information and subsequent knowledge from large sets of data.

Patterns in Data Mining

Business rules, affinities, correlations, trends and prediction models found from data mining.

Nontrivial Process

Identifying valid, novel, potentially useful, and ultimately understandable patterns in data.

Signup and view all the flashcards

Associations

The pattern finds the commonly co-occurring groupings of things.

Signup and view all the flashcards

Predictions

Patterns tell the nature of future occurrences of certain events based on what has happened in the past.

Signup and view all the flashcards

Clusters

Patterns identify natural groupings of things based on their known characteristics.

Signup and view all the flashcards

Sequential Relationships

Patterns discover time-ordered events

Signup and view all the flashcards

CRISP-DM

Standard process for data mining proposed in the mid-1990s.

Signup and view all the flashcards

SEMMA

Sample, Explore, Modify, Model, and Assess

Signup and view all the flashcards

KDD

Knowledge Discovery in Databases (KDD)

Signup and view all the flashcards

Classification

A data mining method that learns patterns from past data to place new instances into groups.

Signup and view all the flashcards

Regression

Statistical technique to predict a numeric value.

Signup and view all the flashcards

Classification-type Prediction

A common two-step methodology that includes Model development/training and Model testing/deployment

Signup and view all the flashcards

Factors to Consider - Assessing the Model

Accuracy, Speed, Robustness, Scalability, Interpretability

Signup and view all the flashcards

Decision Tree

A diagram with nodes and branches that represent relationships.

Signup and view all the flashcards

Algorithm to Build Decision Tree

Creating a root node and assigning all the training data to it.

Signup and view all the flashcards

Decision Tree - Example Algorithms

Iterative Dichotomiser 3 (ID3), Classification and regression trees (CART),Chi-squared automatic interaction detector (CHAID)

Signup and view all the flashcards

Cluster Analysis

Clustering items, events, or concepts into groupings called clusters.

Signup and view all the flashcards

K-Means Algorithm

The algorithm assigns each data point to the cluster whose center is nearest.

Signup and view all the flashcards

Association Rule Mining

Mining to find relationships (affinities) between variables (items).

Signup and view all the flashcards

Association Rule Mining - Sales Transactions

Retail product placement on sales floor

Signup and view all the flashcards

Association Rule Mining - Credit card transactions

Insight to purchase or fraud

Signup and view all the flashcards

Association Rule Mining - Banking Services

ID services used by customers

Signup and view all the flashcards

Apriori Algorithm

The commonly used algorithm to discover association rules.

Signup and view all the flashcards

Privacy Obligations

To maintain the privacy and protection of individual's rights, data mining professionals have ethical (and often legal) obligations

Signup and view all the flashcards

De-identification

The process of the customer records prior to applying data mining applications, so that the records cannot be traced to an individual.

Signup and view all the flashcards

Data Mining Truth

Data mining is a multistep process that requires deliberate, proactive design and use

Signup and view all the flashcards

Study Notes

Data Mining Concepts

  • Data mining is a term used to describe discovering or "mining" knowledge from large amounts of data.
  • Data mining uses statistical, mathematical, and artificial intelligence techniques to extract and identify useful information and subsequent knowledge or patterns from large sets of data.
  • It is the nontrivial process of identifying patterns in data stored in structured databases.
  • The data is organized in records structured by categorical, ordinal, and continuous variables.
  • Data mining blends statistics, management science, information systems, database management, data warehousing, information visualization, artificial intelligence, machine learning, and pattern recognition.
  • Data is often buried deep in very large databases, sometimes spanning years
  • Sophisticated tools, including advanced visualization tools, help unlock information buried in corporate files or public archives.
  • Data mining environment is usually a client/server architecture or a Web-based IS architecture.
  • The miner is often an end user, empowered by data drills and other powerful query tools to ask ad hoc questions and get answers quickly with little or no programming skill.
  • Data mining often involves finding an unexpected result and requires end users to think creatively throughout the process, including the interpretation of the findings.
  • Data mining tools are readily combined with spreadsheets and other software development tools.
  • It is sometimes necessary to use parallel processing for data mining.
  • Associations find the commonly co-occurring groupings such as beer and diapers going together in market-basket analysis.
  • Predictions tell the nature of future occurrences of certain events based on what has happened in the past.
  • Clusters identify natural groupings of things based on their known characteristics, such as assigning customers in different segments based on their demographics and past purchase behaviors.
  • Sequential relationships discover time-ordered events.

Tasks and Methods

  • Data mining has tasks and methods, including predictions, classification, regression, time series, association, market basket, link analysis, sequence analysis, segmentation, clustering, and outlier analysis.
  • Supervised Data mining algorithms include Decision Trees, Neural Networks, Support Vector Machines, kNN, Naïve Bayes, GA, Linear/Nonlinear Regression, ANN, Regression Trees, SVM, KNN, GA, Autoregressive Methods, Averaging Methods, Exponential Smoothing, and ARIMA

Data Mining Applications

  • Common applications include Customer relationship management, Banking, Retailing and logistics, Manufacturing and production, Brokerage and securities trading, Insurance, Computer hardware and software, Government and defense, Travel industry, Healthcare, Medicine, Entertainment industry, Homeland security and law enforcement, Sports.

Data Mining Process

  • CRISP-DM is Cross-Industry Standard Process for Data Mining
  • Proposed in the mid-1990s by a European consortium of companies to serve as a nonproprietary standard methodology for data mining
  • The steps are Business Understanding, Data Understanding, Data Preparation, Model Building, Testing and Evaluation, and Deployment
  • SEMMA is Sample, Explore, Modify, Model, and Assess.
  • Developed by SAS Institute in 2009.
  • KDD means Knowledge Discovery in Databases
  • Data Selection, Data Cleaning, Transformation, Data Mining, and Interpretation/Evaluation

Data Mining Methods

  • Classification is perhaps the most frequently used data mining method for real-world problems.
  • Classification learns patterns from past data to place new instances (with unknown labels) into their respective groups or classes.
  • If being predicted is a class label, the prediction problem is called a classification.
  • Weather: sunny, rainy, or cloudy
  • If what is being predicted is a numeric value, the prediction problem is called a regression.
  • Temperature: such as 68°F
  • The two-step methodology of classification-type prediction is model development/training and model testing/deployment
  • Model development/training includes collection of input data, including the class labels.
  • Model testing/deployment is tested against the holdout sample for accuracy assessment.
  • Assessment includes accuracy, speed, robustness, scalability, and Interpretability
  • It helps to predict classes of new data instances where the class label is unknown.
  • Estimating True Accuracy of Models are Positive, Negative, True Positive Rate, True Negative Rate, Precision, and Recall
  • Techniques include decision tree analysis, statistical analysis (logistic regression and discriminant analysis), neural networks, case-based reasoning, Bayesian classifiers, genetic algorithms, and rough sets.
  • A Decision tree recursively divides a training set
  • Each non-leaf node of the tree contains a split point.
  • A general algorithm to build a decision tree includes creating a root node and assigning all of the training data to it, selecting the best splitting attribute, adding a branch to the root node for each value of the split.
  • Common algorithms are Iterative Dichotomiser 3 (ID3), C4.5 and C5, and Classification and regression trees (CART).
  • Cluster analysis is an essential data mining method for classifying items, events, or concepts into common groupings called clusters.
  • It is an exploratory data analysis tool for solving classification problems
  • Cluster analysis results may be used to Identify a classification scheme, indicate rules for assigning new cases to classes for identification, Find typical cases to label and represent classes, decrease the size and complexity of the problem space for other data mining methods, and identify outliers in a specific domain.
  • Using K-Means Clustering Algorithm
  • The k-means algorithm (where k stands for the predetermined number of clusters) is arguably the most referenced clustering algorithm.
  • The algorithm assigns each data point to the cluster whose center (also called the centroid) is the nearest.
  • The center is calculated as the average of all points

Association Rule Mining

  • Association rule mining is commonly used to explain what data mining is and what it can do to a technologically less-savvy audience.
  • It aims to find interesting relationships between variables (items) in large databases.
  • The main idea is to identify strong relationships among different products purchased together.
  • Applications are sales transactions, credit card transactions, banking services, insurance service products, telecommunication services, and medical records.
  • Apriori Algorithm is the commonly used algorithm to discover association rules.
  • Attempts to find subsets that are common to at least a minimum number of the item-sets

Privacy Issues

  • Data that is collected, stored, and analyzed in data mining often contains information about real people
  • This includes Identification data, Demographic data, Financial data, Purchase history, and Other personal data
  • Most of these data can be accessed through some third-party data providers.
  • Data mining professionals have ethical and legal obligations to maintain privacy.
  • This includes de-identification of the customer records before applying data mining applications.
  • Data mining is a multistep process, not providing instant results
  • The current state of the art is ready for almost any business type and/or size.
  • Newer Web-based tools enable managers of all educational levels to do data mining.
  • If the data accurately reflect the business or its customers, any company can use data mining.

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Introduction to Data Mining
31 questions

Introduction to Data Mining

MajesticSeaborgium550 avatar
MajesticSeaborgium550
Data Mining Concepts
25 questions

Data Mining Concepts

EntrancingRhenium avatar
EntrancingRhenium
Data Analysis and Mining Concepts
39 questions
Use Quizgecko on...
Browser
Browser