Podcast
Questions and Answers
Flashcards
Data Mining
Data Mining
The process of discovering useful knowledge from large amounts of data.
Data Mining Process
Data Mining Process
A process that uses statistical, mathematical, and artificial intelligence techniques to extract and identify useful information and subsequent knowledge from large sets of data.
Patterns in Data Mining
Patterns in Data Mining
Business rules, affinities, correlations, trends and prediction models found from data mining.
Nontrivial Process
Nontrivial Process
Signup and view all the flashcards
Associations
Associations
Signup and view all the flashcards
Predictions
Predictions
Signup and view all the flashcards
Clusters
Clusters
Signup and view all the flashcards
Sequential Relationships
Sequential Relationships
Signup and view all the flashcards
CRISP-DM
CRISP-DM
Signup and view all the flashcards
SEMMA
SEMMA
Signup and view all the flashcards
KDD
KDD
Signup and view all the flashcards
Classification
Classification
Signup and view all the flashcards
Regression
Regression
Signup and view all the flashcards
Classification-type Prediction
Classification-type Prediction
Signup and view all the flashcards
Factors to Consider - Assessing the Model
Factors to Consider - Assessing the Model
Signup and view all the flashcards
Decision Tree
Decision Tree
Signup and view all the flashcards
Algorithm to Build Decision Tree
Algorithm to Build Decision Tree
Signup and view all the flashcards
Decision Tree - Example Algorithms
Decision Tree - Example Algorithms
Signup and view all the flashcards
Cluster Analysis
Cluster Analysis
Signup and view all the flashcards
K-Means Algorithm
K-Means Algorithm
Signup and view all the flashcards
Association Rule Mining
Association Rule Mining
Signup and view all the flashcards
Association Rule Mining - Sales Transactions
Association Rule Mining - Sales Transactions
Signup and view all the flashcards
Association Rule Mining - Credit card transactions
Association Rule Mining - Credit card transactions
Signup and view all the flashcards
Association Rule Mining - Banking Services
Association Rule Mining - Banking Services
Signup and view all the flashcards
Apriori Algorithm
Apriori Algorithm
Signup and view all the flashcards
Privacy Obligations
Privacy Obligations
Signup and view all the flashcards
De-identification
De-identification
Signup and view all the flashcards
Data Mining Truth
Data Mining Truth
Signup and view all the flashcards
Study Notes
Data Mining Concepts
- Data mining is a term used to describe discovering or "mining" knowledge from large amounts of data.
- Data mining uses statistical, mathematical, and artificial intelligence techniques to extract and identify useful information and subsequent knowledge or patterns from large sets of data.
- It is the nontrivial process of identifying patterns in data stored in structured databases.
- The data is organized in records structured by categorical, ordinal, and continuous variables.
- Data mining blends statistics, management science, information systems, database management, data warehousing, information visualization, artificial intelligence, machine learning, and pattern recognition.
- Data is often buried deep in very large databases, sometimes spanning years
- Sophisticated tools, including advanced visualization tools, help unlock information buried in corporate files or public archives.
- Data mining environment is usually a client/server architecture or a Web-based IS architecture.
- The miner is often an end user, empowered by data drills and other powerful query tools to ask ad hoc questions and get answers quickly with little or no programming skill.
- Data mining often involves finding an unexpected result and requires end users to think creatively throughout the process, including the interpretation of the findings.
- Data mining tools are readily combined with spreadsheets and other software development tools.
- It is sometimes necessary to use parallel processing for data mining.
- Associations find the commonly co-occurring groupings such as beer and diapers going together in market-basket analysis.
- Predictions tell the nature of future occurrences of certain events based on what has happened in the past.
- Clusters identify natural groupings of things based on their known characteristics, such as assigning customers in different segments based on their demographics and past purchase behaviors.
- Sequential relationships discover time-ordered events.
Tasks and Methods
- Data mining has tasks and methods, including predictions, classification, regression, time series, association, market basket, link analysis, sequence analysis, segmentation, clustering, and outlier analysis.
- Supervised Data mining algorithms include Decision Trees, Neural Networks, Support Vector Machines, kNN, Naïve Bayes, GA, Linear/Nonlinear Regression, ANN, Regression Trees, SVM, KNN, GA, Autoregressive Methods, Averaging Methods, Exponential Smoothing, and ARIMA
Data Mining Applications
- Common applications include Customer relationship management, Banking, Retailing and logistics, Manufacturing and production, Brokerage and securities trading, Insurance, Computer hardware and software, Government and defense, Travel industry, Healthcare, Medicine, Entertainment industry, Homeland security and law enforcement, Sports.
Data Mining Process
- CRISP-DM is Cross-Industry Standard Process for Data Mining
- Proposed in the mid-1990s by a European consortium of companies to serve as a nonproprietary standard methodology for data mining
- The steps are Business Understanding, Data Understanding, Data Preparation, Model Building, Testing and Evaluation, and Deployment
- SEMMA is Sample, Explore, Modify, Model, and Assess.
- Developed by SAS Institute in 2009.
- KDD means Knowledge Discovery in Databases
- Data Selection, Data Cleaning, Transformation, Data Mining, and Interpretation/Evaluation
Data Mining Methods
- Classification is perhaps the most frequently used data mining method for real-world problems.
- Classification learns patterns from past data to place new instances (with unknown labels) into their respective groups or classes.
- If being predicted is a class label, the prediction problem is called a classification.
- Weather: sunny, rainy, or cloudy
- If what is being predicted is a numeric value, the prediction problem is called a regression.
- Temperature: such as 68°F
- The two-step methodology of classification-type prediction is model development/training and model testing/deployment
- Model development/training includes collection of input data, including the class labels.
- Model testing/deployment is tested against the holdout sample for accuracy assessment.
- Assessment includes accuracy, speed, robustness, scalability, and Interpretability
- It helps to predict classes of new data instances where the class label is unknown.
- Estimating True Accuracy of Models are Positive, Negative, True Positive Rate, True Negative Rate, Precision, and Recall
- Techniques include decision tree analysis, statistical analysis (logistic regression and discriminant analysis), neural networks, case-based reasoning, Bayesian classifiers, genetic algorithms, and rough sets.
- A Decision tree recursively divides a training set
- Each non-leaf node of the tree contains a split point.
- A general algorithm to build a decision tree includes creating a root node and assigning all of the training data to it, selecting the best splitting attribute, adding a branch to the root node for each value of the split.
- Common algorithms are Iterative Dichotomiser 3 (ID3), C4.5 and C5, and Classification and regression trees (CART).
- Cluster analysis is an essential data mining method for classifying items, events, or concepts into common groupings called clusters.
- It is an exploratory data analysis tool for solving classification problems
- Cluster analysis results may be used to Identify a classification scheme, indicate rules for assigning new cases to classes for identification, Find typical cases to label and represent classes, decrease the size and complexity of the problem space for other data mining methods, and identify outliers in a specific domain.
- Using K-Means Clustering Algorithm
- The k-means algorithm (where k stands for the predetermined number of clusters) is arguably the most referenced clustering algorithm.
- The algorithm assigns each data point to the cluster whose center (also called the centroid) is the nearest.
- The center is calculated as the average of all points
Association Rule Mining
- Association rule mining is commonly used to explain what data mining is and what it can do to a technologically less-savvy audience.
- It aims to find interesting relationships between variables (items) in large databases.
- The main idea is to identify strong relationships among different products purchased together.
- Applications are sales transactions, credit card transactions, banking services, insurance service products, telecommunication services, and medical records.
- Apriori Algorithm is the commonly used algorithm to discover association rules.
- Attempts to find subsets that are common to at least a minimum number of the item-sets
Privacy Issues
- Data that is collected, stored, and analyzed in data mining often contains information about real people
- This includes Identification data, Demographic data, Financial data, Purchase history, and Other personal data
- Most of these data can be accessed through some third-party data providers.
- Data mining professionals have ethical and legal obligations to maintain privacy.
- This includes de-identification of the customer records before applying data mining applications.
- Data mining is a multistep process, not providing instant results
- The current state of the art is ready for almost any business type and/or size.
- Newer Web-based tools enable managers of all educational levels to do data mining.
- If the data accurately reflect the business or its customers, any company can use data mining.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.