Data Mining Review and CRISP-DM Lifecycle
84 Questions
5 Views

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

What method is used to identify the central tendency of a dataset that is not influenced by outliers?

  • Mode
  • Median (correct)
  • Mean
  • Standard Deviation

In the context of model evaluation, what does entropy generally measure?

  • Data variability
  • Average prediction error
  • Feature importance
  • Impurity in data (correct)

What is the primary purpose of hyperparameter tuning in machine learning models?

  • To standardize feature scales
  • To eliminate missing values
  • To optimize model performance (correct)
  • To increase dataset size

Which of the following metrics is typically used to evaluate the performance of a classification model?

<p>Confusion Matrix (D)</p> Signup and view all the answers

Which method is NOT used for identifying outliers in data?

<p>Median Absolute Deviation (A)</p> Signup and view all the answers

Which component is part of the general form for estimating a confidence interval?

<p>Point Estimate +/- Margin of Error (B)</p> Signup and view all the answers

What is a characteristic of multivariate statistical analysis?

<p>It analyzes multiple dependent variables simultaneously. (B)</p> Signup and view all the answers

What is one reason for the increase in data mining usage?

<p>Commercialization of products (C)</p> Signup and view all the answers

Which of the following is NOT a common task of data mining?

<p>Interrogation (A)</p> Signup and view all the answers

What is the main purpose of data preprocessing?

<p>Ensure data is more complete and suitable for analysis (D)</p> Signup and view all the answers

Which step follows the understanding of business and data in the CRISP-DM lifecycle?

<p>Data preparation (A)</p> Signup and view all the answers

Which of these options is considered a method of data cleaning?

<p>Removing redundant entries (D)</p> Signup and view all the answers

What does GIGO stand for in the context of data processing?

<p>Garbage In, Garbage Out (B)</p> Signup and view all the answers

Which aspect of data mining focuses specifically on making predictions from data?

<p>Regression (C)</p> Signup and view all the answers

Which of the following is a benefit of using NumPy over standard lists in Python?

<p>NumPy is generally faster and more efficient in numerical calculations. (D)</p> Signup and view all the answers

Which method would NOT be appropriate for identifying outliers in data?

<p>Linear Regression (C)</p> Signup and view all the answers

NumPy arrays can hold elements of different data types.

<p>True (A), False (B)</p> Signup and view all the answers

What is the purpose of a confidence interval estimate?

<p>To provide a range of values that likely contain the true population parameter.</p> Signup and view all the answers

The formula for a confidence interval is: Point Estimate +/- __________.

<p>Margin of Error</p> Signup and view all the answers

Match the following statistical metrics with their descriptions:

<p>Mean Absolute Error = Average of absolute errors between predicted and actual values Mean Squared Error = Average of squared errors between predicted and actual values Root Mean Squared Error = Square root of the mean of squared errors R2 Score = Determines the proportion of variance for the dependent variable</p> Signup and view all the answers

Which of the following statistical measures indicates the most frequently occurring value in a dataset?

<p>Mode (C)</p> Signup and view all the answers

Standardization and normalization are the same processes in data preprocessing.

<p>False (B), True (A), False (B)</p> Signup and view all the answers

What is the purpose of feature selection in machine learning?

<p>To identify and select the most relevant features for model training.</p> Signup and view all the answers

The process of transforming categorical variables into a numerical format is known as ______.

<p>feature encoding</p> Signup and view all the answers

Match the following statistical concepts with their definitions:

<p>Mean = The average of a dataset Skewness = Measure of asymmetry in a distribution Z-Score = Number of standard deviations an element is from the mean Confidence Interval = Range of values likely to contain the population parameter</p> Signup and view all the answers

External pressure is one of the reasons for the increase in data mining usage.

<p>False (B), True (A)</p> Signup and view all the answers

What does CRISP-DM stand for?

<p>Cross Industry Standard Process for Data Mining</p> Signup and view all the answers

Data cleaning involves removing _______.

<p>entries</p> Signup and view all the answers

Match the following data mining tasks with their descriptions:

<p>Estimation = Determining the value of a property based on available data Prediction = Making forecasts about future outcomes based on past data Classification = Assigning items to predefined categories Clustering = Grouping similar items without predefined labels</p> Signup and view all the answers

What is one reason for data preprocessing?

<p>To ensure raw data is complete and consistent (B)</p> Signup and view all the answers

GIGO stands for 'Garbage In, Garbage Out'.

<p>False (B), True (A)</p> Signup and view all the answers

Name one task performed in the data preparation phase of CRISP-DM.

<p>Data cleaning</p> Signup and view all the answers

Which evaluation metric is particularly useful when dealing with imbalanced data?

<p>Recall (A)</p> Signup and view all the answers

Using resampled data for evaluating models can lead to better generalization.

<p>True (A), False (B)</p> Signup and view all the answers

What is the purpose of using a Precision-Recall curve in model evaluation?

<p>To identify the best threshold for the positive class.</p> Signup and view all the answers

The F1 score is the harmonic mean of ______ and ______.

<p>precision, recall</p> Signup and view all the answers

Match the following evaluation metrics with their descriptions:

<p>Accuracy = Can be misleading in imbalanced datasets Precision = Measures correct positive predictions out of all positive predictions Recall = Measures correct positive predictions out of actual positives AUC of ROC = Treats both classes equally and less sensitive to minority improvements</p> Signup and view all the answers

Which of the following is a method for generating synthetic examples in the context of imbalanced data?

<p>SMOTE (D)</p> Signup and view all the answers

Tomek links are a technique used to add synthetic examples to the minority class.

<p>True (A), False (B)</p> Signup and view all the answers

What is one potential drawback of random oversampling?

<p>Overfitting to the minority class</p> Signup and view all the answers

In the context of data resampling techniques, SMOTE is specifically designed for __________.

<p>oversampling</p> Signup and view all the answers

Match the following resampling techniques with their descriptions:

<p>Random over-sampling = Copies examples of the minority class Random under-sampling = Removes examples from the majority class SMOTE = Creates synthetic examples from original ones Tomek links = Removes majority class samples based on proximity</p> Signup and view all the answers

Which of the following measures of spread is preferable when dealing with extreme values?

<p>Mean absolute deviation (A)</p> Signup and view all the answers

In comparing two portfolios with the same measures of center, which observation about their spread could be inferred?

<p>One portfolio may have outliers affecting its values. (B)</p> Signup and view all the answers

Which statement accurately describes the relationship between measures of center and measures of spread?

<p>Measures of spread can indicate how consistent the measures of center are. (D)</p> Signup and view all the answers

What does the sample standard deviation represent in relation to the mean?

<p>The typical distance between field values and the mean. (D)</p> Signup and view all the answers

Which statement correctly describes the range of min-max normalization values?

<p>Values are always between 0 and 1. (A)</p> Signup and view all the answers

What does z-score standardization use to scale field values?

<p>Field mean and standard deviation (C)</p> Signup and view all the answers

What represents the minimum value when applying min-max normalization?

<p>0 (C)</p> Signup and view all the answers

How is the Z-score calculated for a given data value?

<p>Subtract the mean from the value and divide by the standard deviation (C)</p> Signup and view all the answers

Which of the following statements is true about Z-scores?

<p>Z-scores around zero indicate values near the mean (D)</p> Signup and view all the answers

What is the purpose of decimal scaling in normalization?

<p>To ensure data values lie between -1 and 1 (B)</p> Signup and view all the answers

What is a potential risk when using methods that replace missing values with constants?

<p>It can result in a loss of valuable information if patterns of missing values are systematic. (D)</p> Signup and view all the answers

Which method for handling missing data might lead to an overestimation of confidence levels in statistical inference?

<p>Replacing missing values with the mean of the dataset. (B)</p> Signup and view all the answers

What does replacing missing values with the mode or mean fail to address?

<p>The systematic patterns of missingness that could impact analysis. (B)</p> Signup and view all the answers

What is a common drawback of replacing missing values with the mode, specifically in categorical fields?

<p>It may distort the frequency distribution of the dataset. (A)</p> Signup and view all the answers

Which data mining task involves finding natural groupings in the data?

<p>Clustering (C)</p> Signup and view all the answers

What is a significant reason for the rise in data mining usage?

<p>Commercialization of products (D)</p> Signup and view all the answers

In the CRISP-DM lifecycle, after understanding business objectives, what is the next step?

<p>Data Understanding (A)</p> Signup and view all the answers

Which preprocessing issue relates to entries that are irrelevant or no longer needed?

<p>Redundant fields (B)</p> Signup and view all the answers

Why is minimizing GIGO crucial in data mining processes?

<p>To improve data quality and outcomes (A)</p> Signup and view all the answers

What characteristic of NumPy arrays enhances their efficiency over traditional lists?

<p>NumPy arrays offer fixed data types and contiguous memory storage. (C)</p> Signup and view all the answers

Which of the following statements best describes the concept of bias and variance in modeling?

<p>Bias is the error due to approximating a real-world problem, while variance is the error due to sensitivity to fluctuations in the training set. (D)</p> Signup and view all the answers

In the context of hypothesis testing, what does the 'null hypothesis' typically represent?

<p>It asserts no effect or relationship exists between variables. (C)</p> Signup and view all the answers

Which of the following concepts is primarily concerned with the evaluation of classification models?

<p>Sensitivity and Specificity (A)</p> Signup and view all the answers

Why is it important for the training set and the test set to be independent?

<p>To validate that the model can be generalized to unseen data. (C)</p> Signup and view all the answers

What is the purpose of examining the efficacy of a classification model using the test set?

<p>To compare the predicted values against the true target variable. (C)</p> Signup and view all the answers

What does cross-validation help guard against in model evaluation?

<p>Spurious results that may arise from random variations in the data. (D)</p> Signup and view all the answers

What is the next step after assessing the performance of a data mining model on the test set?

<p>Adjusting the provisional model to minimize errors on the test set. (B)</p> Signup and view all the answers

What distinguishes supervised methods from unsupervised methods in data mining?

<p>Supervised methods require a specific target variable to guide the learning process. (B)</p> Signup and view all the answers

Which of the following methods is classified as unsupervised data mining?

<p>Clustering voter profiles based on demographics. (B)</p> Signup and view all the answers

Why might statistical methods and data mining result in statistically significant results that lack practical significance?

<p>Statistical methods often interpret large datasets in ways that can misrepresent the real-world impact. (C)</p> Signup and view all the answers

Which statement accurately characterizes the role of clustering in unsupervised data mining?

<p>Clustering functions by identifying data groups without any prior classification. (A)</p> Signup and view all the answers

In data mining, what is a common misconception about unsupervised methods?

<p>Unsupervised methods can operate fully autonomously without human guidance. (C)</p> Signup and view all the answers

What is a key drawback of k-fold cross-validation?

<p>It requires more computational resources than a single train-test split. (B)</p> Signup and view all the answers

What primarily causes the degradation of generalizability in a model when its complexity is increased?

<p>The model fits all available data rather than underlying trends. (A)</p> Signup and view all the answers

Which statistical test should be used when validating a partition with a continuous target variable?

<p>Two-sample t-test (D)</p> Signup and view all the answers

Which statement correctly reflects the relationship between training error and test error as model complexity changes?

<p>Training error decreases as test error increases. (D)</p> Signup and view all the answers

In k-fold cross-validation, what aspect ensures that each record appears in the test set exactly once?

<p>Partitioning the data into k subsets. (A)</p> Signup and view all the answers

What indicates that a model is overfitting the training data?

<p>High accuracy on the training set with low accuracy on the test set. (B)</p> Signup and view all the answers

Which of the following would likely introduce bias into the results when partitioning into training and test sets?

<p>Assigning a higher proportion of positive values to one set. (B)</p> Signup and view all the answers

What is one key advantage of utilizing k-fold cross-validation?

<p>It provides a more reliable estimate of model performance across different subsets. (A)</p> Signup and view all the answers

At what point is the optimal model complexity achieved according to the discussion on error rates?

<p>At the lowest point of the test set error rate. (A)</p> Signup and view all the answers

What potential risk arises from using a model with zero training error?

<p>There is a high chance the model is overfitting and memorizing the training set. (B)</p> Signup and view all the answers

Flashcards

NumPy Array Locality

NumPy arrays store data contiguously in memory, making access and manipulation faster than lists due to locality of reference.

Univariate Statistical Analysis

Analyzing data with one variable at a time.

Bias-Variance Tradeoff

A model's ability to balance error from simplifying assumptions (bias) with error from random fluctuations in the training data (variance).

Mean Squared Error (MSE)

The average squared difference between predicted and actual values in a regression model.

Signup and view all the flashcards

Confidence Interval

A range of values that likely contains the true value of a population parameter.

Signup and view all the flashcards

Data Mining Common Tasks

Data mining involves tasks like estimation, prediction (using regression, classification, or association), and clustering.

Signup and view all the flashcards

CRISP-DM Lifecycle

A standard process for data mining projects, including understanding the business problem, exploring the data, preparing the data, modeling the data, evaluating the results, and deploying the solution.

Signup and view all the flashcards

Data Preparation in Data Mining

This stage of data mining involves cleaning, transforming, and preparing raw data for analysis to improve quality and consistency.

Signup and view all the flashcards

Data Cleaning Tasks

Techniques used to improve the quality of data, including removing problematic entries, columns, or clusters.

Signup and view all the flashcards

Preprocessing Raw Data

The process of preparing data for use in data mining, handling inconsistencies, missing values, outliers, and other issues to ensure accuracy.

Signup and view all the flashcards

Why do we Preprocess Data?

Preprocessing data reduces issues like garbage input (GIGO) that can lead to inaccurate results by dealing with incomplete, noisy, and poorly formatted data.

Signup and view all the flashcards

Data Mining Usage Increase Reasons

Factors impacting the growing use of data mining include product commercialization, ongoing technology improvements, and external influences.

Signup and view all the flashcards

Data Mining Objectives

Data mining objectives are translating business requirements and goals into specific datamining problems that can be solved through the data.

Signup and view all the flashcards

Mean

The average of a dataset. Calculated by summing all the values and dividing by the number of values.

Signup and view all the flashcards

Median

The middle value of a sorted dataset. If there are an even number of values, it's the average of the two middle values.

Signup and view all the flashcards

Mode

The most frequent value in a dataset.

Signup and view all the flashcards

Skewness

A measure of the asymmetry of a distribution. Positive skewness means the tail is longer on the right, negative skewness means the tail is longer on the left.

Signup and view all the flashcards

Normalization

Scaling data to a specific range, typically between 0 and 1. This ensures that all features have the same scale.

Signup and view all the flashcards

NumPy Array Advantage

NumPy arrays store data in a contiguous block of memory, allowing for faster access and manipulation compared to lists. This efficiency is due to the principle of locality of reference.

Signup and view all the flashcards

Univariate Analysis

Analyzing a single variable at a time to understand its characteristics and distribution. This helps identify patterns, trends, and outliers within that specific variable.

Signup and view all the flashcards

What is a Confidence Interval?

A range of values calculated from sample data that is likely to contain the true value of a population parameter with a certain level of confidence.

Signup and view all the flashcards

Box Plot: What does it show?

A graphical representation of data that summarizes its distribution through the minimum, maximum, median, and quartiles (25th and 75th percentiles).

Signup and view all the flashcards

Model Complexity: What does it mean?

The complexity of a model refers to its ability to fit the training data precisely. A model with more parameters and flexibility can capture intricate relationships, leading to high complexity.

Signup and view all the flashcards

Data Mining Tasks

Data mining focuses on extracting useful knowledge from data. Common tasks involve estimation, prediction (through regression, classification, association, or clustering), and finding patterns.

Signup and view all the flashcards

Data Preparation

This crucial stage of data mining involves cleaning, transforming, and preparing raw data to improve its quality and consistency for analysis. This ensures the data is useful for modeling.

Signup and view all the flashcards

Preprocessing Data Why?

Raw data often needs preprocessing to address issues like incomplete or inconsistent data. This helps ensure accuracy in analysis and minimizes 'garbage in, garbage out' (GIGO).

Signup and view all the flashcards

Data Cleaning Techniques

Data cleaning removes problematic entries, columns, or clusters from the dataset. This improves the quality and reliability of the data.

Signup and view all the flashcards

Data Mining Usage Increase

Data mining has grown in popularity due to factors like commercialization of data-driven products, rapid technological advancements, and external pressures to leverage data for informed decision-making.

Signup and view all the flashcards

Data Mining - Understanding Business & Data

The initial stage of the CRISP-DM process involves understanding the business problem and acquiring knowledge about the available data.

Signup and view all the flashcards

Imbalanced Data

A dataset where the classes are not equally represented. For example, in a dataset of customer reviews, there might be many more positive reviews than negative reviews.

Signup and view all the flashcards

Accuracy (Imbalanced Data)

A misleading metric for imbalanced datasets because it's heavily influenced by the performance on the majority class. For example, in a dataset with 90% positive cases, a model that predicts everything as positive would have 90% accuracy, but it's not a good model.

Signup and view all the flashcards

Precision

Out of all the positive predictions made by a model, how many were actually positive. It measures how precise the model is at identifying positive cases.

Signup and view all the flashcards

Recall

Out of all the actual positive cases, how many were correctly identified by the model. It measures how well the model can identify true positives.

Signup and view all the flashcards

F1 Score

A balanced metric that considers both precision and recall. It is the harmonic mean of the two. It's useful for balancing the trade-off between precision and recall, particularly in imbalanced datasets.

Signup and view all the flashcards

What is Resampling?

Resampling is a technique used to adjust the distribution of training data in order to minimize the impact of class imbalance.

Signup and view all the flashcards

Over-sampling

Over-sampling involves adding more examples to the minority class in an imbalanced dataset. This is done to increase the representation of the less frequent class and make the model more sensitive to it.

Signup and view all the flashcards

SMOTE: What is it?

SMOTE (Synthetic Minority Over-sampling Technique) is a method for generating synthetic examples of the rare class by combining existing examples. It uses a nearest-neighbor approach to create new data points.

Signup and view all the flashcards

Under-sampling

Under-sampling involves removing examples from the majority class to reduce its dominance. This helps to reduce the bias towards the majority class and improve the model's performance on the minority class.

Signup and view all the flashcards

Tomek Links: What are they used for?

Tomek Links are pairs of examples from opposite classes that are very close together. Under-sampling techniques like Tomek Links identify and remove majority class examples from these pairs, helping to clarify the decision boundary and improve model performance.

Signup and view all the flashcards

Missing Value Handling: Why is it important?

Handling missing data is crucial because it can significantly affect the accuracy of analyses. Removing records with missing values can lead to biased results, as the pattern of missing data might be systematic, losing valuable information.

Signup and view all the flashcards

Replacing Missing Values: Constant

This method replaces missing values with a predetermined constant, like 0.0 for numeric values or "Missing" for categorical fields.

Signup and view all the flashcards

Replacing Missing Values: Mean/Mode

Missing values are replaced with the mean (for numeric data) or the mode (for categorical data).

Signup and view all the flashcards

Mean/Mode Replacement: Drawbacks

While replacing missing values with mean or mode seems plausible, it can lead to overconfident results. Statistical measures can be artificially lowered, and the true variability of the data might be obscured.

Signup and view all the flashcards

Data Imputation: A Better Approach

Data imputation methods provide more sophisticated techniques to handle missing data, often incorporating relationships between variables to estimate missing values.

Signup and view all the flashcards

Measures of Center

A single value that summarizes the central tendency of a dataset, indicating the typical value within the data distribution.

Signup and view all the flashcards

Z-score Standardization

A method of standardizing data by converting raw values to a standardized score, where the mean is 0 and standard deviation is 1.

Signup and view all the flashcards

Measures of Spread

Metrics describing how spread out the data values are, providing insight into the variability within a dataset.

Signup and view all the flashcards

Z-score Formula

(X - mean(X)) / SD(X)

Signup and view all the flashcards

Mean sensitive to outliers?

Yes, the mean is strongly influenced by extreme values (outliers) in a dataset, because it accounts for all values equally.

Signup and view all the flashcards

Range

The simplest measure of spread, calculated by subtracting the minimum value from the maximum value in a dataset.

Signup and view all the flashcards

Decimal Scaling

A normalization technique that scales data values by dividing by a power of 10, where the power is determined by the number of digits in the largest absolute value.

Signup and view all the flashcards

Standard Deviation

A measure of spread indicating how much data values typically deviate from the mean, taking into account the magnitude of deviations.

Signup and view all the flashcards

Decimal Scaling Formula

X* = X / 10^d

Signup and view all the flashcards

Why Normalize Data?

Normalization helps to ensure that all features have the same scale, preventing features with larger values from dominating the analysis.

Signup and view all the flashcards

Min-Max Normalization

A data scaling technique that transforms data values to a range between 0 and 1, by adjusting them based on the minimum and maximum values in the dataset. This ensures all features have the same scale.

Signup and view all the flashcards

What is 'X' in the Min-Max Formula?

'X' represents the original data value that you want to normalize. It is the individual value you are transforming to fit within the 0 to 1 range.

Signup and view all the flashcards

Why Normalize or Standardize Data?

Normalization or standardization is used to prepare data for analysis or machine learning algorithms. It helps to prevent features with larger scales from dominating, ensuring all features have equal influence on the model.

Signup and view all the flashcards

What is the 'standard deviation'?

The standard deviation measures how spread out the data points are from the mean. A higher standard deviation indicates greater variability in the data.

Signup and view all the flashcards

Why Preprocess Data?

Raw data often contains inconsistencies, missing values, and outliers. Preprocessing cleans and prepares data for analysis, minimizing errors and improving accuracy.

Signup and view all the flashcards

Data Cleaning

Data cleaning involves removing problematic entries, columns, or clusters from a dataset to improve its quality and reliability.

Signup and view all the flashcards

Supervised Learning

A type of machine learning where the algorithm is given labeled data (input and desired output) to learn a mapping between features and target variables.

Signup and view all the flashcards

Unsupervised Learning

A type of machine learning where the algorithm is given unlabeled data and must find patterns or structures without explicit guidance.

Signup and view all the flashcards

Clustering

A unsupervised learning technique that groups data points based on their similarity, creating clusters of related data.

Signup and view all the flashcards

Target Variable

The variable that the machine learning model aims to predict or understand in supervised learning.

Signup and view all the flashcards

Predictor Variables

The independent variables used to make predictions about the target variable in supervised learning.

Signup and view all the flashcards

What is cross-validation used for?

Cross-validation is a technique used to estimate the performance of a machine learning model on unseen data. It helps prevent overfitting by evaluating the model on a separate test set.

Signup and view all the flashcards

What is a spurious artifact?

A spurious artifact is a pattern in the training data that is not representative of the real world and could lead the model to make inaccurate predictions on new data.

Signup and view all the flashcards

What does a data analyst do to protect against spurious results?

A data analyst ensures that the training and test sets are independent, meaning they contain different samples of data, to reduce the likelihood of spurious patterns.

Signup and view all the flashcards

Why is model evaluation important?

Model evaluation is crucial to determine how well the model will generalize to new, unseen data. It helps identify areas where the model needs improvement and ensures its reliability.

Signup and view all the flashcards

What is the goal of model adjustment with cross-validation?

The goal is to minimize the error of the model on the test set, ensuring it makes accurate predictions on new data.

Signup and view all the flashcards

Why validate data partitions?

Ensuring the training and test sets have similar distributions of important features to avoid bias and improve model generalization. This prevents the model from performing well on the training data but poorly on unseen data.

Signup and view all the flashcards

What are the benefits of k-fold cross-validation?

It helps mitigate bias by training and testing on different folds of data, giving a more robust model evaluation. Each data point is used in the test set exactly once, making it efficient.

Signup and view all the flashcards

What is the purpose of data mining?

To uncover hidden patterns, insights, and valuable knowledge from large datasets to support decision-making, predict future trends, and improve business processes.

Signup and view all the flashcards

Why is handling missing data important?

Missing values can bias results and lead to inaccurate models. Removing records with missing values can lose valuable information, and replacing them with simple constants might be misleading.

Signup and view all the flashcards

What is the purpose of data normalization?

Scaling data to a similar range to ensure that all features have equal influence on the analysis and prevent features with larger values from dominating the learning process.

Signup and view all the flashcards

Overfitting

When a model learns the training data too well and fails to generalize to new data.

Signup and view all the flashcards

Underfitting

When a model is too simple and doesn't capture the underlying patterns in the data.

Signup and view all the flashcards

Optimal Model Complexity

The model complexity that minimizes error on the test set, balancing model accuracy with generalizability.

Signup and view all the flashcards

What is the goal of model complexity?

The goal is to find the sweet spot where the model is complex enough to capture the patterns in the data, but not too complex that it overfits and loses its ability to generalize.

Signup and view all the flashcards

Study Notes

Data Mining Review

  • Data mining involves extracting knowledge from data.
  • Common tasks in data mining include estimation and prediction.
  • Prediction tasks include regression and classification.
  • Other tasks include association and clustering.
  • Data mining usage is increasing due to commercialization of products, technological advancements, and external pressure.

CRISP-DM Lifecycle

  • CRISP-DM (Cross-Industry Standard Process for Data Mining) is a scientific method for analytics.
  • The lifecycle includes business understanding, data understanding, data preparation, modeling, evaluation, and deployment.
  • The method involves defining project requirements and objectives, translating them into data mining problem definitions, creating preliminary strategies, and meeting objectives.
  • Separates data needed from data available to understand necessary data preparation steps.
  • Includes training dataset, learning algorithms, test data, and accuracy metrics, to train and evaluate models.

Data Preparation

  • Raw data is often unprocessed, incomplete, or noisy.
  • Data may contain obsolete, redundant fields, missing values, outliers, or be in an unsuitable form.
  • Data cleaning involves removing entries, columns, or clusters to improve data quality.
  • Data preprocessing includes removing entries, removing columns, and removing clusters.

Arrays in NumPy

  • NumPy arrays store data efficiently due to fixed types and contiguous memory allocation.
  • 1D arrays have one axis, 2D arrays have two axes, and 3D arrays have three axes.
  • NumPy provides slicing capabilities for accessing data subsets within arrays. Slicing can extract an array element based on start, end, and step indices.

Pandas

  • Pandas DataFrame is a 2D data structure for tabular data.
  • DataFrames can hold multiple data types (like ndarrays, lists, constants, series, or dictionaries).
  • DataFrames have row and column labels (index) for data organization.
  • DataFrames can use copy method to create deep copies of data.

Statistical Analysis

  • Univariate analysis examines one variable.
  • Bivariate analysis examines the relationship between two variables.
  • Multivariate analysis examines the relationship between multiple variables.
  • Statistical analysis includes univariate, bivariate, and multivariate analyses.

Transformations for Normality

  • Many real-world datasets are not normally distributed.
  • Right-skewed data typically has a longer tail on the right side of the distribution.
  • Left-skewed data has a longer tail on the left side, often observed in test scores.
  • Transformations can be used to achieve normality in data, improving analysis and model performance.

Outlier Identification

  • Methods for identifying outliers include Z-score standardization, Interquartile Range (IQR), and scatterplots.

Confidence Intervals

  • A confidence interval is a range of values that likely contains the true value of a population parameter.
  • It includes a confidence level indicating the probability of containing the parameter. 
  • The general form is Point Estimate +/- Margin of Error.

Box Plots

  • Box plots show the range across a group (or set).
  • Visualizes the median, quartiles, and outliers of a dataset.
  • Useful for comparing distributions and identifying outliers.

Frequency Heatmaps

  • Heatmaps display data frequency (counts) by visually depicting a dataset with colors.
  • Frequency heatmaps present summary plots of data frequencies.
  • Can show multiple data subsets' distributions or comparisons effectively.

Model Complexity

  • Model complexity refers to the model's ability to learn from data and generalize to unseen data.
  • Measuring model complexity helps assess the risk of overly simplistic or complicated models.

Bias and Variance

  • Bias is the model's error from its expected prediction.
  • Variance is the model's error that occurs because of its sensitivity to small fluctuations in the training data.
  • An Overfitting model has high variance and low bias (the model fits the training data too well and performs poorly in new data or generalization is difficult).
  • An Underfitting model has high bias and low variance (it is too simple to capture the patterns in the data).
  • A Balanced model has an appropriate level of bias and variance.

Hypothesis Testing

  • Hypothesis testing assesses whether evidence supports a particular claim.
  • The process involves stating hypotheses, choosing a confidence level (significance level), collecting data, and analyzing to determine whether to reject or accept (support) the hypothesis. 

Learning Models

  • Learning models encompass various algorithms, from simple linear regression, to complex methods like support vector machines (SVMs) and decision trees, to random forests and K-Nearest Neighbors (KNN).

Correlation Coefficients

  • Correlation measures the relationship strength (positive or negative) between two variables.
  • Correlation coefficients range from -1 to +1.
  • Values close to 0 indicate no correlation.

Linear Regression

  • Linear regression models the relationship between a dependent variable and one or more independent variables using a linear equation.
  • The goal is to find the least squares fit, by minimizing the squared residuals.

Regression Evaluation Metrics

  • Metrics used to evaluate regression models include Mean Absolute Error, Mean Squared Error, Root Mean Squared Error, R-squared (Coefficient of Determination), and Adjusted R-squared. The metrics show errors from the predicted values.

Logistic Regression

  • Logistic regression is used for binary classification problems to determine the probability of an outcome.
  • The Sigmoid (S-curve) function models the probability outcomes.

K-Nearest Neighbors (KNN)

  • KNN classifies new data points by examining their proximity to existing data points.
  • Uses distances of data points for classification and decision-making (e.g., Euclidean).

Support Vector Machines (SVM)

  • SVMs classify data points by attempting to find the best possible separation, maximizing the margin between classes. 

Decision Trees

  • Decision trees use a series of decision rules —if…then rules—to classify data points.
  • Decision trees represent a system of questions and answers for classification determination.

Random Forest

  • Random forests combine multiple decision trees.
  • This approach reduces variability of single decision tree predictions for a more accurate overall prediction.

K-Means Clustering

  • Aims to partition data points into clusters of similar characteristics.
  • Clusters are represented by centroid points.
  • K-means iteratively calculates the distance between data points and closest centroid point, and moves data points to cluster matching those attributes.

Agglomerative and Divisive Hierarchical Clustering

  • Agglomerative clustering builds clusters from individual data points while Divisive clustering starts from a single cluster and divides it into smaller clusters.

Model Parameters

  • Model parameters are determined by training data (internal).
  • These parameters control the model's behavior. 

Hyperparameters

  • Hyperparameters control the learning process and obtained from external parameters.
  • Examples include learning rate, number of epochs, and the number of estimators.

Miscellaneous Statistics

  • Mean, median, mode, skewness, normalization, standardization, Z-scores, confidence intervals, interquartile range (IQR), distance functions, and entropy.

Project 2 Deliverable 2

  • The project involves justification, prediction techniques, performance comparison, feature engineering, feature selection, feature scaling, handling missing values, handling imbalanced data, feature encoding, model selection, hyperparameter tuning, best parameter selection, regression and classification evaluation metrics, unsupervised learning (clustering), and supervised learning (regression, classification).

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Description

This quiz covers the fundamentals of data mining, including key tasks such as estimation, prediction, and clustering. It also explores the CRISP-DM lifecycle, guiding you through its phases from business understanding to deployment. Test your knowledge of these critical concepts in data analytics.

More Like This

CRISP DM Data Mining Process Quiz
10 questions
CRISP-DM Process for Data Mining Quiz
10 questions
Data Mining: CRISP-DM Framework Quiz
93 questions
Use Quizgecko on...
Browser
Browser