Measures of Central Tendency Quiz
47 Questions
0 Views

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

What is the purpose of measures of central tendency?

  • To analyze how well a model fits the data
  • To identify the most frequent value in the dataset
  • To summarize the spread of data values
  • To determine the average or center of the values in a dataset (correct)

How is the mean calculated in a dataset?

  • By identifying the most frequent value in the data
  • By selecting the middle value after sorting the data
  • By summing all values and dividing by the total count (correct)
  • By calculating the difference between the maximum and minimum values

What does the median represent in a set of numbers?

  • The highest value in the dataset
  • The value that divides the dataset into two equal halves (correct)
  • The average of all values in the dataset
  • The overall spreading of values in the dataset

Which statistical measure would be most appropriate to understand the spread of a dataset?

<p>Dispersion measures (B)</p> Signup and view all the answers

What is the significance of the distribution in data analysis?

<p>It shows how often values occur within the dataset (A)</p> Signup and view all the answers

Which statement regarding measures of central tendency is correct?

<p>The median is less affected by outliers compared to the mean (A)</p> Signup and view all the answers

What kind of issues can statistics help prevent in machine learning?

<p>Underfitting and overfitting (C)</p> Signup and view all the answers

Which of the following is a central tendency measure?

<p>Mean (B)</p> Signup and view all the answers

What type of variables commonly have increasing values over time?

<p>Time-related variables (D)</p> Signup and view all the answers

What is the distinction between data and information in a database context?

<p>Data is the collection of facts, while information is the processed data that provides meaning. (D)</p> Signup and view all the answers

What does knowledge represent in the context of the example given?

<p>Indirect conclusions drawn from observations of customer behavior (D)</p> Signup and view all the answers

Which of the following is NOT a part of data pre-processing?

<p>Mining data to extract insights (B)</p> Signup and view all the answers

What percentage of the knowledge discovery process is estimated to involve data pre-processing?

<p>70-80% (C)</p> Signup and view all the answers

What could be a consequence of having records with missing values in a database?

<p>Inability to draw complete conclusions (A)</p> Signup and view all the answers

Which statement best describes outliers in data pre-processing?

<p>They may skew the results if not addressed. (C)</p> Signup and view all the answers

What is the primary focus of the Problem Understanding phase in the data analysis process?

<p>Defining project aims from a business perspective (B)</p> Signup and view all the answers

What is a potential issue with the format of data for machine learning models?

<p>Incompatible formats may prevent model training. (C)</p> Signup and view all the answers

Which of the following is NOT a key activity performed during the Data Understanding phase?

<p>Joining several data sets (A)</p> Signup and view all the answers

What crucial questions should be addressed during the Data Understanding phase?

<p>Who collected the data and what collection methods were used? (C)</p> Signup and view all the answers

Which of the following describes a primary task in the Data Preparation phase?

<p>Removing anomalies and reformatting data (B)</p> Signup and view all the answers

What is meant by 'reducing the number of variables' in the Data Preparation phase?

<p>Eliminating irrelevant data to simplify analysis (A)</p> Signup and view all the answers

During which phase do analysts primarily evaluate the outcomes of their models?

<p>Evaluation (A)</p> Signup and view all the answers

Which of the following is a goal during the Deployment phase?

<p>Implementing the analysis results in real-world scenarios (B)</p> Signup and view all the answers

What happens to the weight of a point if it is correctly classified during an iteration?

<p>The weight decreases. (A)</p> Signup and view all the answers

In the boosting algorithm, how is the weight of misclassified observations updated?

<p>It is modified by the formula $e^{\alpha} * \text{old weight}$. (A)</p> Signup and view all the answers

Which aspect of the data is not typically explored during the Data Understanding phase?

<p>The cleaning methodologies to be used later (A)</p> Signup and view all the answers

What does the symbol ϵ represent in the boosting algorithm?

<p>The total weight of misclassified observations. (A)</p> Signup and view all the answers

Which of the following correctly describes the role of alpha (α) in the boosting algorithm?

<p>Alpha is calculated based on the error rate and helps modify the weights of misclassified points. (C)</p> Signup and view all the answers

If ϵ is calculated as 0.4, what would be the value of alpha (α) using the formula $\alpha = 0.5 \cdot \log{\frac{1 - \epsilon}{\epsilon}}$?

<p>0.2027 (D)</p> Signup and view all the answers

After updating weights for misclassified observations, the new value of ϵ is found to be 0.2225. What does this indicate?

<p>The model still has errors in classification. (D)</p> Signup and view all the answers

How do weights of observations initially start in the provided example?

<p>They all start at 0.1. (A)</p> Signup and view all the answers

What is the next step after the iteration process is finished in boosting?

<p>The fitted model is obtained using the final weights. (A)</p> Signup and view all the answers

What is the primary purpose of data visualization?

<p>To represent information and data in graphical form. (D)</p> Signup and view all the answers

Which library is known as the first Python data visualization library?

<p>Matplotlib (D)</p> Signup and view all the answers

What type of visualization is best suited for illustrating the difference between two or more items over time?

<p>Boxplots (A)</p> Signup and view all the answers

What do boxplots primarily visualize?

<p>The distribution of a continuous feature against values of a categorical feature. (B)</p> Signup and view all the answers

What is a characteristic feature of Seaborn in relation to Matplotlib?

<p>It is a wrapper that allows access to Matplotlib's functionality with less code. (A)</p> Signup and view all the answers

Which of the following libraries is used for creating interactive plots?

<p>Plotly (A)</p> Signup and view all the answers

What do the five summary statistics displayed in a boxplot represent?

<p>Minimum, first quartile, median, third quartile, and maximum. (C)</p> Signup and view all the answers

Which library is specifically noted for creating maps and geographical data plots?

<p>Geoplotlib (B)</p> Signup and view all the answers

What is the main advantage of using a pair plot for data analysis?

<p>Pair plots facilitate the identification of trends and correlations within the data. (D)</p> Signup and view all the answers

When analyzing a pair plot, what is one of the key findings you can potentially identify?

<p>Finding clusters of data points with similar attributes. (A)</p> Signup and view all the answers

What is NOT a common benefit of using a pair plot in data analysis?

<p>Estimating the precision of machine learning models built on the data. (B)</p> Signup and view all the answers

What is a central purpose of data cleaning as mentioned in the text?

<p>Ensuring that the data used for analysis is accurate, consistent, and reliable. (D)</p> Signup and view all the answers

Which of the following is NOT directly addressed by data cleaning as described in the text?

<p>Identifying and analyzing the root cause of missing data points. (D)</p> Signup and view all the answers

What is the purpose of the "Data Preparation" stage in the data science process?

<p>Transforming and cleaning the raw data to make it suitable for analysis. (D)</p> Signup and view all the answers

Which of the following tasks is NOT typically performed during the "Data Preparation" stage?

<p>Identifying the underlying business needs and objectives. (C)</p> Signup and view all the answers

Flashcards

Data Pre-processing

Data pre-processing is the process of preparing raw data for analysis and machine learning. It involves tasks like cleaning, transforming, and standardizing data to ensure its quality and suitability for the intended analysis.

Distribution

A distribution represents how often values appear within a dataset, showing the frequency or likelihood of each value.

Measures of Central Tendency

Measures of central tendency describe the central or typical value of a dataset. They indicate where most of the data points are clustered.

Mean

The mean is the average of a set of numbers. Calculated by summing all values and dividing by the total number of values.

Signup and view all the flashcards

Median

The median is the middle value of a dataset when it's sorted in ascending order. It divides the data into two equal halves.

Signup and view all the flashcards

Measures of Spread

Measures of spread or dispersion describe how data points are distributed around the central tendency. They illustrate the variability or spread of values in a dataset.

Signup and view all the flashcards

Range

The range is the difference between the highest and lowest values in a dataset, indicating the spread of data points across the entire range.

Signup and view all the flashcards

Problem Understanding

The initial stage of data analysis where the project's overall objectives and requirements are defined from a business perspective, converting these goals into a data analysis problem statement and formulating a basic plan to achieve them.

Signup and view all the flashcards

Data Understanding

The process of acquiring a deep understanding of the data, including its origins, structure, and potential issues. It involves exploring the data to identify patterns, anomalies, and any data quality problems.

Signup and view all the flashcards

Data Preparation

The phase where raw data is transformed into a usable format for analysis. This involves tasks like combining datasets, selecting relevant variables, handling missing values, and addressing inconsistencies in the data.

Signup and view all the flashcards

Modeling

This stage involves applying various statistical and machine learning algorithms to extract insights and patterns from the prepared data. This could include building predictive models, clustering data, or identifying key relationships.

Signup and view all the flashcards

Evaluation

The process of evaluating the effectiveness of the chosen model or analysis techniques using various metrics and benchmarks. This helps determine how well the model satisfies the initial business objectives and identifies potential areas for improvement.

Signup and view all the flashcards

Deployment

The final stage where the insights derived from the analysis are deployed into real-world applications. This could involve creating dashboards, reports, or integrating the model into a business process to drive decision-making.

Signup and view all the flashcards

What is Data?

Data is raw, unprocessed facts and figures.

Signup and view all the flashcards

What is Information?

Information is organized, structured, and processed data, providing context and meaning.

Signup and view all the flashcards

What is Knowledge?

Knowledge is derived from information, involving interpretation, analysis, and understanding of relationships.

Signup and view all the flashcards

What is Data Pre-processing?

Data pre-processing is the essential step of cleaning and transforming data before applying data mining techniques.

Signup and view all the flashcards

Why is outdated Data a problem?

Out-of-date or outdated data can lead to inaccurate analysis and unreliable results.

Signup and view all the flashcards

Why is redundant Data a problem?

Redundant data creates unnecessary storage space and can complicate analysis by repeating information.

Signup and view all the flashcards

Why are missing values a problem?

Missing values can hinder analysis, leading to biased or incomplete results.

Signup and view all the flashcards

Why is inconsistent Data a problem?

Data in incompatible or inconsistent formats can lead to analysis errors and difficulty in integrating different datasets.

Signup and view all the flashcards

What is data visualization?

Data visualization uses visual elements like charts, graphs, plots, and maps to represent information and data.

Signup and view all the flashcards

What are the benefits of data visualization?

Data visualization helps analysts understand patterns, trends, outliers, distributions, and relationships within data.

Signup and view all the flashcards

What is Matplotlib?

Matplotlib is a foundational library for data visualization in Python, providing a wide range of plotting options.

Signup and view all the flashcards

How do libraries like pandas and Seaborn relate to Matplotlib?

Libraries like pandas and Seaborn utilize Matplotlib to simplify data visualization tasks by offering high-level functions.

Signup and view all the flashcards

What is Seaborn?

Seaborn is a Python library built on top of Matplotlib, designed for creating aesthetically pleasing statistical graphics.

Signup and view all the flashcards

What is The Grammar of Graphics?

The Grammar of Graphics is a principle that states any data graphic can be created by combining data with visual components like axes, tick marks, and lines.

Signup and view all the flashcards

What is a boxplot used for?

Boxplots are used to visually compare the distribution of a numeric feature against the values of a categorical feature. They highlight five important statistics.

Signup and view all the flashcards

What is the purpose of a comparison visualization?

A comparison visualization helps illustrate the difference between two or more items at a specific point in time or over a period of time.

Signup and view all the flashcards

Data cleaning

The process of identifying and correcting errors in a dataset. This can include fixing incorrect values, removing duplicates, and handling missing data.

Signup and view all the flashcards

Pair plot

A type of data visualization that shows the relationships between multiple variables in a dataset. It creates a grid of scatter plots where each row and column represents a variable.

Signup and view all the flashcards

Visualize distributions

Visualizing how a single variable is distributed within a dataset. It helps understand the spread and frequency of values.

Signup and view all the flashcards

Identify relationships

Identifying patterns or trends between two or more variables. This includes spotting linear or non-linear relationships that may suggest predictability.

Signup and view all the flashcards

Detect anomalies

Spotting data points that deviate significantly from the rest. Outliers could indicate errors or unique insights.

Signup and view all the flashcards

Find trends

Analyzing data to discover recurring patterns or trends. This can help understand the predictability of relationships.

Signup and view all the flashcards

Find clusters

Identifying groups of data points that share similar characteristics. This hints at potential subpopulations within the dataset.

Signup and view all the flashcards

Find correlations

Measuring the strength and direction of the relationship between two variables. Positive correlation indicates a direct relationship, while negative correlation shows an inverse relationship.

Signup and view all the flashcards

Boosting

A technique for combining multiple weak models (e.g., decision trees) into a strong prediction model, where each model focuses on correcting the errors of the previous ones.

Signup and view all the flashcards

Weights (in Boosting)

Weights assigned to each data point in a training dataset, initially equal but adjusted during the boosting process.

Signup and view all the flashcards

Epsilon (ϵ)

The fraction of misclassified data points in a boosting iteration. It represents the model's error rate.

Signup and view all the flashcards

Alpha (α)

A value calculated based on epsilon (ϵ), determining the weight adjustment for each model in boosting.

Signup and view all the flashcards

Weight Updating

The process of updating weights for data points based on their classification results in each boosting iteration.

Signup and view all the flashcards

Final Model Prediction

The combined prediction of multiple weak models in a boosting algorithm.

Signup and view all the flashcards

Sum of Misclassified Weights

The sum of weights for misclassified data points in a boosting iteration.

Signup and view all the flashcards

Boosting Algorithm Process

The algorithm iteratively adjusts weights for data points, builds multiple models, and combines them to create a strong predictor.

Signup and view all the flashcards

Study Notes

Machine Learning

  • Machine learning (ML) is a computer science field that studies algorithms and techniques for automating solutions to complex problems.
  • ML was coined around 1960, combining "machine" (computer/robot) and "learning" (acquiring/discovering patterns).
  • Humans excel at discovering patterns; ML aims to replicate this skill in machines.

Bibliography

  • Larose, Daniel T.: Discovering Knowledge in Data (2014, Wiley)
  • Han Jiawei, Kamber Micheline: Data Mining: Concepts and Techniques (2006, Elsevier)
  • Pang-Ning Tan, Steinbach Michael, Vipin Kumar: Introduction to Data Mining (2014, Pearson)
  • G. James, D. Witten, T. Hastie, R. Tibshirani: An Introduction to Statistical Learning (2013, Springer)
  • Raschka Sebastian, Yuxi Liu, Vahid Mirjalili: Machine Learning with PyTorch and Scikit-Learn (2022, Packt)
  • Data Mining Map by Saed Sayad (http://www.saedsayad.com/data_mining_map.htm)
  • Analytics, Data Mining, and Data Science (http://www.kdnuggets.com/)
  • Kaggle (https://www.kaggle.com/datasets?fileType=csv)

Big Data Example

  • The University of Lodz Library contains approximately 2.8 million volumes.
  • If each document averages 1 MB, the library's data would occupy 30 terabytes.
  • A logistics company's database of courier shipments is about 20 terabytes.

Big Data Units

  • SI (International System of Units) prefixes for bytes: KB (kilobytes), MB (megabytes), GB (gigabytes), TB (terabytes), PB (petabytes), EB (exabytes), ZB (zettabytes), YB (yottabytes).
  • IEC (International Electrotechnical Commission) prefixes for bytes: KiB (kibibytes), MiB (mebibytes), GiB (gibibytes), TiB (tebibytes), PiB (pebibytes), EiB (exbibytes), ZiB (zebibytes), YiB (yobibytes).

Data Analysis Process (CRISP-DM)

  • This approach is often used in data mining.
  • Problem Understanding/Business Understanding.
  • Data Understanding.
  • Data Preparation.
  • Modeling.
  • Evaluation.
  • Deployment.

What is Data Mining?

  • Data mining is the art and science of intelligent data analysis.
  • The goal is to uncover meaningful insights and knowledge from data.
  • Building models is commonly used.
  • A model summarizes the discovered knowledge for improved understanding or predictions.

Data Mining vs. Machine Learning

  • Data mining is a technique for uncovering patterns in data, with a focus on precise, new, and useful information.
  • ML entails an algorithm that learns from data, enhancing itself through experience.
  • ML uses data mining techniques, which, in the process, aids in generating models that can effectively predict future outcomes.

Machine Learning and Data Analysis: Non-OLAP

  • Data Mining (DM) and Machine Learning (ML) are not the same as OLAP (Online Analytical Processing).
  • OLAP is focused on query and analysis of existing data, not the discovery of new patterns or the creation of predictive models.

Common Data Analysis Questions (not DM/ML)

  • How many customers who bought a suit bought a shirt?
  • Which customers are not paying back the loan?
  • Which customers have not renewed their insurance policies?
  • What product did the customers who bought the suit buy?
  • What credit risk does the customer pose?
  • Which customers may leave for another company?

Data Analysis Process - Problem Understanding/Business Understanding

  • This initial step focuses on understanding the aim and requirements from a business perspective of the project.
  • The objective here is to interpret the knowledge into a data analysis problem definition and a preliminary plan.

Data Analysis Process - Data Understanding

  • A crucial stage is understanding the data by gaining familiarity with the data characteristics and identifying quality problems.
  • Answering questions about the data's origin, collection methods, data meanings, and abbreviations is necessary.

Data Analysis Process - Data Preparation

  • This stage involves all activities creating the final dataset from the raw data.
  • Key activities include the following:
  • merge different data sets
  • reduce the number of variables to be only essential.
  • data cleaning, including missing values, and reformatting for better use.

Data Analysis Process - Modeling

  • This step involves selecting appropriate modeling techniques and calibrating parameters to optimal levels.

Data Analysis Process - Evaluation

  • Assess the model or models to determine whether the model fits the established parameters.
  • Verify if all business/research objectives have been reflected in the model during the decision to use the output result.

Data Analysis Process - Deployment

  • The final step.
  • The knowledge that has been gained by the model should be organized and presented in a usable way for the customer. In short, use it.

Data Analysis Engineering Summary

  • This outlines how all previous concepts relate to each other and how they are linked.
  • Programming, statistics, machine learning, data engineering, data science, and visualization are integral parts of this process.

Python's 4 main ML Libraries

  • NumPy: fundamental for scientific computing, handling arrays.
  • Pandas: essential for data manipulation and analysis (tables & time series).
  • Matplotlib: for producing high-quality plots and visualizations.
  • Scikit-learn: for a wider range of ML tasks.

Data Analysis in a Graphically based Visual

  • This explains how data elements are represented and related, as well as resources used for data references and data mining.

Data Quality

  • Machine learning algorithms are very sensitive to high-quality data. Incorrect data can yield incorrect results.
  • Data quality principles include completeness, correctness, and actuality.

Data Noise

  • Label noise: A data point is incorrectly labeled to its true class.
  • Inconsistent data: Data points vary from their true class label despite being from the same category.
  • Classification errors: Observations incorrectly assigned to a class.
  • Attribute noise: Inaccurate values in one or more attributes.
  • Missing/unknown values: Missing values are an area that a model may struggle to predict upon.

Types of Variables

  • Qualitative (categorical): non-measurable; cannot be uniquely characterized by numbers; examples include brand names, size, and gender.
  • Quantitative (numeric): can be compared, measured, or used in arithmetic operations; examples include height and income.

Quantitative Data Types

  • Discrete: finite or countable set of values (e.g., number of items sold).
  • Continuous: Infinite set of values (e.g., height).

Qualitative Data Types

  • Nominal: categories with no inherent order (e.g., colors).
  • Ordinal: categories with an inherent order (e.g., customer satisfaction ratings).

Transaction-Based Sets

  • Each transaction is a vector; components denote products or items.
  • Example set: {1: Bread, Coke, Milk}, {2: Beer, Bread}.

Data in Graph Form

  • Storing data as vertices (indicating a relationship).

Normal Distribution (Gaussian Distribution)

  • A common distribution in machine learning used to model and analyze data with a bell-shaped curve. The majority of data falls within one standard deviation of the mean.

  • The mean (average) and the standard deviation are helpful in understanding if the data conforms to a normal distribution which is used in further models.

  • The shape of distributions can be visualized (skewed, normal, and others). Shape helps predict data patterns (e.g. outliers).

Central Limit Theorem

  • A fundamental theorem in statistics used to make inferences and estimate with sample data to estimate population parameters.

  • As the sample size increases, the sampling distribution of the mean approaches a normal distribution, regardless of shape of the original data.

  • Means of samples will be close to the population mean and variance of the sample diminishes.

  • This theorem is essential for inferential statistics (hypotheses testing and confidence intervals).

Sampling

  • A sample is a subset of a much larger population.
  • Sampling is the method or process of collecting samples from a population.
  • It is a crucial part of data collection because errors in sampling can affect findings.
  • Samples help infer population information while reducing data collection/management workload.

Data Visualization

  • Visualizing data is useful in many machine learning applications.
  • Visualizing data patterns, trends, outliers, distributions, and relationships are helpful insights into the data being analyzed.

Data Visualization Using Python

  • Python libraries (e.g. Matplotlib, Seaborn, Bokeh, Plotly, geoplotlib, missingno) are available to help construct visualizations of various datasets.

Data Visualization - Comparison

  • Box plots are comparison visualizations visually representing the distribution of a continuous feature across different categories.

Data Visualization - Relationship

  • Scatter plots are useful in visualizing the relationship and correlation between two or more variables. This can help establish if variables can be related through the use of plotting.

Data Visualization - Distribution

  • Histograms visually represent the distribution of data, including insights into the data's spread (range) and skewness.

Data Visualization - Composition

  • Composition visualizations are useful in showing the percentage allocation between parts to the whole using methods that include stacked bar charts and pie charts.

Data Visualization - Heatmap

  • A heat map uses colors to visually show relationships or different values between many parts or data partitions. This is better used when attempting to visually represent many parts or data partitions.

Data Visualization - Pair Plot

  • A pair plot is a matrix of graphs, including histograms and scatter plots of every combination of variables to visualize patterns, relationships, and correlations among variables in a given dataset.

Data Preprocessing

  • Data preprocessing is the essential initial process for data mining.
  • This covers all data-related techniques involved before the data can be used by the machine learning models and algorithms.
  • The following activities can be part of this process:
  • cleaning
  • removing incorrect, corrupted, inconsistently formatted, and redundant data
  • fixing and removing missing data
  • preprocessing anomalies
  • reshaping or normalizing data to meet the model requirement

Handling Missing Values.

  • Methods include removal and imputation techniques used to fill in gaps in missing data values.
  • Removal: data partitions with missing values can be excluded.
  • Imputation: replacing missing values using a procedure like random imputation based on known data points, mean or median, or using a predictor model to estimate missing values.

Handling Outliers

  • Outliers are those data points that are distant from most of the similar data points in a dataset.
  • Handling outliers involves methods that either exclude outliers from a given subset or the dataset entirely. This also requires having an alternative method that will not lose crucial data points in the process.

Transforming Data

  • Transforming data is a crucial part of data preprocessing required to meet the assumptions and specifications set in the machine learning process and algorithm.
  • Changing data type, normalizing, or using other math operations can be part of the process.
  • Standardization or normalizing values is a common practice.

Data Types for Machine Learning : Numeric

  • z-score standardization (zero mean normalization): Transforms data to have mean 0 and standard deviation 1.
  • Min-Max normalization: Transforms to the range [0, 1].
  • Log Transformation: Applied to variables with distributions that are not symmetric or have a wide range of values to better model relationships.

Feature Encoding

  • Converts categorical features into integer/numerical values, allowing models to interpret the data effectively by converting strings into numbers.

  • Label Encoding: Replace different categories by an integer representing the category's position within an ordered set of distinct categories.

  • One-Hot Encoding: Converting categories to numeric binary arrays; i.e., categories are represented as independent columns.

Machine Learning Algorithm Types

  • Supervised learning (SL): Models learn relationships between the target and training data to calculate and predict.

  • Unsupervised learning (UL): No target value; focus is on identifying patterns and relationships within data points.

  • Supervised Learning has these parts involved

  • Classification problems: The model needs to predict a discrete outcome (e.g., spam/no spam).

  • Regression problems: The model predicts a continuous outcome (e.g., house prices).

Unsupervised Techniques

  • Clustering: Identify groups or clusters of similar items where there is no target.
  • Dimensionality reduction (e.g., PCA): Reduces the number of variables by combining highly correlated ones into fewer components.
  • Association rule mining: Identifies relationships among sets of items that are frequently purchased together called an item set or a subset.

Association Analysis

  • Association Rules: Models predict if an outcome will be related to another or a combination of other outcomes given a few values. This is best represented using a series of rules where: IF X THEN Y.
  • Support: The ratio of times an itemset appears in the data.
  • Confidence: The ratio of instances where Y happened, given that X happened.
  • Lift: compares the confidence of the rule to the the probability of these outcomes occurring by chance.
  • Recommendation Engines: Suggest other items similar to an item based on the patterns in the data using association analysis.
  • Frequent itemsets are itemsets or collections of item whose support value is greater or equal to the threshold required.

Apriori Property

  • Frequent Itemsets
  • Apriori heuristic: All subsets of a frequent itemset are also frequent; in other words, if the combined items occur frequently, the constituent parts of the combination will also occur frequently.

The Apriori Algorithm

  • A method for generating frequent itemsets.
  • Involves repeatedly calculating the support for itemsets of increasing size, using the Apriori property to eliminate less-frequent itemsets.
  • This algorithm is an efficient technique to generate rules through a cyclical process.

Hierarchical Clustering

  • Agglomerative clustering begins with each observation as its cluster. Clusters are merged based on distance measures.

  • Divisive clustering begins with all observations in a single cluster and splits recursively based on a hierarchical tree, called a dendrogram.

  • Common linkages in an agglomerative approach include:

    • Minimum linkage
    • Maximum linkage
    • Mean Linkage
    • Centroid Linkage
    • Ward's method

Assessing Clustering Tendency

  • Visual methods: create ordered dissimilarity images (ODIs) of data that visually represents groupings/clusterings based on similarities between data points; this is useful when checking trends, for example.
  • Hopkins statistic: calculates the probability that a given dataset can be seen as uniformly distributed.

k-Means Method

  • This method partitions data points into k clusters by finding the mean values for variables of each data point from those in a given cluster.
  • The algorithm repeatedly reassigns clusters based on which center is closest.
  • The number of clusters, k, is provided a priori by the user.
  • Euclidean distance is often used for calculating this.
  • Random Initialization Trap: Initial cluster centers are random that will influence output to some degree, so multiple runs are common to minimize the impact and random outcomes on modeling.

Model Selection

  • Model Selection: Use this process when tuning hyperparameters and comparing various settings for models to improve performance in unseen data.
  • Validation set is often used to test different setups of a model's hyperparameters/settings versus the test set, which is saved for evaluating the validity of the model overall.
  • Techniques like k-fold cross validation and Leave-one-out cross-validation are useful in model selection and are better than holdout in situations with smaller datasets.

Resampling

  • A method that uses various samples repeatedly from the same original data set to train and validate the model in order to have a better overall performance estimate. Two common resampling techniques include:
  • k-Fold cross-validation: Divides data into k parts and trains on k-1 parts, and tests on the remaining part in each iteration.
  • Leave-one-out cross-validation (LOOCV): A special case of k-fold cross validation where k equals the number of observations in the data set; i.e., every observation or data point is held out as the testing set once during each iteration.

Bootstrap Sampling

  • Create different training datasets from the original data, repeatedly, using sampling with replacement to overcome issues of small datasets or bias.
  • A variant is the 0.632 bootstrap method where the proportion of a randomly chosen data point to be in the training set is about 63.2%. The out of bag instances are used to test the model.

Model Performance Metrics

  • Accuracy: Percentage of correctly classified instances.
  • Error rate: Percentage of incorrectly classified instances.
  • Precision: Proportion of correct positive classifications related to all positive classifications.
  • Recall/Coverage : Proportion of correctly classified positives relative to all positive instances from the original data.
  • Specificity: Proportion of correct negative classifications relative to all negative cases.
  • F1-score: Harmonic mean of precision and recall.

Kappa Statistic

  • Adjusts accuracy to account for chance predictions.
  • Kappa values range from 0 to 1, with 1 representing perfect agreement between predicted and actual outcomes.

Neural Networks

  • Neural networks are an attempt to model how neurons in nature function as computational elements.

  • Consist of interconnected layers of nodes (neurons) to model.

    • Input layer
    • Hidden layer(s)
    • Output layer
  • Activation functions, like sigmoid and hyperbolic tangent functions determine outputs on a range [0,1] for neurons.

    • Link weights determine the relationship between input variables and output. The weights determine how the input variables inform the output of each layer/node/neuron of the neural network.

Gradient Descent

  • A method to optimize a cost/loss function iteratively, nudging parameters along a descending gradient by adjusting weights.

  • Key steps include initializing parameters, calculating the cost function /loss gradient, updating parameters along the gradient using a step size/learning rate, and repeating until the cost function converges (i.e., converges until change to parameters is insignificant).

  • Controlling the learning rate step size is critical to obtain satisfactory results.

Gradient Boosting

  • An ensemble method that sequentially builds multiple models, such as decision trees, where each new model tries to correct errors from the previous one.
  • Minimizing the error, therefore, leads to an iterative improvement and an incremental refinement and addition/refinement of previous models.
  • Gradient descent is sometimes used in combination with boosting. The output of the model is the weighted average of the outputs from the multiple single models.

Extreme Gradient Boosting (XGBoost)

  • XGBoost is a popular implementation of Gradient Boosting, which has additional features such as regularization to prevent overfitting.
  • Regularization adds penalties within the objective function, thereby keeping model complexity relatively consistent vs. overly complex by preventing overfitting.

Predicting Continuous Target Variables

  • Regression techniques are used to predict continuous values.
  • Simple linear regression models predict a relationship between a single dependent variable and one predictor variable.
  • Multiple linear regression models predict a relationship between one dependent variable and more than one predictor variables.

Regression Metrics

  • Mean Absolute Error (MAE): Average absolute difference between predicted and actual values.
  • Mean Squared Error (MSE): The average of the squared differences between predictions and actual values.
  • R-squared (R2): Indicates the proportion of variance in the dependent variable explained by the independent variables.
  • Root Mean Squared Error (RMSE): The square root of the MSE. It is easier to interpret as directly showing the typical error relative to the scale of the actual values.

Quantile-Quantile (Q-Q) Plots

  • Used to graphically assess whether a dataset follows a specific probability distribution, often used to examine normality of a dataset.
  • Points on the Q-Q plot should fall along a straight line if the data conforms to the assumed distribution.
  • Deviation from a straight line indicates deviation from normality.

Bias-Variance Tradeoff

  • A key concept in machine learning and other modeling areas. There is trade-off between the model's bias (simplicity) and its variance (sensitivity).

  • A good/optimized model has low bias and low variance, while balancing these two factors is critical to avoiding the issues of overfitting or underfitting.

  • Underfitting refers to the case where the model is too simple and is not capturing the underlying data patterns (high bias).

  • Overfitting refers to the case where the model is too complex and captures noise/unnecessary features instead of the desired data patterns (high variance).

  • There is a trade-off between how well the model fits to the underlying data pattern versus the model's generalizations being based on noisy data points.

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

Machine Learning Lectures - PDF

Description

Test your knowledge on measures of central tendency and their significance in data analysis. This quiz covers concepts like mean, median, and the impact of these measures in understanding datasets. Perfect for statistics enthusiasts and students!

More Like This

Use Quizgecko on...
Browser
Browser