Data Analysis Quiz: Ordinal and Quantitative Variables
48 Questions
1 Views

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

Which of the following is NOT a characteristic of ordinal attributes?

  • They can be used to determine an exact difference between values. (correct)
  • They are often used in surveys for customer satisfaction ratings.
  • They measure subjective qualities.
  • They can be ranked or ordered.
  • What is the main difference between quantitative data and ordinal attributes?

  • Quantitative data is used for describing populations, while ordinal attributes are used for describing individuals.
  • Quantitative data can be discrete or continuous, while ordinal attributes are always discrete.
  • Quantitative data can be used for mathematical calculations, but ordinal attributes cannot. (correct)
  • Quantitative data can be measured objectively, while ordinal attributes are subjective.
  • Which of the following is an example of a discrete quantitative variable?

  • The number of cars passing a certain point on a highway in an hour. (correct)
  • The height of a student in centimeters.
  • The temperature of a room in degrees Celsius.
  • The weight of a person in kilograms.
  • Why are single-valued variables considered not useful for data analysis?

    <p>They carry no information about the variation within the data. (A)</p> Signup and view all the answers

    What is the primary reason for avoiding the use of identifiers in predictive models?

    <p>They do not provide information about the relationships between variables. (A)</p> Signup and view all the answers

    Why is it important to check if a variable is single-valued in the entire dataset, not just the sample?

    <p>To avoid losing valuable information about rare events that are not captured in the sample. (C)</p> Signup and view all the answers

    Which of the following is an example of a monotonic variable?

    <p>The price of a product over time. (B)</p> Signup and view all the answers

    Why are monotonic variables not useful for predictive models?

    <p>They do not provide information about the relationships between variables. (B)</p> Signup and view all the answers

    What is the primary purpose of a pair plot?

    <p>To simplify the initial stages of data analysis by offering a comprehensive snapshot of potential relationships within the data. (D)</p> Signup and view all the answers

    Which of the following can be achieved using a pair plot?

    <p>Visualize distributions, identify relationships, detect anomalies, find trends, find clusters, and find correlations. (B)</p> Signup and view all the answers

    What happens to the weight of a data point if it is correctly classified during boosting iterations?

    <p>The weight is decreased. (C)</p> Signup and view all the answers

    What type of relationships can be observed using pair plots?

    <p>Both linear and nonlinear relationships. (D)</p> Signup and view all the answers

    In the boosting algorithm, which of the following is used to update the weights of misclassified observations?

    <p>The value of ϵ calculated from misclassifications. (A)</p> Signup and view all the answers

    How is the value of α calculated in the boosting process?

    <p>$0.5 imes ext{log}( rac{1 - ϵ}{ϵ})$. (B)</p> Signup and view all the answers

    What is the primary purpose of data cleaning in the context of managing data?

    <p>To ensure data quality and accuracy by fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data. (D)</p> Signup and view all the answers

    In which stage of data management is data cleaning typically performed?

    <p>Data Preparation (A)</p> Signup and view all the answers

    What is the effect of increasing the weight of a misclassified observation in the boosting algorithm?

    <p>It enhances the focus of the model on the misclassified observation. (D)</p> Signup and view all the answers

    What is high variance in a model indicative of?

    <p>The model captures noise and fluctuations in the training data. (B)</p> Signup and view all the answers

    Which of these factors can contribute to the presence of missing values in a dataset?

    <p>All of the above. (D)</p> Signup and view all the answers

    After calculating α, what is the next step for the weights of misclassified observations?

    <p>They are multiplied by the factor e raised to α. (D)</p> Signup and view all the answers

    Which of the following is NOT a benefit of data cleaning?

    <p>Reduced data storage space. (D)</p> Signup and view all the answers

    If ϵ equals 0.4, what would be the new weight for a misclassified observation that initially had a weight of 0.1?

    <p>0.1225 (B)</p> Signup and view all the answers

    What is the goal of the bias-variance tradeoff?

    <p>Balance bias and variance to optimize errors. (D)</p> Signup and view all the answers

    Which of the following is NOT a common reason for data cleaning?

    <p>To improve data visualization. (C)</p> Signup and view all the answers

    During boosting, how is the initial weight assigned to each observation?

    <p>It is equal for all observations. (B)</p> Signup and view all the answers

    Which of these is a reason for a model to underfit?

    <p>Inadequate training data size. (D)</p> Signup and view all the answers

    What does a high value of ϵ indicate in the context of the boosting process?

    <p>A large number of misclassifications. (B)</p> Signup and view all the answers

    What technique can help reduce underfitting?

    <p>Increasing model complexity. (D)</p> Signup and view all the answers

    What contributes to overfitting in a model?

    <p>An overly complex model. (A)</p> Signup and view all the answers

    How can increasing the training dataset size help with overfitting?

    <p>It improves the model's ability to generalize. (C)</p> Signup and view all the answers

    Which scenario describes a model that underfits?

    <p>It cannot capture the complexity of the data. (A)</p> Signup and view all the answers

    What is the difference between actual values and predicted values in machine learning?

    <p>Actual values are the original values, while predicted values are the values from the model. (B)</p> Signup and view all the answers

    The relationship between underfitting and overfitting is characterized by:

    <p>They represent opposing errors impacting generalizability. (B)</p> Signup and view all the answers

    Which type of errors is caused by factors that cannot be controlled or mitigated in a machine learning model?

    <p>Irreducible errors (A)</p> Signup and view all the answers

    How does high bias in a machine learning model affect its performance?

    <p>It indicates that the model makes overly simplistic assumptions. (C)</p> Signup and view all the answers

    Which aspect of variance reflects a model's learning from data noise?

    <p>Sensitivity to fluctuations in data (B)</p> Signup and view all the answers

    What is meant by 'reducible errors' in machine learning?

    <p>These errors are caused by the model's output function not matching the desired output. (A)</p> Signup and view all the answers

    What is the relationship between bias and training error?

    <p>Higher bias often leads to higher training error. (A)</p> Signup and view all the answers

    Which statement correctly describes the effects of variance on a machine learning model?

    <p>High variance indicates the model is prone to overfitting. (A)</p> Signup and view all the answers

    In machine learning, what is the key characteristic of irreducible errors?

    <p>They cannot be reduced due to unknown variables. (B)</p> Signup and view all the answers

    What does the variable $x$ represent in the normal distribution formula?

    <p>Variable (C)</p> Signup and view all the answers

    Which of the following statements is true regarding standard deviations in a normal distribution?

    <p>Standard deviations subdivide the area under the normal curve. (B)</p> Signup and view all the answers

    What does the Empirical Rule state about data distribution in a normal curve?

    <p>95% of data is within two standard deviations of the mean. (D)</p> Signup and view all the answers

    How does the size of the standard deviation affect the shape of the normal distribution curve?

    <p>A larger standard deviation makes the curve wider and shorter. (A)</p> Signup and view all the answers

    In a standard normal distribution, what are the values of the mean and standard deviation?

    <p>Mean = 0 and Standard Deviation = 1 (A)</p> Signup and view all the answers

    Which of the following methods in machine learning typically assumes data is generated from a Gaussian distribution?

    <p>Gaussian Mixture Models (B)</p> Signup and view all the answers

    What is the purpose of the Gaussian distribution as a prior distribution in Bayesian machine learning?

    <p>To represent uncertainty about parameters before data observation. (C)</p> Signup and view all the answers

    In anomaly detection, what is the primary goal regarding data?

    <p>To identify rare events or outliers. (D)</p> Signup and view all the answers

    Study Notes

    Machine Learning

    • A field of computer science that studies algorithms and techniques for automating solutions to complex problems.
    • Coined around 1960, combining "machine" (computer, robot) and "learning" (acquiring/discovering patterns).

    Bibliography

    • Includes various books and online resources on data mining, machine learning and related concepts.
    • Provides specific titles and URLs for further research.

    Big Data Characteristics

    • The University of Lodz Library holds approximately 2.8 million volumes.
    • Assuming average document size of 1MB, the library's data amounts to approximately 30 terabytes.
    • Databases of courier shipments in a logistics company often exceed 20 terabytes.

    Big Data Units

    • The metric system provides decimal prefixes, such as kilo, mega, giga, tera, peta, exa, zetta, and yotta.
    • The IEC (International Electrotechnical Commission) standard uses binary prefixes, such as KiB, MiB, GiB, TiB, PiB, EiB, ZiB, and YiB.

    Machine Learning vs Data Mining

    • Data mining is a technique that discovers patterns in a dataset, while machine learning includes an algorithm that automatically improves through experience.
    • Data mining's origins are in databases and statistics; it's a subset of business analytics.

    What is not dm and ml?

    • Data mining and machine learning are not the same as OLAP (Online analytical processing).

    Data Analysis Process

    • The Cross Industry Standard Process for Data Mining (CRISP-DM), is a widely-used process for building data mining models.
    • The process typically includes problem understanding, data understanding, data preparation, modelling, evaluation and deployment.

    Data Analysis Process - Problem/Business Understanding

    • The initial phase focuses on understanding the project aims and requirements from a business perspective.
    • It involves converting knowledge into a data analysis problem definition and a preliminary plan designed to achieve the aims.

    Data Analysis Process - Data Understanding

    • This phase begins with data collection and familiarizes users with the data.
    • Activities identify data quality problems and answer questions about the data's origin, collection methods, meaning of rows/columns, and obscure symbols/abbreviations.

    Data Analysis Process - Data Preparation

    • This phase covers activities to construct the final dataset from the raw data, and includes tasks like data cleaning, joining of multiple datasets and reducing the number of variables.

    Data Analysis Process - Modeling

    • In this phase, various modeling techniques are selected, applied, and calibrated to optimal value.

    Data Analysis Process - Evaluation

    • Evaluating the model or models involves determining whether they meet the assumptions of the first stage in terms of quality and efficiency.
    • This phase checks for important business or research objectives that may have not been taken into account during earlier phases.

    Data Analysis Process - Deployment

    • Model creation is usually not the final step of the process.
    • Even if the model is intended for gaining knowledge of the data, the knowledge must be organized and presented in a user-friendly way.

    Data Analysis Engineering - Summary

    • This summary presents machine learning, data science, data engineering and visualization in a diagram structure format.

    Python for Machine Learning

    • Python is widely used for machine learning due to its readability, simplicity and abundant libraries.

    Python Libraries for Machine Learning

    • NumPy: Fundamental for scientific computing with support for large, multidimensional arrays and matrices, along with mathematical functions.
    • Pandas: Essential for data manipulation and analysis, providing data structures and operations for numerical tables and time series.
    • Matplotlib: Produces publication-quality graphs and charts for interactive, static and, animated visualizations.
    • Scikit-learn: Provides a collection of supervised and unsupervised learning algorithms with a uniform interface.
    • SciPy: Extends NumPy capabilities with sophisticated routines for optimization, regression, interpolation, and eigenvector decomposition.

    Data Analysis Process - Summary

    • This section summarizes a popular data mining approach, CRISP-DM (Cross-Industry Standard Process for Data Mining).

    Data

    • Plural of datum, used to describe quantitative and qualitative information.

    Data in Graph Form

    • A helpful technique for organizing data and highlighting interrelationships in a visual form.

    Data Quality

    • Machine learning algorithms are highly sensitive to the quality of the source data.
    • Following the "GIGO" (garbage in, garbage out) principle, incorrect data input will inherently lead to erroneous results.

    Data Properties

    • Completeness refers to the presence of all required values and content.
    • Correctness indicates accurate and valid data representations.
    • Actuality ensures data validity and timeliness.

    Noise in Data

    • Noise in a dataset can lead to inaccurate predictions.
    • Noise refers to label noise, inconsistent observations, and classification errors which cause incorrectly labeled or observed records.

    Types of Data Variables

    • Qualitative variables are non-measurable and represented by names or labels (e.g., categorical variables)
    • Quantitative variables are measurable and expressed by numbers (e.g., numerical variables)
    • Qualitative data can be:
      • Nominal
      • Ordinal
    • Quantitative data can be discrete (integer) or continuous (real numbers).

    Transaction-Based Sets

    • A dataset that records transactions (e.g., purchases) where each transaction is a vector representing the items.

    Data analysis - summary

    • The data analysis engineering summary provides a comprehensive overview of the relationship between programming, statistics, visualizations, machine learning and data engineering.

    Knowledge and Information

    • Data is unprocessed information that may lack context or meaning.
    • Information is data that's been processed and interpreted.
    • Knowledge is information that has been evaluated and used to draw conclusions.

    Data Pre-processing

    • Often, a significant chunk of the data analysis process is devoted to preparing the data through cleaning and transformation to fit for data mining, and machine learning purposes.
    • This section identifies common data characteristics in a dataset, such as incomplete or incorrect records, and data not in a usable format.

    Descriptive Statistics

    • The field of statistics helps us to quantify our dataset and its results.

    Statistical Measures - Central Tendency

    • Mean: The average of a dataset.
    • Median: The middle value in a sorted dataset.
    • Mode: The most frequent value in a dataset.

    Statistical Measures - Spread/Dispersion

    • Maximum describes the largest value.
    • Minimum describes the smallest value.
    • Range – the difference between maximum and minimum.
    • Variance - measures the spread of data.
    • Standard deviation- the square root of the variance.
    • Quantiles/Quartiles - split the data into equal-sized groups.

    Measures of central tendency - Summary

    • The summary distinguishes between symmetrical and asymmetrical distribution, and explains the significance of the mean, median & mode of a dataset in these distributions.

    Measures of spread or dispersion - Summary

    • The document identifies different measures to assess spread and dispersion in a dataset such as the maximum, minimum, range, variance, standard deviation and quantiles/quartiles

    Covariance

    • Measures The degree of change in one variable affects another variable.
    • Ranges from - infinity to + infinity.

    Correlation

    • Measures the relationship between two variables.
    • Ranges from -1 to 1, values closer to 1 have a strong positive correlation, values closer to -1 have a strong negative correlation.
    • Pearson, Kendall's Tau and Spearman's rank correlation coefficients identify the type of correlation within a dataset.

    Quantiles (Including Quartiles)

    • q-quantiles are values that partition a finite set of values into q subsets of (nearly) equal sizes.
    • Common quantiles have special names, such as quartiles (four groups), quintiles (five groups), deciles (ten groups), and percentiles (100 groups).

    Skewness

    • Measures the asymmetry of a distribution.
    • Values can be zero, positive or negative.

    Kurtosis

    • Measures the tailedness (thickness of tail) of a distribution compared to a normal distribution;
    • Three types are mesokurtic, platykurtic and leptokurtic;
    • A normal distribution has zero kurtosis.

    Data Analysis and Machine Learning - Summary

    • This summary describes the relationships between data analysis, statistics and machine learning.

    Normal Distribution

    • A continuous probability distribution often assumed in machine learning.
    • Characterized by its symmetrical bell shape.

    Central Limit Theorem

    • A theorem stating that the sampling distribution of the mean approaches a normal distribution as the sample size increases.
    • The theorem aids in making inferences about population parameters based on sample data.

    Collecting Samples

    • Sampling is the process of collecting sample data from various sources.
    • Importance of sampling is to reduce survey cost and workload, and the final interpretations.

    Data Visualization

    • Data visualization is an important technique frequently used to present information and data in graphical format (e.g. charts, graphs, plots).
    • It can be used to understand data patterns, trends, outliers, distributions and relationships.

    Data Visualization with Python

    • Python relies on several libraries for advanced and sophisticated visualizations of data, such as matplotlib, Bokeh, Plotly, geoplotlib and missingno.

    Visualization - Comparison

    • Comparison visualizations illustrate differences between 2 or more items over time.
    • Boxplots are frequently used to display distributions of a continuous feature against the categories of another variable. -
    • Box plots are used to provide a visual summary, to calculate summary statistics, such as minimum, 1st quartile, median, 3rd quartile and maximum values, and to identify outliers

    Visualization - Relationship

    • Data in relationship are visualized using a matrix of scatter plots, known as scatter plots,
    • This method helps to visually assess the relationships between pairs of variables

    Visualization - Distribution

    • Distribution visualizations show the statistical distribution and frequency of values within a dataset.
    • Histograms are a common type of distribution visualisation to asses spread and skewness of data.
    • A histogram represents data using a series of bars whose height indicates the count or frequency of values within the data set.

    Visualization - Composition

    • Compositional visualizations illustrate the component makeup of data
    • Stacked bar charts and pie charts are used to show how a total value has been divided into parts.
    • This illustrates the proportion of data within a given category.

    Visualization - Heatmap

    • Heatmaps are intelligent applications that employ a color system to depict different values in data.

    Visualization - Pair Plot

    • A pair plot, or scatterplot matrix, is a matrix of graphs that enables the visualization of the relationship between each pair of variables in a dataset.
    • They combine histogram and scatter plots and yield a unique overview of data distributions and correlations.

    Data Quality – Addressing issues

    • Impurity in data can lead to unsatisfactory results in the model fitting process.

    Data Cleaning Techniques

    • Data cleaning aims to improve a dataset for fitting the model.

    Handling Missing Values

    • Missing values can stem from various sources such as human error, methodological issues in data collection, bias and other sources.

    Handling Outliers

    • Outliers are data points that are far apart from most of similar points.
    • Outliers can lead to unexpected or erroneous results when fitting models.

    Transforming Data, Z-Score Standardization

    • Normalizing data improves the efficiency of machine learning algorithms.
    • Z-score standardization (or zero mean normalization) is a technique that ensures normalized values have a mean of 0 and a standard deviation of one (1)

    Transforming data, min-max normalization

    • Min-max normalization transforms raw values in the interval of [0, 1]

    Log Transformation

    • A log transformation is applied to variables which have non-symmetric or wide distributions to address a wide input range.

    Feature Encoding Techniques

    • Categorical features are needed to be encoded to numerical values because most machine learning models require numeric inputs.
    • Label encoding replaces category names with integer values.
    • One-hot encoding transforms categorical variables into a series of binary variables.

    Types of Machine Learning Algorithms

    • Supervised learning involves a target variable to be predicted based on other input variables.
    • Supervised learning can be further divided into classification and regression techniques.
    • Unsupervised learning aims to recognize patterns or relations within data without a defined target variable, and it is used for tasks such as pattern discovery, and clustering.

    Supervised Learning - Classification

    • Classification is a machine learning task where the target variable is a category or discrete label (e.g. spam/not spam).

    Supervised Learning - Regression

    • Regression models are used to predict a continuous target variable, with numerical values (example: pricing of products, weather forecasts).

    Supervised learning - example

    • Application Examples: speech and text recognition or object identification on images.

    Unsupervised Learning - Summary

    • Unsupervised learning has two main techniques, clustering and pattern discovery.
    • Clustering is used to divide a dataset into a number of homogeneous clusters.

    Association Rules

    • Association analysis identifies the relationship between items.

    Association Rules - Historical background

    • Market basket analysis is a common application of these rules, which can uncover the relationship between products purchased by customers together.

    Association Rules - Support

    • It measures the frequency of the itemset in the dataset.

    Association Rules - Confidence

    • It measures the frequency of association rules in the dataset.

    Association Rules - Lift

    • Measures the improvement of the rule's support above what would be expected by chance.

    Apriori Property

    • All subsets of a frequent itemset is also frequent.

    Apriori Algorithm

    • The Apriori algorithm identifies frequent itemsets by repeatedly evaluating increasing large itemsets.

    Hierarchical Clustering

    • Hierarchical clustering is an alternative to partitioning to find sets of objects based on their similarity.
    • The result of this approach is typically represented in a tree structure format, called a dendrogram.

    Partitions

    • k-partitions.

    Clustering Methods

    • Partitioning algorithms - divide data into a given number of clusters (k).
    • Hierarchical algorithms - construct hierarchical clustering trees based on the dissimilarity or distance information.

    k-Means Clustering

    • A partitioning algorithm in machine learning used to group data into k clusters by minimizing intra-cluster similarity and maximizing inter-cluster similarity.
    • It works by finding centroids in each of the groups calculated.

    K-Means method - How it works

    • Randomly selects k observations as initial cluster centres.
    • For each observation, assigning it to the nearest cluster centroid.
    • Recalculating the mean of each cluster.
    • Iterating until conditions converge (or a suitable convergence criterion has been met).

    K-Means Method - The k-means++ approach.

    • This approach aims to select initial cluster centers that are as diverse/distant from each other as possible (in order to improve the outcome)

    Determining the Optimal Number of Clusters: Elbow Method

    • The elbow method is based on the sum of squared errors (SSE) within each cluster.

    Determining the Optimal Number of Clusters - Average Silhouette Method

    • A better approach to determine the suitable number of clusters (partitions) than the elbow method, which uses the average silhouette. -

    The Gap Statistic

    • This method compares the observed data and randomly generated datasets for the number of clusters in a dataset.

    The Random Initialization Trap

    • The initial cluster centers are not reflective of actual points in the original dataset.
    • The final selection of clusters is sensitive to the initial setting of cluster centers.
    • Therefore, the result can be different each time we run the algorithm.

    Strengths and Weaknesses of K-means Clustering

    • Strengths - The method can be configured to different values of k (which may lead to different outcomes). The underlying mathematical principles and configuration is straightforward.

    • Weaknesess - Requires to specify the number of clusters k. Works only for numerical data. Difficult to model clusters with complex geometric shapes. Reliance on randomly selected initial cluster centers that introduces random variability in results.

    Decision Trees

    • Use a tree-like structure to represent the relationship between input variables and potential outcome values.
    • They can be used to model both discrete classification trees as well as the prediction of continuous variables (regression trees)

    Divide and Conquer approach

    • Repeatedly splits the data into smaller groups until data within subsets are homogeneous to achieve a good fit.

    Purity of a Partition

    • A partition is considered pure if all values within it belong to one class (and to no others)
    • A highly impure partition represents data with more mixed class values.

    Attributes Splits

    • Decision tree induction algorithms provide splitting methods that are configured according to the type of variables to split observations into groups.
    • Methods for splits in binary, nominal and ordinal data structures are distinguished.
    • There are methods to split continuous variables based on a midpoint value selection (by ranking or using an increasing order).

    Measures of Impurity

    • Information gain measures the expected information required, which helps classify observations based on observed variables, and partitions.
    • Gini index is less computationally intensive than entropy but less inclusive. A lower value suggests improved purity within a partition.
    • Classification error measures the probability of misclassifying minority observations. A lower classification error indicates a higher degree of homogeneity within the partition.

    Entropy

    • Entropy is a measure of homogeneity and randomness in a dataset. Entropy within a given set of partitions can help determine which partitions represent homogeneous clusters.

    Error Made by the Classifier

    • Training Error, Test Error and Generalization Error.

    Pruning

    • The process that reduces the size of a decision tree to aid in generalizing data more effectively against future unseen data (avoid overfitting)
    • Pre-pruning: stops the splitting process before exhaustion on observations or feature.
    • Post-pruning: reduces complexity after the tree is built by simplifying the tree

    Model Evaluation

    • This section describes metrics for evaluating classification models in data, such as accuracy, error rate, precision, recall, f1-score, and the kappa statistic.
    • The metrics calculate the performance of a model by examining the true values versus the predicted values for each observation to yield values of these metrics.

    Cross-validation and validation sets

    • The process of assessing the accuracy of a model created through resampling techniques and splitting the dataset into new training, validation and test sets.

    Resampling techniques

    • Resampling techniques, such as k-fold cross-validation and the bootstrap method, improve the accuracy of a model used in classification problems by evaluating the model's performance on different dataset splits to enhance the statistical significance and reliability of the resulting evaluation.
    • The holdout method evaluates a model's generalization performance.
    • k-fold cross-validation is repeated training and evaluation.

    Holdout Method

    • Splits original data into two partitions (training and testing) to improve the evaluation of a model's generalization performance.
    • This approach is suitable for evaluating large datasets.

    Validation Set

    • Separates a part of the dataset to create a validation set to aid in model tuning when several attempts are done to change the values of several model parameters.
    • Usually 50/25/25 split proportion between training, validation and test sets, which should be independent one partition of the others and representative of the whole dataset being evaluated.

    Resampling

    • Repeatedly using different samples of the original data to train and validate a model.

    k-fold Cross validation

    • The approach repeatedly trains and evaluates the model using a partition of similar size.
    • The model's performance is estimated by averaging across the partitions.
    • The approach is better suited to smaller datasets.

    Leave-one-out Cross validation (LOOCV)

    • A special case of k-fold cross-validation, where the number of folds is identical to the number of elements in the dataset.

    Bootstrap Sampling

    • A resampling method that samples from a dataset with replacement to create multiple training datasets.
    • This enables creating many datasets to train a classifier model.

    Measuring performance for classification

    • This section helps with understanding classification metrics and their implication.

    Model Evaluation - Measures of Model Quality (Including Precision)

    • Measures used for quantifying model quality in classification including positive predictive value (PPV), negative predictive value (NPV), recall (sensitivity), specificity, F1-score (balanced F-score).

    Class Imbalance Problem

    • Classification problems skewed disproportionately towards one class, leading to high accuracy when guessing from a majority class.

    The Kappa Statistic

    • The kappa statistic adjusts accuracy by accounting for the possibility of correct predictions by chance.

    Neural networks

    • Algorithms simulate neurons.

    Neural Networks - Structure

    • Consist of an input layer, hidden layer(s) and an output layer.
    • Hidden layers are also referred to as nodes.
    • Inputs from the data set are simply passed to the hidden layer without modifications.

    Neural Networks - How many nodes?

    • The analyst can configure both hidden layers and the number of nodes.
    • Excessive numbers of nodes may result in overfitting.

    Perceptron learning

    • A linear learning algorithm to solve data that has only two categories (binary classification).

    Activation Function

    • A calculation (the activation function) maps the weighted sum to a value representing an output in a neural network.
    • The activation function can be linear or non-linear

    Interpretation of weights

    • Weights in a neural network indicate the contribution or influence of the input variable on the outcome.
    • Lower values indicate weaker connections, while higher values indicate stronger, more impactful connections

    Perceptron - Example

    • A simple neural network consisting of one or more independent neurons, and an activation function used for supervised learning tasks.
    • Given an input vector [x1, x2, x3], the output or the activation value is calculated as a weighted sum of inputs plus bias.

    Gradient Descent

    • An optimization algorithm to find the minimum of a function when finding parameters for a given model.

    Gradient Boosting Method

    • Builds models sequentially to reduce error; the subsequent model improves in correcting errors/residuals of the prior model.

    Extreme Gradient Boosting

    • An advance form of gradient boosting that incorporates regularization.

    Studying That Suits You

    Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

    Quiz Team

    Related Documents

    Machine Learning Lectures - PDF

    Description

    Test your understanding of key concepts in data analysis, focusing on ordinal attributes and quantitative data. This quiz covers important distinctions, examples, and best practices in handling different types of data for effective predictive modeling.

    More Like This

    Use Quizgecko on...
    Browser
    Browser