Podcast
Questions and Answers
What is the purpose of measures of central tendency?
What is the purpose of measures of central tendency?
- To analyze how well a model fits the data
- To identify the most frequent value in the dataset
- To summarize the spread of data values
- To determine the average or center of the values in a dataset (correct)
How is the mean calculated in a dataset?
How is the mean calculated in a dataset?
- By identifying the most frequent value in the data
- By selecting the middle value after sorting the data
- By summing all values and dividing by the total count (correct)
- By calculating the difference between the maximum and minimum values
What does the median represent in a set of numbers?
What does the median represent in a set of numbers?
- The highest value in the dataset
- The value that divides the dataset into two equal halves (correct)
- The average of all values in the dataset
- The overall spreading of values in the dataset
Which statistical measure would be most appropriate to understand the spread of a dataset?
Which statistical measure would be most appropriate to understand the spread of a dataset?
What is the significance of the distribution in data analysis?
What is the significance of the distribution in data analysis?
Which statement regarding measures of central tendency is correct?
Which statement regarding measures of central tendency is correct?
What kind of issues can statistics help prevent in machine learning?
What kind of issues can statistics help prevent in machine learning?
Which of the following is a central tendency measure?
Which of the following is a central tendency measure?
What type of variables commonly have increasing values over time?
What type of variables commonly have increasing values over time?
What is the distinction between data and information in a database context?
What is the distinction between data and information in a database context?
What does knowledge represent in the context of the example given?
What does knowledge represent in the context of the example given?
Which of the following is NOT a part of data pre-processing?
Which of the following is NOT a part of data pre-processing?
What percentage of the knowledge discovery process is estimated to involve data pre-processing?
What percentage of the knowledge discovery process is estimated to involve data pre-processing?
What could be a consequence of having records with missing values in a database?
What could be a consequence of having records with missing values in a database?
Which statement best describes outliers in data pre-processing?
Which statement best describes outliers in data pre-processing?
What is the primary focus of the Problem Understanding phase in the data analysis process?
What is the primary focus of the Problem Understanding phase in the data analysis process?
What is a potential issue with the format of data for machine learning models?
What is a potential issue with the format of data for machine learning models?
Which of the following is NOT a key activity performed during the Data Understanding phase?
Which of the following is NOT a key activity performed during the Data Understanding phase?
What crucial questions should be addressed during the Data Understanding phase?
What crucial questions should be addressed during the Data Understanding phase?
Which of the following describes a primary task in the Data Preparation phase?
Which of the following describes a primary task in the Data Preparation phase?
What is meant by 'reducing the number of variables' in the Data Preparation phase?
What is meant by 'reducing the number of variables' in the Data Preparation phase?
During which phase do analysts primarily evaluate the outcomes of their models?
During which phase do analysts primarily evaluate the outcomes of their models?
Which of the following is a goal during the Deployment phase?
Which of the following is a goal during the Deployment phase?
What happens to the weight of a point if it is correctly classified during an iteration?
What happens to the weight of a point if it is correctly classified during an iteration?
In the boosting algorithm, how is the weight of misclassified observations updated?
In the boosting algorithm, how is the weight of misclassified observations updated?
Which aspect of the data is not typically explored during the Data Understanding phase?
Which aspect of the data is not typically explored during the Data Understanding phase?
What does the symbol ϵ represent in the boosting algorithm?
What does the symbol ϵ represent in the boosting algorithm?
Which of the following correctly describes the role of alpha (α) in the boosting algorithm?
Which of the following correctly describes the role of alpha (α) in the boosting algorithm?
If ϵ is calculated as 0.4, what would be the value of alpha (α) using the formula $\alpha = 0.5 \cdot \log{\frac{1 - \epsilon}{\epsilon}}$?
If ϵ is calculated as 0.4, what would be the value of alpha (α) using the formula $\alpha = 0.5 \cdot \log{\frac{1 - \epsilon}{\epsilon}}$?
After updating weights for misclassified observations, the new value of ϵ is found to be 0.2225. What does this indicate?
After updating weights for misclassified observations, the new value of ϵ is found to be 0.2225. What does this indicate?
How do weights of observations initially start in the provided example?
How do weights of observations initially start in the provided example?
What is the next step after the iteration process is finished in boosting?
What is the next step after the iteration process is finished in boosting?
What is the primary purpose of data visualization?
What is the primary purpose of data visualization?
Which library is known as the first Python data visualization library?
Which library is known as the first Python data visualization library?
What type of visualization is best suited for illustrating the difference between two or more items over time?
What type of visualization is best suited for illustrating the difference between two or more items over time?
What do boxplots primarily visualize?
What do boxplots primarily visualize?
What is a characteristic feature of Seaborn in relation to Matplotlib?
What is a characteristic feature of Seaborn in relation to Matplotlib?
Which of the following libraries is used for creating interactive plots?
Which of the following libraries is used for creating interactive plots?
What do the five summary statistics displayed in a boxplot represent?
What do the five summary statistics displayed in a boxplot represent?
Which library is specifically noted for creating maps and geographical data plots?
Which library is specifically noted for creating maps and geographical data plots?
What is the main advantage of using a pair plot for data analysis?
What is the main advantage of using a pair plot for data analysis?
When analyzing a pair plot, what is one of the key findings you can potentially identify?
When analyzing a pair plot, what is one of the key findings you can potentially identify?
What is NOT a common benefit of using a pair plot in data analysis?
What is NOT a common benefit of using a pair plot in data analysis?
What is a central purpose of data cleaning as mentioned in the text?
What is a central purpose of data cleaning as mentioned in the text?
Which of the following is NOT directly addressed by data cleaning as described in the text?
Which of the following is NOT directly addressed by data cleaning as described in the text?
What is the purpose of the "Data Preparation" stage in the data science process?
What is the purpose of the "Data Preparation" stage in the data science process?
Which of the following tasks is NOT typically performed during the "Data Preparation" stage?
Which of the following tasks is NOT typically performed during the "Data Preparation" stage?
Flashcards
Data Pre-processing
Data Pre-processing
Data pre-processing is the process of preparing raw data for analysis and machine learning. It involves tasks like cleaning, transforming, and standardizing data to ensure its quality and suitability for the intended analysis.
Distribution
Distribution
A distribution represents how often values appear within a dataset, showing the frequency or likelihood of each value.
Measures of Central Tendency
Measures of Central Tendency
Measures of central tendency describe the central or typical value of a dataset. They indicate where most of the data points are clustered.
Mean
Mean
Signup and view all the flashcards
Median
Median
Signup and view all the flashcards
Measures of Spread
Measures of Spread
Signup and view all the flashcards
Range
Range
Signup and view all the flashcards
Problem Understanding
Problem Understanding
Signup and view all the flashcards
Data Understanding
Data Understanding
Signup and view all the flashcards
Data Preparation
Data Preparation
Signup and view all the flashcards
Modeling
Modeling
Signup and view all the flashcards
Evaluation
Evaluation
Signup and view all the flashcards
Deployment
Deployment
Signup and view all the flashcards
What is Data?
What is Data?
Signup and view all the flashcards
What is Information?
What is Information?
Signup and view all the flashcards
What is Knowledge?
What is Knowledge?
Signup and view all the flashcards
What is Data Pre-processing?
What is Data Pre-processing?
Signup and view all the flashcards
Why is outdated Data a problem?
Why is outdated Data a problem?
Signup and view all the flashcards
Why is redundant Data a problem?
Why is redundant Data a problem?
Signup and view all the flashcards
Why are missing values a problem?
Why are missing values a problem?
Signup and view all the flashcards
Why is inconsistent Data a problem?
Why is inconsistent Data a problem?
Signup and view all the flashcards
What is data visualization?
What is data visualization?
Signup and view all the flashcards
What are the benefits of data visualization?
What are the benefits of data visualization?
Signup and view all the flashcards
What is Matplotlib?
What is Matplotlib?
Signup and view all the flashcards
How do libraries like pandas and Seaborn relate to Matplotlib?
How do libraries like pandas and Seaborn relate to Matplotlib?
Signup and view all the flashcards
What is Seaborn?
What is Seaborn?
Signup and view all the flashcards
What is The Grammar of Graphics?
What is The Grammar of Graphics?
Signup and view all the flashcards
What is a boxplot used for?
What is a boxplot used for?
Signup and view all the flashcards
What is the purpose of a comparison visualization?
What is the purpose of a comparison visualization?
Signup and view all the flashcards
Data cleaning
Data cleaning
Signup and view all the flashcards
Pair plot
Pair plot
Signup and view all the flashcards
Visualize distributions
Visualize distributions
Signup and view all the flashcards
Identify relationships
Identify relationships
Signup and view all the flashcards
Detect anomalies
Detect anomalies
Signup and view all the flashcards
Find trends
Find trends
Signup and view all the flashcards
Find clusters
Find clusters
Signup and view all the flashcards
Find correlations
Find correlations
Signup and view all the flashcards
Boosting
Boosting
Signup and view all the flashcards
Weights (in Boosting)
Weights (in Boosting)
Signup and view all the flashcards
Epsilon (ϵ)
Epsilon (ϵ)
Signup and view all the flashcards
Alpha (α)
Alpha (α)
Signup and view all the flashcards
Weight Updating
Weight Updating
Signup and view all the flashcards
Final Model Prediction
Final Model Prediction
Signup and view all the flashcards
Sum of Misclassified Weights
Sum of Misclassified Weights
Signup and view all the flashcards
Boosting Algorithm Process
Boosting Algorithm Process
Signup and view all the flashcards
Study Notes
Machine Learning
- Machine learning (ML) is a computer science field that studies algorithms and techniques for automating solutions to complex problems.
- ML was coined around 1960, combining "machine" (computer/robot) and "learning" (acquiring/discovering patterns).
- Humans excel at discovering patterns; ML aims to replicate this skill in machines.
Bibliography
- Larose, Daniel T.: Discovering Knowledge in Data (2014, Wiley)
- Han Jiawei, Kamber Micheline: Data Mining: Concepts and Techniques (2006, Elsevier)
- Pang-Ning Tan, Steinbach Michael, Vipin Kumar: Introduction to Data Mining (2014, Pearson)
- G. James, D. Witten, T. Hastie, R. Tibshirani: An Introduction to Statistical Learning (2013, Springer)
- Raschka Sebastian, Yuxi Liu, Vahid Mirjalili: Machine Learning with PyTorch and Scikit-Learn (2022, Packt)
- Data Mining Map by Saed Sayad (http://www.saedsayad.com/data_mining_map.htm)
- Analytics, Data Mining, and Data Science (http://www.kdnuggets.com/)
- Kaggle (https://www.kaggle.com/datasets?fileType=csv)
Big Data Example
- The University of Lodz Library contains approximately 2.8 million volumes.
- If each document averages 1 MB, the library's data would occupy 30 terabytes.
- A logistics company's database of courier shipments is about 20 terabytes.
Big Data Units
- SI (International System of Units) prefixes for bytes: KB (kilobytes), MB (megabytes), GB (gigabytes), TB (terabytes), PB (petabytes), EB (exabytes), ZB (zettabytes), YB (yottabytes).
- IEC (International Electrotechnical Commission) prefixes for bytes: KiB (kibibytes), MiB (mebibytes), GiB (gibibytes), TiB (tebibytes), PiB (pebibytes), EiB (exbibytes), ZiB (zebibytes), YiB (yobibytes).
Data Analysis Process (CRISP-DM)
- This approach is often used in data mining.
- Problem Understanding/Business Understanding.
- Data Understanding.
- Data Preparation.
- Modeling.
- Evaluation.
- Deployment.
What is Data Mining?
- Data mining is the art and science of intelligent data analysis.
- The goal is to uncover meaningful insights and knowledge from data.
- Building models is commonly used.
- A model summarizes the discovered knowledge for improved understanding or predictions.
Data Mining vs. Machine Learning
- Data mining is a technique for uncovering patterns in data, with a focus on precise, new, and useful information.
- ML entails an algorithm that learns from data, enhancing itself through experience.
- ML uses data mining techniques, which, in the process, aids in generating models that can effectively predict future outcomes.
Machine Learning and Data Analysis: Non-OLAP
- Data Mining (DM) and Machine Learning (ML) are not the same as OLAP (Online Analytical Processing).
- OLAP is focused on query and analysis of existing data, not the discovery of new patterns or the creation of predictive models.
Common Data Analysis Questions (not DM/ML)
- How many customers who bought a suit bought a shirt?
- Which customers are not paying back the loan?
- Which customers have not renewed their insurance policies?
- What product did the customers who bought the suit buy?
- What credit risk does the customer pose?
- Which customers may leave for another company?
Data Analysis Process - Problem Understanding/Business Understanding
- This initial step focuses on understanding the aim and requirements from a business perspective of the project.
- The objective here is to interpret the knowledge into a data analysis problem definition and a preliminary plan.
Data Analysis Process - Data Understanding
- A crucial stage is understanding the data by gaining familiarity with the data characteristics and identifying quality problems.
- Answering questions about the data's origin, collection methods, data meanings, and abbreviations is necessary.
Data Analysis Process - Data Preparation
- This stage involves all activities creating the final dataset from the raw data.
- Key activities include the following:
- merge different data sets
- reduce the number of variables to be only essential.
- data cleaning, including missing values, and reformatting for better use.
Data Analysis Process - Modeling
- This step involves selecting appropriate modeling techniques and calibrating parameters to optimal levels.
Data Analysis Process - Evaluation
- Assess the model or models to determine whether the model fits the established parameters.
- Verify if all business/research objectives have been reflected in the model during the decision to use the output result.
Data Analysis Process - Deployment
- The final step.
- The knowledge that has been gained by the model should be organized and presented in a usable way for the customer. In short, use it.
Data Analysis Engineering Summary
- This outlines how all previous concepts relate to each other and how they are linked.
- Programming, statistics, machine learning, data engineering, data science, and visualization are integral parts of this process.
Python's 4 main ML Libraries
- NumPy: fundamental for scientific computing, handling arrays.
- Pandas: essential for data manipulation and analysis (tables & time series).
- Matplotlib: for producing high-quality plots and visualizations.
- Scikit-learn: for a wider range of ML tasks.
Data Analysis in a Graphically based Visual
- This explains how data elements are represented and related, as well as resources used for data references and data mining.
Data Quality
- Machine learning algorithms are very sensitive to high-quality data. Incorrect data can yield incorrect results.
- Data quality principles include completeness, correctness, and actuality.
Data Noise
- Label noise: A data point is incorrectly labeled to its true class.
- Inconsistent data: Data points vary from their true class label despite being from the same category.
- Classification errors: Observations incorrectly assigned to a class.
- Attribute noise: Inaccurate values in one or more attributes.
- Missing/unknown values: Missing values are an area that a model may struggle to predict upon.
Types of Variables
- Qualitative (categorical): non-measurable; cannot be uniquely characterized by numbers; examples include brand names, size, and gender.
- Quantitative (numeric): can be compared, measured, or used in arithmetic operations; examples include height and income.
Quantitative Data Types
- Discrete: finite or countable set of values (e.g., number of items sold).
- Continuous: Infinite set of values (e.g., height).
Qualitative Data Types
- Nominal: categories with no inherent order (e.g., colors).
- Ordinal: categories with an inherent order (e.g., customer satisfaction ratings).
Transaction-Based Sets
- Each transaction is a vector; components denote products or items.
- Example set: {1: Bread, Coke, Milk}, {2: Beer, Bread}.
Data in Graph Form
- Storing data as vertices (indicating a relationship).
Normal Distribution (Gaussian Distribution)
-
A common distribution in machine learning used to model and analyze data with a bell-shaped curve. The majority of data falls within one standard deviation of the mean.
-
The mean (average) and the standard deviation are helpful in understanding if the data conforms to a normal distribution which is used in further models.
-
The shape of distributions can be visualized (skewed, normal, and others). Shape helps predict data patterns (e.g. outliers).
Central Limit Theorem
-
A fundamental theorem in statistics used to make inferences and estimate with sample data to estimate population parameters.
-
As the sample size increases, the sampling distribution of the mean approaches a normal distribution, regardless of shape of the original data.
-
Means of samples will be close to the population mean and variance of the sample diminishes.
-
This theorem is essential for inferential statistics (hypotheses testing and confidence intervals).
Sampling
- A sample is a subset of a much larger population.
- Sampling is the method or process of collecting samples from a population.
- It is a crucial part of data collection because errors in sampling can affect findings.
- Samples help infer population information while reducing data collection/management workload.
Data Visualization
- Visualizing data is useful in many machine learning applications.
- Visualizing data patterns, trends, outliers, distributions, and relationships are helpful insights into the data being analyzed.
Data Visualization Using Python
- Python libraries (e.g. Matplotlib, Seaborn, Bokeh, Plotly, geoplotlib, missingno) are available to help construct visualizations of various datasets.
Data Visualization - Comparison
- Box plots are comparison visualizations visually representing the distribution of a continuous feature across different categories.
Data Visualization - Relationship
- Scatter plots are useful in visualizing the relationship and correlation between two or more variables. This can help establish if variables can be related through the use of plotting.
Data Visualization - Distribution
- Histograms visually represent the distribution of data, including insights into the data's spread (range) and skewness.
Data Visualization - Composition
- Composition visualizations are useful in showing the percentage allocation between parts to the whole using methods that include stacked bar charts and pie charts.
Data Visualization - Heatmap
- A heat map uses colors to visually show relationships or different values between many parts or data partitions. This is better used when attempting to visually represent many parts or data partitions.
Data Visualization - Pair Plot
- A pair plot is a matrix of graphs, including histograms and scatter plots of every combination of variables to visualize patterns, relationships, and correlations among variables in a given dataset.
Data Preprocessing
- Data preprocessing is the essential initial process for data mining.
- This covers all data-related techniques involved before the data can be used by the machine learning models and algorithms.
- The following activities can be part of this process:
- cleaning
- removing incorrect, corrupted, inconsistently formatted, and redundant data
- fixing and removing missing data
- preprocessing anomalies
- reshaping or normalizing data to meet the model requirement
Handling Missing Values.
- Methods include removal and imputation techniques used to fill in gaps in missing data values.
- Removal: data partitions with missing values can be excluded.
- Imputation: replacing missing values using a procedure like random imputation based on known data points, mean or median, or using a predictor model to estimate missing values.
Handling Outliers
- Outliers are those data points that are distant from most of the similar data points in a dataset.
- Handling outliers involves methods that either exclude outliers from a given subset or the dataset entirely. This also requires having an alternative method that will not lose crucial data points in the process.
Transforming Data
- Transforming data is a crucial part of data preprocessing required to meet the assumptions and specifications set in the machine learning process and algorithm.
- Changing data type, normalizing, or using other math operations can be part of the process.
- Standardization or normalizing values is a common practice.
Data Types for Machine Learning : Numeric
- z-score standardization (zero mean normalization): Transforms data to have mean 0 and standard deviation 1.
- Min-Max normalization: Transforms to the range [0, 1].
- Log Transformation: Applied to variables with distributions that are not symmetric or have a wide range of values to better model relationships.
Feature Encoding
-
Converts categorical features into integer/numerical values, allowing models to interpret the data effectively by converting strings into numbers.
-
Label Encoding: Replace different categories by an integer representing the category's position within an ordered set of distinct categories.
-
One-Hot Encoding: Converting categories to numeric binary arrays; i.e., categories are represented as independent columns.
Machine Learning Algorithm Types
-
Supervised learning (SL): Models learn relationships between the target and training data to calculate and predict.
-
Unsupervised learning (UL): No target value; focus is on identifying patterns and relationships within data points.
-
Supervised Learning has these parts involved
-
Classification problems: The model needs to predict a discrete outcome (e.g., spam/no spam).
-
Regression problems: The model predicts a continuous outcome (e.g., house prices).
Unsupervised Techniques
- Clustering: Identify groups or clusters of similar items where there is no target.
- Dimensionality reduction (e.g., PCA): Reduces the number of variables by combining highly correlated ones into fewer components.
- Association rule mining: Identifies relationships among sets of items that are frequently purchased together called an item set or a subset.
Association Analysis
- Association Rules: Models predict if an outcome will be related to another or a combination of other outcomes given a few values. This is best represented using a series of rules where: IF X THEN Y.
- Support: The ratio of times an itemset appears in the data.
- Confidence: The ratio of instances where Y happened, given that X happened.
- Lift: compares the confidence of the rule to the the probability of these outcomes occurring by chance.
- Recommendation Engines: Suggest other items similar to an item based on the patterns in the data using association analysis.
- Frequent itemsets are itemsets or collections of item whose support value is greater or equal to the threshold required.
Apriori Property
- Frequent Itemsets
- Apriori heuristic: All subsets of a frequent itemset are also frequent; in other words, if the combined items occur frequently, the constituent parts of the combination will also occur frequently.
The Apriori Algorithm
- A method for generating frequent itemsets.
- Involves repeatedly calculating the support for itemsets of increasing size, using the Apriori property to eliminate less-frequent itemsets.
- This algorithm is an efficient technique to generate rules through a cyclical process.
Hierarchical Clustering
-
Agglomerative clustering begins with each observation as its cluster. Clusters are merged based on distance measures.
-
Divisive clustering begins with all observations in a single cluster and splits recursively based on a hierarchical tree, called a dendrogram.
-
Common linkages in an agglomerative approach include:
- Minimum linkage
- Maximum linkage
- Mean Linkage
- Centroid Linkage
- Ward's method
Assessing Clustering Tendency
- Visual methods: create ordered dissimilarity images (ODIs) of data that visually represents groupings/clusterings based on similarities between data points; this is useful when checking trends, for example.
- Hopkins statistic: calculates the probability that a given dataset can be seen as uniformly distributed.
k-Means Method
- This method partitions data points into k clusters by finding the mean values for variables of each data point from those in a given cluster.
- The algorithm repeatedly reassigns clusters based on which center is closest.
- The number of clusters, k, is provided a priori by the user.
- Euclidean distance is often used for calculating this.
- Random Initialization Trap: Initial cluster centers are random that will influence output to some degree, so multiple runs are common to minimize the impact and random outcomes on modeling.
Model Selection
- Model Selection: Use this process when tuning hyperparameters and comparing various settings for models to improve performance in unseen data.
- Validation set is often used to test different setups of a model's hyperparameters/settings versus the test set, which is saved for evaluating the validity of the model overall.
- Techniques like k-fold cross validation and Leave-one-out cross-validation are useful in model selection and are better than holdout in situations with smaller datasets.
Resampling
- A method that uses various samples repeatedly from the same original data set to train and validate the model in order to have a better overall performance estimate. Two common resampling techniques include:
- k-Fold cross-validation: Divides data into k parts and trains on k-1 parts, and tests on the remaining part in each iteration.
- Leave-one-out cross-validation (LOOCV): A special case of k-fold cross validation where k equals the number of observations in the data set; i.e., every observation or data point is held out as the testing set once during each iteration.
Bootstrap Sampling
- Create different training datasets from the original data, repeatedly, using sampling with replacement to overcome issues of small datasets or bias.
- A variant is the 0.632 bootstrap method where the proportion of a randomly chosen data point to be in the training set is about 63.2%. The out of bag instances are used to test the model.
Model Performance Metrics
- Accuracy: Percentage of correctly classified instances.
- Error rate: Percentage of incorrectly classified instances.
- Precision: Proportion of correct positive classifications related to all positive classifications.
- Recall/Coverage : Proportion of correctly classified positives relative to all positive instances from the original data.
- Specificity: Proportion of correct negative classifications relative to all negative cases.
- F1-score: Harmonic mean of precision and recall.
Kappa Statistic
- Adjusts accuracy to account for chance predictions.
- Kappa values range from 0 to 1, with 1 representing perfect agreement between predicted and actual outcomes.
Neural Networks
-
Neural networks are an attempt to model how neurons in nature function as computational elements.
-
Consist of interconnected layers of nodes (neurons) to model.
- Input layer
- Hidden layer(s)
- Output layer
-
Activation functions, like sigmoid and hyperbolic tangent functions determine outputs on a range [0,1] for neurons.
- Link weights determine the relationship between input variables and output. The weights determine how the input variables inform the output of each layer/node/neuron of the neural network.
Gradient Descent
-
A method to optimize a cost/loss function iteratively, nudging parameters along a descending gradient by adjusting weights.
-
Key steps include initializing parameters, calculating the cost function /loss gradient, updating parameters along the gradient using a step size/learning rate, and repeating until the cost function converges (i.e., converges until change to parameters is insignificant).
-
Controlling the learning rate step size is critical to obtain satisfactory results.
Gradient Boosting
- An ensemble method that sequentially builds multiple models, such as decision trees, where each new model tries to correct errors from the previous one.
- Minimizing the error, therefore, leads to an iterative improvement and an incremental refinement and addition/refinement of previous models.
- Gradient descent is sometimes used in combination with boosting. The output of the model is the weighted average of the outputs from the multiple single models.
Extreme Gradient Boosting (XGBoost)
- XGBoost is a popular implementation of Gradient Boosting, which has additional features such as regularization to prevent overfitting.
- Regularization adds penalties within the objective function, thereby keeping model complexity relatively consistent vs. overly complex by preventing overfitting.
Predicting Continuous Target Variables
- Regression techniques are used to predict continuous values.
- Simple linear regression models predict a relationship between a single dependent variable and one predictor variable.
- Multiple linear regression models predict a relationship between one dependent variable and more than one predictor variables.
Regression Metrics
- Mean Absolute Error (MAE): Average absolute difference between predicted and actual values.
- Mean Squared Error (MSE): The average of the squared differences between predictions and actual values.
- R-squared (R2): Indicates the proportion of variance in the dependent variable explained by the independent variables.
- Root Mean Squared Error (RMSE): The square root of the MSE. It is easier to interpret as directly showing the typical error relative to the scale of the actual values.
Quantile-Quantile (Q-Q) Plots
- Used to graphically assess whether a dataset follows a specific probability distribution, often used to examine normality of a dataset.
- Points on the Q-Q plot should fall along a straight line if the data conforms to the assumed distribution.
- Deviation from a straight line indicates deviation from normality.
Bias-Variance Tradeoff
-
A key concept in machine learning and other modeling areas. There is trade-off between the model's bias (simplicity) and its variance (sensitivity).
-
A good/optimized model has low bias and low variance, while balancing these two factors is critical to avoiding the issues of overfitting or underfitting.
-
Underfitting refers to the case where the model is too simple and is not capturing the underlying data patterns (high bias).
-
Overfitting refers to the case where the model is too complex and captures noise/unnecessary features instead of the desired data patterns (high variance).
-
There is a trade-off between how well the model fits to the underlying data pattern versus the model's generalizations being based on noisy data points.
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.
Related Documents
Description
Test your knowledge on measures of central tendency and their significance in data analysis. This quiz covers concepts like mean, median, and the impact of these measures in understanding datasets. Perfect for statistics enthusiasts and students!