Performance in Machine Learning - Skema Business School - 2024 PDF
Document Details
Uploaded by Deleted User
SKEMA Business School
2024
null
Francis Wolinski
Tags
Summary
This presentation from Skema Business School in Lille, 2024, provides an overview of performance metrics in machine learning. It covers topics such as supervised machine learning, algorithms, underfitting/overfitting, regression, and binary classification. The presentation also discusses fairness metrics and concludes with perspectives on the field.
Full Transcript
This presentation uses Introducing the subject Short video on an AI which failed AI Camera Ruins Soccer Game For Fans After Mistaking Referee's Bald Head For Ball | IFLScience AI Camera Ruins Soccer Game For Fans After Mistaking Referee's Bald Head For Ball - YouTube Perfo...
This presentation uses Introducing the subject Short video on an AI which failed AI Camera Ruins Soccer Game For Fans After Mistaking Referee's Bald Head For Ball | IFLScience AI Camera Ruins Soccer Game For Fans After Mistaking Referee's Bald Head For Ball - YouTube Performance Metrics in Machine Learning 1. What is Machine Learning? 2. What are the different kinds of algorithms in Machine Learning? Examples of use. 3. Supervised Machine Learning Methodology 4. Underfitting and Overfitting – Bias and Variance 5. Performance Metrics for Regression 6. Performance Metrics for Binary Classification 7. Fairness Metrics 8. Conclusion and Perspectives Machine learning algorithms build a model based on sample data, known as training data, in order to make predictions or decisions without being explicitly programmed to do so. Machine Learning Vs Traditional Programming – Avenga Example: Escape the Maze Escape the maze Data Program …/… heuristic Escape the Maze - Python Coding - Day 6 - Southernfolks Designs Escape the maze Output …/… Escape the Maze - Python Coding - Day 6 - Southernfolks Designs Escape the maze Data Output Program qmaze (samyzaf.com) Simplified history of AI, ML and DL AI Generative AI https://towardsdatascience.com/notes-on-artificial-intelligence-ai-machine-learning-ml-and-deep-learning-dl-for-56e51a2071c2 Simplified history of AI, ML and DL Computation Knowledge Large Language Models Algorithms Deep Learning Hardware Software Data Machine Learning First Second winter winter of AI of AI https://towardsdatascience.com/notes-on-artificial-intelligence-ai-machine-learning-ml-and-deep-learning-dl-for-56e51a2071c2 Deep Neural Networks When do you need Machine Learning? Hilarious Data Science Jokes on the Web to Lighten Your Mood (analyticsinsight.net) Applications of Machine Learning What are the different kinds of algorithms in Machine Learning? Machine Learning Algorithms In Layman’s Terms, Part 1 | by Audrey Lorberfeld | Towards Data Science Supervised Machine Learning Regression Predicting a continuous-valued attribute associated with an object. A few examples of use 1. Stock market prediction Machine learning regression algorithms can analyse historical stock data, market trends, and various factors influencing stock prices to forecast future values. This information can help investors make informed decisions about buying or selling stocks. 2. Sales forecasting Regression models can be used to predict future sales based on historical sales data, advertising expenditures, pricing strategies, and other relevant factors. This information is valuable for businesses in planning inventory, managing resources, and setting sales targets. 3. Demand forecasting Regression models can analyse historical data on product demand, weather patterns, economic indicators, and other variables to forecast future demand. This information is useful for supply chain management, production planning, and optimizing inventory levels. 4. Real estate price prediction Regression models can consider factors such as location, property size, number of bedrooms, nearby amenities, and historical sales data to predict property prices. This information is valuable for real estate agents, buyers, and sellers to assess property values accurately. 5. Energy consumption prediction Regression models can analyse historical energy consumption patterns, weather conditions, and other relevant variables to predict future energy consumption. This information can assist energy providers in optimizing resource allocation, managing peak loads, and improving energy efficiency. 6. Risk assessment in insurance Regression models can be used to assess the risk associated with an insurance policyholder based on various factors such as age, health status, driving records, and previous claims. This information helps insurance companies set appropriate premiums and determine policy terms. 7. Credit scoring Regression models can analyse a person's credit history, income, employment status, and other relevant factors to predict their creditworthiness. This information is crucial for banks and financial institutions when evaluating loan applications and determining interest rates. 8. Demand for ride-sharing services Regression models can predict the demand for ride-sharing services based on historical trip data, time of day, weather conditions, and other factors. This information helps ride-sharing companies optimize driver allocation, estimate fare prices, and reduce passenger waiting times. Linear Separation Non-Linear Separation Supervised Machine Learning Classification Identifying which category an object belongs to. A few examples of use 1. Email spam filtering Classification models can be trained to analyse email content and classify incoming emails as either spam or legitimate. The model learns from labelled examples and can accurately predict whether an email is spam, helping in managing email inboxes effectively. 2. Sentiment analysis Classification algorithms can be used to analyse text data, such as customer reviews or social media posts, and classify them into positive, negative, or neutral sentiment categories. This information is valuable for understanding public opinion, brand monitoring, and customer feedback analysis. 3. Fraud detection Classification models can be trained on historical data to identify patterns and indicators of fraudulent activities in credit card transactions, insurance claims, or online transactions. This helps in detecting and preventing fraudulent behaviour, saving businesses from financial losses. 4. Disease diagnosis Classification algorithms can assist in medical diagnosis by analysing patient data, symptoms, and test results to classify individuals into different disease categories. This information can aid healthcare professionals in making accurate diagnoses and providing appropriate treatments. 5. Image recognition Classification models can be trained on labelled image datasets to classify images into various categories or objects. This is useful in applications such as facial recognition, object detection, autonomous driving, and quality control in manufacturing. 6. Customer segmentation Classification techniques can group customers based on their demographics, purchasing behavior, or other relevant features. This segmentation helps businesses tailor their marketing strategies, personalize recommendations, and optimize customer experiences. 7. Document categorization Classification models can automatically classify documents into different categories, such as news articles, legal documents, or research papers. This enables efficient organization and retrieval of large document collections. dataset dataset features target individuals are represented by rows of features and a target value features individual matrix of values for individuals target vector of values for individuals X y regression model the model predicts a target from the features the target is a continuous variable: e.g., physical value, price, score… binary classification model the model predicts a target from the features the target is one of two classes: e.g., healthy/diseased, won’t churn/will churn, not fraud/fraud… train the model using X and y features X Training set ❷ of the training set target y ~75% to build a model Initial split randomly the dataset ❶ 95% Dataset into training and test sets 100% Model 90% use the model Test set evaluate the model ❸ to predict ŷ using X ❹ by comparing y and ŷ ~25% ŷ = model(X) Generalization in AI refers to an AI model's ability to apply what it has learned from training data to new, unseen data. cat Unseen Data Model Training dog Data Example: R Example: Real Estate dataset for regression features X target y Real estate price prediction | Kaggle features X target y prediction ŷ performance compare ❹ numbers y and ŷ ❷ train the model using X and y features X Regression ❸ model use the model to predict ŷ using X ŷ = model(X) Example of Failure in Real Estate Example: Titanic dataset for classification Example: Titanic dataset for classification features X target y Titanic - Machine Learning from Disaster | Kaggle features X target y prediction ŷ performance compare ❹ classes y and ŷ ❷ train the model using X and y features X Classification ❸ model use the model to predict ŷ using X ŷ = model(X) Global Methodology CRISP-DM Methodology CRoss-Industry Standard Processing for Data Mining Different Kinds of Data Structured Data Semi-Structured Data Unstructured Data Structured Data Definition: Highly organized data that fits into a predefined model or schema Examples Relational databases Excel spreadsheets Characteristics Easily searchable and analyzable using SQL queries. Stored in rows and columns (tabular format). Best suited for numerical and categorical data. Different Types of Structured Data What is the difference between ordinal, interval and ratio variables? Why should I care? - FAQ 1089 Types of data measurement scales: nominal, ordinal, interval, and ratio - GraphPad (mymarketresearchmethods.com) Semi-Structured Data Definition: Data that does not conform to a fixed schema but still has some level of structure (e.g., tags, attributes) Examples: JSON, XML, CSV, YAML, log files, Swift messages, graphs, etc. Characteristics Flexible structure that can evolve. Easier for machines to parse compared to unstructured data. Common in web data and document storage. Unstructured Data Definition: Data without any specific organization or schema Examples: Text files, emails, images, videos, audio files, social media content. Characteristics Lacks a pre-defined format. Harder to analyze and process directly. Requires advanced tools like Natural Language Processing (NLP) or machine learning for insights. Structured vs Unstructured Data What is Structured Data vs Unstructured Data? | Pecan AI Global Methodology What data scientists spend the most time Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says (forbes.com) doing Data Cleaning Garbage In Garbage Out What are the advantages of pre-processing and cleaning data before modelling? 1. Data Quality High-quality data leads to better model performance. Data cleaning helps identify and rectify errors, outliers, and missing values, ensuring the data is accurate and reliable. 2. Model Performance Cleaned data leads to more robust and accurate models. Outliers and noisy data can mislead machine learning algorithms, resulting in subpar predictions. 3. Reduced Bias Biased data can lead to biased models. Data cleaning helps in mitigating biases, ensuring that the model provides fair and unbiased predictions. 4. Feature Engineering Data cleaning often involves feature engineering, where irrelevant or redundant features are removed, and relevant ones are transformed or created. This enhances the model's ability to extract meaningful patterns. 5. Computational Efficiency Cleaning data reduces its complexity, making it easier and faster for algorithms to process. This is especially important when dealing with large datasets. 6. Improved Interpretability Cleaned data is more interpretable. When stakeholders understand the data, they can better trust and use the model's output. 7. Generalization Data cleaning improves a model's ability to generalize from the training data to new, unseen data. This ensures the model performs well in real-world scenarios. 8. Error Reduction By addressing data quality issues beforehand, the likelihood of errors in predictions is significantly reduced, leading to more reliable results. Underfitting and Overfitting in Regression The model is too The model can The model is too simple to be able generalize complex and was to learn anything properly from able to learn the from the data. the data. noise in the data. Overfitting and Underfitting – The Correlation Underfitting and Overfitting in Classification The model is too The model can The model is too simple to be able generalize complex and was to learn anything properly from able to learn the from the data. the data. noise in the data. Underfitting and Overfitting in Machine Learning – C#, Python, Data Structure , Algorithm and ML (wordpress.com) Definitions Bias The bias measures how well the average prediction of a model matches the true value. It quantifies the systematic error of the model. Formula Variance The variance measures the variability of the model's predictions for different training sets. It quantifies the model's sensitivity to changes in the training data. Formula Geometric perspective Variance Bias By Bernhard Thiery - Bias–variance tradeoff - Wikipedia Model perspective and performance Good performance Low Bias High Bias High Variance High Variance on training set Model is accurate Model is inaccurate Poor performance but inconsistent and inconsistent on test set Overfitting Worst model Variance Low Bias High Bias Good performance Low Variance Low Variance Poor performance Model is accurate Model is consistent on training and test sets and consistent but inaccurate on training and test sets Appropriate fitting Underfitting Bias Bias-Variance Trade-off A good model should be neither too simple nor too complex to provide good results on both training and test sets. error on test set underfitting overfitting ~ error on training set Decoding Bias Variance Tradeoff. Bias — An error that on avergae tells… | by Saurav Agrawal | Jun, 2023 | Medium Underfitting Overfitting reducing bias reducing variance more data more data for training + + dataset train set feature engineering feature selection dataset + dataset- + - regularization regularization increase model complexity decrease model complexity cross-validation cross-validation hyperparameters hyperparameters train longer early stop … … To evaluate a regression model, we need to compute the error due to the gap between y and ŷ It's important to note that the choice of metric depends on the specific problem, the nature of the data, and the context in which the model will be used. It is recommended to consider the characteristics of your data and the objectives of your analysis to select the most appropriate metric for your regression task. We are going to deal with 3 metrics: - Mean Squared Error (MSE) - Root Mean Squared Error (RMSE) - Mean Absolute Error (MAE) Mean Squared Error (MSE) MSE calculates the average of the squared differences between the predicted and actual values. It gives higher weights to larger errors due to the squaring operation. MSE is widely used as a general-purpose metric for regression and is sensitive to outliers. It is suitable when large errors are particularly important and need to be penalized heavily. Root Mean Squared Error (RMSE) RMSE is the square root of the MSE and provides an interpretable measure in the same units as the target variable. It is commonly used when you want to report the average error in the original scale of the data and want to penalize large errors. Mean Absolute Error (MAE) MAE calculates the average of the absolute differences between the predicted and actual values. Unlike MSE, it does not involve squaring, which makes it less sensitive to outliers. MAE is useful when you want a metric that is more robust to extreme values and when you want to focus on the magnitude of errors rather than their direction. Example: Real Estate dataset sample y ŷ 43.2 48.0 40.8 41.4 27.7 30.6 34.2 30.3 … … metrics train test 1 MSE 66.98 108.53 RMSE 8.18 10.42 MAE 6.08 6.31 Example: Real Estate dataset sample y ŷ outlier 43.2 48.0 40.8 41.4 27.7 30.6 34.2 30.3 … … 117.5 41.4 outlier What is an outlier in real estate? Example: Real Estate dataset without the outlier sample y ŷ without outlier 43.2 48.0 40.8 41.4 27.7 30.6 34.2 30.3 … … metrics train test 1 test 2 MSE 66.98 108.53 53.41 RMSE 8.18 10.42 7.31 MAE 6.08 6.31 5.63 A few other metrics in regression - R-squared (R²) measures the proportion of the variance in the target variable that can be explained by the regression model. It ranges from 0 to 1, where 1 indicates a perfect fit. R-squared is often used to assess how well the model captures the variability of the data. However, it does not provide information about the magnitude or direction of errors. - Mean Percentage Error calculates the average percentage difference between the predicted and actual values. It is commonly used in forecasting problems and provides a measure of relative error. MPE is useful when you want to assess the average percentage deviation of the predictions from the actual values. - Mean Absolute Percentage Error is like MPE but takes the absolute value of the percentage differences. It provides a measure of the average relative error and is commonly used when the scale of the data varies widely across different samples. It is recommended to use a combination of metrics to capture a comprehensive understanding of the model’s strengths and weaknesses. The most used metrics are in green. MSE – Mean Squared Error R2 – R-squared + penalizes large errors + shows how well the model fits the data - sensitive to outliers - does not capture the accuracy of individual predictions - not in the original unit - can give false information when overfitting RMSE – Root Mean Squared Error MPE – Mean Percentage Error + penalizes large errors + compute average magnitude of errors relatively to the true values + in the original unit - sensitive to outliers - sensitive to outliers - sensitive to near-zero actual values MAE – Mean Absolute Error MAPE – Mean Absolute Percentage Error + less sensitive to outliers + easy to interpret + in the original unit + used in forecasting and time series analysis - treat positive and negative errors equally - sensitive to near-zero actual values metrics train test 1 test 2 MSE 66.98 108.53 53.41 Example of RMSE 8.18 10.42 7.31 other metrics with the real MAE 6.08 6.31 5.63 estate dataset R2 0.60 0.53 0.68 MPE -0.05 -0.05 -0.05 MAPE 0.19 0.19 0.19 Perfect classifier dataset Positive Negative perfect classifier Standard classifier dataset True Positive False Positive Positive Negative False Negative True Negative classifier Contingency table dataset True Positive False Positive Positive Negative Observation False Negative True Negative classifier Positive Negative Positive True Positive False Positive Prediction Negative False Negative True Negative False Positive vs False Negative Observation Negative Positive Positive True Positive False Positive False Positive Prediction (noise) False Negative Negative False Negative True Negative (silence) Accuracy Accuracy is a common performance metric used to evaluate the performance of classification models. It measures the proportion of correctly predicted instances out of the total number of instances in a dataset. Accuracy is particularly useful when the classes in the dataset are balanced, meaning they have roughly equal representation. TP FP FN TN TP FP FN TN Precision Precision is a performance metric used to evaluate the quality of a classification model, particularly in binary classification problems. Precision measures the proportion of correctly predicted positive instances (true positives) out of the total instances predicted as positive (true positives plus false positives). It provides an indication of how well the model predicts positive instances and how likely it is to correctly identify them. TP FP FN TN TP FP FN TN Recall Recall, also known as sensitivity or true positive rate, is a performance metric used to evaluate the effectiveness of a classification model, particularly in binary classification problems. Recall measures the proportion of correctly predicted positive instances (true positives) out of the total actual positive instances (true positives plus false negatives). It quantifies the ability of the model to identify all positive instances correctly. TP FP FN TN TP FP FN TN Example: Titanic dataset with a very simple classifier Let us implement a very simple classifier based on the feature gender: female → survived, male → not survived Let us compute the metrics on the whole dataset, just to figure out the numbers. Observation Survived Not survived Survived Metrics Score 1 233 81 Prediction (female) Accuracy 0.787 Precision 0.742 Not survived 109 468 Recall 0.681 (male) Example: Titanic dataset with another simple classifier Let us implement another simple classifier based on the features gender (same as above) and infancy (below 6 years old) Let us compute again the metrics on the whole dataset, just to figure out the numbers. Observation Survived Not survived Survived Metrics Score 1 Score 2 248 (+15) 89 (+8) Prediction (female or infant) Accuracy 0.787 0.795 Not survived Precision 0.742 0.736 (male and not 94 (-15) 460 (-8) Recall 0.681 0.725 infant) 1 2 Density (random selection) A Survey of Online Failure Prediction Methods (researchgate.net) A few other metrics in binary classification - F1-Score is the harmonic mean of precision and recall, combining both metrics into a single value. It provides a balanced measure that considers both false positives and false negatives. The F1 score ranges from 0 to 1, where 1 indicates perfect precision and recall. - Specificity, also known as true negative rate, measures the proportion of correctly predicted negative instances (true negatives) out of the total actual negative instances (true negatives plus false positives). It quantifies the ability of the model to correctly identify negative instances. - Area Under the ROC Curve (AUC-ROC) is a performance metric that measures the area under the Receiver Operating Characteristic (ROC) curve. The ROC curve plots the true positive rate (sensitivity) against the false positive rate (1 - specificity) at various classification thresholds. A higher AUC-ROC indicates better discrimination ability of the model. It is recommended to use a combination of metrics to capture a comprehensive understanding of the model’s performance and trade-offs. It is crucial to consider the consequences of misclassifications. Accuracy F1-Score + simple to understand and interpret + balance the trade-off between precision and recall - sensitive to unbalanced classes - implies equal weight on false positives and false negatives Precision Specificity (True Negative Rate) + focuses on the correctness of positive predictions + focuses on the correctness of negative instances - does not consider false negatives - does not consider false negatives Recall AUC (Area Under the ROC Curve) + focuses on the correctness of positive instances + aggregated measure of the model’s performance - does not consider false positives + focuses on the model’s ability to rank instances correctly - less interpretable - sensitive to unbalanced classes Metrics Score 1 Score 2 Accuracy 0.787 0.795 Example of Precision 0.742 0.736 other metrics with the Recall 0.681 0.725 Titanic dataset F1 Score 0,710 0,730 Specificity 0,852 0,838 AUC-ROC 0,770 0,780 Example of AUC curves with the Titanic dataset Example of cost-sensitive learning Fraud Detection in Banking and Finance: In fraud detection systems, missing a fraudulent transaction (false negative) is usually much costlier than falsely flagging a legitimate transaction as fraud (false positive). Cost-sensitive learning helps prioritize minimizing false negatives, possibly at the expense of increasing false positives, which can be reviewed manually. Diagnosis: In medical diagnosis, the cost of missing a serious condition (false negative) is extremely high because it can lead to no treatment and potentially the patient's death. On the other hand, a false positive might lead to unnecessary worry or further tests, which is generally less costly. Therefore, cost-sensitive learning is employed to minimize false negatives for critical conditions. Customer Churn Prediction: For businesses, especially in the subscription-based model, predicting which customers are likely to churn (cancel their subscription) is crucial. The cost of falsely identifying a loyal customer as at risk of churning (false positive) might be an unnecessary discount or offer, whereas failing to identify a real at-risk customer (false negative) could result in the permanent loss of that customer. Cost-sensitive learning can balance these costs to optimize retention strategies. Credit Scoring: When financial institutions decide whom to give loans to, missing a potential default (false negative) could mean a significant financial loss, while denying a loan to someone who would have repaid it (false positive) means a lost opportunity for profit. Cost-sensitive learning can help in tuning the prediction models to minimize costly defaults. Example with fraud detection Observation Contingency Cost Matrix Observation Table (Fraud Case Example) Positive Negative Fraudulent Legitimate Prediction Selected TP FP Selected 0€ -10€ Prediction -amount 0.02 x Not selected Not selected FN TN + 20€ amount High Precision Algorithms minimize false positives which are visible by users in selection dataset True Positive False Positive (noise) Positive Negative False Negative True Negative high precision classifier High Precision Algorithms 1. Fraud Detection ML models can be trained to identify fraudulent activities, such as credit card fraud or online transaction fraud, by analyzing patterns and anomalies in the data. High precision is important in this case to minimize false positives and ensure that genuine transactions are not mistakenly flagged as fraudulent. 2. Medical Diagnosis ML models can assist in diagnosing medical conditions based on patient data, including symptoms, medical history, and test results. Achieving high precision is essential to avoid misdiagnosis and ensure accurate identification of diseases or conditions. 3. Image Classification ML models can classify images into different categories, such as recognizing objects, identifying faces, or detecting specific features. High precision is crucial to minimize misclassifications and ensure accurate categorization of images. 4. Sentiment Analysis ML models can analyze text data, such as customer reviews or social media posts, to determine the sentiment expressed (e.g., positive, negative, or neutral). High precision is important to accurately classify the sentiment and provide reliable insights for businesses. 5. Spam Filtering ML models can be trained to filter out spam emails or messages from legitimate ones. High precision is necessary to avoid mistakenly classifying important messages as spam, ensuring that the filtering process is accurate and reliable. 6. Autonomous Vehicles ML algorithms are employed in self-driving cars to interpret sensor data and make decisions while driving. High precision is critical to ensure the correct identification of objects, accurate lane detection, and precise decision-making to maintain safety on the road. 7. Quality Control ML models can be used to detect defects or anomalies in manufacturing processes, such as identifying faulty products on an assembly line. High precision is vital to minimize false positives and accurately identify defective items, improving overall quality control. High Recall Algorithms minimize false negatives which are not visible by users in selection dataset True Positive False Positive Positive Negative False Negative True Negative (silence) high recall classifier High Recall Algorithms 1. Disease Screening ML models can be utilized to screen individuals for various diseases or medical conditions based on symptoms, risk factors, or diagnostic tests. High recall is crucial in this scenario to ensure that as many affected individuals as possible are correctly identified, minimizing false negatives and preventing missed diagnoses. 2. Anomaly Detection ML algorithms can be employed to detect unusual patterns or anomalies in data, such as identifying fraudulent activities, network intrusions, or system failures. High recall is essential to minimize false negatives and ensure that potential threats or abnormalities are not overlooked. 3. Cancer Detection ML models can aid in the early detection of cancer by analyzing medical images, such as mammograms or CT scans, to identify suspicious lesions or tumors. High recall is critical in this case to minimize false negatives and ensure that cancer cases are not missed, enabling early intervention and improved treatment outcomes. 4. Customer Churn Prediction ML algorithms can predict customer churn in various industries, such as telecommunications or subscription-based services. High recall is important to identify as many potential churners as possible, enabling proactive retention strategies and reducing customer attrition. 5. Security Threat Detection ML models can analyze network traffic, log files, or user behavior to detect potential security threats, such as malware attacks or data breaches. High recall is vital to minimize false negatives and ensure that suspicious activities or indicators of compromise are not missed, enhancing overall system security. 6. Rare Event Prediction ML algorithms can be employed to predict rare events, such as earthquakes, stock market crashes, or equipment failures. High recall is crucial in this context to detect as many instances of the rare event as possible, allowing for appropriate preventive measures or timely responses. 7. Emergency Response Systems ML models can aid in emergency response systems by analyzing various data sources, such as social media feeds or sensor data, to detect and respond to incidents like natural disasters or public safety threats. High recall is important to ensure that relevant incidents are not missed, enabling swift and effective response efforts. Information Retrieval refers to the process of efficiently and effectively retrieving relevant textual resources documents selection topic topic Density = documents selection Λ topic Precision = selection noise selection Λ topic visible Recall = topic silence invisible Trade-off between precision and recall depends on polymorphism of entities and polysemy of words Caisse des Dépôts et Consignations silence P = 100% ; R = 20% polymorphism silence polysemy CDC noise P = 40% ; R = 60% A classifier with pretty good accuracy, precision and recall, is it actually good? dataset True Positive False Positive metrics value accuracy 91.7% Positive Negative precision 92.5% recall 92.5% F1-score 92.5% False Negative True Negative classifier A sensitive subgroup refers to a specific group of individuals or a demographic category that is designated as requiring protection against unfair or discriminatory treatment. Such a subgroup is defined based on certain attributes that are considered legally or socially protected from discrimination. gender age origin opinion condition family occupation Suppose that we have such a sensitive subgroup dataset with a sensitive subgroup marked in orange True Positive False Positive Positive Negative False Negative True Negative classifier When not belonging to the sensitive subgroup, the performances might get even better… dataset without the sensitive subgroup True Positive False Positive metrics value accuracy 95.0% Positive Negative precision 94.1% recall 97.0% F1-score 95.5% False Negative True Negative classifier … but for the sensitive subgroup the performances might get much worse dataset limited to the sensitive subgroup True Positive False Positive metrics value accuracy 75.0% Positive Negative precision 83.3% recall 71.4% F1-score 76.9% False Negative True Negative classifier We do have such a sensitive subgroup: bald referees or players (condition) True Positive False Positive Positive Negative False Negative True Negative classifier Fairness issues may lead to different types of harm Allocation Harm Quality of Service Harm Stereotype Harm Denigration Harm Representation Harm [2105.05595] An Introduction to Algorithmic Fairness Allocation harm can be defined as an unfair allocation of opportunities, resources, or information. It occurs when some groups are selected less often than others for instance by AI-based classifiers. Quality-of-service harm occurs when a system disproportionately fails for certain groups of people. It occurs when some groups are more often wrongly rejected than others. Stereotyping harm occurs when a system reinforces undesirable and unfair societal stereotypes. Generative AI may introduce stereotyping harms due to the lack of diversity in the training sets. Denigration harm refers to situations in which algorithmic systems are actively derogatory or offensive. Representation harm occurs when the development and usage of algorithmic systems over- or under-represents certain groups of people. or Recall or Specificity Fairness Metrics or Precision [2102.08453] Towards the Right Kind of Fairness in AI (arxiv.org) Comparing fairness metrics between the two subgroups group = 0 group = 1 Observation Observation y=1 y=0 y=1 y=0 Prediction Prediction ŷ=1 33 2 FDR=5.7% PR=58.3% ŷ=1 5 1 FDR=16.7% PR=50.0% ŷ=0 1 24 FOR=4% NR=41.7% ŷ=0 2 4 FOR=33.3% NR=50.0% TPR=97.1% TNR=92.3% TPR=71.4% TNR=80.0% BR=56.7% BR=58.3% Recall Specificity Recall Specificity base rates are similar % of both groups in the two classes are similar Bare Rate is a fairness metric whose goal is to ensure that the ratio of positive examples in a dataset is independent of membership in a sensitive group What it compares: Proportion of positives between different P group Reason to use: If the input data are known to contain Base rate biases, base rate may be appropriate to measure fairness P Caveats: By only using the observed values, information is partial Common fairness metrics — Fairlearn 0.10.0.dev0 documentation Demographic Parity (or Statistical Parity) is a fairness metric whose goal is to ensure a machine learning model’s predictions are independent of membership in a sensitive group What it compares: Predictions between different P N N groups Reason to use: If the input data are known to contain Positive Rates Negative Rates biases, demographic parity may be appropriate to measure fairness P N Caveats: By only using the predicted values, information is thrown away + Classification model - Common fairness metrics — Fairlearn 0.10.0.dev0 documentation Equalized Odds is a fairness metric whose goal is to ensure a machine learning model’s predictions are independent of membership in a sensitive group What it compares: True and False Positive rates between different groups TP P N TN Reason to use: If historical data does not contain measurement bias or historical bias that we need to take into True Positive Rates True Negative Rates account, and true and false positives are considered to be of the same importance, TP P N TN equalized odds may be useful Caveats: If there are historical biases in the data, then the original labels may hold little + Classification model - value Common fairness metrics — Fairlearn 0.10.0.dev0 documentation Equal Opportunity is a relaxed version of equalized odds that only considers conditional expectations with respect to positive labels What it compares: True Positive rates between TP P different groups Reason to use: If historical data are useful, and extra false positives are much less likely to True Positive Rates cause harm than missed true positives, equal opportunity TP P may be useful Caveats: If there are historical biases in the data, then the original labels may hold little value + Classification model - Common fairness metrics — Fairlearn 0.10.0.dev0 documentation Fairness metrics to use depends on business cases, here equal representation vs. equal errors Fairness metrics to use depends on business cases, here punitive vs. assistive The Moment of Truth From Brainwave to Machine Learning Grave: Challenges Faced by ML Models from Idea to Production. (Part 1: Deployment Phase) | LinkedIn Sources of performance degradation after deploying a model Data Drift: Changes in the distribution of input data over time Concept Drift: Changes in the relationship between input features and the target variable. Model Decay: Changing patterns in the data or a shift in user behaviour. Scalability Issues: Model not scalable to handle increased load or a growing user base. Resource Constraints: Inadequate resources such as memory, CPU, or GPU for the deployed model. Security Concerns: Models may be susceptible to adversarial attacks or unintended use cases. Model Interpretability: Lack of interpretability, especially in sensitive applications where understanding model decisions is crucial. Latency and Throughput: Delays in model predictions or an inability to handle the required volume of requests. Regulatory Compliance: Failure to comply with industry or legal regulations regarding data privacy and model usage. Monitoring and Logging: Insufficient monitoring and logging of model performance, inputs, and outputs. Metrics in Machine Learning are Crucial from a Business Perspective Performance Evaluation by providing clear and objective evaluation of a model Benchmarking by comparing models’ performances Model Selection for choosing the best model among various alternatives Model Monitoring for detecting drift in model performance across time Continuous Improvement by providing feedback on models' behaviour ROI Assessment by measuring how well a model achieves its intended goals Regulatory Compliance for ensuring that model’s outcomes meet business standards Specific metrics exist for Generative AI tasks provided by Large Language Models (LLMs) Source: DataCamp Home Also, human evaluation based on the ELO rating system used in chess. Kind of Turing test: two random models are in competition, and a human decides which one is the best. LMSys Chatbot Arena Leaderboard - a Hugging Face Space by lmsys