Machine Learning Lecture Notes PDF
Document Details
Uploaded by SmittenObsidian3598
Delft University of Technology
Tags
Summary
These lecture notes provide an introduction to machine learning, discussing definitions, differences from statistical models, popular methods like regression and advanced models, and considerations within socio-technical systems. A summary of the lecture is included.
Full Transcript
Summary Lecture 1: Wat is Machine Learning? 1. Definition of Machine Learning (ML) What is ML? ○ "The field that gives computers the ability to learn without being explicitly programmed." –Arthur Samuel, 1959. ○ Applications: Email spam filt...
Summary Lecture 1: Wat is Machine Learning? 1. Definition of Machine Learning (ML) What is ML? ○ "The field that gives computers the ability to learn without being explicitly programmed." –Arthur Samuel, 1959. ○ Applications: Email spam filters. Chatbots. Fraud detection on credit cards. Recommendation systems. Advertisement placement on websites. Why so popular now? ○ Increase in large data sets (big data) and powerful computing power. ○ Able to work with unstructured data (images, video, text, audio). 2. Differences between Machine Learning and Statistical Models Statistical models: ○ Focused on inference: determine whether a relationship exists and why. ○ Based on theories (e.g., law of large numbers, central limit theorem). ○ Parameters are interpretable. ○ Models are tractable and based on assumptions about the Data Generating Process (DGP). Machine Learning: ○ Focused on predictions: learning input-output relationships. ○ Less focus on theory; based on data and generalization performance. ○ Large number of parameters, often not interpretable. ○ Works on associations without assumptions about causality. 3. Popular ML methods Regression models: ○ Lineaire en logistic regressie. Advanced models: ○ Decision Trees. ○ Random Forests. ○ Artificial Neural Networks. ○ Gradient Boosting. ○ Clustering (e.g., K-means, DBSCAN). ○ Bayesian Networks. 4. ML within the Context of Socio-Technical Systems ML provides powerful tools for pattern recognition and prediction. Limits: ○ Not suitable for all problems (e.g., simple DGPs). ○ Often no insight into causality, which is important for policy analysis. 5. Wrap-Up Summary of this lecture: ○ Introduction to Machine Learning. ○ Differences between statistical models and ML. Next session: ○ Delve deeper into learning from data, generalization, bias-variance trade-off, and developing ML models. Lecture 2 1. Machine Learning Fundamentals Learning: The process by which a model learns a function ggg that maps an input to an output based on examples. ○ Supervised learning: Works with labeled data (X, Y) and trains a model to replicate correct answers. ○ Unsupervised learning: Works with unlabeled data (X) and looks for structures in the data. ○ Reinforcement learning: Based on rewards (positive/negative) after decisions. 2. Generalization Goal: Develop a model that performs well on new data. Key terms: ○ Overfitting: Model fits too closely to the training data, performs poorly on new data. ○ Underfitting: Model is too simple and misses important patterns in the data. ○ Bias-variance trade-off: Balance between bias (errors due to assumptions) and variance (sensitivity to data variations). Solution: Split data training data in testdata (e.g. 60%-40% for small data sets). 3. Model development Iterative process in 5 steps: 1. Study of phenomenon & data cleaning. 2. Discover of dates. 3. Explore of connections through graphs and correlations. 4. Train basic model (e.g., linear regression, decision trees). 5. Evaluating model based on performance metrics, such as: R-squared, Mean Absolute Error (MAE), Root Mean Squared Error (RMSE). 4. Regression models Objective: Predict continuous values (e.g., income based on age). Examples: ○ Linear regression. ○ Multiple regression (with multiple explanatory variables). Statistical vs. Machine Learning approach: ○ Statistical: Based on assumptions (linearity, constant variance, no perfect collinearity). ○ ML: Fewer assumptions, but fewer interpretable parameters. 5. Geospatial Data Data with geographic components: ○ Vector data: Points, lines, polygons. ○ Raster data: Images, such as satellite images. Commonly used tools: geopanda (Python), GIS-software. Projection Methods: ○ Mercator: Preserves corners, but disrupts surfaces. ○ Equal-Area: Preserves area proportions. ○ Specific projects for the Netherlands: EPSG:28992 (Amersfoort). Wrap-up Discussed: ○ Machine learning: generalization, bias-variance trade-off. ○ Model development: cycles and evaluation. ○ Regression: linear and multiple. ○ Geospatial data: use and projection. Lecture 4: Decision Trees en Performance Metrics Lecturer: Sander van Cranenburgh Faculty: Technology, Policy, and Management, TU Delft 1. Decision Trees (DTs) What are Decision Trees? ○ Commonly used ML models for classification and regression. ○ Advantages: Easy to understand, use and interpret. Little data preprocessing required. Useful as an exploratory tool (identifying important features). ○ Disadvantages: Sensitive to overfitting. Unstable results with small changes in the data. Structure of a Decision Tree: ○ Root node: Contains the entire data set. ○ Split nodes: Decisions based on features. ○ Branches: Represent splits in data. ○ Leaf nodes: Represent the final classes or values. 2. How are Decision Trees built? Process: ○ Based on Entropy in Information Gain: Entropy measures the homogeneity of a data set. Information Gain determines which split is the most informative. ○ Splits are added iteratively until: All nodes are “pure” (homogeneous classes). No further splits are possible. Examples: ○ Gender classification: Based on hair length and length. ○ Diabetes classification: Based on BMI and blood pressure. BMI turned out to be the most determining factor. Greedy algorithm: ○ The tree chooses locally optimal splits (global optimality is computationally infeasible). 3. Avoid overfitting Problem: ○ DTs continue to split until all data is completely fitted, leading to poor generalization. Solutions (Pruning): ○ Pre-pruning: Prevent further splitting by conditions, such as: Maximum depth. Minimum number of examples per leaf. Minimal reduction in impurity. ○ Post-pruning: Post-removal of non-significant branches (not covered in this course). Effect on performance: ○ Pre-pruning can reduce overfitting and often results in a balance between training and testing performance. 4. Feature Importance Decision Trees can provide insight into the relative importance of features: ○ Based on the reduction of node impurity. ○ Not stable: Results may vary per workout. 5. Model Performance Metrics Based on the Confusion Matrix: ○ True Positive (TP): Correctly classified positively. ○ False Positive (FP): Incorrectly classified as positive. ○ True Negative (TN): Correctly classified negative. ○ False Negative (FN): Wrongly classified as negative. Key metrics: ○ Accuracy: Correctly classified observations / total observations. ○ Precision: Correct positive predictions / total positive predictions. ○ Recall (Sensitivity): Correct positive predictions / actual positive cases. ○ Specificity: Correct negative predictions / actual negative cases. ○ Matthew’s Correlation Coefficient (MCC): Combines all metrics into one number. Examples: ○ Combinations of high/low precision, sensitivity and specificity show advantages and disadvantages of different classifiers. Lecture 5: Artificial Neural Networks (ANNs) 1. What are Artificial Neural Networks (ANNs)? Definition: ○ Commonly used ML models, with applications in both classification and regression. ○ Belangrijk in deep learning (e.g. text-to-image, text-to-text). Advantages: ○ Flexible and powerful, suitable for complex patterns. ○ Scales well to large data sets and non-linear relationships. Disadvantages: ○ Difficult to interpret ("black box"). ○ Require intensive training and tuning. ○ Do not always perform better than simpler models. 2. Construction of ANNs Structure: ○ Input layer: The explanatory variables (features). ○ Hidden layers: Contain nodes that process information. One hidden layer: "Shallow" network. Two or more hidden layers: "Deep" network. ○ Output layer: The target variable. Connections: ○ Weight factors between nodes determine how information flows through the network. ○ Bias nodes: Add an intercept to the output of nodes. 3. Training van ANNs Loss function: ○ Measures the error between predicted and expected values. ○ The type: Regression: MSE, RMSE. Classification: Cross-entropy (log-loss). Backpropagation: ○ Iterative process of optimizing weights by minimizing loss. ○ Uses gradient descent methods (e.g. Adam, RMSprop). Preprocessing: ○ Shell features (e.g. StandardScaler, MinMaxScaler). ○ One-hot encoding categorical variables. ○ Handle missing values (e.g. remove or replace with averages). 4. Hyperparameter Tuning Important hyper parameters: ○Number of hidden layers and nodes. ○Batch size: How much data is used per iteration. ○Learning rate: Step size during optimization. ○Regularization: L1 (Lasso) or L2 (Ridge) to avoid overfitting. Optimization: ○ Hyperparameter tuning can be done with tools such as GridSearchCV (SciKit-learn). ○ Early stopping prevents overfitting by stopping training when test performance no longer improves. 5. K-Fold Cross Validation Goal: ○ More robust evaluation of generalization performance than a simple train/test split. Operation: ○ Divide data into KKK folds. ○ Train and evaluate the model several times on different combinations of folds. ○ Use average performance to optimize hyperparameters. Advantages: ○ Minimizes risk of overfitting through one specific dataset format. 6. Examples Diabetes classification (Efron et al. 2004): ○ Impact of number of layers and nodes on learning nonlinear relationships. ○ Empirical performance comparable to decision trees on simple datasets. Wrap-up ANNs are powerful but labor-intensive to train. Hyperparameter tuning and techniques such as early stopping and cross-validation improve performance and generalization. Next session: Lab on fashion choice prediction with MLPs: ○ Prepare and train data. ○ Compare performance based on various metrics. Lecture 7: Ensemble Models in Random Forests 1. What are Ensemble Models? Definition: ○ An ensemble model combines multiple "weak" models to create a stronger model. ○ Based on the principle of the “wisdom of the crowd” – diversity reduces bias and increases generalization. Why use ensembles? ○ Conceptual: More independent sources of information lead to higher reliability. ○ Technical: Different models make different mistakes; combination of these reduces overfitting and improves generalization. 2. Random Forests (RF) Definition: ○ An ensemble of decision trees, suitable for classification and regression. ○ Addresses the sensitivity of decision trees to overfitting through more diversity and stability. How are they built? ○ Bagging: Generating multiple bootstrap datasets by random sampling with replacement. ○ Random patching: Use randomly selected features for splits. Hyperparameters: ○ n_estimators: Number of trees (more trees improve performance up to a certain point). ○ max_features: Number of features per split. ○ max_depth: Maximum depth of trees. ○ min_samples_leaf: Minimum number of samples per leaf. Advantages: ○ Good performance on various datasets. ○ Robust against overfitting due to diversity in trees. Disadvantages: ○ Require more computing power than individual decision trees. ○ More difficult to interpret. 3. Boosting Definition: ○ Boosting trains multiple models sequentially, with each model targeting the errors of the previous model. Gradient Boosted Trees (GBTs): ○ Popular boosting technique where decision trees are trained on residuals. ○ Advantages: Can learn complex non-linear relationships. Usually more robust than single models. ○ Hyperparameters: n_estimators: Number of trees. learning_rate: Amount of impact from each tree. max_depth, min_samples_leaf, etc., similar to RF. 4. Example: Diabetes classification Random Forests: ○ Diverse trees result in robust predictions. Gradient Boosted Trees: ○ Models smoothly learn nonlinear relationships. ○ Better generalization and performance than decision trees. 5. Comparison of Ensemble Methods Random Forests: ○ Robust, stable, easy to implement. Boosting: ○ More focused on correcting errors; better for complex data sets. General disadvantages of ensembles: ○ Increased training complexity. ○ More difficult to interpret than simple models. Wrap-up Ensembles combine multiple models to make more powerful predictions. Random Forests and Boosting are popular techniques with different applications and benefits. Next session: Focus on causality and predictions outside the training distribution. Lecture 8: Embeddings, Causality, and Prediction Models 1. Embeddings What are embeddings? ○ Representation of discrete data in a continuous, lower-dimensional vector space. ○ Applications: images, words, networks, users. Why embeddings? ○ Makes unstructured data suitable for algorithmic processing, such as classification, clustering, and similarity searching. Ways to create embeddings: ○ Supervised: Output of the last layer of a neural network. ○ Unsupervised: Bottleneck of an autoencoder. ○ Pre-trained models: For text: Word2Vec, BERT. For images: VGG, Inception. Properties: ○ Semantic preservation: Relationships between data are preserved, measured for example with Euclidean distance or cosine similarity. 2. Causality What is causality? ○ Relationships where a change in XXX leads to a change in YYY (ceteris paribus). Conditions for causality: ○ Association: XXX and YYY must vary together. ○ Temporary order: The cause (XXX) must occur before the effect (YYY). ○ No false connections: Relationship should not be caused by a third variable. Relevance in ML: ○ Machine learning builds on associations, but is sometimes misinterpreted as causal. ○ Policy predictions often require a causal approach. 3. In-Distribution vs. Out-of-Distribution Predictions ML performs well: ○ For predictions within the distribution of the training data. ○ Examples: demand for groceries, predicting ICU needs. ML performs poorly: ○ For predictions outside the distribution of the training data. ○ Examples: demand for a new product, COVID-related forecasts. Causal models are better: ○ In out-of-distribution scenarios, such as predicting the effects of policy changes. part 2 Explainable AI: Part 1 1. The Risks of AI Example: Criminal facial recognition (Wu & Zhang, 2016): ○ AI claimed to be able to recognize criminal intentions from facial features. ○ Criticisms: Data bias (photos of criminals vs. non-criminals). Hypothesized causal links between facial features and criminal behavior. Ethical and methodological shortcomings. 2. Wat is Explainable AI (XAI)? Definition: ○ Development of ML models that: 1. Provide explanations for their predictions. 2. Promote people's understanding, trust and stewardship. Goals: ○ Building trust. ○ Making causality transparent. ○ Increase reusability. ○ Exposing bias and privacy issues. Difference between Interpretability and Explainability: ○ Interpretability: Passive property, focuses on how understandable a model is. ○ Explainability: Active property, explains how and why a model makes a prediction. 3. Properties of Statements Quality criteria: ○ Accuracy: How well does the statement predict unseen data? ○ Fidelity: How well does the explanation approximate the original model? ○ Consistency: Are explanations similar for similar models? ○ Comprehensibility: Are explanations understandable and useful to people? ○ Stability: Are similar examples explained consistently? ○ Contrastivity: Clarifies discrepancies (e.g., “Why this score instead of another?”). 4. Transparency in ML Transparent models such as: ○ Linear regression. ○ Decision Trees. ○ K-Nearest Neighbors. Properties of transparency: ○ Simulability: Comprehensibility of the entire model. ○ Decomposability: Understandability of individual components. ○ Algorithmic transparency: Prediction process is mathematically understandable. 5. Post-Hoc Explainability Simplification: ○ Local explanations such as LIME (Local Interpretable Model-agnostic Explanations). ○ Anchors: High precision declarations for specific parts of the model. Counterfactuals: ○ Alternative data points that would give a different prediction. ○ Criteria: As close as possible to the original data point. Minimal changes to features. Practical and observable. 6. Case Study: COMPAS What is COMPASS? ○ Algorithm that predicts chances of recidivism in American justice. ○ Criticisms: Black-box model: little insight into decision-making. Bias: Discrimination against black suspects in risk assessment. 7. Conclusion Why Explainable AI? ○ Increases trust and ethical use of ML. ○ Identifies and addresses biases in models. ○ Increase engagement of non-technical users. Next session: ○ Exploration of causal explanations and their applications in XAI. Explainable AI: Part 2 1. Global Model-Agnostic Methods Partial Dependence Plots (PDPs): ○ Visualize the relationship between a feature and the predicted outcome. ○ Marginalize other features to isolate the effect of the target feature. ○ Advantages: Easy to understand and implement. Provides a causal effect (for the model). ○ Disadvantages: Limited to one or two features due to visualization limitations. Can be misleading with highly correlated features. Individual Conditional Expectation (ICE): ○ Shows predictions for individual data points instead of averages. 2. Local Declarations: LIME Wat is LIME (Local Interpretable Model-agnostic Explanations)? ○ Explains predictions locally by analyzing the model's behavior around a specific data point. ○ Method: Choose a data point for explanation. Perturbate the data point and generate predictions. Train a simple model (such as a linear regression) on the new data. Use this simple model to explain the prediction. ○ Advantages: Works with any model type. Supports tabular data, text, and images. Easy to implement (e.g. lime in Python). ○ Disadvantages: Ignoring correlations between features. Variable explanations for similar data points. Difficult to choose the right kernel width. 3. Feature Relevance: SHAP Wat is SHAP (SHapley Additive exPlanations)? ○ Based on game theory, it divides the model's output among the input features to quantify their contribution. ○ Shapley Values: Fairly distributed contributions of features in a model prediction. Calculates marginal contributions of each feature in different combinations. ○ Properties: Efficiency: Total of contributions equals the model output. symmetry: Equal contributions receive the same score. Consistency: Higher feature impact does not lead to lower scores. Benefits of SHAP: ○ Consistent and fair distribution of feature contributions. ○ Applicable to all machine learning models. ○ Available in practical libraries (e.g. shap in Python). Visualization tools: ○ Waterfall plot: Shows contributions of features for individual predictions. ○ Beeswarm plot: Visualizes SHAP values across multiple data points. 4. Practical Examples Bike Sharing Dataset: ○ PDPs show how temperature and seasons influence the number of bicycle rentals. House Price Forecast (SHAP): ○ SHAP identifies median incomes and location as the most important drivers of home prices. 5. LIME vs SHAP comparison LIME: ○ Local, fast, but prone to inaccuracy in nonlinear models. SHAP: ○ Global consistency, more accurate, but calculation is more complex. 6. Conclusion Why use XAI methods? ○ Increasing understandability and confidence. ○ Identify and address biases. ○ Provide fair and consistent model statements. Next session: ○ Deepening the interpretation and applications of explanations in socio-technical systems. Explainable AI: Part 3 1. Post-Hoc Explainability Feature Relevance Methods: ○ Assess the influence or relevance of features in a model prediction. ○ Methods: Sobol Global Sensitivity Analysis: Measures how variations in model input affect output. Permutation Feature Importance: Measures how predictions change when a feature's values are randomly shuffled. Visual Explanation: ○ Use visual aids (such as ICE and PDP) to clarify model decisions. ○ Model specific techniques: Saliency maps (voor CNN's). Interactive visualizations (e.g. LLM visualization tools). 2. Responsible AI Ethics and Privacy: ○ Example: AI and sexual orientation (Wang & Kosinski): Methodological and ethical controversies: Privacyschending. Reinforcement of stereotypes. Weak inferences with insufficient statistical evidence. ○ Privacyrisico's: Inappropriate processing of sensitive data. Example: Strava heatmaps and anonymous taxi data that turned out to be traceable. Risks of Large Language Models (LLMs): ○ Challenges: Discrimination and spread of misinformation. Privacy leaks. Environmental damage due to high computing power requirements. ○ Solutions: Representative training data. Privacy-protecting techniques such as differential privacy. Monitoring and responsible implementation. 3. AI and Climate Change Applications: ○ Electricity networks: Optimization of supply and demand. ○ Transport: Improving logistics and fuel efficiency. ○ Buildings: Support energy efficient designs. ○ Policy analysis: Simulation of emission reductions. Downscaling of data to finer geographic scales. Detecting tipping points in techno-economic systems. 4. XAI and Future Directions Guidelines for Explainable AI: ○ Consider contextual and domain-specific needs. ○ Use interpretable models where possible. ○ Ensure ethical, fair and safe impact of black-box models. ○ Tailor statements to the audience. Challenges and Trade-offs: ○ Balance between explanatory power and model performance. ○ Difficulties in generalization across domains. ○ Need for standards and regulations (e.g. EU AI Act). 5. Wrap-Up Key points: ○ Post-hoc explanations enhance understanding of black-box models. ○ Responsible AI is essential to address ethical challenges. ○ XAI plays a crucial role in climate and policy models. Next steps: ○ Further integration of XAI methodologies into socio-technical systems.