CDS 403 Machine Learning Cheatsheet PDF

Summary

This document is a cheatsheet for the CDS 403 Machine Learning final exam. It contains multiple-choice questions, scenario-based questions, and definition/list questions covering key concepts from the course materials. The questions cover a range of topics in machine learning, including supervised learning, unsupervised learning, and neural networks.

Full Transcript

CDS 403 Machine Learning Cheatsheet (Final Exam Material) This cheatsheet covers key material for your CDS 403 Machine Learning final exam.It contains multiple-choice questions(MCQs) withexplanations for both correct and incorrect answers, scenario-based questions, and definition/list questions cove...

CDS 403 Machine Learning Cheatsheet (Final Exam Material) This cheatsheet covers key material for your CDS 403 Machine Learning final exam.It contains multiple-choice questions(MCQs) withexplanations for both correct and incorrect answers, scenario-based questions, and definition/list questions covering key concepts from the course materials. Accordingly, Your exam will consist of three parts: 45 multiple-choice questions (45 points), 1 scenario-based question (10 points), and 1 definition/list question (10 points). The total will be 65 points, which will be converted to 25 points as your final grade (multiplying by 25/65). This conversion method will help maintain higher overall grades. Multiple Choice Questions Section 1.1: Introduction & ML Pipeline (Weeks 1-2) Question 1: According to the syllabus, what is the primary goal of the CDS 403 Machine Learning course? a) To teach advanced theoretical mathematics behind algorithms. b) To provide a broad introduction to machine learning concepts, algorithms, and practical applications. c) To focus exclusively on deep learning and neural networks. d) To cover software engineering best practices for deploying ML models. Answer: b) Explanation (Correct): The syllabus overview generally emphasizes introducing fundamental concepts, various algorithms (supervised, unsupervised), evaluation, and practical application using tools like Python and relevant libraries. This aligns with a broad introduction. Explanation (Incorrect a): While some theory is covered, the focus is broader than just advanced mathematics. Explanation (Incorrect c): Deep learning is a part of the course but not the exclusive focus; other methods like SVMs, Decision Trees, etc., are also covered. Explanation (Incorrect d): Deployment might be touched upon, but software engineering practices are not the primary goal of an introductory ML course. Question 2: Which of the following best defines Machine Learning? a) Programming computers to explicitly perform tasks. b) The field of study that gives computers the ability to learn without being explicitly programmed. c) Designing complex database systems for large datasets. d) Creating visually appealing data dashboards. Answer: b) Explanation (Correct): This is a classic definition attributed to Arthur Samuel. Machine learning focuses on algorithms that allow systems to learn patterns and make predictions from data, rather than relying solely on predefined rules. Explanation (Incorrect a): This describes traditional programming, not machine learning. Explanation (Incorrect c, d): Database design and dashboard creation are related fields but distinct from the core concept of machine learning. Question 3: What are the three main categories of machine learning discussed in the introductory materials? a) Regression, Classification, Clustering b) Supervised Learning, Unsupervised Learning, Reinforcement Learning c) Deep Learning, Ensemble Methods, Dimensionality Reduction d) Feature Engineering, Model Training, Model Evaluation Answer: b) Explanation (Correct): The course materials (and standard ML introductions) categorize ML into these three fundamental paradigms based on the type of data and learning signal available (labeled data, unlabeled data, or rewards/penalties). Explanation (Incorrect a): Regression and Classification are types of Supervised Learning; Clustering is a type of Unsupervised Learning. Explanation (Incorrect c): These are specific techniques or areas within ML, not the primary categories. Explanation (Incorrect d): These are stages within the ML pipeline, not categories of ML itself. Question 4: In Supervised Learning, what is provided to the algorithm during training? a) Only input data without any corresponding outputs. b) Input data paired with the correct output labels. c) A reward signal indicating the quality of actions taken. d) Rules defined by human experts. Answer: b) Explanation (Correct): Supervised learning relies on labeled data. The algorithm learns a mapping from inputs to outputs by being shown examples where the correct output (label) is known. Explanation (Incorrect a): This describes Unsupervised Learning. Explanation (Incorrect c): This describes Reinforcement Learning. Explanation (Incorrect d): While domain knowledge is useful, supervised learning learns from data, not predefined rules. Question 5: Which task is an example of a Classification problem in supervised learning? a) Predicting the price of a house based on its features. b) Grouping similar customers based on purchase history. c) Identifying whether an email is spam or not spam. d) Learning to play chess through trial and error. Answer: c) Explanation (Correct): Classification involves assigning input data to predefined discrete categories. Identifying spam/not spam is a binary classification task. Explanation (Incorrect a): Predicting a continuous value like price is a Regression problem. Explanation (Incorrect b): Grouping similar items without predefined labels is Clustering (Unsupervised Learning). Explanation (Incorrect d): Learning through trial and error with rewards describes Reinforcement Learning. Question 6: Which task is an example of a Regression problem in supervised learning? a) Predicting whether a customer will click on an ad (Yes/No). b) Estimating the temperature tomorrow based on historical weather data. c) Segmenting images into different regions (e.g., sky, trees, road). d) Discovering hidden topics in a collection of documents. Answer: b) Explanation (Correct): Regression involves predicting a continuous numerical output. Temperature is a continuous variable. Explanation (Incorrect a): Predicting Yes/No is a Classification problem. Explanation (Incorrect c): Image segmentation can be viewed as pixel-level classification. Explanation (Incorrect d): Topic discovery is typically done using Unsupervised Learning (e.g., LDA). Question 7: What is the primary goal of Unsupervised Learning? a) To predict a specific target variable based on labeled inputs. b) To learn a policy that maximizes rewards in an environment. c) To discover patterns, structure, or relationships in unlabeled data. d) To minimize the error between predicted and actual values. Answer: c) Explanation (Correct): Unsupervised learning algorithms work with data where no explicit output labels are given. Their goal is to find inherent structures, such as clusters (groupings) or lower-dimensional representations (dimensionality reduction). Explanation (Incorrect a, d): These relate to Supervised Learning. Explanation (Incorrect b): This relates to Reinforcement Learning. Question 8: Clustering, such as K-Means, is a common technique in which category of machine learning? a) Supervised Learning b) Reinforcement Learning c) Unsupervised Learning d) Semi-Supervised Learning Answer: c) Explanation (Correct): Clustering algorithms aim to group similar data points together based on their features, without using any predefined labels. This falls under the umbrella of Unsupervised Learning. Explanation (Incorrect a, b, d): These are different learning paradigms. Question 9: According to the ML Pipeline overview (e.g., from file 331072186_1.txt), what is typically the first step after defining the problem? a) Model Training b) Feature Engineering c) Data Collection / Ingestion d) Model Deployment Answer: c) Explanation (Correct): Once the problem is understood, the necessary data must be gathered from various sources. This raw data forms the basis for all subsequent steps. Explanation (Incorrect a, b, d): Model training, feature engineering, and deployment occur later in the pipeline after data has been collected and prepared. Question 10: What does the Data Preparation (or Preprocessing) stage of the ML pipeline typically involve? a) Training the final machine learning model. b) Evaluating the model performance using metrics like accuracy. c) Cleaning the data, handling missing values, and transforming features (e.g., scaling). d) Defining the business objective and success criteria. Answer: c) Explanation (Correct): Raw data is often messy. This stage focuses on making the data suitable for modeling by addressing issues like missing entries, errors, inconsistencies, and converting features into appropriate formats (e.g., numerical scaling, encoding categoricals). Explanation (Incorrect a, b): These are later stages (Model Training, Model Evaluation). Explanation (Incorrect d): This is part of Problem Definition. Question 11: Feature Engineering, a crucial step in the ML pipeline, involves: a) Selecting the best machine learning algorithm. b) Creating new, potentially more informative features from existing data or selecting the most relevant ones. c) Splitting the data into training and testing sets. d) Tuning the hyperparameters of the chosen model. Answer: b) Explanation (Correct): This step uses domain knowledge and data analysis to transform raw features or create new ones that can improve model performance. It can also involve selecting a subset of the most useful features. Explanation (Incorrect a): This is Model Selection. Explanation (Incorrect c): This is part of data splitting for evaluation. Explanation (Incorrect d): This is Hyperparameter Tuning. Question 12: What is the purpose of splitting data into training and testing sets? a) To increase the amount of data available for training. b) To evaluate the model's ability to generalize to new, unseen data. c) To speed up the model training process. d) To perform feature scaling more effectively. Answer: b) Explanation (Correct): A model might perform perfectly on the data it was trained on but fail on new data (overfitting). The testing set provides an unbiased estimate of how well the model is likely to perform on data it hasn't seen before. Explanation (Incorrect a): Splitting reduces the amount of data used strictly for training. Explanation (Incorrect c): It doesn't inherently speed up training, although training on a smaller subset is faster. Explanation (Incorrect d): Feature scaling is typically fitted on the training set and applied to both. Question 13: Which Python library, mentioned as required in the syllabus, forms the foundation for numerical computing and provides support for arrays and matrices? a) Pandas b) Matplotlib c) Scikit-learn d) NumPy Answer: d) Explanation (Correct): NumPy (Numerical Python) is the fundamental package for scientific computing in Python. It provides powerful N-dimensional array objects, linear algebra functions, Fourier transforms, and random number capabilities, forming the basis for many other data science libraries. Explanation (Incorrect a): Pandas builds on NumPy for data manipulation (DataFrames). Explanation (Incorrect b): Matplotlib is for plotting. Explanation (Incorrect c): Scikit-learn provides ML algorithms and tools. Question 14: Which required Python library provides tools for machine learning tasks like classification, regression, clustering, dimensionality reduction, model selection, and preprocessing? a) NumPy b) Pandas c) Scikit-learn d) TensorFlow Answer: c) Explanation (Correct): Scikit-learn is a comprehensive library offering efficient tools for data analysis and machine learning, implementing a wide range of algorithms and utility functions within a consistent interface. Explanation (Incorrect a, b): NumPy and Pandas are prerequisites for data handling but don't contain the ML algorithms themselves. Explanation (Incorrect d): TensorFlow is primarily focused on deep learning (though Scikit- learn integrates with it). Question 15: Exploratory Data Analysis (EDA) is a critical step involving: a) Deploying the model to production. b) Summarizing the main characteristics of the data, often using visualizations. c) Writing the final project report. d) Optimizing the model's learning rate. Answer: b) Explanation (Correct): EDA involves investigating the dataset to discover patterns, spot anomalies, test hypotheses, and check assumptions using summary statistics and graphical representations. It helps understand the data before formal modeling. Explanation (Incorrect a, c, d): These are other distinct stages or tasks in the ML lifecycle. Section 1.2: Evaluation & Fundamentals (Weeks 3-4) Question 16: What is the primary purpose of model evaluation in the machine learning pipeline? a) To select the most complex model possible. b) To assess how well a trained model generalizes to new, unseen data. c) To clean and preprocess the raw data. d) To visualize the relationships between features. Answer: b) Explanation (Correct): Evaluation aims to estimate the model's performance on future data it hasn't encountered during training. This helps in selecting the best model and understanding its reliability. Explanation (Incorrect a): Complexity is not the goal; generalization is. Overly complex models often perform poorly on unseen data (overfitting). Explanation (Incorrect c): Data preparation happens before evaluation. Explanation (Incorrect d): Visualization is part of EDA or communicating results, not the core of performance assessment. Question 17: In classification tasks, what does the term "True Positive" (TP) represent? a) An instance correctly predicted as negative. b) An instance incorrectly predicted as positive (False Alarm). c) An instance correctly predicted as positive. d) An instance incorrectly predicted as negative (Miss). Answer: c) Explanation (Correct): A True Positive occurs when the model correctly predicts the positive class for an instance that actually belongs to the positive class. Explanation (Incorrect a): This is a True Negative (TN). Explanation (Incorrect b): This is a False Positive (FP). Explanation (Incorrect d): This is a False Negative (FN). Question 18: Accuracy is a common evaluation metric, calculated as (TP + TN) / (TP + TN + FP + FN). When can accuracy be a misleading metric? a) When the dataset is perfectly balanced. b) When the dataset is highly imbalanced (one class significantly outnumbers the others). c) When using regression models instead of classification. d) When the model achieves very high precision. Answer: b) Explanation (Correct): In imbalanced datasets, a model can achieve high accuracy by simply predicting the majority class most of the time. For example, if 99% of data is negative, predicting negative always yields 99% accuracy but fails to identify any positive cases. Explanation (Incorrect a): Accuracy is generally reliable for balanced datasets. Explanation (Incorrect c): Accuracy is a classification metric, not typically used for regression. Explanation (Incorrect d): High precision doesn't inherently make accuracy misleading, although focusing only on accuracy might obscure issues with recall. Question 19: Precision is calculated as TP / (TP + FP). What question does precision answer? a) Out of all the actual positive instances, how many did the model correctly identify? b) Out of all the instances the model predicted as positive, how many were actually positive? c) What proportion of all predictions were correct? d) How many instances were incorrectly classified? Answer: b) Explanation (Correct): Precision focuses on the predictions made for the positive class. It measures the proportion of positive predictions that were correct. High precision means fewer false positives. Explanation (Incorrect a): This describes Recall (Sensitivity). Explanation (Incorrect c): This describes Accuracy. Explanation (Incorrect d): This relates to the total number of errors (FP + FN). Question 20: Recall (or Sensitivity) is calculated as TP / (TP + FN). What question does recall answer? a) Out of all the actual positive instances, how many did the model correctly identify? b) Out of all the instances the model predicted as positive, how many were actually positive? c) What is the overall error rate of the model? d) How often does the model predict the negative class correctly? Answer: a) Explanation (Correct): Recall focuses on the actual positive instances in the dataset. It measures the proportion of these actual positives that the model successfully identified. High recall means fewer false negatives. Explanation (Incorrect b): This describes Precision. Explanation (Incorrect c): Error rate is 1 - Accuracy. Explanation (Incorrect d): This relates to Specificity (True Negative Rate). Question 21: The F1-Score is the harmonic mean of precision and recall (2 * Precision * Recall / (Precision + Recall)). When is the F1-Score particularly useful? a) When you only care about minimizing false positives. b) When you only care about minimizing false negatives. c) When you want a single metric that balances both precision and recall, especially with imbalanced classes. d) When evaluating regression models. Answer: c) Explanation (Correct): The F1-score provides a balance between precision and recall. It's often preferred over accuracy for imbalanced datasets because it takes both false positives and false negatives into account. Explanation (Incorrect a): Focus on precision if minimizing FP is the priority. Explanation (Incorrect b): Focus on recall if minimizing FN is the priority. Explanation (Incorrect d): F1-score is a classification metric. Question 22: What is a Confusion Matrix used for in classification evaluation? a) To visualize the correlation between different features. b) To show the performance of a classification model by detailing the counts of True Positives, True Negatives, False Positives, and False Negatives. c) To plot the decision boundary of a classifier. d) To determine the optimal hyperparameters for a model. Answer: b) Explanation (Correct): A confusion matrix is a table that summarizes the performance of a classification algorithm. The rows typically represent the actual classes, and the columns represent the predicted classes (or vice versa). It provides the counts for TP, TN, FP, and FN, which are used to calculate various metrics like accuracy, precision, recall, etc. Explanation (Incorrect a): Correlation matrices visualize feature relationships. Explanation (Incorrect c): Decision boundary plots visualize how a classifier separates classes in the feature space. Explanation (Incorrect d): Hyperparameter tuning uses techniques like Grid Search or Randomized Search, often evaluating models based on metrics derived from the confusion matrix. Question 23: The ROC (Receiver Operating Characteristic) curve plots which two metrics against each other? a) Precision vs. Recall b) True Positive Rate (Recall) vs. False Positive Rate c) Accuracy vs. F1-Score d) True Negative Rate vs. False Negative Rate Answer: b) Explanation (Correct): The ROC curve illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. It plots the True Positive Rate (TPR, Recall, Sensitivity) on the Y-axis against the False Positive Rate (FPR, 1 - Specificity) on the X-axis. Explanation (Incorrect a): Precision-Recall curves plot precision vs. recall. Explanation (Incorrect c, d): These are not standard curve plots for evaluation. Question 24: What does the Area Under the ROC Curve (AUC-ROC) represent? a) The highest accuracy achievable by the model. b) The probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance. c) The optimal threshold for classification. d) The computational cost of training the model. Answer: b) Explanation (Correct): AUC-ROC provides a single scalar value summarizing the performance across all classification thresholds. An AUC of 1.0 represents a perfect classifier, while an AUC of 0.5 represents a classifier performing no better than random chance. It measures the model's ability to distinguish between positive and negative classes. Explanation (Incorrect a): Accuracy depends on a specific threshold. Explanation (Incorrect c): The ROC curve helps visualize threshold choices, but AUC is threshold-independent. Explanation (Incorrect d): AUC measures predictive performance, not computational cost. Question 25: What is K-Fold Cross-Validation? a) A method for feature engineering. b) A technique for training K different models simultaneously. c) A resampling procedure used to evaluate machine learning models on a limited data sample, reducing variance in the performance estimate. d) An algorithm for clustering data into K groups. Answer: c) Explanation (Correct): In K-Fold CV, the data is split into K subsets (folds). The model is trained K times, each time using K-1 folds for training and the remaining fold for validation. The results are averaged across the K runs to provide a more robust estimate of the model's performance than a single train-test split. Explanation (Incorrect a): Feature engineering creates/selects features. Explanation (Incorrect b): While K models are trained, it's sequential for evaluation, not typically simultaneous training of different model types. Explanation (Incorrect d): This describes K-Means clustering. Question 26: What is the main advantage of using K-Fold Cross-Validation compared to a single train-test split? a) It is much faster to compute. b) It uses less data for training. c) It provides a more reliable and less biased estimate of the model's generalization performance. d) It automatically selects the best features. Answer: c) Explanation (Correct): By training and evaluating the model on different subsets of the data, cross-validation reduces the dependency on a particular random split, giving a more stable and reliable estimate of how the model will perform on unseen data. Explanation (Incorrect a): It is computationally more expensive as it involves training the model K times. Explanation (Incorrect b): Over the K folds, all data is used for both training and validation at some point. Explanation (Incorrect d): Feature selection is a separate process, although CV can be used within it. Question 27: The Bias-Variance Tradeoff is a key concept in model evaluation. A model with high bias is likely to: a) Fit the training data very closely, including noise. b) Perform poorly on both training and test data (underfitting). c) Perform well on training data but poorly on test data (overfitting). d) Be very sensitive to small changes in the training data. Answer: b) Explanation (Correct): High bias indicates that the model is too simple or makes overly strong assumptions, failing to capture the underlying patterns in the data. This leads to underfitting, characterized by poor performance on both the data it was trained on and new data. Explanation (Incorrect a, c): These describe high variance (overfitting). Explanation (Incorrect d): Sensitivity to training data changes is characteristic of high variance. Question 28: A model with high variance is likely to: a) Be too simple to capture the data's patterns. b) Perform poorly on both training and test data (underfitting). c) Perform well on training data but poorly on test data (overfitting). d) Make strong assumptions about the data distribution. Answer: c) Explanation (Correct): High variance indicates that the model is too complex and has learned the noise or specific details of the training data too well. This leads to overfitting, where the model performs excellently on the training set but fails to generalize to unseen data. Explanation (Incorrect a, b): These describe high bias (underfitting). Explanation (Incorrect d): Strong assumptions are characteristic of high bias. Question 29: Techniques like L1 and L2 regularization are often used to: a) Increase model bias and decrease variance (combat overfitting). b) Decrease model bias and increase variance (combat underfitting). c) Speed up the training process significantly. d) Handle missing data automatically. Answer: a) Explanation (Correct): Regularization adds a penalty term to the model's loss function based on the magnitude of the model parameters (weights). This discourages overly complex models with large weights, effectively simplifying the model, increasing bias slightly, but significantly reducing variance, thus helping to prevent overfitting. Explanation (Incorrect b): Regularization aims to reduce complexity/variance. Explanation (Incorrect c): It adds computation, potentially slowing training slightly. Explanation (Incorrect d): Missing data requires imputation or specific handling. Question 30: When choosing between models, if Model A has higher training accuracy but lower validation accuracy than Model B, what might this suggest? a) Model A is underfitting. b) Model B is overfitting. c) Model A is likely overfitting compared to Model B. d) Model B is computationally more expensive. Answer: c) Explanation (Correct): The large gap between high training accuracy and lower validation accuracy for Model A is a classic sign of overfitting. Model B, with potentially lower training accuracy but better validation accuracy, is generalizing better. Explanation (Incorrect a): High training accuracy contradicts underfitting. Explanation (Incorrect b): Model B shows better generalization, suggesting less overfitting than Model A. Explanation (Incorrect d): Performance metrics don't directly indicate computational cost. Section 1.3: Decision Trees (Weeks 5-6) Question 31: What is the fundamental idea behind Decision Tree algorithms for classification? a) To find the optimal hyperplane that separates different classes. b) To partition the feature space recursively into regions, assigning a class label to each region. c) To model the probability distribution of each class using Bayes' theorem. d) To combine the predictions of multiple weak learners sequentially. Answer: b) Explanation (Correct): Decision trees work by creating a tree-like structure where internal nodes represent tests on features, branches represent the outcomes of these tests, and leaf nodes represent the final class labels. The process involves recursively splitting the data based on feature values that best separate the classes. Explanation (Incorrect a): This describes Support Vector Machines (SVMs). Explanation (Incorrect c): This describes Naive Bayes or Bayesian networks. Explanation (Incorrect d): This describes Boosting ensemble methods. Question 32: In a decision tree, what do the internal nodes typically represent? a) The final predicted class label. b) A test or condition on a specific feature. c) The probability of belonging to a certain class. d) A cluster centroid. Answer: b) Explanation (Correct): Each internal node in a decision tree corresponds to a test on the value of one of the input features (e.g., "Is feature X > 5?"). Based on the outcome of the test, the data follows a specific branch to the next node. Explanation (Incorrect a): Leaf nodes represent the final predicted class label. Explanation (Incorrect c): While leaf nodes might store class probabilities or distributions, internal nodes represent tests. Explanation (Incorrect d): Cluster centroids are used in clustering algorithms like K-Means. Question 33: What do the leaf nodes (terminal nodes) in a decision tree represent? a) The feature that provides the most information gain. b) The final classification outcome or predicted value for instances reaching that leaf. c) The root of the tree. d) A condition used to split the data. Answer: b) Explanation (Correct): When an instance traverses the tree down to a leaf node, that node contains the final prediction. For classification, this is typically the majority class of the training instances that ended up in that leaf. For regression, it's often the average target value. Explanation (Incorrect a, d): Features and conditions are associated with internal nodes for splitting. Explanation (Incorrect c): The root is the topmost node where the first split occurs. Question 34: Common algorithms for building decision trees (like ID3, C4.5, CART) use criteria to decide the best feature to split on at each node. What do these criteria generally aim to maximize or minimize? a) The depth of the tree. b) The number of leaf nodes. c) The purity of the resulting child nodes (or minimize impurity). d) The correlation between features. Answer: c) Explanation (Correct): Splitting criteria like Information Gain (used in ID3, C4.5) or Gini Impurity (used in CART) measure how well a feature separates the data into distinct classes. The goal is to choose the feature split that results in child nodes that are as homogeneous (pure) as possible with respect to the class labels. Explanation (Incorrect a, b): While tree depth and size are related to complexity, they are not the direct criteria for choosing splits. Explanation (Incorrect d): Feature correlation is considered in other contexts (like multicollinearity) but not the primary splitting criterion. Question 35: What is Information Gain, often used as a splitting criterion in decision trees? a) The average value of a feature. b) The expected reduction in entropy (or impurity) achieved by partitioning the data based on a feature. c) The number of instances correctly classified by a split. d) The correlation between a feature and the target variable. Answer: b) Explanation (Correct): Entropy measures the impurity or disorder of a set of instances. Information Gain calculates the difference between the entropy of the parent node and the weighted average entropy of the child nodes resulting from a split. A higher information gain indicates a more effective split in separating classes. Explanation (Incorrect a, c, d): These describe other statistical measures or concepts. Question 36: What is Gini Impurity, another common splitting criterion used in CART decision trees? a) A measure of the probability of a randomly chosen element being incorrectly classified if it were randomly labeled according to the distribution of labels in the subset. b) The variance of the target variable within a node. c) The depth of the node in the tree. d) The number of features used in the tree. Answer: a) Explanation (Correct): Gini impurity measures the frequency at which any element from the set would be mislabeled if it was randomly labeled according to the distribution of labels in the set. A Gini impurity of 0 indicates perfect purity (all elements belong to one class). The CART algorithm aims to find splits that minimize Gini impurity in the child nodes. Explanation (Incorrect b): Variance is used as an impurity measure for regression trees. Explanation (Incorrect c, d): These relate to tree structure, not impurity. Question 37: Decision trees are prone to overfitting, especially if they are grown very deep. What is a common technique to prevent overfitting in decision trees? a) Increasing the number of features used. b) Using a more complex splitting criterion. c) Pruning the tree (pre-pruning or post-pruning). d) Training the tree on the entire dataset without a test set. Answer: c) Explanation (Correct): Pruning involves limiting the growth of the tree. Pre-pruning stops the tree from growing too deep by setting constraints (e.g., maximum depth, minimum samples per leaf). Post-pruning grows a full tree and then removes branches that provide little predictive power on a validation set. Explanation (Incorrect a): Increasing features can sometimes increase overfitting risk. Explanation (Incorrect b): Splitting criteria complexity isn't the primary overfitting control. Explanation (Incorrect d): Training without evaluation leads to overfitting. Question 38: What is an advantage of Decision Trees compared to some other models like SVMs or Neural Networks? a) They always provide the highest possible accuracy. b) They are inherently robust to outliers in the data. c) They are relatively easy to interpret and visualize. d) They require extensive feature scaling. Answer: c) Explanation (Correct): The tree structure, representing a series of explicit decisions based on feature values, is often considered more interpretable or "white-box" compared to more complex models. The decision path for any prediction can be easily followed. Explanation (Incorrect a): No single model type guarantees the highest accuracy in all situations. Explanation (Incorrect b): Decision trees can be sensitive to outliers, especially near decision boundaries. Explanation (Incorrect d): Decision trees are generally insensitive to the scale of features and do not require scaling. Question 39: What is a potential disadvantage of single Decision Trees? a) They are difficult to implement. b) They can be unstable, meaning small changes in the data can lead to a completely different tree structure. c) They cannot handle categorical features. d) They always underfit the data. Answer: b) Explanation (Correct): Decision trees can have high variance. A small change in the training data might result in a different feature being chosen for the top split, leading to a significantly different tree structure and potentially different predictions. Ensemble methods like Random Forests address this instability. Explanation (Incorrect a): They are relatively straightforward to implement. Explanation (Incorrect c): They handle categorical features naturally. Explanation (Incorrect d): They are prone to overfitting if not pruned, not underfitting. Question 40: How do Decision Trees typically handle categorical features? a) They require categorical features to be one-hot encoded. b) They cannot use categorical features directly. c) They can often handle categorical features directly by creating multi-way splits (one branch for each category) or binary splits based on category subsets. d) They convert categorical features to numerical values using label encoding first. Answer: c) Explanation (Correct): Many decision tree algorithms (like C4.5 and extensions of CART) can directly handle categorical features without requiring prior encoding. They can create splits with branches for each category or group categories to form binary splits. Explanation (Incorrect a, d): While encoding is possible, it's often not strictly necessary for the tree algorithm itself. Explanation (Incorrect b): They can use categorical features. Question 41: What does "pre-pruning" in the context of decision trees refer to? a) Growing the full tree first and then removing branches. b) Stopping the tree growth early based on certain criteria (e.g., max depth, min samples per leaf). c) Converting the decision tree into a set of rules. d) Using an ensemble of decision trees. Answer: b) Explanation (Correct): Pre-pruning (or early stopping) involves setting constraints before training begins to prevent the tree from becoming overly complex. Common constraints include limiting the maximum depth, requiring a minimum number of samples to split a node, or requiring a minimum number of samples in a leaf node. Explanation (Incorrect a): This describes post-pruning. Explanation (Incorrect c, d): These are different techniques. Question 42: What does "post-pruning" (or just "pruning") in the context of decision trees refer to? a) Stopping the tree growth early based on certain criteria. b) Growing the full tree first and then removing or collapsing nodes/subtrees that provide little improvement on a validation set. c) Selecting the best features before building the tree. d) Averaging the predictions of multiple trees. Answer: b) Explanation (Correct): Post-pruning allows the tree to grow fully (potentially overfitting the training data) and then simplifies it by removing branches or nodes that do not significantly improve performance on unseen data (validation set). This often leads to better generalization than pre-pruning. Explanation (Incorrect a): This describes pre-pruning. Explanation (Incorrect c, d): These are different techniques. Question 43: Decision trees can be used for both classification and regression tasks. How does a regression tree differ from a classification tree, particularly at the leaf nodes? a) Regression trees use Gini impurity for splitting, while classification trees use variance reduction. b) Regression trees predict a continuous value at the leaf node (e.g., the average target value of instances in the leaf), while classification trees predict a class label. c) Regression trees cannot be pruned. d) Regression trees only use numerical features. Answer: b) Explanation (Correct): The structure and splitting process are similar, but the prediction at the leaf differs. Classification leaves predict the majority class, while regression leaves predict a continuous value, typically the mean of the target variable for the training samples that fall into that leaf. Explanation (Incorrect a): The splitting criteria are reversed; regression trees typically use variance reduction (or MSE reduction), while classification trees use Gini or entropy. Explanation (Incorrect c): Regression trees can and should be pruned to prevent overfitting. Explanation (Incorrect d): Regression trees can handle categorical features similarly to classification trees. Question 44: Which impurity measure is commonly used for splitting in regression trees? a) Entropy b) Gini Impurity c) Variance Reduction (or Mean Squared Error Reduction) d) Information Gain Answer: c) Explanation (Correct): For regression, the goal is to create leaf nodes where the target values are as similar as possible. Splitting criteria aim to reduce the variance (or equivalently, the Mean Squared Error) of the target variable within the resulting child nodes compared to the parent node. Explanation (Incorrect a, b, d): Entropy, Gini Impurity, and Information Gain are typically used for classification trees. Question 45: If a decision tree has only one node (the root node), what does this imply? a) The dataset has only one feature. b) The model is perfectly accurate. c) No feature split improves the impurity significantly, or pre-pruning criteria stopped growth immediately. d) The dataset contains no instances. Answer: c) Explanation (Correct): A single-node tree (just the root) means no splits were made. This could happen if the initial dataset is already pure (all instances belong to the same class), if no possible split reduces the impurity according to the chosen criterion, or if pre-pruning rules (like minimum samples to split) prevent any splits. Explanation (Incorrect a): The number of features doesn't dictate the tree size. Explanation (Incorrect b): A single-node tree usually has low accuracy unless the data is trivial. Explanation (Incorrect d): If the dataset were empty, a tree wouldn't typically be built. Section 1.4: Support Vector Machines (Weeks 7-8) Question 46: What is the main objective of a Support Vector Machine (SVM) for classification? a) To find a hyperplane that maximally separates the different classes in the feature space. b) To recursively partition the data based on feature values. c) To model the probability of class membership using Bayes' theorem. d) To create an ensemble of decision trees. Answer: a) Explanation (Correct): SVMs aim to find the optimal hyperplane (a decision boundary) that has the largest margin between the closest data points of the different classes (the support vectors). This maximal margin is believed to lead to better generalization. Explanation (Incorrect b): This describes Decision Trees. Explanation (Incorrect c): This describes Naive Bayes or Bayesian classifiers. Explanation (Incorrect d): This describes ensemble methods like Random Forests or Boosting. Question 47: In the context of SVMs, what are "support vectors"? a) All data points used for training the SVM. b) The data points that lie farthest from the decision boundary. c) The data points that lie closest to the decision boundary (on the margin or incorrectly classified). d) The features used to define the hyperplane. Answer: c) Explanation (Correct): Support vectors are the critical data points from the training set that define the position and orientation of the optimal hyperplane. They are the points lying on the margin boundaries or within the margin (if using soft margins). Removing non-support vectors would not change the hyperplane. Explanation (Incorrect a, b): Only a subset of training points typically become support vectors. Explanation (Incorrect d): Features define the space, but support vectors are specific data instances. Question 48: What is the "margin" in an SVM? a) The distance between the two farthest points in the dataset. b) The number of misclassified points. c) The region around the separating hyperplane that is kept free of data points (in the hard-margin case). d) The complexity parameter of the SVM model. Answer: c) Explanation (Correct): The margin is the separation distance between the separating hyperplane and the closest data points (support vectors) from either class. SVMs aim to maximize this margin. Explanation (Incorrect a, b, d): These describe other concepts. Question 49: What is the difference between a hard-margin SVM and a soft-margin SVM? a) Hard-margin SVMs use linear kernels, while soft-margin SVMs use non-linear kernels. b) Hard-margin SVMs require the data to be perfectly linearly separable, while soft-margin SVMs allow for some misclassifications or margin violations. c) Hard-margin SVMs are faster to train than soft-margin SVMs. d) Hard-margin SVMs only work for binary classification, while soft-margin SVMs work for multi- class problems. Answer: b) Explanation (Correct): A hard-margin SVM assumes the data can be separated without any errors. A soft-margin SVM introduces slack variables and a penalty parameter (C) to allow some points to be within the margin or even on the wrong side of the hyperplane, making it applicable to non-linearly separable data or data with noise/outliers. Explanation (Incorrect a): Both can use linear or non-linear kernels. Explanation (Incorrect c): Training time depends on various factors, not just the margin type. Explanation (Incorrect d): Both can be extended to multi-class problems (e.g., using one-vs- rest or one-vs-one strategies). Question 50: In soft-margin SVMs, what is the role of the hyperparameter 'C'? a) It controls the width of the Gaussian kernel. b) It determines the degree of the polynomial kernel. c) It controls the trade-off between maximizing the margin and minimizing the classification errors (margin violations). d) It specifies the number of support vectors to use. Answer: c) Explanation (Correct): The parameter 'C' is a regularization parameter. A small 'C' allows for a wider margin but tolerates more margin violations (smoother decision boundary, potentially higher bias). A large 'C' aims for fewer margin violations, resulting in a narrower margin and potentially overfitting (more complex boundary, potentially higher variance). Explanation (Incorrect a, b): These relate to kernel parameters (gamma for Gaussian, degree for polynomial). Explanation (Incorrect d): The number of support vectors is determined during training, influenced by C and the data. Question 51: What is the "kernel trick" used in SVMs? a) A method for speeding up the training process. b) A technique to implicitly map data into a higher-dimensional space to find a linear separator, without explicitly computing the coordinates in that space. c) A way to automatically select the best features for the SVM. d) A method for handling missing data in SVMs. Answer: b) Explanation (Correct): When data is not linearly separable in its original space, SVMs can use kernels (like polynomial or Radial Basis Function - RBF/Gaussian) to compute the dot products between data points as if they were mapped to a higher-dimensional space where separation might be linear. The trick is that these dot products can be computed efficiently using the kernel function without ever explicitly performing the high-dimensional mapping. Explanation (Incorrect a, c, d): These describe other techniques. Question 52: Which of the following is a commonly used non-linear kernel function in SVMs? a) Linear Kernel b) Radial Basis Function (RBF) / Gaussian Kernel c) Dot Product Kernel d) Identity Kernel Answer: b) Explanation (Correct): The RBF (Gaussian) kernel (K(x, y) = exp(-gamma * ||x - y||^2)) is a popular choice for non-linear SVMs. It maps data into an infinite-dimensional space and is controlled by the hyperparameter gamma. Explanation (Incorrect a): The linear kernel performs classification in the original feature space. Explanation (Incorrect c, d): These are not standard SVM kernel names in this context. Question 53: In the RBF (Gaussian) kernel for SVMs, what does the hyperparameter 'gamma' typically control? a) The penalty for misclassification. b) The number of dimensions in the feature space. c) The influence of a single training example; low gamma means 'far' influence, high gamma means 'close' influence. d) The maximum margin width. Answer: c) Explanation (Correct): Gamma defines how much influence a single training example has. A low gamma value means points farther apart are considered similar, leading to a smoother, simpler decision boundary (potentially higher bias). A high gamma value means only points close to each other are considered similar, leading to a more complex, wiggly decision boundary that closely fits the training data (potentially higher variance/overfitting). Explanation (Incorrect a): This is controlled by 'C'. Explanation (Incorrect b): The RBF kernel maps to an infinite-dimensional space. Explanation (Incorrect d): Margin width is influenced by C, gamma, and the data. Question 54: What is a potential disadvantage of using SVMs, especially with non-linear kernels? a) They are prone to underfitting. b) They are difficult to interpret compared to decision trees. c) They do not require hyperparameter tuning. d) They perform poorly on high-dimensional data. Answer: b) Explanation (Correct): While powerful, the decision boundary resulting from non-linear SVMs (especially in high-dimensional feature spaces created by kernels) can be complex and difficult to interpret directly in terms of the original features, unlike the explicit rules of a decision tree. Explanation (Incorrect a): They can overfit if hyperparameters (C, gamma) are not chosen carefully. Explanation (Incorrect c): They require careful tuning of C and kernel parameters (like gamma). Explanation (Incorrect d): SVMs often perform well on high-dimensional data, especially when the number of dimensions is greater than the number of samples. Question 55: Support Vector Machines are primarily designed for which type of task? a) Clustering b) Dimensionality Reduction c) Binary Classification (though extendable to multi-class) d) Reinforcement Learning Answer: c) Explanation (Correct): The core formulation of SVMs is for separating two classes. Extensions like one-vs-rest or one-vs-one strategies allow SVMs to be applied to multi-class problems, and variations exist for regression (Support Vector Regression - SVR). Explanation (Incorrect a, b, d): These are different ML tasks. Section 1.5: Bayesian Networks (Weeks 8-9) Question 56: What does a Bayesian Network graphically represent? a) The flow of computation in a neural network. b) A set of conditional dependencies among a set of random variables. c) The hierarchical structure of clusters in a dataset. d) The decision boundaries learned by an SVM. Answer: b) Explanation (Correct): A Bayesian Network (or Belief Network) is a probabilistic graphical model. It uses a Directed Acyclic Graph (DAG) where nodes represent random variables and directed edges represent conditional dependencies. Each node is associated with a conditional probability distribution given its parent nodes. Explanation (Incorrect a, c, d): These describe other models or concepts. Question 57: In a Bayesian Network graph, what does a directed edge from node A to node B typically signify? a) A and B are mutually exclusive. b) A and B are independent. c) A is a direct cause or influence on B (or B is conditionally dependent on A). d) A and B have the same probability distribution. Answer: c) Explanation (Correct): The directed edges represent direct probabilistic dependencies. An edge A -> B implies that the probability distribution of variable B depends directly on the value of variable A (given B's other parents, if any). It encodes the statement P(B | A, other_parents). Explanation (Incorrect a, b, d): These are incorrect interpretations of the directed edge. Question 58: What mathematical theorem forms the foundation for inference and learning in Bayesian Networks? a) Central Limit Theorem b) Law of Large Numbers c) Bayes' Theorem d) Pythagorean Theorem Answer: c) Explanation (Correct): Bayes' Theorem (P(H|E) = [P(E|H) * P(H)] / P(E)) is fundamental for updating beliefs (probabilities) about hypotheses (variables) given new evidence. Bayesian networks use this theorem extensively for probabilistic inference (calculating probabilities of interest given observed evidence). Explanation (Incorrect a, b, d): These are important theorems but not the core foundation for Bayesian networks. Question 59: What is probabilistic inference in the context of Bayesian Networks? a) Learning the structure (graph) of the network from data. b) Learning the parameters (conditional probabilities) of the network from data. c) Calculating the probability distribution of some query variables, given observed values (evidence) for other variables. d) Simplifying the network by removing redundant nodes. Answer: c) Explanation (Correct): Inference involves using the network structure and conditional probabilities to answer probabilistic queries. For example, given that a patient has certain symptoms (evidence), what is the probability they have a particular disease (query variable)? Explanation (Incorrect a, b): These are aspects of learning Bayesian networks. Explanation (Incorrect d): This relates to network simplification or structure learning. Question 60: What is a key advantage of using Bayesian Networks? a) They are always the most accurate models for any task. b) They can explicitly model uncertainty and dependencies between variables. c) They require very little data to train effectively. d) They are inherently linear models. Answer: b) Explanation (Correct): Bayesian Networks provide a framework for representing complex joint probability distributions and reasoning about dependencies and uncertainties in a structured way. They allow for incorporating prior knowledge and updating beliefs as new evidence becomes available. Explanation (Incorrect a): Accuracy depends on the problem and data. Explanation (Incorrect c): Learning complex networks can require significant data. Explanation (Incorrect d): They model probabilistic dependencies, which can be highly non- linear. Section 1.7: Neural Networks (Weeks 11-12) Question 61: What is the basic building block of an Artificial Neural Network (ANN)? a) A decision node b) A support vector c) A neuron (or perceptron) d) A cluster centroid Answer: c) Explanation (Correct): ANNs are inspired by biological neural networks. The fundamental processing unit is the artificial neuron (or perceptron), which receives inputs, applies weights, sums them up, applies an activation function, and produces an output. Explanation (Incorrect a): Decision nodes are part of decision trees. Explanation (Incorrect b): Support vectors are key data points in SVMs. Explanation (Incorrect d): Cluster centroids are representative points in clustering algorithms like K-Means. Question 62: In a typical feedforward neural network architecture, how is information processed? a) Information flows in cycles between layers. b) Information flows in one direction, from the input layer, through hidden layers, to the output layer. c) Information is processed only within a single layer. d) Information flow is determined randomly at each step. Answer: b) Explanation (Correct): Feedforward networks are characterized by a unidirectional flow of information. Connections go from neurons in one layer to neurons in the next layer, without forming cycles. Input data propagates forward through the network to produce the output. Explanation (Incorrect a): Networks with cycles are called Recurrent Neural Networks (RNNs). Explanation (Incorrect c): Information is passed between layers. Explanation (Incorrect d): The flow is defined by the network architecture, not random. Question 63: What is the role of an activation function in a neuron? a) To normalize the input data before it enters the neuron. b) To determine the number of neurons in the next layer. c) To introduce non-linearity into the network, allowing it to learn complex patterns. d) To calculate the initial weights for the neuron's connections. Answer: c) Explanation (Correct): If neurons only performed weighted sums, the entire network would behave like a single linear model, regardless of depth. Activation functions (like Sigmoid, Tanh, ReLU) introduce non-linearity, enabling the network to approximate complex, non-linear relationships in the data. Explanation (Incorrect a): Normalization is a preprocessing step. Explanation (Incorrect b): The number of neurons is part of the architecture design. Explanation (Incorrect d): Weights are typically initialized randomly and then learned. Question 64: Which of the following is a commonly used activation function, known for its simplicity and effectiveness in mitigating the vanishing gradient problem in deep networks? a) Linear function b) Sigmoid function c) Rectified Linear Unit (ReLU) d) Step function Answer: c) Explanation (Correct): ReLU (f(x) = max(0, x)) is computationally efficient and helps address the vanishing gradient problem that can occur with functions like sigmoid or tanh in deep networks, as its gradient is either 0 or 1 for positive inputs. Explanation (Incorrect a): A linear function would not introduce non-linearity. Explanation (Incorrect b): Sigmoid was popular but suffers from vanishing gradients in deep networks. Explanation (Incorrect d): The step function is non-differentiable at zero and less commonly used in modern deep learning. Question 65: What is the primary purpose of the backpropagation algorithm in training neural networks? a) To initialize the weights of the network randomly. b) To select the optimal number of hidden layers. c) To efficiently compute the gradients of the loss function with respect to the network's weights. d) To perform feature scaling on the input data. Answer: c) Explanation (Correct): Backpropagation is an algorithm that uses the chain rule of calculus to efficiently compute the gradient of the loss function (error) with respect to each weight and bias in the network. These gradients are then used by an optimization algorithm (like gradient descent) to update the weights and minimize the loss. Explanation (Incorrect a): Weight initialization happens before training starts. Explanation (Incorrect b): Network architecture design is typically done separately, often through experimentation or using established patterns. Explanation (Incorrect d): Feature scaling is a preprocessing step. Question 66: What does the term "Deep Learning" generally refer to? a) Any machine learning algorithm that uses probability. b) Machine learning using neural networks with multiple hidden layers (deep architectures). c) Ensemble methods that combine many weak learners. d) Unsupervised learning techniques for dimensionality reduction. Answer: b) Explanation (Correct): Deep Learning is a subfield of machine learning focused on using artificial neural networks with considerable depth (i.e., multiple hidden layers between the input and output layers) to learn complex patterns and hierarchical representations from data. Explanation (Incorrect a, c, d): These describe other areas or techniques within machine learning. Question 67: In the context of training neural networks, what is an "epoch"? a) A single forward pass of data through the network. b) A single backward pass (backpropagation) through the network. c) One complete pass of the entire training dataset through the network (both forward and backward). d) The number of neurons in the output layer. Answer: c) Explanation (Correct): An epoch represents one full iteration over the entire training dataset. During an epoch, the network processes all training examples, typically in mini-batches, performing both forward passes (to get predictions) and backward passes (to compute gradients and update weights). Explanation (Incorrect a, b): Forward and backward passes happen multiple times within an epoch, usually for each mini-batch. Explanation (Incorrect d): This is an architectural parameter. Question 68: Which type of neural network architecture is particularly well-suited for processing sequential data like text or time series? a) Convolutional Neural Networks (CNNs) b) Recurrent Neural Networks (RNNs) c) Feedforward Neural Networks (Multilayer Perceptrons) d) Autoencoders Answer: b) Explanation (Correct): RNNs have connections that form cycles, allowing them to maintain an internal state or memory. This makes them suitable for tasks where the order of input data matters, such as natural language processing, speech recognition, and time series analysis. Explanation (Incorrect a): CNNs excel at processing grid-like data, such as images. Explanation (Incorrect c): Standard feedforward networks do not inherently handle sequential dependencies. Explanation (Incorrect d): Autoencoders are typically used for dimensionality reduction or feature learning. Question 69: Convolutional Neural Networks (CNNs) are highly effective for tasks like image classification. What is the primary role of the convolutional layers in a CNN? a) To flatten the input data into a single vector. b) To perform the final classification using a softmax function. c) To apply learnable filters to input data, detecting spatial hierarchies of features (e.g., edges, textures, objects). d) To reduce the dimensionality of the data using pooling. Answer: c) Explanation (Correct): Convolutional layers use filters (kernels) that slide across the input image (or feature map) to detect local patterns. Early layers might detect simple features like edges, while deeper layers combine these to detect more complex features and objects, leveraging the spatial structure of the data. Explanation (Incorrect a): Flattening typically occurs after convolutional and pooling layers, before fully connected layers. Explanation (Incorrect b): The final classification is usually done by fully connected layers followed by a softmax (or similar) output layer. Explanation (Incorrect d): Pooling layers (e.g., Max Pooling) are often used after convolutional layers to reduce dimensionality and provide some translation invariance, but the primary feature detection happens in the convolutional layers. Question 70: What is the "vanishing gradient problem" in deep neural networks? a) When gradients become too large, causing unstable training. b) When the network weights converge to zero. c) When gradients become extremely small during backpropagation, preventing weights in earlier layers from being updated effectively. d) When the loss function has no gradients. Answer: c) Explanation (Correct): In deep networks, especially when using activation functions like sigmoid or tanh, the gradients calculated during backpropagation can become progressively smaller as they propagate backward through the layers. This can make the updates to weights in the initial layers negligible, effectively halting learning for those layers. Explanation (Incorrect a): This describes the exploding gradient problem. Explanation (Incorrect b, d): Weights don't necessarily converge to zero, and the loss function usually has gradients, but they become too small to be useful. Question 71: Which optimization algorithm is commonly used to train neural networks by updating weights based on gradients computed from small subsets of the training data? a) K-Means b) Principal Component Analysis (PCA) c) Stochastic Gradient Descent (SGD) or its variants (e.g., Adam, RMSprop) d) Decision Tree Induction Answer: c) Explanation (Correct): Standard gradient descent computes gradients using the entire dataset, which is inefficient for large datasets. SGD updates weights using the gradient computed from a single example or a small mini-batch. Variants like Adam and RMSprop adapt the learning rate and often converge faster. Explanation (Incorrect a, b, d): K-Means, PCA, and Decision Tree Induction are different types of machine learning algorithms, not optimizers for neural networks. Question 72: According to the syllabus, which optional libraries are mentioned for the neural networks module, particularly for deep learning? a) NumPy and Pandas b) Scikit-learn and Matplotlib c) TensorFlow and PyTorch d) NLTK and SpaCy Answer: c) Explanation (Correct): The syllabus and content summary explicitly list TensorFlow and PyTorch as optional but relevant libraries for the neural networks module (Weeks 11-12), especially for deep learning implementations. Explanation (Incorrect a, b): NumPy, Pandas, Scikit-learn, and Matplotlib are listed as core required tools. Explanation (Incorrect d): NLTK and SpaCy are libraries for natural language processing. Question 73: What is the purpose of a pooling layer (e.g., Max Pooling) in a Convolutional Neural Network? a) To increase the number of feature maps. b) To introduce non-linearity into the network. c) To reduce the spatial dimensions (width and height) of the feature maps, making the representation more robust to small translations. d) To apply filters for edge detection. Answer: c) Explanation (Correct): Pooling layers downsample the feature maps, reducing their spatial resolution. This decreases the number of parameters and computation in the network, and helps make the detected features somewhat invariant to their exact location in the input. Explanation (Incorrect a): Convolutional layers typically increase the number of feature maps (channels). Explanation (Incorrect b): Non-linearity is introduced by activation functions after convolutional layers. Explanation (Incorrect d): Filter application for feature detection is done by convolutional layers. Question 74: When training a neural network for classification, what is a common loss function used for multi-class problems where classes are mutually exclusive? a) Mean Squared Error (MSE) b) Mean Absolute Error (MAE) c) Binary Cross-Entropy d) Categorical Cross-Entropy Answer: d) Explanation (Correct): Categorical Cross-Entropy is designed for multi-class classification problems where each input belongs to exactly one class. It measures the difference between the predicted probability distribution (usually from a softmax output layer) and the true distribution (one-hot encoded). Explanation (Incorrect a, b): MSE and MAE are typically used for regression problems. Explanation (Incorrect c): Binary Cross-Entropy is used for binary classification problems (two classes). Question 75: What is dropout in the context of neural network training? a) A method for selecting the best activation function. b) A regularization technique where randomly selected neurons are ignored during training. c) An algorithm for initializing network weights. d) A technique for visualizing high-dimensional data. Answer: b) Explanation (Correct): Dropout is a regularization technique used to prevent overfitting. During each training iteration, a random fraction of neurons (and their connections) are temporarily removed or "dropped out". This forces the network to learn more robust features that are not overly reliant on any single neuron. Explanation (Incorrect a, c, d): These describe other concepts unrelated to dropout. Section 1.8: Advanced Topics (Week 13) Question 76: Feature engineering involves transforming raw data into features that better represent the underlying problem for predictive models. Which of the following is an example of feature engineering? a) Splitting data into training and testing sets. b) Creating a new feature representing the ratio of two existing numerical features. c) Tuning the hyperparameters of a Random Forest model. d) Evaluating a model using K-Fold cross-validation. Answer: b) Explanation (Correct): Feature engineering is about creating new input variables from existing ones or transforming them to be more informative. Creating interaction terms, polynomial features, or ratios are common examples. Explanation (Incorrect a, d): Splitting data and cross-validation are part of the model evaluation process. Explanation (Incorrect c): Hyperparameter tuning is part of model optimization. Question 77: Hyperparameter optimization is crucial for achieving good model performance. What distinguishes a hyperparameter from a model parameter? a) Hyperparameters are learned from data during training, while parameters are set before training. b) Parameters are learned from data during training, while hyperparameters are set before training and control the learning process. c) Hyperparameters only exist in ensemble methods. d) Parameters are only used in linear models. Answer: b) Explanation (Correct): Model parameters are internal variables learned by the algorithm from the training data (e.g., weights in a neural network, coefficients in linear regression). Hyperparameters are external configuration settings for the algorithm that are not learned from data but set beforehand (e.g., the learning rate, the number of trees in a random forest, the 'C' value in SVM). Explanation (Incorrect a): This reverses the definitions. Explanation (Incorrect c, d): Both parameters and hyperparameters exist in various model types. Question 78: Grid Search and Randomized Search are common techniques for hyperparameter optimization. What is the main difference? a) Grid Search explores all possible combinations of specified hyperparameter values, while Randomized Search samples a fixed number of combinations randomly. b) Randomized Search explores all combinations, while Grid Search samples randomly. c) Grid Search is only for linear models, while Randomized Search is for non-linear models. d) Grid Search requires less computation time than Randomized Search. Answer: a) Explanation (Correct): Grid Search exhaustively tries every combination of the hyperparameter values provided in a grid. Randomized Search, given a budget (number of iterations), samples random combinations from specified distributions or lists, often finding good hyperparameters more efficiently, especially when some hyperparameters are more important than others. Explanation (Incorrect b): This reverses the definitions. Explanation (Incorrect c): Both can be used for various model types. Explanation (Incorrect d): Grid Search is often much more computationally expensive, especially with many hyperparameters or large value ranges. Question 79: What is the primary goal of dimensionality reduction techniques like Principal Component Analysis (PCA)? a) To increase the number of features in a dataset. b) To reduce the number of features while preserving as much important information (variance) as possible. c) To cluster data points into distinct groups. d) To train a supervised learning model. Answer: b) Explanation (Correct): Dimensionality reduction aims to simplify data by reducing the number of input variables (features). PCA, for example, finds principal components (linear combinations of original features) that capture the maximum variance in the data, allowing representation in a lower-dimensional space with minimal information loss. Explanation (Incorrect a): The goal is reduction, not increase. Explanation (Incorrect c): This describes clustering. Explanation (Incorrect d): While reduced data can be used for supervised learning, dimensionality reduction itself is often unsupervised. Question 80: Ensemble methods combine multiple individual models (base learners) to produce a final prediction. What is a key reason for using ensemble methods? a) They are always faster to train than single models. b) They typically achieve better predictive performance and robustness than individual models. c) They require less data than single models. d) They are easier to interpret than single models. Answer: b) Explanation (Correct): By combining the predictions of several models, ensemble methods can reduce variance (e.g., Bagging, Random Forests) or bias (e.g., Boosting), often leading to improved accuracy and generalization compared to any single base learner. Explanation (Incorrect a): Training multiple models usually takes longer. Explanation (Incorrect c): They generally require the same or more data. Explanation (Incorrect d): Ensembles are often less interpretable than simple models like decision trees or linear regression. Question 81: Bagging (Bootstrap Aggregating) is an ensemble technique. How does it primarily work? a) It trains models sequentially, with each model focusing on the errors of the previous one. b) It trains multiple models independently on different bootstrap samples (random samples with replacement) of the training data and averages their predictions. c) It assigns weights to training instances based on how difficult they are to classify. d) It selects the single best-performing model from a pool of candidates. Answer: b) Explanation (Correct): Bagging involves creating multiple versions of the training set via bootstrapping, training a separate model (e.g., a decision tree) on each sample, and then combining their predictions (e.g., by averaging for regression or majority vote for classification). This helps reduce variance. Explanation (Incorrect a): This describes Boosting. Explanation (Incorrect c): Weighting instances is characteristic of Boosting. Explanation (Incorrect d): Ensembles combine models, not just select one. Question 82: Random Forest is an extension of Bagging, primarily used with decision trees. What additional source of randomness does it introduce compared to standard Bagging with trees? a) It randomly selects the target variable for each tree. b) It randomly selects a subset of features to consider at each split point in the individual trees. c) It uses different types of base learners in the ensemble. d) It assigns random weights to the predictions of each tree. Answer: b) Explanation (Correct): In addition to training each tree on a bootstrap sample of the data (like Bagging), Random Forest also considers only a random subset of features when searching for the best split at each node. This further decorrelates the trees and often improves performance by reducing variance. Explanation (Incorrect a, c, d): These are not standard components of the Random Forest algorithm. Question 83: Boosting is another type of ensemble method. What is the general principle behind Boosting algorithms like AdaBoost or Gradient Boosting? a) To train multiple models in parallel on different subsets of features. b) To train models sequentially, where each new model attempts to correct the errors made by the previous models. c) To average the predictions of many independently trained models. d) To randomly sample data points and features for each model. Answer: b) Explanation (Correct): Boosting builds an ensemble sequentially. Each subsequent model focuses more on the training instances that were misclassified by the preceding models. The final prediction is a weighted combination of the predictions from all models. Explanation (Incorrect a): Training in parallel on feature subsets relates more to Random Forests. Explanation (Incorrect c): Averaging independent models is characteristic of Bagging. Explanation (Incorrect d): Random sampling is used in Bagging/Random Forests. Question 84: What is a potential disadvantage of Boosting methods compared to Bagging methods like Random Forests? a) Boosting is generally less accurate. b) Boosting is less prone to overfitting. c) Boosting can be more sensitive to noisy data and outliers, and potentially more prone to overfitting if not carefully tuned (e.g., with learning rate, number of estimators). d) Boosting cannot be used for regression tasks. Answer: c) Explanation (Correct): Because boosting focuses on correcting errors, noisy data or outliers that are hard to classify can receive increasing attention from subsequent models, potentially leading to overfitting. Careful tuning of hyperparameters (like learning rate, tree depth, number of estimators) is often required. Explanation (Incorrect a): Boosting often achieves higher accuracy than Bagging, but requires more care. Explanation (Incorrect b): Boosting can be more prone to overfitting than Random Forests if not regularized. Explanation (Incorrect d): Boosting algorithms (like Gradient Boosting Regressor) exist for regression. Question 85: Model interpretability refers to the degree to which a human can understand the reasons behind a model's predictions. Which model type is generally considered more interpretable? a) Deep Neural Networks b) Random Forests c) Linear Regression or simple Decision Trees (shallow) d) Support Vector Machines with RBF kernel Answer: c) Explanation (Correct): Linear models have explicit coefficients for each feature, showing their contribution. Shallow decision trees have clear, traceable decision paths. These are often easier for humans to understand compared to the complex interactions in deep networks, ensembles, or kernelized SVMs. Explanation (Incorrect a, b, d): These models are often treated as "black boxes" due to their complexity, although techniques exist to provide partial explanations (e.g., SHAP, LIME). Question 86: Techniques like LIME (Local Interpretable Model-agnostic Explanations) and SHAP (SHapley Additive exPlanations) are used for: a) Training machine learning models faster. b) Automatically engineering new features. c) Explaining the predictions of complex, often black-box, machine learning models. d) Performing dimensionality reduction. Answer: c) Explanation (Correct): As models become more complex, understanding why they make specific predictions becomes harder but more important. LIME and SHAP are popular techniques designed to provide insights into individual predictions made by almost any type of machine learning model, enhancing interpretability. Explanation (Incorrect a, b, d): These describe other ML tasks or techniques. Question 87: What is the primary focus of ethical considerations in machine learning? a) Maximizing model accuracy at all costs. b) Ensuring models are fair, unbiased, transparent, and accountable, considering their potential societal impact. c) Developing the most computationally efficient algorithms. d) Using the largest possible datasets for training. Answer: b) Explanation (Correct): Ethical ML involves critically examining how models are built and deployed, addressing potential issues like bias amplification, lack of transparency, privacy violations, and ensuring that the technology benefits society equitably and responsibly. Explanation (Incorrect a): Accuracy is important but should not override fairness or ethical concerns. Explanation (Incorrect c): Efficiency is a technical goal, not the primary ethical focus. Explanation (Incorrect d): Data size doesn't guarantee ethical outcomes; data quality and representativeness are crucial. Question 88: Algorithmic bias can arise in machine learning models. What is a common source of such bias? a) Using too many hidden layers in a neural network. b) Biases present in the training data reflecting historical or societal inequalities. c) Choosing a learning rate that is too high. d) Not using cross-validation during evaluation. Answer: b) Explanation (Correct): If the data used to train a model contains biases (e.g., underrepresentation of certain groups, skewed labels reflecting past discrimination), the model is likely to learn and potentially amplify these biases, leading to unfair or discriminatory outcomes. Explanation (Incorrect a, c, d): These relate to model architecture, training procedure, or evaluation methods, which are less direct sources of societal bias compared to the data itself. Question 89: What does the concept of "fairness" in machine learning typically involve? a) Ensuring the model achieves the same accuracy for all demographic groups. b) Designing models that are completely free of any statistical bias. c) Measuring and mitigating disparities in model performance or outcomes across different protected groups (e.g., based on race, gender). d) Using only interpretable models like linear regression. Answer: c) Explanation (Correct): Fairness in ML is a complex and multifaceted concept with various mathematical definitions. It generally involves assessing whether a model's predictions or errors disproportionately affect certain groups and implementing strategies to reduce such disparities, aiming for equitable treatment. Explanation (Incorrect a): Equal accuracy is one possible fairness metric (Equal Opportunity), but not the only one, and sometimes impossible to achieve simultaneously with others. Explanation (Incorrect b): Statistical bias is inherent in modeling; the goal is to mitigate harmful or unfair bias. Explanation (Incorrect d): Interpretability is related but distinct from fairness; complex models can sometimes be made fairer. Question 90: When deploying a machine learning model, why is continuous monitoring important? a) To ensure the model's code remains unchanged. b) To detect potential degradation in performance over time due to changes in data distribution (concept drift) or other factors. c) To satisfy regulatory requirements for documentation. d) To train the model on new data continuously. Answer: b) Explanation (Correct): The real-world data a model encounters after deployment may change over time (data drift or concept drift). Continuous monitoring of the model's performance and input data characteristics is crucial to identify when the model may no longer be performing reliably and needs retraining or updating. Explanation (Incorrect a): Code changes might be necessary for updates. Explanation (Incorrect c): Documentation is important but distinct from performance monitoring. Explanation (Incorrect d): Continuous training might be a response to monitoring results, but monitoring itself is the act of observing performance and data. Scenario-Based Questions Question 1: You are working on a medical diagnosis system to predict whether a patient has a rare disease based on various symptoms and test results. The dataset is highly imbalanced, with only 2% of patients having the disease. Your initial model achieves 98% accuracy, but you're concerned about its performance. What issue might be occurring, and what steps would you take to address it? Ideal Answer: The high accuracy (98%) is misleading because it matches the percentage of negative cases in the dataset. The model might be simply predicting "no disease" for all patients, which would still achieve 98% accuracy but would be useless for identifying patients with the disease. To address this issue, I would: 1. Use more appropriate evaluation metrics like precision, recall, F1-score, or AUC-ROC that better handle imbalanced data 2. Apply resampling techniques such as oversampling the minority class (SMOTE), undersampling the majority class, or a combination 3. Use class weighting to penalize misclassification of the minority class more heavily 4. Consider ensemble methods like Random Forests that can handle imbalanced data better 5. Focus on optimizing the threshold for classification based on the cost of false negatives vs. false positives, since missing a disease diagnosis (false negative) is typically more costly than a false alarm (false positive) Question 2: You've trained a Random Forest model to predict customer churn for a subscription service. The model performs well on your test set, but when deployed, its performance degrades significantly over time. Why might this be happening, and what would you do to mitigate this issue? Ideal Answer: This is likely a case of concept drift, where the statistical properties of the target variable (customer churn patterns) are changing over time, making the original model less effective. This could be due to: Changing customer behaviors New competitors entering the market Seasonal effects Changes in the company's own product or pricing To mitigate this issue, I would: 1. Implement a monitoring system to track model performance metrics over time 2. Set up automated alerts when performance drops below certain thresholds 3. Regularly retrain the model with more recent data (e.g., monthly or quarterly) 4. Consider using online learning algorithms that can adapt to changing patterns 5. Implement a sliding window approach where older data is gradually phased out 6. Add time-based features that might capture seasonal or trend effects 7. Consider developing an ensemble of models trained on different time periods Question 3: You're building a recommendation system for an e-commerce platform. You have access to user demographics, browsing history, purchase history, and product details. What machine learning approach would you use and why? Ideal Answer: For an e-commerce recommendation system, I would implement a hybrid approach combining: 1. Collaborative filtering: This captures patterns based on user behavior similarities. Users who purchased similar items in the past might have similar preferences in the future. â—¦ Matrix factorization techniques like Singular Value Decomposition (SVD) â—¦ Neural network-based approaches like Neural Collaborative Filtering 2. Content-based filtering: This uses product features and user preferences to make recommendations. â—¦ Would leverage product details and user demographics â—¦ Can help address the "cold start" problem for new products 3. Knowledge-based recommendations: For high-value items purchased infrequently The hybrid approach addresses limitations of individual methods: Collaborative filtering alone suffers from the cold-start problem for new users/items Content-based filtering alone might miss unexpected but relevant recommendations A hybrid system can provide more robust and diverse recommendations I would evaluate the system using offline metrics (precision, recall, NDCG) and online A/B testing to measure actual user engagement and conversion rates. Question 4: Your team has developed a machine learning model to predict housing prices. The model performs well on the training data but poorly on the validation set. When you examine the learning curves, you notice high training accuracy but significantly lower validation accuracy. What is likely happening, and how would you address it? Ideal Answer: The model is likely overfitting to the training data. The high performance on training data coupled with poor performance on validation data indicates the model is learning the noise and specific patterns in the training set rather than generalizing well. To address overfitting, I would: 1. Simplify the model (reduce complexity): â—¦ Decrease the depth of decision trees â—¦ Reduce the number of parameters in neural networks â—¦ Use fewer features 2. Add regularization: â—¦ L1 or L2 regularization for linear models â—¦ Dropout for neural networks â—¦ Pruning for decision trees 3. Collect more training data if possible 4. Use ensemble methods like Random Forests or Gradient Boosting, which are more resistant to overfitting 5. Implement cross-validation to ensure the model generalizes well across different data subsets 6. Apply feature selection to remove irrelevant features that might contribute to noise 7. Early stopping during training when validation performance starts to degrade Question 5: You're analyzing customer feedback data for a product and want to automatically categorize thousands of text reviews into themes like "user interface," "performance," "pricing," etc. What machine learning approach would you use and why? Ideal Answer: For categorizing customer feedback into themes, I would use Natural Language Processing (NLP) techniques combined with either supervised or unsupervised learning approaches: If I have labeled data (reviews already categorized): 1. I would use a supervised text classification approach: â—¦ Preprocess text (tokenization, removing stop words, stemming/lemmatization) â—¦ Convert text to numerical features using techniques like TF-IDF or word embeddings â—¦ Train a classifier such as Naive Bayes, SVM, or a deep learning model like BERT â—¦ Implement multi-label classification since reviews might belong to multiple categories If I don't have labeled data: 1. I would use unsupervised learning techniques: â—¦ Topic modeling approaches like Latent Dirichlet Allocation (LDA) to discover themes â—¦ Clustering techniques like K-means on document embeddings â—¦ Then manually review and label the discovered clusters For either approach, I would: Validate results with manual review of a sample Implement a semi-supervised approach where I start with unsupervised learning to discover themes, manually verify a subset, then use that labeled data to train a supervised model Consider sentiment analysis alongside categorization to understand positive and negative aspects within each theme Question 6: Your company wants to implement a fraud detection system for credit card transactions. False positives (flagging legitimate transactions as fraudulent) cause customer frustration, while false negatives (missing actual fraud) result in financial losses. How would you approach building this system? Ideal Answer: For a fraud detection system balancing false positives and false negatives, I would: 1. Frame it as an anomaly detection problem with these considerations: â—¦ Use highly imbalanced classification techniques since fraudulent transactions are rare â—¦ Implement cost-sensitive learning where the cost matrix reflects the different impacts of errors â—¦ Focus on recall (catching fraud) while maintaining acceptable precision (minimizing false alarms) 2. Feature engineering would be critical: â—¦ Create behavioral features (deviation from user's normal patterns) â—¦ Time-based features (time of day, day of week) â—¦ Location-based features (geographic anomalies) â—¦ Transaction amount features relative to user history â—¦ Network features (connections to known fraudulent entities) 3. Model selection: â—¦ Ensemble methods like XGBoost or Random Forests work well for fraud detection â—¦ Isolation Forests specifically designed for anomaly detection â—¦ Deep learning approaches for complex pattern recognition â—¦ Consider a two-stage approach: a fast model to flag suspicious transactions, followed by a more thorough model 4. Implementation strategy: â—¦ Real-time scoring system with configurable thresholds â—¦ Human review process for transactions above certain risk thresholds â—¦ Feedback loop to continuously improve the model with new confirmed fraud cases â—¦ A/B testing different models and thresholds to optimize the cost-benefit tradeoff Question 7: You're working with time series data to forecast product demand for a retail chain. The data shows strong seasonal patterns and an overall upward trend. What machine learning approaches would be appropriate, and what special considerations would you need to address? Ideal Answer: For forecasting retail product demand with seasonality and trend, I would consider: 1. Time series specific models: â—¦ ARIMA/SARIMA to capture trend and seasonality â—¦ Exponential smoothing methods (Holt-Winters) for trend and seasonal components â—¦ Prophet (developed by Facebook) which handles seasonality and holidays well 2. Machine learning approaches: â—¦ Gradient Boosting models (XGBoost, LightGBM) with engineered time features â—¦ LSTM or other recurrent neural networks for capturing long-term dependencies â—¦ Hybrid models combining statistical and ML approaches Special considerations: 1. Feature engineering for time series: â—¦ Create lag features (previous periods' demand) â—¦ Add calendar features (day of week, month, holidays) â—¦ Encode seasonal patterns (Fourier terms) â—¦ Include external factors (promotions, pricing, weather) 2. Proper validation strategy: â—¦ Use time-based cross-validation (not random) â—¦ Ensure validation periods maintain the temporal order â—¦ Test on multiple seasonal cycles 3. Multi-level forecasting: â—¦ Consider hierarchical forecasting for different store locations â—¦ Ensure forecasts are consistent across product categories 4. Handling special events: â—¦ Account for holidays, promotions, and other irregular events â—¦ Consider separate models for regular periods and special events 5. Evaluation metrics: â—¦ Use scale-independent metrics like MAPE or SMAPE â—¦ Consider asymmetric loss functions if over-forecasting has different costs than under- forecasting Question 8: You're building a model to predict which customers are likely to respond to a marketing campaign. You have a dataset with 50 features, but you suspect many might be irrelevant or redundant. How would you approach feature selection and why is it important in this context? Ideal Answer: Feature selection is crucial in this marketing response prediction context because: It reduces overfitting by removing noise from irrelevant features It improves model interpretability, which is important for marketing insights It reduces computational costs and model complexity It can improve model performance by focusing on truly predictive signals I would implement a comprehensive feature selection strategy: 1. Filter methods (feature-target relationship): â—¦ Calculate correlation coefficients between each feature and the target â—¦ Use statistical tests like chi-square for categorical features â—¦ Apply information gain or mutual information metrics â—¦ Remove features with near-zero variance 2. Wrapper methods (model performance-based): â—¦ Recursive Feature Elimination (RFE) â—¦ Forward or backward stepwise selection â—¦ Use cross-validation to evaluate each feature subset 3. Embedded methods (built into model training): â—¦ Use L1 regularization (Lasso) to shrink irrelevant feature coefficients to zero â—¦ Leverage feature importance from tree-based models like Random Forest or XGBoost â—¦ Gradient-based feature selection in neural networks 4. Dimensionality reduction: â—¦ Principal Component Analysis (PCA) to handle correlated features â—¦ Factor analysis for interpretable feature groups 5. Domain knowledge integration: â—¦ Consult with marketing experts about which features they believe are most relevant â—¦ Consider feature groups based on customer behavior, demographics, and past campaign responses I would validate the selected feature subset using cross-validation and ensure the final model remains interpretable for marketing insights. Question 9: Your team has collected a large dataset for a machine learning project, but you notice it contains a significant amount of missing values across different features. How would you handle this issue, and what factors would influence your approach? Ideal Answer: When handling missing values, I would first analyze the nature and pattern of missingness: 1. Understand the missing data mechanism: â—¦ Missing Completely At Random (MCAR): No pattern to missingness â—¦ Missing At Random (MAR): Missingness related to observed data â—¦ Missing Not At Random (MNAR): Missingness related to unobserved factors 2. Analyze the extent and pattern: â—¦ Calculate percentage of missing values per feature and per sample â—¦ Visualize patterns (e.g., missingness correlation heatmap) â—¦ Check if missingness is related to the target variable Based on this analysis, I would implement a strategy combining: 1. Feature-level decisions: â—¦ Drop features with excessive missing values (e.g., >50%) â—¦ For features with moderate missingness: â–ª Simple imputation (mean/median/mode) for MCAR data â–ª More sophisticated imputation (KNN, regression, MICE) for MAR data â–ª Consider adding "missing" indicators as new features 2. Sample-level decisions: â—¦ Drop samples with too many missing values â—¦ Consider the impact on class balance if removing samples 3. Advanced techniques: â—¦ Use algorithms that handle missing values natively (e.g., XGBoost) â—¦ Matrix factorization for datasets with many missing values â—¦ Deep learning approaches for complex imputation Factors influencing my approach would include: The importance of each feature (domain knowledge) Dataset size (can we afford to drop data?) The ML algorithm to be used (some handle missing values better) Computational resources available Whether the missing values themselves contain information I would validate my approach by comparing model performance with different imputation strategies. Question 10: You've been asked to develop a system that automatically grades student essays. What machine learning approach would you use, what challenges might you face, and how would you ensure the system is fair and accurate? Ideal Answer: For an automated essay grading system, I would implement: 1. NLP-based approach combining: â—¦ Feature engineering: extracting linguistic features (grammar, vocabulary diversity, sentence complexity, coherence markers) â—¦ Deep learning: using pre-trained language models like BERT or GPT to capture semantic meaning and context â—¦ Supervised learning: training on essays with human-assigned scores 2. Key challenges to address: â—¦ Subjectivity in grading criteria â—¦ Capturing higher-order aspects like creativity and critical thinking â—¦ Potential for gaming the system (e.g., using flowery language without substance) â—¦ Handling diverse writing styles and topics â—¦ Ensuring fairness across different demographic groups 3. To ensure fairness and accuracy: â—¦ Train on diverse essay samples from various demographic groups â—¦ Use multiple human graders for training data to reduce individual bias â—¦ Implement regular bias audits to check for disparate impact across groups â—¦ Create an ensemble of models trained on different feature sets or using different algorithms â—¦ Provide explanations for grades using techniques like LIME or SHAP â—¦ Implement a human review process for borderline cases or appeals â—¦ Continuously monitor and update the model based on feedback and new data Definition/List Questions Question 1: Define the three main types of machine learning (Supervised, Unsupervised, Reinforcement) and provide one example application for each. Ideal Answer: 1. Supervised Learning: â—¦ Definition: Learning a mapping from input features to known output labels based on a labeled training dataset. The goal is to predict the output for new, unseen inputs. â—¦ Example Application: Email spam detection (classifying emails as spam or not spam based on labeled examples). 2. Unsupervised Learning: â—¦ Definition: Finding patterns, structures, or relationships within an unlabeled dataset without predefined outputs. The goal is to discover inherent groupings or representations in the data. â—¦ Example Application: Customer segmentation (grouping customers with similar purchasing behaviors without prior labels). 3. Reinforcement Learning: â—¦ Definition: An agent learns to make sequences of decisions by interacting with an environment. It receives rewards or penalties for its actions and aims to learn a policy that maximizes cumulative reward over time. â—¦ Example Application: Training a bot to play a game like Chess or Go (learning optimal moves through trial and error). Question 2: List and b