Graded Assignment - Week 8

Graded assignment Question 1 Statement Consider the two different generative model-based algorithms. 1. Model : chances of occurring a feature are affected by the occurrence of other features and the model does not impose any additional condition on conditional independence of features. 2. Model : chances of occurring a feature are not affected by the occurrence of other features and therefore, the model assumes that features are conditionally independent of the label. Which model has more independent parameters to estimate? Options (a) Model (b) Model Answer (a) Solution: In the first model, features are not independent, therefore, we need to find the probabilities (or density) for each and every possible example given the labels whereas in model 2, the features are independent, therefore we need to find the pmf (or pdf) of the features only. That is the model 1 has more parameters to estimate. Question 2 Statement Which of the following statement is/are always correct in context to the naive Bayes classification algorithm for binary classification with all binary features? Here, denotes the estimate for the probability that the feature value of a data point is given that the point has the label. Options (a) If for , then for (b) for any (c) If for , then for (d) If , no labeled example in the training dataset takes feature values as. (e) None of the above Answer (d) Solution In general, estimate for. It means that denotes the parameters of different distributions for different and for different. Therefore, If for , then it is not mandatory that for as they come from different distributions. For different , distributions are different distributions. Therefore, it is not necessary that. What we can say is that If for , It implies that there is no labeled zero examples such that feature value is. It doesn't mean that all feature value is for all labeled one examples. If , no labeled example in the training dataset takes feature values as. Question 3 Statement A naive Bayes model is trained on a dataset containing features. Labels are and. If a test point was predicted to have the label , which of the following expression should be sufficient for this prediction? Options (a) (b) (c) (d) None of the above Answer (c) Solution A test example is predicted label , it implies that Question 4 Statement Consider a binary classification dataset contains only one feature and the data points given the label follow the given distribution If the decision boundary learned using the gaussian naive Bayes algorithm is linear, what is the value of ? Answer Solution Since the decision boundary is linear, both theiances will be the same. That is Question 5 Statement Consider a binary classification dataset with two binary features and. The feature values are for all label ' ' examples but the label ' ' examples take both values and for the feature. If we apply the naive Bayes algorithm on the same dataset, what will be the prediction for point ? Options (a) Label (b) Label (c) Insufficient information to predict. Answer (c) Solution Given that the feature values are for all label ' ' examples it implies that. Therefore, But the label ' ' examples take both values and for the feature. It implies that Still the value of can be 0 if the value of is zero. So, we need the value of to make any conclusion. Common data for questions 6 and 7 Statement Consider the following binary classification dataset with two features and. The data points given the labels follow the Gaussian distribution. The dataset is given as label 0.5 1.3 1 0.7 1.1 1 1.3 2.0 0 2.3 2.4 0 Question 6 Statement What will be the value of the estimate for ? Answer Solution Question 7 Statement What will be the value of Options (a) (b) (c) (d) Answer (a) Solution 𝟙 𝟙 Question 8 Statement Consider a binary classification dataset containing two features and. The feature is categorical which can take three values and the feature is numerical that follows the Gaussian distribution. How many independent parameters must be estimated if we apply the naive Bayes algorithm to the same dataset? Answer Solution We need one parameter for as takes only two values. For a given label (say ) feature can take three values, therefore we need two estimates for and. Similarly, two estimates if For feature , we need , , and. Therefore, total number of parameters = Common data for questions 9 and 10 Statement A binary classification dataset has 1000 data points belonging to. A naive Bayes algorithm was run on the same dataset that results in the following estimate: , estimate for 0.3 , estimate for , estimate for , estimate for , estimate for Question 9 Statement What is the estimated value of ? Write your answer correct to two decimal places. Answer Range: [0.97, 0.99] Solution Question 10 Statement What will be the predicted label for the data point ? Answer No range is required Solution Since , therefore the point will be predicted label. Graded This document has questions. Question-1 Statement We have a dataset of points for a classification problem using -NN algorithm. Now consider the following statements: S1: If , it is enough if we store any points in the training dataset. S2: If , we need to store the entire dataset. S3: The number of data-points that we have to store increases as the size of increases. S4: The number of data-points that we have to store is independent of the value of. Options (a) S1 and S3 are true statements (b) S2 and S4 are true statements (c) S1 alone is a true statement (d) S3 alone is a true statement (e) S4 alone is a true statement Answer (b) Solution The entire training dataset has to be stored in memory. For predicting the label of a test-point, we have to perform the following steps: Compute distance of the test-point from each training point. Sort the training data points in ascending order of distance. Choose the first points in this sequence. Return that label which garners the maximum vote among these neighbors. Question-2 Statement The blue and the red points belong to two different classes. Both of them are a part of the training dataset. The black point at also belongs to the training dataset, but its true color is hidden form our view. The black point at is a test-point. How should we recolor the black train point if the test point is classified as "red" without any uncertainty by a -NN classifier, with ? Use the Euclidean distance metric for computing distances. Options (a) blue (b) red (c) Insufficient information Answer (b) Solution Since we are looking at the -NN algorithm with , we need to look at the four nearest neighbors of the test data-point. The four points from the training dataset that are closest to the test data-point are the following: : black : blue : red : red Each of them is at unit distance from the test data-point. From the problem statement, it is given that the test data-point is classified as "red" without any uncertainty. Let us now consider two scenarios that concern the black training data-point at : Black training data-point is colored red There are three red neighbors and one blue neighbor. Therefore, the test-data point will be classified as red. There is no uncertainty in the classification. This is what we want. However, for the sake of completeness, let us look at the alternative possibility. Black training data-point is colored blue There will be exactly two neighbors that are blue and two that are red. In such a scenario, we can't classify the black test-point without any uncertainty. That is, we could call it either red or blue. This is one of the reasons why we choose an odd value of for the -NN algorithm. If is odd, then this kind of a tie between the two classes can be avoided. Question-3 Statement Consider the following feature vectors: The labels of these four points are: If use a -NN algorithm with , what would be the predicted label for the following test point: Answer 1 Solution The distances are: We see that among the three nearest neighbors, two have label 1 and one has label 0. Hence the predicted label is 1. For those interested in a code for the same: 1 import numpy as np 2 3 x_1 = np.array([1, 2, 1, -1]) 4 x_2 = np.array([5, -3, -5, 10]) 5 x_3 = np.array([3, 1, 2, 4]) 6 x_4 = np.array([0, 1, 1, 0]) 7 x_5 = np.array([10, 7, -3, 2]) 8 9 x_test = np.array([1, 1, 1, 1]) 10 11 for x in [x_1, x_2, x_3, x_4, x_5]: 12 print(round(np.linalg.norm(x_test - x) ** 2)) Comprehension Type (4 to 6) Statement Consider the following split at some node in a decision tree: The following is the distribution of data-points and their labels: Node Num of points Labels Q1 100 0 Q1 100 1 L1 50 0 L1 30 1 L2 50 0 L2 70 1 For example, L1 has 80 points of which 50 belong to class 0 and 30 belong to class 1. Use for all calculations that involve logarithms. Question-4 Statement If the algorithm is terminated at this level, then what are the labels associated with L1 and L2? Options (a) L1 : 0 (b) L1 : 1 (c) L2 : 0 (d) L2 : 1 Answer (a), (d) Solution has data-points out of which belong to class- and belong to class-. Since the majority of the points belong to class- , this node will have as the predicted label. has data-points out of which belong to class- and belong to class-. Since the majority of the points belong to class , this node will have as the predicted label. Question-5 Statement What is the impurity in L1 if we use entropy as a measure of impurity? Report your answer correct to three decimal places. Answer Range: [0.94, 0.96] Solution If represents the proportion of the samples that belong to class-1 in a node, then the impurity of this node using entropy as a measure is: For ,. So, the impurity for turns out to be: Code for reference: 1 import math 2 imp = lambda p: -p * math.log2(p) - (1 - p) * math.log2(1 - p) 3 print(imp(3 / 8)) Question-6 Statement What is the information gain for this split? Report your answer correct to three decimal places. Use at least three decimal places in all intermediate computations. Answer Range: [0.025, 0.035] Solution The information gain because of this split is equal to the decrease in impurity. Here, and denote the cardinality of the leaves. is the total number of points before the split at node. For this problem, the variables take on the following values: , there are points in the node. , there are points in the node. , there are points in the node. To calculate the entropy of the three nodes, we need the proportion of points that belong to class-1 in each of the three nodes. Let us call them for node , for node and for node : Now, we have all the data that we need to compute , and : Now, we have all the values to compute the information gain: Code for reference: 1 import math 2 imp = lambda p: -p * math.log2(p) - (1 - p) * math.log2(1 - p) 3 4 p_0 = 1 / 2 5 p_1 = 3 / 8 6 p_2 = 7 / 12 7 8 n = 200 9 l_1 = 80 10 l_2 = 120 11 12 ig = imp(p_0) - ((l_1 / n) * imp(p_1) + (l_2 / n) * imp(p_2)) 13 print(ig) Question-7 Statement Consider the following decision tree. Q-i corresponds to a question. The labels are and. If a test-point comes up for prediction, what is the minimum and maximum number of questions that it would have to pass through before being assigned a label? Options (a) (b) (c) (d) (e) Answer (b), (e) Solution Look at all paths from the root to the leaves. Find the shortest and longest path. Question-8 Statement is the proportion of points with label 1 in some node in a decision tree. Which of the following statements are true? [MSQ] Options (a) As the value of increases from to , the impurity of the node increases (b) As the value of increases from to , the impurity of the node decreases (c) The impurity of the node does not depend on (d) correspond to the case of maximum impurity Answer (d) Solution Options (a) and (b) are incorrect as the impurity increases from to and then decreases. Option-(c) is incorrect for obvious reasons. Question-9 Statement Consider a binary classification problem in which all data-points are in. The red points belong to class and the green points belong to class. A linear classifier has been trained on this data. The decision boundary is given by the solid line. This classifier misclassifies four points. Which of the following could be a possible value for the weight vector? Options (a) (b) (c) (d) Answer (b) Solution The weight vector is orthogonal to the decision boundary. So it will lie on the dotted line. This gives us two quadrants in which the vector can lie in: second or fourth. In other words, we only need to figure out its direction. If it is pointing in the second quadrant, then there will be four misclassifications. If it is pointing in the fourth quadrant then all but four points will be misclassified. Question-10 Statement Which of the following are valid decision regions for a decision tree classifier for datapoints in ? The question in every internal node is of the form. Both the features are positive real numbers. Options (a) (b) (c) (d) Answer (a), (b), (d) Solution A question of the form can only result in one of these two lines: a horizontal line a vertical line It cannot produce a slanted line as shown in option-(c). Options (a) and (d) correspond to what are called decision stumps: a single node splitting into two child nodes. Graded This document has questions. Question-1 Statement Assume that for a certain linear regression problem involving 4 features, the following weight vectors produce an equal amount of mean square error: = [2, 2, 3, 1] = [1, 1, 3, 1] = [3, 2, 4, 1] = [1, 2, 1, 1] Which of the weight vector is likely to be chosen by ridge regression? Options (a) (b) (c) (d) Answer (d) Solution Total error = MSE + If MSE for all the given weights is same, the weight vector whose squared length is the least will be chosen by Ridge Regression. Question-2 Statement Assuming that in the constrained version of ridge regression optimization problem, following are the weight vectors to be considered, along with the mean squared error (MSE) produced by each: = [2, 2, 3, 1], MSE = 3 = [1, 1, 3, 1], MSE = 5 = [3, 2, 4, 1], MSE = 8 = [1, 2, 1, 1], MSE = 9 If the value of is 13, which of the following weight vectors will be selected as the final weight vector by ridge regression? Note: is as per lectures. That is, Options (a) (b) (c) (d) Answer (b) Solution We need to minimize MSE such that for and. However, the MSE for is lesser than. Hence, will be chosen. Question-3 Statement Consider the following piece-wise function as shown in the image: How many sub-gradients are possible at points , , and ? Options (a) (b) (c) (d) Answer (c) Solution lies on the part of the function which is differentiable. For a differentiable function (subpart), only one sub-gradient is possible which is the gradient itself. lies at the intersection of two and. The function is not differentiable at this point (as left slope is different from right slope). Hence there are multiple sub-gradients possible at. lies on the part of the function which is differentiable. For a differentiable function (subpart), only one sub-gradient is possible which is the gradient itself. lies at the intersection of two and. The function is not differentiable at this point (as left slope is different from right slope). Hence there are multiple sub-gradients possible at Question-4 Statement For a data set with 1000 data points and 50 features, 10-fold cross-validation will perform validation of how many models? Options (a) 10 (b) 50 (c) 1000 (d) 500 Answer (a) Solution In 10-fold cross validation, the data will be divided into 10 parts. In each of ten iterations, a model will be built using nine of these parts and the remaining part will be used for validation. Hence, in total, ten models will be validated. Question-5 Statement For a data set with 1000 data points and 50 features, assume that you keep 80% of the data for training and remaining 20% of the data for validation during k-fold cross-validation. How many models will be validated during cross-validation? Options (a) 80 (b) 20 (c) 5 (d) 4 Answer (c) Solution If 20% of the data is used for validation, that means, 1/5th part is used for validation, which means, 5-fold cross validation is being performed. In each iteration, one model will be validated. Hence, total 5 models will be validated. Question-6 Statement For a data set with 1000 data points and 50 features, how many models will be trained during Leave-One-Out cross-validation? Options (a) 1000 (b) 50 (c) 5000 (d) 20 Answer (a) Solution In leave one out cross-validation, only one data point is used for validation in each iteration, and the remaining n-1 data points are used for training. Hence a total of n = 1000 models will be trained. Question-7 Statement The mean squared error of will be small if Options (a) The eigenvalues of are small. (b) The eigenvalues of are large. (c) The eigenvalues of are large. (d) The eigenvalues of are small. Answer (c), (d) Solution Mean Squared error of. Trace of a matrix = sum of eigenvalues. If the eigenvalues of are large, the eigenvalues of will be small. Hence, trace will be small and in turn MSE will be small. Question-8 Statement The eigenvalues of a matrix are 2, 5 and 1. What will be the eigenvalues of the matrix Options (a) 4, 25, 1 (b) 2, 5, 1 (c) 0.5, 0.2, 1 (d) 0.6, 0.9, 0.1 Answer (c) Solution If the eigenvalues of A are a, b and c, then the eigenvalues of will be 1/a, 1/b and 1/c. Graded Assignment Note: 1. In the following assignment, denotes the data matrix of shape where and are the number of features and samples, respectively. 2. denotes the sample and denotes the corresponding label. 3. denotes the weight vector (parameter) in the linear regression model. Question 1 Statement An ML engineer comes up with two different models for the same dataset. The performances of these two models on the training dataset and test dataset are as follows: Model 1: Training error = ; Test error = Model 2: Training error = ; Test error = Which model you would select? Options (a) Model 1 (b) Model 2 Answer (a) Solution In model , the test error is very low compared to model even though the training error is high in model. We choose model as it worked well on unseen data. Question 2 Statement Consider a model for a given -dimensional training data points and corresponding labels as follows: where is the average of all the labels. Which of the following error function will always give the zero training error for the above model? Options (a) (b) (c) (d) Answer (c) Solution The sum of squared error and absolute error will give zero error only if predicted values are the same as actual values for all the examples. But for option (3), we have This error function will give zero error for the above model. Common data for questions 3 and 4 Consider the following dataset with one feature : label ( ) -1 5 0 7 1 6 Question 3 Statement We want to fit a linear regression model of the form. Assume that the initial weight vector is. What will be the weight after one iteration using the gradient descent algorithm assuming the squared loss function? Assume the learning rate is. Answer No range is required Solution At iteration , we have At , we have Here Put the values and we get Question 4 Statement If we stop the algorithm at the weight calculated in question 1, what will be the prediction for the data point ? Answer 8 No range is required Solution The model is given as at , Question 5 Statement Assume that denotes the updated weight after the iteration in the stochastic gradient descent. At each step, a random sample of the data points is considered for weight update. What will be the final weight after iterations? Options (a) (b) (c) (d) any of the Answer (c) Solution The final weight is given by the average of all the weights updated in all the iterations. That is why option (c) is correct. Common data for questions 6 and 7 Kernel regression with a polynomial kernel of degree two is applied on a data set. Let the weight vector be Question 6 Statement Which data point plays the most important role in predicting the outcome for an unseen data point? Write the data point index as per matrix assuming indices start from 1. Answer 3, No range is required Solution Since is written as , the data point which is associated with the highest weight (coefficient) will have the most importance. The third data point is associated with the highest coefficient ( ) therefore, the third data point has the highest importance. Question 7 What will be the prediction for the data point ? Answer 6.5 No range is required Solution The polynomial kernel of degree is given by The coefficient vector is given as. The prediction for a point is given by Since, That is Therefore the prediction will be Question 8 Statement If be the solution to the optimization problem of the linear regression model, which of the following expression is always correct? Options (a) (b) (c) (d) Answer (b) Solution We know that is the projection of labels on the subspace spanned by the features that is will be orthogonal to. For datails, check the lecture 5.4. Question 9 Statement The gradient descent with a constant learning rate of for a convex function starts oscillating around the local minima. What should be the ideal response in this case? Options (a) Increase the value of (b) Decrease the value of Answer (b) Solution One possible reason of oscillation is that the weight vector jumps the local minima due to greater step size. That is if we decrease the value of , the weight vector may not jump the local minima and the GD will converge to that local minima. Question 10 Statement Is the following statement true or false? Error in the linear regression model is assumed to have constant variance. Options (a) True (b) False Answer (a) Solution We make the assumption in the regression model that the error follows gaussian distribution with zero mean and a constant variance. Practice assignment Question 1 Statement Consider a 3-class classification dataset with labels and. The data points belong to. If we apply the generative model-based algorithm on the same dataset, how many features need to be estimated? Assume that the features given the label are not independent. Answer Solution We need to estimate the parameters for. Since , we need to estimate two parameters for the distribution of. For the distribution of , we need to estimate the parameters for. Since , we can have possible data points and we need to estimate the probability for 26 such points as the sum will be one. Similarly, for and , we need 26 parameters each. Therefore, total parameters to estimate = Question 2 In question , if the features are conditionally independent given the labels, how many parameters need to be estimated? Answer Solution We need to estimate the parameters for. Since , we need to estimate two parameters for the distribution of. For a given label (say ), we need to estimate ) Similarly, for labels and. Therefore, total parameters to estimate = Common data for questions 3, 4, and 5 Statement Consider a naive Bayes model is trained on the following data matrix of shape and corresponding label vector : Assume that and are estimates for and , respectively. Here, is the feature. These parameters are estimated using MLE. Question 3 Statement If a test point has label , what will be the probability that the point is ? Options (a) (b) (c) (d) Answer (d) Solution We know that is the estimate for. It implies that is the estimate for Therefore, Question 4 Statement What is the value of ? Answer 1 Solution is the estimate for 𝟙 𝟙 Here, the first two examples belong to label and the first feature value for both examples is , therefore Question 5 Statement What will be the probability that a test data point is labeled as ? Assume no smoothing of data is done. Answer Solution Here Therefore, Question 6 Statement Consider a spam classification problem that was modeled using naive Bayes. The features take a value of or depending on whether a word is present in the email or not. Assume that the probability of a mail being spam is 0.2. The following table gives the estimation for conditional probabilities for some of the words: word label Hurray! spam 0.7 win spam 0.2 exciting spam 0.01 prizes spam 0.3 Hurray! Non-spam 0.01 win Non-spam 0.02 exciting Non-spam 0.01 prizes Non-spam 0.1 Consider a mail with the following sentence: "Hurray! win exciting prizes" With what probability the mail will be predicted spam? Assume that these are the only possible words (that is there are four features) in a mail. Write your answer correct to two decimal places. A Answer Range: Solution Here, Denote spam as and non-spam as. Therefore, Question 7 Statement A binary classification dataset contains only one feature and the data points given the label follow the gaussian distributions whose means and variances are already estimated as: What will be the decision boundary learned using the naive Bayes algorithm? Assume that , an estimate for , is. Hint: Solve Options (a) (b) (c) (d) Answer (a) Solution The decision boundary is given by Common data for questions 8, 9 and 10 Statement Consider the gaussian naive Bayes algorithm was run on the following dataset: feature 1 feature 2 Label feature 1 feature 2 Label Question 8 Statement What will be the value of ? Answer Range; [0.74, 0.76] Solution Question 9 Statement What will be the value of ? Options (a) (b) (c) (d) Answer (c) Solution 𝟙 𝟙 Question 10 Statement What will be the value of ? Options (a) (b) (c) (d) Answer (d) Solution 𝟙 𝟙 Practice This document has questions. Question-1 Statement You are given a training dataset of data-points for a classification task. You wish to use a - NN classifier, with. is a function that returns the Euclidean distance between two points and. Calling an a pair of points corresponds to a single distance computation. If you want to predict the label of a new test point, how many distances would you have to compute? Answer Solution We would have to compute distances: for. Common Data for questions (2) to (3) Statement Consider the following data-points in a binary classification problem. is the weight vector corresponding to a linear classifier. The labels are and. Question-2 Statement What is the predicted label for these five points? Options (a) All five points are predicted as. (b) All five points are predicted as. (c) and are predicted as and are predicted as. (d) and are predicted as and are predicted as. Answer (c) Solution For a linear classifier with weight vector , the prediction for a test-point is given by: The geometric interpretation of this is as follows: a linear classifier divides the space into two half- spaces. This decision is made just by looking at the sign of the dot-product. Half-space-1: The dot-product is positive. This corresponds to those cases in which the data-point makes an acute angle with the weight vector. Half-space-2: The dot-product is non-positive. This corresponds to those cases in which the data-point makes either a right angle or an obtuse angle with the weight vector. When a point lies on the line (right angle with weight), it could be classified either way. As per convention, we choose. Question-3 Statement Which of the following statements are true? Options (a) (b) (c) (d) (e) Answer (a), (c), (e) Solution The farther away a data-point is from the decision boundary, the larger the magnitude of its projection onto the weight vector. Hence: Since all these three points lie on the positive half-plane, these dot products will be positive and we can remove the modulus sign. As for point , its projection on will be zero as it is orthogonal to it. Finally, will be negative as lies in the negative half-plane. Comprehension Type (4 - 5) Statement Consider the following decision tree for a binary classification problem that has two features:. Question-4 Statement Which of the following statements are true? Options (a) is classified as (b) is classified as (c) is classified as (d) is classified as Answer (b), (d) Solution Start at the root of the decision tree and keep traversing the internal nodes until you end up with a prediction/leaf node. Question-5 Statement This decision tree partitions the feature space into four regions as shown below: Assume that for all points. What can you say about the predicted labels of points that fall in the four regions? Options (a) (b) (c) (d) Answer (b) Solution The tree partitions the space into four regions:. These four regions correspond to the four leaves from left to right. Question-6 Statement Consider the following training dataset for a binary classification task: x y 2 1 10 0 8 0 -4 1 0 1 9 0 The following decision tree cleanly separates the two classes, such that the resulting leaves are pure. Q-1 is of the form. How many possible integer values can take? Answer 6 Solution can take any one of the values from this set:. It can also take the value as will go to the right branch for the data-point. Question-7 Statement is the proportion of points with label in some leaf in a decision tree. For what values of will this node be assigned a label of ? Select the most appropriate option. Options (a) (b) (c) Answer (b) Solution In a vote, we need more points to belong to class than class. Therefore, Question-8 Statement is some internal node (question node) of a decision tree that splits into two leaf nodes (prediction nodes) and. The proportion of data-points belonging to class in each of the three nodes is the same and is equal to. What is the information gain for the question corresponding to node ? Answer 0 Solution Let be the proportion of ones in each node. Then, all three nodes have the same entropy as entropy only depends on. Call this. If is the ratio into which the original dataset is partitioned by this question, then: We gain nothing because of this split. Intuitively this makes sense because in both leaf nodes the status quo prevails. To put it in another way, none of the child nodes have become purer than the parent node. They have the same level of impurity as their parent. Question-9 Statement Consider the following decision tree for a classification problem in which all the data-points are constrained to lie in the unit square in the first quadrant. That is,. If a point is picked at random from the unit square, what is the probability that the decision tree predicts this point as belonging to class ? Answer Range: [0.74, 0.76] Solution The decision regions will look as follows: We see that of the region belongs to class , hence the required probability is. Question-10 Statement Consider the following dataset. By visually inspecting the data-points, construct a decision tree. Solution The structure of the data seems clear. There are four blobs of red and green points. From left to right, we have blob-1, blob-2, blob-3 and blob-4. They take the colors red, green, red and green respectively. Assume that the lines parallel to the y-axis that separate consecutive blobs are , and. Think about why we need only three lines. One tree that we could construct has the following form: x

Graded Assignment - Week 8

Document Details

Tags

Related

Summary

Full Transcript