Logistic Regression and k-Nearest Neighbor

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson
Download our mobile app to listen on the go
Get App

Questions and Answers

In logistic regression, which of the following p-values would indicate a statistically significant independent variable?

  • 0.20
  • 0.56
  • 0.03 (correct)
  • 0.11

Based on logistic regression results, a negative coefficient for 'Online' necessarily indicates that being online significantly decreases the probability of personal loan acceptance.

False (B)

In k-NN, what is the primary consideration when choosing the value of 'k'?

balancing bias and variance

In k-NN, increasing the value of k generally ______ the model's sensitivity to noise.

<p>decreases</p>
Signup and view all the answers

Match the following feature selection techniques with their primary goal:

<p>Filter Methods = Evaluate the relevance of features based on statistical measures, independent of any specific machine learning algorithm Wrapper Methods = Use a specific machine learning algorithm to evaluate the performance of different subsets of features. Embedded Methods = Incorporate feature selection as part of the model training process.</p>
Signup and view all the answers

Which of the following is a potential drawback of using wrapper methods for feature selection?

<p>They are computationally expensive. (D)</p>
Signup and view all the answers

Feature extraction methods aim to select a subset of the original features, while feature selection methods create new features from the original set.

<p>False (B)</p>
Signup and view all the answers

What is the primary goal of Principal Component Analysis (PCA)?

<p>dimensionality reduction</p>
Signup and view all the answers

In PCA, the principal components are ______ of each other.

<p>orthogonal</p>
Signup and view all the answers

Match the following concepts with their descriptions:

<p>Variance = A measure of the spread or dispersion of a set of data points around their mean value. Eigenvalue = A scalar value representing the amount of variance captured by a principal component. Eigenvector = A vector that defines the direction of a principal component in the original feature space.</p>
Signup and view all the answers

Which of the following distance metrics is more sensitive to outliers?

<p>Euclidean distance (B)</p>
Signup and view all the answers

Manhattan distance is calculated as the straight-line distance between two points in a multi-dimensional space.

<p>False (B)</p>
Signup and view all the answers

In clustering, what is the purpose of a distance measure?

<p>quantify similarity</p>
Signup and view all the answers

In K-means clustering, the algorithm aims to minimize the ______ of squared distances between data points and their cluster's centroid.

<p>sum</p>
Signup and view all the answers

Match the following clustering algorithm characteristics with their corresponding algorithm:

<p>K-Means = Partitions data into k clusters, where each data point belongs to the cluster with the nearest mean (centroid). Hierarchical Clustering = Builds a hierarchy of clusters, where each data point starts in its own cluster and clusters are successively merged based on a linkage criterion.</p>
Signup and view all the answers

What is the primary difference between K-means clustering and hierarchical clustering?

<p>K-means requires pre-defining the number of clusters, while hierarchical clustering does not. (D)</p>
Signup and view all the answers

In hierarchical clustering with complete linkage, the distance between two clusters is defined as the minimum distance between any two points in the clusters.

<p>False (B)</p>
Signup and view all the answers

In hierarchical clustering, what is a dendrogram?

<p>binary tree</p>
Signup and view all the answers

The document term matrix represents the ______ of terms in a collection of documents.

<p>frequency</p>
Signup and view all the answers

Match the following text analytics concepts with their definitions:

<p>Term Frequency (TF) = The number of times a term appears in a document. Document Frequency (DF) = The number of documents in which a term appears. TF-IDF = The term frequency-inverse document frequency shows how important a word is to a document in a collection.</p>
Signup and view all the answers

What does a high TF-IDF score for a term in a document indicate?

<p>The term is frequent in the document but rare in other documents. (B)</p>
Signup and view all the answers

In text analytics, stop words are typically removed to increase the dimensionality of the data.

<p>False (B)</p>
Signup and view all the answers

What is the purpose of stemming in text analytics?

<p>reduce words to root form</p>
Signup and view all the answers

In text analytics, the process of converting text into a numerical representation is called ______.

<p>vectorization</p>
Signup and view all the answers

Which of the following metrics balances precision and recall?

<p>F1-score (A)</p>
Signup and view all the answers

A high accuracy score always indicates a well-performing model, even if the classes are imbalanced.

<p>False (B)</p>
Signup and view all the answers

What is the purpose of cross-validation?

<p>assess model generalization</p>
Signup and view all the answers

The ROC curve plots the true positive rate against the ______ rate.

<p>false positive</p>
Signup and view all the answers

Match the evaluation metrics with their definitions:

<p>Precision = The proportion of positive identifications that were actually correct. Recall = The proportion of actual positives that were identified correctly. F1-score = The harmonic mean of precision and recall.</p>
Signup and view all the answers

Which method is most suitable when combining Arizona and Commonwealth?

<p>Single Linkage (A)</p>
Signup and view all the answers

The estimated coefficient of the independent variable 'Family' is 1.729.

<p>False (B)</p>
Signup and view all the answers

In logistic regression for personal loan acceptance, what does a star sign on the coefficients table indicate?

<p>statistical significance</p>
Signup and view all the answers

In the provided example, the final result of the agglomerative clustering groups {NY} with ______.

<p>{Boston}</p>
Signup and view all the answers

Match the distance type with its calculation method:

<p>Euclidean Distance = √(x2 − x1)² + (y2 − y1)² Manhattan Distance = |x2 - x1| + |y2 - y1|</p>
Signup and view all the answers

Which distance measure is generally MORE robust in high dimensional spaces?

<p>Manhattan Distance (B)</p>
Signup and view all the answers

With Euclidean distance, high dimensions are more informative.

<p>False (B)</p>
Signup and view all the answers

Using the provided example of K-Means Clustering, if we specify K = 2 and assign the initial records as nodes 1 and 5, then what will the new center point of cluster 1 be after the classification? (enter the coordinate)

<p>(2.5, 2.5)</p>
Signup and view all the answers

Given the two nodes (2,3) and (3,2), find the Euclidean distance between them. ______

<p>1.414</p>
Signup and view all the answers

Match what each part of text mining is:

<p>TF (term frequency) = number of times a term appears in a document DF (document frequency) = number of documents the term appears in TF-IDF = combines term frequency and inverse term frequency</p>
Signup and view all the answers

When is hierarchical clustering typically used?

<p>Limited dimensionality (A)</p>
Signup and view all the answers

Flashcards

What is Logistic Regression?

Predicts the probability of a binary outcome.

What are significant independent variables?

Variables with at least one star sign in the coefficients table.

What is the estimated coefficient of the 'Family' variable?

0.5476

Highest magnitude coefficient?

EducationGraduate

Signup and view all the flashcards

What are the 'Family' variable odds?

e^0.5476 = 1.729

Signup and view all the flashcards

What do odds mean?

Ratio of (Y=1; loan acceptance) to (Y=0; loan rejection).

Signup and view all the flashcards

Age increases loan probability? True/False

False

Signup and view all the flashcards

Income decreases loan probability? True/False

False

Signup and view all the flashcards

Experience insignificantly decreases loan probability? True/False

True

Signup and view all the flashcards

k-NN; error rate at k=1?

75%

Signup and view all the flashcards

k-NN; error rate at k=3?

25%

Signup and view all the flashcards

Features to select after decision tree analysis?

income, overage, underage, phone

Signup and view all the flashcards

PCA: # components for 99% variance?

4

Signup and view all the flashcards

PCA: Dominantly explained variable in PC1?

sodium

Signup and view all the flashcards

PCA: What to do if sodium rotation is too high?

Standardization first

Signup and view all the flashcards

Euclidean distance?

Distance between John and Emily is '√21'.

Signup and view all the flashcards

Manhattan distance?

Distance between John and Emily is '7'

Signup and view all the flashcards

When is Manhattan distance more robust?

High-dimensional spaces

Signup and view all the flashcards

Nodes in cluster 1? (k-means, Euclidean)

Nodes 1 and 2

Signup and view all the flashcards

Nodes in cluster 2? (k-means, Euclidean)

Nodes 3, 4, and 5

Signup and view all the flashcards

Center point of cluster 1?

(2.5, 2.5)

Signup and view all the flashcards

Center point of cluster 2?

(26/3, 5/3)

Signup and view all the flashcards

Nodes in cluster 1? (Hierarchical, single linkage, Euclidean)

Nodes 1, 2, and 3

Signup and view all the flashcards

Nodes in cluster 2? (Hierarchical, single linkage, Euclidean)

Nodes 4 and 5

Signup and view all the flashcards

Node 3: within cluster distance?

10/2 = 5

Signup and view all the flashcards

Node 3: closest neighbor distance?

3

Signup and view all the flashcards

good/bad cluster?

bad cluster

Signup and view all the flashcards

Term frequency (tf) value of 'Cent' in doc 2?

0

Signup and view all the flashcards

Document frequency value of 'Cent'?

1

Signup and view all the flashcards

tf-idf value for 'Cent' in doc 1?

3 * log(5)

Signup and view all the flashcards

Study Notes

  • There will be a final exam summary of a one-hour closed book and offline paper exam.

Other Common Classifiers

  • Logistic Regression and k-Nearest Neighbor (kNN) are common classifiers.
  • Binary indicators on customer's acceptance of a personal loan is the focal dependent variable for personal loan.
  • In logistic regression, variables with at least one star sign are considered significant.
  • The estimated coefficient of the independent variable ‘Family’ is 0.5476.
  • EducationGraduate is the significant independent variable with the highest absolute magnitude of the estimated coefficient.
  • The odds of the independent variable of ‘Family’ are 1.729 (e^0.5476).
  • Odds refer to the ratio of the probability of personal loan acceptance (Y=1) to the probability of personal loan rejection (Y=0).
  • An increase of one unit of family size multiplies the odds of acceptance of a personal loan by e0.5476 (1.729)
  • An increase of one unit of family size is associated with an increase of 72.9% in the odds of acceptance of a personal loan.
  • Age does not significantly increase the probability of personal loan acceptance.
  • Income does not significantly decrease the probability of personal loan acceptance.
  • Experience insignificantly decreases the probability of personal loan acceptance.
  • In k-Nearest Neighbor algorithms, the error rate is 75% when k is specified as 1.
  • For k-Nearest Neighbor algorithms, the error rate is 25% when k is specified as 3.

Dimension Reduction

  • Income, overage, underage and phone variables should be selected for subsequent business analytics using decision tree outcome during feature selection.
  • Use Principal Component Analysis (PCA) for feature extraction.
  • Four principal component variables are needed to explain at least 99% of the total variance of raw data information.
  • In PCA results, Sodium is dominantly explained in principal component 1.
  • A high rotation value of sodium in principal component 1 may be due to the unit of measurement affecting PCA, so standardization should be applied first.

Introduction to Clustering

  • The Euclidean Distance between John and Emily is √21 when using age, number of credit cards and family size, as John is 38 with 5 credit cards and a family size of 4, while Emily is 34 with 7 credit cards and a family size of 3.
  • The Manhattan Distance between John and Emily is 7.
  • Manhattan Distance is more robust in high-dimensional spaces.
  • In high dimensions, Euclidean distances become less informative because all points tend to be similarly distant from each other.
  • Manhattan distance grows linearly with the number of dimensions, whereas Euclidean grows with the square root, making Manhattan more discriminative.

Clustering

  • For K-Means Clustering. When k is specified to be 2, the assigned initial records point is nodes 1 and 5.
  • Then node 1 and 2 will be classified into cluster 1 after calculating distance by Euclidean distance measure.
  • Nodes 3, 4, and 5 will be classified into cluster 2.
  • The two-dimension location of center point of cluster 1 is (2.5, 2.5) using Euclidean distance measure.
  • The two-dimension location of center point of cluster 2 is (26/3, 5/3) using Euclidean distance measure.
  • When k is specified to be 2, the assigned initial records point is nodes 3 and 4.
  • Nodes 1, 2, and 3 will be classified into cluster 1 after calculating the distance by Euclidean distance measure.
  • Node 4 and 5 will be classified into cluster 2.
  • The two-dimension location of the center point of cluster 1 is (12/3, 7/3) using the Euclidean distance measure.
  • The two-dimension location of the center point of cluster 2 is (19/2, 3/2) using the Euclidean distance measure.
  • For node 3, the within-cluster distance is 5 using the Manhattan distance measure.
  • For node 3, the closest neighbor distance is 3 using the Manhattan distance measure.
  • For node 3 and cluster 1, it is a bad cluster considering the within-cluster distance and the closest neighbor distance

Hierarchical Clustering

  • Hierarchical clustering can be conducted based on nodes and two-dimension locational information using Euclidean distance measure and single linkage.
  • Draw a dendrogram with hierarchical clustering results.
  • Hierarchical clustering can be conducted based on nodes and two-dimension locational information using Manhattan distance measure and complete linkage.
  • An example provides logic for public utilities using agglomerative clustering.
  • Five utilities and two measures (Sales and Fuel Cost) are used.
  • Join Arizona and Commonwealth, then recalculate the distance of four clusters: {Arizona, Commonwealth}, {Boston}, {Central}, {NY}.
  • Recalculate the distance of the four clusters using single linkage.
  • Consolidate {Central} with {Arizona, Commonwealth} clusters.
  • Recompute the distance matrix of the resulting three clusters: {Arizona, Commonwealth, Central}, {Boston}, {NY}.
  • Consolidate {NY} with {Boston}, resulting in two clusters: {Arizona, Commonwealth, Central}, {NY, Boston}.
  • Consolidate these two clusters and end the algorithm.
  • The document frequency (df) value for the word 'Cent' is 1.
  • Fill out the term frequency-inverse document frequency (tf-idf) value for the word 'Cent' for document 1 (3*log(5)).

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Use Quizgecko on...
Browser
Browser