Machine Learning Challenges in AIMLB

ArdentVector avatar
ArdentVector
·
·
Download

Start Quiz

Study Flashcards

21 Questions

What is the name of the induction algorithm used to construct decision trees?

TDIDT

What does entropy measure in data?

Degree of randomness

Lower entropy implies lower predictability in data.

False

What is the purpose of Gain Ratio in attribute selection?

Normalize information gain

What is the name of the technique that reduces the size of a decision tree by removing branches providing little predictive power? Pruning is a _technique.

regularization

What are the two main methods of pruning in decision trees?

Pre-pruning and Post-pruning

How should missing values in datasets be handled during training time according to the content?

Set them to the most common value or the most probable value given the label

Which methods can be used for estimating a classifier's accuracy?

All of the above

Decision trees produce __ decision boundaries.

non-linear

Classifier precision and recall have an inverse relationship.

True

What is the Bayes Error in machine learning?

The lower limit of the error that can be achieved with any classifier.

What is Bias error in machine learning?

The systematic error of the model which measures how far the predicted value is from the true value.

Explain Variance error in machine learning.

Variance error is caused by sensitivity to small variances in the training data set, resulting in dispersion of predicted values over target values with different training sets.

When does overfitting occur in machine learning?

When the model captures noise and outliers in the data along with the underlying pattern.

What characterizes underfitting in machine learning?

Inability to capture the underlying pattern of the data.

What is the main consideration in model selection in machine learning?

Suitability for the type of data, model accuracy, bias/variance balance, and ability to capture patterns without overfitting.

What is ML primarily about in terms of model training and validation?

Training, validation, and testing the model.

Machine Learning is all about training, validation, and testing ________ model.

the

What do credit risk models help banks predict?

Likelihood of default on a loan

What is the goal of classification in machine learning?

Determine the target attribute values of new examples.

How is a Decision Tree represented?

Through rules that can be understood by humans and used in knowledge systems.

Study Notes

Here are the study notes for the provided text:

Machine Learning Challenges

• Prediction error consists of two components: bias and variance • Overfitting occurs when a model captures the noise and outliers in the data, resulting in high variance and low bias • Underfitting occurs when a model is unable to capture the underlying pattern of the data, resulting in low variance and high bias • The bias-variance tradeoff is a critical challenge in machine learning • Model selection and tuning are crucial to achieve the right balance between bias and variance

Credit Risk Assessment

• Credit risk assessment is a critical application of machine learning in finance • The goal is to determine whether a loan applicant is likely to default on a loan • Factors considered in credit risk assessment include: + Credit history + Income + Loan terms + Personal information • Machine learning algorithms can be used to develop credit risk models that predict the likelihood of default

Classification

• Classification is a type of supervised learning where the target variable is categorical • The goal of classification is to predict the class label of a new instance based on the attributes of the instance • Examples of classification tasks include: + Spam vs. non-spam emails + Tumor cells as benign or malignant + Credit card transactions as legitimate or fraudulent + Sentiment analysis • Decision trees are a popular classification algorithm

Decision Trees

• Decision trees represent rules that can be understood by humans • Decision trees are useful for knowledge representation and can be used in databases • The goal of decision tree induction is to learn a model that maps each attribute set to one of the predefined class labels • The process of decision tree induction involves: + Selecting the most informative attribute + Partitioning the data according to the attribute's values + Recursively constructing subtrees for the subsets of the data • Conditions for stopping partitioning include: + All samples for a given node belong to the same class + There are no remaining attributes for further partitioning + There are no samples left

Entropy and Information Gain

• Entropy measures the degree of randomness in data • Information gain is the expected reduction in entropy due to splitting on values of an attribute • The best attribute is the one with the highest information gain • Information gain is used to select the most informative attribute in decision tree induction

Best Attribute Selection

• Best attribute selection is critical in decision tree induction • Information gain is used to select the best attribute • Gain ratio is used to overcome the limitation of information gain, which is biased towards multivalued attributes • Gini impurity is an alternative to entropy for selecting attributes

Gini Impurity

• Gini impurity measures how often a randomly chosen example would be incorrectly labeled • Gini impurity is used to select the best attribute in decision tree induction • The best attribute is the one with the highest impurity decrease### Decision Trees and Metrics

  • Pruning: a technique to reduce the size of a decision tree by removing branches that provide little predictive power, reducing overfitting.
  • Types of Pruning:
    • Pre-pruning: stops the tree building algorithm before it fully classifies the data.
    • Post-pruning: builds the complete tree, then replaces some non-leaf nodes with leaf nodes if it improves validation error.

Computing Information-Gain for Continuous-Valued Attributes

  • Sorting: sort the values of the continuous attribute in increasing order.
  • Midpoint: consider the midpoint between each pair of adjacent values as a possible split point.
  • Split: select the point with the minimum expected information requirement for the attribute as the split-point.

Handling Missing Values

  • Handling missing values at training time:
    • Set them to the most common value.
    • Set them to the most probable value given the label.
    • Add a new instance for each possible value.
  • Handling missing values at inference time: explore all possibilities and take the final prediction based on a weighted vote of the corresponding leaf nodes.

Decision Boundaries

  • Decision trees produce non-linear decision boundaries.

Model Evaluation and Selection

  • Evaluation metrics:
    • Accuracy: measures how correctly classified the test set tuples are.
    • Other metrics: consider precision, recall, F-measure, etc.
  • Methods for estimating a classifier's accuracy:
    • Holdout method.
    • Random subsampling.
    • Cross-validation.
    • Bootstrap.

Classifier Evaluation Metrics

  • Confusion Matrix:
    • A table used to evaluate the performance of a classifier.
    • Contains true positives, false negatives, false positives, and true negatives.
  • Accuracy: percentage of test set tuples that are correctly classified.
  • Error Rate: 1 - accuracy, or the percentage of misclassified tuples.
  • Sensitivity: true positive recognition rate.
  • Specificity: true negative recognition rate.
  • Precision: exactness - what percentage of tuples that the classifier labeled as positive are actually positive.
  • Recall: completeness - what percentage of positive tuples did the classifier label as positive.
  • F-measure: harmonic mean of precision and recall.

This quiz covers the challenges faced in Machine Learning, including prediction error, bias, variance, overfitting, and underfitting, and the selection and tuning of a model.

Make Your Own Quizzes and Flashcards

Convert your notes into interactive study material.

Get started for free

More Quizzes Like This

Use Quizgecko on...
Browser
Browser