Untitled
8 Questions
0 Views

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

If a document is consistently failing to render correctly across different browsers, which approach would MOST effectively address the underlying issue?

  • Implement browser-specific CSS hacks to target each browser individually, ensuring the document appears as intended in each.
  • Validate the document's HTML and CSS against established web standards and correct any identified errors or inconsistencies. (correct)
  • Rewrite the entire document using only basic HTML tags and inline CSS styles to maximize compatibility.
  • Use a JavaScript library to dynamically modify the document's structure and styles at runtime based on browser detection.

What is the primary risk associated with modifying a document's structure and styles dynamically at runtime using Javascript?

  • Improved SEO performance as search engines can better index dynamic content.
  • Reduced accessibility for users with disabilities or those using assistive technologies. (correct)
  • Enhanced security as the client-side modifications obfuscate the underlying code.
  • Increased server load due to the additional processing required to serve customized content.

When designing a website, what is the MOST important consideration regarding cross-browser compatibility?

  • Adhering to web standards and testing on a range of browsers and devices to provide a consistent user experience. (correct)
  • Focusing on achieving pixel-perfect visual consistency across every browser, regardless of the effort required.
  • Prioritizing support for the latest versions of all major browsers to ensure access to cutting-edge web technologies.
  • Limiting the use of CSS and JavaScript to minimize potential compatibility issues.

Which of the following is the LEAST effective strategy for ensuring accessibility in web development?

<p>Relying solely on JavaScript to enhance interactivity and user experience. (C)</p> Signup and view all the answers

What is the potential drawback of using browser-specific CSS prefixes for experimental features?

<p>Possible rendering inconsistencies across browsers if the feature is standardized without aligning with a specific prefix implementation. (C)</p> Signup and view all the answers

What is the most significant advantage of using a CSS preprocessor like Sass or Less in web development?

<p>The ability to write more maintainable and modular CSS code through features like variables, mixins, and nesting. (B)</p> Signup and view all the answers

Which of the following techniques is MOST effective for optimizing website performance for mobile devices?

<p>Minifying CSS and JavaScript files, compressing images, and leveraging browser caching. (B)</p> Signup and view all the answers

What is the MOST critical consideration when designing responsive web pages?

<p>Prioritizing content and functionality based on the user's context and device capabilities. (D)</p> Signup and view all the answers

Flashcards

Pedatal

The term was not defined in the given text, thus unavailable.

File Path

The presented content appears to be file paths without educational value.

Document context

The document appears to be a series of file paths to 'Untitled-1' and 'study_gude1'.

Content type

The text consists of filenames and timestamps.

Signup and view all the flashcards

Inferred activity

The content references file paths, potentially indicating document processing or viewing activities.

Signup and view all the flashcards

Note about Flashcard Generation

The content shown doesn't contain easily convertible textual elements for effective flashcards.

Signup and view all the flashcards

Effective flashcard

Flashcards require specific and factual content for useful recall.

Signup and view all the flashcards

Flashcard Goal

This structure is designed for learning through repeated questioning and answering.

Signup and view all the flashcards

Study Notes

Loss Function

  • Measures the error between predicted and actual values in a model.
  • The goal is to minimize the error by optimizing model parameters.

Likelihood and Probability

  • Likelihood measures how likely a specific set of parameters explains the observed data.
  • Probability measures the likelihood of observing a certain outcome given fixed parameters.
  • Probability deals with data given fixed parameters.
  • Likelihood measures the "fit" of parameters given the observed data.
  • For a set of data points X = {x1, x2,..., xn} and parameter θ, the likelihood is given by: L(θ|X) = product from i=1 to n of P(xi|θ).

Maximum Likelihood Estimation (MLE)

  • The purpose of MLE is to estimate the parameter θ that maximizes the likelihood of observing the given data.
  • θMLE = arg max L(θ|X), where the arg max is taken over θ.

Negative Log-Likelihood (NLL)

  • Transforms the product of probabilities (likelihood) into a sum for easier optimization.
  • Minimizes NLL instead of maximizing the likelihood.
  • NLL is commonly used as a loss function in classification models, especially in probabilistic models.
  • Minimizing the NLL is equivalent to maximizing the likelihood.
  • min(-log L(θ|X)) is equivalent to max L(θ|X).

Maximum A Posteriori (MAP)

  • The purpose of MAP is to estimate the parameter θ that maximizes the posterior probability by incorporating prior knowledge.
  • Bayes' Rule defines the posterior: P(θ|X) = P(X|θ)P(θ) / P(X).
  • Posterior = likelihood × prior / evidence.
  • Use the posterior when wanting to incorporate prior knowledge about a parameter.
  • Solve for θMAP = arg max P(θ|X) to find the parameter using the posterior, where the arg max is taken over θ.
  • A prior is useful when there is a small sample size, when there is real background knowledge not included in the data, and when the prior can serve as a regularizer (e.g., Lasso and Ridge regularization).

Machine Learning Workflow

  • We have data
  • Assume the model has parameters
  • Optimize a loss function to fit the model to the data

Loss Functions

  • Four different loss functions and when they are appropriate to use are:

Negative Log-Likelihood (NLL)

  • Used in probabilistic models or classification tasks.
  • Minimizes the difference between predicted and actual probability distributions.

Sum or Mean Absolute Error (MAE)

  • Appropriate when minimizing large deviations between actual and predicted values (e.g., regression tasks).
  • More robust to outliers than Mean Squared Error (MSE).

Lasso (L1 Loss/Regularization)

  • Used to promote sparsity in the model (forces some coefficients to zero).
  • Ideal for feature selection.

Ridge (L2 Loss/Regularization)

  • Penalizes large model coefficients without forcing them to zero.
  • Used to prevent overfitting in regression models.

Other things to study

  • Homework 1 and the end of Notebook 2 would be worthwhile to review

Naive Bayes: Parameters, Features, and Labels

  • Parameters: The underlying values the model learns (e.g., mean and variance in Gaussian Naive Bayes).
  • Features: The input variables used for prediction.
  • Discrete Labels: The possible categories the model predicts (e.g., spam or not spam).

Starting Point of Naive Bayes Classifiers

  • Naive Bayes classifiers begin with Bayes' Rule: P(Y|X) = P(X|Y)P(Y) / P(X)
  • It conditions on the predictor, calculating the probability of the class given the observed features.

Categorical Naive Bayes

  • A main assumption of categorical Naive Bayes is that features are conditionally independent given the class label.
  • This assumption simplifies learning because it allows to compute probabilities independently, reducing computational complexity.
  • The steps for Categorical Naive Bayes are:
    • Calculate the base rates (priors): P(Y)
    • Compute the probability of each class/predictor: P(X|Y)
    • Divide the counts of each feature k by the base rates to get the conditional probabilities.
  • An example to represent categorical Naive Bayes is the sleep deprivation and symptoms (mild, moderate, severe) example.

Pros and Cons of Naive Bayes Classification

  • Advantages
    • It doesn't require extensive training.
    • Is relatively fast.
    • Works with both categorical and continuous data.
    • Is not sensitive to irrelevant data.
  • Disadvantages
    • It has a zero probability issue: If a categorical variable is missing in training, the model assigns zero probability.
    • Has a strong independence assumption: In reality, features are often correlated, which affects prediction accuracy.
    • Can have misleading probabilities: The actual values of computed probabilities are often incorrect.

When to Use Different Types of Naive Bayes

  • Categorical Naive Bayes (CategoricalNB):
    • Use when features are categorical (e.g., presence/absence of symptoms).
  • Gaussian Naive Bayes (GaussianNB):
    • Use when features are continuous and assumed to follow a normal distribution (e.g., height, weight, temperature).
  • Bernoulli Naive Bayes (BernoulliNB):
    • Use when features are binary (e.g., 0/1 values in text classification).
  • Multinomial Naive Bayes (MultinomialNB):
    • Use when dealing with count-based data, such as word frequencies in documents (e.g., spam detection).
  • Complement Naive Bayes (ComplementNB):
    • Used when class imbalances exist, meaning one class has significantly more examples than another (e.g., rare disease detection).
  • Optimal Naive Bayes (OptimalNB):
    • Used when an optimized form of Naive Bayes is needed, often tuned for specific datasets.

Gaussian Naive Bayes (GNB) Assumption

  • Instead of categorical probabilities, GNB assumes the feature values follow a normal (Gaussian) distribution: P(Xj|Y) = N(Xj; μY, σ²Y), where Xj is the feature, μY is the mean, and σ²Y is the variance of the feature within class Y.

Numerical Issues: Underflow & Zero Probability

  • Underflow in Naive Bayes: When multiplying many small probabilities together, the result can be so small that it rounds to zero due to floating-point limitations.
  • Solve the underflow problem with the logarithm trick: Instead of computing the product of probabilities, sum their logarithms: log P(Y|X) = log P(Y) + ∑log P(Xj|Y).

Spam-Ham Example & Zero Probability Problem

  • The spam-ham example uses Multinomial Naive Bayes, because it models word counts.
  • The zero probability problem: If a word never appears in a given class in the training dataset, Naive Bayes assigns it zero probability, which eliminates that class from consideration.
  • If data that doesn't appear in any of the classes is ignored and only use the ones that do in the spam-ham example:
    • The model might be overconfident in its predictions and fail when encountering new words.

Solutions to the Zero Probability Problem

  • Additive Smoothing (Laplace Smoothing)
    • Add a small positive number α to all counts: P(Xj|Y) = (count(Xj, Y) + α) / (Σj(count(Xj, Y) + α))
    • Prevents zero probabilities without significantly altering large counts.
  • Adjusting Smoothing Parameter α
    • Choosing α carefully (commonly α = 1) balances between handling zero probabilities and keeping real probability distributions intact.

Final Notes: What to Review

  • Review Homework 1 and examples from the end of Notebook 2.
  • Ensure understanding of probability assumptions, categorical vs. Gaussian Naive Bayes, and smoothing techniques.

K-Nearest Neighbors (KNN)

  • Is a non-parametric, instance-based learning algorithm.
  • It classifies a new data point based on the majority class of its K nearest neighbors.
  • It memorizes the training data and makes decisions based on similarity.

Effects of Increasing/Decreasing K in KNN

  • Increasing K reduces variance, smooths decision boundaries, and can lead to underfitting.
  • Decreasing K increases variance, makes the model sensitive to noise, and can lead to overfitting.

Overfitting

  • Overfitting happens when a model learns noise in the training data rather than the actual pattern.
  • It performs well on the training set but poorly on unseen data.
  • The model memorizes noise rather than generalizing the underlying data distribution.

Underfitting

  • Underfitting happens when a model is too simple to capture the pattern in the data.
  • It leads to high bias and poor performance on both training and test data.

Test Error and Model Complexity

  • Test error is the error rate on unseen data and measures a model's generalization ability.
  • Overfitting: low training error, high test error (model memorizes training data but fails on new data).
  • Underfitting: high training error, high test error (model is too simple and fails to learn the pattern).

Cross-Validation (CV)

  • Cross-validation evaluates model performance by splitting data into multiple subsets and training/testing on different parts of the dataset.
  • CV prevents overfitting, ensures the model generalizes well to unseen data, and helps select hyperparameters.
  • Splitting data into training and test sets ensures that model evaluation is done on unseen data, giving a better measure of generalization.
  • Leave-one-out cross-validation (LOO-CV) is a special case of k-fold cross-validation where each data point is used as a test set exactly once.
    • It is useful for small datasets but is computationally expensive.

Regularization

  • It prevents overfitting by adding a penalty to large weights in the model, controlling complexity.

L1 and L2 Regularization

  • L1 (Lasso) Regularization:
    • Encourages sparse features by setting some coefficients to zero (feature selection).
    • Suitable when many features are irrelevant.
  • L2 (Ridge) Regularization:
    • Shrinks weights smoothly (no zero coefficients).
    • Suitable when all features contribute but need smaller magnitudes.

Bias-Variance Tradeoff

  • Bias: error due to simplified assumptions, e.g., underfitting.
  • Variance: error due to sensitivity to noise, e.g., overfitting.
  • Tradeoff: increasing model complexity lowers bias but increases variance. The goal is to find an optimal balance.

Gradient Descent

  • The gradient is the rate of change of a function with respect to its parameters.
  • Mathematically, the gradient of a function $f(x)$ is: $\nabla f(x) = \left(\frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, ..., \frac{\partial f}{\partial x_n}\right)$.
  • Gradients on a contour plot show the steepest ascent direction (gradient points in the direction of increasing function value).
  • If the gradient points towards steepest ascent, the function increases which is the opposite of what you want in minimization problems.

Gradient Descent Algorithm

  • Algorithm:
  1. Initialize weights randomly.
  2. Compute the gradient of the loss function.
  3. Update parameters in the opposite direction of the gradient.
  4. Repeat until convergence.
  • Equation: Given a loss function J(θ): θ(t+1) = θ(t) – α∇J(θ)
  • α = learning rate
  • ∇J(θ) = gradient of the loss function

Learning Rate

  • The learning rate controls how much weights are updated at each step.
  • Too small leads to slow convergence.
  • Too large might overshoot and never converge.
  • Gradient descent is valid when the loss function is differentiable and gradient updates move the function towards a minimum.
  • How to choose the learning rate:
  1. Trial and error
  2. Cross-validation (CV)
  3. Adaptive methods (e.g., Adam, RMSProp)
  • The step size determines how far we move in the direction of the gradient.
  • A step size that is too large may miss the optimal point.
  • A step size that is too small will cause convergence to be slow.

Gradient of the Loss & Step Size

  • The gradient of the loss function tells how much and in which direction to adjust model parameters to minimize the error.
  • Mathematically, if the loss function is J(θ), the gradient is VJ(0) = ∂J/∂θ .
  • In gradient descent, parameters are updated using θ(t+1) = θ(t) – αVJ(θ).
  • Step size (learning rate α) controls how much the parameters are adjusted in the direction of the gradient.
    • Small step size leads to slow convergence but stable learning.
    • Large step size leads to faster learning but may overshoot and not converge.
    • Adaptive step sizes (e.g., Adam optimizer) adjust automatically.

Supervised vs. Unsupervised Learning

  • Supervised learning is learning from labeled data, meaning each input has a corresponding known output (label).
    • Examples: Classification (spam detection, image recognition) and Regression (predicting house prices).
  • Unsupervised learning is learning patterns from unlabeled data (i.e., no explicit output labels).
    • Examples: Clustering (grouping customers based on purchase behavior) and Dimensionality Reduction (PCA for feature extraction).
  • Supervised learning is used when needing to make predictions with labeled data.
  • Unsupervised learning is used when needing to find structure with unlabeled data.

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Untitled Quiz
6 questions

Untitled Quiz

AdoredHealing avatar
AdoredHealing
Untitled
44 questions

Untitled

ExaltingAndradite avatar
ExaltingAndradite
Untitled
6 questions

Untitled

StrikingParadise avatar
StrikingParadise
Untitled Quiz
18 questions

Untitled Quiz

RighteousIguana avatar
RighteousIguana
Use Quizgecko on...
Browser
Browser