Data Science Basics

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson
Download our mobile app to listen on the go
Get App

Questions and Answers

Which of the following best describes the purpose of the ROC curve?

  • To identify the optimal number of clusters in a dataset.
  • To graphically represent the performance of a classifier by plotting the true positive rate against the false positive rate. (correct)
  • To reduce the dimensionality of a dataset while preserving variance.
  • To visualize the distribution of a single feature in a dataset.

Which of the following statements is true regarding a p-value?

  • A p-value represents the probability that the null hypothesis is false.
  • A p-value measures the effect size of an experiment.
  • A p-value is used to eliminate outliers in a dataset.
  • A p-value measures the probability of obtaining test results at least as extreme as the observed results, assuming the null hypothesis is true. (correct)

Which of the following is NOT an assumption of linear regression?

  • Independence of errors
  • Homoscedasticity
  • Multicollinearity (correct)
  • Linearity

Which of the following methods can be used to detect multicollinearity in a regression model?

<p>Variance Inflation Factor (VIF) (A)</p>
Signup and view all the answers

In the k-means clustering algorithm, what is the key objective that the algorithm tries to achieve?

<p>Minimize the variance within each cluster (C)</p>
Signup and view all the answers

What is the primary way that a decision tree works when classifying or predicting outcomes?

<p>It splits data into subsets based on input features until a decision is made at the leaf nodes. (C)</p>
Signup and view all the answers

What is the strategy employed by the Random Forest algorithm to improve accuracy and control overfitting?

<p>Combining multiple decision trees using bootstrap sampling and random feature selection. (A)</p>
Signup and view all the answers

How does gradient boosting differ from bagging methods like Random Forest?

<p>Gradient boosting builds models sequentially, where each new model corrects errors from previous models. (C)</p>
Signup and view all the answers

What is the primary goal of Principal Component Analysis (PCA)?

<p>To reduce dimensionality while preserving the maximum variance in the data (B)</p>
Signup and view all the answers

What is the 'curse of dimensionality'?

<p>The set of challenges that arise when analyzing and organizing data in high-dimensional spaces. (A)</p>
Signup and view all the answers

How does bagging differ from boosting in ensemble methods?

<p>Bagging trains multiple models independently, while boosting trains them sequentially, focusing on correcting errors of previous models. (A)</p>
Signup and view all the answers

What is the key difference between L1 and L2 regularization?

<p>L1 regularization promotes sparsity by adding the absolute value of coefficients, while L2 leads to smaller but nonzero coefficients by adding the squared value. (A)</p>
Signup and view all the answers

How do generative models contrast with discriminative models in machine learning?

<p>Generative models learn the joint probability distribution of input features and output labels, while discriminative models learn the conditional probability of the output labels given the input features. (A)</p>
Signup and view all the answers

What is the backpropagation algorithm's primary function in neural networks?

<p>To calculate the gradient of the loss function with respect to each weight and update the weights to minimize the loss. (C)</p>
Signup and view all the answers

What is the 'vanishing gradient' problem in deep learning, and why does it occur?

<p>A problem where gradients used to update neural network weights become very small, causing slow or stalled training, often in deep networks with specific activation functions. (D)</p>
Signup and view all the answers

Which of the following techniques can be used to handle imbalanced datasets?

<p>Resampling techniques, using different evaluation metrics, generating synthetic samples, and using algorithms designed for imbalanced data. (D)</p>
Signup and view all the answers

What is the function of convolutional layers in a Convolutional Neural Network (CNN)?

<p>To extract features from structured grid data like images (C)</p>
Signup and view all the answers

What characteristics make Recurrent Neural Networks (RNNs) suitable for sequential data?

<p>They use connections between nodes forming directed cycles, which enables them to capture temporal dependencies. (A)</p>
Signup and view all the answers

What is the primary goal of a Support Vector Machine (SVM)?

<p>To find the hyperplane that best separates data points of different classes with the maximum margin (D)</p>
Signup and view all the answers

What are the two primary steps in the Expectation-Maximization (EM) algorithm?

<p>Expectation (E-step) to estimate latent variables and Maximization (M-step) to maximize likelihood with respect to parameters. (C)</p>
Signup and view all the answers

What is the main purpose of the provided Python function mean_variance(data)?

<p>To calculate the mean and variance of a list of numbers. (A)</p>
Signup and view all the answers

In the k-means clustering implementation provided, what is the purpose of the line centroids = X[np.random.choice(X.shape[0], k, replace=False)]?

<p>To randomly select <code>k</code> data points from <code>X</code> as initial centroids (C)</p>
Signup and view all the answers

In the Python code for logistic regression using gradient descent, what does the sigmoid function accomplish?

<p>It converts linear predictions into probabilities between 0 and 1. (C)</p>
Signup and view all the answers

What is the purpose of the function pca(X, num_components) in the given Python code?

<p>To perform Principal Component Analysis to reduce the dimensionality of the dataset to <code>num_components</code>. (B)</p>
Signup and view all the answers

In the Python implementation of a decision tree, what is the role of the _grow_tree function?

<p>To recursively build the decision tree by splitting the data based on the best feature at each node. (B)</p>
Signup and view all the answers

In the neural network implementation, what is the purpose of the sigmoid_derivative(x) function?

<p>To calculate the derivative of the sigmoid function, used in the backpropagation step. (B)</p>
Signup and view all the answers

What is the purpose of the provided Python function f1_score(y_true, y_pred)?

<p>To calculate the F1 score, which is the harmonic mean of precision and recall. (C)</p>
Signup and view all the answers

In the given k-NN implementation, what is the purpose of the euclidean_distance function?

<p>To calculate the Euclidean distance between two data points, used to measure similarity. (D)</p>
Signup and view all the answers

What is the role of the enumerate function in the context of the _predict method within the NaiveBayes class?

<p>To iterate through the classes, providing both the index and the class label for calculating posteriors. (D)</p>
Signup and view all the answers

According to the pseudocode for apriori, what is the general role of the function generate_candidates?

<p>To generate candidate itemsets from the transactions, to later assess their support. (B)</p>
Signup and view all the answers

In the context of the hierarchical_clustering function in the provided Python code, what is the purpose of the linkage function from scipy.cluster.hierarchy?

<p>To create a linkage matrix, which encodes the hierarchical clustering tree based on the input data. (B)</p>
Signup and view all the answers

In the provided silhouette_score implementation, what is the purpose of calculating intra_distances and inter_distances?

<p>To calculate the distances within clusters and between clusters, respectively, for the silhouette score calculation. (B)</p>
Signup and view all the answers

In the provided code snippet, what is the purpose of the ParameterGrid class from sklearn.model_selection in the grid_search function?

<p>To generate all possible combinations of hyperparameters from the given parameter grid for grid search. (B)</p>
Signup and view all the answers

What is the purpose of the cross_entropy function in the provided Python code?

<p>To calculate the cross-entropy loss between true labels and predicted probabilities. (B)</p>
Signup and view all the answers

What is the mathematical role of the numerator in the matthews_corrcoef function?

<p>It represents the difference between observed correct predictions and expected correct predictions under randomness. (D)</p>
Signup and view all the answers

Within the kmeans_plus_plus function, why are probabilities constructed?

<p>To select centroids randomly, using weighted probabilities based on the distance from existing centroids. (A)</p>
Signup and view all the answers

In the provided entropy(y) function, what does the expression np.unique(y, return_counts=True) return?

<p>The unique values in <code>y</code> and their raw counts. (B)</p>
Signup and view all the answers

Based on the Python code for the metropolis_hastings function, what is the purpose of the accept_prob variable?

<p>To determine whether to accept or reject a new sample based on the Metropolis-Hastings acceptance criterion. (D)</p>
Signup and view all the answers

In the levenshtein_distance function, what is the significance of the previous_row variable?

<p>It stores the Levenshtein distances calculated for the previous row, used to compute the current row's distances. (D)</p>
Signup and view all the answers

Using the provided code for the viterbi function, what is the purpose of both trans_p and emit_p?

<p>They store the transition probabilities between states and emission probabilities of observations given states, respectively. (C)</p>
Signup and view all the answers

In a customer churn prediction case study, what is the importance of feature engineering?

<p>It creates relevant features like usage patterns, duration of service, and interaction with support, thereby improving model performance. (A)</p>
Signup and view all the answers

During A/B testing, what metrics might be defined to measure the success of an e-commerce company's new recommendation algorithm?

<p>Click-through rate, conversion rate, and average order value. (A)</p>
Signup and view all the answers

In fraud detection for a credit card company, what type of features would be useful for feature engineering?

<p>Transaction amount, frequency, location, and time of day. (A)</p>
Signup and view all the answers

What would be the initial step in tackling a sales forecasting task for a retail company?

<p>Gather historical sales data, including seasonal trends and external factors like holidays. (D)</p>
Signup and view all the answers

What type of data should be analyzed as an initial step to build a recommendation system for an online streaming service?

<p>User behavior data, including watch history, ratings, and preferences. (B)</p>
Signup and view all the answers

Flashcards

What is Data Science?

An interdisciplinary field focused on extracting knowledge and insights from data.

Supervised vs. Unsupervised Learning

Training a model on labeled data versus training on data without labels.

Overfitting vs. Underfitting

Overfitting learns noise, underfitting misses patterns.

Bias-Variance Tradeoff

Balance between overly simplistic models causing bias and overly complex models causing variance.

Signup and view all the flashcards

Parametric vs. Non-parametric Models

Models assuming a specific form versus those that don't.

Signup and view all the flashcards

Cross-Validation

A technique to assess how a model will generalize.

Signup and view all the flashcards

Confusion Matrix

A table evaluating classification model performance. Showing true/false positives/negatives.

Signup and view all the flashcards

Regularization

Preventing overfitting by adding a penalty to model complexity.

Signup and view all the flashcards

Central Limit Theorem

Distribution of sample means approaches a normal distribution as the sample size grows.

Signup and view all the flashcards

Precision

Ratio of true positives to the sum of true/false positives.

Signup and view all the flashcards

Recall

Ratio of true positives to the sum of true positives/false negatives.

Signup and view all the flashcards

ROC Curve and AUC

Graphical representation of a classifier's performance, plotting TPR against FPR.

Signup and view all the flashcards

P-Value

Probability of obtaining test results as extreme as observed.

Signup and view all the flashcards

Assumptions of Linear Regression

Linearity, independence, constant variance, normality of residuals, no multicollinearity.

Signup and view all the flashcards

Multicollinearity

Independent variables are highly correlated.

Signup and view all the flashcards

K-Means Clustering

Partitions data into k clusters by minimizing variance within each cluster.

Signup and view all the flashcards

Decision Tree

Flowchart-like structure for classification and regression.

Signup and view all the flashcards

Random Forest Algorithm

Combines multiple decision trees to improve accuracy and control overfitting.

Signup and view all the flashcards

Gradient Boosting

Ensemble technique that builds models sequentially.

Signup and view all the flashcards

PCA (Principal Component Analysis)

Dimensionality reduction technique transforming data to a new coordinate system.

Signup and view all the flashcards

Curse of Dimensionality

Challenges arising from high-dimensional spaces.

Signup and view all the flashcards

Bagging vs. Boosting

Bagging trains models independently; boosting trains sequentially.

Signup and view all the flashcards

Generative vs. Discriminative Model

Learn joint probability; learn conditional probability.

Signup and view all the flashcards

Backpropagation

Algorithm to train neural networks by calculating the gradient of the loss function.

Signup and view all the flashcards

Vanishing Gradient Problem

Gradients become very small, causing slow or stalled training.

Signup and view all the flashcards

CNN

Structured for grid data.

Signup and view all the flashcards

SVM

Finds hyperplane that separates data.

Signup and view all the flashcards

EM Algorithm

Technique to find maximum likelihood of estimates in probabilistic models.

Signup and view all the flashcards

Calculate mean and variance.

A function to find the mean and variance of a list of numbers.

Signup and view all the flashcards

K-means from scratch.

A clustering method performed without libraries.

Signup and view all the flashcards

Study Notes

What is Data Science?

  • Data Science is an interdisciplinary field
  • It extracts knowledge and insights from data
  • It uses scientific methods, algorithms, and systems
  • Data Science combines aspects of statistics, computer science, and domain expertise

Supervised Learning vs. Unsupervised Learning

  • Supervised learning trains a model on labeled data
  • Unsupervised learning trains a model on data without labels to find hidden patterns

Overfitting vs. Underfitting

  • Overfitting occurs when a model learns the noise in the training data
  • It performs well on training data but poorly on new data
  • Underfitting occurs when a model is too simple to capture the underlying patterns in the data
  • It performs poorly on both training and new data.

Bias-Variance Tradeoff

  • The bias-variance tradeoff is the balance between two sources of error that affect model performance
  • Bias is the error due to overly simplistic models
  • Variance is the error due to models being too complex
  • A good model should balance between bias and variance

Parametric vs. Non-Parametric Models

  • Parametric models assume a specific form for the function that maps inputs to outputs
  • They have a fixed number of parameters
  • Non-parametric models do not assume a specific form
  • They can grow in complexity with the data

Cross-Validation

  • Cross-validation assesses how a predictive model will generalize to an independent dataset
  • It involves partitioning the data into subsets, training the model on some subsets, and validating it on the remaining subsets.

Confusion Matrix

  • A confusion matrix is a table to evaluate the performance of a classification model
  • It shows the counts of true positives, true negatives, false positives, and false negatives.

Regularization

  • Regularization prevents overfitting by adding a penalty to the model's complexity
  • Common types include L1 (Lasso) and L2 (Ridge) regularization

Central Limit Theorem

  • The Central Limit Theorem states that the distribution of sample means approaches a normal distribution
  • This happens as the sample size becomes large
  • It is regardless of the original distribution of the data

Precision and Recall

  • Precision is the ratio of true positives to the sum of true and false positives
  • Recall is the ratio of true positives to the sum of true positives and false negatives

ROC Curve and AUC

  • The ROC curve is a graphical representation of a classifier's performance
  • It plots the true positive rate against the false positive rate
  • AUC (Area Under the Curve) measures the entire two-dimensional area underneath the ROC curve

P-Value

  • A p-value measures the probability of obtaining test results at least as extreme as the observed results
  • This assumes that the null hypothesis is true
  • P-values help determine the statistical significance of the results

Assumptions of Linear Regression

  • Assumptions include linearity
  • Assumptions include independence
  • Assumptions include homoscedasticity (constant variance)
  • Assumptions include normality of residuals
  • Assumptions include no multicollinearity

Multicollinearity

  • Multicollinearity occurs when independent variables in a regression model are highly correlated
  • It can be detected using Variance Inflation Factor (VIF) or correlation matrices

K-Means Clustering Algorithm

  • K-means is an unsupervised learning algorithm
  • It partitions data into k clusters by minimizing the variance within each cluster
  • It iteratively assigns data points to the nearest centroid
  • It updates centroids based on the mean of the points in each cluster

Decision Tree

  • A decision tree is a flowchart-like structure
  • It is used for classification and regression
  • It splits data into subsets based on the value of input features
  • This creates branches until a decision is made at the leaf nodes

Random Forest Algorithm

  • Random forest is an ensemble learning method
  • It combines multiple decision trees to improve accuracy and control overfitting
  • It uses bootstrap sampling and random feature selection to build each tree

Gradient Boosting

  • Gradient boosting is an ensemble technique
  • It builds models sequentially
  • Each new model attempts to correct the errors of the previous ones
  • It combines weak learners to form a strong learner

Principal Component Analysis (PCA)

  • PCA is a dimensionality reduction technique
  • It transforms data into a new coordinate system by projecting it onto principal components
  • Principal components are orthogonal and capture the maximum variance in the data

Curse of Dimensionality

  • The curse of dimensionality refers to the challenges and issues that arise when analyzing and organizing data in high-dimensional spaces
  • As the number of dimensions increases, the volume of the space increases exponentially, making data sparse and difficult to manage

Bagging vs. Boosting

  • Bagging (Bootstrap Aggregating) is an ensemble method that trains multiple models independently using different subsets of the training data
  • It averages their predictions
  • Boosting trains models sequentially
  • Each model focuses on correcting the errors of the previous ones

L1 vs. L2 Regularization

  • L1 regularization (Lasso) adds the absolute value of the coefficients as a penalty term, promoting sparsity
  • L2 regularization (Ridge) adds the squared value of the coefficients as a penalty term, leading to smaller but non-zero coefficients

Generative vs. Discriminative Model

  • Generative models learn the joint probability distribution of input features and output labels
  • They can generate new data points
  • Discriminative models learn the conditional probability of the output labels given the input features, focusing on the decision boundary

Backpropagation Algorithm

  • Backpropagation is an algorithm used to train neural networks
  • It calculates the gradient of the loss function with respect to each weight
  • It updates the weights in the opposite direction of the gradient to minimize the loss

Vanishing Gradient Problem

  • The vanishing gradient problem occurs when the gradients used to update neural network weights become very small
  • This causes slow or stalled training
  • It is common in deep networks with certain activation functions like sigmoid or tanh

Handling Imbalanced Datasets

  • Techniques include resampling (oversampling the minority class or undersampling the majority class)
  • Techniques include using different evaluation metrics (e.g., precision-recall curve)
  • Techniques include generating synthetic samples (e.g., SMOTE)
  • Techniques include using algorithms designed for imbalanced data

Convolutional Neural Network (CNN)

  • CNN is a type of neural network designed for processing structured grid data like images
  • It uses convolutional layers to extract features and pooling layers to reduce dimensionality
  • It is followed by fully connected layers for classification

Recurrent Neural Networks (RNN)

  • RNNs are neural networks designed for sequential data, where connections between nodes form directed cycles
  • Variants include Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU)
  • LSTM and GRU address the vanishing gradient problem and capture long-term dependencies

Support Vector Machine (SVM)

  • SVM is a supervised learning algorithm used for classification and regression
  • It finds the hyperplane that best separates data points of different classes with the maximum margin
  • It can handle non-linear data using kernel functions

Expectation-Maximization (EM) Algorithm

  • EM is an iterative algorithm
  • It is used to find maximum likelihood estimates of parameters in probabilistic models with latent variables
  • It consists of two steps: Expectation (E-step) to estimate the expected value of the latent variables, and Maximization (M-step) to maximize the likelihood function with respect to the parameters

Customer Churn Prediction (Case 1)

  • Understand the data, check for missing values, and explore patterns
  • Create relevant features like usage patterns, duration of service, and interaction with support
  • Use models like logistic regression, decision trees, or ensemble methods like random forests or XGBoost
  • Use metrics like accuracy, precision, recall, and AUC-ROC for evaluation
  • Implement the model in a production environment and monitor performance

A/B Testing (Case 2)

  • Clearly state the null and alternative hypotheses
  • Determine the required sample size to achieve statistical significance
  • Randomly assign users to the control (current algorithm) and treatment (new algorithm) groups
  • Define success metrics such as click-through rate, conversion rate, and average order value
  • Use statistical tests to compare the performance of both groups
  • Draw conclusions based on the results and make recommendations

Fraud Detection (Case 3)

  • Analyze transaction data to identify patterns indicative of fraud
  • Create features such as transaction amount, frequency, location, and time of day
  • Use supervised learning models like logistic regression, decision trees, and anomaly detection methods like isolation forests
  • Evaluate using metrics like precision, recall, F1 score, and confusion matrix
  • Continuously monitor model performance and update the model as fraud patterns evolve

Sales Forecasting (Case 4)

  • Gather historical sales data, including seasonal trends and external factors like holidays
  • Identify patterns, trends, and anomalies in the data via Exploratory Data Analysis (EDA)
  • Create features such as moving averages, lagged values, and external indicators
  • Use time series models like ARIMA, exponential smoothing, or machine learning models like random forests and gradient boosting
  • Validate model performance using metrics like RMSE, MAE, and MAPE
  • Generate forecasts and provide actionable insights

Recommender Systems (Case 5)

  • Analyze user behavior data, including watch history, ratings, and preferences
  • Implement user-based or item-based collaborative filtering
  • Use metadata like genre, actors, and directors for content-based filtering
  • Combine collaborative and content-based filtering for better recommendations (hybrid approach)
  • Use metrics like precision, recall, and mean reciprocal rank (MRR) to evaluate the recommender system
  • Continuously update the model based on user interactions to improve recommendations

Sentiment Analysis (Case 6)

  • Gather customer reviews from various sources like social media, websites, and surveys
  • Clean and preprocess the text data, including tokenization, stop-word removal, and stemming/lemmatization
  • Use techniques like TF-IDF, word embeddings, or BERT for feature extraction
  • Use machine learning models like logistic regression, SVM, or deep learning models like LSTM and BERT
  • Evaluate model performance using metrics like accuracy, precision, recall, and F1 score
  • Analyze the results to provide actionable insights to the company

Anomaly Detection (Case 7)

  • Analyze the server logs to identify normal and abnormal behavior patterns
  • Create features like CPU usage, memory usage, request count, and error rates
  • Use unsupervised learning methods like clustering (e.g., DBSCAN), isolation forests, or autoencoders for anomaly detection
  • Validate the model using techniques like the ROC curve and precision-recall curves
  • Implement the model in a monitoring system to detect anomalies in real-time and alert the relevant teams

Image Classification (Case 8)

  • Gather a dataset of labeled X-ray images
  • Preprocess the images by resizing, normalization, and augmentation to increase the dataset size
  • Use convolutional neural networks (CNN) architectures like ResNet, VGG, or transfer learning models
  • Train the model using cross-validation to avoid overfitting
  • Use metrics like accuracy, precision, recall, F1 score, and AUC-ROC for evaluation
  • Implement the model in a clinical setting, ensuring it integrates with existing systems and provides explainable results

Natural Language Processing (NLP) (Case 9)

  • Gather a dataset of historical support tickets and their categories
  • Clean and preprocess the text data, including tokenization, stop-word removal, and stemming/lemmatization
  • Use techniques like TF-IDF, word embeddings, or BERT for feature extraction
  • Use classification models like logistic regression, SVM, or deep learning models like LSTM and BERT
  • Evaluate model performance using metrics like accuracy, precision, recall, and F1 score
  • Integrate the model into the support system to automatically categorize new tickets and continuously improve based on user feedback

Market Basket Analysis (Case 10)

  • Gather transaction data, including items purchased and transaction timestamps
  • Clean the data, removing any inconsistencies or missing values
  • Use algorithms like Apriori or FP-Growth to find frequent itemsets and generate association rules
  • Evaluate the rules using metrics like support, confidence, and lift
  • Analyze the results to identify patterns and provide recommendations to increase cross-selling and up-selling
  • Implement changes in the store layout, promotions, and marketing strategies based on the insights

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Machine Learning: Supervised Learning Quiz
8 questions
Data Science Machine Learning Overview
8 questions
Machine Learning Unit III Quiz
24 questions
Use Quizgecko on...
Browser
Browser