Data Science Section A Quiz

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

What is the measure of central tendency that represents the most frequently occurring value in a dataset?

  • Median
  • Mode (correct)
  • Mean
  • Range

If a dataset has an even number of observations, how is the median determined?

  • The maximum value
  • The mean of the two middle values (correct)
  • The last middle value
  • The first middle value

Which of the following is not a measure of dispersion?

  • Range
  • Variance
  • Mode (correct)
  • Standard deviation

What is the range of a dataset?

<p>The difference between the highest and lowest values (B)</p> Signup and view all the answers

Which measure of central tendency is most sensitive to extreme values?

<p>Mean (B)</p> Signup and view all the answers

What is the formula for calculating the variance?

<p>(sum of squared deviations) / (number of values) (A)</p> Signup and view all the answers

Which measure of spread is equal to the square root of the variance?

<p>Standard deviation (A)</p> Signup and view all the answers

What is a significant impact of Data Science on businesses?

<p>Improved decision-making and efficiency (D)</p> Signup and view all the answers

What are the three key components of Data Science?

<p>Data, Statistics, and Visualization (D)</p> Signup and view all the answers

Which of the following is a supervised learning technique?

<p>Linear Regression (A)</p> Signup and view all the answers

What is the difference between precision and recall?

<p>Precision measures the number of true positives, while recall measures the number of false negatives (A)</p> Signup and view all the answers

Which of the following is a data visualization technique?

<p>Box Plot (C)</p> Signup and view all the answers

What is the goal of feature engineering?

<p>To transform the features into a more suitable representation for a machine learning algorithm (B)</p> Signup and view all the answers

What is the purpose of cross-validation?

<p>To ensure that the model is not overfitting the data (D)</p> Signup and view all the answers

What is the purpose of hypothesis testing in data science?

<p>To determine if a sample statistic is significantly different from a population parameter (A)</p> Signup and view all the answers

How can you define a function in Python that accepts an arbitrary number of positional arguments?

<p>Using the *args parameter (A)</p> Signup and view all the answers

Which data structure is primarily used in NumPy for handling arrays?

<p>ndarray (A)</p> Signup and view all the answers

Which method is used to create a NumPy array of integers ranging from 0 to 9?

<p>np.arange(10) (B)</p> Signup and view all the answers

What is the default data type of elements in a NumPy array?

<p>Integer (C)</p> Signup and view all the answers

What will be the result of the operation np.array([1, 2, 3]) + np.array([4, 5, 6])?

<p>[5, 7, 9] (B)</p> Signup and view all the answers

How can you access the element at the second row and third column of a NumPy array arr?

<p>arr[2, 3] (B)</p> Signup and view all the answers

What followed the equation of the regression line y = 2x + 3 when x is 5?

<p>13 (A)</p> Signup and view all the answers

Which of the following is NOT a common assumption of linear regression?

<p>Multicollinearity (A)</p> Signup and view all the answers

In logistic regression, what type of outcome does the dependent variable typically represent?

<p>A binary outcome or category (B)</p> Signup and view all the answers

What is the primary purpose of conducting residual analysis in regression models?

<p>To identify outliers and assess the model's assumptions (C)</p> Signup and view all the answers

When performing polynomial regression, what effect does increasing the degree of the polynomial generally have?

<p>Overfitting the data (C)</p> Signup and view all the answers

Which type of regression is typically preferred when dealing with multicollinearity among independent variables?

<p>Lasso Regression (D)</p> Signup and view all the answers

In the context of regression analysis, which of the following is an example of a dependent variable?

<p>Sales (B)</p> Signup and view all the answers

Given the function f(x) = x^3 + 3x^2 - 24*x + 7, what is true about x=2?

<p><em>x</em>=2 will give the minimum for <em>f</em>(x) (B)</p> Signup and view all the answers

What distinguishes linear regression from logistic regression?

<p>Linear regression produces a linear outcome, while logistic regression produces a binary outcome (C)</p> Signup and view all the answers

Which of the following accurately describes the purpose of logistic regression?

<p>To predict categorical variables (A)</p> Signup and view all the answers

Which measure indicates how well the linear regression model fits the data?

<p>R-squared (A)</p> Signup and view all the answers

What does the correlation coefficient measure in regression analysis?

<p>The strength and direction of the relationship (B)</p> Signup and view all the answers

What is a primary objective of k-means clustering?

<p>To minimize the distance within clusters (A)</p> Signup and view all the answers

How do k-means clustering and hierarchical clustering primarily differ?

<p>K-means uses centroids, while hierarchical uses distance measures (B)</p> Signup and view all the answers

What is a limitation of using k-means clustering?

<p>It requires a priori knowledge of the number of clusters (A)</p> Signup and view all the answers

What is the function of a hyperparameter in the gradient descent algorithm?

<p>To set the learning rate (D)</p> Signup and view all the answers

What is a disadvantage of using a low learning rate in gradient descent?

<p>The algorithm may converge slowly (B)</p> Signup and view all the answers

What is the condition on a and b for which the given system of linear equations has no solution?

<p>a ≠ 4, 2a + b − 6 = 0 (A)</p> Signup and view all the answers

Which statement is true about the determinant of a matrix?

<p>The determinant of a diagonal matrix is the product of its diagonal entries. (B)</p> Signup and view all the answers

Using the provided confusion matrix for classification, how is accuracy calculated?

<p>(True Positive + True Negative) / Total Predictions (C)</p> Signup and view all the answers

What distinguishes simple linear regression from multiple regression?

<p>Simple linear regression involves only one independent variable, while multiple involves more than one. (D)</p> Signup and view all the answers

What is the goal of multivariate optimization?

<p>To find the minimum or maximum of a function with multiple variables. (D)</p> Signup and view all the answers

Which method can be used to find the minimum of a function with multiple variables without derivatives?

<p>Gradient descent (A)</p> Signup and view all the answers

What does pruning in decision trees achieve?

<p>Reduces the complexity of the tree by removing unnecessary branches. (C)</p> Signup and view all the answers

Flashcards

Key components of Data Science

Data, statistics, and visualization are the core elements of data science.

Supervised Learning Technique

A type of machine learning where the model learns from labeled data.

Unsupervised Learning Technique

A type of machine learning where the model learns from unlabeled data.

Feature Engineering Goal

Transforming features into a useful representation for algorithms.

Signup and view all the flashcards

Data Visualization Technique

Presenting data in a visual format to better understand it.

Signup and view all the flashcards

Precision and Recall Difference

Precision measures the accuracy of positive predictions; recall measures the completeness of positive predictions.

Signup and view all the flashcards

Classification Algorithm Quality

F1 Score is a measure of a classification algorithm's effectiveness combining precision and recall.

Signup and view all the flashcards

Cross-validation Purpose

Ensuring a model generalizes well to unseen data by evaluating its performance on different subsets.

Signup and view all the flashcards

Mode

The value that appears most frequently in a dataset.

Signup and view all the flashcards

Median (even dataset)

The average of the two middle values when the dataset has an even number of observations.

Signup and view all the flashcards

Measure of Central Tendency

A way to describe the typical value in a dataset.

Signup and view all the flashcards

Range

The difference between the highest and lowest values in a dataset.

Signup and view all the flashcards

Mean

The sum of all values divided by the number of values.

Signup and view all the flashcards

Python random integer

The function randint(a, b) generates a random integer between a and b (inclusive).

Signup and view all the flashcards

Data Science

The study of data to extract meaningful insights.

Signup and view all the flashcards

Python built-in data type

List is a built-in data type in Python.

Signup and view all the flashcards

Arbitrary Positional Arguments

A function can accept a variable number of positional arguments using the '*args' parameter. This allows for flexibility in the number of inputs provided.

Signup and view all the flashcards

NumPy's Primary Array Structure

NumPy's core data structure for working with arrays is the 'ndarray' (n-dimensional array). It efficiently stores and manipulates numerical data.

Signup and view all the flashcards

Creating a NumPy Array

You can create a NumPy array containing integers from 0 to 9 using 'np.arange(10)'. This function generates a sequence of numbers with a specified step.

Signup and view all the flashcards

Default Data Type in NumPy Array

NumPy arrays default to 'float' as their element data type. This allows for greater flexibility in numerical operations.

Signup and view all the flashcards

Adding NumPy Arrays

Adding two NumPy arrays of the same size performs element-wise addition. Each corresponding element in the arrays is added together.

Signup and view all the flashcards

Accessing NumPy Array Elements

To access a specific element in a NumPy array, use square brackets and specify the row and column index. For example, 'arr[2, 3]' accesses the element at the third row and fourth column.

Signup and view all the flashcards

Universal Functions (ufuncs) in NumPy

Universal functions (ufuncs) in NumPy are functions that operate on arrays element-wise. Examples include 'sqrt' for square root and 'sin' for sine.

Signup and view all the flashcards

Sample Variance Distribution

A sample variance, calculated from N observations independently drawn from a normal distribution, follows a chi-square distribution with N-1 degrees of freedom.

Signup and view all the flashcards

Linear Regression

Predicts a continuous value (like price or temperature) based on the relationship with one or more input variables. It aims to find a line that best fits the data points.

Signup and view all the flashcards

Logistic Regression

Predicts the probability of a categorical outcome (like yes/no, true/false) based on input variables. It uses a S-shaped curve to map the input to a probability.

Signup and view all the flashcards

R-squared

Indicates how well the regression line fits the data. A value of 1 means the line perfectly predicts the data, 0 means no relationship.

Signup and view all the flashcards

Correlation Coefficient

Measures the strength and direction of the linear relationship between two variables. Values range from -1 to 1.

Signup and view all the flashcards

K-Means Clustering

An algorithm that groups similar data points into clusters by minimizing the distance within each cluster. It iteratively assigns data points to the nearest centroid.

Signup and view all the flashcards

Hierarchical Clustering

An algorithm that builds clusters by iteratively merging or splitting existing clusters based on their similarity or distance.

Signup and view all the flashcards

Gradient Descent

An optimization algorithm that iteratively adjusts the parameters of a model to minimize a cost function, which represents the error of the model's predictions.

Signup and view all the flashcards

Learning Rate

A hyperparameter in gradient descent that controls the step size for parameter adjustments. A higher learning rate makes larger jumps, while a lower rate makes smaller adjustments.

Signup and view all the flashcards

Linear Regression Assumption

Linearity, independence of residuals, normality of residuals, and homoscedasticity are common assumptions of linear regression. Multicollinearity is NOT an assumption but a potential problem.

Signup and view all the flashcards

Logistic Regression Outcome

Logistic regression predicts a categorical outcome, typically a binary classification (e.g., yes/no, true/false).

Signup and view all the flashcards

Residual Analysis Purpose

Residual analysis helps identify outliers, assess model assumptions, and check if the model is a good fit for the data.

Signup and view all the flashcards

Polynomial Regression Degree Effect

Increasing the degree of a polynomial in regression can lead to overfitting by making the model too complex and fitting the noise in the data.

Signup and view all the flashcards

Multicollinearity Solution

Lasso regression is a technique that can handle multicollinearity (highly correlated independent variables) by shrinking some coefficients to zero.

Signup and view all the flashcards

Dependent Variable in Regression

A dependent variable (e.g., sales) in regression is the variable being predicted or explained by the model.

Signup and view all the flashcards

Linear vs. Logistic Regression

Linear regression predicts continuous variables, while logistic regression predicts categorical variables. Both techniques help understand the relationship between variables.

Signup and view all the flashcards

Tuple Slicing with Indexing

To extract a specific value from nested tuples in Python, use indexing and slicing. Use aTuple[index][sub-index] to access the desired value.

Signup and view all the flashcards

No Solution System

A system of linear equations has no solution when the equations are inconsistent, meaning they cannot be satisfied simultaneously. This occurs when the coefficients are related in a way that leads to a contradiction.

Signup and view all the flashcards

Determinant of a Diagonal Matrix

The determinant of a diagonal matrix is calculated by simply multiplying all the elements along the diagonal.

Signup and view all the flashcards

Accuracy

Accuracy in a classification model is the proportion of correctly classified instances out of the total instances.

Signup and view all the flashcards

Sensitivity

Sensitivity, also known as the true positive rate, measures the proportion of actual positive instances that are correctly identified as positive.

Signup and view all the flashcards

Multiple Linear Regression

A statistical technique used to predict a dependent variable based on the relationship with multiple independent variables.

Signup and view all the flashcards

Simple vs. Multiple Regression

Simple linear regression involves one independent variable, while multiple regression uses two or more independent variables to predict the dependent variable.

Signup and view all the flashcards

Multivariate Optimization

The process of finding the optimal values for multiple variables within a function, aiming to maximize or minimize its output.

Signup and view all the flashcards

Pruning Decision Trees

A technique used to simplify a decision tree by removing unnecessary or redundant branches, improving its accuracy and reducing overfitting.

Signup and view all the flashcards

Study Notes

Data Science Section A

  • Data Science key components are Data, Model, and Visualization
  • Supervised Learning technique is Linear Regression
  • Unsupervised learning technique is K-Means Clustering
  • Feature engineering goal is to transform features into a suitable representation for machine learning algorithms
  • Data visualization technique is Hierarchical Clustering
  • Precision measures true positives, recall measures true negatives
  • Measures for classification algorithm quality include Precision, Recall, and F1 Score
  • Cross-validation ensures the model doesn't overfit the data
  • Linear Regression and Random Forest are classification algorithms
  • Hypothesis testing purpose is to determine if a sample statistic is significantly different from a population parameter
  • Null hypothesis states that there is no significant difference between a sample statistic and a population parameter
  • Alternative hypothesis states there is a significant difference between a sample statistic and a population parameter
  • P-value is the probability of observing a sample statistic as extreme or more extreme, assuming the null hypothesis is true
  • Significance level is the probability of making a type I error
  • Type II error is failing to reject a false null hypothesis

Data Science Section B

  • A data storage domain

  • Study of data to extract meaningful insights

  • Field restricted to structured data only

  • Impact of Data Science: Improved decision-making and efficiency, decreased profitability, or increased manufacturing costs

  • Built-in data types in Python include lists

  • Result of "Hello, " + "World!" is "Hello, World!" in Python

  • Pseudo-random numbers generated using the random module in Python

  • Primary data structure in NumPy for arrays is ndarray

  • A NumPy array containing integers from 0 to 9 can be created using np.arange(10)

  • The default data type of elements in a NumPy array is integer

  • Element access in a NumPy array is done with arr[row, column]

  • Universal functions (ufuncs) are like sqrt in NumPy

Data Science Section C

  • Gradient descent converges to local minimum (True or False)
  • Covariance is not a better metric than correlation for analyzing association
  • Linear regression minimizes the residual sum-of-squares (SSR)
  • Cross-validation techniques include Leave-One-Out Cross-Validation (LOOCV) and k-fold cross-validation
  • Classification problems include disease diagnosis and house price prediction
  • Function predicting the y value when x=5 in a Linear Regression equation of y = 2x + 3 is 13
  • Not a common assumption in Linear Regression is linearity

Additional Topics

  • The purpose of residual analysis in regression is to identify outliers and evaluate model assumptions
  • A type of regression suitable for multicollinearity is Lasso Regression

Tuple Slicing

  • To set val to 20 by slicing the tuple aTuple = ("Orange", (10, 20, 30), (5, 15, 25)) is val = aTuple[1][1]
  • Example of dependent variable is Age
  • Difference between simple and multiple regression: Simple regression has one independent variable, multiple regression has more than one independent variable
  • Multivariate optimization finds the minimum or maximum of a function with multiple variables
  • A method for finding a minimum without derivatives in a function with multiple variables is Gradient Descent
  • Pruning reduces the size of a decision tree by removing unnecessary branches
  • Goal of a support vector machine (SVM) is to find the ideal decision boundary that separates the data into classes

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

Sample MCQ PDF

More Like This

Fundamentals of Data Science Quiz 1
10 questions
Fundamentals of Data Science - DS302
32 questions
Fundamentals of Data Science - Chapter 1 Quiz
32 questions
Use Quizgecko on...
Browser
Browser