Mock Tests PDF
Document Details
Uploaded by PicturesqueShark8597
Tags
Summary
This document is a collection of mock exam questions covering various aspects of data analysis and statistical testing. Topics include hypothesis testing, data visualization, and data manipulation.
Full Transcript
Question 1: In statistical testing, what does it mean when the null hypothesis (H₀) is rejected? a. The statistical test was not performed correctly. b. The alternative hypothesis is supported by the data. c. The null hypothesis is accepted, and the alternative hypothesis is re...
Question 1: In statistical testing, what does it mean when the null hypothesis (H₀) is rejected? a. The statistical test was not performed correctly. b. The alternative hypothesis is supported by the data. c. The null hypothesis is accepted, and the alternative hypothesis is rejected. d. There is evidence to suggest that the null hypothesis is true. Question 2: A bar graph is the best graph to use when: a. You want to show ordered trends in your data. b. Your independent and dependent variables are both continuous. c. Your dependent variable was measured on at least a ratio scale. d. Your independent variable is categorical. Question 3: If you have discrete group data, such as months of the year, age group, shoe sizes, and animals, which is the best to explain? a. Scatter b. Boxplot c. Histogram d. Bar Question 4: What does the following code do? scss CopyEdit data %>% group_by(category) %>% filter(rank == min(rank)) a. Filters rows within each group where rank equals the group minimum. b. Removes rows where rank is less than the minimum. c. Filters rows globally where rank is the minimum. d. Creates a new column that ranks the rows. Question 5: In the data analysis process, which of the following is typically done during the data interpretation phase? a. Developing a model to make predictions. b. Cleaning the data to remove inconsistencies. c. Analyzing data to derive insights and conclusions. d. Collecting raw data from various sources. Question 6: When cleaning a dataset with duplicate rows, which of the following is the most appropriate first step? a. Sort the dataset by a key variable and remove rows that are identical in all columns. b. Aggregate the duplicates by calculating summary statistics (e.g., mean, sum) for numerical columns. c. Drop all rows that appear more than once without any further checks. d. Check for duplicates only in key columns (such as ID or transaction date) and remove those. Question 7: Which of the following methods would most likely be employed in diagnostic analysis to understand why a product's sales have declined over the past six months? a. Performing a cohort analysis to identify changing customer preferences. b. Utilizing time series analysis to assess sales trends. c. Implementing A/B testing on different marketing strategies. d. Conducting regression analysis to predict future sales. Question 8: In hypothesis testing, if the p-value is less than the significance level (α), what is the correct decision? a. Fail to reject the null hypothesis. b. Reject the null hypothesis. c. Reject the alternative hypothesis. d. Increase the sample size to reduce the p-value. Question 9: Which of the following best describes the purpose of data validation in the data analysis process? a. To create visual representations of the data. b. To ensure the data is visually appealing. c. To verify the accuracy and quality of the data before analysis. d. To perform statistical tests on the data. Question 10: Which of the following would be the correct way to compute the sum of a column named sales in a dataset? a. mutate(total_sales = sum(sales)) b. group_by(total_sales = sum(sales)) c. summarize(total_sales = sum(sales)) d. select(total_sales = sum(sales)) Question 11: In the realm of data analysis types, which of the following methodologies would most likely be utilized in diagnostic analysis to identify the underlying causes of a decrease in sales? a. Optimization algorithms to suggest pricing strategies for maximizing sales. b. Time series forecasting using ARIMA models to predict future sales trends. c. Cluster analysis to segment customers based on purchasing behavior. d. Exploratory data analysis (EDA) techniques, including correlation analysis and hypothesis testing, to identify relationships. Question 12: In the context of reading data from an SQL database into R, which function from the DBI package is typically used to execute SQL queries and retrieve data into R as a data frame? a. dbWriteTable() b. dbGetQuery() c. dbSendQuery() d. dbConnect() Question 13: The pipe operator (%>%) in dplyr is used to: a. Chain together multiple functions. b. Group data by a column. c. Filter rows of a dataset. d. Assign a value to a variable. Question 14: In a retail environment, which data analysis type would be most effective for developing a personalized marketing strategy based on customer behavior and purchase history? a. Descriptive analysis, to summarize customer demographics. b. Prescriptive analysis, to recommend specific marketing actions tailored to each customer. c. Predictive analysis, to forecast future buying patterns. d. Diagnostic analysis, to identify reasons for customer churn. Question 15: Which of the following metrics is most relevant when evaluating the performance of a predictive model in the context of classification tasks? a. F-statistic b. Area Under the Receiver Operating Characteristic Curve (AUC-ROC) c. R-squared value d. Mean Squared Error (MSE) Question 16: What does a high variance in data indicate? a. Data is not normally distributed. b. Data contains missing values. c. Data points are clustered closely. d. Data points are spread out widely. Question 17: When importing a CSV file using read.csv() in R, which argument would you use to prevent automatic conversion of strings into factors? a. as.is = TRUE b. stringsAsFactors = FALSE c. stringsAsFactor = FALSE d. convertStrings = FALSE Question 18: What is the purpose of na.omit() in R? a. Remove rows containing missing values. b. Add missing values. c. Replace missing values with zeros. d. Calculate missing value percentage. Question 19: How does the challenge of Velocity in big data influence the design of data architecture, especially when considering the requirements for near-real-time analytics? a. It promotes the implementation of event-driven architectures that can process streams of data as they arrive. b. It leads to a focus on archival storage solutions for long-term historical data retention. c. It necessitates the use of batch processing systems that handle data at scheduled intervals for consistency. d. It encourages the integration of artificial intelligence only after data has been processed through traditional methods. Question 20: What type of variable is measured on an ordinal scale? a. Gender b. Customer satisfaction rating (1 to 5) c. Number of transactions d. Temperature in Celsius Question 21: Which of the following scenarios best illustrates the integration of both predictive and prescriptive analysis in decision-making? a. Analyzing customer demographics and summarizing findings in a dashboard. b. Reviewing previous marketing campaigns to summarize their success. c. Forecasting demand for a new product and recommending optimal inventory levels to maximize profit. d. Predicting customer churn based on historical data and developing a report on past customer behavior. Question 22: When importing data from an Excel file using the readxl package's read_excel() function, which of the following is NOT true? a. You can specify the sheet to be read by using the sheet argument. b. read_excel() automatically detects the data types of each column. c. It can read both.xls and.xlsx formats. d. read_excel() can read data from a password-protected Excel file. Question 23: What is the main purpose of descriptive analytics? a. Predict future trends b. Test hypothesis c. Summarize and describe historical data d. Clean and preprocess data Question 24: When performing exploratory data analysis (EDA), which statistical method would you use to examine the relationship between two categorical variables in a large dataset, and how would you interpret the results? a. ANOVA, to analyze the variance between multiple categorical groups and their respective means. b. T-test, to compare the means of the two variables and determine if the difference is statistically significant. c. Pearson correlation, as it captures the linear relationship between the two variables. d. Chi-square test of independence, to assess whether an association exists between the two variables, with a significant result indicating dependence. Question 25: In a large dataset with multiple categorical variables, which of the following techniques is most appropriate to handle high-cardinality categorical variables during data cleaning? a. Drop categorical variables with high cardinality from the dataset. b. Replace each category with its corresponding mean of the target variable. c. Convert categorical variables to binary variables using one-hot encoding. d. Group rare categories together into an "Other" category. Question 26: You are using the scan() function to read data from a text file. Which of the following is TRUE about scan() compared to read.table()? a. scan() is more flexible, allowing data to be read into vectors, lists, or matrices. b. scan() is faster for reading large datasets with mixed data types than read.table(). c. scan() is optimized for reading data directly into data frames. d. scan() is the preferred method for reading structured data with headers. Question 27: When handling imbalanced classes in a dataset, which of the following is a proper data cleaning technique to address the imbalance? a. Oversample the minority class or undersample the majority class. b. Drop the target variable and focus only on the predictors. c. Remove instances of the minority class to balance the dataset. d. Normalize the target variable to reduce class imbalance. Question 28: Which characteristic of Variety in big data best describes the challenge organizations face when integrating data from multiple sources? a. The high speed at which data is generated across different platforms and devices. b. The need to standardize data types for analysis regardless of the source format. c. The requirement for real-time processing of streaming data from IoT devices. d. The incorporation of unstructured data alongside structured data from various formats such as text, audio, video, and social media feeds. Question 29: You have a dataset df with columns id, name, and sales. If you want to keep only the rows where sales is within the top 10 highest values, which code would you use? a. df %>% filter(rank(desc(sales)) % arrange(desc(sales)) %>% top_n(10) c. df %>% arrange(desc(sales)) %>% slice_max(sales, n = 10) d. df %>% filter(sales %in% max(sales)) Question 30: What is a key difference between a tibble and a traditional data frame in R? a. Tibbles do not support row names, while data frames do. b. Tibbles cannot store character data. c. Tibbles automatically sort columns alphabetically. d. Tibbles are slower to process compared to data frames. Question 31: Which of the following is the most appropriate method for handling outliers when they are due to data entry errors rather than true variability? a. Log transformation of the data to reduce the influence of outliers. b. Remove the outliers based on a predefined threshold (e.g., Z-Score or IQR). c. Keep the outliers and run the analysis as usual. d. Winsorization to cap extreme values at a certain percentile. Question 32: Considering the challenges of Volume, what technique is most effective for ensuring that analytical systems can scale dynamically to handle fluctuating data loads? a. Optimizing SQL queries for faster execution times without addressing data size concerns. b. Utilizing cloud-based platforms that provide on-demand scalability and flexible resource allocation. c. Restricting data collection to only essential metrics to keep storage requirements manageable. d. Implementing monolithic database architectures that can manage larger data sets efficiently. Question 33: What does str(df) do in R? a. Provides a statistical summary of the data frame. b. Returns the structure of the data frame. c. Displays the first few rows of the data frame. d. Converts the data frame to string format. Question 34: When designing an experiment to test the impact of diet on health outcomes, which of the following would be a statistical hypothesis? a. Healthy diets consist primarily of fruits and vegetables. b. A diet high in sugar will have a negative impact on heart health. c. The mean cholesterol level of participants on a high-fiber diet is different from that of participants on a low-fiber diet. d. A healthy diet improves well-being. Question 35: A Type I error in hypothesis testing occurs when: a. A statistical test fails to detect a significant effect. b. The null hypothesis is incorrectly rejected when it is actually true. c. The null hypothesis is accepted when it is false. d. The alternative hypothesis is accepted based on insufficient evidence. Question 36: What does the term "data normalization" mean? a. Reducing data dimensionality. b. Removing duplicates from the dataset. c. Sorting data values in ascending order. d. Standardizing data to a common scale. Question 37: In the data analysis process, which step typically involves transforming raw data into a more suitable format for analysis? a. Data Wrangling b. Data Collection c. Data Interpretation d. Data Transformation Question 38: How can you check for outliers in a data set? a. Histogram b. Boxplot c. Scatterplot d. All of the above Question 39: Which function from the readr package is designed to import a CSV file and returns the output as a tibble? a. read.table() b. read.csv() c. fread() d. read_csv()