Programming for Data Analysis Past Paper Mock 2 PDF

Summary

This is a mock past paper for a programming for data analysis course. The paper includes questions and answers on data validation, descriptive analytics, handling outliers, and more. Topics covered include CSV imports, SQL queries, hypothesis testing, and data visualization.

Full Transcript

Course Material: Programming for Data Analysis (092024-KLT) Exam Details: Start Time: Wednesday, 27 November 2024, 11:00 AM End Time: Wednesday, 27 November 2024, 11:27 AM Duration: 26 minutes 33 seconds Status: Completed Questions and Answers: 1. Purpose of Data Validation Q: Which of...

Course Material: Programming for Data Analysis (092024-KLT) Exam Details: Start Time: Wednesday, 27 November 2024, 11:00 AM End Time: Wednesday, 27 November 2024, 11:27 AM Duration: 26 minutes 33 seconds Status: Completed Questions and Answers: 1. Purpose of Data Validation Q: Which of the following best describes the purpose of data validation in the data analysis process? Options: a) To ensure the data is visually appealing b) To verify the accuracy and quality of the data before analysis c) To create visual representations of the data d) To perform statistical tests on the data Correct Answer: b) To verify the accuracy and quality of the data before analysis Status: Correct 2. Purpose of Descriptive Analytics Q: What is the main purpose of descriptive analytics? Options: a) Test hypothesis b) Predict future trends c) Summarize and describe historical data d) Clean and preprocess data Correct Answer: c) Summarize and describe historical data Status: Incorrect 3. Data Variance Q: What does a high variance in data indicate? Options: a) Data points are spread out widely b) Data points are clustered closely c) Data is not normally distributed d) Data contains missing values Correct Answer: a) Data points are spread out widely Status: Incorrect 4. Handling Outliers Q: Which of the following is the most appropriate method for handling outliers when they are due to data entry errors? Options: a) Log transformation of the data to reduce the influence of outliers b) Remove the outliers based on a predefined threshold (e.g., Z-score or IQR) c) Winsorization to cap extreme values at a certain percentile d) Keep the outliers and run the analysis as usual Correct Answer: b) Remove the outliers based on a predefined threshold (e.g., Z-score or IQR) Status: Correct 5. CSV Import Function Q: Which function from the readr package is designed to import a CSV file and returns the output as a tibble? Options: a) fread() b) read.csv() c) read.table() d) read_csv() Correct Answer: d) read_csv() Status: Incorrect 6. SQL Database Query Q: In the context of reading data from an SQL database into R, which function from the DBI package is typically used to execute SQL queries and retrieve data into R as a data frame? Options: a) dbConnect() b) dbSendQuery() c) dbWriteTable() d) dbGetQuery() Correct Answer: d) dbGetQuery() Status: Correct 7. Hypothesis Testing Decision Q: In hypothesis testing, if the p-value is less than the significance level (α), what is the correct decision? Options: a) Increase the sample size to reduce the p-value b) Reject the alternative hypothesis c) Reject the null hypothesis d) Fail to reject the null hypothesis Correct Answer: c) Reject the null hypothesis Status: Incorrect 8. Handling Duplicate Rows Q: When cleaning a dataset with duplicate rows, which of the following is the most appropriate first step? Options: a) Drop all rows that appear more than once without any further checks b) Check for duplicates only in key columns (such as ID or transaction date) and remove those c) Aggregate the duplicates by calculating summary statistics for numerical columns d) Sort the dataset by a key variable and remove rows that are identical in all columns Correct Answer: d) Sort the dataset by a key variable and remove rows that are identical in all columns Status: Incorrect 9. Classification Model Evaluation Q: Which of the following metrics is most relevant when evaluating the performance of a predictive model in the context of classification tasks? Options: a) Mean Squared Error (MSE) b) Area Under the Receiver Operating Characteristic Curve (AUC-ROC) c) F-statistic d) R-squared value Correct Answer: b) Area Under the Receiver Operating Characteristic Curve (AUC-ROC) Status: Incorrect 10. Big Data Velocity Challenge Q: How does the challenge of Velocity in big data influence the design of data architecture, especially when considering the requirements for near-real-time analytics? Options: a) It leads to a focus on archival storage solutions for long-term historical data retention b) It promotes the implementation of event-driven architectures that can process streams of data as they arrive c) It necessitates the use of batch processing systems that handle data at scheduled intervals for consistency d) It encourages the integration of artificial intelligence only after data has been processed through traditional methods Correct Answer: b) It promotes the implementation of event-driven architectures that can process streams of data as they arrive Status: Correct 11. Diagnostic Analysis Methods Q: In the realm of data analysis types, which methodology would most likely be utilized in diagnostic analysis to identify the underlying causes of a decrease in sales? Options: a) Time series forecasting using ARIMA models to predict future sales trends b) Cluster analysis to segment customers based on purchasing behavior c) Exploratory data analysis (EDA) techniques, including correlation analysis and hypothesis testing d) Optimization algorithms to suggest pricing strategies for maximizing sales Correct Answer: c) Exploratory data analysis (EDA) techniques, including correlation analysis and hypothesis testing Status: Correct 12. Big Data Volume Management Q: Considering the challenges of Volume, what technique is most effective for ensuring that analytical systems can scale dynamically to handle fluctuating data loads? Options: a) Utilizing cloud-based platforms that provide on-demand scalability and flexible resource allocation b) Restricting data collection to only essential metrics to keep storage requirements manageable c) Optimizing SQL queries for faster execution times without addressing data size concerns d) Implementing monolithic database architectures that can manage larger data sets efficiently Correct Answer: a) Utilizing cloud-based platforms that provide on-demand scalability and flexible resource allocation Status: Incorrect 13. Outlier Detection Methods Q: How can you check for outliers in data set? Options: a) Scatterplot b) Histogram c) Boxplot d) All of the above Correct Answer: d) All of the above Status: Correct 14. Excel File Import Q: When importing data from an Excel file using the readxl package's read_excel() function, which of the following is NOT true? Options: a) It can read both.xls and.xlsx formats b) read_excel() automatically detects the data types of each column c) You can specify the sheet to be read by using the sheet argument d) read_excel() can read data from a password-protected Excel file Correct Answer: d) read_excel() can read data from a password-protected Excel file Status: Incorrect 15. Null Hypothesis Rejection Q: In statistical testing, what does it mean when the null hypothesis (H₀) is rejected? Options: a) The alternative hypothesis is supported by the data b) The statistical test was not performed correctly c) There is evidence to suggest that the null hypothesis is true d) The null hypothesis is accepted, and the alternative hypothesis is rejected Correct Answer: a) The alternative hypothesis is supported by the data Status: Incorrect 16. Data Filtering Q: You have a dataset with columns id, name, and sales. If you want to keep only the rows where sales is within the top 10 highest values, which code would you use? Options: a) df %>% arrange(desc(sales)) %>% top_n(10) b) df %>% filter(rank(desc(sales)) % filter(sales %in% max(sales)) d) df %>% arrange(desc(sales)) %>% slice_max(sales, n = 10) Correct Answer: d) df %>% arrange(desc(sales)) %>% slice_max(sales, n = 10) Status: Incorrect 17. Analysis Integration Q: Which of the following scenarios best illustrates the integration of both predictive and prescriptive analysis in decision-making? Options: a) Predicting customer churn based on historical data and developing a report on past customer behavior b) Forecasting demand for a new product and recommending optimal inventory levels to maximize profit c) Reviewing previous marketing campaigns to summarize their success d) Analyzing customer demographics and summarizing findings in a dashboard Correct Answer: b) Forecasting demand for a new product and recommending optimal inventory levels to maximize profit Status: Incorrect 18. Discrete Data Visualization Q: If you have discrete group data, such as months of the year, age group, shoe sizes, and animals, which is the best to explain? Options: a) Scatter plot b) Histogram c) Boxplot d) Bar chart Correct Answer: d) Bar chart Status: Correct 19. Ordinal Variables Q: What type of variable is measured on an ordinal scale? Options: a) Customer satisfaction rating (1 to 5) b) Number of transactions c) Temperature in Celsius d) Gender Correct Answer: a) Customer satisfaction rating (1 to 5) Status: Incorrect 20. Sales Decline Analysis Q: Which of the following methods would most likely be employed in diagnostic analysis to understand why a product's sales have declined over the past six months? Options: a) Conducting regression analysis to predict future sales b) Implementing A/B testing on different marketing strategies c) Utilizing time series analysis to assess sales trends d) Performing a cohort analysis to identify changing customer preferences Correct Answer: d) Performing a cohort analysis to identify changing customer preferences Status: Incorrect 21. Distribution Types Q: A distribution has most scores collected about the center and is symmetrical about its midpoint. What type of distribution is this? Options: a) Functional b) Bimodal c) Normal d) Monotonic Correct Answer: c) Normal Status: Incorrect 22. Data Reading Functions Q: You are using the scan() function to read data from a text file. Which of the following is TRUE about scan() compared to read.table()? Options: a) scan() is more flexible, allowing data to be read into vectors, lists, or matrices b) scan() is optimized for reading data directly into data frames c) scan() is faster for reading large datasets with mixed data types than read.table() d) scan() is the preferred method for reading structured data with headers Correct Answer: a) scan() is more flexible, allowing data to be read into vectors, lists, or matrices Status: Incorrect 23. Bar Graph Usage Q: A bar graph is the best graph to use when: Options: a) Your dependent variable was measured on at least a ratio scale b) You want to show ordered trends in your data c) Your independent and dependent variables are both continuous d) Your independent variable is categorical Correct Answer: d) Your independent variable is categorical Status: Incorrect 24. R Data Frame Structure Q: What does str(df) do in R? Options: a) Returns the structure of the data frame b) Provides a statistical summary of the data frame c) Converts the data frame to string format d) Displays the first few rows of the data frame Correct Answer: a) Returns the structure of the data frame Status: Incorrect 25. Tibble vs Data Frame Q: What is a key difference between a tibble and a traditional data frame in R? Options: a) Tibbles cannot store character data b) Tibbles are slower to process compared to data frames c) Tibbles do not support row names, while data frames do d) Tibbles automatically sort columns alphabetically Correct Answer: c) Tibbles do not support row names, while data frames do Status: Correct 26. Categorical Variable Analysis Q: When performing exploratory data analysis (EDA), which statistical method would you use to examine the relationship between two categorical variables in a large dataset? Options: a) Pearson correlation b) Chi-square test of independence c) T-test d) ANOVA Correct Answer: b) Chi-square test of independence Status: Incorrect 27. Marketing Strategy Analysis Q: In a retail environment, which data analysis type would be most effective for developing a personalized marketing strategy based on customer behavior and purchase history? Options: a) Prescriptive analysis b) Descriptive analysis c) Predictive analysis d) Diagnostic analysis Correct Answer: a) Prescriptive analysis Status: Incorrect 28. Data Interpretation Phase Q: In the data analysis process, which of the following is typically done during the data interpretation phase? Options: a) Analyzing data to derive insights and conclusions b) Collecting raw data from various sources c) Developing a model to make predictions d) Cleaning the data to remove inconsistencies Correct Answer: a) Analyzing data to derive insights and conclusions Status: Correct 29. Pipe Operator Usage Q: The pipe operator (%>%) in dplyr is used to: Options: a) Chain together multiple functions b) Group data by a column c) Assign a value to a variable d) Filter rows of a dataset Correct Answer: a) Chain together multiple functions Status: Correct 30. Imbalanced Classes Q: When handling imbalanced classes in a dataset, which of the following is a proper data cleaning technique? Options: a) Oversample the minority class or undersample the majority class b) Normalize the target variable to reduce class imbalance c) Remove instances of the minority class to balance the dataset d) Drop the target variable and focus only on the predictors Correct Answer: a) Oversample the minority class or undersample the majority class Status: Incorrect 31. CSV Import Settings Q: When importing a CSV file using read.csv() in R, which argument would you use to prevent automatic conversion of strings into factors? Options: a) stringsAsFactors = FALSE b) convertStrings = FALSE c) as.is = TRUE d) stringsAsFactor = FALSE Correct Answer: a) stringsAsFactors = FALSE Status: Correct 32. Categorical Variables Handling Q: In a large dataset with multiple categorical variables, which technique is most appropriate to handle high-cardinality categorical variables during data cleaning? Options: a) Drop categorical variables with high cardinality from the dataset b) Convert categorical variables to binary variables using one-hot encoding c) Group rare categories together into an "Other" category d) Replace each category with its corresponding mean of the target variable Correct Answer: c) Group rare categories together into an "Other" category Status: Incorrect 33. Type I Error Q: A Type I error in hypothesis testing occurs when: Options: a) A statistical test fails to detect a significant effect b) The null hypothesis is accepted when it is false c) The null hypothesis is incorrectly rejected when it is actually true d) The alternative hypothesis is accepted based on insufficient evidence Correct Answer: c) The null hypothesis is incorrectly rejected when it is actually true Status: Correct 34. Data Summarization Q: Which of the following would be the correct way to compute the sum of a column named sales in a dataset? Options: a) summarize(total_sales = sum(sales)) b) select(total_sales = sum(sales)) c) mutate(total_sales = sum(sales)) d) group_by(total_sales = sum(sales)) Correct Answer: a) summarize(total_sales = sum(sales)) Status: Incorrect 35. Data Filtering Code Q: What does the following code do? data %>% group_by(category) %>% filter(rank == min(rank)) Options: a) Creates new column that ranks the rows b) Removes rows where rank is less than the minimum c) Filters rows globally where rank is the minimum d) Filters rows within each group where rank equals the group minimum Correct Answer: d) Filters rows within each group where rank equals the group minimum Status: Incorrect 36. Statistical Hypothesis Q: When designing an experiment to test the impact of diet on health outcomes, which of the following would be a statistical hypothesis? Options: a) A diet high in sugar will have a negative impact on heart health b) Healthy diets consist primarily of fruits and vegetables c) The mean cholesterol level of participants on a high-fiber diet is different from that of participants on a low-fiber diet d) A healthy diet improves well-being Correct Answer: c) The mean cholesterol level of participants on a high-fiber diet is different from that of participants on a low-fiber diet Status: Correct 37. Big Data Variety Q: Which characteristic of Variety in big data best describes the challenge organizations face when integrating data from multiple sources? Options: a) The requirement for real-time processing of streaming data from IoT devices b) The incorporation of unstructured data alongside structured data from various formats c) The high speed at which data is generated across different platforms and devices d) The need to standardize data types for analysis regardless of the source format Correct Answer: b) The incorporation of unstructured data alongside structured data from various formats Status: Incorrect 38. Missing Values Handling Q: What is the purpose of na.omit() in R? Options: a) Add missing values b) Remove rows containing missing values c) Calculate missing value percentage d) Replace missing values with zeros Correct Answer: b) Remove rows containing missing values Status: Incorrect 39. Data Analysis Process Q: In the data analysis process, which step typically involves transforming raw data into a more suitable format for analysis? Options: a) Data Collection b) Data Interpretation c) Data Transformation d) Data Wrangling Correct Answer: c) Data Transformation Status: Incorrect 40. Data Normalization Q: What does the term "data normalization" mean? Options: a) Standardizing data to a common scale b) Sorting data values in ascending order c) Reducing data dimensionality d) Removing duplicates from the dataset Correct Answer: a) Standardizing data to a common scale Status: Incorrect Final Score: 11/40 (27.5%)

Use Quizgecko on...
Browser
Browser