PFDA Mock PDF
Document Details
Uploaded by AwestruckPrehnite558
Asia Pacific Institute of Information Technology (APIIT)
Tags
Summary
This document is a mock exam paper covering data analysis topics, techniques, and questions. It briefly outlines ecosystem components, types of data analysis, hypothesis testing, data analysis techniques, and various visualizations relevant to the field.
Full Transcript
**ECOSYSTEM-components** - Sensing-evaluating quality of data - Collection-data must be collected - Wrangling-transform data into a more useful format - Analyze - Storage **Type of Data Analysis** - Descriptive - what happened? - Diagnostic - Why happened? - Predictive - Wha...
**ECOSYSTEM-components** - Sensing-evaluating quality of data - Collection-data must be collected - Wrangling-transform data into a more useful format - Analyze - Storage **Type of Data Analysis** - Descriptive - what happened? - Diagnostic - Why happened? - Predictive - What happen in future - Prescriptive - what action? **Hypothesis** - Simple - the relationship between two variables - Complex - relationship between variables - Null - no difference or relationship b/w variables - Alternative - one over another - Logical - proposed explanation using limited evidence - Statistical - examination of a portion of a population or statistical model **Data Analysis Techniques** **[Statistical Analysis]** - Descriptive Analysis - Dispersion Analysis - Regression Analysis - Factor Analysis - Discriminant Analysis - Time Series Analysis **[AI and Machine Learning]** - Artificial Neural Networks - Decision Trees - Evolutionary Programming - Fuzzy Logic **[Visualization]** - Bar chart - Line chart - Area chart - Pie chart - Bubble chart - Scatter plot **Techniques based on Mathematics and Statistics** - **Descriptive Analysis:**\ Descriptive Analysis considers the historical data, Key Performance Indicators, and describes the performance based on a chosen benchmark. It takes into account past trends and how they might influence future performance.\ *Eg:* KPIs such as year-on-year percentage sales growth, revenue per customer, and the average time customers take to pay bills. - **Dispersion Analysis:**\ Dispersion in the area onto which a data set is spread. This technique allows data analysts to determine the variability of the factors under study.\ *Eg:* Variance, standard deviation, and interquartile range. - **Regression Analysis:**\ This technique works by modeling the relationship between a dependent variable and one or more independent variables.\ *Eg:* Stock prediction, customer behavior prediction, forecasting sales. - **Factor Analysis:**\ This technique helps to determine if there exists any relationship between a set of variables. This process reveals other factors or variables that describe the patterns in the relationship among the original variables. Factor Analysis leaps forward into useful clustering and classification procedures.\ *Eg:* Restaurant Menu in a university campus, high-end shopping complex. - **Discriminant Analysis:**\ It is a classification technique in data mining. It identifies the different points on different groups based on variable measurements. In simple terms, it identifies what makes two groups different from one another; this helps to identify new items. - **Time Series Analysis:**\ In this kind of analysis, measurements are spanned across time, which gives us a collection of organized data known as a time series.\ *Eg:* Weather report, house prices over time, stock price. Line chart/scatter plot/point chart: - Two numerical/continuous variables. Bar chart/box plot: - One categorical and one numerical Pie chart: - Represent data as a whole Histogram: - One numerical Bubble chart: - At least three numerical variables Density chart: - Numerical variables within a time period **Question 1:**\ *In statistical testing, what does it mean when the null hypothesis (H₀) is rejected?* - a\. The statistical test was not performed correctly. - **b. The alternative hypothesis is supported by the data.** - c\. The null hypothesis is accepted, and the alternative hypothesis is rejected. - d\. There is evidence to suggest that the null hypothesis is true. **Question 2:**\ *A bar graph is the best graph to use when:* - a\. You want to show ordered trends in your data. - b\. Your independent and dependent variables are both continuous. - c\. Your dependent variable was measured on at least a ratio scale. - **d. Your independent variable is categorical.** **Question 3:**\ *If you have discrete group data, such as months of the year, age group, shoe sizes, and animals, which is the best to explain?* - a\. Scatter - b\. Boxplot - c\. Histogram - **d. Bar** **Question 4:**\ *What does the following code do?* scss CopyEdit data %\>% group\_by(category) %\>% filter(rank == min(rank)) - **a. Filters rows within each group where rank equals the group minimum.** - b\. Removes rows where rank is less than the minimum. - c\. Filters rows globally where rank is the minimum. - d\. Creates a new column that ranks the rows. **Question 5:**\ *In the data analysis process, which of the following is typically done during the data interpretation phase?* - a\. Developing a model to make predictions. - b\. Cleaning the data to remove inconsistencies. - **c. Analyzing data to derive insights and conclusions.** - d\. Collecting raw data from various sources. **Question 6:**\ *When cleaning a dataset with duplicate rows, which of the following is the most appropriate first step?* - **a. Sort the dataset by a key variable and remove rows that are identical in all columns.** - b\. Aggregate the duplicates by calculating summary statistics (e.g., mean, sum) for numerical columns. - c\. Drop all rows that appear more than once without any further checks. - d\. Check for duplicates only in key columns (such as ID or transaction date) and remove those. **Question 7:**\ *Which of the following methods would most likely be employed in diagnostic analysis to understand why a product\'s sales have declined over the past six months?* - **a. Performing a cohort analysis to identify changing customer preferences.** - b\. Utilizing time series analysis to assess sales trends. - c\. Implementing A/B testing on different marketing strategies. - d\. Conducting regression analysis to predict future sales. **Question 8:**\ *In hypothesis testing, if the p-value is less than the significance level (α), what is the correct decision?* - a\. Fail to reject the null hypothesis. - **b. Reject the null hypothesis.** - c\. Reject the alternative hypothesis. - d\. Increase the sample size to reduce the p-value. **Question 9:**\ *Which of the following best describes the purpose of data validation in the data analysis process?* - a\. To create visual representations of the data. - b\. To ensure the data is visually appealing. - **c. To verify the accuracy and quality of the data before analysis.** - d\. To perform statistical tests on the data. **Question 10:**\ *Which of the following would be the correct way to compute the sum of a column named sales in a dataset?* - a\. mutate(total\_sales = sum(sales)) - b\. group\_by(total\_sales = sum(sales)) - **c. summarize(total\_sales = sum(sales))** - d\. select(total\_sales = sum(sales)) **Question 11:**\ *In the realm of data analysis types, which of the following methodologies would most likely be utilized in diagnostic analysis to identify the underlying causes of a decrease in sales?* - a\. Optimization algorithms to suggest pricing strategies for maximizing sales. - b\. Time series forecasting using ARIMA models to predict future sales trends. - c\. Cluster analysis to segment customers based on purchasing behavior. - **d. Exploratory data analysis (EDA) techniques, including correlation analysis and hypothesis testing, to identify relationships.** **Question 12:**\ *In the context of reading data from an SQL database into R, which function from the DBI package is typically used to execute SQL queries and retrieve data into R as a data frame?* - a\. dbWriteTable() - **b. dbGetQuery()** - c\. dbSendQuery() - d\. dbConnect() **Question 13:**\ *The pipe operator (%\>%) in dplyr is used to:* - **a. Chain together multiple functions.** - b\. Group data by a column. - c\. Filter rows of a dataset. - d\. Assign a value to a variable. **Question 14:**\ *In a retail environment, which data analysis type would be most effective for developing a personalized marketing strategy based on customer behavior and purchase history?* - a\. Descriptive analysis, to summarize customer demographics. - **b. Prescriptive analysis, to recommend specific marketing actions tailored to each customer.** - c\. Predictive analysis, to forecast future buying patterns. - d\. Diagnostic analysis, to identify reasons for customer churn. **Question 15:**\ *Which of the following metrics is most relevant when evaluating the performance of a predictive model in the context of classification tasks?* - a\. F-statistic - **b. Area Under the Receiver Operating Characteristic Curve (AUC-ROC)** - c\. R-squared value - d\. Mean Squared Error (MSE) **Question 16:**\ *What does a high variance in data indicate?* - a\. Data is not normally distributed. - b\. Data contains missing values. - c\. Data points are clustered closely. - **d. Data points are spread out widely.** **Question 17:**\ *When importing a CSV file using read.csv() in R, which argument would you use to prevent automatic conversion of strings into factors?* - a\. as.is = TRUE - **b. stringsAsFactors = FALSE** - c\. stringsAsFactor = FALSE - d\. convertStrings = FALSE **Question 18:**\ *What is the purpose of na.omit() in R?* - **a. Remove rows containing missing values.** - b\. Add missing values. - c\. Replace missing values with zeros. - d\. Calculate missing value percentage. **Question 19:**\ *How does the challenge of Velocity in big data influence the design of data architecture, especially when considering the requirements for near-real-time analytics?* - **a. It promotes the implementation of event-driven architectures that can process streams of data as they arrive.** - b\. It leads to a focus on archival storage solutions for long-term historical data retention. - c\. It necessitates the use of batch processing systems that handle data at scheduled intervals for consistency. - d\. It encourages the integration of artificial intelligence only after data has been processed through traditional methods. **Question 20:**\ *What type of variable is measured on an ordinal scale?* - a\. Gender - **b. Customer satisfaction rating (1 to 5)** - c\. Number of transactions - d\. Temperature in Celsius **Question 21:** *Which of the following scenarios best illustrates the integration of both predictive and prescriptive analysis in decision-making?* - a\. Analyzing customer demographics and summarizing findings in a dashboard. - b\. Reviewing previous marketing campaigns to summarize their success. - c\. **Forecasting demand for a new product and recommending optimal inventory levels to maximize profit.** - d\. Predicting customer churn based on historical data and developing a report on past customer behavior. **Question 22:** *When importing data from an Excel file using the **readxl** package\'s **read\_excel()** function, which of the following is NOT true?* a\. You can specify the sheet to be read by using the sheet argument.\ b. read\_excel() automatically detects the data types of each column.\ c. It can read both.xls and.xlsx formats.\ **d. read\_excel() can read data from a password-protected Excel file.** **Question 23:** *What is the main purpose of descriptive analytics?* a\. Predict future trends\ b. Test hypothesis\ **c. Summarize and describe historical data**\ d. Clean and preprocess data **Question 24:** *When performing exploratory data analysis (EDA), which statistical method would you use to examine the relationship between two categorical variables in a large dataset, and how would you interpret the results?* a\. ANOVA, to analyze the variance between multiple categorical groups and their respective means.\ b. T-test, to compare the means of the two variables and determine if the difference is statistically significant.\ c. Pearson correlation, as it captures the linear relationship between the two variables.\ **d. Chi-square test of independence, to assess whether an association exists between the two variables, with a significant result indicating dependence.** **Question 25:** *In a large dataset with multiple categorical variables, which of the following techniques is most appropriate to handle high-cardinality categorical variables during data cleaning?* a\. Drop categorical variables with high cardinality from the dataset.\ b. Replace each category with its corresponding mean of the target variable.\ c. Convert categorical variables to binary variables using one-hot encoding.\ **d. Group rare categories together into an \"Other\" category.** **Question 26:** *You are using the **scan()** function to read data from a text file. Which of the following is TRUE about **scan()** compared to **read.table()**?* **a. scan() is more flexible, allowing data to be read into vectors, lists, or matrices.**\ b. scan() is faster for reading large datasets with mixed data types than read.table().\ c. scan() is optimized for reading data directly into data frames.\ d. scan() is the preferred method for reading structured data with headers. **Question 27:** *When handling imbalanced classes in a dataset, which of the following is a proper data cleaning technique to address the imbalance?* **a. Oversample the minority class or undersample the majority class.**\ b. Drop the target variable and focus only on the predictors.\ c. Remove instances of the minority class to balance the dataset.\ d. Normalize the target variable to reduce class imbalance. **Question 28:** *Which characteristic of Variety in big data best describes the challenge organizations face when integrating data from multiple sources?* a\. The high speed at which data is generated across different platforms and devices.\ b. The need to standardize data types for analysis regardless of the source format.\ c. The requirement for real-time processing of streaming data from IoT devices.\ **d. The incorporation of unstructured data alongside structured data from various formats such as text, audio, video, and social media feeds.** **Question 29:** **You have a dataset df with columns id, name, and sales. If you want to keep only the rows where sales is within the top 10 highest values, which code would you use?** a\. df %\>% filter(rank(desc(sales)) \% arrange(desc(sales)) %\>% top\_n(10)\ **c. df %\>% arrange(desc(sales)) %\>% slice\_max(sales, n = 10)**\ d. df %\>% filter(sales %in% max(sales)) **Question 30:** *What is a key difference between a tibble and a traditional data frame in R?* **a. Tibbles do not support row names, while data frames do.**\ b. Tibbles cannot store character data.\ c. Tibbles automatically sort columns alphabetically.\ d. Tibbles are slower to process compared to data frames. **Question 31:** *Which of the following is the most appropriate method for handling outliers when they are due to data entry errors rather than true variability?* - **a.** Log transformation of the data to reduce the influence of outliers. - **b. Remove the outliers based on a predefined threshold (e.g., Z-Score or IQR).** - **c.** Keep the outliers and run the analysis as usual. - **d.** Winsorization to cap extreme values at a certain percentile. **Question 32:** *Considering the challenges of Volume, what technique is most effective for ensuring that analytical systems can scale dynamically to handle fluctuating data loads?* a\. Optimizing SQL queries for faster execution times without addressing data size concerns.\ **b. Utilizing cloud-based platforms that provide on-demand scalability and flexible resource allocation.\ **c. Restricting data collection to only essential metrics to keep storage requirements manageable.\ d. Implementing monolithic database architectures that can manage larger data sets efficiently. **Question 33:** *What does **str(df)** do in R?* a\. Provides a statistical summary of the data frame.\ **b. Returns the structure of the data frame.**\ c. Displays the first few rows of the data frame.\ d. Converts the data frame to string format. **Question 34:** *When designing an experiment to test the impact of diet on health outcomes, which of the following would be a statistical hypothesis?* a\. Healthy diets consist primarily of fruits and vegetables.\ b. A diet high in sugar will have a negative impact on heart health.\ **c. The mean cholesterol level of participants on a high-fiber diet is different from that of participants on a low-fiber diet.**\ d. A healthy diet improves well-being. **Question 35:** *A Type I error in hypothesis testing occurs when:* a\. A statistical test fails to detect a significant effect.\ **b. The null hypothesis is incorrectly rejected when it is actually true.**\ c. The null hypothesis is accepted when it is false.\ d. The alternative hypothesis is accepted based on insufficient evidence. **Question 36:** *What does the term \"data normalization\" mean?* a\. Reducing data dimensionality.\ b. Removing duplicates from the dataset.\ c. Sorting data values in ascending order.\ **d. Standardizing data to a common scale.** **Question 37:** *In the data analysis process, which step typically involves transforming raw data into a more suitable format for analysis?* a\. Data Wrangling\ b. Data Collection\ c. Data Interpretation\ **d. Data Transformation** **Question 38:** *How can you check for outliers in a data set?* a\. Histogram\ b. Boxplot\ c. Scatterplot\ **d. All of the above** **Question 39:** *Which function from the **readr** package is designed to import a CSV file and returns the output as a tibble?* a\. read.table()\ b. read.csv()\ c. fread()\ **d. read\_csv()** **Question 1: Exploratory Data Analysis** a\) **What is the purpose of exploratory data analysis (EDA)?**\ The purpose of EDA is to analyze datasets to summarize their main characteristics, often using visual methods. It helps identify patterns, detect anomalies, test hypotheses, and validate assumptions. b\) **List down two common techniques used in EDA to detect outliers in a dataset.** - Boxplots - Z-Score Analysis c\) **What is a correlation matrix, and why is it useful in EDA?**\ A correlation matrix is a table showing the correlation coefficients between variables. It is useful in EDA to understand relationships and dependencies between variables. d\) **Why is data cleaning an essential step in EDA?**\ Data cleaning ensures accuracy and quality of the dataset by removing errors, missing values, and inconsistencies, which can significantly affect the results of data analysis. **Question 2: Visualizations** a\) **What chart would you use to visualize the trend in daily passenger numbers on public transport over a year? Explain your choice.**\ Use a **line chart**. A line chart is ideal for visualizing trends over time, such as changes in daily passenger numbers throughout the year. b\) **Suggest a suitable chart to compare the average travel time across different types of public transport (e.g., bus, metro, train). Explain your choice.**\ Use a **bar chart**. A bar chart effectively compares average values across categorical groups like different types of public transport. c\) **Identify a chart to show the distribution of travel distances among users to see which distance ranges are most common. Justify.**\ Use a **histogram**. A histogram is used to visualize the frequency distribution of numerical data like travel distances. **Question 3: Hypothesis** a\) **What is a hypothesis in the context of data analysis?**\ A hypothesis is a testable statement or assumption about a population parameter, such as the mean or proportion. b\) **Explain with an example the difference between a null hypothesis (H₀) and an alternative hypothesis (Hₐ).** - Null hypothesis (H₀): Assumes no effect or no difference (e.g., \"There is no difference in average sales between Region A and Region B\"). - Alternative hypothesis (Hₐ): Opposes the null hypothesis (e.g., \"There is a significant difference in average sales between Region A and Region B\"). c\) **Explain what is p-value and the concept of statistical significance in the context of hypothesis testing.**\ The p-value indicates the probability of observing the data, assuming the null hypothesis is true. If the p-value is below a chosen significance level (e.g., 0.05), the null hypothesis is rejected, indicating statistical significance. **Question 4: R Code** A computer screen with text Description automatically generated ![A computer screen with text on it Description automatically generated](media/image2.jpeg) A computer screen with a computer screen Description automatically generated The iris dataset in R contains 150 rows and 5 variables: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species. Complete the R code below to perform various data manipulation tasks using dplyr: \# Load necessary library library(dplyr) \# Import the dataset (iris is a built-in dataset in R) data(\"iris\") \# View the first six rows of the dataset [i) head(iris)] \# Add a new column named \"Sepal\_Ratio\" that is the ratio of Sepal.Length to Sepal.Width iris \% mutate(] [iii) Sepal\_Ratio = Sepal.Length / Sepal.Width)] \# Remove the column Petal.Width from the dataset iris \% select(] [v) -Petal.Width)] \# Filter rows where Petal.Length is greater than 4 iris\_filtered \% [vi) filter(] Petal.Length [vii) \>] 4) \# Calculate the total number of rows in the filtered dataset total\_rows \% [ xi) summarise] (Avg\_Sepal\_Ratio = [xii) mean](Sepal\_Ratio)) \# Arrange the summary table in ascending order of Avg\_Sepal\_Ratio species\_summary \% [xiii) arrange] (Avg\_Sepal\_Ratio) \# Count the number of observations for each Species in the filtered dataset species\_count \% [xiv) group\_by] (Species) %\>% [xv) summarise] (Count = n())