Summary

These are lecture slides on applied statistics as part of the CMSC 106 Research Methods for Computer Science course. The slides cover topics such as visualizing categorical and numerical data, quantitative data analysis, and hypothesis testing. The slides include topical outlines and examples.

Full Transcript

Research Methods for Computer Science (CMSC 106) Applied Statistics – Part IIA DEMELO M. LAO Department of Computer Science Topical Outline – 1 (F2F) RECAP  Visualizing Categorical and Numerical Data  Measures of Central Tendency and Assessing Its Reliability  5-Number Summar...

Research Methods for Computer Science (CMSC 106) Applied Statistics – Part IIA DEMELO M. LAO Department of Computer Science Topical Outline – 1 (F2F) RECAP  Visualizing Categorical and Numerical Data  Measures of Central Tendency and Assessing Its Reliability  5-Number Summary Quantitative Data Analysis Inferential Statistics  Hypothesis Testing Topical Outline – 2 (Asynchronous Remote Learning)  Supplementary video lecture materials and Algorithm Analysis  Assignment 3 – Answering Research Question by Hypothesis Testing RECAP1 Visualizing Categorical Data  Presenting 1 variable Pie chart – to show contributions of part to whole Bar chart – to compare categories for emphasis  Presenting >1 variables Stacked bar chart – variant of traditional bar chart that deals w/ 2 or more categorical variables Stacked area chart – to portray part-to-whole relationship over time RECAP2 Visualizing Numerical Data  Presenting 1 variable Histogram – to show shape/underlying distribution Stem-and-leaf plot – variant of histogram showing individual values w/n each column-bar Box-and-whiskers plot or boxplot – graphical representation of 5-number summary  Presenting >1 variables Scatterplot – to show relationship between 2 vars Bubble chart - to show relationship between 3 vars Time-series plot (aka Line graph) – to show trend over time RECAP3 Measures of Central Tendency & Assessing for Reliability  Mean  Standard Deviation (SD)  When in different units  Coefficient of Variation of measure (CV)  Median  Inter-quartile Range (Q3-Q1)  Mode _ RECAP4 5-Number Summary Observations one can draw from viewing boxplot  5-number summary  Data symmetry  Median near Q1 → +Skewed  Median near Q3 → -Skewed  Median=Mean → Normal dist.  Data “tightness”  Wide box → disperse  Narrow box→ consistent  Outliers’ presence Source: https://datavizcatalogue.com/methods/box_plot.html Why do we visualize data? ⓘ Start presenting to display the poll results on this slide. Think! This Photo by Unknown Author is licensed under CC BY-NC Quantitative Data Analysis  Systematic approach in research investigations w/c collects numerical data and researcher transforms these collected/ observed data into actionable insights or information  Often describes situation/event, answers research questions or study objectives  Often concerns w/ finding evidence to either support or contradict (research) idea or hypothesis Quantitative Data Analysis Rule of Thumb  Use only descriptive statistics for descriptive research design studies  Use inferential statistics for experimental, quasi-experimental, and correlational research studies Inferential Statistics  Subdivided into statistical tests that measure differences and relationships between variables of interest, respectively Statistical tests determine whether observed difference or relationship between variables of interest is statistically significant  Statistical significance helps researcher to rule out one important threat to (internal) validity – result could be due to chance rather than due to real observed differences in (target) population Inferential Statistics aka (Null) Hypothesis Testing-1  Statistical method that uses sample data to evaluate researcher’s claims/conjectures (i.e., test research hypothesis) about (target) population parameter “What can we infer about the population based upon what we have learned from our sample?”  Uses probability to determine whether it is likely that particular sample (or test outcome) is representative of (target) population Determines whether natural differences that we observe are more likely due to random variation or due to effect from treatment/researcher’s intervention Hypothesis Testing-2 Questions to Ask before You Begin  What are your variables (of interest)? Independent variable (IV) Dependent variable (DV)  What do you want to know? Research Objectives Differences between groups? Compare means Relationships between groups? Test associations Prediction/influence on DV by IV? Assess if changes in IV predict/influence changes in DV  What do your data look like? What level of measurement are DV data – NOIR/NODC? Have assumptions been met? How many IV groups do you have?  What type of test will you use? Directional or non-directional H1/Ha? Is your desired alpha () 0.05 or 0.01? Hypothesis Testing-3 Sample Hypothesis  Mean Comparisons There is difference in learning gains before and after introducing interactive learning object (ILO). The total elapsed times in remote procedure calls (RPCs) differ according to OS. The mean connection speed is 54 Mbps (as claimed by ISP).  Testing Associations There is relationship between number of bugs in software and software developer’s experience. Software quality is correlated to number of checkpoints in software testing.  Predictive Newly-hired IT personnel’s starting salary can be predicted by work experience and education level. Online customer satisfaction influence revisit intention on e- commerce website.