Statistics: Descriptive and Correlational Methods

Summary

This document provides an overview of statistics, covering descriptive statistics, correlational statistics, and inferential statistics. It discusses data collection methods, sampling techniques, and potential sources of error in sampling. Formulae are included for different statistical calculations, along with explanations of correlation coefficients and linear regression analysis.

Full Transcript

Statistics 1.​ Qualitative Variables Is the science of collecting, organizing, Is a variable that yields categorical summarizing, and analyzing information to responses. It is a word or a code that draw conclusions or answer...

Statistics 1.​ Qualitative Variables Is the science of collecting, organizing, Is a variable that yields categorical summarizing, and analyzing information to responses. It is a word or a code that draw conclusions or answer questions. represents a class or category. 2.​ Quantitative Variables Limitation of Statistics: Takes on numerical values 1.​ Not suitable to the study of qualitative representing an amount or quantity. phenomenon 2.​ Does not study individuals Quantitative Variable 3.​ Laws are not exact 1.​ Discrete Variable 4.​ Table may be misused Either a finite number of possible 5.​ One of the methods of studying a values or a countable number of problem possible values. 2.​ Continuous Variable Types of Statistics Has an infinite number of possible 1.​ Descriptive Statistics values that are not countable. They define or describe or give information about a set of data or Levels of Measurement distribution. ​ Ratio 2.​ Correlational Statistics ​ Interval Study of the relationship or among ​ Ordinal variables. ​ Nominal 3.​ Inferential Statistics Involves the study of samples for the Nominal Level purpose of making identify , name, classify, or categorize objects generalizations/conclusions/inferences or events. about the population from which the -Method of payment samples were taken. -Type of school -Eye color Data Factual information used as a basis for Ordinal Level reasoning, discussion, or calculation. Involves data that can be categorized and ranked, but the differences between categories Population are not necessarily uniform. Is the total or entire group of individuals or -Food preferences observations from which information is desired -Rank of a military officer by a researcher. -Social economic class Individual Interval Level Is a person or object that is a member of the Have ordered values, and have the additional population being studied. property of equal distances or intervals between scale. A value of zero does not mean Sample the absence of the quantity. Subset of the population -Temperature -Trait anxiety Variable -IQ Is a characteristic or attribute of persons or objects which assume different values for different objects under consideration. Ratio Level Identify, order, represent equal distances scores values, and a value of zero means the absence of the quantity. -Height -Weight Where: -Number of words correctly spelled Z is the z-score corresponding to level of confidence e is the level of precision Sample Size Typically denoted by n and it is always a When σ is unknown, it is common practice to positive integer. conduct a preliminary survey to determine s and use it as an estimate of σ or use results Criteria need to be specified to determine from previous studies to obtain an estimate of the appropriate sample size: σ. When using this approach, the size of the 1.​ Level of Precision sample should be at least 30. The formula for Also called sampling error, is the the sample standard deviation s is range in which the true value of the population is estimated to be. 2.​ Level of Confidence It is a statistical measure of the number of times out of 100 that results can be expected to be within a specified range. Estimating proportion (Infinite Population) The sample size required to obtain a confidence interval for p with specified margin of error e is given by Where: Z is the z-score corresponding to level of confidence. e is the level of precision. 3.​ Level of Variability 2 ways to solve the dilemma in this Depending upon the target population formula: and attributes under consideration, the ​ Determine a preliminary value for p degree of variability varies based on a pilot study or an earlier considerably. study ​ Simply to replace p in the formula by Methods in determining the sample size 0.5. When p= 0.5, the maximum value of p(1-p)= 0.5 or ¼. This is called the Estimating the mean or average most conservative estimate, since it The sample size required to estimate the gives the largest possible estimate of population mean to with a level of confidence n. with specified margin of error, given by Infinite Population Correction Where: The conservative formula using the strong law 𝑛0 = sample size of large number. Z= z-value (1.96 for 95% confidence level). The value for the selected alpha level. p = percentage picking a choice, expressed as a decimal (50% used for sample size needed). The estimated proportion of an attribute that is present in the population. q=p-1 pq= the estimate of variance c= confidence interval, expressed as a Where: decimal (0.05 = ± 5). The acceptable margin Confidence level is 95% of error for proportion being estimated. The level of precision is 0.05. Correction for finite population Slovin’s Formula Is used to calculate the sample size n given the population size and error. It is computed as Where: 𝑛0 = number of sample size 𝑛1= number of new sample size N = population size Sampling technique/Sampling Strategies Where: It is a plan you forth to be sure that the sample N is the total population you use in your research study represents the e is the level of precision population from which you drew your sample. Finite Population Correction Sampling Frame If the population is small then the sample size This is the list of the elements in your can be reduced slightly population and from this your sample is drawn. Sampling Bias This involves problems in your sampling which reveals that your sample is not representative of your population. Selection Bias: Where: 1.)​ Deliberately or purposively selecting a 𝑛0 is Cochran’s sample size recommendation “representative” sample. 2.)​ Misspecifying the target population N is the population size 3.)​ Failing to include all of the target population in the sampling frame, Sampling Technique called undercoverage. The researchers used the Cochran’s formula 4.)​ Including population units in the to calculate the sample size. A marginal error sampling frame that are not in the of 5% or 0.05 was used as a basis for the target population, called formula overcoverage. 5.)​ Having multiplicity of listings in the Basic Sampling Technique of Probability sampling frame. Sampling: 6.)​ Substituting a convenient member of a ​ Simple Random Sampling population for a designated member ​ Systematic Random Sampling who is not readily available. ​ Stratified Random Sampling 7.)​ Failing to obtain responses from all of ​ Cluster Sampling the chosen sample (Nonresponse) ​ Multi-stage Sampling 8.)​ Allowing the sample to consist entirely of volunteers. Simple Random Sampling Advantages of Sampling Over Complete Most basic method of drawing a probability Enumeration: sample. Assigns equal probabilities of -Less Labor selection to each possible sample. -Reduced Cost -Greater Speed -Greater Scope -Greater Efficiency and Accuracy -Convenience -Ethical Considerations Two Types of Samples Systematic Random Sampling 1.​ Probability Sample It is obtained by selecting every ktb individual -Samples are obtained using some from the population. The first individual objective chance mechanism thus selected corresponds to a random number involving randomization. between 1 to k. -They require the use of a complete listing of the elements of the universe called the Sampling Frame. -The probabilities of selection are known. -They are generally referred to as Random Samples. Stratified Random Sampling -They allow drawing of valid It is obtained by separating the population into generalizations about the non-overlapping groups called strata and then universe/population. obtaining a simple random sample from each stratum. The individuals within each stratum 2.​ Non - Probability Sample should be homogeneous (or similar) in some -Samples are obtained haphazardly, way. selected purposively or taken as volunteers. -The probabilities of selection are unknown. -They should not be used for statistical inference. Cluster Sampling ​ Cannot enumerate the population You take the sample from naturally occurring elements groups in your population. The clusters are constructed such that the sampling units are Sources of Errors in Sampling heterogeneous within the cluster and ​ Non-sampling Error homogeneous among the clusters. Errors that resu;t from the survey process. Any errors that cannot be attributed to the sample-to-sample variability. -Non-response -Interviewer Error -Misrepresented Answers -Data entry errors -Questionnaire Design Multi-stage Sampling -Wording of Questions Selection of the sample is done in two or more -Selection Bias steps or stages, with sampling units varying in each stage. ​ Sampling Error Errors that result from taking one sample instead of examining the whole population. Error that results from using sampling to estimate information regarding a population. Data Collection Is the process of gathering and measuring information on variables of interest, in an established systematic fashion that enables one to answer stated research questions, test hypotheses, and evaluate outcomes. Importance of Data ​ Empowers you to make informed decisions ​ Helps you identify problems Basic Sampling Technique of ​ Allows you to develop accurate Non-Probability Sampling: theories ​ Accidental Sampling ​ Will backup your arguments ​ Quota Sampling ​ Helps you get your hands-on funding ​ Convenience Sampling ​ Increases your return on assets ​ Purposive Sampling ​ Improves quality of life ​ Judgement Sampling Sources of Data Cases wherein Non-Probability Sampling is 1.​ Primary Sources Useful: Provide a first-hand account of an ​ Only few are willing to be interviewed event or time period and are ​ Extreme difficulties in locating or considered to be authoritative. They identifying subjects represent original thinking, reports on ​ Probability sampling is more discoveries or events, or they can expensive to implement share new information. Primary Data tendency to form a straight line from Are data documented by the primary the lower left hand of the graph to the source. The data collectors upper right hand corner. documented the data themselves X is usually the independent variable 2.​ Secondary Sources while y is the dependent variable, Offer an analysis interpretation or a which depends, for its value on x. The restatement of primary sources and correlation coefficients of positively are considered to be persuasive. They correlated variables range from 0.01 to often involve generalisation, synthesis, 1.00. If all the points in the graph fall interpretation, commentary or on the line (called the line-of-best-fit or evaluation in an attempt to convince the regression line), the two variables the reader of the creator’s argument. are said to have a perfect positive correlation and they have a coefficient Secondary Data of correlation of perfect 1. Are data documented by the secondary source. The data collectors The nearer the points are to the had the data documented by other line-of-best fit, the higher is the sources. correlation between the two variables , the lower is the correlation. The farther the points are from the line, the lower Primary Data is the correlation. 1.​ Direct personal interviews 2.​ Indirect/Questionnaire Method 2.​ Negative Correlation 3.​ Focus Group There is negative correlation between 4.​ Experiment two variables when this condition 5.​ Observation exists: as the value in the variable (x) increases, the value in the other Open-Ended vs. Close-Ended variable (y) Questionnaire decreases. 3.​ Zero Correlation There is no correlation at all between two variables. In the scattergram of two variables which have a zero correlation , points are scattered all over the graph. There is no pattern or tendency of the points to form a straight line. 4.​ Curvillinear Correlation Correlational Statistics At first, as the value in one variable(x) Types of Correlational: increases, the value in the other variable(y) also increases, until a point 1.​ Positive Correlation is reached when the value in the first There is a positive correlation between variable continues to increase, but the two variables when this condition value in the second variable stops exists: as the value in one variable(x) increasing, and then gradually increases, the value in the other decreases. variable(y) also increases. In the scattergram of two variables which are positively correlated, the points have a Three Ways to Interpret the Coefficient of Steps in Linear Regression Correlation: 1.​ Arrange the x and y scores in columns. (Columns 1 and 2) 1.​ Through the use of verbal equivalents 2.​ Add columns 3,4, and 5 to the table and label them as x2,y2 and XY. 3.​ Get the values in column 3 by squaring the values if x in the first column. 4.​ Get the values in column 4 by squaring the values of y in the second column. 2.​ Through the coefficient of determination, the square of the 5.​ Compute the values in column 5 by correlation coefficient (r2). The multiplying each value of x in column 1 coefficient of correlation is the amount by the corresponding value of y in of variation in the dependent variable x column 2. which is attributed or accounted by the independent variable/s. 6.​ Get the summation of each of the five columns, ∑x, ∑y ∑x2, ∑y2 , and ∑xy. 3.​ Through the use of the Table of Significant values of r, which is found 7.​ You will come up with the following in Statistics books. The degree of table. freedom(df) is obtained by subtracting one(1) from the no. of pairs. 8.​ Solve for byx using the formula​ Correlational Statistical Tools 9.​ Solve for the mean of x and y ​ Regression Analysis ​ Pearson Product Moment Correlation 10.​ solve for ayx using the formula ​ Spearman Rank Order Correlation ayx = Y - byxX ​ Kendalls’ Coefficient of Concordance ​ Point Biserial Correlation 11.​ Write the regression equation ​ Chi-Square Test of Association Ŷ=ayx + byxX ​ Partial Correlation 12.​ Prepare a table with columns: ​ X, Y Linear Regression Analysis ,Ŷ, Ŷ – Y and (Ŷ – Y)2 It is used in prediction problems. In a prediction scheme, known measures in one or 13.​ The values in column 4 are the more variables (the independent variables, differences between the actual (Y) represented by X1, X2,…) are used to make and the corresponding predicted (Ŷ) estimates of the values of a second variable Y, values of Y. These are called the or the independent variable. errors of estimate. Linear regression or the prediction equation 14.​ Solve for the predicted Y(Ŷ) values by has the formula Ŷ=ayx + byxX where ayx is substituting each value of x in the read as the a-coefficient, or the y-intercept, in regression equation. predicting y in terms of x, and byx means the b-coefficient, or the slope, in predicting in 15.​ Square each value in column 4 to get terms of x. the values in column 5, the squared errors of estimate. 16.​ The sum of the squared errors of derived from the Pearson Product Moment estimate is the amount of variation Correlation. 17.​ Solve for the variation of all y scores Partial Correlation using the formula:​ It is used to determine the extent of correlation between two variables which are both related 18.​ Prepare the regression analysis table. to a third variable , when the effect of the third variable is removed or partialled out. 19.​ The amount of variation , expressed as a proportion (0.8634), in the Correlation dependent variable (Y) that is ​ Measures the degree to which two accounted for by the independent variables are related or associated. variable (X) is called the coefficient of determination. It is also the square of ​ It quantifies the strength and direction the coefficient of correlation r (r2). of the relationship between variables. Extracting the square root of the coefficient of determination will yield ​ It helps assess whether changes in the coefficient of correlation (r). one variable are related to changes in one another variable. Pearson Product Moment Correlation (Pearson r) ​ Does not imply causation, meaning It is used to determine the extent of that even if two variables are correlation between two variables (usually x correlated, it does not necessarily and y) which are expressed in the interval mean that one causes the other. scale. It is the most commonly used correlational statistical tool and it is the most ​ Examines the linear relationship of reliable because the magnitude of each score variables. in the distribution is considered in the ​ is often expressed as a correlation computation. To compute Pearson r, prepare a coefficient, with the Pearson table with the following columns: x, y, and xy. coefficient being one of the most After completing the table , get the summation common methods. It ranges of each column, n is the number of pairs. correlation from-1 (perfect negative Then, use the following formula. correlation) to 1 (perfect positive correlation), with 0 correlation Spearman Rank Order Correlation (Spearman rho) Regression It is used when data are expressed in the Is a way of mathematically sorting out which of ordinal scale of measurement. It can also be those variables does indeed have an impact. used with interval data as an alternative to the Pearson r if there are only 30 cases or less. Correlation Coefficient Chi-Square Test of Association It is used to determine the significance of the relationship between two variables which are both expressed in the nominal scale. Point Biserial Correlation (rpb) It is used when one of the variables is dichotomous, that is , it has only two values, while the other is continuous. It is especially useful in test construction and validation. It is The strength of the correlation is determined by the correlation coefficient, which varies between -1 and +1. Spearman’s rank-order correlation Is the nonparametric counterpart of the pearson product-moment correlation. Spearman’s correlation coefficient, (ρ or 𝑟𝑠) Direction of the Correlation 1.​ Positive Correlation measures the strength and direction of When one variable increases, the association between two ranked variables. other tends to increase as well or one variable decreases while the other ​ Mathematics achievement scores of decreases students in math and music ​ The number of movie releases that a -Height and Weight (Taller people tend motion picture studio put out and its to be heavier.) gross receipts for the year ​ The number of hospitals and -Body size and shoe size (A large pharmacies in each of ten randomly body size is usually associated with a selected provinces large shoe size. ) Assumptions for the test 2.​ Negative Correlation ​ The two variables should be ordinal, When one variable increases, the interval or ratio other tends to decrease. ​ The scores on one variable must be monotonically related to the other -As you climb the mountain (increase variable in height), it gets colder (decrease in temperature) Spearman’s Rank-Order Correlation Formula -The product price and the sales volume If the price is higher, the sales volume goes down. 3.​ No Correlation There is no consistent relationship between the variables -As the temperature went up there was no apparent effect on coffee sales Pearson’s Sample Correlation Coefficient ThePearson’ssamplecorrelationcoefficient (alsoknownasPearson r), denoted by r, isa test statistic that measures the strength of the linear relationship between two variables. To find r, the following formula is used: