Introduction to Statistics PDF
Document Details
Uploaded by AdulatoryUkiyoE9080
Humboldt-Universität zu Berlin
2015
Wolfgang Karl Härdle, Sigbert Klinke, Bernd Rönz
Tags
Summary
Introduction to Statistics, by Wolfgang Karl Härdle, Sigbert Klinke, and Bernd Rönz, is a textbook for undergraduate students. It is designed to introduce students to the important science and methods of statistics at Humboldt-Universität zu Berlin, and it is still relevant and up-to-date, using modern web technology. The book was originally released in several languages in the last millennium on CD, and now incorporates interactive examples.
Full Transcript
Wolfgang Karl Härdle · Sigbert Klinke Bernd Rönz Introduction to Statistics Using Interactive MM*Stat Elements Introduction to Statistics Wolfgang Karl HRardle Sigbert Klinke Bernd RRonz Introduction to Statistics Using Interactive MM*Stat Elements 123 Wolfgang Karl HRardle...
Wolfgang Karl Härdle · Sigbert Klinke Bernd Rönz Introduction to Statistics Using Interactive MM*Stat Elements Introduction to Statistics Wolfgang Karl HRardle Sigbert Klinke Bernd RRonz Introduction to Statistics Using Interactive MM*Stat Elements 123 Wolfgang Karl HRardle Sigbert Klinke C.A.S.E. Centre f. Appl. Stat. & Econ. Ladislaus von Bortkiewicz Chair of School of Business and Economics Statistics Humboldt-Universität zu Berlin Humboldt-UniversitRat zu Berlin Berlin, Germany Berlin, Germany Bernd RRonz Department of Economics Inst. for Statsitics and Econometrics Humboldt-UniversitRat zu Berlin Berlin, Germany The quantlet codes in Matlab or R may be downloaded from http://www.quantlet.de or via a link on http://springer.com/978-3-319-17703-8 ISBN 978-3-319-17703-8 ISBN 978-3-319-17704-5 (eBook) DOI 10.1007/978-3-319-17704-5 Library of Congress Control Number: 2015958919 Springer Cham Heidelberg New York Dordrecht London © Springer International Publishing Switzerland 2015 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. Printed on acid-free paper Springer International Publishing AG Switzerland is part of Springer Science+Business Media (www.springer.com) Preface Statistics is a science central to many disciplines: modern, big, and smart data analysis can only be performed with statistical scientific tools. This is the reason why statistics is fundamental and is taught in many curricula and used in many applications. The collection and analysis of data changes the way how we observe and understand real data. Nowadays, we are collecting more and more, mostly less structured, data, which require a new analysis method and challenge the classical ones. But even nowadays the ideas used for the development of the classical methods are the foundation to new and future methods. At the Ladislaus von Bortkiewicz Chair of Statistics, School of Business and Economics in Humboldt-Universität zu Berlin, we are introducing students into this important science with the lectures “Statistics I & II.” The structure of these lectures and the methods used have changed over time, especially with the rise of the internet, but the topics taught are still the same. In the end of the last millennium, we developed a set of hyperlinked web pages on CD, which covered even more than our lectures “Statistics I & II” in English, Spanish, French, Arabic, Portuguese, German, Indonesian, Italian, Polish, and Czech. This gave the students an easy access to data and methods. An integral and important part of the CD were the interactive examples where the students can learn certain statistical facts by themselves. With wiki, we made a first version in German available in the internet (without interactive examples). But modern web technology nowadays allows much easier, better, and faster develop- ment of interactive examples than 15 years ago, which lead to this SmartBook with web-based interactive examples. Dicebat Bernardus Carnotensis nos esse quasi nanos gigantum umeris insidentes, ut possimus plura eis et remotiora videre, non utique proprii visus acumine, aut eminentia corporis, sed quia in altum subvehimur et extollimur magnitudine gigantea. Bernard of Chartres used to compare us to [puny] dwarfs perched on the shoulders of giants. He pointed out that we see more and farther than our predecessors, not because we have keener vision or greater height, but because we are lifted up and borne aloft on their gigantic stature. Johannes von Salisbury: Metalogicon 3,4,46–50 v vi Preface Therefore, we would like to thank all the colleagues and students who contributed to the development of the current book and its predecessors: For the CDs: Gökhan Aydinli, Michal Benko, Oliver Blaskowitz, Thierry Brunelle, Pavel Čížek, Michel Delecroix, Matthias Fengler, Eva Gelnarová, Zděnek Hlávka, Kateřina Houdková, Paweł Jaskólski, Šárka Jakoubková, Jakob Jurdziak, Petr Klášterecký, Torsten Kleinow, Thomas Kühn, Salim Lardjane, Heiko Lehmann, Marlene Müller, Rémy Slama, Hizir Sofyan, Claudia Trentini, Axel Werwatz, Rodrigo Witzel, Adonis Yatchew, and Uwe Ziegenhagen. For the Arabic and German wikis: Taleb Ahmad, Paul Giradet, Leonie Schlittgen, Dennis Uieß, and Beate Weidenhammer. For the current SmartBook: Sarah Asmah, Navina Gross, Lisa Grossmann, Karl- Friedrich Israel, Wiktor Olszowy, Korbinian Oswald, Darya Shesternya, and Yordan Skenderski. Berlin, Germany Wolfgang Karl Härdle Berlin, Germany Sigbert Klinke Berlin, Germany Bernd Rönz Structure of the Book Each chapter covers a broader statistical topic and each topic is categorized in sections, additionally, you may find larger explained, enhanced, and interactive examples: Explained examples are directly related to the content of the section or chapter. Enhanced examples may require knowledge from other earlier chapters to under- stand them. Interactive examples allow to use different datasets, to choose between analysis methods and/or to play with the parameters of the (chosen) analysis method. The web address of a specific interactive example can be found in the appropriate section. In the online version of the book under http://www.springer.com/de/book/ 9783319177038, you can also find at the end of each chapter a set of multiple choice exercises. vii Contents 1 Basics......................................................................... 1 1.1 Objectives of Statistics................................................ 1 A Definition of Statistics...................................... 1 Explained: Descriptive and Inductive Statistics............. 3 1.2 Statistical Investigation................................................ 4 Conducting a Statistical Investigation........................ 4 Sources of Economic Data.................................... 4 Explained: Public Sources of Data........................... 6 More Information: Statistical Processes..................... 6 1.3 Statistical Element and Population................................... 8 Statistical Elements........................................... 8 Population..................................................... 8 Explained: Statistical Elements and Population............. 8 1.4 Statistical Variable..................................................... 10 1.5 Measurement Scales................................................... 11 1.6 Qualitative Variables................................................... 11 Nominal Scale................................................. 11 Ordinal Scale.................................................. 12 1.7 Quantitative Variables................................................. 13 Interval Scale.................................................. 13 Ratio Scale..................................................... 13 Absolute Scale................................................. 13 Discrete Variable.............................................. 14 Continuous Variable........................................... 14 1.8 Grouping Continuous Data............................................ 14 Explained: Grouping of Data................................. 16 1.9 Statistical Sequences and Frequencies................................ 16 Statistical Sequence........................................... 16 Frequency...................................................... 17 Explained: Absolute and Relative Frequency................ 18 ix x Contents 2 One-Dimensional Frequency Distributions.............................. 21 2.1 One-Dimensional Distribution........................................ 21 2.1.1 Frequency Distributions for Discrete Data................... 21 Frequency Table............................................... 21 2.1.2 Graphical Presentation........................................ 22 Explained: Job Proportions in Germany..................... 25 Enhanced: Evolution of Household Sizes.................... 25 2.2 Frequency Distribution for Continuous Data......................... 26 Frequency Table............................................... 27 Graphical Presentation........................................ 27 Explained: Petrol Consumption of Cars...................... 30 Explained: Net Income of German Nationals................ 31 2.3 Empirical Distribution Function...................................... 34 2.3.1 Empirical Distribution Function for Discrete Data.......... 35 2.3.2 Empirical Distribution Function for Grouped Continuous Data............................................... 36 Explained: Petrol Consumption of Cars...................... 37 Explained: Grades in Statistics Examination................ 38 2.4 Numerical Description of One-Dimensional Frequency Distributions........................................................... 40 Measures of Location......................................... 40 Explained: Average Prices of Cars........................... 47 Interactive: Dotplot with Location Parameters.............. 49 Interactive: Simple Histogram................................ 49 2.5 Location Parameters: Mean Values—Harmonic Mean, Geometric Mean....................................................... 50 Harmonic Average............................................ 50 Geometric Average............................................ 52 2.6 Measures of Scale or Variation........................................ 55 Range.......................................................... 56 Interquartile Range............................................ 57 Mean Absolute Deviation..................................... 57 The Variance and the Standard Deviation.................... 58 Explained: Variations of Pizza Prices........................ 60 Enhanced: Parameters of Scale for Cars..................... 61 Interactive: Dotplot with Scale Parameters.................. 62 2.7 Graphical Display of the Location and Scale Parameters............ 64 Boxplot (Box-Whisker-Plot).................................. 64 Explained: Boxplot of Car Prices............................ 66 Interactive: Visualization of One-Dimensional Distributions................................................... 67 3 Probability Theory......................................................... 69 3.1 The Sample Space, Events, and Probabilities........................ 69 Venn Diagram................................................. 70 Contents xi 3.2 Event Relations and Operations....................................... 70 Subsets and Complements.................................... 70 Union of Sets.................................................. 71 Intersection of Sets............................................ 71 Logical Difference of Sets or Events......................... 73 Disjoint Decomposition of the Sample Space............... 74 Some Set Theoretic Laws..................................... 75 3.3 Probability Concepts.................................................. 75 Classical Probability.......................................... 76 Statistical Probability......................................... 76 Axiomatic Foundation of Probability........................ 78 Addition Rule of Probability................................. 78 More Information: Derivation of the Addition Rule......... 79 More Information: Implications of the Probability Axioms.................................... 80 Explained: A Deck of Cards.................................. 81 3.4 Conditional Probability and Independent Events.................... 82 Conditional Probability....................................... 82 Multiplication Rule........................................... 83 Independent Events........................................... 83 Two-Way Cross-Tabulation................................... 84 More Information: Derivation of Rules for Independent Events....................................... 85 Explained: Two-Way Cross-Tabulation...................... 85 Explained: Screws............................................. 86 3.5 Theorem of Total Probabilities and Bayes’ Rule..................... 87 Theorem of Total Probabilities............................... 87 Bayes’ Rule.................................................... 88 Explained: The Wine Cellar.................................. 88 Enhanced: Virus Test.......................................... 90 Interactive: Monty Hall Problem............................. 91 Interactive: Die Rolling Sisters............................... 94 4 Combinatorics.............................................................. 97 4.1 Introduction............................................................ 97 Different Ways of Grouping and Ordering................... 97 Use of Combinatorial Theory................................. 98 4.2 Permutation............................................................ 98 Permutations Without Repetition............................. 98 Permutations with Repetition................................. 99 Permutations with More Groups of Identical Elements..... 99 Explained: Beauty Competition.............................. 100 4.3 Variations.............................................................. 100 Variations with Repetition.................................... 100 Variations Without Repetition................................ 101 Explained: Lock Picking...................................... 101 xii Contents 4.4 Combinations.......................................................... 102 Combinations Without Repetition............................ 102 Combinations with Repetition................................ 103 Explained: German Lotto..................................... 103 4.5 Properties of Euler’s Numbers (Combination Numbers)............ 104 Symmetry...................................................... 104 Specific Cases................................................. 104 Sum of Two Euler’s Numbers................................ 104 Euler’s Numbers and Binomial Coefficients................. 105 5 Random Variables.......................................................... 107 5.1 The Definition......................................................... 107 More Information............................................. 107 Explained: The Experiment................................... 108 Enhanced: Household Size I.................................. 108 5.2 One-Dimensional Discrete Random Variables....................... 109 Discrete Random Variable.................................... 109 Explained: One-Dimensional Discrete Random Variable... 110 Enhanced: Household Size II................................. 111 5.3 One-Dimensional Continuous Random Variables................... 113 Density Function.............................................. 113 Distribution Function......................................... 113 More Information: Continuous Random Variable, Density, and Distribution Function................ 114 Explained: Continuous Random Variable.................... 116 Enhanced: Waiting Times of Supermarket Costumers...... 116 5.4 Parameters............................................................. 119 Expected Value................................................ 120 Variance........................................................ 121 Standard Deviation............................................ 121 Standardization................................................ 122 Chebyshev’s Inequality....................................... 122 Explained: Continuous Random Variable.................... 123 Explained: Traffic Accidents................................. 124 5.5 Two-Dimensional Random Variables................................. 124 Marginal Distribution......................................... 125 The Conditional Marginal Distribution Function............ 126 Explained: Two-Dimensional Random Variable............. 127 Enhanced: Link Between Circulatory Diseases and Patient Age................................................ 129 5.6 Independence.......................................................... 131 Conditional Distribution...................................... 132 More Information............................................. 133 Explained: Stochastic Independence......................... 134 Enhanced: Economic Conditions in Germany............... 136 Contents xiii 5.7 Parameters of Two-Dimensional Distributions....................... 139 Covariance..................................................... 140 Correlation Coefficient........................................ 140 More Information............................................. 141 Explained: Parameters of Two-Dimensional Random Variables............................................. 144 Enhanced: Investment Funds................................. 146 6 Probability Distributions.................................................. 149 6.1 Important Distribution Models........................................ 149 6.2 Uniform Distribution.................................................. 149 Discrete Uniform Distribution................................ 149 Continuous Uniform Distribution............................ 150 More Information............................................. 151 Explained: Uniform Distribution............................. 152 6.3 Binomial Distribution................................................. 154 More Information............................................. 155 Explained: Drawing Balls from an Urn...................... 157 Enhanced: Better Chances for Fried Hamburgers........... 158 Enhanced: Student Jobs....................................... 160 Interactive: Binomial Distribution............................ 162 6.4 Hypergeometric Distribution.......................................... 163 More Information............................................. 164 Explained: Choosing Test Questions......................... 166 Enhanced: Selling Life Insurances........................... 167 Enhanced: Insurance Contract Renewal...................... 168 Interactive: Hypergeometric Distribution.................... 169 6.5 Poisson Distribution................................................... 170 More Information............................................. 171 Explained: Risk of Vaccination Damage..................... 172 Enhanced: Number of Customers in Service Department.................................................... 173 Interactive: Poisson Distribution............................. 175 6.6 Exponential Distribution.............................................. 176 More Information............................................. 177 Explained: Number of Defects............................... 178 Enhanced: Equipment Failures............................... 180 Interactive: Exponential Distribution......................... 181 6.7 Normal Distribution................................................... 181 Standardized Random Variable............................... 183 Standard Normal Distribution................................ 183 Confidence Interval........................................... 184 More Information............................................. 186 Other Properties of the Normal Distribution................. 187 Standard Normal Distribution................................ 188 xiv Contents Explained: Normal Distributed Random Variable........... 188 Interactive: Normal Distribution.............................. 195 6.8 Central Limit Theorem................................................ 196 Central Limit Theorem........................................ 197 More Information............................................. 197 Explained: Application to a Uniform Random Variable.... 197 6.9 Approximation of Distributions....................................... 199 Normal Distribution as Limit of Other Distributions........ 199 Explained: Wrong Tax Returns............................... 201 Enhanced: Storm Damage.................................... 203 6.10 Chi-Square Distribution............................................... 204 More Information............................................. 205 6.11 t-Distribution (Student t-Distribution)................................ 206 More Information............................................. 207 6.12 F-Distribution.......................................................... 207 More Information............................................. 208 7 Sampling Theory........................................................... 209 7.1 Basic Ideas............................................................. 209 Population..................................................... 209 Sample......................................................... 210 Statistic........................................................ 211 More Information............................................. 213 Explained: Illustrating the basic Principles of Sampling Theory........................................... 213 7.2 Sampling Distribution of the Mean................................... 218 Distribution of the Sample Mean............................. 218 More Information............................................. 221 Explained: Sampling Distribution............................ 225 Enhanced: Gross Hourly Earnings of a Worker.............. 228 7.3 Distribution of the Sample Proportion................................ 233 Explained: Distribution of the Sample Proportion........... 237 Enhanced: Drawing Balls from a Urn........................ 239 7.4 Distribution of the Sample Variance.................................. 242 Distribution of the Sample Variance S2...................... 243 Probability Statements About S2............................. 243 More Information............................................. 244 Explained: Distribution of the Sample Variance............. 247 8 Estimation................................................................... 251 8.1 Estimation Theory..................................................... 251 Point Estimation............................................... 251 The Estimator or Estimating Function....................... 251 Explained: Basic Examples of Estimation Procedures...... 252 8.2 Properties of Estimators............................................... 253 Mean Squared Error........................................... 255 Contents xv Unbiasedness.................................................. 255 Asymptotic Unbiasedness.................................... 256 Efficiency...................................................... 256 consistency.................................................... 257 More Information............................................. 257 Explained: Properties of Estimators.......................... 262 Enhanced: Properties of Estimation Functions.............. 263 8.3 Construction of Estimators............................................ 264 Maximum Likelihood......................................... 264 Least Squares Estimation..................................... 266 More Information...................................................... 266 Applications of ML........................................... 266 Application of Least Squares................................. 270 Explained: ML Estimation of an Exponential Distribution.................................................... 271 Explained: ML Estimation of a Poisson Distribution....... 272 8.4 Interval Estimation.................................................... 273 8.5 Confidence Interval for the Mean..................................... 275 Confidence Interval for the Mean with Known Variance.... 276 Confidence Interval for the Mean with Unknown Variance........................................................ 278 Explained: Confidence Intervals for the Average Household Net Income........................................ 280 Enhanced: Confidence Intervals for the Lifetime of a Bulb....................................................... 285 Interactive: Confidence Intervals for the Mean.............. 287 8.6 Confidence Interval for Proportion................................... 288 Properties of Confidence Intervals........................... 290 Explained: Confidence Intervals for the Percentage of Votes................................... 291 Interactive: Confidence Intervals for the Proportion......... 291 8.7 Confidence Interval for the Variance................................. 292 Properties of the Confidence Interval........................ 293 Explained: Confidence Intervals for the Variance of Household Net Income..................................... 294 Interactive: Confidence Intervals for the Variance........... 295 8.8 Confidence Interval for the Difference of Two Means............... 295 1. Case: The Variances 12 and 22 of the Two Populations Are Known....................................... 297 Properties of the Confidence Interval........................ 297 2. Case: The Variances 12 and 22 of the Two Populations Are Unknown.................................... 298 Properties of Confidence Intervals When Variances Are Unknown...................................... 299 xvi Contents Explained: Confidence Interval for the Difference of Car Gas Consumptions................ 300 Enhanced: Confidence Intervals of the Difference of Two Mean Stock Prices................ 301 Interactive: Confidence Intervals for the Difference of Two Means............................. 304 8.9 Confidence Interval Length........................................... 305 (a) Confidence Interval for ................................. 306 (b) Confidence Interval for ................................. 306 Explained: Finding a Required Sample Size................. 307 Enhanced: Finding the Sample Size for an Election Threshold..................................... 308 Interactive: Confidence Interval Length for the Mean....... 309 9 Statistical Tests.............................................................. 311 9.1 Key Concepts.......................................................... 311 Formulating the Hypothesis.................................. 313 Test Statistic................................................... 314 Decision Regions and Significance Level.................... 314 Non-rejection Region of Null Hypothesis................... 315 Rejection Region of Null Hypothesis........................ 315 Power of a Test................................................ 323 OC-Curve...................................................... 324 A Decision-Theoretical View on Statistical Hypothesis Testing............................................ 324 More Information: Examples................................. 325 More Information: Hypothesis Testing Using Statistical Software............................................ 327 9.2 Testing Normal Means................................................ 330 Hypotheses.................................................... 331 Test Statistic, Its Distribution, and Derived Decision Regions.............................................. 332 Calculating the Test Statistic from an Observed Sample.... 336 Test Decision and Interpretation.............................. 337 Power.......................................................... 338 More Information: Conducting a Statistical Test............ 342 Explained: Testing the Population Mean..................... 348 Enhanced: Average Life Time of Car Tires.................. 352 Hypothesis..................................................... 353 1st Alternative................................................. 354 2nd Alternative................................................ 355 3rd Alternative................................................ 357 Interactive: Testing the Population Mean.................... 358 Interactive: Testing the Population Mean with Type I and II Error....................................... 359 Contents xvii 9.3 Testing the Proportion in a Binary Population....................... 360 Hypotheses.................................................... 361 Test Statistic and Its Distribution: Decision Regions........ 361 Sampling and Computing the Test Statistic.................. 363 Test Decision and Interpretation.............................. 363 Power Curve P./............................................ 364 Explained: Testing a Population Proportion................. 364 Enhanced: Proportion of Credits with Repayment Problems.................................... 369 Interactive: Testing a Proportion in a Binary Population.... 376 9.4 Testing the Difference of Two Population Means.................... 377 Hypotheses.................................................... 377 Test Statistic and Its Distribution: Decision Regions........ 378 Sampling and Computing the Test Statistic.................. 380 Test Decision and Interpretation.............................. 380 Explained: Testing the Difference of Two Population Means............................................. 381 Enhanced: Average Age Difference of Female and Male Bank Employees................................... 383 1st Dispute..................................................... 384 2nd Dispute.................................................... 386 3rd Dispute.................................................... 387 Interactive: Testing the Difference of Two Population Means............................................. 388 9.5 Chi-Square Goodness-of-Fit Test..................................... 389 Hypothesis..................................................... 390 How Is pj Computed?......................................... 391 Test Statistic and Its Distribution: Decision Regions........ 391 Approximation Conditions................................... 392 Sampling and Computing the Test Statistic.................. 393 Test Decision and Interpretation.............................. 394 More Information............................................. 394 Explained: Conducting a Chi-Square Goodness-of-Fit Test.......................................... 397 Enhanced: Goodness-of-Fit Test for Product Demand...... 399 1st Version..................................................... 400 2nd Version.................................................... 401 9.6 Chi-Square Test of Independence..................................... 404 Hypothesis..................................................... 405 Test Statistic and Its Distribution: Decision Regions........ 406 Sampling and Computing the Test Statistic.................. 407 Test Decision and Interpretation.............................. 408 More Information............................................. 408 xviii Contents Explained: The Chi-Square Test of Independence in Action.................................... 411 Enhanced: Chi-Square Test of Independence for Economic Situation and Outlook......................... 413 10 Two-Dimensional Frequency Distribution............................... 419 10.1 Introduction............................................................ 419 10.2 Two-Dimensional Frequency Tables.................................. 419 Realizations m r.............................................. 420 Absolute Frequency........................................... 420 Relative Frequency............................................ 420 Properties...................................................... 420 Explained: Two-Dimensional Frequency Distribution...... 421 Enhanced: Department Store................................. 422 Interactive: Example for Two-Dimensional Frequency Distribution........................................ 423 10.3 Graphical Representation of Multidimensional Data................ 423 Frequency Distributions....................................... 423 Scatterplots.................................................... 424 Explained: Graphical Representation of a Two- or Higher Dimensional Frequency Distribution.................................................... 426 Interactive: Example for the Graphical Representation of a Two- or Higher Dimensional Frequency Distribution........................................ 429 10.4 Marginal and Conditional Distributions.............................. 429 Marginal Distribution......................................... 429 Conditional Distribution...................................... 430 Explained: Conditional Distributions........................ 432 Enhanced: Smokers and Lung Cancer....................... 433 Enhanced: Educational Level and Age....................... 434 10.5 Characteristics of Two-Dimensional Distributions................... 435 Covariance..................................................... 435 More Information............................................. 437 Explained: How the Covariance Is Calculated............... 437 10.6 Relation Between Continuous Variables (Correlation, Correlation Coefficients).............................................. 438 Properties of the Correlation Coefficient..................... 439 Relation of Correlation and the Scatterplot of X and Y Observations............................................ 440 Explained: Relationship of Two Metrically Scaled Variables............................................... 443 Interactive: Correlation Coefficients......................... 444 10.7 Relation Between Discrete Variables (Rank Correlation)........... 445 Spearman’s Rank Correlation Coefficient.................... 445 Contents xix Kendall’s Rank Correlation Coefficient...................... 447 Explained: Relationship Between Two Ordinally Scaled Variables............................................... 448 Interactive: Example for the Relationship Between Two Ordinally Scaled Variables.................... 450 10.8 Relationship Between Nominal Variables (Contingency)........... 450 Explained: Relationship Between Two Nominally Scaled Variables.................................. 452 Interactive: Example for the Relationship Between Two Nominally Scaled Variables.................. 454 11 Regression................................................................... 455 11.1 Regression Analysis................................................... 455 The Objectives of Regression Analysis...................... 455 11.2 One-Dimensional Regression Analysis............................... 457 One-Dimensional Linear Regression Function.............. 457 Quality (Fit) of the Regression Line.......................... 463 One-Dimensional Nonlinear Regression Function.......... 466 Explained: One-Dimensional Linear Regression............ 468 Enhanced: Crime Rates in the US............................ 471 Enhanced: Linear Regression for the Car Data.............. 472 Interactive: Simple Linear Regression....................... 473 11.3 Multi-Dimensional Regression Analysis............................. 474 Multi-Dimensional Regression Analysis..................... 474 12 Time Series Analysis....................................................... 477 12.1 Time Series Analysis.................................................. 477 Definition...................................................... 477 Graphical Representation..................................... 477 The Objectives of Time Series Analysis..................... 477 Components of Time Series.................................. 479 12.2 Trend of Time Series.................................................. 479 Method of Moving Average.................................. 479 Least-Squares Method........................................ 481 More Information: Simple Moving Average................. 483 Explained: Calculation of Moving Averages................ 485 Interactive: Test of Different Filters for Trend Calculation.................................................... 486 12.3 Periodic Fluctuations.................................................. 487 Explained: Decomposition of a Seasonal Series............. 489 Interactive: Decomposition of Time Series.................. 491 12.4 Quality of the Time Series Model.................................... 492 Mean Squared Dispersion (Estimated Standard Deviation)..................................................... 493 Interactive: Comparison of Time Series Models............. 494 xx Contents A Data Sets in the Interactive Examples.................................... 495 A.1 ALLBUS Data......................................................... 495 A.1.1 ALLBUS1992, ALLBUS2002, and ALLBUS2012: Economics................................... 495 A.1.2 ALLBUS1994, ALLBUS2002, and ALLBUS2012: Trust.......................................... 496 A.1.3 ALLBUS2002, ALLBUS2004, and ALLBUS2012: General....................................... 497 A.2 Boston Housing Data.................................................. 499 A.3 Car Data................................................................ 499 A.4 Credit Data............................................................. 500 A.5 Decathlon Data........................................................ 501 A.6 Hair and Eye Color of Statistics Students............................ 502 A.7 Index of Basic Rent.................................................... 502 A.8 Normally Distributed Data............................................ 503 A.9 Telephone Data........................................................ 503 A.10 Titanic Data............................................................ 504 A.11 US Crime Data......................................................... 504 Glossary.......................................................................... 507 Chapter 1 Basics 1.1 Objectives of Statistics A Definition of Statistics Statistics is the science of collecting, describing, and interpreting data, i.e., the tool box underlying empirical research. In analyzing data, scientists aim to describe our perception of the world. Descriptions of stable relationships among observable phenomena in the form of theories are sometimes referred to as being explanatory. (Though one could argue that science merely describes how things happen rather than why.) Inventing a theory is a creative process of restructuring information embedded in existing (and accepted) theories and extracting exploitable information from the real world. (We are abstracting from purely axiomatic theories derived by logical deduction.) A first exploratory approach to groups of phenomena is typically carried out using methods of statistical description. Descriptive Statistics Descriptive statistics encompasses tools devised to organize and display data in an accessible fashion, i.e., in a way that doesn’t exceed the perceptual limits of the human mind. It involves the quantification of recurring phenomena. Various summary statistics, mainly averages, are calculated; raw data and statistics are displayed using tables and graphs. © Springer International Publishing Switzerland 2015 1 W.K. Härdle et al., Introduction to Statistics, DOI 10.1007/978-3-319-17704-5_1 2 1 Basics Statistical description can offer important insights into the occurrence of isolated phenomena and indicate associations among them. But can it provide results that can be considered laws in a scientific context? Statistics is a means of dealing with variations in characteristics of distinct objects. Isolated objects are thus not representative for the population of objects possessing the quantifiable feature under investigation. Yet variability can be the result of the (controlled or random) variation of other, underlying variables. Physics, for example, is mainly concerned with the extraction and mathematical formulation of exact relationships, not leaving much room for random fluctuations. In statistics such random fluctuations are modeled Statistical relationships are thus relationships which account for a certain proportion of stochastic variability. Inductive Statistics In contrast to wide areas of physics, empirical relationships observed in the natural sciences, sociology, and psychology (and more eclectic subjects such as economics) are statistical. Empirical work in these fields is typically carried out on the basis of experiments or sample surveys. In either case, the entire population cannot be observed—either for practical or economic reasons. Inferring from a limited sample of objects to characteristics prevailing in the underlying population is the goal of inferential or inductive statistics. Here, variability is a reflection of variation in the sample and the sampling process. Statistics and the Scientific Process Depending on the stage of the scientific investigation, data are examined with vary- ing degrees of prior information. Data can be collected to explore a phenomenon in a first approach, but it can also serve to statistically test (verify/falsify) hypotheses about the structure of the characteristic(s) under investigation. Thus, statistics is applied at all stages of the scientific process wherever quantifiable phenomena are involved. Here, our concept of quantifiability is sufficiently general to encompass a very broad range of scientifically interesting propositions. Take, for example, a proposition such as “a bumble bee is flying by.” By counting the number of such occurrences in various settings we are quantifying the occurrence of the phenomenon. On this basis we can try to infer the likelihood of coming across a bumble bee under specific circumstances (e.g., on a rainy summer day in Berlin). 1.1 Objectives of Statistics 3 Table 1.1 Absolute 1 2 3 4 5 6 7 frequencies of numbers in National Lottery 311 337 345 316 321 335 322 8 9 10 11 12 13 14 309 324 331 315 302 276 310 15 16 17 18 19 20 21 322 319 337 331 326 312 334 22 23 24 25 26 27 28 322 319 304 325 337 323 285 29 30 31 32 33 34 35 321 311 333 378 340 291 330 36 37 38 39 40 41 42 340 320 357 326 329 335 335 43 44 45 46 47 48 49 311 314 304 327 311 337 361 Fig. 1.1 Absolute frequencies of numbers in the National Lottery from 1955 to 2007 Explained: Descriptive and Inductive Statistics Descriptive statistics provide the means to summarize and visualize data. Table 1.1, which contains the frequency distribution of numbers drawn in the National Lottery, provides an example of a such a summary. Cursory examination suggests that some numbers occur more frequently than others (Fig. 1.1). Does this suggest bias in the way numbers are selected? As we shall see, statistical methods can also be used to test such propositions. 4 1 Basics 1.2 Statistical Investigation Conducting a Statistical Investigation Statistical investigations often involve the following steps: 1. Designing the investigation: development of the objectives, translation of theoret- ical concepts into observable phenomena (i.e., variables), environmental setting (e.g., determining which parameters are held constant), cost projection, etc. 2. Obtaining data Primary data: data collected by the institution conducting the investigation – Surveys: recording data without exercising control over environmental conditions which could influence the observations observing all members of the population (census) or taking a sample (sample survey) collecting data by interview or by measurement documentation of data via questionnaires, protocols, etc. personal vs. indirect observation (e.g., personal interview, question- naires by post, telephone, etc.) – Experiments: actively controlling variables to capture their impact on other variables – Automated recording: observing data as it is being generated, e.g., within a production process Secondary data: using readily available data, either from internal or external sources. 3. Organizing the data 4. Analysis: applying statistical tools 5. Interpretation: which conclusions do the quantitative information generated by statistical procedures support? Sources of Economic Data Public Statistics Private Statistics International Organizations Figure 1.2 illustrates the sequence of steps in a statistical investigation. 1.2 Statistical Investigation 5 Economic Theory Theoretical Concepts etc. Judical & Institutional Cost Consideration Environment Economic Policy Values and Goals Feasibility Adjustment Investigation Judgement Processing Decision Evaluation Interpretation Presentation of results Fig. 1.2 Overview of steps in statistical investigation 6 1 Basics Table 1.2 Data on animal Animals in Berlin, 2013 Zoo and Aquarium Tierpark populations of Berlin’s three major zoos Mammals Total population 1044 1283 Species 169 199 Birds Total population 2092 2380 Species 319 356 Snakes Total population 357 508 Species 69 103 Lizards Total population 639 55 Species 54 3 Fish Total population 7629 938 Species 562 106 Invertebrate Total population 8604 2086 Species 331 79 Visitors 3059136 1035899 Data from: Financial reports 2013 of Zoologischer Garten Berlin AG, Tierpark Berlin GmbH and Amt für Statistik Berlin-Brandenburg Explained: Public Sources of Data The official body engaged in collecting and publishing Berlin-specific data is the Amt für Statistik Berlin-Brandenburg. For example, statistics on such disparate sub- jects as the animal populations of Berlin’s three major zoos and voter participation in general elections, are available (Table 1.2). The data on voter participation covers the elections into the 8th European Parliament in Berlin (25.05.2014). The map displays the election participation in election districts of Berlin (Fig. 1.3). More Information: Statistical Processes A common objective of economic policy is to reduce the overall duration of unemployment in the economy. An important theoretical question is, to what extent can the level of unemploy- ment benefits account for variations in unemployment duration. In order to make this question suitable for a statistical investigation, the variables must be translated into directly observable quantities. (For example, the number 1.3 Statistical Element and Population 7 0−10% 10−20% 20−30% 30−40% 40−50% 50−60% Fig. 1.3 Voter participation in Berlin (Data from: Amt für Statistik Berlin-Brandenburg 2014) of individuals who are registered as unemployed is a quantity available from government statistics. While this may not include everyone who would like to be working, it is usually used as the unemployment variable in statistical analyses.) By examining government unemployment benefit payments in different coun- tries, we can try to infer whether more or less generous policies have an impact on the unemployment rate. Prior to further investigation, the collected raw data must be organized in a fashion suitable for the statistical methods to be performed upon them. Exploring the data for extractable information and presenting the results in an accessible fashion by means of statistical tools lies at the heart of statistical investigation. In interpreting quantitative statistical information, keys to an answer to the initial scientific questions are sought. Analogous to the general scientific process, conclusions reached in the course of statistical interpretation frequently give rise to further propositions—triggering the next iteration of the statistical process. 8 1 Basics 1.3 Statistical Element and Population Statistical Elements Objects whose attributes are observed or measured for statistical purposes are called statistical elements. In order to identify all elements relevant to a particular investigation, one must specify their defining characteristics as well as temporal and spatial dimensions. Example Population Census in Germany defining characteristic: citizen of Germany spatial: permanent address in Federal Republic of Germany temporal: date of census Population The universe of statistical elements covered by a particular set of specifications is called population. In general, increasing the number of criteria to be matched by the elements will result in a smaller and more homogeneous population. Populations can be finite or infinite in size. In a census, all elements of the population are investigated. Recording informa- tion from a portion of the population yields a sample survey. The stock of elements constituting a population may change over time, as some elements leave and others enter the population. This sensitivity of populations to time flow has to be taken into account when carrying out statistical investigations. Explained: Statistical Elements and Population We use the following questionnaire, developed at the Department of Statistics, to clarify the notions of statistical unit and population. This questionnaire was filled out by all participants in Statistics 1. The investigation was carried out on the first lecture of the summer semester in 1999. The population consists of all students taking part in Statistics 1 at Humboldt University during the summer semester 1999. The statistical unit is one student. 1.3 Statistical Element and Population 9 HUMBOLDT UNIVERSITY BERLIN Department of Statistics - Statistics 1 QUESTIONNAIRE Welcome to Statistics. Before we start, we would like to ask a you few questions. Your answers will help us to optimize the lectures. Furthermore, your answers will be statistically analyzed during the lecture. Everybody, who fills in this questionnaire, will have a chance to win the multimedia version of Statistics 1 which is worth 200,-DM. 1. Do you have access to the internet? Yes No If yes: 2. From where do you connect to the internet? home university internet cafe friends other (please, specify): 3. Which internet browser do you usually use? Netscape 4.5 or newer older version of Netscape Netscape, I do not know the version number Internet Explorer 4 or newer older version of Internet Explorer Internet Explorer, I do not know the version number 4. Do you have access to a multimedia computer (i.e., computer which can be used to play audio and video files?) Yes No 5. Have you previously studied Theory of Probability or Stochastics? Yes No 6. What is the probability that the sum of numbers of two dice is seven? 7. In which state did you attend secondary school? Baden-Württemb erg Bayern Berlin Brandenburg Bremen Hamburg Hessen Mecklenburg-Vorpommern Niedersachsen Nordrhein-Westfalen Rheinland-Pfalz Saarland Sachsen Sachsen-Anhalt Schleswig-Holstein Thüringen Thank you. If you would like to win the multimedia CD, please complete the following entries: Name: ID number: 10 1 Basics 1.4 Statistical Variable An observable characteristic of a statistical element is called statistical variable. The actual values assumed by statistical variables are called observations, measure- ments, or data. The set of possible values a variable can take is called sample space. Variables are denoted by script capitals X; Y; : : :, whereas corresponding realiza- tions are written in lowercase: x1 ; x2 ; : : :, the indices reflect the statistical elements sampled. Variable Observations X x1 ; x2 ; x3 ; : : : Y y1 ; y2 ; y3 ; : : : It is useful to differentiate between variables used for identification and target variables. Identification variables In assigning a set of fixed values the elements of the population are specified. For example, restricting a statistical investigation to female persons involves setting the identification variable “sex” to “female.” Target variables These are the characteristics of interest, the phenomena that are being explored by means of statistical techniques, e.g., the age of persons belonging to a particular population. Example Objective of the statistical investigation is to explore Berlin’s socio- economic structure as of December 21, 1995. The identification variables are chosen to be: legal: citizen spatial: permanent address in Berlin temporal: 31 December 1995 Statistical element: a registered citizen of Berlin on 31 December 1995 Population: all citizens of Berlin on 31 December 1995 Possible target variables: Symbol Variable Sample space X Age (rounded to years) f0; 1; 2; : : :g S Sex {female, male} T Marital status {single, married, divorced} Y Monthly income Œ0; 1/ 1.6 Qualitative Variables 11 1.5 Measurement Scales The values random variables take can differ distinctively, as can be seen in the above table. They can be classified into quantitative, i.e., numerically valued (age and income) and qualitative, i.e., categorical (sex, marital status) variables. As numerical values are usually assigned to observations of qualitative variables, they may appear quantitative. Yet such synthetic assignments aren’t of the same quality as numerical measurements that naturally arise in observing a phenomenon. The crucial distinction between quantitative and qualitative variables lies in the properties of the actual scale of measurement, which in turn is crucial to the applicability of statistical methods. In developing new tools statisticians make assumptions about permissible measurement scales. A measurement is a numerical assignment to an observation. Some measure- ments appear more natural than others. By measuring the height of persons, for example, we apply a yardstick that ensures comparability between observations up to almost any desired precision—regardless of the units (such as inches or centime- ters). School grades, on the other hand, represent a relatively rough classification indicating a certain ranking, yet putting many pupils into the same category. The values assigned to qualitative statements like “very good,” “average,” etc. are an arbitrary yet practical shortcut in assessing people’s achievements. As there is no conceptual reasoning behind a school grade scale, one should not try to interpret the “distances” between grades. Clearly, height measurements convey more information than school marks, as distances between measurements can consistently be compared. Statements such as “Tom is twice as tall as his son” or “Manuela is 35 centimeters smaller than her partner” are permissible. As statistical methods are developed in mathematical terms, the applicable scales are also defined in terms of mathematical concepts. These are the transformations that can be imposed on them without loss of information. The wider the range of permissible transformations, the less information the scale can convey. Table 1.3 lists common measurement scales in increasing order of information content. Scales carrying more information can always be transformed into less informative scales. 1.6 Qualitative Variables Nominal Scale The most primitive scale, one that is only capable of expressing whether two values are equal or not, is the nominal scale. It is purely qualitative. If an experiment’s sample space consists of categories without a natural ordering, the corresponding random variable is nominally scaled. The distinct numbers assigned to outcomes merely indicate whether any two outcomes are equal or not. 12 1 Basics Table 1.3 Measurement scales of random variables Measurement Permissible Variable scale Statements transformations Qualitative Nominal scale Equivalence Any equivalence preserving mapping Categorical Ordinal scale Equivalence, order Any order preserving mapping Quantitative Interval scale Equivalence, y D ˛x C ˇ; ˛ > 0 metric order, distance Ratio scale Equivalence, y D ˛x; ˛ > 0 order, distance, ratio Absolute scale Equivalence, order, Identity function distance, ratio, absolute level For example, numbers assigned to different political opinions may be helpful in compiling results from questionnaires. Yet in comparing two opinions we can only relate them as being of the same kind or not. The numbers do not establish any ranking. Variables with exactly two mutually exclusive outcomes are called binary variables or dichotomous variables. If the indicator numbers assigned convey information about the ranking of the categories, a binary variable might also be regarded as ordinally scaled. If the categories (events) constituting the sample space are not mutually exclu- sive, i.e., one statistical element can correspond to more than one category, we call the variable cumulative. For example, a person might respond affirmatively to different categories of professional qualifications. But there cannot be more than one current full-time employment (by definition). Ordinal Scale If the numbers assigned to measurements express a natural ranking, the variable is measured on an ordinal scale. The distances between different values cannot be interpreted—a variable mea- sured on an ordinal scale is thus still somehow nonquantitative. For example, school marks reflect different levels of achievement. There is, however, usually no reason to regard a work receiving a grade of “4” as twice as good as one that achieved a grade of “2.” 1.7 Quantitative Variables 13 As the numbers assigned to measurements reflect their ranking relatively to each other, they are called rank values. There are numerous examples for ordinally scaled variables in psychology, sociology, business studies, etc. Scales can be designed attempting to measure such vague concepts as “social status,” “intelligence,” “level of aggression,” or “level of satisfaction.” 1.7 Quantitative Variables Apart from possessing a natural ordering, measurements of quantitative variables can also be interpreted in terms of distances between observations. Interval Scale If distances between measurements can be interpreted meaningfully, the variable is measured on an interval scale. In contrast to the ratio scale, ratios of measurements don’t have a substantial meaning, for the interval scale doesn’t possess a natural zero value. For example, temperatures measured in degrees centigrade can be interpreted in order of higher or lower levels. Yet, a temperature of 20 degrees centigrade cannot be regarded to be twice as high as a temperature of 10 degrees. Think of equivalent temperatures measured in Fahrenheit. Converting temperatures from centigrade to Fahrenheit and vice versa involves shifting the zero point. Ratio Scale Values of variables measured on a ratio scale can be interpreted both in terms of distances and ratios. The ratio scale thus conveys even more information than the interval scales, in which only intervals (distances between observations) are quantitatively meaningful. The phenomena to be measured on a ratio scale possess a natural zero element, representing total lack of the attribute. Yet there isn’t necessarily a natural measure- ment unit. Prominent examples are weight, height, age, etc. Absolute Scale The absolute scale is a metric scale with a natural unit of measurement. Absolute scale measurement is thus simply counting. It is the only measurement without 14 1 Basics alternative. Example: All countable phenomena such as the number of people in a room or number of balls in an urn. Discrete Variable A metric variable that can take a finite or countably infinite set of values is called discrete. Example: Monthly production of cars or number of stars in the universe. Continuous Variable A metric variable is called continuous, if it can take on an uncountable number of values in any interval on the number line. Example: Petrol sold in a specific period of time. In practice, many theoretically continuous variables are measured discretely due to limitations in the precision of physical measurement devices. Measuring a person’s age can be carried out to a certain fraction of a second, but not infinitely precisely. We regard a theoretically continuous variable, which we can measure with a certain sufficient precision, as effectively continuous. Similar reasoning applies to discrete variables, which we sometimes regard as quasi-continuous, if there are enough values to suggest the applicability of statistical methods devised for continuous variables. 1.8 Grouping Continuous Data Consider height data on 100 school boys. In order to gain an overview of the distribution of heights you start “reading” the raw data. But the typical person will soon discover that making sense of more than, say, 10 observations without some process of simplification is not useful. Intuitively, one starts to group individuals with similar heights. By focusing on the size of these groupings rather than on the raw data itself one gains an overview of the data. Even though one has set aside detailed information about exact heights, one has created a clearer overall picture. Data sampled from continuous or quasi-continuous random variables can be con- densed by partitioning the sample space into mutually exclusive classes. Counting the number of realizations falling into each of these classes is a means of providing a descriptive summary of the data. Grouping data into classes can greatly enhance our ability to “see” the structure of the data, i.e., the distribution of the realizations over the sample space. 1.8 Grouping Continuous Data 15 Table 1.4 Example for 1st alternative grouping of continuous variables Less than 10 < 10 10 to less than 12 10; < 12 12 to less than 15 12; < 15 15 or greater 15 2nd alternative Less than or equal to 10 10 Greater than 10 to less than or equal to 12 > 10; 12 Greater than 12 to less than or equal to 15 > 12; 15 Greater than 15 > 15 Classes are nonoverlapping intervals specified by their upper and lower limits (class boundaries). Loss of information arises from replacing the actual values by the sizes and location of the classes into which they fall. If one uses too few classes, then useful patterns may be concealed. Too many classes may inhibit the expositional value of grouping. Class boundaries The upper and lower values of a class are called class bound- aries. A class j is fully specified by its lower boundary xlj and upper boundary xuj. j D 1; : : : ; k/, where xuj D xljC1. j D 1; : : : ; k 1/, i.e., upper boundary of the jth class and lower boundary of the. j C 1/th class coincide. xlj < x xuj or xlj x < xuj. j D 1; : : : ; k/, i.e., the class boundary can be attributed to either of the classes it separates (Table 1.4). When measurements of (theoretically) unbounded variables are being classified, left- and/or right-most classes extend to 1, C1, respectively, i.e., they form a semi-open interval. Class width Taking the difference between two boundaries of a class yields the class width (sometimes referred to as the class size). Classes need not be of equal width: xj D xuj xlj. j D 1; : : : ; k/ Class midpoint The class midpoint xj can be interpreted as a representative value for the class, if the measurements falling into it are evenly or symmetrically distributed. xlj C xuj xj D. j D 1; : : : ; k/ 2 16 1 Basics Table 1.5 Income Consolidated distribution in Germany Persons gross income Taxable income (1000) (mio. marks) 1 – 4000 1445:2 2611:3 4000 – 8000 1455:5 8889:2 8000 – 12000 1240:5 12310:9 12000 – 16000 1110:7 15492:7 15000 – 25000 2762:9 57218:5 25000 – 30000 1915:1 52755:4 30000 – 50000 6923:7 270182:7 50000 – 75000 3876:9 234493:1 75000 – 100000 1239:7 105452:9 100000 – 250000 791:6 108065:7 250000 – 500000 93:7 31433:8 500000 – 1 Mio. 26:6 17893:3 1 Mio. – 2 Mio. 8:6 11769:9 2 Mio. – 5 Mio. 3:7 10950:8 5 Mio. – 10 Mio. 0:9 6041:8 10 Mio. 0:5 10749:8 Data from: Datenreport 1992, p. 255; Statistisches Jahrbuch der Bundesrepublik Deutschland 1993, p. 566 Explained: Grouping of Data Politicians and political scientists are interested in the income distribution. In Germany, a large portion of the population has taxable income The 1986 data, compiled from various official sources, displays a concentration in small and medium income brackets. Relatively few individuals earned more than one million marks. Greater class widths have been chosen for higher income brackets to retain a compact exposition despite the skewness in the data (Table 1.5). 1.9 Statistical Sequences and Frequencies Statistical Sequence In recording data we generate a statistical sequence. The original, unprocessed sequence is called raw data. Given an appropriate scale level (i.e., at least an ordinal scale), we can sort the raw data, thus creating an ordered sequence. 1.9 Statistical Sequences and Frequencies 17 Data collected at the same point in time or for the same period of time on different elements are called cross-section data. Data collected at different points in time or for different periods of time on the same element are called time series data. The sequence of observations is ordered along the time axis. Frequency The number of observations falling into a given class is called the frequency. Classes are constructed to summarize continuous or quasi-continuous data by means of frequencies. In discrete data one regularly encounters so-called ties, i.e., two or more observations taking on the same value. Thus, discrete data may not require grouping in order to calculate frequencies. Absolute Frequency Counting the number of observations taking on a specific value yields the absolute frequency: h X D xj D h xj D h j When data are grouped, the absolute frequencies of classes are calculated as follows: h xj D h xlj X < xuj Properties: 0 h xj n X h xj D n j Relative Frequency The proportion of observations taking on a specific value or falling into a specific class is called the relative frequency, the absolute frequency standardized by the total number of observations. h xj f xj D n 18 1 Basics Properties: 0 f xj 1 X f xj D 1 j Frequency Distribution By standardizing class frequencies for grouped data by their respective class widths, frequencies for differently sized classes are made comparable. The resulting frequencies can be compiled to form a frequency distribution. h xj hO xj D u xj xlj f xj fO xj D u ; xj xlj where xlj ; xuj are the upper and lower class boundaries with xlj < x xuj. Explained: Absolute and Relative Frequency 150 persons have been asked for their marital status: 88 of them are married, 41 single, and 21 divorced. The four conceivable responses have been assigned categories as follows: single: x1 married: x2 divorced: x3 widowed: x4 The number of statistical elements is n D 150. The absolute frequencies given above are: h.x1 / D 41 h.x2 / D 88 h.x3 / D 21 h.x4 / D 0 1.9 Statistical Sequences and Frequencies 19 Dividing by the sample size n D 150 yields the relative frequencies: f.x1 / D 41=150 D 0:27 f.x2 / D 88=150 D 0:59 f.x3 / D 21=150 D 0:14 f.x4 / D 0=150 D 0:00 Thus, 59 % of the persons surveyed are married, 27 % are single, and 15 % divorced. No one is widowed. Chapter 2 One-Dimensional Frequency Distributions 2.1 One-Dimensional Distribution The collection of information about class boundaries and relative or absolute frequencies constitutes the frequency distribution. For a single variable (e.g., height) we have a one-dimensional frequency distribution. If more than one variable is measured for each statistical unit (e.g., height and