Introduction to Statistical Learning PDF
Document Details
2022
Seppo Pynnönen
Tags
Summary
This presentation, by Seppo Pynnönen, introduces statistical learning, covering various techniques, including supervised and unsupervised learning. It details different methods of estimating and predicting data, including examples like wages, stock market prediction, and customer consumption habits.
Full Transcript
Introduction Part I Introduction As of Oct 31, 2022 Some of the figures in this presentation are taken from ”An Introduction to Statistical Learning, with applications...
Introduction Part I Introduction As of Oct 31, 2022 Some of the figures in this presentation are taken from ”An Introduction to Statistical Learning, with applications in R” (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani Seppo Pynnönen Applied Multivariate Statistical Analysis 1 / 50 Introduction 1 Introduction Understanding Data A Brief History of Statistical Learning What is Statistical Learning Prediction Inference Estimating f Prediction Accuracy and Model Interpretability Assessing Model Accuracy Measuring the Quality of Fit Bias–Variance trade-off Classification Setting Seppo Pynnönen Applied Multivariate Statistical Analysis 2 / 50 Introduction In these classes we take a statistical learning perspective to statistical techniques of which various multivariate analyses are part of. According to James et al. p.1 “Statistical learning refers to a vast set of tools for understanding data.” Seppo Pynnönen Applied Multivariate Statistical Analysis 3 / 50 Introduction Understanding Data 1 Introduction Understanding Data A Brief History of Statistical Learning What is Statistical Learning Prediction Inference Estimating f Prediction Accuracy and Model Interpretability Assessing Model Accuracy Measuring the Quality of Fit Bias–Variance trade-off Classification Setting Seppo Pynnönen Applied Multivariate Statistical Analysis 4 / 50 Introduction Understanding Data Tools for understanding data can be broadly classified as Supervised learning involves building a statistical model for predicting, or estimating, an output based on one or more inputs. The key is learning from training data set and use the training results for prediction purposes as new input data becomes available. The learning problem consist of inferring from the training data set the function that can be used to map future inputs to predictions, i.e., give input variables x and output variable(s) y , find function f such that f (x) ≈ y in a predictive way. Tools Regression methods; find the functional mapping of input variables to quantitative output variable(s) (e.g. how wage is related to some background variables, like age, education, gender, etc,). Classification methods; find functional mapping of input variables to discrete set of classes (e.g. how different financial ratios predict firm solvency {solvent, non-solvent}). Unsupervised learning there are inputs but no supervising output; from such data we can learn relationships and structures. Tools: various clustering methods Seppo Pynnönen Applied Multivariate Statistical Analysis 5 / 50 Introduction Understanding Data To illustrate some application of statistical learning, consider the two supervised data examples given in the book by James et al. For the unsupervised learning example, see the book. These (and other) data sets are available in the R package ISLR accompanied with the book (install the package on R and see help(package = “ISLR”)). Seppo Pynnönen Applied Multivariate Statistical Analysis 6 / 50 Introduction Understanding Data Example 1 (Supervised learning: Continuous output) Wages of a group of men from the Atlantic region of the US. The interest is in the relation/effects of various background factors (like age, education, calendar year) on wage. 300 300 300 250 250 250 Annual Wage (1 000 USD) Annual Wage (1 000 USD) Annual Wage (1 000 USD) 200 200 200 150 150 150 100 100 100 50 50 50 20 30 40 50 60 70 80 2003 2005 2007 2009 1 2 3 4 5 Age Year Eduction Level There is considerable variability in wages. The trend in the left hand panel shows that wages tend to rise up to age 45 followed some decreasing in older ages. The middle panel show some increase over the years, the left hand panel shows clear incremental effect of education (1 = no high school diploma, 5 = advanced graduate degree). Seppo Pynnönen Applied Multivariate Statistical Analysis 7 / 50 Introduction Understanding Data Example 2 (Supervised leraning: Categorical output) Predict German stock market direction (Up or Down) next day on the basis of past few days direction (daily returns from the beginning of 2012 until Oct 17, 2018). Up Up Up Today's Direction Today's Direction Today's Direction Down Down Down