Linear Regression Models in Python

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

What does linear regression primarily analyze?

  • The logarithmic relationship between two variables.
  • The exponential relationship between two variables.
  • The quadratic relationship between two variables.
  • The linear relationship between two variables. (correct)

In linear regression, the relationship between variables must always be perfectly linear in real-world scenarios.

False (B)

In the equation $\hat{y} = mX + b$, what does 'm' represent?

slope

In the equation $\hat{y} = mX + b$, 'b' is referred to as the ______ or intercept.

<p>constant</p> Signup and view all the answers

What is the goal of linear regression in terms of errors?

<p>To minimize the errors. (D)</p> Signup and view all the answers

In multiple linear regression (MLR), it is not possible to have hundreds or even thousands of independent variables.

<p>False (B)</p> Signup and view all the answers

What Python module provides classes and functions for estimating different statistical models, performing statistical tests, and statistical data exploration?

<p>statsmodels</p> Signup and view all the answers

The easiest way to install statsmodels is through the ______ package.

<p>Anaconda</p> Signup and view all the answers

Which library is described as the gold standard in Python for machine learning?

<p>Scikit-learn (A)</p> Signup and view all the answers

When using scikit-learn for linear regression, it is necessary to manually load the data into pandas DataFrames before analysis.

<p>False (B)</p> Signup and view all the answers

When using statsmodels and running a regression, what function must you use to add a constant/intercept?

<p>sm.add_constant</p> Signup and view all the answers

OLS stands for ______ Least Squares.

<p>Ordinary</p> Signup and view all the answers

Match the following terms with their descriptions:

<p>SLR = Regression based on data where the slope and intercept are derived from the data. Residuals = Differences between the actual y value and the estimated y value. MSE = Average of the squared errors. MLR = Using multiple independent variables.</p> Signup and view all the answers

What does a high $R^2$ value in a linear regression model indicate?

<p>The model explains a large proportion of the variance in the dependent variable. (C)</p> Signup and view all the answers

If linear regression is performed on a dataset without splitting into training test sets, what might happen?

<p>The model might not be able to generalize well to new, unseen data. (C)</p> Signup and view all the answers

Flashcards

Linear Regression

A statistical model that examines the linear relationship between two or more variables.

Dependent Variable

The variable you are trying to predict or estimate in a regression model.

Independent Variables

Variables used to predict the dependent variable.

Statsmodels

A Python module providing classes and functions for estimating different statistical models.

Signup and view all the flashcards

Scikit-learn

A Python library for machine learning, offering tools for regression, classification, clustering, and dimensionality reduction.

Signup and view all the flashcards

Intercept

The point where the regression line intersects the y-axis; the value of y when x=0.

Signup and view all the flashcards

Ordinary Least Squares (OLS)

A way to find the best-fitting line by minimizing the sum of the squares of the differences between observed and predicted values.

Signup and view all the flashcards

R-squared

The proportion of the variance in the dependent variable that is predictable from the independent variable(s).

Signup and view all the flashcards

Adjusted R-squared

A measure of how well the model fits the data, adjusted for the number of independent variables used.

Signup and view all the flashcards

Datasets.load_boston()

To load Boston dataset from datasets library.

Signup and view all the flashcards

Pandas

A Python data analysis library providing data structures like DataFrames for efficient data manipulation and analysis.

Signup and view all the flashcards

sm.add_constant(X)

Adds a constant (intercept) to the predictor variables in a regression model, allowing the regression line to have a y-intercept.

Signup and view all the flashcards

sm.OLS(y, X).fit()

A method in statsmodels used to fit an ordinary least squares regression model to the data.

Signup and view all the flashcards

LinearRegression()

Method from sklearn is used for regression.

Signup and view all the flashcards

lm.fit(X, y)

A scikit-learn method used to fit the linear regression model to the training data

Signup and view all the flashcards

Study Notes

Linear Regression

  • An introduction to building linear regression models in Python
  • Focuses on linear regression concepts and their implementation in Python
  • Linear regression is a statistical model that examines the linear relationship between two or more variables
  • Simple linear regression involves one dependent variable and one independent variable
  • Multiple linear regression involves one dependent variable and several independent variables
  • A linear relationship means that the dependent variable increases or decreases as the independent variable(s) increase or decrease

Linear Relationships

  • Can be positive, where the dependent variable increases as the independent variable increases
  • Or negative, where the dependent variable decreases as the independent variable increases

Math Behind Linear Regression

  • Does cover the math in depth, focuses on Python implementations
  • The relationship between variables y and X can be represented as: ŷ = mX + b
  • y represents the dependent variable
  • X represents the independent variable
  • m is the slope of the regression line reflecting the impact of X on y
  • b is the intercept, representing the value of y when X is 0

Simple Linear Regression (SLR)

  • Builds a model based on data
  • Models the slope and intercept derived from the data
  • SLR models account for data errors, or residuals
  • Residuals: the differences between actual y values and predicted y values
  • Linear regression tries to reduce these errors
  • Achieved by finding the line of best fit, minimizing the distance of errors from that line
  • This involves minimizing the Mean Squared Error (MSE) or Sum of Squares Error (SSE) which is related to the Residual Sum of Squares (RSS)

Multiple Linear Regression (MLR)

  • Uses multiple independent variables, starting from two and extending to hundreds or theoretically thousands
  • The equation for MLR is similar to SLR but includes more variables: ŷ = bo + b₁X₁ + b₂X₂

Linear Regression in Python Overview

  • Statsmodels and scikit-learn are the two common methods for performing linear regression in Python
  • The ”SciPy“ library implementation exists, but not as widely used as the 2 other libraries

Statsmodels

  • Statsmodels is a Python module with classes and functions for estimating statistical models, performing statistical tests, and statistical data exploration
  • Statsmodels: Can be installed through Anaconda package
  • The Statsmodels package is called into each new script/ set of code that uses it through: import statsmodels.api as sm

Using Statsmodels for Linear Regression

  • Use a dataset from ”sklearn“
  • Specifically load the Boston dataset from datasets library
  • from sklearn import datasets ## imports datasets from scikit-learn
  • data = datasets.load_boston() ## loads Boston dataset from datasets library

Boston Housing Prices Dataset Description

  • Contains information on housing prices in Boston
  • Is is designed to test machine learning
  • Includes a description of the dataset
  • (print(data.DESCR command to view it
  • Will not work as a command for all datasets - exclusively sklearn
  • Can run data.feature_names and data.target for column names of independent and dependent variables
  • “scikit-learn” has configured home price data, and another 13 variables pre-set as predictors
  • Can load data as a pandas DataFrame
  • Can set median home value as the focus, through:import numpy as np and import pandas as pd

Pandas for Data Analysis

  • Data is loaded as a pandas dataset
  • Predictors (independent variables) are defined as df
  • The target variable (dependent variable) is established as the variable that will be predicted

Fitting a Linear Regression Model

  • The model is fitted by selecing variables that correlate well to one another
  • Can be done via reviewing the correlation between variables, data plots etc
  • RM (average number of rooms) and LSTAT (% lower status of the population) are selected
  • Constant value addition is not default in ”statsmodels
  • Code:
    • import statsmodels.api as sm
    • X = df ["RM"]
    • y = target ["MEDV"]
    • model = sm.OLS (y, X).fit()

Model Predictions

  • Done via model.predict statement/function
  • The “Ordinary Least Squares” (OLS) utilizes the least squares method to identify the regression line that has the least space from given data
  • Df describes the degrees of freedom within data
  • the coefficient of 3.6534 means that with each increase in variable RM by 1, the median of MEDV will increase by 3.6534
  • R -Squared indicated the percentage of variance that the model shows
  • RM is statistically significant, and with statistical significance, the value of RM has a probability of measurement to be between 3.548- 3.759
  • Constant value: Use X = sm.add_constant (X

Linear Regression w/ >1 Variable

  • import statsmodels.api as sm
  • X = df ["RM"]
  • y = target ["MEDV"]
  • X = sm.add_constant (X)
  • model = sm.OLS(y, X).fit()
  • Adding the constant value changes the variables, and slope of the RM predictor

Setting Up More Than > 1 variable:

  • X = df [["RM", "LSTAT"]]
  • y = target [MEDV]
  • Higher Squarred- R ratings indicates the correlation between the independent and dependent variables. Statistically, RM and LSTAT are significant.

Sklearn to Run Regression Models Directly

  • Must be declared before it can be used, specifically: from sklearn import linear_model
  • Will work out of data from previous models, specifically loading:
    • from sklearn import datasets
    • data = datasets.load_boston()
  • Follow same Pandas commands as previously

Code for Declaring Independent and Dependent Variables

  • X = df
  • y = target ["MEDV]
  • Code is followed by a command to fit the data, specifically: lm = linear_model.LinearRegression()
  • and then model = lm.fit(X,y)
  • Can utilize the 1m.predict to view what target values one can anticipate if it is within statistical range

Running Sklearns Regression Equations:

  • A beautiful table, that can be acquired when regression is active within ”statsmodels
  • Utilized for printing scores, evaluating coefficients and evaluating for intercept values
  • Functions:
    • lm.score(X, y) for seeing squared value
    • lm.coef_ will print the coefficients of the predictors
    • finally lm.intercept_ prints the value of the intercepts

Splitting Datasets

  • Split up into data into train (for training the model) and test (for testing the model) data

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Use Quizgecko on...
Browser
Browser