Summary

This document is a Python tutorial on machine learning, specifically focusing on linear and polynomial regression. The tutorial outlines how to use Python libraries like matplotlib and scipy to implement and visualize linear regression models and demonstrates techniques and examples for various applications. The document primarily focuses on practical applications rather than theoretical concepts. It includes detailed examples, code snippets, and visualizations to guide users through Python programming for building linear models.

Full Transcript

Python ML Tutorial LINEAR REGRESSION Linear Regression Regression The term regression is used when you try to find the relationship between variables. In Machine Learning, and in statistical modeling, that relationship is used to predict the outcome of future events. Linear Regression Linear...

Python ML Tutorial LINEAR REGRESSION Linear Regression Regression The term regression is used when you try to find the relationship between variables. In Machine Learning, and in statistical modeling, that relationship is used to predict the outcome of future events. Linear Regression Linear regression uses the relationship between the data‐points to draw a straight line through all them. This line can be used to predict future values. In Machine Learning, predicting the future is very important. Linear Regression How Does it Work? Python has methods for finding a relationship between data‐ points and to draw a line of linear regression. We will show you how to use these methods instead of going through the mathematic formula. In the example below, the x‐axis represents age, and the y‐ axis represents speed. We have registered the age and speed of 13 cars as they were passing a tollbooth. Let us see if the data we collected could be used in a linear regression: Example ‐ Start by drawing a scatter plot: import matplotlib.pyplot as plt x = [5,7,8,7,2,17,2,9,4,11,12,9,6] y = [99,86,87,88,111,86,103,87,94,78,77,85,86] plt.scatter(x, y) plt.show() 1 Linear Regression Example ‐ Import scipy and draw the line of Linear Regression: import matplotlib.pyplot as plt from scipy import stats x = [5,7,8,7,2,17,2,9,4,11,12,9,6] y = [99,86,87,88,111,86,103,87,94,78,77,85,86] slope, intercept, r, p, std_err = stats.linregress(x, y) def myfunc(x): return slope * x + intercept mymodel = list(map(myfunc, x)) plt.scatter(x, y) plt.plot(x, mymodel) plt.show() Linear Regression R for Relationship It is important to know how the relationship between the values of the x‐axis and the values of the y‐axis is, if there are no relationship the linear regression can not be used to predict anything. This relationship ‐ the coefficient of correlation ‐ is called r. The r value ranges from ‐1 to 1, where 0 means no relationship, and 1 (and ‐1) means 100% related. Python and the Scipy module will compute this value for you, all you have to do is feed it with the x and y values. Example ‐ How well does my data fit in a linear regression? from scipy import stats x = [5,7,8,7,2,17,2,9,4,11,12,9,6] y = [99,86,87,88,111,86,103,87,94,78,77,85,86] slope, intercept, r, p, std_err = stats.linregress(x, y) print(r) Note: The result ‐0.76 shows that there is a relationship, not perfect, but it indicates that we could use linear regression in future predictions. Linear Regression Predict Future Values Now we can use the information we have gathered to predict future values. Example: Let us try to predict the speed of a 10 years old car. To do so, we need the same myfunc() function. Example ‐ Predict the speed of a 10 years old car: from scipy import stats x = [5,7,8,7,2,17,2,9,4,11,12,9,6] y = [99,86,87,88,111,86,103,87,94,78,77,85,86] slope, intercept, r, p, std_err = stats.linregress(x, y) def myfunc(x): return slope * x + intercept speed = myfunc(10) print(speed) The example predicted a speed at 85.6, which we also could read from the diagram: 2 Linear Regression Bad Fit? Let us create an example where linear regression would not be the best method to predict future values. Example ‐ These values for the x‐ and y‐axis should result in a very bad fit for linear regression: import matplotlib.pyplot as plt from scipy import stats x = [89,43,36,36,95,10,66,34,38,20,26,29,48,64,6,5,36,66,72,40] y = [21,46,3,35,67,95,53,72,58,10,26,34,90,33,38,20,56,2,47,15] slope, intercept, r, p, std_err = stats.linregress(x, y) def myfunc(x): return slope * x + intercept mymodel = list(map(myfunc, x)) plt.scatter(x, y) plt.plot(x, mymodel) plt.show() Linear Regression And the r for relationship? Example ‐ You should get a very low r value. import numpy from scipy import stats x = [89,43,36,36,95,10,66,34,38,20,26,29,48,64,6,5,36,66,72,40] y = [21,46,3,35,67,95,53,72,58,10,26,34,90,33,38,20,56,2,47,15] slope, intercept, r, p, std_err = stats.linregress(x, y) print(r) The result: 0.013 indicates a very bad relationship, and tells us that this data set is not suitable for linear regression. Polynomial Regression If your data points clearly will not fit a linear regression (a straight line through all data points), it might be ideal for polynomial regression. Polynomial regression, like linear regression, uses the relationship between the variables x and y to find the best way to draw a line through the data points. 3 Polynomial Regression How Does it Work? Python has methods for finding a relationship between data‐points and to draw a line of polynomial regression. We will show you how to use these methods instead of going through the mathematic formula. In the example on the next slide, we have registered 18 cars as they were passing a certain tollbooth. We have registered the car's speed, and the time of day (hour) the passing occurred. The x‐axis represents the hours of the day and the y‐axis represents the speed: Note: The result ‐0.76 shows that there is a relationship, not perfect, but it indicates that we could use linear regression in future predictions. Polynomial Regression Example ‐ Start by drawing a scatter plot: import matplotlib.pyplot as plt x = [1,2,3,5,6,7,8,9,10,12,13,14,15,16,18,19,21,22] y= [100,90,80,60,60,55,60,65,70,70,75,76,78,79,90,99,9 9,100] plt.scatter(x, y) plt.show() Polynomial Regression Example ‐ Import numpy and matplotlib then draw the line of Polynomial Regression: import numpy import matplotlib.pyplot as plt x = [1,2,3,5,6,7,8,9,10,12,13,14,15,16,18,19,21,22] y= [100,90,80,60,60,55,60,65,70,70,75,76,78,79,90,99,9 9,100] mymodel = numpy.poly1d(numpy.polyfit(x, y, 3)) myline = numpy.linspace(1, 22, 100) plt.scatter(x, y) plt.plot(myline, mymodel(myline)) plt.show() 4 Polynomial Regression Example Explained Import the modules you need. Create the arrays that represent the values of the x and y axis: import numpy import matplotlib.pyplot as plt x = [1,2,3,5,6,7,8,9,10,12,13,14,15,16,18,19,21,22] y = [100,90,80,60,60,55,60,65,70,70,75,76,78,79,90,99,99,100] NumPy has a method that lets us make a polynomial model. Then specify how the line will display, we start at position 1, and end at position 22: mymodel = numpy.poly1d(numpy.polyfit(x, y, 3)) myline = numpy.linspace(1, 22, 100) Draw the original scatter plot. Draw the line of polynomial regression. Display the diagram: plt.scatter(x, y) plt.plot(myline, mymodel(myline)) plt.show() Polynomial Regression R‐Squared It is important to know how well the relationship between the values of the x‐ and y‐axis is, if there are no relationship the polynomial regression can not be used to predict anything. The relationship is measured with a value called the r‐squared. The r‐squared value ranges from 0 to 1, where 0 means no relationship, and 1 means 100% related. Python and the Sklearn module will compute this value for you, all you have to do is feed it with the x and y arrays: Example ‐ How well does my data fit in a polynomial regression? import numpy from sklearn.metrics import r2_score x = [1,2,3,5,6,7,8,9,10,12,13,14,15,16,18,19,21,22] y = [100,90,80,60,60,55,60,65,70,70,75,76,78,79,90,99,99,100] mymodel = numpy.poly1d(numpy.polyfit(x, y, 3)) print(r2_score(y, mymodel(x))) Note: The result 0.94 shows that there is a very good relationship, and we can use polynomial regression in future predictions. Polynomial Regression Predict Future Values Now we can use the information we have gathered to predict future values. Example: Let us try to predict the speed of a car that passes the tollbooth at around the time 17:00: To do so, we need the same mymodel array from the example above: mymodel = numpy.poly1d(numpy.polyfit(x, y, 3)) Example ‐ Predict the speed of a car passing at 17:00: import numpy from sklearn.metrics import r2_score x = [1,2,3,5,6,7,8,9,10,12,13,14,15,16,18,19,21,22] y = [100,90,80,60,60,55,60,65,70,70,75,76,78,79,90,99,99,100] mymodel = numpy.poly1d(numpy.polyfit(x, y, 3)) speed = mymodel(17) print(speed) 5 Polynomial Regression The example predicted a speed to be 88.87, which we also could read from the diagram: Polynomial Regression Bad Fit? Let us create an example where polynomial regression would not be the best method to predict future values. Example ‐ These values for the x‐ and y‐axis should result in a very bad fit for polynomial regression: import numpy import matplotlib.pyplot as plt x= [89,43,36,36,95,10,66,34,38,20,26,29,48,64,6,5,36,66,72,40] y= [21,46,3,35,67,95,53,72,58,10,26,34,90,33,38,20,56,2,47,15] mymodel = numpy.poly1d(numpy.polyfit(x, y, 3)) myline = numpy.linspace(2, 95, 100) plt.scatter(x, y) plt.plot(myline, mymodel(myline)) plt.show() Polynomial Regression And the r‐squared value? Example ‐ You should get a very low r‐squared value. import numpy from sklearn.metrics import r2_score x= [89,43,36,36,95,10,66,34,38,20,26,29,48,64,6,5,36,66,72,40] y= [21,46,3,35,67,95,53,72,58,10,26,34,90,33,38,20,56,2,47,15] mymodel = numpy.poly1d(numpy.polyfit(x, y, 3)) print(r2_score(y, mymodel(x))) The result: 0.00995 indicates a very bad relationship, and tells us that this data set is not suitable for polynomial regression. 6 Multiple Regression Multiple Regression Multiple regression is like linear regression, but with more than one independent value, meaning that we try to predict a value based on two or more variables. We can predict the CO2 emission of a car based on the size of the engine, but with multiple regression we can throw in more variables, like the weight of the car, to make the prediction more accurate. How Does It Work? In Python we have modules that will do the work for us. Start by importing the Pandas module. The Pandas module allows us to read csv files and return a DataFrame object. Then make a list of the independent values and call this variable X. Put the dependent values in a variable called y. Tip: It is common to name the list of independent values with a upper case X, and the list of dependent values with a lower case y. We will use some methods from the sklearn module, so we will have to import that module as well. From the sklearn module we will use the LinearRegression() method to create a linear regression object. This object has a method called fit() that takes the independent and dependent values as parameters and fills the regression object with data that describes the relationship. Now we have a regression object that are ready to predict CO2 values based on a car's weight and volume. Multiple Regression Example ‐ See the whole example in action: import pandas from sklearn import linear_model df = pandas.read_csv("data_multReg.csv") X = df[['Weight', 'Volume']] y = df['CO2'] regr = linear_model.LinearRegression() regr.fit(X.values, y) #predict the CO2 emission of a car where the weight is 2300kg, and the volume is 1300cm3: predictedCO2 = regr.predict([[2300, 1300]]) print(predictedCO2) We have predicted that a car with 1.3 liter engine, and a weight of 2300 kg, will release approximately 107 grams of CO2 for every kilometer it drives. Multiple Regression Coefficient The coefficient is a factor that describes the relationship with an unknown variable. Example: if x is a variable, then 2x is x two times. x is the unknown variable, and the number 2 is the coefficient. In this case, we can ask for the coefficient value of weight against CO2, and for volume against CO2. The answer(s) we get tells us what would happen if we increase, or decrease, one of the independent values. Example – Print the coefficient values of the regression object: import pandas from sklearn import linear_model df = pandas.read_csv("data_multReg.csv") X = df[['Weight', 'Volume']] y = df['CO2'] regr = linear_model.LinearRegression() regr.fit(X, y) print(regr.coef_) 7 Multiple Regression Result Explained The result array represents the coefficient values of weight and volume. Weight: 0.00755095 Volume: 0.00780526 These values tell us that if the weight increase by 1kg, the CO2 emission increases by 0.00755095g. And if the engine size (Volume) increases by 1 cm3, the CO2 emission increases by 0.00780526 g. I think that is a fair guess, but let test it! We have already predicted that if a car with a 1300cm3 engine weighs 2300kg, the CO2 emission will be approximately 107g. What if we increase the weight with 1000kg? Multiple Regression Example – Copy the example from before, but change the weight from 2300 to 3300: import pandas from sklearn import linear_model df = pandas.read_csv("data.csv") X = df[['Weight', 'Volume']] y = df['CO2'] regr = linear_model.LinearRegression() regr.fit(X, y) predictedCO2 = regr.predict([[3300, 1300]]) print(predictedCO2) We have predicted that a car with 1.3 liter engine, and a weight of 3300 kg, will release approximately 115 grams of CO2 for every kilometer it drives. Which shows that the coefficient of 0.00755095 is correct: 107.2087328 + (1000 * 0.00755095) = 114.75968 Scale Scale Features When your data has different values, and even different measurement units, it can be difficult to compare them. What is kilograms compared to meters? Or altitude compared to time? The answer to this problem is scaling. We can scale data into new values that are easier to compare. Take a look at the table found in data_scale.csv: it is the same data set that we used for multiple regression, but this time the Volume column contains values in liters instead of cm3 (1.0 instead of 1000). It can be difficult to compare the volume 1.0 with the weight 790, but if we scale them both into comparable values, we can easily see how much one value is compared to the other. There are different methods for scaling data, in this tutorial we will use a method called standardization. 8 Scale The standardization method uses this formula: z = (x ‐ u) / s Where z is the new value, x is the original value, u is the mean and s is the standard deviation. If you take the weight column from the data set above, the first value is 790, and the scaled value will be: (790 ‐ 1292.23) / 238.74 = ‐2.1 If you take the volume column from the data set above, the first value is 1.0, and the scaled value will be: (1.0 ‐ 1.61) / 0.38 = ‐1.59 Now you can compare ‐2.1 with ‐1.59 instead of comparing 790 with 1.0. You do not have to do this manually, the Python sklearn module has a method called StandardScaler() which returns a Scaler object with methods for transforming data sets. Scale Example – Scale all values in the Weight and Volume columns: import pandas from sklearn import linear_model from sklearn.preprocessing import StandardScaler scale = StandardScaler() df = pandas.read_csv("data_scale.csv") X = df[['Weight', 'Volume']] scaledX = scale.fit_transform(X) print(scaledX) Result: Note that the first two values are ‐2.1 and ‐1.59, which corresponds to our calculations. Scale Predict CO2 Values The task in the Multiple Regression chapter was to predict the CO2 emission from a car when you only knew its weight and volume. Example ‐ Predict the CO2 emission from a 1.3 liter car that weighs 2300 kilograms: import pandas from sklearn import linear_model from sklearn.preprocessing import StandardScaler scale = StandardScaler() df = pandas.read_csv("data_scale.csv") X = df[['Weight', 'Volume']] y = df['CO2'] scaledX = scale.fit_transform(X) regr = linear_model.LinearRegression() regr.fit(scaledX, y) scaled = scale.transform([[2300, 1.3]]) predictedCO2 = regr.predict([scaled]) print(predictedCO2) 9 References https://www.w3schools.com/python/python_ml_getting_started.asp https://scikit‐learn.org/ https://scikit‐learn.org/stable/user_guide.html 10

Use Quizgecko on...
Browser
Browser