Advertising Dataset Analysis PDF

1/18/24, 12:57 PM Lesson-1 - HackMD Let us consider a snapshot of the famous Advertising dataset (https://www.kaggle.com/datasets/purbar/advertising-data/data) : TV Radio Newspaper Sales 230.1 37.8 69.2 22.1 44.5 39.3 45.1 10.4 17.2 45.9 69.3 9.3 151.5 41.3 58.5 18.5 180.8 10.8 58.4 12.9 During an advertising campaign, the money spent (in thousands of dollars) on television ads, radio ads, and newspaper ads are recorded in columns TV, Radio, and Newspaper respectively. The Sales generated as a consequence of a campaign are recorded in the Sales column. Our goal is to examine the connection between Sales and money spent on campaigns run on each of the three media. We begin with a simpler situation of examining the connection between Sales and TV, Sales and Radio, and Sales and Newspaper separately. Later, we shall analyse the connection when all three taken together. We will also answer how strong the connection is and how to predict Sales on the basis of money spent on ads in each media. The following matrix of correlations gives us an idea about the relationships between various pairs of variables. In particular we can see a strong correlation between Sales and TV ads and a significantly lower correlation between Sales and Newspaper ads. TV Radio Newspaper Sales TV 1.0 0.054 0.056 0.782 Radio 0.054 1.0 0.354 0.576 Newspaper 0.056 0.354 1.0 0.228 Sales 0.782 0.576 0.228 1.0 So the prediction of Sales on the basis of TV ads should be much more accurate compared to the prediction on the basis of Newspaper ads. Let us now focus on Sales vs TV ads: https://hackmd.io/0VoTWSCQT8ywjxRY7G82nA?view 1/5 1/18/24, 12:57 PM Lesson-1 - HackMD TV Sales 230.1 22.1 44.5 10.4 17.2 9.3 151.5 18.5 180.8 12.9 The above data could have been recorded at a particular time of the year or at a particular place. What it means is that if the same capaign is run again and the same 230.1 thousand dollars are spent on TV ads then the Sales generated will be most probably different to 22.1 thousand dollars. If we keep running campaigns for the same amount spent on TV ads we will generally endup generating different numbers for Sales. Thus it is reasonable to regard Sales as a random variable for a fixed amount x spent on TV ads. So, in that case, how do we predict Sales for a given amount x spent on TV ads? Answer is simple: If possible, run a large number of campaigns spending the same amount x on TV ads and take the average of all the Sales generated. The obvious problem with this strategy is that it may not be feasible to run a large number of campaigns owing to logistic and financial constraints. Thus obtaining more and more datasets is out of question. Trick is to find an approximation to the average of Sales when T V = x. This means that we need to approximate the conditional expectation E[Sales|TV = x]. Recall from Probability theory that ∞ E[Sales|T V = x] = ∫ yf (y|x)dy. −∞ In this expression, the random variable Sales take on values y given that amount x is pumped into TV ads. For the sake of convinience we shall write Sales as Y throughout. In general there is no control over the kind of values y can take so we allow y to be any real number. The conditional probability distribution function f (y|x) is never known so a direct calculation of the integral is not possible. To simplify matters, we assume that Y follows a Gaussian distribution with constant variance σ when T V = x. 2 https://hackmd.io/0VoTWSCQT8ywjxRY7G82nA?view 2/5 1/18/24, 12:57 PM Lesson-1 - HackMD This forces f (y|x) to take the form: 2 1 1 y − μx f (y|x) = exp(− ( ) ) √ 2πσ 2 2 σ This yields E[Y |T V = x] = μ x. So the predicted value of Y is μx when amount spent in TV ads is x. In symbols we express this fact by writing ^ Y (x) = μ x. In simple linear regression we assume μ to be of the form α + α x. Another kind x 0 1 of linear regression is polynomial regression in which μ is assumed to be of the form x α + α x + ⋯ + α x. We will discuss the polynomial case in the next lesson. d 0 1 d In this lesson we restrict ourselves to μ x = α0 + α1 x. Under this assumption, the conditional distribution f (y|x) becomes 2 1 1 y − α0 − α1 x f (y|x) = exp(− ( ) ) √ 2πσ 2 2 σ The parameters α , α will be estimated from the given dataset by maximizing the 0 1 likelihood function: L(α 0 , α 1 ) = f (y 1 |x 1 )f (y 2 |x 2 ) ⋯ f (y n |x n ) (1) This means that we are searching for those values of the parameters for which it is most likely that the amount xi invested in TV ads would yield a sales of yi. Maximizing (1) is equivalent to maximizing log L(α 0 , α 1 ) : n log L(α 0 , α 1 ) = ∑ log f (y i |x i ) i=1 n n 1 2 2 = − log(2πσ ) − ∑(y i − α 0 − α 1 x i ) (2) 2 2 2σ i=1 This is equivalent to maximizing n 2 − ∑(y i − α 0 − α 1 x i ) i=1 This, in other words, means that we are minimizing the residual sum of squares (RSS): n 2 ∑(y i − α 0 − α 1 x i ) (3) i=1 https://hackmd.io/0VoTWSCQT8ywjxRY7G82nA?view 3/5 1/18/24, 12:57 PM Lesson-1 - HackMD In order to minimize (3) we invoke an elementary technique in multivariable calculus of finding critical points by equating the partial derivatives of RSS with respect to α 0 and α1 respectively to Zero. We thus conclude that n 0 = 2 ∑(y i − α 0 − α 1 x i ) (4) i=1 and n 0 = −2 ∑(y i − α 0 − α 1 x i )x i (5) i=1 Solving the system of linear equations (4) and (5) we get: n ∑ (x i − x̄)(y i − ȳ) i=1 α1 = (6) n 2 ∑ (x i − x̄) i=1 and α 0 = ȳ − α 1 x̄. (7) The symbols x̄ and ȳ stand for the average of the respective columns in the dataset. We now have a procedure to predict the Sales corresponding to x dollars invested in TV ads by simply plugging in x into the equation: ^ Y (x) = α 0 + α 1 x. This equation is called the line of best fit. The important question offcourse is how much to trust this prediction. Answering this question would need us to delve into the statistical properties of the parameters and also the random variable Y. We religate this discussion to the next lesson. For now we close this lesson with a straight forward Python implementation of finding and drawing the line of best fit. Based on the flow of this lesson, the implementation logically falls into 4 steps: Read the datafile into Python import pandas as pd df = pd.read_csv('Advertising.csv') Extract only 2 columns - TV and Sales from the dataset. https://hackmd.io/0VoTWSCQT8ywjxRY7G82nA?view 4/5 1/18/24, 12:57 PM Lesson-1 - HackMD x=df[['TV']] y=df[['Sales']] Calculate the parameters α0 , α1 either directly or through a library function. from sklearn.linear_model import LinearRegression model = LinearRegression() model.fit(x,y) a_0=model.intercept_ a_1=model.coef_ Generate a scatter chart showing all the data points along with the line of best fit. import matplotlib.pyplot as plt plt.scatter(x, y, color='black') plt.plot(x,a_0+a_1*x, color='red') plt.show() You may be wondering how did we generate the table of correlations in the early part of the lesson. It’s just a one line code: df.corr() Exercises 1. Derive the expressions for the parameters given in equations (6) and (7). 2. Study the scatter charts of Sales Vs Radio, and Sales Vs Newspaper and relate with the correlation among these variables. 3. Find the lines of best fit for predicting Sales on the basis of Radio ads and Newspaper ads respectively. 4. Can you spot the link that correlation has with equation (6) and hence with the linear relationship between the 2 variables? https://hackmd.io/0VoTWSCQT8ywjxRY7G82nA?view 5/5

Advertising Dataset Analysis PDF

Document Details

Tags

Related

Summary

Full Transcript