IMG_2854.jpeg
Document Details

Uploaded by QuieterEarthArt25
University of Surrey
Full Transcript
## 2. Derivation of the training rule ### 2.1 Preliminaries #### Notation * $x$ - input vector * $y$ - desired output * $w$ - weight vector * $\eta$ - learning rate * $E$ - error ### 2.2 The error function For a single training example, the error is defined as: $E = \frac{1}{2}(y - w^...
## 2. Derivation of the training rule ### 2.1 Preliminaries #### Notation * $x$ - input vector * $y$ - desired output * $w$ - weight vector * $\eta$ - learning rate * $E$ - error ### 2.2 The error function For a single training example, the error is defined as: $E = \frac{1}{2}(y - w^T x)^2$ This is a measure of the difference between the desired output $y$ and the actual output $w^T x$. The factor of $\frac{1}{2}$ is included for mathematical convenience. ### 2.3 Derivation of the gradient descent rule We want to find the weight vector $w$ that minimizes the error $E$. We can do this using gradient descent. The gradient descent rule updates the weights in the direction opposite to the gradient of the error function: $w \leftarrow w - \eta \nabla E$ where $\eta$ is the learning rate, which controls the step size. To find the gradient of the error function, we need to compute the partial derivatives of $E$ with respect to each weight $w_i$: $\frac{\partial E}{\partial w_i} = \frac{\partial}{\partial w_i} \frac{1}{2}(y - w^T x)^2$ Using the chain rule, we have: $\frac{\partial E}{\partial w_i} = (y - w^T x) \frac{\partial}{\partial w_i} (-w^T x)$ Since $w^T x = \sum_{j=1}^{n} w_j x_j$, we have: $\frac{\partial}{\partial w_i} (-w^T x) = -x_i$ Therefore, $\frac{\partial E}{\partial w_i} = -(y - w^T x)x_i$ The gradient of the error function is then: $\nabla E = \begin{bmatrix} \frac{\partial E}{\partial w_1} \\ \vdots \\ \frac{\partial E}{\partial w_n} \end{bmatrix} = -(y - w^T x)x$ Finally, the gradient descent update rule is: $w \leftarrow w + \eta(y - w^T x)x$ This rule updates the weights by adding a fraction of the error, scaled by the input vector. The learning rate $\eta$ controls the size of the update.