Linear Algebra for Computer Scientists Lecture Notes PDF Part II

Linear Algebra for Computer Scientists Lecture Notes Part II A. S. Bandeira and R. Weismantel ETH Zürich Last update on November 1, 2024 1 2 “R EAD ME ” FOR PART II These lecture notes serv...

Linear Algebra for Computer Scientists Lecture Notes Part II A. S. Bandeira and R. Weismantel ETH Zürich Last update on November 1, 2024 1 2 “R EAD ME ” FOR PART II These lecture notes serve as a continuation1 of Part I, taught by Prof. Bernd Gärtner, available at https://ti.inf.ethz.ch/ew/courses/LA23/notes_part_I.pdf. Please read the Preface there. Please note there may be some changes in notation. Furthermore, we will try to stay close to the notation in [Str23], but there also be some differences. [Str23] Gilbert Strang. Introduction to Linear Algebra. Wellesley - Cambridge Press, 6th ed., 2023. The course page has relevant information for the course: https://ti.inf.ethz.ch/ew/ courses/LA23. There are countless excellent Linear Algebra books with the material covered in this course. For Part II we will roughly continue to follow, in structure and content, [Str23], with some small devi- ations. I will try to keep the numbering of Chapters/Sections and Sections/Subsections consistent with [Str23] (as far as the deviations allow). See Appendix A for some important preliminaries and some remarks on notation. Throughout the notes, and the lectures, we will try to motivate some of the material with Guid- ing Questions. For students who would like to explore the topic further, I will include some Exploratory Challenges and Further Reading, these often will include difficult problems or topics. I will also take some opportunities to share some active Research Questions related to the topics we covered (we are still discovering new phenomena in Linear Algebra today and for many years to come!). After deriving a result, we will often do some Sanity Checks, and some things we will leave as a Challenge: these should be accessible and of difficulty comparable to homework questions, a ⋆ indicates a harder problem (but still within the scope). On the other hand, Exploratory Challenges are generally outside the scope of the course or of substantial higher difficulty. Linear Algebra is a beautiful topic, connecting Algebra with Geometry2, and has countless applications making it a key component of almost all quantitative pursuits. We sincerely hope you will enjoy the course as much as we enjoy teaching this beautiful subject! 1 If you are reading these notes and did not follow Part I, please read Appendix A. 2and Analysis, as you will likely see later in your academic life. For example, when Joseph Fourier invented Fourier Series to develop a theory of heat transfer he was essentially finding good orthonormal bases for functions. 3 We believe the Questions, Sanity Checks, Challenges, etc are very useful to learn the material, but when you want to review the material, or do a last read before the exam, you can focus on the Definitions, Propositions, Theorems, etc (and focus less on the blue parts). As your mathematical level matures over the semester, the notes will have less illustrations and more definitions and mathematical statements. Our recommendation is to read the notes with pen & paper next to you and to draw the picture yourself, this “translation” you will be doing — from mathematical statement to picture — will help you greatly in the learning of Mathematics! There are also countless high-quality videos and other content online about Linear Algebra, for example there is also an excellent series of videos by Gil Strang filmed ∼15 years ago: https: //www.youtube.com/playlist?list=PLE7DDD91010BC51F8. Strang actually retired just a few months ago, at almost 90 years of age! You can see his last lecture online: https://www.youtube.com/watch?v=lUUte2o2Sn8 Moreover, there are many excellent animations online giving lots of great intuition on several Linear Algebra topics and phenomena. While it is a great idea to take advantage of this, I would recommend first trying yourself to develop an intuition of the concept/phenomenon (e.g. by drawing a picture) and using these tools only after — use them to improve your intuition, not to create it! As these Lecture Notes are being continuously updated, and sometimes the discussion in lectures leads us into proving an extra result, or suggests a remark, etc, I will try to add then and not change the numbering of things downstream, I do this by numbering them with +1000. After each lecture, we post the handwritten notes from lecture on the course website https: //ti.inf.ethz.ch/ew/courses/LA23/index.html. My suggestion would be to use the Lecture Notes to review the material, not the handwritten notes (which are mainly meant to support my oral exposition). 4 C ONTENTS “Read me” for Part II 2 5. Orthogonality 5 5.1. Orthogonality of vectors and subspaces 5 5.2. Projections 9 5.3. Least Squares Approximation 13 5.4. Orthonormal Bases and Gram Schmidt 17 5.5. The Pseudoinverse, also known as Moore–Penrose Inverse 23 5.6. Projections of sets and the Farkas lemma 27 6. The determinant 32 7. Eigenvalues and Eigenvectors 41 7.0. Complex Numbers 42 7.1. Introduction to Eigenvalues and Eigenvectors 45 7.2. Diagonalizing a Matrix and Change of Basis of a Linear Transformation 56 7.3. Symmetric Matrices and the Spectral Theorem 59 8. Singular Value Decomposition; and some open questions in Linear Algebra 65 8.1. The Singular Value Decomposition 65 8.2. Vector and Matrix Norms 69 8.3. Some Mathematical Open Problems 70 Acknowledgements 72 Appendix A. Some Important Preliminaries and Remarks on Notation 72 Appendix B. A “Simple proof” of the Fundamental Theorem of Algebra 73 73 5 5. O RTHOGONALITY 5.1. Orthogonality of vectors and subspaces. Guiding Question 1. Our task is to explore orthogonality as a geometric and algebraic tool in order to be able to decompose a space into subspaces. Among many results our knowledge will then put us in position to understand how to solve systems of linear equations. Let us begin by introducing orthogonality. Definition 5.1.1. Two vectors v, w ∈ Rn are called orthogonal if vT w = ∑ni=1 vi wi = 0. Two sub- spaces V and W are orthogonal if for all v ∈ V and w ∈ W , the vectors v and w are orthogonal. In order to check whether two subspaces V and W are orthogonal it is enough to verify it for the vectors forming a basis of V and W , respectively. Lemma 5.1.2. Let v1 ,... , vk be a basis of subspace V. Let w1 ,... , wl be a basis of subspace W. V and W are orthogonal if and only if vi and w j are orthogonal for all i ∈ {1,... , k} and j ∈ {1,... , l}. Proof. Let us begin by proving the statement from left to right: Suppose V and W are orthogonal. Since vi ∈ V for all i ∈ {1,... , k} and w j ∈ W for all j ∈ {1,... , l}, we have that vTi w j = 0 for all i ∈ {1,... , k} and j ∈ {1,... , l}. For the converse direction, assume that vTi w j = 0 for all i ∈ {1,... , k} and j ∈ {1,... , l}. Let v ∈ V and w ∈ W. Then, there exist real multipliers such that k l v = ∑ λi vi and w = ∑ µ jv j. i=1 j=1 Then k k l T v w= ∑ λi vTi w = ∑ ∑ µ j λivTi w j = 0. i=1 i=1 j=1 □ Lemma 5.1.3. Let V and W be two orthogonal subspaces of Rn. Let v1 ,... , vk be a basis of subspace V. Let w1 ,... , wl be a basis of subspace W. The set of vectors {v1 ,... , vk , w1 ,... , wl } are linearly independent. 6 Proof. Consider the linear combination k l ∑ λivi + ∑ µ j w j = 0. i=1 j=1 We want to show that λi = 0 for all i ∈ {1,... , k} and µ j = 0 for all j ∈ {1,... , l}. Let v = ∑ki=1 λi vi. The linear combination is equivalent to v = − ∑lj=1 µ j w j. We obtain vT v = − ∑lj=1 µ j vT w j = 0. Hence, v = 0. This implies that λi = 0 for all i ∈ {1,... , k} since v1 ,... , vk is a basis of V. The same argument can be applied to show that w = ∑lj=1 µ j w j must equal the all-zero vector and hence, µ j = 0 for all j ∈ {1,... , l}. This is the result. □ Lemma 5.1.3 allows us to derive an important fact about orthogonal subspaces. Namely we can take bases of the two subspaces V and W and their union gives a basis for the subspace {λ v + µw | λ , µ ∈ R, v ∈ V, w ∈ W }. Corollary 5.1.4. Let V and W be orthogonal subspaces. Then V ∩ W = {0}. Moreover, if dim(V ) = k and dim(W ) = l, then dim({λ v + µw | λ , µ ∈ R, v ∈ V, w ∈ W }) = k + l ≤ n. 5.1.1. The orthogonal complement of a subspace. So far we have explored general subspaces V and W that are orthogonal. Next consider a subspace V. Then there is a special orthogonal subspace attached to V. Definition 5.1.5. Let V be a subspace of Rn. We define the orthogonal complement of V as V ⊥ = {w ∈ Rn | wT v = 0 for all v ∈ V }. Challenge 2. Prove that V ⊥ is a subspace of Rn. This concept of orthogonal subspaces allows us to decompose the space. Let us first see an important example of how this idea can be used. Theorem 5.1.6. Let A ∈ Rm×n be a matrix. N(A) = C(AT )⊥ = R(A)⊥. Proof. Let us show that N(A) ⊆ C(AT )⊥. 7 Let x ∈ N(A). Take any b ∈ R(A). By definition, b = AT y for some y ∈ Rm. Then bT x = yT Ax = 0. Hence, x ∈ C(AT ). Conversely, we want to show that C(AT )⊥ ⊆ N(A). To this end let x ∈ C(AT )⊥. Then bT x = 0 for all b ∈ C(AT ). Take y := Ax ∈ Rm Then b := AT y ∈ C(AT ) and hence, xT b = 0. We obtain 0 = xT b = xT AT y = xT AT Ax = ∥Ax∥2. This implies that Ax = 0, i.e., x ∈ N(A). □ From the lecture in Chapter 3.5 we know already that if r = dim(R(A)), then n − r = dim(N(A). This fact together with Theorem 5.1.6 allows us to prove a decomposition theorem of the space Rn. Theorem 5.1.7. Let V,W be orthogonal subspaces of Rn. The following statements are equivalent. (i) W = V ⊥. (ii) dim(V ) + dim(W ) = n. (iii) Every u ∈ Rn can be written as u = v + w with unique vectors v ∈ V , w ∈ W. Proof. Let v1 ,... , vk be a basis of V and w1 ,... , wl a basis of W. From Lemma 5.1.2 V and W are orthogonal if and only if vTi w j = 0 for all i ∈ {1,... , k} and j ∈ {1,... , l}. (i) implies (ii): Define A ∈ Rk×n to be the matrix with row vectors v1 ,... , vk. Then V = R(A) = C(AT ). Moreover, W = V ⊥ = N(A) from Theorem 5.1.6. From our remark before dim(V ) = k and hence, dim(W ) = n − k. (ii) implies (iii): From Lemma 5.1.3 the vectors in the set {v1 ,... , vk , w1 ,... , wl } are linearly independent. Since by assumption l = n − k, this set is a basis of Rn. Hence, every vector u ∈ Rn has a unique representation in form of k l u = ∑ λi vi + ∑ µ j w j , where λ1 ,... , λk , µ1 ,... , µl ∈ R. i=1 j=1 Define the unique vectors v := ∑ki=1 λi vi , w := ∑lj=1 µ j w j. This gives the statement. (iii) implies (i): We need to show that W = V ⊥. Note that W ⊆ V ⊥ since W is orthogonal to V. To show the reverse inclusion, take any vector u ∈ V ⊥ ⊆ Rn Hence, from our assumption in (iii) 8 we know that u = v + w where v ∈ V and w ∈ W. Then 0 = uT v = vT v + vT w = vT v = ∥v∥2. Hence, v = 0 and it follows that u = w ∈ W. □ Indeed Theorem 5.1.7 allows us to decompose the space Rn according to a given subspace V ⊆ Rn. We write Rn = V +V ⊥ = {v + w | v ∈ V, w ∈ V ⊥ }. This decomposition is symmetric in the sense that we can also take the subspace V ⊥ and then write Rn = V ⊥ + (V ⊥ )⊥ = V ⊥ +V. This follows from the following observation. Lemma 5.1.8. Let V be a subspace of Rn. Then V = (V ⊥ )⊥. Proof. Let v1 ,... , vk be a basis of V and w1 ,... , wl a basis of V ⊥. It follows that l = n − k. From Lemma 5.1.2 we conclude that vTi w j = 0 for all indices i and j. Then by definition (V ⊥ )⊥ = {x ∈ Rn | xT w j = 0 for all j = 1,... , n − k}. Since vTi w j = 0 for all j = 1,... , n − k we obtain that V ⊆ (V ⊥ )⊥. From Theorem 5.1.7 we obtain that dim((V ⊥ )⊥ ) = n − (n − k) = k. Since {v1 ,... , vk } ⊆ V ⊆ (V ⊥ )⊥ are linearly independent, they form a basis of (V ⊥ )⊥. Hence V = (V ⊥ )⊥. □ This lemma in combination with Theorem 5.1.6 allows us to conclude that for a matrix A we have that C(AT ) = N(A)⊥. Corollary 5.1.9. Let A ∈ Rm×n. N(A) = C(AT )⊥ and C(AT ) = N(A)⊥. 5.1.2. The set of all solutions to a system of linear equations. The machinery developed in the previous chapters allows us to understand the set of solutions to a system of linear equations over the reals. To make it precise, let A ∈ Rm×n. There are two important subspaces associated with A: 9 N(A) = {x ∈ Rn | Ax = 0} and R(A) = C(AT ) = {x ∈ Rn | ∃y ∈ Rm such that x = AT y}. We have learned that N(A) is the orthogonal complement of R(A). Vice versa, R(A) is the or- thogonal complement of N(A). Hence all of Rn can be written as the sum of two elements: one is from N(A) and the other one from R(A). In other words: ∀x ∈ Rn there exist x0 ∈ N(A) and x1 ∈ R(A) such that x = x0 + x1 and x1T x0 = 0. We summarize below how the set of all solutions to a system of equations is described. Theorem 5.1.10. {x ∈ Rn | Ax = b} = x1 + N(A) where x1 ∈ R(A) such that Ax1 = b. There is one final link between the nullspace of a matrix A and the nullspace of the matrix AT A that we will need in our analysis of projections later on. Lemma 5.1.11. Let A ∈ Rm×n. Then N(A) = N(AT A) and C(AT ) = C(AT A). Proof. If x ∈ N(A) then Ax = 0 and so A⊤ Ax = 0, thus x ∈ N(A⊤ A). The other implication is more interesting. If x ∈ N(A⊤ A) then A⊤ Ax = 0. This implies that x⊤ A⊤ Ax = x⊤ 0 = 0. But x⊤ A⊤ Ax = (Ax)⊤ (Ax) = ∥Ax∥2 so Ax must be a vector with norm 0 which implies that Ax = 0 and so x ∈ N(A). For the second statement we utilize Corollary 5.1.9. We have C(AT ) = N(A)⊥ = N(AT A)⊥ = C((AT A)T ) = C(AT A). 2 5.2. Projections. Guiding Question 3. If we have a system of linear equations that has no solution, how do we find the “solution” that has the smallest error? This question is central in countless applications3. 3as you will see later on, it is in a sense what Machine Learning is all about. 10 Before diving into systems of equations, we will study Projections of vectors in a subspace. Definition 5.2.1 (Projection of a vector onto a subspace). The projection of a vector b ∈ Rm on a subspace S (of Rm ) is the point in S that is closest to b. In other words (1) projS (b) = argmin ∥b − p∥. p∈S Sanity Check 4. This is only a proper definition if the minimum exists and is unique. Can you show it exists and is unique? (perhaps at the end of the lecture?) Let us build us some intuition by starting with one-dimensional subspaces. 5.2.1. The one-dimensional case. Let S be the subspace corresponding to the line that goes through the vector a ∈ Rm \ {0}, i.e. S = {λ a | λ ∈ R} = C(a). By drawing a two dimensional example one can see that the projection p is the vector in the subspace S such that the “error vector” e = b − p is perpendicular to a (i.e. b − p ⊥ a). This geometric intuition turns out to be correct. We will verify it algebraically. F IGURE 1. Projection on a line. Lemma 5.2.2. Let a ∈ Rm \ {0}. The projection of b ∈ Rm on S = {λ a | λ ∈ R} = C(a) is given by aaT projS (b) = T b. a a 11 Proof. Let p ∈ S, p = λ a for λ ∈ R. Then ∥b − p∥2 = (b − p)T (b − p) = bT b − 2bT p + pT p = ∥b∥2 − 2λ bT a + λ 2 ∥a∥2 = g(λ ). g is a convex, quadratic function in one variable λ. Hence, the minimizer is obtained at the point λ ∗ where the derivative vanishes. We obtain bT a g′ (λ ) = −2bT a + 2λ ∥a∥2 = 0 ⇐⇒ λ ∗ =. aT a Hence, we have shown that bT a aT b aaT projS (b) = λ ∗ a = a = a = b. aT a aT a aT a □ □ Let us next verify that our initial geometric understanding is indeed correct: the projection p should be the vector in the subspace S such that the “error vector” e = b − p is perpendicular to a, i.e., (b − projS (b)) ⊥ projS (b). Indeed by substituting what we just computed we obtain aaT T aaT bT aaT b T T aa aa T (aT b)2 bT aaT b (b − T b) T b = −b T b= T − = 0. a a a a aT a a a aT a a a aT a The projection of a vector that is already a multiple of a should be the identity operation. This is indeed true. Check that this is the case! 5.2.2. The general case. For general subspaces the idea is precisely the same as with dimension one. Let S be a subspace in Rm with dimension n. Let a1 ,... , an be a basis of S, meaning that S = span(a1 ,... , an ) = C(A) = {Aλ | λ ∈ Rn }, where A is the matrix with column vectors a1 ,... , an. Lemma 5.2.3. The projection of a vector b ∈ Rm to the subspace S = C(A) can be written as projS (b) = Ax̂, where x̂ satisfies the normal equations AT Ax̂ = AT b. 12 Proof. Let b ∈ Rm. The vector b can be written as b = p + e where p ∈ S and e ∈ S⊥ , i.e., pT e = 0. Now consider another point p′ ∈ S. Then p − p′ ∈ S and hence, eT (p − p′ ) = 0. This gives ∥p′ − b∥2 = ∥p′ − p + p − b∥2 = ∥p′ − p − e∥2 = ∥p′ − p∥2 + ∥e∥2 ≥ ∥e∥2 = ∥p − b∥2. Hence, we have shown that projS (b) = p = Ax̂ ∈ S where b = p + e with e ∈ S⊥. This shows us that (b − projS (b)) ⊥ ai for all i = 1,... , n ⇐⇒ aTi (b − projS (b)) = 0 for all i = 1,... , n. This is equivalent to saying that AT (b − projS (b)) = 0 ⇐⇒ AT (b − Ax̂) = 0 ⇐⇒ AT Ax̂ = AT b. □ □ −1 ⊤ If we can show that A⊤ A is invertible then we would have p = Ax̂ = A A⊤ A A b. Let’s make a detour to show that it is indeed invertible. Lemma 5.2.4. A⊤ A is invertible if and only if A has linearly independent columns. Proof. This follows essentially from Lemma 5.1.11 where we proved that A⊤ A and A have the same nullspace. This is enough because, since A⊤ A is a square matrix it is invertible if and only if its nullspace only has the 0 vector, and A has linearly independent columns if and only if its nullspace only has the 0 vector.4 2 Corollary 5.2.5. If A has linearly independent columns then A⊤ A is a square matrix, it is invertible and symmetric.5 Now back to deriving a formula for projections: Since the columns of A are a basis they are linearly independent and so A⊤ A is indeed invertible. We just proved the following. Theorem 5.2.6. Let S be a subspace in Rm and A a matrix whose columns are a basis of S. The projection of b ∈ Rm to S is given by projS (b) = Pb, 4We usually call a nullspace with only the zero vector, a trivial nullspace. 13 −1 where P = A A⊤ A A⊤ is the projection matrix. −1 The matrix P = A A⊤ A A⊤ is known as a Projection Matrix, it maps a vector b to its projection aa⊤ Pb on a subspace S. For the case of lines, P was given by P = a⊤ a = a a⊤1 a a⊤. Caution! 5. The matrix A (and A⊤ ) are not necessarily square, and so they don’t have inverses. −1 ⊤ −1 The expression A A⊤ A A cannot be simplified by expanding A⊤ A (which would yield m I = P, this would only make sense if S was all of R and note that, unsurprisingly, this would correspond exactly to the case when A is invertible). Let us summarize a few facts about the projection matrix P. P can be viewed as a mapping: for a given vector b its projection is given by projS (b) = Pb. Remark 5.2.7. If b ∈ Rm , then projS (projs (b)) = projS (b) by definition. This requires us to have that PPb = Pb, i.e., we should have P2 = P. Indeed −1 2 −1 −1 −1 ⊤ 2 P = A A A A⊤ = A A⊤ A A⊤ A A⊤ A A⊤ = A A⊤ A A⊤ = P. Let S⊥ be the orthogonal complement of S and P the projection matrix onto the subspace S, i.e., projS (b) = Pb. Then I − P is the projection matrix that maps b ∈ Rm to projS⊥ (b). This follows since b = e + projS (b) = e + Pb where e ∈ S⊥. Hence, (I − P)b = b − Pb = e = projS⊥ (b). Note that – as it should be – we have that (I − P)2 = I − 2P + P2 = I − P. 5.3. Least Squares Approximation. We go back to the guiding question of what to do when we want to “solve” a linear system that does not have an exact solution. More precisely let us suppose we have a linear system Ax = b, for which no solution x exists (for example, with too many equations, which would happen if A ∈ Rm×n and m > n). A natural approach is to try to find x for which Ax is as close as possible to b (2) min ∥Ax̂ − b∥2. x̂∈Rn 14 Further Remark 6. This seemingly simple observation is key to countless technologies. Mea- surement systems often have errors and so it is impossible to find the target object/signal x that satisfies them all exactly, and we look for the one that satisfies them the best possible. In Data Science and Learning Theory we often want to find a predictor that best describes a set of training data, but usually no predictor described the data exactly, so we look for the best possible, etc etc. We’ll see a couple of applications later. We can solve this problem using the ideas we developed above. What we are looking for is a vector x̂ for which the error e = b − Ax̂ is as small as possible. Since the set of possible vectors y = Ax̂ is exactly C(A), Ax̂ is precisely the projection of b on C(A). As we saw in the Section above, this means that A⊤ (b − Ax̂) = 0. These are known as the normal equations and can be rewritten as (3) A⊤ Ax̂ = A⊤ b. Recall that we had shown in Lemma 5.1.11 that for any matrix A, C(A⊤ ) = C(A⊤ A). Hence, the system (3) always has a solution. We also know that if A has linearly independent columns, then A⊤ A is invertible and so we can write x̂ = (A⊤ A)−1 A⊤ b. We will address the case in which A has dependent columns shortly. Fact 5.3.1. A minimizer of (2) is also a solution of (3). When A has independent columns the unique minimizer x̂ of (2) is given by (4) x̂ = (A⊤ A)−1 A⊤ b 5.3.1. Linear Regression — fitting a line to data points. One of the most common tasks in data analysis is linear regression, to fit a line through data points. Let us consider data points (t1 , b1 ), (t2 , b2 ),... , (tm , bm ), perhaps representing some attribute b over time t. If the relation between t and b is (at least partly) explained by a linear relationship then it makes sense to search for constants α0 ∈ R and α1 ∈ R such that bk ≈ α0 + α1tk. 15 F IGURE 2. Fitting a line to points See Figure 2. In particular, it is natural to search for α0 and α1 that minimize the sum of squares of the errors (“least squares”), m min α0 ,α1 ∑ (bk − [α0 + α1tk ])2. k=1 In matrix-vector notation " # 2 α0 (5) min b − A , α0 ,α1 α1 where     b1 1 t1 b2 1 t2          b=..  and  A=....  . .   ..       bm−1   1 tm−1  bm 1 tm We can assume w.l.o.g. that A has independent columns, see Lemma 5.3.2. Hence, the solution to (5) is given by " # " #−1 " # m m α0 m ∑k=1 kt ∑ b k=1 k = (A⊤ A)−1 A⊤ b = m m 2 m α1 t ∑k=1 k ∑k=1 kt ∑k=1 tk bk 16 Lemma 5.3.2. The columns of the m × 2 matrix A defined before are linearly dependent if and only if ti = t j for all i ̸= j. Proof. Suppose that there are two indices i ̸= j such that ti ̸= t j. Let 1 be the all ones-vector in Rm and t the vector with components t1 ,... ,tm. Consider the system in variables λ , µ λ 1 + µt = 0. Since ti ̸= t j we can subtract row j from row i to obtain λ 0 + µ(ti − t j ) = 0 ⇐⇒ µ = 0 since ti − t j ̸= 0. This implies that λ = 0 and hence A has full column rank. Conversely, if ti = t j for all i and j, then t = t1 1. Then the two columns of A are linearly dependent. □ Remark 5.3.3. If the columns of A are pairwise orthogonal, then A⊤ A is a diagonal matrix, which is easy to invert. In this example, the columns of A being orthogonal corresponds to ∑m k=1 tk = 0. new 1 m We could simply do a change of variables to a new time tk = tk − m ∑i=1 ti to achieve this. If indeed ∑m k=1 tk = 0 then the equation above could be easily simplified: " # " #−1 " # " 1 #" # α0 m 0 ∑m b k=1 k m 0 ∑ m b k=1 k = = 1 α1 0 ∑m t k=1 k 2 m t ∑k=1 k kb 0 m ∑k=1 tk2 m ∑k=1 tk bk " # 1 m ∑ m k=1 bk = , (∑m m k=1 k bk ) / ∑k=1 tk t 2 this is an instance where having orthogonal vectors is beneficial. In this next Section we will see how to build orthonormal basis for subspaces, and some of the many benefits they have. Challenge 7. Try to work out the actual change of variables that makes the tk ’s add up to zero and derive a formula for fitting a line to points without the assumption in Remark 5.3.3 Example 5.3.4 (Fitting a Parabola). We can use Linear Algebra to do fits of many other curves (or functions), not just lines. If we believe the relationship between tk and bk is quadratic we could attempt to fit a Parabola: bk ≈ α0 + α1tk + α2tk2. While this isn’t a linear function in tk , this is still a linear function on the coefficients α0 , α1 , and α2 , and this is what is important. Similarly as with linear regression, it is natural to attempt to 17 minimze   2 α0 (6) min b − A  α1  ,   α0 ,α1 ,α2 α2 where t12     b1 1 t1 b2 1 t2 t22         ..  ....  b= .   and A= .. ,    bm−1     1 tm−1 2  tm−1  bm 1 tm tm2 and we can use the technology we developed in this section to solve this problem as well. Challenge 8. Try to work out the example of fitting a parabola further. What is A⊤ A? When is A⊤ A diagonal? Further Reading 9. There is a whole (beautiful) area of Mathematics related to studying so- called Orthogonal Polynomials. The basic idea can be already hinted at from these examples: In the example of the parabola we wrote a function of t as a linear combination of the polynomials 1, t, and t 2. But we could have picked other polynomials, we could have e.g. written something like b ≈ α0′ + α1′ (t − 2023) + α2 (t 2 +t), and a particularly good choice (that would depend on the distribution of the points tk ) might have resulted in a diagonal matrix A⊤ A... search “orthogonal polynomials” online to learn more. Further Reading 10. A lot of Machine Learning includes Linear Regression as a key component. The idea is to create, find, or learn features of the data points. Given n data points t1 ,...tn (which now can be perhaps pixel images, rather than just time points) we might want to do classification (for example, in the case of images, maybe we want a function that is large when the picture has a dog in it and small when it has a cat in it). It is hard to imagine that this can be done with a linear fit, but if we build good feature vectors ϕ(tk ) ∈ R p for very large p then the function can depend on all coordinates of ϕ(tk ) (the p features) and this is incredible powerful. There are several ways to construct features, a bit over a decade ago they were sometimes handmade, now they are often learned (this is in a sense what Deep Learning does). Another important way to build (or compute with) features are the so-called Kernel Methods. 5.4. Orthonormal Bases and Gram Schmidt. When we think of (or draw) a basis of a subspace, we tend to think of (or draw) vectors that are orthogonal (have an angle of 90◦ ) and that have the same length (length 1). Indeed, these bases 18 have many advantages, this section is about these bases, some of their advantages, and how to find them. Definition 5.4.1 (Orthonormal vectors). Vectors q1 ,... , qn ∈ Rm are orthonormal if they are or- thogonal and have norm 1. In other words, for all i, j ∈ {1,... , n} qTi q j = δi j , where δi j is the Kronecker delta ( 0 if i ̸= j (7) δi j = 1 if i = j. If Q is the matrix whose columns are the vectors qi ’s, then the condition that the vectors are orthonormal can be rewritten as Q⊤ Q = I. Caution! 11. Q may not be a square matrix, and so it is not necessarily the case that QQ⊤ = I. Example 5.4.2. A classical example of an orthonormal set of vectors is the canonical basis, e1 ,... , en ∈ Rn where ei is the vector with a 1 in the i-th entry and 0 in all other entries, i.e., (ei ) j = δi j. When Q is a square matrix then Q⊤ Q = I implies also that QQ⊤ = I and so Q−1 = Q⊤. We call such matrices orthogonal matrices. This corresponds to the case when the qi ’s are an orthonormal basis for all of Rn. Definition 5.4.3 (Orthogonal Matrix). A square matrix Q ∈ Rn×n is an orthogonal matrix when Q⊤ Q = I. In this case, QQ⊤ = I, Q−1 = Q⊤ , and the columns of Q form an orthonormal basis for Rn. Example 5.4.4. The 2 × 2 matrix Q that corresponds to rotating, counterclockwise, the plane by θ, " # cos θ − sin θ Rθ = sin θ cos θ is an orthogonal matrix. Indeed, " #" # cos θ sin θ cos θ − sin θ RTθ Rθ = = I. − sin θ cos θ sin θ cos θ Example 5.4.5. Permutation matrices are another example of orthogonal matrices. A permuta- tion is a map π : {1,... , n} 7→ {1,... , n} such that π(i) ̸= π( j) for i ̸= j. 19 The permutation matrix A ∈ Rn×n associated with π has entries Ai j = 1 if π(i) = j and Ai j = 0, otherwise. From this definition one can derive that AT is the permutation matrix associated with the permutation σ defined as σ ( j) = i for π(i) = j. Hence, AT A = I, i.e., A is an orthogonal matrix. Challenge 12 (⋆). Show that for every permutation matrix P there exists a positive integer k such that Pk = I. Proposition 5.4.6. Orthogonal matrices preserve norm and inner product of vectors. In other words, if Q ∈ Rn×n is orthogonal then, for all x, y ∈ Rn ∥Qx∥ = ∥x∥ and (Qx)⊤ (Qy) = x⊤ y Proof. To show the second inequality note that, for x, y ∈ Rn we have that (Qx)⊤ (Qy) = x⊤ Q⊤ Qy = x⊤ Iy = x⊤ y. To show the first equality note that, since for x ∈ Rn we have that ∥Qx∥ ≥ 0 and ∥x∥ ≥ 0, it suffices to show that the squares are equal and indeed ∥Qx∥2 = (Qx)⊤ (Qx) = x⊤ x = ∥x∥2. 2 5.4.1. Projections with Orthonormal Basis. One advantage of having access to an orthonormal basis is that projections become much simpler. The reason is easy to explain. When we discussed projections and least squares, many of the expressions we derived included A⊤ A, but in the case when A has orthonormal columns, these all simplify as A⊤ A = I. We collect these observations in the following proposition. Proposition 5.4.7. Let S be a subspace of Rm and q1 ,... , qn be h an orthonormal basis i for S. Let Q be the m × n matrix whose columns are the qi ’s; Q = q1 , · · · , qn. Then the Projection Matrix that projects to S is given by QQ⊤ and the Least Squares solution to Qx = b is given by x̂ = Q⊤ b. Remark 5.4.8. When Q is a square matrix then the projection QQ⊤ is simply the identity (corre- sponding to projecting to the entire ambient space Rn. Even in this seemingly trivial instance, it is useful to look closer at what this operation does: For a vector x ∈ Rn it gives x = q1 q⊤ 1 x + q 2 q ⊤ 2 x + · · · + q n q⊤ n x. 20 It is writing x as a linear combination of the orthonormal basis {qi }ni=1 (as we will see later this is sometimes referred to as a change of basis).6 5.4.2. Gram-Schmidt Process. By now we have given some evidence that orthonormal bases are useful. Fortunately, there is a relatively simple process to construct orthonormal bases, that will also suggest a new matrix factorization. The idea is simple: If we have 2 linearly independent vectors a1 and a2 which span a subspace S, it is straightforward to transform them into an orthonormal basis of S: we first normalize a1 : q1 = ∥aa11 ∥ , then subtract from a2 a multiple of q1 so that it becomes orthogonal to q1 , followed by a normalization step: a2 − (a⊤2 q1 )q1 q2 = ⊤. a2 − (a2 q1 )q1 Let us check that indeed these vectors are orthonormal: By construction they have unit norm, and a2 − (a⊤ 2 q1 )q1 q⊤ ⊤ ⊤ 1 a2 − (a2 q1 )q1 q1 0 q⊤ ⊤ 1 q2 = q1 ⊤ = ⊤ = = 0. a2 − (a2 q1 )q1 a2 − (a2 q1 )q1 a2 − (a⊤ 2 q1 )q1 Note that the denominator is not zero because a1 and a2 are linearly independent; and that, since q1 has unit norm, (a⊤ 2 q1 )q1 = projSpan(q1 ) (a2 ). For more vectors, the idea is to apply this process recursively, by removing from a vector ak+1 the projection of it on the subspace spanned by the k vectors before it. More formally: Algorithm 5.4.9. [Gram-Schmidt Process] Given n linearly independent vectors a1 ,... , an that span a subspace S, the Gram-Schmidt process constructs q1 ,... qn in the following way: q1 = ∥aa11 ∥. For k = 2,... , n set q′k = ak − ∑k−1 ⊤ i=1 (ak qi )qi q′k qk = ∥q′k ∥. Theorem 5.4.10 (Correctness of Gram-Schmidt). Given n linearly independent vectors a1 ,... , an , the Gram-Schmidt process returns an orthonormal basis for the span of a1 ,... , an. 6There are countless instances in which doing this operation is beneficial, for example one of the most important algorithms, the Fast Fourier Transform, is an instance of this operation. 21 Proof. Let Sk be the subspace spanned by a1 ,... , ak. Then S = Sn. We will show, by induction, that q1 ,... , qk are an orthonormal basis for Sk. It is enough to show that they are orthonormal and are in Sk since orthonormality implies linearly independence and Sk has dimension k. For the base case, note that ∥q1 ∥ = 1 and q1 is a multiple of a1 and so q1 ∈ S1. Now we assume the hypothesis for i = 1,... k −1 and prove it for k. By the hypothesis q1 ,... , qk−1 are orthonormal, so we have to show that ∥qk ∥ = 1 and that q⊤ i qk = 0 for all 1 ≤ i ≤ k − 1. Since ak is linearly independent from the other original vectors it is not in Sk−1 and so q′k ̸= 0. Thus ∥qk ∥ = 1. By construction ak ∈ Sk and so qk ∈ Sk. Let 1 ≤ j ≤ k − 1. Since q1 ,... , qk−1 are orthonormal, we have ! k−1 k−1 q⊤j ak − ∑ (a⊤ k qi )qi = q⊤j ak − ∑ (a⊤ ⊤ ⊤ ⊤ k qi )q j qi = q j ak − (ak q j ) = 0, i=1 i=1 and q⊤j qk = 1 q⊤ q′ ∥q′k ∥ j k = 0. 2 Challenge 13. Try to do the Gram-Schmidt process for the columns of   1 2 3 0  0 4 5 6  .     0 0 7 8  0 0 0 9 Is it the case that the Gram-Schmidt process of the columns of an upper triangular matrix (with non-zero diagonal elements) is always a subset of the canonical basis? Can you come up with an example of a set of vectors for which Gram-Schmidt does not output elements of the canonical basis? Gram-Schmidt actually provides us with a new matrix factorization. Let A be an m × n matrix with linearly independent columns a1 ,... , an and Q the m×n matrix whose columns are q1 ,... , qn as outputted by Algorithm 5.4.9. Let R = Q⊤ A, since each qk is orthogonal to every ai for i < k we have that R is upper triangular. Q is not necessarily a square matrix, and so not necessarily invertible. But QQ⊤ is the projection on the span of the qi ’s and thus also on the ai ’s, this means that QQ⊤ A = A and so we have that QR = QQ⊤ A = A. We call A = QR the QR decomposition. 22 Definition 5.4.11 (QR decomposition). Let A be an m × n matrix with linearly independent columns. The QR decomposition is given by A = QR, where Q is an m × n matrix with orthonormal columns (they are the output of Gram Schmidt, Algorithm 5.4.9, on the columns of A) and R is an upper triangular matrix given by R = Q⊤ A. It requires us to show that indeed this is a proper definition. We need to convince ourselves that R is upper triangluar. Lemma 5.4.12. The matrix R defined in Definition 5.4.11 is upper triangular and invertible. Moreover, QQT A = A and hence, A = QR is well defined. Proof. qTk qi = 0 for all i = 1,... k − 1. Since q1 ,... , qk−1 and a1 ,... , ak−1 span the same subspace Sk−1 we have that qTk ai = 0 for all i = 1,... , k − 1. Hence R = QT A is upper triangular. Moreover, QQT is the projection onto the subspace C(Q) = C(A). Hence, for every index i, projSn (ai ) = ai = QQT ai. This is equivalent to QQT A = QR = A. Finally, N(A) = {0} and since A = QR, we must have that N(R) = {0}. Since R is an n by n matrix R is invertible and hence, RT as well. 2 Fact 5.4.13. The QR decomposition greatly simplifies calculations involving Projections and Least Squares. Since C(A) = C(Q) then projections on C(A) can be done with Q which means they are given by projC(A) (b) = QQ⊤ b. The least squares solution to Ax = b denoted by x̂ is defined as a solution of the normal equations (recall (3)) A⊤ Ax̂ = A⊤ b. Furthermore, A⊤ A = (QR)⊤ (QR) = R⊤ Q⊤ QR = R⊤ R, and so we can write (8) R⊤ Rx̂ = R⊤ Q⊤ b. Since R has independent columns (is full column rank) then N(R) = {0} and so we can simplify (8) to (9) Rx̂ = Q⊤ b, 23 which can be efficiently solved by back-substitution since R is a triangular matrix. 5.5. The Pseudoinverse, also known as Moore–Penrose Inverse. The goal of this Section is to construct an analogue to the inverse of a matrix A for matrices that have no inverse. Such an analogue is called a pseudoinverse, or the Moore-Penrose Inverse, and we will denote it by A†. It is also commonly denoted by A+. Guiding Question 14. While not all matrices are A invertible, we saw that we can still aim to find the (or a) vector x such that Ax is as close as possible to a target vector b. Can we develop this idea to define a “pseudoinverse” for any matrix A, a matrix that is, in a sense, closest to being an inverse for A? What should “closest to being an inverse” even mean? There are (at least) three issues we need to overcome to try to define a pseudoinverse for a non- invertible matrix A: (i) For some vectors b there might not be a vector x such that Ax = b, (ii) For some vectors b there may be more than one x such that Ax = b and we would have to pick one, and (iii) even if we make such choices, it is not clear that such an operation will correspond to multiplying by a matrix A†. Let A ∈ Rm×n be an m × n matrix. There are a couple of different ways we could try to define a pseudoinverse A† for a non-invertible matrix A. Let us start by building on what we discussed in Section 5.3 (Least Squares Approximations), if the columns of A are linearly independent that it would make sense to build A† such that A† b is the Least Squares Solution x̂ = (A⊤ A)−1 A⊤ b (the vector x̂ such that Ax̂ is as close as possible to b), and so for matrices A with independent columns we will define A† = (A⊤ A)−1 A⊤. This motivates the following definition. Definition 5.5.1 (Pseudoinverse for matrices of full column rank). For A ∈ Rm×n with rank(A) = n we define the pseudo-inverse A† ∈ Rn×m of A as A† = (A⊤ A)−1 A⊤. Proposition 5.5.2. For A ∈ Rm×n with rank(A) = n, the pseudoinverse A† is a left inverse of A, meaning that A† A = I. Proof. Since rank(A) = n, A⊤ A is invertible. Furthermore, A† A = (A⊤ A)−1 A⊤ A = I. 2 24 Let us know consider the case for which the rows are linearly independent (in other words, A ∈ Rm×n is full row rank; or equivalently rank(A) = m). One natural way to define a pseudoinverse is based on the observation that A⊤ has full column rank and to define A† as ⊤ −1 !⊤ −1 ⊤ † ⊤ ⊤ −1 ⊤ ⊤ ⊤ ⊤ ⊤ A = A A A = AA A = A⊤ AA⊤. Definition 5.5.3 (Pseudoinverse for matrices of full row rank). For A ∈ Rm×n with rank(A) = m we define the pseudo-inverse A† ∈ Rn×m of A as A† = A⊤ (AA⊤ )−1. Lemma 5.5.4. For A ∈ Rm×n with rank(A) = m, the pseudoinverse A† is a right inverse of A, meaning that AA† = I. Proof. Since rank(A) = m, AA⊤ is invertible. Furthermore, AA† = AA⊤ (AA⊤ )−1 = I. 2 Let us try to understand what A† is achieving for full row rank matrices A. Since A is full row rank, for all b ∈ Rm , there exists x ∈ Rn such that Ax = b. The issue is that there are potentially many such vectors. A natural strategy in this case is to pick, among all such vectors, the one with smallest norm.7 In other words to solve (10) min ∥x∥2 x∈Rn s.t. Ax = b, where s.t. stands for “subject to” or “such that”. If x1 and x2 are vectors such that Ax1 = Ax2 = b then x1 − x2 ∈ N(A), and conversely, if Ax = b and y ∈ N(A) then A(x + y) = b. Thus, given one vector x1 such that Ax1 = b the set of solutions to Ax = b are all vectors of the form x1 + y where y ∈ N(A). So we would like to find the minimum ∥x1 + y∥ among all vectors y ∈ N(A). Let us write x1 = x1 − projN(A) (x1 ) + projN(A) (x1 ). Since y ∈ N(A) we have that x1 − projN(A) (x1 ) ⊥ y + projN(A) (x1 ) and so, by Pythagoras, 2 2 ∥x1 + y∥ = x1 − projN(A) (x1 ) + projN(A) (x1 ) + y 2 2 = x1 − projN(A) (x1 ) + projN(A) (x1 ) + y , 7This idea, of picking the smallest (or simplest) solution among many possibilities goes far beyond Linear Algebra and is known as “regularization” in Statistics, Machine Learning, Signal Processing, and Image Processing, etc. It can be viewed as a mathematical version of the famous “Occam’s razor” principle in Philosophy. 25 and so picking y = − projN(A) (x1 ) yields the smallest norm solution. Since the vectors orthogonal to N(A) are precisely the vectors that are in C(A⊤ ), i.e., the row space of A. We just proved. Lemma 5.5.5. For any matrix A and a vector b ∈ C(A), the (unique) solution to (10) is given by the vector x̂ ∈ C(A⊤ ) that satisfies the constraint Ax̂ = b. A† is precisely the matrix that maps b to a point x̂ that corresponds to a solution of (10). Proposition 5.5.6. For a full row rank matrix A, the (unique) solution to (10) is given by the vector x̂ = A† b. Proof. By using Lemma 5.5.5 we just need to show that x̂ = A† b satisfies Ax̂ = b and that x̂ = A† b is in C(A⊤ ). Both these are easy to verify: Ax̂ = AA† b = AA⊤ (AA⊤ )−1 b = b and x̂ = A† b = A⊤ (AA⊤ )−1 b and so x̂ ∈ C(A⊤ ). 2 Our next task is to define A† for all matrices, not just full rank matrices. The idea is to write A as a product of two matrices, one which is of full column rank and one which is of full row rank. Recall that in Part I of the lecture we achieved this task by introducing the CR-decomposition. For A ∈ Rm×n , with rank(A) = r, the CR decomposition writes A = CR where C ∈ Rm×r has the first r linearly independent columns of A and R ∈ Rr×n is upper triangular. Note that C has full column rank and R full row rank. Definition 5.5.7 (Pseudoinverse for all matrices). For A ∈ Rm×n with rank(A) = r and CR de- composition A = CR we define the pseudoinverse A† as A† = R†C† , which can be rewritten as −1 −1 −1 −1 A† = R⊤ RR⊤ C⊤C C⊤ = R⊤ C⊤CRR⊤ C⊤ = R⊤ C⊤ AR⊤ C⊤. The following lemma characterizes what the matrix A† achieves for us. 26 Lemma 5.5.8. Given A ∈ Rm×n and a vector b ∈ Rn , the (unique) solution to (11) min ∥x∥2 x∈Rn s.t. A⊤ Ax = A⊤ b, is given by x̂ = A† b. Proof. Let r be the rank of A and A = CR with C ∈ Rm×r and R ∈ Rr×n. Then x̂ = A† b = −1 ⊤ R⊤ C⊤ AR⊤ C b. Thus, −1 −1 A⊤ Ax̂ = A⊤ AR⊤ C⊤ AR⊤ C⊤ b = R⊤C⊤ AR⊤ C⊤ AR⊤ C⊤ b = R⊤C⊤ b = A⊤ b. It remains to show that x̂ is the smallest norm solution. To verify this we use Lemma 5.5.5, i.e., we need to verify that x̂ ∈ C(A⊤ A). From Lemma 5.1.11 we conclude that C(AT A) = C(AT ). Hence it is enough to show that x̂ ∈ C(A⊤ ) and since C(A⊤ ) = C(R⊤ ) we have that −1 ⊤ x̂ = R⊤ C⊤ AR⊤ C b ∈ C(R⊤ ) from which the statement follows. 2 In this proof, the only property of the matrices CR we used is that A = CR and both C and R are full rank. So we have actually shown that we can compute the pseudoinverse from any full rank factorization, not just specifically the CR decomposition. We write it here as a proposition. Proposition 5.5.9. For A ∈ Rm×n , with rank(A) = r, and let S ∈ Rm×r and T ∈ Rr×n such that A = ST. Then, A† = T † S †. Remark 5.5.10. Note that If A = ST and rank(A) = r then rank(S) ≥ r and rank(T ) ≥ r and so the matrices ST in Proposition 5.5.9 are indeed full rank (either full column rank or full row rank). Let us finally summarize a few important properties about the matrix A and its pseudoinverse A†. Theorem 5.5.11. Let A ∈ Rm×n. (1) AA† A = A (2) A† AA† = A†. (3) AA† is symmetric. It is the projection matrix for projection on C(A), 27 (4) A† A is symmetric. It is the projection matrix for projection on C(A⊤ ). † ⊤ (5) A⊤ = A† , Proof. (1) We calculate AA† A = CRRT (CT CRRT )−1CT CR = CRRT (RRT )−1 (CT C)−1CT CR = CR = A. (3) To see that AA† is symmetric we calculate T AA† = CRRT (RRT )−1 (CT C)−1CT = C(CT C)−1CT = C(CT C)−1CT = (AA† )T. Since the column space of A, C(A) and the column space of C coincide and since C is a basis for C(A) Theorem 5.2.6 applies and shows us that AA† = C(CT C)−1CT is the projection matrix for projecting onto C(A). □ Challenge 15. Prove Statements (2), (4) and (5) of Theorem 5.5.11. Proposition 5.5.12. Let A ∈ Rm×n be a matrix and recall that C(A) and C(A⊤ ) denote re- spectively its column and row spaces. When A : x → Ax is viewed as a function from C(A⊤ ) to C(A) it is a bijection. In other words, for all b ∈ C(A) there is one and only one x ∈ C(A⊤ ) such that Ax = b. Challenge 16. Prove Proposition 5.5.12 5.6. Projections of sets and the Farkas lemma. We have so far seen that a single point can be projected to a subspace by solving a least squares problem. Now we want to consider entire sets of points and understand their projections. Guiding Question 17. Suppose we are given a set of linear inequalities in Rn. How can we certify that the set is nonempty? An answer to this question is absolutely fundamental and has numerous applications in other areas of science. In order to attack this question we remark that projections are a way to reduce the dimension of the initial question about feasibility of a set in Rn to a question about feasibility of a set in smaller dimension. We will not consider here arbitrary sets, but focus on sets described by linear inequalities whose coefficients are all rational. Let us make this precise 28 Definition 5.6.1. Let A ∈ Qm×n , b ∈ Qm and P = {x ∈ Rn | Ax ≤ b}. P is called a polyhedron. Let S = {1,... , s}. The projection of P on the subspace Rs associated with the variables in the subset S is projS (P) := {x ∈ Rs | ∃y ∈ Rn−s such that (x, y) ∈ P}.. From the definition it is clear that P ̸= 0/ if and only if projS (P) ̸= 0./ The question though is whether the set projS (P) also has a description in the form of a finite system of linear inequalities. If so, then we can indeed reduce the question of whether P ̸= 0/ to a question of the same form in smaller dimension. Let us build some intuition by analyzing the situation when we project a one-dimensional set of inequalities to dimension 0. Let a ∈ Qm , ai ̸= 0 for all i and b ∈ Qm. We consider P = {x ∈ R | ax ≤ b} ⊆ R. We first notice that we can rewrite the constrains in P as follows. Set bi bi u := min{ | ai > 0}, l := max{ | ai < 0}. ai ai Then bi bi P = {x ∈ R | x ≤ if ai > 0, x ≥ if ai < 0} = {x ∈ R | x ≤ u, x ≥ l}. ai ai Now we are in the position to clarify when P = 0. / Proposition 5.6.2. P ̸= 0/ ⇐⇒ l ≤ u ⇐⇒ 0 ≤ u − l ⇐⇒ 0 ≤ yT b for all y ≥ 0 such that yT a = 0. This idea can be applied in general. 5.6.1. Elimination of one variable. Let A ∈ Qm×n , b ∈ Qm and P = {x ∈ Rn | Ax ≤ b}. In what follows we denote the entries of the matrix A by ai j. Hence, row i gives us an inequality of the form n ∑ ai j x j ≤ bi. j=1 Let x̄ = (x1 ,... , xn−1 ) and define the matrix consisting of the first n − 1 columns by Ā = [A·1... A·n−1 ]. Consider now the algorithm that applies the following steps. 29 Step 1 Partition the indices M = {1,... , m} of the rows of the matrix A into three subsets M0 = {i ∈ M | ai,n = 0}, M+ = {i ∈ M | ai,n > 0} and M− = {i ∈ M | ai,n < 0}. 1 Step 2 – For every row with index i ∈ M+ multiply the corresponding constraint by ain. This gives a new representation of the i-th. row as follows bi ai j xn ≤ di + fiT x̄ for i ∈ M+ where di = , fi j = −. ain ain – Every row with index k ∈ M0 can be rewritten as 0 ≤ dk + fkT x̄ for k ∈ M0 where dk = bk , fk j = −ak j. 1 – For every row with index i ∈ M− multiply the corresponding constraint by ain. This gives a new representation of the i-th. row as follows bi ai j xn ≥ di + fiT x̄ for i ∈ M− where di = , fi j = −. ain ain Step 3 Return n Q= x̄ ∈ Rn−1 | 0 ≤ dk + fkT x̄ for all k ∈ M0 , o dl + flT x̄ ≤ di + fiT x̄ for all l ∈ M− , i ∈ M+. Theorem 5.6.3. The set Q returned in Step 3 is a polyhedron. Moreover, Q = projS (P), where S = {1,... , n − 1}. Proof. Q is a polyhedron, because we can define a matrix F ∈ Qk×n−1 and a right hand side vector δ such that n o Q = x̄ ∈ Rn−1 | F x̄ ≤ δ. Indeed, k = |M0 | + |M− ||M+ |. The rows of F contain all rows of A with index i ∈ M0 and the corresponding right hand side vector satisfies that δi = bi. The other rows of F are of the form ( fl − fi )T for indices l ∈ M− and i ∈ M+. The corresponding right hand side entry of δ is then δi = di − dl. Let us show next that projS (P) ⊆ Q. Take any x̄ ∈ projS (P). By definition there exists z ∈ R such that (x̄, z) ∈ P. Hence z satisfies the constraints presented in Step 2. In particular, dl + flT x̄ ≤ z ≤ di + fiT x̄ for all l ∈ M, i ∈ M+. This shows that x̄ ∈ Q. 30 To see why Q ⊆ projS (P), take any x̄ ∈ Q. It follows that 0 ≤ dk + fkT x̄ for all k ∈ M0 , dl + flT x̄ ≤ di + fiT x̄ for all l ∈ M− , i ∈ M+. Let L := max{dl + flT x̄ | l ∈ M− } and U := min{di + fiT x̄ | i ∈ M+ }. Take any value z ∈ [L,U]. Then (x̄, z) ∈ P. Hence, x̄ ∈ projS (P). □ Our next task is to use these projections repeatedly. Lemma 5.6.4. Let A ∈ Qm×n , b ∈ Qm and P = {x ∈ Rn | Ax ≤ b}. Let S1 = {1,... , n − 1} and S2 = {1,... , n − 2}. Then projS2 (P) = projS2 (projS1 (P)). Proof. Let z ∈ projS2 (P). Hence, there exist real values (xn−1 , xn ) such that (z, xn−1 , xn ) ∈ P. In particular, there exists a value xn such that (z, xn−1 , xn ) ∈ P. Hence, (z, xn−1 ) ∈ projS1 (P) and hence, z ∈ projS2 (projS1 (P). Conversely, take any z ∈ projS2 (projS1 (P)). By definition, there exists a real value xn−1 such that (z, xn−1 ) ∈ projS1 (P). Hence there also exists a real value xn such that (z, xn−1 , xn ) ∈ P. Hence, z ∈ projS2 (P). □ It is an inductive argument to show the following generalization of Lemma 5.6.4. This is left as an exercise. Challenge 18. Let A ∈ Qm×n , b ∈ Qm and P = {x ∈ Rn | Ax ≤ b}. Let S1 = {1,... , n − k} and S2 = {1,... , n − j} for indices 1 ≤ k < j < n. Then projS2 (P) = projS2 (projS1 (P)). 5.6.2. A compact algebraic description of projections and the Farkas Lemma. In view of Challenge 18 we can start with a polyhedron P in dimension n and eliminate variable by variable. At the end there is no variable left and we simply reduced the question of whether P is empty or not to a logical statement that is either true or false, see Proposition 5.6.2. The question emerges how to describe this elimination process algebraically? This requires us to set up some terminology. 31 Definition 5.6.5. Let A ∈ Qm×n , b ∈ Qm and P = {x ∈ Rn | Ax ≤ b}. For k ∈ {1,... , j} let A( j) to be the submatrix of A with column vectors A·k. Let P(0) = P and C(0) = Rm +. Define for i ∈ {1,... , n} n o C(i) = y ∈ Rm + | yT A ·k = 0 for all k = n − i + 1,... , n. n o P(i) = x̄ ∈ Rn−i | yT A(n−i) x̄ ≤ yT b for all y ∈ C(i). With this notation we can now elegantly describe the projection of a polyhedron. Theorem 5.6.6. projSn−i (P) = P(i). Proof. We first verify that projSn−i (P) ⊆ P(i). Indeed, take any x̄ ∈ projSn−i (P). By definition there exists a real vector z of dimension i such that (x̄, z) ∈ P. Hence, (x̄, z) satisfies the following inequalities n−i n ∑ A·k x̄k + ∑ A·k zk ≤ b. k=1 k=n−i+1 This implies that for all y ∈ C(i) we obtain that n−i n n−i ∑ yT A·k x̄k + ∑ yT A·k zk = ∑ yT A·k x̄k = yT A(n−i)x̄ ≤ yT b. k=1 k=n−i+1 k=1 Hence, x̄ ∈ P(i). For the converse direction, we apply induction. Let i = 1. We have n o n o C(1) = y ∈ Rm + | yT A ·n = 0 and P(1) = x̄ ∈ Rn−1 | yT (n−1) A x̄ ≤ yT b for all y ∈ C (1). Our task is to show that P(1) ⊆ projSn−1 (P). This follows from Theorem 5.6.3. To see this, let ei be the i-th. unit vector in Rm. Take as y the unit vector ek for k ∈ M0. Then ek ∈ C(1). Moreover, pick two indices l ∈ M− and i ∈ M+. Then 1 1 − el + ei ∈ C(1). aln ain This gives precisely the description of Q in Theorem 3. This argument can be adapted to show the inductive step. The details are left as an exercise. □ Challenge 19. Prove that P(i) ⊆ projSn−i (P). We are now in the position to prove the famous Farkas Lemma. 32 Theorem 5.6.7 (The Farkas Lemma). Let A ∈ Qm×n , b ∈ Qm. Either there exists a vector x ∈ Rn such that Ax ≤ b or there exists a vector y ∈ Rm such that y ≥ 0, yT A = 0 and yT b < 0. Proof. We refer to the notation introduced in Definition 5.6.5. C(n) = {y ∈ Rm T T + | y A· j = 0 for all j = 1,... , n} = {y ≥ 0 | y A = 0}. P(n) = {0 ≤ yT b for all y ∈ C(n) }. We conclude with Proposition 5.6.2 that P ̸= 0/ ⇐⇒ P(1) ̸= 0/ ⇐⇒... ⇐⇒ P(n) ̸= 0/ ⇐⇒ for all y ≥ 0 with yT A = 0 we have yT b ≥ 0. This leads to the following conclusion. Either P ̸= 0/ or P = 0. / This is equivalent to saying: either there exists a vector x ∈ R such that Ax ≤ b or there exists a vector y ∈ Rm such that y ≥ 0, n yT A = 0 and yT b < 0. □ The power of Theorem 5.6.7 is that it allows us to give a nice certificate of when a polyhedron is empty. Indeed, if P ̸= 0, / then we can simply find a point in P that is feasible and convince ourselves that P ̸= 0. / However, when P = 0, / then it is not clear how to convince someone that this is indeed the case. In the CS lenz we will see several important applications of Theorem 5.6.7. 6. T HE DETERMINANT We will now introduce the notion of determinant det(A) of a square matrix A. While this has a somewhat involved definition for n × n matrices, it is useful to first discuss what the determinant geometrically corresponds to, and to focus on small matrices. In a nutshell, the determinant of a matrix is a number that corresponds to how much the associated linear transformation inflates space, it corresponds precisely to the volume (or area, in R2 ) of the image of the unit cube (the red square in the pictures above in R2 ); with a negative sign when the orientation changes (in the pictures above in R2 , when the order of the colored dots, on the red square, changed). If we think about the determinant this way, then many of the properties we will list below can be intuitively understood (while it is hard to do so from the formula for the n × n determinant). For this reason, this section will be somewhat less proof-based, and rather focus on the most relevant properties of the determinant. 33 F IGURE 3. Calculation in 3Blue1Brown’s video (see Remark 6.0.1) computing the determinant of a 2 × 2 matrix as the area of the image of the unit square after a linear transformation (that does not change orientation). Remark 6.0.1. Grant Sanderson has a website https://www.3blue1brown.com/ and Youtube channel https://www.youtube.com/3blue1brown with excellent animation- heavy explanations of topics in Mathematics, including Linear Algebra. I particularly recom- mend the video on Determinants, it has also 3 dimensional visualizations that are harder to do on a static medium. You can find it here https://youtu.be/Ip3X9LOh2dk or here https://www.3blue1brown.com/lessons/determinant. See also Figure 3. A calculation of the area of the image of the unit square by left-multiplication by a 2 × 2 matrix shows (see Figure 3) that " # a b a b := det = ad − bc. c d c d Before we actually formally define the determinant for general n × n matrices we will first focus on the special case of 2 × 2- matrices to derive several important properties of the determinant. 6.0.1. The 2 × 2 - case. 34 Let us first understand how the determinant changes when we multiply matrices. To this end, let " # " # a c x z A= and W =. b d y w Next we multiply these two matrices and obtain an explicit representation of the coefficients " # ax + cy az + cw AW =. bx + dy bz + dw Using this representation we obtain the following result. Lemma 6.0.2. Let A,W ∈ R2×2. Then det(AW ) = det(A) det(W ). Proof. det(AW ) = (ax + cy)(bz + dw) − (az + cw)(bx + dy) = axbz + axdw + cybz + cydw − azbx − azdy − cwbx − cwdy = axdw + cybz − azdy − cwbx = ad(xw − zy) + cb(zy − xw) = det(A) det(W ). □ This computation allows us to derive a characterization of when a 2 × 2-matrix is invertible. Lemma 6.0.3. A matrix A ∈ R2×2 is invertible if and only if det(A) ̸= 0. Proof. Let " # a c A=. b d If A is invertible, then A−1 exists and hence, AA−1 = I implies together with the previous lemma that det(A) det(A−1 ) = 1. Hence, det(A) ̸= 0. Conversely, if det(A) ̸= 0, then a ̸= 0 or b ̸= 0. Without loss of generality we can assume that a ̸= 0. Consider now the system of linear equations AW = I. ax + cy = 1 implies that x = 1−cy a az + cw = 0 implies that z = −cw a. 35 By substituting these expressions into the other two equations bx + dy = 0 and bz + dw = 1 we obtain b cyb −b − + dy = 0 ⇐⇒ b + y(ad − bc) = 0 ⇐⇒ y =. a a det(A) −bcw a + dw = 1 ⇐⇒ −bcw + adw = a ⇐⇒ w =. a det(A) This then gives us a formula for the parameters z and x in form of cd −c 1 + det(A) d z= and x = =. det(A) a det(A) These calculations show that A−1 exists whenever det(A) ̸= 0. □ Notice that our calculations give us an explicit formula for the inverse of matrix A and its deter- minant: " # 1 d −c (12) A−1 =. det(A) −b a 6.0.2. The n × n - case. It turns out that what we have verified in dimension two carries over to general dimensions. It is, however much more involved to verify it algebraically. We now give the definition of a determinant for n × n matrices. This requires us to discuss permutations. Definition 6.0.4 (Sign of a permutation). Given a permutation σ : {1,... , n} → {1,... , n} of n elements, its sign sgn(σ ) can be 1 or −1. The sign counts the parity of the number of pairs of elements that are out of order (sometimes called inversions) after applying the permutation. In other words,   1 if |(i, j) ∈ {1,... , n} × {1,... , n} such that i < j and σ (i) > σ ( j)| is even, sgn(σ ) =  −1 if |(i, j) ∈ {1,... , n} × {1,... , n} such that i < j and σ (i) > σ ( j)| is odd. Example 6.0.5. Let n = 4. Consider the permutation π defined as π(1) = 1, π(2) = 3, π(3) = 2, π(4) = 4. The pairs (i, j) such that i < j are (1, 2), (1, 3), (1, 4), (2, 3), (2, 4), (3, 4). For all these listed pairs (i, j) we have that π(i) < π( j) except for the pair (2, 3). Hence, sgn(π) = −1. 36 Exploratory Challenge 20. The sign of a permutation has many nice properties. Try to prove a couple of them: (1) The sign of a permutation is multiplicative, i.e.: for two permutations σ , γ we have that sgn(σ ◦ γ) = sgn(σ )sgn(γ). (2) For all n ≥ 2, exactly half of the permutations have sign 1 and exactly half have sign −1. The identity has a sign of 1, the sign of a transposition (a permutation that only swaps two elements) is −1 and for two permutations σ , γ we have that sgn(σ ◦ γ) = sgn(σ )sgn(γ). We are now in position to introduce the general notion of a determinant of a square matrix. Definition 6.0.6. Given a square matrix A ∈ Rn×n the determinant det(A) is defined as n det(A) = ∑ sgn(σ ) ∏ Ai,σ (i) , σ ∈Πn i=1 where Πn is the set of all permutations of n elements. If A is a 1 × 1 matrix, since there is only one permutation of 1 element (the permutation σ (1) = 1, which has sign 1). It follows that det(A) = A. For 2 × 2 matrices we observe that here are two permutations. Let us call σ1 the identity permu- tation (that doesn’t move any element, which has sign 1) and σ2 the permutation that swaps the two elements (which has sign −1). Hence, for a 2 × 2 matrix A with entries Ai j we have 2 2 2 det(A) = ∑ sgn(σ ) ∏ Ai,σ (i) = (+1) ∏ Ai,σ1 (i) + (−1) ∏ Ai,σ2 (i) = A11 A22 − A12 A21. σ ∈Π2 i=1 i=1 i=1 This corresponds precisely to. a b = ad − bc. c d Definition 6.0.6 allows us to derive a few results. Proposition 6.0.7. Given a permutation matrix P ∈ Rn×n corresponding to a permutation σ , then det(P) = sgn(σ ). We sometimes also write sgn(P). Proposition 6.0.8. Given a triangular (either upper- or lower-) matrix T ∈ Rn×n we have n det(T ) = ∏ Tkk , k=1 37 in particular, det(I) = 1. Theorem 6.0.9. Given a matrix A ∈ Rn×n we have det(A⊤ ) = det(A). Proof. For a permutation σ let σ −1 denote the inverse permutation, i.e., σ (i) = j ⇐⇒ σ −1 ( j) = i for all i, j. From Challenge 20 it follows that sgn(σ ) = sgn(σ −1 ). The conclusion det(A⊤ ) = det(A) follows from observing n n n ∑ sgn(σ ) ∏ Ai,σ (i) = ∑ sgn(σ −1 ) ∏ Aσ −1 (i),i = ∑ sgn(σ ) ∏ Aσ (i),i. σ ∈Πn i=1 σ −1 ∈Πn i=1 σ ∈Πn i=1 □ The following is a consequence of the propositions above. Proposition 6.0.10. If Q ∈ Rn×n is an orthogonal matrix then det(Q) = 1 or det(Q) = −1. Proof. By Propositions 6.0.8 and 6.0.12 we have 1 = det(I) = det(Q⊤ Q) = det(Q⊤ ) det(Q). by Proposition 6.0.9 we have 1 = det(Q)2 and so det(Q) is 1 or -1. 2 Proposition 6.0.11. A matrix A ∈ Rn×n is invertible if and only if det(A) ̸= 0. We can also multiply matrices as we did in the 2 dimensional case and see the effect on their determinants. 38 Proposition 6.0.12. Given matrices A, B ∈ Rn×n we have det(AB) = det(A) det(B). Following the same line of argument we also have Proposition 6.0.13. Given a matrix A ∈ Rn×n such that det(A) ̸= 0, then A is invertible and 1 det(A−1 ) =. det(A) Example 6.0.14. For 3 × 3 matrices there are 3! = 6 permutations, so there will be 6 terms. For A a 3 × 3 matrix, we can write its determinant as (where an empty entry corresponds to a zero entry) A11 A12 A13 det(A) = A21 A22 A23 A31 A32 A33 A11 A12 A12 = A22 + A21 + A23 A33 A33 A31 A13 A13 A11 + A22 + A21 + A23 A31 A32 A32 = A11 A22 A33 − A12 A21 A33 + A12 A23 A31 − A13 A22 A31 + A13 A21 A32 − A11 A23 A32. There is another convenient way of writing this determinant A11 A12 A13 A22 A23 A21 A23 A21 A22 (13) A21 A22 A23 = A11 − A12 + A13. A32 A33 A31 A33 A31 A32 A31 A32 A33 In general, these terms are called the co-factors of A. Definition 6.0.15. Given A ∈ Rn×n , for each 1 ≤ i, j ≤ n let Ai j denote the (n − 1) × (n − 1) matrix obtained by removing row i and column j from A. Then we define the co-factors of A as Ci j = (−1)i+ j det(Ai j ). 39 Just as in (13), the determinant can be written in terms of the co-factors. Proposition 6.0.16. Let A ∈ Rn×n , for any 1 ≤ i ≤ n, n det(A) = ∑ Ai jCi j. j=1 The formula we derived above for the inverse of 2 × 2 matrices (Equation 12), also has an ana- logue in n dimensions. Proposition 6.0.17. Given A ∈ Rn×n with det(A) ̸= 0 we have 1 A−1 = C⊤ , det(A) where C is the n × n matrix with the co-factors of A as entries. The formula in Proposition 6.0.17 can be rewritten as AC⊤ = det(A)I. Remark 6.0.18. Computationally speaking, this is not a good way to compute the inverse, as it involves computing many determinants. Challenge 21. Verify that Proposition 6.0.17 indeed corresponds to the formula we derived for A−1 when n = 2. Exploratory Challenge 22. Try to prove Proposition 6.0.17 by showing that AC⊤ = det(A)I. Perhaps start with n = 3. You can also use Cramer’s Rule (below) to prove this. 6.0.3. Cramer’s Rule. The determinant also allows us to write a formula for the solution of the linear system of the type Ax = b when A ∈ Rn×n and det(A) ̸= 0. The idea is simple, we will illustrate it here for n = 3. 40      A11 A12 A13 x1 b1 If  A21 A22 A23   x2  =  b2 , then we have      A31 A32 A33 x3 b3      A11 A12 A13 x1 0 0 b1 A12 A13  A21 A22 A23   x2 1 0  =  b2 A22 A23 .      A31 A32 A33 x3 0 1 b3 A32 A33 Since the determinant is multiplicative, and the determinant of the second matrix in the expression is x1 , we have det(A)x1 = det(B1 ), where B1 is the matrix obtained by A by replacing the first column of A with the vector b. Since we can do this for any of the columns, we have x j = det(B j )/ det(A). In general Proposition 6.0.19 (Cramer’s Rule). Let A ∈ Rn×n such that det(A) ̸= 0 and b ∈ Rn then the solution x ∈ Rn of Ax = b is given by det(B j ) xj = , det(A) where B j is the matrix obtained by A by replacing the j-th column of A with the vector b. Remark 6.0.20. As with the formula for the inverse: computationally speaking, this is not a good way to solve linear systems, as it involves computing many determinants. 6.0.4. Several further comments on the determinant. The definition we used for the determinant of a square matrix involves a formula with n! terms. This formula is computational infeasible for even moderate levels of n (it is faster than exponen- tial! For example, 100! has almost 160 digits!). In practice, the determinant of a matrix A is computed by Gaussian Elimination and the matrix decomposition PA = LU (P permutation and so det(P) = sgn(P), U is upper triangular and L is lower triangular with only 1s in the diagonal, and so det(L) = 1). This gives us the formula 1 (14) det(A) = det(L) det(U) = sgn(P) det(U), det(P) and since U is a triangular matrix its determinants can be easily computed by Proposition 6.0.8. Alternatively, one can also think of Gaussian Elimination as directly computing the determinant via the following two propositions 41 Proposition 6.0.21. If A is an n × n matrix and P is a permutation that swaps two elements, meaning that PA corresponds to swapping two rows of A then det(PA) = − det(A). Proposition 6.0.22. The determinant is linear in each row (or each column). In other words, for any a0 , a1 , a2... , an ∈ Rn and α0 , α1 ∈ R we have — α0 a⊤ ⊤ 0 + α1 a1 — — a⊤0 — — a⊤1 — — a⊤2 — — a⊤2 — — a⊤2 —.. = α0.. + α 1.. ,... — a⊤n — — a⊤n — — a⊤n — and | | | | | | | | | α0 a0 + α1 a1 a2 · · · an = α0 a0 a2 · · · an + α1 a1 a2 · · · an. | | | | | | | | | Exploratory Challenge 23. The more mathematical way of presenting this material is to define a determinant as a function that goes from n × n matrices to R with the following properties: (1) it is linear in each column, (2) det(I) = 1 and (3) det(A) = 0 whenever A has two identical columns. It is then possible to prove that the only function satisfying these three properties is the determi- nant as we defined it. 7. E IGENVALUES AND E IGENVECTORS We are (almost) ready for one of the most important concepts (if not the most important one) in Linear Algebra, eigenvalues and eigenvectors. In a sense, it has all been building up to this! Guiding Strategy 24. Given a square matrix A, as we will see below, an eigenvalue λ and eigen- vector v will be, respectively, a scalar and a non-zero vector satisfying Av = λ v. This means that (A − λ I)v = 0 and so (A − λ I) is not invertible, or equivalently det(A − λ I) = 0. We can look for eigenvalues as solutions of det(A − λ I) = 0 which is a polynomial8 in λ but unfortunately, not 8This is one of the main reasons we had to cover determinants. 42 # " 0 −1 all polynomials have real zeros.9 For example if A = , det(A − λ I) = 0 corresponds 1 0 to λ 2 + 1 = 0 which only has solutions in C, the Complex Numbers. For this reason we will start this Chapter with a brief introduction to Complex Numbers. It all starts with asking for a number λ such that λ 2 + 1 = 0. Further Reading 25. Complex Analysis is a beautiful topic in Mathematics, what we will cover here is just a tiny peak at it, there is a all bookshelf of excellent books in this topic in our li- brary. I have personally taught a course at ETH on Complex Analysis, and since it was during the COVID pandemic I made videos available online, which are still available at https://www. youtube.com/playlist?list=PLiud-28tsatLRRGqO_Eg_x0S4LVyxuV5p (In par- ticular, the first lecture covers roughly the content here). 7.0. Complex Numbers. If we start with the natural numbers N and want to solve equations like x + 10 = 1, we need negative numbers. This motivates considering the integers Z. Similarly, rational numbers Q are needed to solve equations like 10x = 1 and real numbers R are needed to solve x2 = 2.10 Similarly, the Complex Numbers are needed to solve equations such as x2 + 1 = 0. It starts with the introduction of an imaginary number i ∈ C such that i2 = −1. You can think of √ i as −1. The complex numbers are numbers of the form z = a + ib for a ∈ R and b ∈ R. C = {a + ib : a, b ∈ R}. Keeping in mind that i2 = −1 we can do operations with complex numbers: (a + ib) + (x + iy) = (a + x) + i(b + y), (a + ib)(x + iy) = ax + i(ay + bx) + i2 by = ax + i(ay + bx) − by = (ax − by) + i(ay + bx), (a + ib)(a − ib) = a2 + b2 , a+ib (x−iy)(a+ib) (ax+by)+i(bx−ay) ax+by bx−ay x+iy = (x−iy)(x+iy) = x2 +y2 = x2 +y2 + i x2 +y2. Given z ∈ C with z = a + ib we have the following notation (15) ℜ(a + ib) := a called the real part of z = a + ib, (16) ℑ(a + ib) := b called the imaginary part of z = a + ib, p (17) |z| := a2 + b2 called the modulus of z = a + ib, (18) a + ib := a − ib called the complex conjugate of z = a + ib. 9A zero of a polynomial P is a point x such that P(x) = 0, this is also called a root of the polynomial. In German, it’s a “Nullstelle”. In fact, a (rather deep) multidimensional version of Theorem 7.0.3, and one of the most important facts in Algebraic Geometry, is called “Hilbert’s Nullstellensatz”. 10If you have never seen the proof that there exists no x ∈ Q such that x2 = 2 I highly recommend trying to do it: set x = a/b for a, b ∈ Z and try to count how many times 2 divides both a and b and find a contradiction. 43 1 z Note that for z1 , z2 ∈ C, we have |z|2 = zz, z1 z2 = z2 z1 , z1 + z2 = z1 + z2 , and z = |z|2. Fact 7.0.1 (Euler’s Formula). Given θ ∈ R, we have (19) eiθ = cos θ + i sin θ. This means, in particular, that eiπ = −1. This is usually written as eiπ + 1 = 0. Further Reading 26. In order to prove Euler’s Formula, we need to first define what we mean by eiθ , this can be done, for example, by the Taylor series of the exponential, but this is outside the scope of this course (see Further Reading 25). Fact 7.0.2 (Polar Coordinates). A complex number z ∈ C can be written as (20) z = reiθ , where r ≥ 0 is the modulus of z and θ ∈ R (we can restrict to θ ∈ [0, 2π[) is an angle, also called the argument of z. F IGURE 4. A complex number z = 4 + 3i in the Complex plane. The most important property of Complex Numbers, and what makes them a very natural mathe- matical object, is that any univariate polynomial equation with complex number coefficients has a (complex) solution, in a certain sense we don’t need to extend numbers further, C is Alge- braically closed. 44 Theorem 7.0.3 (Fundamental Theorem of Algebra). Any degree n non-constant (n ≥ 1) poly- nomial P(z) = αn zn + αn−1 zn−1 + · · · + α1 z + α0 (with αn ̸= 0) has a zero: λ ∈ C such that P(λ ) = 0. Further Reading 27. As the name suggests, Theorem 7.0.3 is a central result in Complex Anal- ysis. Proving it is outside the scope of this course.11 Complex analysis (which leads to the proof of this theorem) is a beautiful example of interaction between analysis, algebra, and geometry. In a nutshell the idea for the classical proof is that differentiable functions in the complex plane f : C → C are very special and, in a sense, need to behave like polynomials (this is a deep state- ment that needs a significant amount of background to properly state and prove). If a polynomial P(z) doesn’t have a zero then 1/P(z) is a differentiable function that cannot behave like a non- constant polynomial because it does not grow sufficiently far away from zero, and so it must be a constant function which means that P(z) had to be constant, so any non-constant polynomial has a zero. For more on Complex Analysis see Further Reading 25. Further Remark 28. Once we have λ a zero of P(z), we can divide P(z) by (z − λ ) to get P(z) = (z − λ )P1 (z), then use a zero of P1 to reiterate, and so on. This argument (carried out carefully) gives the following corollary. Corollary 7.0.4. Any degree n non-constant (n ≥ 1) polynomial P(z) = αn zn + αn−1 zn−1 + · · · + α1 z + α0 (with αn ̸= 0) has n zeros: λ1 ,... , λn ∈ C, perhaps with repetitions, such that (21)

Linear Algebra for Computer Scientists Lecture Notes PDF Part II

Document Details

Tags

Related

Summary

Full Transcript