Algorithms, Data, and Security Lecture Notes PDF

A.A. 2023/2024 Algorithms, Data, and Security V. Cardellini-Cioffi Lecture 1 The three-course pillars: Algorithms, Data, and Security Why IT sector is...

A.A. 2023/2024 Algorithms, Data, and Security V. Cardellini-Cioffi Lecture 1 The three-course pillars: Algorithms, Data, and Security Why IT sector is so important? Why is it possible? Computer Power refers to the capacity and capability of a computer system to perform various tasks and processes efficiently and effectively. Regulating the pillar of computer power there is the so-called Moore’s law which says that the number of transistors on a computer chip doubles approximately every 18 months. From the graph, we can see that the number increased, the maximum was reached in 2012. Since 2015-2017, the law showed some signs of slowing down but a new concept arose which is multicore, which means that on the same computer, you have multiple cores. Usually, we can see that computers with time passing get smaller, cheaper, faster, and power efficient. Network bandwidth refers to the maximum amount of data that can be transmitted over a network connection in a given amount of time, typically measured in bits per second (bps), kilobits per second (kbps), megabits per second (Mbps), or gigabits per second (Gbps). In this framework, we have Nielsen’s law, which affirms that users' bandwidth (the amount of data that we can send on a network) grows by 50% per year. This means that in 2 years we can send the double data at the same time as two years before. With 5G we have super-fast low latency internet everywhere. This type of connection is faster than a typical home Internet connection and is wireless. To have an idea of how much is fast the 5G, we can say that it tops out at 10 gigabits per second (Gbps) = 100x faster than the current 4G (theoretical maximum speed). Connectivity value refers to the perceived or actual benefits and advantages derived from being connected to a network. In this pillar, we also have Metcalfe’s law which says that the value of a telecommunications network is proportional to the square of the number of connected users of the system. The larger the number of users we have in the system the greater the value of the network, and if we have an increase in the number of users. What is an algorithm? The informal definition says that an algorithm is a sequence of instructions, given step by step, that could be executed “easily” so that they produce a desired result (output). The instructions that lead to the solution of a problem are executed in a specific order to produce the desired output or result. We can think of an algorithm as a recipe to make a cake, because by following instructions we can produce the output which is the cake. An example of an algorithm: if you have questions about this class, please check the course website (solved in 90% of cases); If you do not find your answer on the website, please ask the instructor during class. On the other hand, an example of a classical algorithm is the Eratosthenes’ sieve (275-195 BC) which is an ancient algorithm for finding all prime numbers up to a specified integer n, so saying differently the output all primes ≤ n. For example, if we want to find the prime numbers less than 100, according to this algorithm we have the following steps: 1. Prepare a list of natural numbers: 2, 3, …, n. 1 A.A. 2023/2024 Algorithms, Data, and Security V. Cardellini-Cioffi 2. Start from k = 2. 3. Mark all multiples of k between k2 and n (they cannot be prime!) 4. Replace k with the smallest unmarked number > k. 5. If (k2 ≤ n) go back to step 3 6. Output all unmarked numbers. We have a control flow of instruction since in point 5 there is written to go back to step 3 if the result is not reached. On the other hand, when the situation is satisfied, we do not have to go back but we have to move on to the next and last step. Other examples are the count of the number of objects that have distinct values (e.g., a, b, b, c, g, d, d, g), i.e. we want to find how many b there are, or in more general to list objectives and how we want to sort them out, and research on google such as search on google and find the shortest path algorithm with Google Maps that was invented by Dijkstra. Finally, another example can be the AlphaGo. Go is a sophisticated board game, which is extremely complex, and AlphaGo is the computer program that plays Go and beats the world’s top Go player. To do so AlphaGo uses an algorithm that finds the moves based on knowledge previously “learned” by machine learning, specifically by an artificial neural network. An artificial neural network is a computing system inspired by biological neural networks that constitute animal brains. The goal of the latter is to “learn” to perform tasks by considering examples, generally without being programmed with task-specific rules. For example, let’s consider the problem of classifying images, the neural network learns to identify images that contain (or tagged) cats or dogs by analyzing a set of images labeled as “cat” or “dog” and using the results to identify cats or dogs in new images. It means that we need to provide some data set with this aim so that the system can learn. In the picture, we can see the scheme of a simple neural network, in which we have 3 layers (input, hidden, and output) that are connected. Each connection between neurons in adjacent layers is associated with weight. These weights determine the strength of the connection. While we increase the number of layers, we call the network a deep-learning neural network. The number of neurons that a deep network, such as ChatGPT, can have hundreds of billions of notes, means that you need impressive powers to make these links and a lot of space to store the information. The transformer model is based on a complex neural network are also called the foundation model because the sheer scale and scope of the foundation model over the last few years have stretched our imagination of what is possible. We can generate new images and texts. Indeed, on the right we have the inputs and to teach the foundation model we have to train the neural network on doing for example writing text we have to provide data of texts. 2 A.A. 2023/2024 Algorithms, Data, and Security V. Cardellini-Cioffi Big Data Big data refers to extremely large, complex, and diverse datasets that cannot be easily managed, processed, or analyzed using traditional data processing tools or methods. Every day we create a lot of data, for example, every minute of the day on Google we create 5.9 million research, this is also given by the fact that 66% of the world's population has an internet connection. o The basic unit of data is a bit (binary digit) and is the smallest unit of data in computing and digital communications. It can have one of two values: 0 or 1. The reason why there are only two levels is that they can be understood by a machine, in which 0 is false and 1 is true. o Eight bits create the so-called byte which is used to represent characters, numbers, and other symbols in computer systems. Each byte can represent 256 (2^8) different values, ranging from 0 to 255. o 1,024 bytes are equal to a kilobyte which is used to measure small to moderate amounts of data, such as text files, images, or short documents. o 1,024 kilobytes are equal to a megabyte which is used to quantify the size of larger files, such as music files, photos, or small videos. o 1,024 megabytes are equal to a gigabyte which is used to measure the capacity of storage devices like hard drives, SSDs, and memory cards. o 1,024 gigabytes are equal to a terabyte which is to large dataset. Generally speaking, we can say that a unit is 1024 times larger than the previous one. We have bigger measures that are zettabytes. One zettabyte is equal to 270 B (210 )7 B ≈ (103 )7 B = 1021 B. Zettabytes is one of the largest units of measurement used to quantify data storage capacity in computing and digital communications. All this data is stored on computers, however, the trend is changing, since we store less data on our computers and more on public clouds, such as Google and Amazon. We also have to consider that big data is growing fast and think about the IoT, social networks, and smartphones. Type of data: Unstructured data: refers to information that doesn't have a predefined data model or organization. This type of data typically includes text documents, images, audio files, video files, social media posts, emails, and other types of human-generated content. More than 80% of data are unstructured. Structed data: refers to information that has a predefined organization, format, and schema. This type of data is typically stored in databases or structured file formats and is easily searchable and analyzable. 3 A.A. 2023/2024 Algorithms, Data, and Security V. Cardellini-Cioffi Some definitions of big data can be the following: - Big data refers to data sets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze. - Big data is a field that treats ways to analyze, systematically extract information from, or otherwise deal with data sets that are too large or complex to be dealt with by traditional data-processing application software. - Big data is mostly about taking numbers and using those numbers to make predictions. The bigger the data set you have, the more accurate the predictions will be. The most famous definition is the so-called 3V model which was defined in 2001 by D. Laney. 1. Volume: data size challenging to store and process (how to index, and retrieve). 2. Variety: data heterogeneity because of different data types (text, audio, video, record) and degree of structure (structured, semi- structured, unstructured data). 3. Velocity: data generation rate and analysis rate. Then we have also the extended (3+n) V model 4. Value: Big data can generate huge competitive advantages. “Big data technologies describe a new generation of technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data, by enabling high-velocity capture, discovery, and/or analysis.” (IDC, 2011) “The bigger the data set you have, the more accurate the predictions will be” (A. Goldbloom) 5. Veracity: uncertainty of accuracy and authenticity of data. 6. Variability: data flows can be highly inconsistent, with periodic peaks 7. Visualization Presentation of data in a pictorial and graphical format because our brain processes images 60,000 faster than text. i.e., the number of flights in the world. The downside of having this amount of data is that we have more data to store regards the amount of energy that we consume to store them. To have a clear idea, in 2022, data centers that power all computers used about 1% to 1.3% of the world’s current electricity use. Looking ahead, by 2027 AI servers could use between 85 to 134 terawatt hours annually. Data structures are structures to organize data. An example is a stack of data items, that can have 3 basic operations: Insert (“push”) Delete (“pop”) Display content of the top item (“peek”) 4 A.A. 2023/2024 Algorithms, Data, and Security V. Cardellini-Cioffi Lecture 2 Fibonacci is a sequence of numbers that we find in nature, arts, and graphics. Therefore, we can find the Fibonacci sequence in the number of petals in a flower or some paintings such as in the proportions of the Gioconda. In the picture, we can see that in the inner part of the spiral, we have a square of sides one per one, then a square two per two, and moving on arriving at 21 times 21. These numbers are related to the Fibonacci sequence, and each number is given by the sum of the preceding two numbers. But why do we see these numbers in nature? Because the goal of nature is to put the maximum number of things in a smaller space. Therefore, the Fibonacci numbers are the whole numbers that express the golden ratio, that corresponds to the angle which maximizes the number of items in the smallest space. Who was Fibonacci? Leonardo of Pisa (aka Fibonacci) who lived around 1175-1250. He wrote Liber Abaci (1202), one of the first books to be published by a European. One of the first people to introduce the decimal number system into Europe. On his travels, he saw the advantage of Hindu-Arabic numbers compared to Roman numerals. He was interested in many problems, including the rabbit problem. Indeed, Fibonacci was interested in the following population dynamics problem: How fast is a population of rabbits expanding (subject to certain conditions)? In particular, if one starts from one rabbit pair on a deserted island, how many rabbit pairs there will be at year n? This scenario is quite improbable but there is a real implementation in Japan where an island is populated only by rabbits, and the numbers correspond to the Fibonacci sequences. Assumption of the rabbit problem: A rabbit pair gives birth to two little rabbits (one male, one female) per year. Rabbits start reproducing two years after birth. Rabbits do not die (this is unrealistic but helps us to make the problem simpler). Now we assume that we are on the rabbits' island, and we have the following: - Fn: number of rabbit pairs at year n - F1 = 1 (only one pair) - F2 = 1 (too young to reproduce, they start to reproduce from the second year). - F3 = 2 (the first pair of little rabbits, so we have two pairs, the first one and the little rabbits that are born from the first couple) - F4 = 3 (second pair of little rabbits) - F5 = 5 (first pair of grandchildren) 5 A.A. 2023/2024 Algorithms, Data, and Security V. Cardellini-Cioffi Each value in the sequence is given by the sum of the two preceding sums. For example, 21 couples, that is data that we are going to expect in the 8th year, is given by 13 plus 8. Therefore, at year n there will be all pairs from the year before (year n-1) plus one new pair for each pair from two years before (year n-2). This yields the following recurrence (an equation that expresses each element of a sequence as a function of the preceding ones): According to Fibonacci sequences if the value of n is equal to is equal to 1 or 2, the result will have 1, while when we consider from year 3 on, we will have the sum of the two preceding Fibonacci numbers. Our (algorithmic) problem: how to compute Fn? There is a formula for the Fibonacci recurrence which is the following: These numbers are the GOLDER RATIO, which is a mathematical concept that describes a special ratio found in various natural and artistic phenomena. The proportion of the Fibonacci sequence says that a+b is to a as a is to b. Find an algorithm for Fibonacci: To define the algorithm we will use pseudocode, which is an informal high-level description of the operating principle of an algorithm or in other words, a way to code the algorithm that can be easily translated into any specific programmatic language. In the picture we see the see world algorithm followed by the name, which is fibonacci1, then we have the inputs (provide the number n, what the users provide) and the output. Regarding the latter, we are also saying to find the data type of output. In this case, we are asking as a return value our algorithm an integer number. So, in the first line, we ask the algorithm to provide n and having a result an integer number. The second line starts with the return which means the value that we return as output and this is what we return as output, which corresponds to the formula of before. 6 A.A. 2023/2024 Algorithms, Data, and Security V. Cardellini-Cioffi Let us ask if this first version is correct. To answer we must consider the accuracy needed for Ꟁ and Ꟁ to get a correct output. Computers do not have an infinite number of cells to store numbers so when we consider a number with an infinite element, such as an irrational number, in the computer we have a fixed number of cells. This is the reason why we must introduce some approximation to store our data. We are going to approximate the two values, as the picture shows. From the latter, we observe that the number of rounding is 2583 and it corresponds to the one of Fibonacci. If we do not consider the approximation and apply the value, we will have 2584. So, there will be one pair of different, which means that when we introduce this approximation, the result is that we do not compute the exact result of the Fibonacci sequence when n increases. We must consider a second version of the algorithms so that we can have a correct answer since fiboancci1 does not always provide the correct answer. We want to put the mathematical form in the algorithms, using the recursive definition, which is: If the value of n is equal to 1 or 2 the result is equal to 1, otherwise, the result will be the sum of the program that we run when the input is n-1 and n-2, the two preceding numbers of the sequences. Doing a comparison with fibonacci1 we notice that the first line is the same it only changes the name since provides an input integer n and an output as an integer. Else return is a conditional statement, so the return corresponds to a return instruction. From the if and else return is a branch, which is a condition state. It means that we provide some key works to the computers and if this condition is true then we follow the instruction that there is written after “then”, if this is, not the case the algorithm will follow the instruction that there is written, “else return.” Is the solution efficient? No, because the algorithm is slow. How much time does fibonacci2 require to compute the computation? - We cannot measure in seconds because it depends on the hardware of the machine. - Neither we can count the number of instructions that are in the algorithm, but this depends on the programming language and compiler. A compiler is a software program that is used to translate the code that we write in a programming language into a lower-level code that the machine can understand. We have to make some approximations which are: - Count the number of lines in the pseudocode containing instructions. 7 A.A. 2023/2024 Algorithms, Data, and Security V. Cardellini-Cioffi - Assume each code line requires 1 microsec (hardware and software independent). Assuming that we can affirm: N3 2 lines code plus The case is different when we have N>3 since we have a different formula, which is similar to the previous one except for the 2 at the beginning. The solution to this currency is: Elevated to n means that the Fibonacci is very slow because we can see that if we want to calculate the Fibonacci when n is equal to 1000, we will have large numbers since we will have almost 2 (1.6) elevated to 1000. Why is so slow? The problem is given that we use recursion. Assume that we want to calculate the F-number when n is equal to 6. We know that is given by the f numbers 5 and 4, but we must compute the values given these numbers. Therefore, we are going to calculate multiple times the same value, for example, we have to calculate 3 times the value of F(1), and 5 times the value of F(2). So, this is not efficient. The idea to improve the algorithm is to avoid calculation multiple times. In conclusion, we can say that the second way provides the right number but is not efficient in mathematical terms, since it takes too much take to do the computation. How can we avoid recomputing the solution to the same subproblem over and over again? Just store the solution somewhere the first time you find it. This technique is known as memorization. We will use an array which is a sequence of values of the same data type one after the other. This means that we can store in the first element of array F1, in the second one F2, and so on. In the end, we will have the following: 8 A.A. 2023/2024 Algorithms, Data, and Security V. Cardellini-Cioffi The third way of the Fibonacci algorithm, which is the most accurate and efficient for solving the problem. 1st line: The new name for the rest is the same as the previous two. The arrow in that way (→) means that we provide an output. 2nd line: Fib is a variable and written in this way means to store numbers and in particular we are saying to the computer that contains n items. Then we are saying to store 1 in Fib and Fib. 3rd line: for times equal to 3 to n (which is the total number) we store the sum of the 2 previous numbers. To do so we say that the computer needs to compete multiple times with the fibi instruction, which is called for loop. What is a for loop is a control flow statement for specifying iteration, which allows code to be executed repeatedly. The instruction of the for loop is given in two parts: - Header H specifies the iteration. Header often declares an explicit loop counter (in our case i), which allows the body to know which iteration is being executed. - Body B is executed once per iteration and is composed of many instructions but in our case is just one. Example. Write the algorithms assume that n is equal to 6 and it would look like this: So, to do a recap: 1. Fibonacci1 does not have accuracy. 2. Fibonacci2 provides accuracy but not efficiency. 3. Fibonacci3 provides both accuracy and efficiency. Lecture 3 Example of Fibonacci3 Algorithm fibonacci3 (integer n)→integer Input a neutral number and as an output a neutral number. If n is equal to 3, the output will be the third number in the Fibonacci sequence. Let fib an array of an integer Fib ← Fib ←1 In this, we know that we store 1 in the first two elements. For i=3 to n do 9 A.A. 2023/2024 Algorithms, Data, and Security V. Cardellini-Cioffi It tells us what we have to do for a series of numbers. It tells us how many times we run the instruction that constitutes the loop. We have an index of the loop, in this case, i, that tells us how many times we repeat the loop. In this case, we repeat the loop from 3 to n, which is given as input. Fib [i]+ Fib[i-1] +F[i-2] We store in the element the sum of the two preceding elements. If n=6 we have an array fib that contains six elements that can be identified with the use of elements. This is what to call the index of the array (the little number under the tables). Here it tells us that in the other elements of the array that go from the index=3 we store the remaining values. End for For the loop we close, and it does not count as an instruction. Return Fib [N] Examples finobacci3, with n=6 and we have to write down the instructions. Fibonacci3(6) Fib ← Fib ←1. With this instruction, we store 1 in the first two elements of the array. For i=3 to n do Fib takes ← Fib [3-1] +F [3-2]. In the array means that when i=3 we store in the third element the sum of the two previous elements. 3-1=2 while 3-2=1, therefore we need to take the numbers in these gibs that correspond to 1 and 1. Therefore, 1+1=2. For i=4 to n do Fib takes ← Fib [4-1] +F [4-2]. Now i is increased by 1, so it is i=4. In the array, we store the result of this sum, of the previous two elements, which will be the third one (given by 4-1) and the second one (given by 4-2). The results that we have in these fibs are 2 and 1, therefore the number of the fourth fib will be equal to 3 (2+1). For i=5 to n do Fib takes ← Fib [5-1] +F [5-2]. We have fibs of 4 and 3, so we will store the fifth position of the fib, 5 which is given by 3+2 (the results of the fourth (5-1) and the third fib (5-2). For i=6 to n do Fib takes ← Fib [4-1 ]+Fib [4-2]. In the sixth position, we will store what we have in the fifth and fourth positions, because Fib[6-1] +F[6-2]. In the last one, the result will be 8=5+3. For i=7 to n do In this case, we jumped out of the loop, we ran the next instruction which was to return a fib of six which is equal to 8. Now we have increased to i=7 but it is out of the inputs, so it returns as a result the value that we have stored in the last fib. 10 A.A. 2023/2024 Algorithms, Data, and Security V. Cardellini-Cioffi Memorization is the technique of remembering previously computed values used in Fibonacci. Never recompute a subproblem F(k), k≤n, if it has been computed before, just stored in the fib array. Is fibonacci3 faster than fibonacci2? Yes, and to do so we calculate the running time of the algorithm, counting how many instructions are executed when we run the algorithm. Lines 1,2 and 5 are run only once. While lines 3 and 4 are run many times, n times. If our n=6, it means that we need to run 4 times (3-4-5-6). The bottleneck is the four cycle which is run roughly for about n iterations. This means that our running time is the following: In this case, we will have 3+2x6, which is equal to 15. Fibonacci 3 is faster than Fibonacci 2. If for example, we consider when n=8: Fibonacci 3 is 19 (3+2x8), while fiboancci2 has 61. When n=45, the number of instructions in fibonacci3 is 93 while for Fibonacci2 is more than 3 million. So, this is why we prefer to use the third one and not the second one. The running time of F3 is linear also given by the fact that it goes like n, while the one of fibonacci2 runs an exponential time. Measuring running time T(n) as the number of lines of codes executed is a rough approximation: – Because two different lines of code may have very different actual times – But it is sufficient for our purpose! In the running time, we do not consider low-level details, such as multiplicative or additive constants. We are just interested in the growth rate of the running time. We will use the notion of asymptotical notation O( ), called Big O notation. It gives an upper bound: running time grows at most this much (but it could grow more slowly). 11 A.A. 2023/2024 Algorithms, Data, and Security V. Cardellini-Cioffi Example (not considering constants). In the first, we do not consider 2n we just say that Tn. In the second we do not consider 3. In the third one, we have the gold ratio number we can say that is almost 2. Big O notation makes our life easier: o Ignore low-level details. o Compare easily different algorithms. o fibonacci3 runs in O(n): much better than fibonacci2 which runs in O(2n). Fibonacci is the best algorithm to solve the Fibonacci problem. Fn is bigger O of the one of the blue if there is a given point on the x-axis such that the curve in red is less than the curve in blue. The red one cannot be higher than the blue one. Tribonacci N=0 the number is 0 N=1 or 2, the Tribonacci number is 1 N=>3 the Tribonacci is given by the sum of the three previous numbers. Prof. Tribonacci claims that the following is the fastest algorithm for computing the n-th Tribonacci number: 12 A.A. 2023/2024 Algorithms, Data, and Security V. Cardellini-Cioffi Stable Marriage Algorithm The stable Marriage Algorithm is an algorithm that we use to solve matching problems, i.e., courses and professors who can rank the subjects, matching rooms, and courses, or matching students and internships. The algorithm was proposed by Shapley an American mathematician who won the Nobel prize in economic science in 2012. He contributed to mathematical economics and especially game theory. To find a solution to the stable matching problem the Gale-Shapley algorithm was developed. They present a well-known problem of matching the elements of two sets. There are n men and n women, who are unmarried. Each has ranked all members of the opposite sex in order of preference. Does there exist and can we find a stable matching (stable opposite-sex marriage)? That is a matching of men and women, such that there is no pair of a man and a woman who both prefer each other above their current partner in the matching. For example: each one has a preference list, and we want to find put a stable marriage. Let us consider a matching M of men and women, a pair (m, w), where m is a man and w is a woman, is a blocking pair if: - m and w are not partners in M. - but m prefers w to his partner in M and w prefers m to her partner in M. Matching M is stable if it has no blocking pairs. Consider the following preferences: Alex and Ann; Bert and Betty; These are associated in the following way: Carl and Cindy There is not a stable matching one because there is a blocking pair, which is given by Alex and Betty since they did not match even though they prefer each other above the assigned partner. However, if we consider the same preferences and we have the following association, we will have a stable matching, because there is not a blocking pair: Alex and Betty; Bert and Ann; Carl and Cindy. The solution depends on the fact that we consider first women and men. In origin assignment of med-school students to hospitals (for internships) for the National Resident Matching Program in the U.S. The students list hospitals in order of preference and the hospitals list students in order of preference. However, the Gale-Shapley paper was written in 1962 to match college admission and opposite-sex marriage. Therefore, students list colleges in order of preference and colleges list students in order of preference. Nowadays there are a wide variety of practical applications, such as matching resident doctors to hospitals, matching students to 13 A.A. 2023/2024 Algorithms, Data, and Security V. Cardellini-Cioffi colleges, matching sailors to ships, and so on. More generally we can say that the algorithm is applied to any two-sided market, that is a market with 2 distinct groups of participants each with their preferences. Akamai: content delivery network. Distributing much of the world’s content on the web delivers 15-30% of Internet traffic every day, i.e. safari. Akamai is used which is the best service for a given user. Indeed, the users’ preferences were based on latency and packet loss, and the web servers’ preferences were based on costs of bandwidth and co-location. The goal was to assign billions of users to web servers, every 10 seconds, therefore the algorithms are very fast. Two other techniques tried to solve the marriage problem (with a simple-minded approach) but without success since they did not provide a good solution, but they can work well in other problems. The first one follows a “Greedy” approach. We tried to solve an optimization problem, by aiming for an optimal choice at each step. We find a solution that can find a solution to all problems. - Greedy for women: The first woman chooses, then the second woman chooses … - Greedy for men: The first man chooses, then the second man chooses … This does not work, indeed if we consider the following preferences, we can have two solutions for men and women: Greedy matching for men: Alex and Ann; Bert and Betty; Carl and Cindy. This is NOT stable since Ann and Carl prefer each other above given partner (blocking pairs) Greedy matching for women: Ann and Carl; Betty and Bert; Cindy and Alex. This is NOT stable since Betty and Carl prefer each other above given partner (blocking pair). A greedy algorithm is a simple, intuitive algorithm that is used to solve optimization problems. It makes the optimal choice at each stage as it attempts to find the overall optimal way to solve the entire problem. The second way is the “local search” approach. It also can be called a Soap-Series-Algorithm because while there is a blocking pair, do switch the blocking pair. However, the claim is that it can go on forever (so it gives no solution for some input). By improving the current solution, by looking at the local area at the current solution. For example, consider the following preferences: We started to do the matching, and we associated w3 with m3 and m2 with w2. Later we discover that w1 & m3 run away together, so they are a blocking pair. 14 A.A. 2023/2024 Algorithms, Data, and Security V. Cardellini-Cioffi Therefore, we have to switch our preference, matching w1 and m3 together, while the other couples will be matched in the following way: w1 and m3, w2 and m2. However, w2 and m1 run away, so the matching changes since we have to match w2 and m1. At the end of the stories, we will have to change the matches and the situation at the end will be the same one at the beginning. In conclusion, the “Local search” approach does not need to terminate, because when there is a blocking pair, it switches the blocking pair. A downside is that it can go on forever and we need something else. Local search algorithms move from solution to solution in the space of candidate solutions (the search space) by applying local changes, until a solution deemed optimal is found or a time bound is elapsed. Lecture 4 The algorithm proposed by Gale-Shapley is a good solution for our problem since it finds always a stable matching, with the input being the list of men and women and their preference list, and the output that are stable matching. It is highlighted that the solution will be optimal just for one of the two sets, therefore, depending on the set that we choose we can find a solution that is optimal for one set, but on the other hand, it will correspond to the worse solution to the other set. The algorithm has a bias towards a set. For this algorithm, we will not see a pseudo code, but we will consider these instructions: o Fix some ordering on the women. This means that we have a list of women (the solution will be optimal for them) and we have to decide who we are going to match as first, second, and so on. For example, we will analyze Ann, Betty, and lastly Cindy. Our goal is to set every woman and man that we have preferences, in the way that there is not a blocking pair. o Repeat until everyone is matched. Repeat the loop until everyone is met. - Let A be the first unmatched woman in the ordering. - Find man B such that B is the most desirable man in A’s list such that B is unmatched, or B is currently matched to woman C and A is preferable to B than C. - Match A and B; possible this turns C to be unmatched. It can happen that some women already match get unmatched. If B is already matched to a different woman, i.e., C. we will unmatch them if A is preferable to C for B. We need to consider if the man that she prefers the most is free or is already matched. If is free we can match the two together, otherwise, if the man is already matched, we unmatched him. 15 A.A. 2023/2024 Algorithms, Data, and Security V. Cardellini-Cioffi Let us consider the algorithm we the following example, giving preference to women (the solution will be optimal for women): 1, fix some order, in which we will first consider Ann, Betty, and Cindy. 2. We start the match with Ann, and her best option is Carl, so we match them. 2.1 Now we consider Betty who also prefers Carlo, but he is already matched with Ann. What do we have to do in this case? We have to CONSIDER THE PREFERENCE OF THE MAN, Carl. From his list we see that he prefers Betty more than Ann, therefore, we will unmatch Ann and Carl and we will match Carl with Betty. 2.2. Now Ann is unmatched so we will consider her again. The second option for Ann is Bert, so we can match Ann with Bert. 2.3 Now, we match Cindy, who prefers Bert the most, when we consider Bert who is already matched with Ann. So as before we have to check his preferences, and from the list we see that he prefers Ann more than Cindy. So, we cannot match Cindy and Bert and we are forced to consider the second option for Cindy who is Carl. Also, in this case, Carl is already matched with Ann, so we have to look at his preferences. According to them, we see that Carl prefers more Ann than Cindy. Consequently, for Cindy, we must consider the third option Alex, so the last couple will be Cindy and Alex. We need to repeat the algorithm until the matching is complete which means when all the women are matched. From the previous example, we obtain a stable matching that is optimal for women provided by matching: Betty and Carl, Ann and Bert, and Cindy and Alex, so there is no stooping couple. If we consider a different order of the list for women, the solution is always the same. Does this algorithm terminate? Once a man is matched, he never becomes unmatched (his partner can change). Note that women can alternate between being free and being engaged. So, the algorithm terminates. How fast? As inputs we have two lists, so we consider that the size of the input is in order of n, which means the number of men and women. When the partner of a man changes, this is to a more preferable partner for him: at most n-1 times. In the example, of Ann and Carl, when we unmatched them to match Betty and C, we see that he prefers B more than A, so the solution is preferable for him. In every step, either an unmatched man becomes matched, or a matched man changes partner: at most n2 steps. This means that if we have 10 men and 10 women, we will need 1000 steps of the algorithm to find a stable match. So, we can say that the running time is n2 when it is the total size of the input – n women and n men. It is a fast solution to our problem. Does this algorithm give a stable matching? 16 A.A. 2023/2024 Algorithms, Data, and Security V. Cardellini-Cioffi To prove the stability, we use the proof by contradiction. This means that we assume that the final matching that found by applied by the algo, is not a stable one, and we prove that this assumption is false. Let us consider this solution for the problem before, and we suppose that the final matching is not stable. Taking into consideration the following facts: - Alex is matched to Ann. - Bert is matched to Betty. - Alex prefers Betty to Ann. - Betty prefers Alex to Bert. It would mean that Betty is before Ann, in Alex’s preference list, but Alex is not matched to Betty. Now we can have two cases: When Betty considers Alex, he has a partner (say) Carola preferable to Betty: Carola is also preferable to Ann, but in the algorithm, men can get only more preferable partners, a contradiction. When Betty considers Alex, he is free, but Betty is later replaced by someone preferable to Betty. Again, Alex can never end up with Ann. This is again a contradiction. By using logic, the solution is stable. Theorem: – All possible executions of the Gale-Shapley algorithm give the same stable matching. – In this matching, the women have the best partner they can have in any stable match. – In this matching, the men have the worst partner they can have in any stable match. This means that the solution is the best one for women and the worst one for men. Indeed, when we consider the matching that we found, Alex is matched with Cindy, which was the last preference of Alex. See the exercise in the notes. In this case, the solutions were the same (for men and women) because the sets were small. To recap: A stable matching always exists. The simple-minded soap opera algorithm does not necessarily terminate. A stable matching can be found efficiently (in linear time in the size of the input) using the Gale-Shapley algorithm. Controversy: The women-optimal Gale-Shapley algorithm is better for women. The solution was more optimal for hospitals than for students. 17 A.A. 2023/2024 Algorithms, Data, and Security V. Cardellini-Cioffi Exercise 2, spoiler: the two solutions for the women and men are the same. We have two sets of women and men, with the following list of preferences. - m1: w1 w2 w3 - w1: m1 m2 m3 - m2: w2 w1 w3 - w2: m3 m1 m2 - m3: w3 w2 w1 - w3: m2 m1 m3 Solve for women: Solve for men: - w1 and m1 - m1 and w1 - w2 and m3 - m2 and w2 - w3 and m2 - m3 and w3 Now let’s change the order. For women, we will have w2, w3, and w1 while for men m3, m1, and m2. Solve for women: Solve for men: - w2 and m3 - m3 and m3 - w3 and m2 - m1 and w1 - w1 and m1 - m2 and w2 Search and Sort Algorithm Search and sort algorithms are fundamental concepts in computer science, essential for efficiently organizing and retrieving data. We use them if we want to order some items or if we want to find them in a list. Search Algorithms: used for searching, and there are two kinds: Sequential/Linear Search: Sequentially checks each element of the list until a match is found or the end of the list is reached. Time complexity: O(n) for an unsorted list. It is the simplest search algorithm. Binary Search: Requires a sorted array. Divide the search interval in half until the element is found or the interval is empty. Time complexity: O(log n). Sort Algorithms: Selection Sort: Divides the array into two parts: the sorted and unsorted subarrays. Finds the smallest element in the unsorted subarray and swaps it with the first unsorted element. Time complexity: O(n^2) in the worst case. Insertion Sort: Builds the sorted array one element at a time by repeatedly taking the next element and inserting it into the correct position in the already sorted part of the array. Time complexity: O(n^2) in the worst case. Bubble Sort. Repeatedly steps through the list, compares adjacent elements, and swaps them if they are in the wrong order. Time complexity: O(n^2) in the worst case. 18 A.A. 2023/2024 Algorithms, Data, and Security V. Cardellini-Cioffi Marge Sort: Divides the array into two halves, sorts each half separately, and then merges them. Time complexity: O(n log n). The running time of this algorithm corresponds to O(n) time in the worst case. This means that the greatest number of comparisons is necessary to find the target value. In our case, it occurs when the target value is in the last slot in the list or is not in the list. The number of comparisons is equal to the size of the list, which is equal to say n. Lecture 5 Sequential Algorithm, also known as Linear Search, is the simplest search algorithm. We want to research if a number or a value is present in the list. So, we checked, and that number is equal to the target value. In the case in which they are equal, it means that we have found the elements, otherwise, we go until the end. If we did find it, it means that the value is not in the list. The pseudocode of the sequential algorithm starts saying that: SequentialSearch(item x, list L ) → Boolean. It returns a Boolean; which means that is binary. 0=false the element is NOT in the list or 1=true so the element is in the list. N= how many items we have on the list. i.e. 100 items, so n=1000. So, n is the length of the lists, the number of the elements: length (L). We need a loop to scan our sequences. For example, our list is equal to 3 2 10 4, and we want to find 10. In this example, n=4 (given by the numbers that are present on the list). To precede we ask, is this number equal to 10? No, so we continue to the second value. X=10 L= [3-2-10-4] is L1=10? No. is L2=10? No is L3=10? Yes. If we consider the same list but we want to look for 9 in the same list, we will have the following: is L1=9? No. is L2=9? No is L3=9? No. is L4=9? No. Since we have reached the end of the list, it means that 9 is not on the list. Consequently, the algorithm will provide 0, false that the number is not in the list. 19 A.A. 2023/2024 Algorithms, Data, and Security V. Cardellini-Cioffi In pseudo-code we have the following: If we find the item, we terminate the algorithm. If we run through the list without finding item x, in this case, we return “NOT FOUND” Considering this algorithm, do the following example: L= [5-4-9-12] X=12 (the value that we are looking) The length of the algorithm which corresponds to n, will be equal to 4 because in the list we have four elements. Now we have to run the algorithm for i=1 to 4. - i=1 - If L=12. However, L1 is 5, so the answer is NO or false. - i=2 - If L=12 but also in this case L is 4, so we ask if 4 is equal to 12. NO, so we continue. - i=3 - Il L=12. The third element of the list is 9, so also in this case is NOT equal to 12. Consequently, the result will be FALSE. - I=4 - If l=12. This is the last element that corresponds to the x, so we return YES or True. At this point, we return found. Another example is with the same list, but we are looking for 10. Therefore we can write: X=10 and L=[5-4-9-12] - Is the first value (5) equal to 10? NO. - Is the second value (4) equal to 10? NO. - Is the third value (9) equal to 10? NO. - Is the fourth value (12) equal to 10? NO. We reached the end of the loop. So, it means that we return NOT FOUND. This is what we call a sequential research algorithm. Regarding the running time, the Sequential search requires O(n) time (comparison) in the worst case. Worst case means the case in which our target value is not in the list, we have to scan all the elements in the list, and we need to compare n times our target value with the current element in the list. Most number of comparisons are 20 A.A. 2023/2024 Algorithms, Data, and Security V. Cardellini-Cioffi necessary to find the target value. In our case, it occurs when the target value is in the last slot of the list or is not in the list. The number of comparisons is equal to the size of the silt (say n). Can we do better? YES, if the list L is sorted → we will see that we have a very fast algorithm (binary search), running time is log(n) which grows much more slowly than n, so the binary search algorithm is more efficient. Binary search Input: sorted list (from the smallest to the biggest value) and target value (it is like searching a dictionary for a word). Idea: each time we compare the target value (the value that we are researching) to the middle element of the list, we eliminate half of the list and continue the search on the remaining half until we either find the target value or determine that it is not in the list. Given target value and sorted list (or array) L as input, find index i such that L[i] = value, or report that no such index exists. We will need two values: that tell us which is the lowest index in the list (lo), and the highest one (hi). We use the mid which is the midpoint of the list. Algorithm maintains L[lo] 31 then return BinarySerach (x,L[1;i-1) → This is NOT the case (so we don’t run this instruction but the other one) - Else L=16 < 31 → we consider only the second part of the list. The running time of the binary search. Let T(n) be several comparisons done by BinarySearch to find an item x in a sorted list of size n. T(n) = 1 + T(n/2) T(1) = 1 (if the list has just one element, only 1 comparison is needed). Therefore, we have a recurrence function, that can be solved in the following 𝑛 𝑇(𝑛) = 1 + 𝑇 2 way:{ } 𝑇(1) = 1 To have the complete one see on the notes. 22 A.A. 2023/2024 Algorithms, Data, and Security V. Cardellini-Cioffi Lecture 6 Binary research with a sorted list. The idea is that we consider the midpoint of the list and compare the target value with the midpoint. If the target value is larger than the value in the midpoint it means that we can discard the up part of the list and we continue with the right part. On the other hand, if the target value is smaller, is the opposite. Example with the sorted list, in the binary research. Assume that we have the following data: L= 3-6-7-10-15-21-29. X=29 (target value) The midpoint is 10. Is 10 and 29 the same? No, so we compare if 10 is greater than 29? No, so we apply again the binary search algorithm on the right part of the list. We re-apply the algo with the new list, and the midpoint will be 21. Now we compare 10 and 21, are the same? No, is 21 greater than 29? No, it means that we can discard the left part of the list, and we continue with the right part. Now, we ask if the midpoint, 29, is equal to 29 and the answer is YES. So, it means that we have found the algo in the list, so we return FOUND. For another example, we want to look for the following data: L= 3-6-7-10-15-21-29. X=5 (target value) We found the midpoint, 10. Is 10 equal to 5? No, is 10 greater than 5? Yes, it means that we discard the left part of the list since the value cannot be in this part of the list. Now, we apply again the research, with a smaller list. The midpoint is 6 so now we compare the target value with the midpoint, is 6 equal to 5? No, is 6 greater than 5? Yes, it means that our target value cannot be in the right part but only in the left part. We continue with the remaining list which is only 3. Now we have just one element, that corresponds to our midpoint, and we ask if 3 is equal to 5 and the answer is No, and we return NOT FOUND. See the running time, remembering that our goal is that the binary is more efficient than the research one. We will have a recurrence equation, and the details are well-explained in the notes. At the end we will reach this result, or our goal: T(n)=1 +log2n. In conclusion, the number of comparisons T(n) done by binary search to find an item in a sorted list of size n is described by the recurrence: T(n) = T(n/2) + 1. The solution of this recurrence is T(n) = O (log n). Binary search requires running time O (log n). That is better than sequential search, which is O(n)! In the picture, we have some of the notation for the O. 23 A.A. 2023/2024 Algorithms, Data, and Security V. Cardellini-Cioffi - Constant algorithm – O(1): the fastest possible running time: the algorithm always takes the same amount of time to execute, regardless of the input size; ideal but rarely achievable. - Logarithmic algorithm – O (log n): Running time grows logarithmically in proportion to n. It is lowest than the constant, but faster than the other. i.e., binary search. - Linear algorithm – O (n): Running time grows directly in proportion to n. It is slower than the constant and the log, but better than the quadratic. i.e., sequential search and fibonacci3. - Quadratic or Polynomial algorithm – O (𝒏𝒄 ) : Running time grows quicker than previous al. It is the slowest algorithm. We can also have different cases, also called more O notation: - Exponential algorithm, - O (𝒄𝒏 ): Running time grows even faster than polynomial algorithm. i.e., Fibonacci2 is 2 elevated to n, is a slow algorithm. - Superlinear algorithm – O(n log n): Running time grows faster than n, still practical. - Factorial algorithm – O(n!): Running time grows the fastest and the algorithm becomes quickly unusable for even small values of n. Sorting Problems This is the most common problem in algorithms since we have to sort names, numbers, and so on. The goal is to give a set of items, to sort the items of the list in different ways, from the highest to the lowest or in alphabetic order. For the binary search, we have a list, which is sorted, so in theory, we should first use the sort algorithm, and then the binary one. We will study 4 sorting algorithms, and most of them have a quadratic running time, except one that will have a super linear running time. Therefore, all the algorithms we will consider are comparison-based, i.e., they sort by comparing elements. Finally, three out of four we have a quadratic running time. The first is the SelectionsSort, in which we have a sequence of numbers, and we have to put them in a non-decreasing form from the smallest one to the highest one. In a mathematician, way is that we extend sorted subsequence from k to (k+1) items by selecting minimum among the (n-k) unsorted items and moving it to position (k+1). 24 A.A. 2023/2024 Algorithms, Data, and Security V. Cardellini-Cioffi Let us consider this example: - Is 7 the minim value in the list? No, is 1 so the two are exchanged. - Is 2 the minim value in the list? Yes. - Is 4 minim value in the list? No, since it is 3, so the exchange. - Is 5 minim value in the list? No, since is 4, they exchange. - Is 5 minim value in the list? Yes. - Is 7 minim value in the list? Yes. How to find the minimum value from a particular position onwards. We will refer to that position as i. The smallest place marker keeps track of the position of the smallest value we have seen so far. - Initialize smallest to i. - Look at every value from i+1 onwards and update the smallest only when you find a value smaller than the one at the smallest. Example of finding minimum value from a particular position (i=1) onwards. The list is the following: From the picture we now that L=7; L=2; L=4; L=5; L=3; L=1. While n=6, since there are six elements in the list. o The smallest is =1. o For j=i+1 (which is 1+1=2) to n do. J is the position. if L[j]

Algorithms, Data, and Security Lecture Notes PDF

Document Details

Tags

Related

Summary

Full Transcript