Algorithms for BioMed (알고리즘) PDF
Document Details
Uploaded by MesmerizingGyrolite5380
Ajou University
Sael Lee
Tags
Summary
This document is a set of lecture slides on algorithms for biomedicine. It covers basic algorithms, analysis, and specific techniques like divide and conquer, dynamic programming, and sequence alignment. The slides are from Ajou University, focusing on the efficiency of distinct approaches and the concept of algorithm analysis within the context of biomedicine.
Full Transcript
의료인공지능 이슬 Algorithms for BioMed (알고리즘) Lecture slides in courtesy of Steven Skiena (The Algorithm Design Manual 2nd Ed.) and Richard Neapolitan (Foundations of Algorithms 5th Ed) Contents ❑ Algorithms Basics ❑ Algorithms Analysis Basics ❑ Divide & Conquer and Sorting ❑ Dynamic...
의료인공지능 이슬 Algorithms for BioMed (알고리즘) Lecture slides in courtesy of Steven Skiena (The Algorithm Design Manual 2nd Ed.) and Richard Neapolitan (Foundations of Algorithms 5th Ed) Contents ❑ Algorithms Basics ❑ Algorithms Analysis Basics ❑ Divide & Conquer and Sorting ❑ Dynamic Programming ❑ Edit Distance for Sequence Alignment Ajou University Data Intelligence Lab (dilab.ajou.ac.kr), Sael Lee 2 Lecture slide courtesy of Prof. Steven Skiena What is a Problem? ❑ A problem is the question to which an answer is sought. ❑ Ex1> Sort a list S of n numbers in nondecreasing order. The answer is the numbers in sorted sequence. ❑ Parameters are variables that are not assigned specific values in the statement of the problem. They are often the input to the problem. ❑ Instance: a specific assignment of values to the input parameters ❑ Ex1> ❑ A solution to an instance of a problem is the answer to the question asked by the problem in that instance. ❑ Ex1> [5, 7, 8, 10, 11, 13]. Ajou University Data Intelligence Lab (dilab.ajou.ac.kr), Sael Lee 3 Lecture slide courtesy of Prof. Steven Skiena What Is An Algorithm? ❑ An Algorithm is the step-by-step procedure that solves the Problem ❑ Algorithms are the ideas behind computer programs ❑ An algorithm is the thing which stays the same ❑ No matter what language (C++, Pascal, Python) or environment (Mac, Windows, Linux) etc. ❑ To be interesting, an algorithm has to solve a general specified problem. ❑ An algorithmic problem is specified by describing the ❑ set of instances it must work on and ❑ what desired properties the output must have. Ajou University Data Intelligence Lab (dilab.ajou.ac.kr), Sael Lee 4 Lecture slide courtesy of Prof. Steven Skiena Example 1: Sorting ❑ Input: A sequence of N numbers a1 … an ❑ Output: the permutation (reordering) of the input sequence such as a1 ≤ a2 … ≤ a n. ❑ We seek algorithms which are ❑ correct and ❑ Must prove that it always returns the desired output for all legal instances of the problem. ❑ efficient. ❑A faster algorithm running on a slower computer will always win for sufficiently large instance. Ajou University Data Intelligence Lab (dilab.ajou.ac.kr), Sael Lee 5 Lecture slide courtesy of Prof. Steven Skiena Example 2: Robot Tour Optimization ❑ Suppose you have a robot arm equipped with a tool, say a soldering iron. ❑ To enable the robot arm to do a soldering job, we must construct an ordering of the contact points, so the robot visits (and solders) the points in order. ❑ We seek the order which minimizes the testing time (i.e. travel distance) it takes to assemble the circuit board. Ajou University Data Intelligence Lab (dilab.ajou.ac.kr), Sael Lee 6 Lecture slide courtesy of Prof. Steven Skiena Problem: Find the Shortest Robot Tour You are given the job to program the robot arm. Give me an algorithm to find the most efficient tour! Traveling salesperson problem (TSP) Ajou University Data Intelligence Lab (dilab.ajou.ac.kr), Sael Lee 7 Lecture slide courtesy of Prof. Steven Skiena Greedy Approach 1: Nearest Neighbor Tour ❑ Starts at some point p0 and then walks to its nearest neighbor p 1 first, then repeats from p1, etc. until done. Pick and visit an initial point p0 p = p0 i=0 While there are still unvisited points i=i+1 Let pi be the closest unvisited point to pi-1 Visit pi Return to p0 from pi Ajou University Data Intelligence Lab (dilab.ajou.ac.kr), Sael Lee 8 Lecture slide courtesy of Prof. Steven Skiena Demonstrating Incorrectness by Counter-examples ❑ Searching for counter-examples is the best way to disprove the correctness of a heuristic. ❑ Think small: Think about all small examples. ❑ Think exhaustively: There are only a small number of possibilities for the smallest nontrivial value of n. ❑ Go for a tie: Think about examples with ties on your decision criteria (e.g. pick the nearest point) ❑ Seek extremes: Think about examples with extremes of big and small. ❑ Hunt for the weakness: If a proposed algorithm is of the form “always take the biggest” (better known as the greedy algorithm), think about why that might prove to be the wrong thing to do. Ajou University Data Intelligence Lab (dilab.ajou.ac.kr), Sael Lee 9 Lecture slide courtesy of Prof. Steven Skiena Nearest Neighbor Tour doesn’t return the shortest path! Wrong! Starting from the leftmost point will not fix the problem. Ajou University Data Intelligence Lab (dilab.ajou.ac.kr), Sael Lee 10 Lecture slide courtesy of Prof. Steven Skiena Greedy Approach 2: Closest Pair Tour ❑ Repeatedly connect the closest pair of points whose connection will not cause a cycle or a three-way branch, until all points are in one tour. Let n be the number of points in the set d=∞ For i = 1 to n - 1 do For each pair of endpoints (x, y) of partial paths If dist(x; y) ≤ d then xm = x, ym = y, d = dist(x; y) Connect (xm, ym) by an edge Connect the two endpoints by an edge. Ajou University Data Intelligence Lab (dilab.ajou.ac.kr), Sael Lee 11 Lecture slide courtesy of Prof. Steven Skiena Closest Pair Tour doesn’t return the shortest path! Although it works correctly on the previous example, other data causes trouble: Wrong! What the Algorithm does What the output should outputs output Ajou University Data Intelligence Lab (dilab.ajou.ac.kr), Sael Lee 12 Lecture slide courtesy of Prof. Greedy Approach 3: Exhaustive Search Correct but Steven Skiena not efficient! Try all possible orderings of the points, then select the one which minimizes the total length: Efficient Search Approach is needed Let n be the number of points in the set d=∞ For each of the n! permutations, 𝛱𝑖 , of the n points If (cost(𝛱𝑖 ) ≤ d) then d = cost(𝛱𝑖 ) and Pmin = i Return Pmin Since all possible orderings are considered, we are guaranteed to end up with the shortest possible tour. Ajou University Data Intelligence Lab (dilab.ajou.ac.kr), Sael Lee 13 Lecture slide courtesy of Prof. Steven Skiena Exhaustive Search is Slow! Because it tries all n! permutations, it is much too slow to use when there are more than 10-20 points. NOTE: No efficient, correct algorithm exists for the traveling salesperson problem, as we will see later. Ajou University Data Intelligence Lab (dilab.ajou.ac.kr), Sael Lee 14 Lecture slide courtesy of Prof. Steven Skiena Stop and Think: Why Not Use a Supercomputer? ❑ A faster algorithm running on a slower computer will always win for sufficiently large instances, as we shall see. ❑ Usually, problems don’t have to get that large before the faster algorithm wins. ❑ We will soon learn about the asymptotic notations in analysis of algorithms. Ajou University Data Intelligence Lab (dilab.ajou.ac.kr), Sael Lee 15 Analysis of Algorithm Efficiency Lecture slides in courtesy of Steven Skiena (The Algorithm Design Manual 2nd Ed.) and Richard Neapolitan (Foundations of Algorithms 5th Ed) The RAM Model of Computation Figure from https://s3.amazonaws.com/content.udacity-data.com/courses/gt-cs6505/churchturing.html Ajou University Data Intelligence Lab (dilab.ajou.ac.kr), Sael Lee 17 Lecture slide courtesy of Prof. Steven Skiena Worst-Case Time Complexity The worst case (time) complexity of an algorithm is the function defined by the maximum number of steps taken on any instance of size n. Worst-case complexity: Our preferred measure of algorithm efficiency: Ajou University Data Intelligence Lab (dilab.ajou.ac.kr), Sael Lee 18 Lecture slide courtesy of Prof. Steven Skiena Exact Analysis is Hard! ❑ Best, worst, and average are difficult to deal with because the precise function details are very complicated: ❑ It easier to talk about upper and lower bounds of the function. ❑ Asymptotic notation (𝑶, 𝚯, 𝛀) are as well as we can practically deal with complexity functions. Ajou University Data Intelligence Lab (dilab.ajou.ac.kr), Sael Lee 19 Lecture slide courtesy of Prof. Steven Skiena Asymptotic Upper Bound ❑ Big-Oh: O(g(n)): At most as fast as g(n) ❑ Def: f(n) = O(g(n)) if there are positive constants n0 and c such that to the right of n0, the value of f(n) always lies on or below c · g(n), where c is constant independent of n. ❑ f(n) = O(g(n)) means c · g(n) is an upper bound on f(n). ❑ f(n) = O(g(n)) is read “f of n is Big Oh of g of n” f(n) ≤ c · g(n) for any 𝑛 > 𝑛0 The definitions imply a constant n0 beyond which they are satisfied. We do not care about small values of n. Ajou University Data Intelligence Lab (dilab.ajou.ac.kr), Sael Lee 20 Lecture slide courtesy of Prof. Steven Skiena Asymptotic Lower Bound ❑ Big-Omega: 𝛀(g(n)) = At least as fast as g(n) ❑ Def: f(n) = 𝛀(g(n)) if there are positive constants n0 and c such that to the right of n0, the value of f(n) always lies on or above c · g(n)), where c is constant independent of n. ❑ f(n) = 𝛀(g(n)) means c · g(n) is a lower bound on f(n) ❑ f(n) = 𝛀(g(n)) is read “f of n is Big Omega of g of n” f(n) ≥ c · g(n) for any 𝑛 > 𝑛0 Ajou University Data Intelligence Lab, Sael Lee 21 Lecture slide courtesy of Prof. Steven Skiena Tight Bound ❑ Big-Theta:𝚯(g(n)) – At the rate of g(n) ❑ Def: f(n) = 𝚯 (g(n)) if there exist positive constants n0, c1, and c2 such that to the right of n0, the value of f(n) always lies between c1 · g(n) and c2 · g(n) inclusive, where c1, and c2 are all constants independent of n. ❑ f(n) = 𝚯(g(n)) means c1 · g(n) is an upper bound on f(n) and c2 · g(n) is a lower bound on f(n). (a.k.a. tight bound). ❑ f(n) = 𝚯(g(n)) is read “f of n is Big Theta of g 𝑐2 𝑔 𝑛 ≤ 𝑓 𝑛 ≤ 𝑐1 𝑔 𝑛 of n” for any 𝑛 > 𝑛0 Ajou University Data Intelligence Lab (dilab.ajou.ac.kr), Sael Lee 22 Simple Programming Analysis Lecture slides in courtesy of Steven Skiena (The Algorithm Design Manual 2nd Ed.) and Richard Neapolitan (Foundations of Algorithms 5th Ed) Ajou University Data Intelligence Lab (dilab.ajou.ac.kr), Sael Lee Lecture slide courtesy of Prof. Steven Skiena Reasoning About Efficiency: Selection Sort Ajou University Data Intelligence Lab, Sael Lee 24 Lecture slide courtesy of Prof. Steven Skiena Selection Sort Worst Case Analysis ❑ The outer loop goes around n times. ❑ The inner loop goes around at most n times for each iteration of the outer loop ❑ Thus selection sort takes at most n*n -> O(n2) time in the worst case. Ajou University Data Intelligence Lab, Sael Lee 25 Lecture slide courtesy of Prof. Steven Skiena More Careful Analysis ❑ An exact count of the number of times the if statement is executed is given by: 𝑛−1 𝑛−1 𝑛−1 𝑆 𝑛 = 1= 𝑛−𝑖−1 𝑖=0 𝑗=𝑖+1 𝑖=0 𝑛 𝑛−1 𝑆 𝑛 = 𝑛 − 1 + 𝑛 − 2 + …+ 1 + 0 = = Θ(𝑛2 ) 2 ❑ Thus exact running time is Θ(n2). ❑ Can we say 0(n2) or Ω(n2) ? Ajou University Data Intelligence Lab, Sael Lee 26 Lecture slide courtesy of Prof. Steven Skiena Reasoning About Efficiency: Insertion Sort Ajou University Data Intelligence Lab, Sael Lee 27 Lecture slide courtesy of Prof. Steven Skiena Insertion Sort Worst Case Analysis ❑ How often does the inner while loop iterate? ❑ Two different stopping conditions: ❑ One to prevent us from running off the bounds of the array (j > 0) ❑ One to mark when the element finds its proper place in sorted order (s[j] < s[j−1]). ❑ Since worst-case analysis seeks an upper bound on the running time, we ignore the early termination and assume that this loop always goes around i times. ❑ Insertion sort must be a quadratic-time algorithm, i.e. O(n2). Ajou University Data Intelligence Lab, Sael Lee 28 Divide-and-Conquer and Sorting Lecture slides adopted from Steven Skiena’s (The Algorithm Design Manual 2nd Ed.) and Richard Neapolitan’s (Foundations of Algorithms 5th Ed) Ajou University Data Intelligence Lab (dilab.ajou.ac.kr), Sael Lee Divide and Conquer ❑ Divide and conquer is an important algorithm design technique used in various algorithms that utilizes the recurrence structure of the problem. ❑ Mergesort, binary search the fast Fourier transform (FFT), and Strassen’s matrix multiplication algorithm. ❑ We 1) divide the problem into smaller subproblems, 2) solve each subproblem recursively, and then 3) combine the two partial solutions into one solution for the full problem. ❑ When merging takes less time than solving the two subproblems, we get an efficient algorithm. Ajou University Data Intelligence Lab (dilab.ajou.ac.kr), Sael Lee 30 Divide-and-Conquer & Recurrence Relations ❑ Many divide-and-conquer algorithms have time complexities that are naturally modeled by recurrence relations. ❑ Recurrence Relation is an equation that is defined in terms of itself. ❑ Example: The Fibonacci numbers are described by the recurrence relation Fn = Fn−1+Fn−2 ❑ Any polynomial can be represented by a recurrence linear function: an = an−1 + 1, a1 = 1 → an = n exponential: an = 2an−1, a1 = 1 → an = 2n−1 Ajou University Data Intelligence Lab (dilab.ajou.ac.kr), Sael Lee 31 Mergesort ❑ Sort an array S of size n (for simplicity, let n be a power of 2) ❑ Divide S into 2 sub-arrays of size n/2 ❑ Conquer (solve) recursively sort each sub-array until array is sufficiently small (size 1) ❑ Combine merge the solutions to the sub-arrays into a single sorted array Ajou University Data Intelligence Lab (dilab.ajou.ac.kr), Sael Lee 32 Mergesort Pseudo Code MergeSort(item_type s[], int low, int high) IF low < high middle = (low + high) / 2 MergeSort(s, low, middle) MergeSort(s, middle + 1, high) Merge(s, low, middle, high) Ajou University Data Intelligence Lab (dilab.ajou.ac.kr), Sael Lee Lecture slide courtesy of Prof. Steven Skiena Mergesort Example 2 Ajou University Data Intelligence Lab (dilab.ajou.ac.kr), Sael Lee Lecture slide courtesy of Prof. Steven Skiena Mergesort Analysis ❑ A linear amount of work is done merging along levels of the mergesort tree. ❑ The height of this tree is Θ(log 𝑛). ❑ Thus the worst case time is Θ (𝑛log 𝑛). https://math.oxford.emory.edu/site/cs171/mergeSortAnalysis/ Ajou University Data Intelligence Lab (dilab.ajou.ac.kr), Sael Lee Quicksort ❑ In practice, the fastest internal sorting algorithm is Quicksort, which uses partitioning as its main idea where input array recursively divided into two partitions and recursively sorted ❑ Example: pivot about 10. ❑ Before: 17 12 6 19 23 8 5 10 ❑ After: 6 8 5 10 23 19 12 17 ❑ Partitioning Pivot divides the two sub-arrays ❑ All items < pivot placed in sub-array left of pivot ❑ All items >= pivot placed in sub-array right of pivot ❑ Note that the pivot element ends up in the correct place in the total order! Ajou University Data Intelligence Lab (dilab.ajou.ac.kr), Sael Lee Best Case Recursion Tree ❑ The total partitioning on each level is θ(n) , and it take lgn levels of perfect partitions to get to single element subproblems. ❑ When we are down to single elements, the problems are sorted. ❑ Thus the total time in the best case Quicksort is θ (n lg n). Ajou University Data Intelligence Lab (dilab.ajou.ac.kr), Sael Lee Worst Case for Quicksort ❑ Suppose instead our pivot element splits the array as unequally as possible. Thus instead of n/2 elements in the smaller half, we get zero, meaning that the pivot element is the biggest or smallest element in the array. Ajou University Data Intelligence Lab (dilab.ajou.ac.kr), Sael Lee Average Case for Quicksort ❑ Suppose we pick the pivot element at random in an array of n keys. ❑ Half the time, the pivot element will be from the center half of the sorted array. ❑ Whenever the pivot element is from positions n/4 to 3n/4, the larger remaining subarray contains at most 3n/4 elements. 3 ℎ 4 ℎ ∗𝑛 =1 →𝑛 = or h = O(lgn) good partitions suffice to finish. 4 3 ❑ Average case analysis: O(nlgn) Ajou University Data Intelligence Lab (dilab.ajou.ac.kr), Sael Lee Non-Comparison-Based Sorting: Bucketsort ❑ Suppose we are sorting n numbers from 1 to m, where we know the numbers are approximately uniformly distributed. ❑ We can set up n buckets, each responsible for an interval of m/n numbers from 1 to m ❑ Given an input number x, it belongs in bucket number 𝑥𝑛/𝑚. ❑ If we use an array of buckets, each item gets mapped to the right bucket in O(1) time. Ajou University Data Intelligence Lab (dilab.ajou.ac.kr), Sael Lee 40 Divide and Conquer Summary ❑ Top-down approach to problem solving ❑ Blindly divide problem into smaller instances and solve the smaller instances ❑ Technique works efficiently for problems where smaller instances are unrelated ❑ Inefficient solution to problems where smaller instances are related ❑ Ex> Recursive solution to the Fibonacci sequence Ajou University Data Intelligence Lab (dilab.ajou.ac.kr), Sael Lee 41 Intro to Dynamic Programming Lecture slides adopted from Steven Skiena’s (The Algorithm Design Manual 2nd Ed.) and Richard Neapolitan’s (Foundations of Algorithms 5th Ed) Ajou University Data Intelligence Lab (dilab.ajou.ac.kr), Sael Lee Dynamic Programming ❑ Instance of problem divided into smaller instances ❑ Smaller instances solved first and stored for later use by solution to solve larger instances ❑ Look it up instead of re-compute ❑ Iterative solution to the Fibonacci Sequence Ajou University Data Intelligence Lab (dilab.ajou.ac.kr), Sael Lee 43 Steps to Develop a Dynamic Programming Algorithm 1. Establish a recursive property that gives the solution to an instance of the problem. (Formulate the answer as a recurrence relation or recursive algorithm.) 2. Compute the value of an optimal solution in a bottom-up fashion by solving smaller instances first Ajou University Data Intelligence Lab (dilab.ajou.ac.kr), Sael Lee 44 Avoiding Re-computation by Storing Partial Results ❑ The trick to dynamic program is to see that the naive recursive algorithm repeatedly computes the same subproblems over and over and over again. ❑ If so, storing the answers to them in a table instead of recomputing can lead to an efficient algorithm. ❑ Thus we must first hunt for a correct recursive algorithm – later we can worry about speeding it up by using a results matrix. Ajou University Data Intelligence Lab (dilab.ajou.ac.kr), Sael Lee 45 Example 1: The Fibonacci Sequence Problem ❑ Initial condition and Recurrence relation in Fibonacci Sequence Finding nth Fibonacci Term ❑ Problem: Determine the nth term in the Fibonacci sequence. ❑ Inputs: a nonnegative integer n. ❑ Outputs: fib, the nth term of the Fibonacci sequence. Ajou University Data Intelligence Lab (dilab.ajou.ac.kr), Sael Lee 46 The D&C Fibonacci Sequence ❑ D&C Recursive Algorithm: Direct application of the recurrence relation If T(n) is the number of steps in the Algorithm, then, for 𝑛 ≥ 𝑛 2, 𝑇(𝑛) > 2 2 Ajou University Data Intelligence Lab (dilab.ajou.ac.kr), Sael Lee 47 Computing the Fibonacci Numbers via D&C Fn = Fn−1 + Fn−2 ; F0 = 0; F1 = 1 ❑ Implementing this as a recursive procedure is easy, but slow because we keep calculating the same value over and over. 𝐹𝑛 ≈ 1.6𝑛 computing Fn requires ≈ 1.6𝑛 calls! Ajou University Data Intelligence Lab (dilab.ajou.ac.kr), Sael Lee 48 Dynamic Programming (DP) for Fibonacci Sequence We can calculate Fn in linear time by storing small values: ❑ 𝐹0 = 0 ❑ 𝐹1 = 1 ❑ for 𝑖 = 2 to 𝑛 ❑ 𝐹𝑖 = 𝐹𝑖−1 + 𝐹𝑖−2 Note: we traded space for time. Time: O(n) Space: O(n) *Space: O(1) if keeping only last two Ajou University Data Intelligence Lab (dilab.ajou.ac.kr), Sael Lee 49 Minimum Edit Distance Slide from: Speech and Language Processing (3rd ed) Dan Jurafsky and James H. Martin Jurafsky & Martin Jurafsky & Martin How similar are two strings? ❑ Spell correction ❑ Computational Biology ❑ Theuser typed “graffe” ❑ Align two sequences of nucleotides Which is closest? AGGCTATCACCTGACCTCCAGGCCGATGCCC ❑ graf TAGCTATCACGACCGCGGTCGATTTGCCCGAC ❑ graft ❑ Resulting alignment: ❑ grail ❑ giraffe -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Jurafsky & Martin Edit Distance ❑ The minimum edit distance between two strings ❑ Is the minimum number of editing operations ❑ Insertion ❑ Deletion ❑ Substitution ❑ Needed to transform one into the other Jurafsky & Martin Minimum Edit Distance ❑ Two strings and their alignment: Jurafsky & Martin Minimum Edit Distance ❑ If each operation has cost of 1 ❑ Distance between these is 5 ❑ If substitutions cost 2 (Levenshtein) ❑ Distance between them is 8 Jurafsky & Martin Alignment in Computational Biology ❑ Given a sequence of bases AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC ❑ An alignment: -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC ❑ Given two sequences, align each letter to a letter or gap Jurafsky & Martin How to find the Min Edit Distance? ❑ Searching for a path (sequence of edits) from the start string to the final string: ❑ Initial state: the word we’re transforming ❑ Operators: insert, delete, substitute ❑ Goal state: the word we’re trying to get to ❑ Path cost: what we want to minimize: the number of edits Jurafsky & Martin Minimum Edit as Search ❑ But the space of all edit sequences is huge! ❑ We can’t afford to navigate naïvely ❑ Lots of distinct paths wind up at the same state. ❑ We don’t have to keep track of all of them ❑ Just the shortest path to each of those revisted states. Jurafsky & Martin Defining Min Edit Distance ❑ For two strings ❑X of length n ❑ Y of length m ❑ We define D(i,j) ❑ the edit distance between X[1..i] and Y[1..j] ❑i.e., the first i characters of X and the first j characters of Y ❑ The edit distance between X and Y is thus D(n,m) Jurafsky & Martin Dynamic Programming for Minimum Edit Distance ❑ Dynamic programming: A tabular computation of D(n,m) ❑ Solving problems by combining solutions to subproblems. ❑ Bottom-up ❑ We compute D(i,j) for small i,j ❑ And compute larger D(i,j) based on previously computed smaller values ❑ i.e., compute D(i,j) for all i (0 < i < n) and j (0 < j < m) Jurafsky & Martin Defining Min Edit Distance (Levenshtein) ❑ Initialization D(i,0) = i D(0,j) = j ❑ Recurrence Relation: For each i = 1…M For each j = 1…N D(i-1,j) + 1 deletion D(i,j)= min D(i,j-1) + 1 insertion D(i-1,j-1) + 2; if X(i) ≠ substitution Y(j) 0; if X(i) = Y(j) ❑ Termination: D(N,M) is distance Jurafsky & Martin The Edit Distance Table Initialization N 9 D(i,0) = i O 8 D(0,j) = j I 7 T 6 N 5 E 4 T 3 N 2 I 1 # 0 1 2 3 4 5 6 7 8 9 # E X E C U T I O N Jurafsky & Martin The Edit Distance Table N 9 O 8 I 7 T 6 N 5 E 4 T 3 N 2 I 1 # 0 1 2 3 4 5 6 7 8 9 # E X E C U T I O N Jurafsky & Martin Edit Distance N 9 O 8 I 7 T 6 N 5 E 4 T 3 N 2 I 1 # 0 1 2 3 4 5 6 7 8 9 # E X E C U T I O N Jurafsky & Martin The Edit Distance Table N 9 8 9 10 11 12 11 10 9 8 O 8 7 8 9 10 11 10 9 8 9 I 7 6 7 8 9 10 9 8 9 10 T 6 5 6 7 8 9 8 9 10 11 N 5 4 5 6 7 8 9 10 11 10 E 4 3 4 5 6 7 8 9 10 9 T 3 4 5 6 7 8 7 8 9 8 N 2 3 4 5 6 7 8 7 8 7 I 1 2 3 4 5 6 7 6 7 8 # 0 1 2 3 4 5 6 7 8 9 # E X E C U T I O N Jurafsky & Martin Computing alignments ❑ Edit distance isn’t sufficient ❑ We often need to align each character of the two strings to each other ❑ We do this by keeping a “backtrace” ❑ Every time we enter a cell, remember where we came from ❑ When we reach the end, ❑ Trace back the path from the upper right corner to read off the alignment Jurafsky & Martin MinEdit with Backtrace Jurafsky & Martin Adding Backtrace to Minimum Edit Distance ❑ Base conditions: Termination: D(i,0) = i D(0,j) = j D(N,M) is distance ❑ Recurrence Relation: For each i = 1…M For each j = 1…N D(i-1,j) + 1 deletion D(i,j)= min D(i,j-1) + 1 insertion D(i-1,j-1) + 2; if X(i) ≠ Y(j)substitution 0; if X(i) = Y(j) LEFT insertion ptr(i,j)= DOWN deletion DIAG substitution Jurafsky & Martin The Distance Matrix x0 …………………… xN Every non-decreasing path from (0,0) to (M, N) corresponds to an alignment of the two sequences SLIDES ADAPTED FROM JURE LESKOVEC An optimal alignment is composed y0 ……………………………… yM of optimal subalignments Jurafsky & Martin Result of Backtrace ❑ Two strings and their alignment: Jurafsky & Martin Performance ❑ Time: O(nm) ❑ Space: O(nm) ❑ Backtrace O(n+m) Jurafsky & Martin Sequence Alignment AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Jurafsky & Martin The Needleman-Wunsch Algorithm ❑ Initialization: D(i,0) = -i * d D(0,j) = -j * d ❑ Recurrence Relation: D(i-1,j) - d D(i,j)= min D(i,j-1) - d D(i-1,j-1) + s[x(i),y(j)] ❑ Termination: D(N,M) is distance Jurafsky & Martin The Needleman-Wunsch Matrix x1 ……………………………… xM y1 …………………… yN (Note that the origin is at the upper left.) SLIDES ADAPTED FROM JURE LESKOVEC Jurafsky & Martin A variant of the basic algorithm: ❑ Maybe it is OK to have an unlimited # of gaps in the beginning and end: ----------CTATCACCTGACCTCCAGGCCGATGCCCCTTCCGGC GCGAGTTCATCTATCAC--GACCGC--GGTCG-------------- If so, we don’t want to penalize gaps at the ends Jurafsky & Martin Different types of overlaps Example: 2 overlapping“reads” from a sequencing project Example: Search for a mouse gene within a human chromosome SLIDES ADAPTED FROM JURE LESKOVEC SLIDES ADAPTED FROM JURE LESKOVEC The Overlap Detection variant Changes: x1 ……………………………… xM y1 …………………… yN 1. Initialization For all i, j, F(i, 0) = 0 F(0, j) = 0 2. Termination SLIDES ADAPTED FROM JURE LESKOVEC maxi F(i, N) FOPT = max maxj F(M, j) Jurafsky & Martin The Local Alignment Problem Given two strings x = x1……xM, y = y1……yN Find substrings x’, y’ whose similarity (optimal global alignment value) is maximum x = aaaacccccggggtta y = ttcccgggaaccaacc SLIDES ADAPTED FROM JURE LESKOVEC Jurafsky & Martin The Smith-Waterman algorithm Idea: Ignore badly aligning regions Modifications to Needleman-Wunsch: Initialization: F(0, j) = 0 F(i, 0) = 0 0 Iteration: F(i, j) = max F(i – 1, j) – d F(i, j – 1) – d F(i – 1, j – 1) + s(xi, yj) SLIDES ADAPTED FROM JURE LESKOVEC Jurafsky & Martin The Smith-Waterman algorithm Termination: 1. If we want the best local alignment… FOPT = maxi,j F(i, j) Find FOPT and trace back 2. If we want all local alignments scoring > t ?? For all i, j find F(i, j) > t, and trace back? Complicated by overlapping local alignments SLIDES ADAPTED FROM JURE LESKOVEC Lab: https://dilab.ajou.ac.kr/ Web: https://leesael.github.io/ Email: [email protected]