BIOC 3265 Lecture 3 - Algorithms PDF
Document Details
UWI Cave Hill
Dr. A. T Alleyne
Tags
Summary
This document is a lecture on algorithms, particularly focusing on concepts relevant to bioinformatics. It details different search algorithms like linear and binary search, and explores computational complexities captured by Big O notation. The material also includes discussions on Perl programming and command line prompts.
Full Transcript
Lecture 3: Algorithms BIOC 3265- Principles of Bioinformatics Dr. A. T Alleyne- UWI Cave Hill This Photo by Unknown Author is licensed under CC BY-SA 1 ...
Lecture 3: Algorithms BIOC 3265- Principles of Bioinformatics Dr. A. T Alleyne- UWI Cave Hill This Photo by Unknown Author is licensed under CC BY-SA 1 LEARNING OUTCOMES After this lecture you should be able to: 1. Define an algorithm and explain the properties of an algorithm 2. Define the “Big O” notation 3. Explain the differences between a naïve and a binary search 4. Explain the need for heuristic algorithms 5. Compare tractable and intractable solutions 6. Recognize the command line prompt 7. Recognize a perl statement 8. Display knowledge of simple computer terms e.g. Command line interface and GitHub 2 A sequence of instructions used to perform a specified task that leads to a precise solution. 3 Algorithm vs Computer programs Algorithm a structured procedure in a computer program program independent-can be computer program implemented in more than one a series of algorithms coded in programming language computer language implement algorithms 4 Two key Algorithm properties Correctness: proof that the algorithm correctly solves the given problem or task. Efficiency: speed at which the algorithm arrives at the solution to the task or problem. ̶ An algorithm’s efficiency is usually expressed in terms of the number of calculations or operations required for a task of size “N” ̶ The number of operations required as a function of the input length 5 Algorithm complexity ̶ Describes the time taken for an algorithm to run ̶ “Big –O” notation or Landau's symbol is a common measure of algorithm complexity ̶ Scanning an unordered list of N items, in “Big O” notation is written O(N) ̶ If it took N2 operations for the algorithm, then “Big O = O(N)2 6 Big “O” notation ̶ Describes the (efficiency) computer running time and space requirements for an algorithm, as the input size increases. ̶ time complexity is the number of steps it takes to complete an algorithmic task As the input size increases ̶ Space complexity is the amount O(log n) is better than O (n)2 of memory an algorithm uses- O(1) is constant O(log n) is logarithmic bytes O(n) is linear O(n2 ) is quadratic 7 Simple Big O applications Big O Class Example O(1) constant Complexity remains the same regardless of the size of the input. O(n) Linear Linear search. Complexity grows in a steady linear fashion O(log n) Logarithmic Binary Search. Complexity increases in small linear increments as the input grows exponentially larger and larger. O(2)n Exponential Intractable problem. Complexity doubles as the inputs grow exponentially. 8 Pseudocode: a description of the steps in an algorithm in plain language or text Example: 1: write the list down 2: Search until the number appears 9 Let us write a Pseudocode Task= how many males are in the class? Possible Pseudocode: 1. List the number of students in the class 2. Assign each person a gender ( M_males and F_females) 3. Count the number of M 4. Print number 5. End/ 10 Linear or Naïve search Example: Search for a number in an unordered list: 239876113732 If the list has N numbers, a linear or naïve search will search the entire list until the number is found. The average amount of time the search will take will be proportional to N. 11 Find the number 33 in this list 12 Example: Binary search An alternate approach ( Binary search- divide and conquer) 1. Step 1: Place the numbers in order ( build an array) 2. Step 2: pick a number in the middle of the list 3. Step 3: restrict the search to the half the contains your number 4. Step 4: Return to step 1 until you find your number Note: The time taken for this approach is proportional to logN 13 Binary Search half-interval search, logarithmic search, or binary chop, is a search algorithm that finds the position of a target value within a sorted array. A binary search works by finding the middle element of a sorted array 14 and comparing it to your target element. Example: Gene Name search Data (problem): You have a file containing a list of protein names (IDs) and the corresponding protein sequences ( see table). Solution (algorithm): Devise a search algorithm that will find the protein sequence for the user-specified protein. 15 Table 1 ID PROTEIN NAME PROTEIN SEQUENCE 1 GroEL MKDRGVVANLDQ.... 2 RAS MLENKAAWLFSST... 3 P53 MGGALNKLASTCI.... 4 CYP VANDLSSQELMN… 5 eiF PRIHDNGGLNNN.. * AATMNLNMMNM… N ACT 1 MAANFGIYCHINPR... 16 Naive Search Strategy: 1. Obtain protein name from user (username). 2. Use Naive Search Algorithm to scan through list of protein names (protein_name) to find match.* 3. Output: corresponding protein sequence (or error message if no match is found). 17 The Naive Search Algorithm *Implementation of step 2 in pseudo code: ̶ For each protein name in list ̶ {if (protein_ name is same as username) {print corresponding protein sequence}. ̶ If (protein_ name is not same as username) {error message}. 18 Analysis of Naive Search Algorithm Correctness: the algorithm checks each protein name for a match, and outputs the protein sequence of a correct match. Efficiency: The number of comparisons made on average using the algorithm For a list of N protein names, the average search will make N/2 comparisons (more if protein names not on list are frequently chosen) O(N) = efficiency 19 Algorithmic Search in Pseudocode Strategy: 1. Sort protein name (and sequence) entries in alphabetical order. 2. Obtain protein name from user (username). 3. * Use Binary Search Algorithm to search through list of protein names (protein_name) to find match. 4. Output corresponding protein sequence (or error message if no match is found). 20 Binary Search Algorithm in Pseudocode Implementation of step 3 in pseudocode: Initial conditions: left_ID = 1; right_ID = N ID PROTEIN PROTEIN SEQUENCE NAME 1. Find protein_name of mid_ID = (left_ID + right_ID)/2 from 1 GroEL MKDRGVVANLDQ.... the alphabetically sorted protein name list. 2 RAS MLENKAAWLFSST... 2. If (user_name is same as protein_name) {return 3 P53 MGGALNKLASTCI.... corresponding protein sequence} 4 CYP VANDLSSQELMN… 3. If (user_name occurs after protein name [alphabetically] 5 eiF PRIHDNGGLNNN.. 4. Repeat step 1 with left_ID = mid_ID + 1 and right_ID = * AATMNLNMMNM… right_ID} N ACT 1 MAANFGIYCHINPR... 5. If (username occurs before protein name) {Repeat step 1 with left_ID = left_ID and right_ID = mid_ID - 1 21 Analysis of Binary Search Algorithm Correctness: the algorithm will identify a protein name and print its sequence from an alphabetically sorted list of protein names. Efficiency: The number of comparisons made on average using this algorithm. Each comparison eliminates half of the possible protein names in the list. The maximum number of comparisons is equal to the number of times we must do (N/2) comparisons, until there is one protein name left. For a list of N protein names, the average search will make, at most, log2(N) comparisons Binary search has O(log N) efficiency 22 Comparison of Naive and Binary Search Algorithms For large lists of protein names, the Binary search algorithm works many orders of magnitude times faster than the Naive Search method. Number of Protein Names Naive Search Binary Search (N) 10 10 4 100 100 7 1,000 1,000 10 10,000 10,000 14 100,000 100,000 17 1,000,000 1,000,000 20 23 Binary Naive O(N) O(log 2 N) starts from the first point and begins from ends at the last the mid- point. point Can be an unordered array must list be sorted in an order Iterative divide and conquer 24 Intractable Algorithms Sequence alignment between two sequences of equal length Aligning two sequences of length of N requires finding a shortest path in a graph with N2 vertices. Essentially it requires visiting all vertices once, so its time complexity is O(N2). ̶ O(N2)- N is squared, where N is the length of the two sequences ̶ For sequences of varying length this can be written O(XN), where X is a constant problem* This type of algorithm, however, is intractable (difficult to solve) because a small N is required to be efficient and complete. These algorithms consume increasingly more memory as the input size grows 25 Graph coloring Question? : How many colors do you need to color a graph so that no two vertices are of the same colour? Solution? a systematic tracing of all paths, comparison of neighboring colors, backtracking, etc., resulting in exponential time complexity once again. As the size of the integers (i.e. the size of n) grows linearly, the size of the computations required to check all subsets and their respective sums grow exponentially 26 Intractable algorithms are Complex problems common in bioinformatics for example in complex tasks such as: ̶ Multiple sequence alignments ̶ Phylogenetic tree analysis Notation Time Solution ̶ Structure prediction problems complexity 2n Exponential Intractable This problem is solved by using n2 polynomial intractable very good approximate solutions n Linear tractable described as heuristic or approximate algorithm Log n Logarithmic tractable 27 Heuristic Algorithms heuristic algorithms are fast, but good approximate solutions for complex problems in bioinformatics. There are no polynomial time algorithms which can solve them. Exhaustive searching is impractical 28 Programming Languages: Basics Definition: a programming language is a tool that allows one to write algorithms and instructions for the computer. The source code--the instructions written in a programming language--is converted by the language compiler or interpreter into binary instructions that can be processed by the computer 29 Programming Languages Compiled languages: - C, C++, Java Scripting languages: - Perl, Python, Ruby Web languages: - html, PHP, Perl, JavaScript Database languages: - SQL 30 This Photo by Unknown Author is licensed under CC BY-NC Perl An interpreted language, which means that your code can be run without a compilation stage that creates a non portable executable program used in many bioinformatics programs A Perl program consists of Perl commands called Perl script File names end in.pl There are two basic steps in writing a Perl program 1. Write the program in a text file 2. Run the Perl interpreter to execute the written script 31 32 The Perl Statement All Perl programs begin with a statement ̶ This represents an instruction ̶ Each statement ends in a semi colon (;) A statement may consist of one line or more than one line of code ̶ It indicates where an execution begins and where it ends ̶ The second and third lines of each statement are indented A block consists of a set of statements in curly brackets -{$ total = 9} 33 Simple Perl operators Operator Effect #!/ Indicates a perl file *, /, +, - Multiplication, division, addition and subtraction $ Variable with only one value (scalar variable) ++, , -- Add 1 to a numeric value, subtracts 1 from a numeric value print Sends information to view output on computer screen \n Found at the end of print and moves the display to the next line of information 34 Example of Perl Statements or Scripts #!/user/bin/perl # print my name\n 35 Python Python - powerful object-oriented programming language, comparable to Perl, or Java. Python is an interactive interpreted language Used widely today for analytics and data science At the end of a code there are no special characters All variables or objects are named by using an Identifier Python Operators (w3schools.com) About Python | Python.org 36 Command Line prompt The CLI is a primary means of interaction with most computer systems, especially in Bioinformatics Precise Commands or instructive codes are typed into the window for execution Three popular operating systems include: UNIX, MacOs and Windows. Windows can be converted into a Command prompt screen for programming 37 GIT HUB GitHub provides a secure, collaborative environment to design, develop, and use any software It is also a repository for open-source programs and upgrades to these programs in real time Used online for development and improvement of computer programmes An open-source community for coding and improving computer codes (147) What is Git? Explained in 2 Minutes! - YouTube Uses a version control system 38 References 2001 Per Kraulis Some definitions for sequence alignments Fundamental concepts of Bioinformatics, Krane, D. E. and Raymer, M. L (2003) Benjamin Cummings, CA. USA Bioinformatics and functional Genomics 2nd ed. (2009), Johnathan Pevsner. Wiley and Sons 39