Bioinformatics Systems Lecture 1 PDF
Document Details
Uploaded by OptimisticFreedom515
Dr. Samar Hesham Ahmed
Tags
Summary
This document is a lecture on bioinformatics systems, particularly focusing on introduction, molecular biology primer. It covers fundamental concepts of bioinformatics, including the role of computers in biological data analysis and the design of algorithms that solve biological problems.
Full Transcript
Bioinformatics Systems Lecture 1 By Dr. Samar Hesham Ahmed References ▪ ESSENTIAL BIOINFORMATICS JIN XIONG Texas A&M University Cambridge university press ▪ AN INTRODUCTION TO BIOINFORMATICS ALGORITHMS By NEIL C. JONES AND...
Bioinformatics Systems Lecture 1 By Dr. Samar Hesham Ahmed References ▪ ESSENTIAL BIOINFORMATICS JIN XIONG Texas A&M University Cambridge university press ▪ AN INTRODUCTION TO BIOINFORMATICS ALGORITHMS By NEIL C. JONES AND PAVEL A. PEVZNER 2 Assessment (100 marks) Course Work: 10 Quizzes: 20 Midterm: 20 Final: 50 3 1 Introduction 4 Bioinformatics (computational molecular biology) is an interdisciplinary research area between computer science and biological science. bioinformatics involves the technology that uses computers for storage, retrieval, manipulation, and distribution of information related to biological macromolecules such as DNA, RNA, and proteins. The emphasis here is on the use of computers because most of the tasks in genomic data analysis are highly repetitive or mathematically complex. The use of computers is vital in mining genomes for information gathering and knowledge building. The question of how much biology a good computer scientist turned–bioinformatician has to know, seems to be best answered with “enough to deeply understand the biological problem and to turn it into an adequate computational problem”. This course provides a very brief introduction to biology. In this course, we will know how to design algorithms that solve biological problems. How popular bioinformatics algorithms work, and we will see what principles drove their design. It is important to understand how an algorithm works in order to be confident in its results; it is even more important to understand an algorithm’s design methodology in order to identify its potential weaknesses and fix them. 5 2 Molecular Biology Primer 6 What Is Life Made Of? A great diversity of cells exist in nature, but they all have some common features. All cells have a life cycle: they are born, eat, replicate, and die. One can envision a cell as a complex mechanical system with many moving parts. Not only does it store all the information necessary to make a complete replica of itself, it also contains all the machinery required to collect and manufacture its components, carry out the copying process, and kick-start its new offspring. Some organisms, as bacteria are unicellular (consists of a single cell), it is called prokaryotic cells. Others, organisms like flies or humans, are multicellular, it is called eukaryotic cells. The figure below illustrate the structure of eukaryotic cell. https://www.sciencetopia.net/biology/eukaryotic-cell-structure 7 Despite the complexity of a cell, there seems to be a few organizing principles that are conserved across all organisms. All life on this planet depends on three types of molecule: DNA, RNA, and proteins. DNA (deoxyribonucleic acid) holds a vast library describing how the cell works. It contains the genetic instructions used in the development and functioning of all living organism as well as the instructions needed to construct other components of cell as proteins and RNA molecules. RNA (ribonucleic acid) acts to transfer certain short pieces of this library to different places in the cell, at which point those smaller volumes of information are used as templates to synthesize proteins. Proteins form enzymes that perform biochemical reactions, send signals to other cells, form the body’s major components (like the keratin in our skin), and otherwise perform the actual work of the cell. DNA, RNA, and proteins are examples of strings written in either the four-letter alphabet of DNA 8 and RNA or the twenty-letter alphabet of proteins. What Is the Genetic Material? ▪ The nucleus is a membrane-enclosed organelle found in most cell. It contains most of the cells’ genetic materials contained by chromosomes. A chromosome is a long, continuous piece of DNA, which contains many genes, regulatory elements and other intervening nucleotide sequences. The shape and number of chromosomes in living cell vary according to organism, suggesting that they might carry information specific for each species. The nucleus of human cell contains 24 chromosomes. 9 What Is the Structure of DNA? ▪ DNA is a long polymer chain of simple units called nucleotides, which usually exists in double helical shape as shown in the figure. There are four types of nucleotides, forming alphabet of DNA A (adenine), C (cytosine), G (guanine), and T (thymine), which are arranged into sequence of 3-letter-words called codons. “A” might be chemically attracted to “T” (and G to C) during DNA replication. A double-stranded helical structure of DNA would be complementary, so that ATGACC is complementary to TACTGG. Thus, the nucleotide string of one strand completely defined by the nucleotide string of the other. This is, in fact, the key to DNA replication. 10 https://iitway.com/mod/book/tool/print/index.php?id=984&chapterid=3114 What Carries Information between DNA and Proteins? ▪ DNA is written in a four-letter alphabet while proteins are written in a twenty-letter alphabet. What is the code that translate texts written in a four-letter alphabet into texts written in a twenty-letter alphabet? ▪ There are two types of cells: those that encapsulate their DNA in a nucleus and those that do not. The former are referred to as eukaryotic cells and the latter are prokaryotic cells. All multicellular organisms (like flies or humans) are eukaryotic, while most unicellular organisms (like bacteria) are prokaryotic. ▪ In eukaryotes, DNA resides within the nucleus, whereas protein synthesis had been observed to happen outside the nucleus, in the cytoplasm. DNA served as a template used to copy (transcribe) a particular gene into messenger RNA (mRNA) that carries the gene’s genetic information to the ribosome to make a particular protein. Protein synthesis in the cytoplasm happens with the help of certain large molecules called ribosomes. RNA (responsible for the synthesis of a particular protein) is complementary to the segment (i.e. the gene) that codes for the protein. The process is shown in the next figure ▪ Chemically speaking, RNA, or ribonucleic acid, is almost the same as DNA. There are two main differences between RNA and DNA: there is no T (Thymine) base in RNA—the similar base U (Uracil) takes its place. DNA is mostly inert and almost always double-stranded, helping it to serve as a static repository for information. RNA, on the other hand, is more chemically active and it usually lives in a single-stranded form. The effect is that RNA can carry short messages from the DNA to the cellular machinery that builds protein, 11 and it can actively participate in important chemical reactions. ▪ The DNA segments that carry genetic information are called genes. Genes are segments of DNA that carry hereditary characteristics of living organism. They contains the information necessary to be transcribed into a functional RNA. The produced RNA is, in turn translated into protein in process called gene expression. ▪ The process of turning mRNA into a protein is called translation, since it translates information from the RNA (written in a four-letter alphabet) into the protein (written in 20-letter alphabet). All proteins, including the ones necessary for this process, are produced by this process. ▪ Therefore, the gene expression process is divided into two phases: transcription and translation. This flow of information, is referred to as the central dogma in molecular biology. 12 Central dogma in molecular biology 13 Central dogma in molecular biology 14 ▪ Amino acids are linked together into linear chains to form proteins. The properties of proteins are defined by the composition and arrangement of their amino acids. ▪ The triplets of consecutive letters in DNA (called codons) are responsible for the amino acid sequence in a protein. Thus, a particular 30-base pair gene in DNA will make a protein of a specific 10 amino acids in a specific order. Ribosomes, read consecutive codons and locate the corresponding amino acid for inclusion in the growing polypeptide chain. When one amino acid has been added, the ribosome shifts one codon to the right, and the process repeats. DNA: TAC CGC GGC TAT TAC TGC CAG GAA GGA ACT RNA: AUG GCG CCG AUA AUG ACG GUC CUU CCU UGA Protein: Met Ala Pro Ile Met Thr Val Leu Pro Stop ▪ There are 4³ = 64 different codons formed from the 4-letters{A,C,G,U}, which is more than three times as large as the number of amino acids, 3 of which are reserved to act as stop codons. The remaining codons are responsible for producing 20 amino acids. So, it is possible that more than on codon (triplets of nucleotides) maps the same amino acid. The genetic code mapping is shown in the next table. 15 16 How Can We Analyze DNA? ▪There are some important techniques for copying, cutting, pasting, measuring, and probing DNA. Copying DNA ▪Most experimental techniques require many copies of the same DNA fragment. Since it is difficult to detect a single molecule or even a hundred molecules with modern instrumentation, amplifying DNA to yield millions or billions of identical copies is often a prerequisite of further analysis. ▪One method, polymerase chain reaction (PCR), is the printing press for DNA. PCR amplifies a short (100- to 500-nucleotide) DNA fragment and produces a large number of identical DNA strings. To use PCR, one must know a pair of short (20- to 30-letter) strings in the DNA flanking the area of interest and design two PCR primers, synthetic DNA fragments identical to these strings. 17 18 ▪ Suppose we want to generate a billion copies of a DNA fragment of 500 nucleotides, that we know happens to be flanked by the 20-mer nucleotide sequence X on the left and the 20-mer nucleotide sequence Y on the right. PCR repeats a cycle of three operations: denaturation, priming, and extension to double the number of DNA fragments in every iteration. ▪ We also need a molecular machine that will copy an existing DNA strand to produce a new DNA strand, and for this purpose, we use DNA polymerase. DNA polymerase has an ability to add a complementary copy to a single-stranded DNA as long as there is a primer (i.e., x̄ and Y ) attached to the DNA strand and a sufficient supply of spare nucleotides. 19 The three main operations in the PCR: ▪Denaturation (top) is performed by heating the solution of DNA until the strands separate (double-stranded DNA into two single strands), which happens around 95 C. ▪Priming (middle) is cooling down the solution to allow primers x̄ and Y to hybridize to their complementary positions in DNA , which happens around 55 C. ▪Finally, extension step (bottom), DNA polymerase extends the primer to produce two double- stranded DNA copies from single-stranded DNA and excess free nucleotides that are added , which happens around 72 C ▪By repeatedly performing these three steps, one achieves an exponential increase in the amount of DNA as shown in the next figure within three iterations we can go from one copy of the target DNA to eight copies 20 21 BIOLOGICAL DATABASES ▪ Current biological databases use three different types of database structures: flat files, relational, and object oriented. Despite the obvious drawbacks of using flat files in database management, many biological databases still use this format. ▪ Based on their contents, biological databases can be roughly divided into three categories: primary databases, secondary databases, and specialized databases. ▪ Primary databases contain original biological data. They are archives of raw sequence or structural data submitted by the scientific community. GenBank and Protein Data Bank (PDB) are examples of primary databases. ▪ Secondary databases contain computationally processed or manually curated information, based on original information from primary databases. Translated protein sequence databases containing functional annotation belong to this category. Examples are SWISS-Prot and Protein Information Resources (PIR) (successor of Margaret Dayhoff’s Atlas of Protein Sequence and Structure). ▪ Specialized databases are those that cater to a particular research interest. For example, Flybase, HIV sequence database, and Ribosomal Database Project are databases that specialize in a particular organism or a particular 22 type of data. 3 Algorithms and Complexity 23 Algorithm ▪ An algorithm is a sequence of instructions that one must perform in order to solve a well-formulated problem. We will specify problems in terms of their inputs and outputs, and the algorithm will be the method of translating the inputs into the outputs. A well-formulated problem is unambiguous& precise. ▪ To understand how an algorithm works, we need some way of listing the steps that the algorithm takes, while being neither too vague nor too formal. We will use pseudocode. Pseudocode is a language computer scientists often use to describe algorithms: it ignores many of the details that are required in a programming language. ▪ A problem describes a class of computational tasks. A problem instance is one particular input from that class. 24 Correct versus Incorrect Algorithms We say that an algorithm is correct when it can translate every input instance into the correct output. An algorithm is incorrect when there is at least one input instance for which the algorithm does not produce the correct output. So, unless you can justify that an algorithm always returns correct results, you should consider it to be wrong. Fast versus Slow Algorithms You could estimate the running time of the algorithm simply by taking the product of the number of operations and the time per operation. However, computing devices are constantly improving, leading to a decreasing time per operation, so your notion of the running time would soon be outdated. Rather than computing an algorithm’s running time on every computer, we rely on the total number of operations that the algorithm performs to describe its running time, since this is an attribute of the algorithm, and not an attribute of the computer you are using. 25 Unfortunately, determining how many operations an algorithm will perform is not always easy. If we know how to compute the number of basic operations that an algorithm performs, then we have a basis to compare it against a different algorithm that solves the same problem. Rather than tediously count every multiplication and addition, we can perform this comparison by gaining a high-level understanding of the growth of each algorithm’s operation count as the size of the input increases. To illustrate this, suppose an algorithm “A” performs 11(n³) operations on an input of size n, “B” solves the same problem in 99(n²) + 7 operations. Which algorithm A or B is faster? Although, A may be faster than B for some small n (e.g., for n between 0 and 9), B will become faster with large n (e.g., for all n ≥ 10). Since n³ is, in some sense, a “faster-growing” function than n² with respect to n, the constants 11, 99, and 7 do not affect the competition between the two algorithms for large n. We refer to A as a cubic algorithm and to B as a quadratic algorithm and say that A is less efficient than B because it performs more operations to solve the same problem when n is large. 26 27 Big-O Notation ▪ Computer scientists use the Big-O notation to describe concisely the running time of an algorithm. If we say that the running time of an algorithm is quadratic, or O(n²), it means that the running time of the algorithm on an input of size n is limited by a quadratic function of n. That limit may be 99.7(n²) or 0.001(n²) or 5(n²)+3.2n+99993 The main factor that describes the growth rate of the running time is the term that grows the fastest with respect to n, for example n² when compared to terms like 3.2n, or 99993. All functions with a leading term of n² have more or less the same rate of growth, so we lump them into one class which we call O(n²). The difference in behavior between two quadratic functions in that class, say 99.7(n²) and 5(n²) + 3.2n + 99993, is negligible when compared to the difference in behavior between two functions in different classes, say 5(n²) + 3.2n and 1.2(n³). Of course, 99.7(n²) and 5(n²) are different functions and we would prefer an algorithm that takes 5(n²) operations to an algorithm that takes 99.7(n²). However, computer scientists typically ignore the leading constant and pay attention only to the fastest growing term. 28 Algorithm Design Techniques ▪ We will illustrate the most common algorithm design techniques. Future examples can be categorized in terms of these design techniques. ▪ To illustrate the design techniques, we will consider a very simple problem. Suppose your cordless phone rings, but you have misplaced the handset somewhere in your home, and it is dark out, so you cannot rely solely on eyesight. How do you find it ? 29 Exhaustive Search (brute force algorithm) ▪ An exhaustive search, or brute force algorithm examines every possible alternative to find one particular solution. ▪ For example, if you used the brute force algorithm to find the ringing telephone, you would ignore the ringing of the phone, and simply walk over every square inch of your home checking to see if the phone was present. You probably would not be able to answer the phone before it stopped ringing, unless you were very lucky, but you would be guaranteed to eventually find the phone no matter where it was. ▪ These are the easiest algorithms to design and understand, and sometimes they work acceptably for certain practical problems in biology. In general, though, brute force algorithms are too slow to be practical for anything, but the smallest instances and we will try to avoid the brute algorithms or to finesse them into faster versions. Branch-and-Bound Algorithms ▪ In certain cases, as we explore the various alternatives in a brute force algorithm, we discover that we can omit a large number of alternatives, a technique that is often called branch-and-bound, or pruning. ▪ Suppose you were exhaustively searching the first floor and heard the phone ringing above your head. You could immediately rule out the need to search the basement or the first floor. What may have taken three hours may now only take one, depending on the amount of space that you can rule out. 30 Greedy Algorithms ▪ Many algorithms are iterative procedures that choose among a number of alternatives at each iteration. For example, you can view the problem as a series of decisions you have to make. Some of these alternatives may lead to correct solutions while others may not. Greedy algorithms choose the “most attractive” alternative at each iteration. The generalization of the greedy strategy, may produce incorrect results. ▪ In the telephone example, the corresponding greedy algorithm would simply be to walk in the direction of the telephone’s ringing until you found it. The problem here is that there may be a wall (or an expensive and fragile vase) between you and the phone, preventing you from finding it. Unfortunately, these sorts of difficulties frequently occur in most realistic problems. In many cases, a greedy approach will seem “obvious” and natural, but will be subtly wrong. Dynamic Programming ▪ Some algorithms break a problem into smaller subproblems and use the solutions of the subproblems to construct the solution of the larger one. During this process, the number of subproblems may become very large, and some algorithms solve the same subproblem repeatedly, needlessly increasing the running time. Dynamic programming organizes computations to avoid recomputing values that you already know, which can often save a great deal of time. 31 Divide-and-Conquer Algorithms One big problem may be hard to solve, but two problems that are half the size may be significantly easier. Divide-and-conquer algorithms split the problem into smaller subproblems, solving the subproblems independently, and combining the solutions of subproblems into a solution of the original problem. The situation is usually more complicated than this and after splitting one problem into subproblems, a divide- and-conquer algorithm usually splits these subproblems into even smaller sub-subproblems, and so on, until it reaches a point at which it no longer needs to recurse. A critical step in many divide-and-conquer algorithms is the recombining of solutions to subproblems into a solution for a larger problem. Often, this merging step can consume a considerable amount of time. Machine Learning Machine learning algorithms often base their strategies on the computational analysis of previously collected data. To solve phone search problem using this approach is by collecting statistics over the course of a year about where you leave the phone, learning where the phone tends to end up most of the time. If the phone was left in the living room 80% of the time, in the bedroom 15% of the time, and in the kitchen 5% of the time, then a sensible time-saving strategy would be to start the search in the living room, continue to the bedroom, and finish in the kitchen. 32 Randomized Algorithms If you have a coin, then before even starting to search for the phone, you could toss it to decide whether you want to start your search on the first floor if the coin comes up heads, or on the second floor if the coin comes up tails. Although tossing coins may be a fun way to search for the phone, it is certainly not the intuitive thing to do, nor is it at all clear whether it gives you any algorithmic advantage over a deterministic algorithm. Randomized algorithms help to solve practical problems, some of them have a competitive advantage over deterministic algorithms. 33