Bioinformatics Chapter 1 & 2 PDF

Summary

This document is an academic text about bioinformatics and genetic diseases. It covers the basics of exploring genetic data, using Python programming to manipulate biological sequences.

Full Transcript

Chapter 1 Exploring Bioinformatics Introduction In the late 19th century, Austrian monk George Mendel (1822 - 1884) quietly initiated a genetic revolution by providing science with its first tools for understanding how characteristics are inherited. He discovered the basi...

Chapter 1 Exploring Bioinformatics Introduction In the late 19th century, Austrian monk George Mendel (1822 - 1884) quietly initiated a genetic revolution by providing science with its first tools for understanding how characteristics are inherited. He discovered the basic principles of heredity (‫ )الوراثة‬through experiments in his garden. His experiments showed that the inheritance of certain traits in pea plants follows particular patterns, subsequently becoming the foundation of modern genetics and leading to the study of heredity. The identification of the DNA (deoxyribonucleic acid) as the genetic material in 1953 sent the revolution in a new direction, allowing researchers to directly correlate changes in genes with the workings of cells and organisms (‫ )الكائنات الحية‬in health and disease. The floodgates of genetic information truly opened in the mid-1990s, as large-scale DNA sequencing technology became capable of revealing the complete nucleotide sequence of entire genomes (3 billion base pairs make up the human genome which was completed in 2003). The challenge now is to extract the secrets that lie within the vast and over-increasing genetic information (Figure 1). Unfortunately, even with our tremendous technological advances and all the data we have amassed, the discovery process is slow and many fascinating and important problems will remain unsolved for years to come. 1 Figure 1: DNA: the genetic material of all living things Bioinformatics is a new science at the interface of molecular biology and computer science seeking to develop better ways to explore, analyze, and understand genetic data. 2 Bioinformatics is a field of computational science that has to do with the analysis of sequences of biological molecules. It usually refers to genes, DNA, RNA, or protein, and is particularly useful in comparing genes and other sequences in proteins and other sequences within an organism or between organisms, looking at evolutionary relationships between organisms, and using the patterns that exist across DNA and protein sequences to figure out what their function is. You can think about bioinformatics as essentially the linguistics part of genetics. That is, the linguistics people are looking at patterns in language, and that's what bioinformatics people do--looking for patterns within sequences of DNA or protein. - Christopher P. Austin, M.D., Director – National Institute of Health Typical Bioinformatics Scenario Nova Scotia (Canada) is highly affected by Nova Scotia Niemann-Pick disease (NS-NPD). NS- NPD is a neurodegenerative (‫ )التنكس العصب‬disorder that affects the nervous system and appears commonly in Children. The disease is believed to be a result of a single mutant gene. The residents are encouraged to marry from outside the community. However, in Stoccareddo, Italy 97% of the population are closely related but amazingly healthy and therefore, researchers are seeking to identify the gene that contributes to the well-being. In this case, the residents are encouraged to marry from inside the community. These two scenarios could lead researchers to answer the following questions:  What happened over the 800-year history of Stoccareddo that left the residents healthy?  What “good” genes contribute to the startling good health of the community?  Will the disease gene be weeded out over time?  Could modern technology bring similar benefits to others?  What can we do for those who have already inherited the disease? Bioinformatics Solutions: Computational approaches to biological problems Why does a biologist need a computer to solve biological related challenges?  The biological data is huge: Sequencing an entire genome for instance requires sequencing a huge number of individual segments and then reassembling them to give the 3 original sequence. The availability of inexpensive computing power and the development of efficient software solutions have made this process practical.  Computers are especially well-suited: The human genome can be thought of as a very long digital message.  Making the data widely accessible: More than 750 complete genomes of organisms ranging from bacteria to human are now in databases such as GenBank, a genetic sequence database maintained by the National Center for Biotechnology Information (NCBI), EMBL (European Molecular Biology Laboratory), and UniProt (Universal Protein Resources). Examples of biological questions whose solutions will include a key role for computer technology:  Why don’t all tumors of the same type respond to the same chemotherapies?  What are the causes of the hundreds of genetic diseases?  How should we define species, and how can we determine whether two different organisms belong to the same species?  How do bacteria and viruses cause disease, and how can a better understanding of their physiology (‫ )علم وظائف األعضاء‬improve prevention and treatment?  What is the risk that a child will inherit a genetic disease?  Why COVID-19 is severe for some patients and mild for others? 4 Class activities  Download and install Anaconda https://www.anaconda.com/products/individual#Downloads Jupyter notebook # Import the package import Bio from Bio.Seq import Seq # Define a DNA sequence seq = Seq("AGGTACACTGGT") print(seq) # Find the sequence complement print (seq.complement()) # Treat the sequence as string and create an index of each letter for index, letter in enumerate(seq): print("%i %s" % (index, letter)) # Get the length of a sequence print(len(seq)) # Return a letter using the index print(seq) #first letter print(seq[-1]) #last letter # How many times does a letter or a combination of letters appear in the sequence seq.count("C") seq.count("GG") 100 * float(seq.count("G") + seq.count("C"))/len(seq) # Slicing a sequence seq[4:7] seq[0::3] # Stride # Concatenating or adding sequences seq1 = Seq("AACCTG") seq2 = Seq("ACGT") seq1 + seq2 list_of_seqs = [Seq("ACGT"), Seq("AACC"), Seq("GGTT")] concatenated = Seq("") for s in list_of_seqs: concatenated += s print(concatenated) # Read a sequence from a file x from Bio import SeqIO for seq_record in SeqIO.parse("x.fasta", "fasta"): print(seq_record.id) print(repr(seq_record.seq)) print(len(seq_record)) 5  Install Orange: (It's included in Anaconda as well) Orange is an open-source data visualization, machine learning, and data mining toolkit. It features a visual programming front-end for explorative rapid qualitative data analysis and interactive data visualization. The process to download and install Orange and adding the bioinformatics package is shown in Figure 2. Figure 2: Install Orange and add the bioinformatics package 6 How many genes are involved in the Niemann-Pick disorder? What are these gene(s)? Working with sequences using Python: 7 Treat the sequences like strings and perform parsing: For some biological uses, you may want an overlapping count or slicing the sequence 8 Homework: 1. Web exploration: visit the National Center for Biotechnology Information (NCBI) website and through the Mendelian Inheritance in Man (OMIM) database, read more about the Nova Scotia Niemann-Pick disease (NS-NPD) and answer the following questions: a) Is NS-NPD caused by a defect in a single gene, or is more than one gene involved? b) What gene(s) is/are involved in NS-NPD? c) What chromosome(s) is/are the gene(s) located on? d) How many nucleotides are there in the gene responsible for NS-NPD? How many amino acids are there in the protein it encodes? Get the protein sequence in Fasta format and confirm the number of amino acids in the sequence. 2. The DNA is composed of two long chains or strands of nucleotide (A, C, G, and T), and that the two strands are held together by base-pairing: A base-pair with T, G base-pair with C, e.g. ACGTAC → determine the other strand. ACGTAC. Write an algorithm to find the second strand of the input DNA. 9 References 1. Caroline St. Clair and Jonathan Visick (2010) Exploring Bioinformatics: A Project-Based Approach. ISBN-10: 0-7637-5829-9. Jones and Bartlett. 2. Jeff Chang, Brad Chapman, Iddo Friedberg, Thomas Hamelryck, Michiel de Hoon, et al. (2021) Biopython tutorial and cookbook. 10 Chapter 2 The Central Dogma: Exploring Genetic Disease Objectives:  Understanding how DNA carries the information needed to make a protein and how to determine the sequence of the mRNA and protein encoded by a gene.  Gain familiarity with the representation of the DNA and the protein sequences as strings and how the sequences can be maintained.  Use web-based tools to find sequenced genes, manipulate sequences, identify coding sequences, convert coding sequences to protein sequences, and identify mutations.  Learn basic Python programming syntax and write programs to manipulate DNA and protein sequences.  Understand the value of sequence comparison in the detection of mutations and write a basic sequence comparison program. Introduction Mary and her husband would like to start a family. However, Mary carries the sickle-cell trait (‫ )فقر الدم المنجل‬and is concerned that she might have a child with the Sickle-cell disease- an inherited disorder causing red blood cells to take a sickle-like shape (Figure 1Figure 1). Misshapen (‫ )مشوه‬red blood cells can get stuck in small capillaries (‫)الشعيات الدموية‬, ‫ر‬ causing pain and depriving tissues of oxygen (‫األكسجي‬ ‫ر‬ ‫)حرمان األنسجة من‬. If Mary’s child inherits this inaccurate disease, he or she would suffer frequent, painful attacks and would probably die by middle age. Figure 1: “Sickled” red blood cells 1 Understanding the problem: Identifying key mutations The sickle-cell disease was the first genetic disease to be understood at the molecular level. It results from a change in a gene. This change, or mutation, alters the function of a protein: in this case, the oxygen-carrying protein (hemoglobin). Any change in DNA can be considered a mutation. However, not all mutations produce genetic diseases. When we know the specific mutation that produces a genetic disease, genetic testing can be done to determine if a particular individual carries that disease (like the blood test Mary wants her husband to do). The sickle-cell disease results from a defect in hemoglobin, the oxygen-carrying protein found in red blood cells. This essential protein is actually a complex structure made up of four individual protein subunits (Figure 2). One gene, HBA, located on human chromosome 16, encoded the α-globin protein, while gene, HBB, located on human chromosome 11, encodes the β-globin protein. Two α-globin and two β- globin molecular come together to make a complete hemoglobin protein. Figure 2: The hemoglobin molecule 2 Defective hemoglobin could be a result of a mutation in either HBA or HBB, or in any of the genes needed to make the heme1 ring (1Heme: An iron-containing molecule attached to hemoglobin protein that actually carries out the function of carrying oxygen) as illustrated in Figure 3. Figure 3: An individual has two copies of each gene: the same (homozygous - ‫ )متماثل‬or different (heterozygous). An allele is a variant form of a gene. Some genes have a variety of different forms, which are located at the same position, or genetic locus, on a chromosome. The role of genes All functions of the human body depend on proteins, such as the thousands of different enzymes that carry out the many chemical reactions required for life. Making each kind of protein requires gene-coded information carried in DNA. The DNA in the nucleus of every human cell carries the information to make every possible human protein. However, not all genes are turned on (expressed) in every cell. A liver cell for example expresses a different set of genes than a skin cell or a brain cell. But all of the possible genes – the entire human genome – are present in all three cell types. The question now is how the computer can be used to manipulate the digital information in DNA. The DNA is composed of four different subunits, called nucleotides, or more informally, bases: Adenine (A), guanine (G), cytosine (C), and thymine (T). These bases, illustrated in Figure 4, can be joined in any order to form a long DNA molecule (an average human chromosome is a single DNA molecule more than 100 million nucleotides long!) The DNA molecule is two nucleotide chains (double-stranded) that are complementary to each other. 3 Figure 4: DNA: the genetic material of all living things Bioinformatics solutions: Computational approaches to Genes When the first DNA molecular was sequences in 1973, reading 24 DNA bases was a laborious process and a major achievement. The amount of DNA sequence data available has expanded exponentially since that time, necessitating new computational tools to handle the flood of information. In this chapter, we explore how DNA and protein sequences can be manipulated and analyzed using computational tools. The web exploration will take you to NCBI/GenBank, to find the sequence of the wild-type HBB gene. You will then manipulate the sequence using web-based tools/ python codes and compare it with sequence information from a Sickle-cell patient to find the change that produced the disease. If we cannot find a tool to solve the genetic problem then we need to develop our own to perform specialized tasks, improve existing software, or identify new algorithmic solutions. 4 The central dogma All the functions of the human body depend on proteins. The DNA in the nucleus of every human cell carries the information to make every possible protein (Figure 5). Figure 5: Diagram showing how a single gene encoded in DNA is transcribed to make a complementary mRNA, then translated to produce a protein, with each three-nucleotide codon specifying an amino acid. A gene is a segment of DNA (typically 3000 to 10,000 nucleotides) that carries the information for one protein. To make the protein, the cell first copies the information by making a messenger RNA (mRNA) molecule. The mRNA is made by matching up nucleotides with one of the DNA strands. The mRNA is similar to the DNA but is single-stranded and composed of nucleotides A, C, G, and U (uracil), with U replacing T and paring with A. This process is called transcription. Proteins on the other hand are folded chains of amino acid subunits. There are 20 different amino acids used to make up proteins, and it turned out that each codon – a group of three nucleotides – in the mRNA represents one amino acid. The process of decoding nucleotide information to make a protein is known as translation. 5 Class activities Transcription To understand the strand issue, consider the sequence shown in Figure 5. from Bio.Seq import Seq coding_dna = Seq("ATGTCCACTGCGGTCCTGGAA") print(coding_dna) messenger_rna = coding_dna.transcribe() print(messenger_rna) mRNA_seq = coding_dna.replace("T", "U") Translation messenger_rna.translate() Translate directly from the coding strand DNA sequence coding_dna.translate(table="Vertebrate Mitochondrial") coding_dna.translate(table=2) Translation Table from Bio.Data import CodonTable standard_table = CodonTable.unambiguous_dna_by_name["Standard"] mito_table = CodonTable.unambiguous_dna_by_name["Vertebrate Mitochondrial"] print(standard_table) print(mito_table) 6 Comparing Seq objects from Bio.Seq import Seq seq1 = Seq("ACGT") "ACGT" == seq1 MutableSeq objects from Bio.Seq import Seq my_seq = Seq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA") my_seq = "G" # what happens here? It's treated as string! Let us convert it into a mutable sequence object from Bio.Seq import MutableSeq mutable_seq = MutableSeq(my_seq) mutable_seq from Bio.Seq import MutableSeq mutable_seq = MutableSeq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA") mutable_seq mutable_seq = "C" mutable_seq Web exploration  Use NCBI and get the HBB (hemoglobin subunit beta) gene sequence (or GenBank). What chromosome(s) is/are the gene(s) located on? How many bp in the sequence? How about “hemoglobin AND human [organism]” Or “hemoglobin [titl] AND human [organism]” 7  How many amino acids are expected to be translated?  Translate the HBB gene sequence into a protein sequence. How long is the sequence? 8 Appendix Homo sapiens hemoglobin subunit beta (HBB), mRNA NCBI Reference Sequence: NM_000518.5 GenBank Graphics >NM_000518.5 Homo sapiens hemoglobin subunit beta (HBB), mRNA ACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCATCTGACTCCTGAGGAGAAGTCTGCCGTT ACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGCTGCTGGTGGTCTACCCTTGGACCCAGAG GTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATGCTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCG GTGCCTTTAGTGATGGCCTGGCTCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTGCAC GTGGATCCTGAGAACTTCAGGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCACCCCACCAGT GCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCACTAAGCTCGCTTTCTTGCTGTCCAAT TTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACTAAACTGGGGGATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTA ATAAAAAACATTTATTTTCATTGCAA >AKI70611.1 HBB, partial [synthetic construct] MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTF ATLSELHCDKLHVDPENFRLLGNVLVCVLDHHFGKEFTPPVQAAYQKVVAGVANALAHKYH 9 Chapter 3 Gene Alignments: Investigating Antibiotic Resistance Objectives:  Understand the value of aligning genes and some practical applications of this technique.  Gain familiarity with key web-based alignment tools and understand how they can be used to explore sequence similarity.  Understand how the use of a heuristic search can provide a practical means of dealing rapidly with a large amount of data.  Understand how the Needleman-Wunsch algorithm produces an optimal alignment of any two sequences. Introduction Alexander Fleming was, it seems, a bit disorderly in his work and accidentally discovered penicillin. Upon returning from a holiday in Suffolk in 1928, he noticed that a fungus had contaminated a culture plate of Staphylococcus bacteria he had accidentally left uncovered. The fungus had created bacteria- free zones wherever it grew on the plate. He found that P. notatum proved extremely effective even at very low concentrations, preventing Staphylococcus growth even when diluted 800 times, and was less toxic than the disinfectants used at the time. After early trials in treating human wounds, collaborations with British pharmaceutical companies ensured that the mass production of penicillin (the antibiotic chemical produced by P. notatum) was possible. Following a fire in Boston, Massachusetts, USA, in which nearly 500 people died, many survivors received skin grafts which are liable to infection by Staphylococcus. Treatment with penicillin was hugely successful, and the US government began supporting the mass production of the drug. By D-Day in 1944, penicillin was being widely used to treat troops for infections both in the field and in hospitals throughout Europe. By the end of World War II, penicillin was nicknamed “the wonder drug” and had saved many lives. 1 The discovery of life-saving antibiotics to curb human disease during the 20th century was a tremendous medical achievement. These “miracle drugs” have saved millions who might otherwise have died from bacterial diseases. Today, antibiotics are used not only medically, but also in agriculture to prevent disease and promote growth. While their benefits are enormous, the overuse and misuse of antibiotics has allowed for rapid development and spread of resistance strains (‫ )سالالت المقاومة‬of bacteria including “superbug”. Understanding the problem: Antibiotic Resistance Fifty years ago, many people believed that antibiotics would bring an end to infectious diseases usually caused by bacteria. However, bacterial diseases such as pneumonia, diarrheal disease, etc. remain important and in some cases increasing causing illness and death. Why? A gene encoding some protein that can make bacteria cells resistant to an antibiotic (arises by mutation). Once some bacteria in the environment have the resistance gene, they can pass copies of the gene to other bacteria, a phenomenon is known as “horizontal gene transfer (HGT)”. However, how to determine experimentally whether these resistant bacteria are an important source of resistance genes for human pathogens (‫البشي‬‫ ?)مسببات األمراض ر‬One way to attack this problem is to look for resistance genes in human pathogens that are highly similar to those found in bacteria from animals, providing evidence that HGT has taken place. Bacteria that were previously sensitive to a drug can become resistant to it. A bacterial cell may require resistance to an antibiotic through mutation. On the other hand, Vertical gene transfer (VGT) is the transfer of genetic information, including any genetic mutations, from a parent to its offspring (‫)النسل‬. In summary: Horizontal gene transfer (HGT) is the movement of genetic material between unicellular and/or multicellular organisms other than by the ("vertical") transmission of DNA from parent to offspring (reproduction). HGT is an important factor in the evolution of many organisms. Horizontal gene transfer is the primary mechanism for the spread of antibiotic resistance in bacteria and plays an important role in the evolution of bacteria and in the evolution, maintenance, and transmission of virulence (‫)خبث‬. Genes responsible for antibiotic resistance in one species of bacteria can be transferred to another species of bacteria through various mechanisms of HGT such as transformation, transduction, and conjugation* ( ‫التحول والتوصيل‬ ‫ر‬ ‫)واالقتان‬, subsequently arming the antibiotic-resistant genes' recipient against antibiotics. The rapid spread of antibiotic resistance genes in this manner is becoming medically challenging to deal with. *In transformation, the recipient bacterium takes up extracellular donor DNA. In transduction, donor DNA packaged in a bacteriophage infects the recipient bacterium. In conjugation, the donor bacterium transfers DNA to the recipient by mating. 2 Bioinformatics Solutions: Aligning genes Are resistant bacteria in agricultural animals the source of resistance genes found in human pathogens (‫البشية‬ ‫ ?)مسببات األمراض ر‬Bioinformatics provides tools that can be used to address this important question. But how do we compare two genes? We need some algorithm that will allow the alignment of the genes. We can then have a scoring system based on whether the nucleotides in the two genes match. The more similar two genes are the more closely related they are evolutionary. If one bacterium had shared a gene with another, we could expect them to be extremely similar indeed. We will focus on one specific antibiotic resistance gene, ermB, which encoded resistance to the antibiotic erythromycin. If the same gene is identified in a distantly related organism such as Bacteroides, this would suggest that the HGT had occurred, since it’s extremely unlikely that the same sequence would evolve independently in a different species. Similarity search with BLAST Basic Local Alignment Search Tool (BLAST) is a program that will allow you to compare DNA or protein sequences. It can align varying length sequences and take in consideration indels as well as mutations. BLAST is very efficient since it uses heuristics search. Heuristic search refers to a search strategy that attempts to optimize a problem by iteratively improving the solution based on a given heuristic function or a cost measure. 3 Class activities Let’s start by using BLASTN to compare a query nucleotide sequence to other nucleotide sequences. We will look for the sequences that contain genes similar to the ermB gene found in pneumonia.  Start at the NCBI website and search for the BLAST tool. 4 5  Get the sequence S. pneumonia ermB gene (gene accession number: AB111455). You may refer to earlier exercises to see how to retrieve the gene AB111455.  Set the database to nucleotide collection. Search the complete collection of nucleotide sequences. Leave the remaining default options and click Blast. 6 Interpreting BLAST results Sample results of a BLAST search for database sequences matching a nucleotide query sequence - e-value is a statistical measure of how likely it is that two genes would be this similar by chance. Sample results of a BLAST search for database sequences matching a nucleotide query sequence. Alignment 7 Sample results of a BLAST search for database sequences matching a nucleotide query sequence - graphical summary of results Questions: 1. How many bp are in the ermB gene sequence? 2. What is the protein associated with this gene? Get the accession number. How many amino acids are in the sequence? 3. How many different matching sequences did BLAST find? Do you think this represents all of the matching sequences in GenBank? 4. What organism had the best match to your query sequence? Why the result is expected? 5. What two other organisms (do not include plasmids or vectors) had the most similar sequences to S. pneumonia ermB? 8 Limiting BLAST results  Go back to the BLAST form where you entered your sequence and the database to search  You should see a place where you can enter an Entrez query to limit your results. Enter (erm*[ALL] OR erythromycin[ALL] NOT plasmid[TI] NOT vector[TI]).  Run BLAST again. How many matching sequences did you get? 9 Multiple alignments with ClustalW BLAST results are very helpful in deciding which genes are the most similar to a query sequence. But suppose you want to look at the ermB family of genes in more detail, or perhaps look for regions of the genes that are the most similar. It is sometimes very useful to align an entire group of sequences all at once. ClustalW is a web-based tool that can do this, and it includes tools to help find similar and dissimilar regions (Figure 1). Figure 1: Sample output from a ClustalW multiple sequence alignment 10 Class activities  Download the sequences of the ermB-like genes from each of the 5 species you previously selected from among the BLAST results.  For each species, copy the nucleotide sequence of the ermB-like gene to a text file. Put all of the genes in a single file, separated by a FASTA comment line identifying the species.  Add the original S. pneumonia sequence at the top of the file.  A good interface to the ClustalW program is found at the European Molecular Biology Laboratory (EMBL) site (https://www.ebi.ac.uk/Tools/msa/clustalo/ ).  Paste your set of sequences into the appropriate box and keep the default parameter.  Click Run to start the alignment.  Check the results: what are the asterisk (*) and the dash (-) represent?  Use JalView viewer what do you see?  What is the sequence at the bottom shows (consensus)?  List the sequences in order of their similarity to the S. pneumonia ermB, according to the ClustalW alignment. List the present identity scores for each one relative to the S. pneumonia sequence. Is the order the same as the order in the BLAST result list? Why do you think there might be differences? 11 Global Alignment Global alignment is a technique used to compare two sequences in their entirety. In other words, matches and mismatches along the entire length of both sequences are considered when determining the best alignment. Dynamic programming: The Needleman-Wunsch Algorithm One technique used in developing algorithms for these types of problems is dynamic programming-- a problem-solving technique that breaks down a problem into smaller overlapping sub-problems whose solutions are used to solve the original problem. In 1970, Saul Needleman and Christian Wunsch published a now-famous paper describing an algorithm for making global alignments. The Needleman-Wunsch algorithm finds the optimal alignment of two sequences and was one of the first algorithms to implement dynamic programming to solve an alignment problem. Why must we consider mutations and indels in constructing an alignment? At a particular position in two sequences, there are three possibilities: identical, different because of mutations, or different because of the indel. If we consider indels, we may find more optimal alignments (why?). For example, consider aligning the sequences ACGTGACT and ACGTACT. If we do not allow indels, then there are only two possible alignments. ACGTGACT ACGTACT - * * * * ACGT GACT - ACG T ACT * * * If gaps are allowed, then we could obtain better alignments. ACGTGACT ACGT - ACT * * * * * * * ACGTGACT AC - GT ACT * * * * * ACG T GACT A - CG T ACT * * * * How is our scoring metric affected by these variations? 12 To make our algorithm more versatile, we now should implement three scoring values to recognize the three types of comparisons: a match score when characters are identical, a mismatch score when characters differ (mutation), and a gap score (gap penalty) when either character is empty (indel). This gives us the flexibility to weight gaps and indels differently. For implementation, the user will be prompted for these three values. Building the scoring matrix Align the sequences: CGCA and CACGTAT: 1. Read into memory two sequences to be aligned. 2. Assume the Match score = 1, mismatch score = 0 and gap penalty = -1 (could be obtained from the user). 3. Create n by m matrix (n is the size of the first sequence +1 and m is the size of the second sequence +1). 4. Place the actual nucleotides from the sequences along the left and top of the matrix, starting at position 1. 5. Starting with the 0 at position 0, initialize the first row by adding the gap penalty to each successive cell. In this case, each position in the matrix represents a possible way to align part of the sequence. 6. Initialize the first column in the same manner For our example, the matrix should now look like this: C A C G T A T 0 -1 -2 -3 -4 -5 -6 -7 C -1 G -2 C -3 A -4 7. Fill in the remaining cells, working from the upper left to the lower right of the matrix in either row or column order. The values in the cells will represent partial alignment scores. The higher the score, the better the alignment. The score for each cell is the highest of three possible scores: o Match or mismatch: If a match then takes the cell diagonally above and to the left of the cell and add the match score (1). If not, add the mismatch score (0) instead. o Gap in sequence 1: To represent a gap in sequence 1 (the left sequence), take the value in the cell to the left and add the gap penalty (-1). o Gap in sequence 2: To represent a gap in sequence 1 (the top sequence), take the value in the cell just above and add the gap penalty (-1). 13 C A C G T A T 0 -1 -2 -3 -4 -5 -6 -7 C -1 1 0 -1 -2 -3 -4 -5 G -2 0 1 0 0 -1 -2 -3 C -3 -1 0 2 1 0 -1 -2 A -4 -2 0 1 2 1 1 0 8. Obtain the optimal alignment score: The value in the bottom-right cell of the matrix represents the optimal alignment score of these two sequences. What is the best alignment score in this case? 9. Identify alignment paths in the matrix using directional arrows: Now we know the optimal score, but we do not know what the alignment looks like. Of course, the user would like to see how the two sequences best align. To find the best alignment(s), you need to traverse the matrix backward. Starting at the lower-right position in the matrix, determine which of the three bordering cells (above, left, or above-left diagonal) could have been used to arrive at the score in the current cell. The completed scoring matrix showing three possible alignment paths is shown in Figure 2. Figure 2: Completed scoring matrix showing three possible alignment paths. 10. Develop directional strings: Path 1: HDHHDDD (D for diagonal, H for horizontal, and V for vertical). Path 2: HDDDHHD Path 3: HDDDDHH Path 1: HDHHDDD Alignment: CA CGTAT CGC- - A- Score = 1+0+1+(-1)+(-1)+1+(-1) = 0 14 Path 2: HDDDHHD Alignment: CACGT AT C- - GCA- Score = 1+(-1)+(-1)+1+0+1+(-1) = 0 Path 3: HDDDDHH Alignment: CACGT AT - - CGCA- Score = (-1)+(-1)+1+1+0+1+(-1) = 0 Class activities 1. M = 7 and N = 8 (the length of sequence #1 and sequence #2, respectively) A simple scoring scheme is assumed where:  Si,j = 1 if the residue at position i of sequence #1 is the same as the residue at position j of sequence #2 (match score); otherwise  Si,j = -3 (mismatch score)  w = -4 (gap penalty) What are the optimal alignments and the optimal alignment score? A C G G C T C 0 A T G G C C T C 15 More Alignment Examples (protein sequences) The alignment of the two sequences “SEND” and “AND” with the BLOSUM62 substitution matrix and gap opening penalty of 10 (no gap expansion): The cells of the score matrix are labeled 𝐶(𝑖, 𝑗) where 𝑖 = 1,2, … , 𝑁 and 𝑗 = 1,2, … , 𝑀 BLOSUM62 Substitution Matrix # Matrix made by matblas from blosum62.iij # * column uses minimum score # BLOSUM Clustered Scoring Matrix in 1/2 Bit Units # Blocks Database = /data/blocks_5.0/blocks.dat # Cluster Percentage: >= 62 # Entropy = 0.6979, Expected = -0.5209 A R N D C Q E G H I L K M F P S T W Y V B Z X * A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0 -4 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 -4 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1 -4 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1 -4 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 -4 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 -4 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 0 -1 -4 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3 -1 -4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 -3 -1 -4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 1 -1 -4 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 -1 -1 -4 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 -3 -1 -4 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -1 -2 -4 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 0 0 -4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 0 -4 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 -4 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 -4 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1 -4 B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 1 -1 -4 Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1 -4 * -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1 S E N D 0 A N D 16 Local Alignment Global and semi-global alignments are quite useful, but there are other alignment problems that they don’t solve. Suppose we have found two different erythromycin resistance genes in distantly related organisms. Although these two genes evolved independently, they have the same function. Key regions of the two proteins are probably similar (Figure 3). In between these regions, however, many mutations may have occurred, so these genes might be very different in these areas. What is the limitation of using global alignment? A local alignment algorithm will solve this problem. Local alignments look for optimal partial or subsequence matches between the sequences. Figure 3: A local alignment will allow us to find the similarity in two sequences that are not similar along their entire length Consider the sequences “AAGCTCCGATCTCG” and “TAAGCAAGAATCCGA”. There are two similar regions in these sequences “AAGC” and “TCCGA”, separated by regions that are different. Run these sequences using your previous programs. What are your observations? While the “AAGC” alignment will be found, none of the alignments algorithms presented will correctly align the sequences so that the “TCCGA” sequences also match up. To find subregions of similarity, large gaps must be expected and should not adversely affect the alignment score. 17 Class activities Pairwise sequence alignment from Bio import pairwise2 from Bio.Align import substitution_matrices blosum62 = substitution_matrices.load("BLOSUM62") alignments = pairwise2.align.localds("LSPADKTNVKAA", "PEEKSAV", blosum62, -10, -1) print(pairwise2.format_alignment(*alignments)) alignments = pairwise2.align.localms("AGAACT", "GAC", 5, -4, -2, -0.5) print(pairwise2.format_alignment(*alignments)) Homework: Read the following two sequences from 2 different files and perform pairwise alignment (m = 5, mismatch -4, and gap of -2). What is the alignment score? >HBA_HUMAN MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDMPNALS ALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR >HBB_HUMAN MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLK GTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH 18 Chapter 4 Protein Alignments: Clues to Protein Function Introduction Drosophila melanogaster, the common fruit fly, has been a key experimental organism for genetic researchers since the early 1900s. It has been instrumental in understanding many important biological problems. Consider the gene known as patched, essential in fly development; mutating patched reserves some of the body segments of the developing embryo (‫)جنين‬. But what is the function of the gene encoded by the patched gene, and could it also be important in mammals? Figure 1: The humble fruit fly, which has provided valuable information for geneticist and developmental biologists for more than 100 years. Development geneticist studying patched mutants were eventually able to clone (‫ )استنساخ‬the patched gene and determine its nucleotide sequence. The amino acid sequence of the protein encoded by the patched could then be predicted. BLAST and other tools were used to identify similar genes that had been sequenced in other organisms (such as human). The human version of patched is found in the same chromosome region, the long arm of chromosome 9, as a gene involved in basal cell nevus syndrome (BCNS) ‐ a human genetic disease that results in the development of hundreds of skin cancers. 1 Understanding the problem: Protein function Understanding the function of the protein is essential: we cannot say that we understand what a gene does until we know the cellular role of the protein it encodes. For example, we know that mutating the patched gene in Drosophila affect how the body segments formed during development, but what exactly does the protein do that establishes the normal pattern of segment development. We also need this information for medical or other applications: for example, in BCNS, if we know that the patched gene product is a tumor in mammals, we can further study specifically how it prevents cancer and then move on to developing specific drugs that could restore or mimic its function. Bioinformatics Solution: Protein alignments Bioinformatics analysis was critical in identifying the function of the Cystic Fibrosis Transmembrane regulator (CFTR) protein. Cystic Fibrosis (‫ )التليف الكيسي‬is a disease that affects the respiratory (‫)تنفسي‬ and digestive (‫)الجهاز الهضمي‬ systems. As discussed earlier, identifying sequence similarity is one of the most important tools in bioinformatics. Let us imagine that we know the DNA sequence of the CFTR gene but we do not yet know the function of the CFTR protein. One of our first steps would be to search GenBank or any other relevant database for other proteins with similar sequences that might have similar functions. To accomplish this, a protein alignment algorithm relies on a substitution matrix – a set of values that will score the match between two amino acids. A substitution matrix is a 20 X 20 matrix with a row and column for each of the 20 amino acids used in proteins. The value is a particular cell – say {i,j} – gives a value for the likelihood of substitution of the amino acid in row i with the one in column j. Popular substitution matrices are PAM (Point Accepted Mutation) and BLOSUM (BLOcks Substitution Matrix). 2 Class activities  Obtain the amino acid sequence for the human CFTR protein (You should know how to find it by searching the NCBI). There are many CFTR proteins by now and therefore, you should limit your query to human CFTR. https://www.ncbi.nlm.nih.gov/gene/1080 FASTA format: https://www.ncbi.nlm.nih.gov/protein/NP_000483.3/  How many amino acids are in the CFTR protein sequence?  Use BLAST to search for similar protein sequences (remember to use BLAST‐P).  Run a BLAST search with the default BLOSUM62 matrix. 3 4  Try different substitution matrices, what do you get? and Why?  To see the conserved domains again, click Show Conserved Domains. What do you see? https://www.ncbi.nlm.nih.gov/Structure/cdd/cddsrv.cgi?uid=405144  Click on the conserved domain display to get more information. To simplify it, click Concise Result. Then click on the ABC_membrane domain to learn more about it; notice that there are two of these domains in the protein. What are their functions?  Go back to Concise Result, and now click the ABCC_CFTR bar. What additional functions can you hypothesize for your protein after investigating this region of the protein? You may want to use other links and perhaps an Internet search to learn more about this family of proteins. 5 Group Assignment: 1. How did your list change when you change the substitution matrix? Did the proteins that matched change, or just the order? How did the scores change? 2. Would you say the PAM30 matrix found closer matches to CFTR, or more distant ones? 3. Thinking about your results, why might you choose a substitution matrix other than the default BLOSUM62 for a BLAST search? 4. What do the proteins of known function that match the CFTR protein have in common (you may need to use the links to the Gene database to get more information about the matching proteins) 5. Suppose you are a researcher who has just sequenced the CFTR gene. What could you learn from these lists about possible functions of the CFTR protein? 6. Suppose a mutation changes a codon in a gene from GUA to GAA. What is the corresponding amino acid change? 7. What are the two ways in which this small change in DNA could produce a drastic change in the function of the protein encoded by this gene? 8. Why building specific substitution matrix is necessary despite the fact that standard matrices such as BLOSUM and PAM are available. 6 Chapter 5 Gene Prediction: Finding Genes in Human Genome Objectives  Understand the structure of a gene and which features are useful in developing computational methods for identifying genes.  Know the strength and limitations of alignment, sequence, and content‐based methods.  Use existing gene discovery tools.  Understand how pattern matching can be used in sequence‐based computational solutions. Introduction In Bioinformatics, gene prediction or gene finding refers to the process of identifying the regions of genomic DNA that encode genes. In late 2007, researchers had identified 21,541 protein‐coding genes, but at least 1,200 candidate genes remained under investigation. Furthermore, researchers using a novel computational method based on alignments with other mammalian genomes had recently identified 300 previously unknown protein‐coding genes and proposed extensions to hundreds of known genes. Understanding the problem: Gene discovery In order to cure a genetic disease or understand how a specialized cell type develops, we have to understand the functions of individual genes within the genome sequence. In order to understand their functions, we first have to find those genes through the process of gene discovery. Today, we have the complete nucleotide sequence of most if not all of the coding regions of the human genome. However, simply knowing the sequence in not enough: we have the pieces of the puzzle, but cannot see the full picture. Previously, we looked for AUG start codons and UAG, UGA, or UAA stop codons to identify a protein‐coding sequence, using a genetic code table (below) to find the amino‐acid equivalent for each three‐nucleotide codon in between. This coding sequence, from start codon to stop codon, is called an open reading frame (ORF). Cataloging every human gene requires the use of methods, such as looking for transcribed sequences based on the presence of promoters (Figure 1). Promoter is the region of a DNA molecular in which the RNA recognizes an appropriate site for transcription to begin. Figure 1: Schematic view of a typical prokaryotic gene and the mRNA transcribed from the gene. A major source of complexity in eukaryotic genomes is the presence of introns. The intron is a noncoding DNA segment inserted between portions of coding sequence and removed from the mRNA by splicing in order for the mRNA to be translated correctly (Figure 2). Figure 2: A eukaryotic mRNA is processed by splicing after transcription to remove introns and join exons into a continuous coding sequence Understanding the complexity of the gene structure is essential to understanding how you might employ computational methods to identify genes in a DNA sequence. In this chapter, we examine some key bioinformatics methods for gene discovery. Bioinformatics Solutions: Gene Prediction Identifying which DNA sequences within a complex genome represent genes is a difficult task, and no single method is likely to identify all of them. Commonly used computational approaches to this problem fall into three major categories of algorithms: alignment based, sequence based and content based. Alignment‐based algorithms You already know how to make DNA and protein alignments. If you had identified a particular gene in mice, zebrafish, fruit flies or even bacteria, there is a good chance you would be able to use sequence similarity to locate an ortholog (a gene that is similar in DNA sequence to a gene in another species) in human genome. Sequence‐based algorithms Search for ORF is an example of sequence‐based method. Content‐based algorithms Search for patterns such as nucleotide or codon frequency that are characteristic of coding sequence in a particular organism. Gene discovery in Prokaryotes (‫)بدائيات النوى‬ Because gene prediction is easier in Prokaryotes, we decided to limit the scope for this chapter to Prokaryotes gene discovery. The genomes of most bacteria consists of a single, circular chromosome ranging in size from 1 million to 10 million base‐pairs, in contrast to 46 chromosome and 6 billion base‐pairs of DNA in every human cell. Class Activities: Using a simple ORF finder 1. Find the ORF finder link within the NCBI website. 2. Rather than searching the entire genome, let us start with a simple DNA molecular the plasmid pACYC184. A plasmid is small, circular DNA molecular that can replicate semi independently within a bacteria cell. Antibiotic‐resistance genes are often encoded on plasmids. 3. Search the GenBank to find the DNA sequence of plasmid “Cloning vector pACYC184”. How many bp in the sequence? What is the accession number? (X06403) https://www.ncbi.nlm.nih.gov/nuccore/X06403 4245 bp 4. Copy the sequence or the accession number to the ORF finder and click OrfFind (Figure 3). https://www.ncbi.nlm.nih.gov/orffinder/ Figure 3: Output from NCBI’s ORF Finder program 5. By default, only ORFs longer than 75 nucleotides are shown, and they are sorted in descending order of length. 6. Does pACYC (a small plasmid of only 4245 bp) really have 39 genes? 7. How can we carry BLAST search for sequences similar to this ORF? 8. How many nucleotides are in the longest ORF in pACYC184? How many amino acids would it encode? 9. What is the function of the predicted protein it encodes? 10. Can you find one other real gene in pACYC184? Identifying open reading frames using Python As earlier mentioned, a very simplistic first step at identifying possible genes is to look for open reading frames (ORFs). Of course, to identify a gene you would also need to worry about the locating of the start codon and possible promoters. Moreover, in Eukaryotes there are introns as well to worry about. However, this approach is still useful when applied to identify coding regions in viruses and Prokaryotes. To show how you might approach this problem, we will use the bacterial plasmid(NC 005816.fna). This is a bacterial sequence, so we'll want to use the NCBI codon table 11. >>> from Bio import SeqIO >>> record = SeqIO.read("NC_005816.fna", "fasta") >>> table = 11 >>> min_pro_len = 100 >>> for strand, nuc in [(+1, record.seq), (-1, record.seq.reverse_complement())]: for frame in range(3): length = 3 * ((len(record)-frame) // 3) #Multiple of three for pro in nuc[frame:frame+length].translate(table).split("*"): if len(pro) >= min_pro_len: print("%s...%s - length %i, strand %i, frame %i" \ % (pro[:30], pro[-3:], len(pro), strand, frame)) GCLMKKSSIVATIITILSGSANAASSQLIP...YRF - length 315, strand 1, frame 0 KSGELRQTPPASSTLHLRLILQRSGVMMEL...NPE - length 285, strand 1, frame 1 GLNCSFFSICNWKFIDYINRLFQIIYLCKN...YYH - length 176, strand 1, frame 1 VKKILYIKALFLCTVIKLRRFIFSVNNMKF...DLP - length 165, strand 1, frame 1 NQIQGVICSPDSGEFMVTFETVMEIKILHK...GVA - length 355, strand 1, frame 2 RRKEHVSKKRRPQKRPRRRRFFHRLRPPDE...PTR - length 128, strand 1, frame 2 TGKQNSCQMSAIWQLRQNTATKTRQNRARI...AIK - length 100, strand 1, frame 2 QGSGYAFPHASILSGIAMSHFYFLVLHAVK...CSD - length 114, strand -1, frame 0 IYSTSEHTGEQVMRTLDEVIASRSPESQTR...FHV - length 111, strand -1, frame 0 WGKLQVIGLSMWMVLFSQRFDDWLNEQEDA...ESK - length 125, strand -1, frame 1 RGIFMSDTMVVNGSGGVPAFLFSGSTLSSY...LLK - length 361, strand -1, frame 1 WDVKTVTGVLHHPFHLTFSLCPEGATQSGR...VKR - length 111, strand -1, frame 1 LSHTVTDFTDQMAQVGLCQCVNVFLDEVTG...KAA - length 107, strand -1, frame 2 RALTGLSAPGIRSQTSCDRLRELRYVPVSL...PLQ - length 119, strand -1, frame 2 More Sophisticated Gene Prediction Using ORF finder, you found a long list of potential genes and had to narrow them down by hand (ORF to be 100 amino acid long). However, there is a possibility of missing genes. What if some of the short ORFs also encode functional genes? You can improve your ability to find authentic genes by determining whether the start codon is preceded by a Shine‐Dalgarno sequence (a sequence close to 5’AGGAGG located six or seven nucleotides before the start codon). This would still be a sequence‐based method of gene prediction. Or you can ask whether the codons in the ORF match the typical codon usage for the organism you are interested in. EasyGene is an example of a gene prediction program intended for finding genes in prokaryotic DNA. This program looks for potential ORFs, examines the region just before the putative start codon for a Shine‐Dalgarno sequence, and then compares the codons in each ORF with a training set taken from a particular prokaryotic genome. The user can specify an organism closely related to the one from which the sequence to be searched was obtained, and the program will then give each ORF a significance score representing the likelihood that it is a genuine gene. Only those ORFs scoring above a selected threshold are displayed.  Find the sequence pACYC184 in GenBank and display the FASTA format  Find the EasyGene server using the internet search engine. http://www.cbs.dtu.dk/services/EasyGene/  Paste the sequence, select Escherichia coli k12 at the organism to be making comparisons. Leave the R‐value (significance score) cutoff at 2. Click submit. What do you see?  How many ORFs did EasyGene return?  How does the EasyGene output compare with the ORF Finder?  Did EasyGene find two “real” genes you identified after working with ORF Finder? Does the score suggest that they are indeed genuine E. coli genes?  Did EasyGene find additional genes?  What do you think are the advantages and disadvantages of the two approaches? Promoter Prediction Both of the tools used so far depend on finding coding sequences. Looking for promoter sequences can find additional genes and provide additional information about your predicted genes.  Use a search engine to find the Neural Network Promoter Prediction site, part of the Berkeley Drosophila genome project site (http://www.fruitfly.org/seq_tools/promoter.html ).  Paste the pACYC184 sequence; be sure to indicate that this is a prokaryotic sequence.  Click submit and view your results. What do you see?  How many promoters were found in the pACYC184 sequence?  Which promoters do you think are the “real” promoters for the two genes you previously identified in this plasmid? ORF Finder Algorithm Before you think of writing the algorithm to find ORF in a DNA sequence, there few questions to be answered first: 1. How can our program identify a coding sequence within a sequence input by the user? 2. How will the program identify the beginning of the start codon if we can not assume it starts with the first nucleotide entered? 3. Is it possible for a particular sequence to have multiple start and stop codons? If so, how should the algorithm handle this situation? Bonus Assignment (5 points) – Project idea Write an algorithm for finding ORFs in a DNA sequence and then implement it in Python or any programming language you are comfortable with. Your program should do the following:  Read a DNA sequence from a file.  Identify coding region within a sequence  Identify the beginning of the start codon (if we don’t assume it starts with the first nucleotide entered).  How should the program handle a particular sequence with multiple starts and stop codons?  Try the program on the pACYC184 sequence and compare your output to what we have done so far. Chapter 6 RNA and Protein Structure: Structure prediction Objectives:  Understand how structure and function are related and why secondary and tertiary structure prediction are important.  Work with publicly available tools for examining the experimentally determined structure of proteins.  Use web-based tools to predict secondary structure.  Understand algorithmic solutions to the problem of predicting secondary structure from RNA and protein sequences. Introduction Because viruses replicate inside our won cells, viral (‫ )فيروسي‬infections resent special challenges for the development of therapeutic (‫ )العالجية‬pharmaceuticals. Despite decades of intensive effort, there are few effective antiviral drugs and no true cure for any viral disease. A detailed understanding of the three-dimensional (3D) structure of virus proteins may be one route to new breakthroughs in antiviral research. Understanding the problem: structure prediction Molecular biology has made great strides (‫ )خطوات‬in sequencing genes, identifying genes within genome, predicting amino-acid sequence of proteins, comparing sequences to obtain clues to function and evolutionary relatedness. However, the nucleotide sequence of a gene is only the primary structure (amino acid sequence) of protein it encodes. An actual cell is a 3D arena where molecules with specific structures interact (Figure 1). We cannot fully understand how the cell works until we can visualize nucleic acids and proteins as 3D entities. An enzyme must have the correct shape to bind a specific substrate ( ‫ )الركيزة‬and exclude non-substrate molecules (Figure 2). The standard method for experimentally determining the 3D structure of protein is X-ray crystallography. Other experimental technique is the Nuclear magnetic resonance (NMR). However, these techniques are presently limited in their resolution and the size of the molecules that they can examine. Practical limitations were also an issue. To date, there are no experimental methods for determining the structure of either proteins or nucleic acids that can keep up with the tremendous rate at which their primary sequences are becoming available. 1 Figure 1: The large subunit of the eukaryotic ribosome is a complex structure combining precisely folded proteins (dark regions) and RNA molecules (light regions) Figure 2: Protein Structure (Ref: https://theory.labster.com/protein-structure/) 2 Key Bio Concept Questions: 1. What are the reasons why determining the three-dimensional structure of a protein or nucleic acid is desirable? i Proteins that are not very similar in sequence may nonetheless be similar in structure and function. ii Structure can give clues to function. iii Structures can suggest possible binding sites for drugs that might modify the function of a protein. iv A functional region of a protein may result from folding of sequences scattered throughout its amino-acid sequence and therefore difficult to study at the sequence level. v Many genetic diseases result from changes in protein structure that may be due to very small changes in sequence. 2. Both secondary and tertiary structures of proteins are three-dimensional structures; what is the difference between the two? Secondary structure occurs when one region of an amino-acid chain folds; they are thought of as “local” structures. Tertiary structure can bring together secondary structures from many different parts of the amino-acid chain into an overall, “global” structure. 3. The amino-acid sequence of a protein clearly must determine what folded structures are possible for that protein. What complications arise in trying to predict the final, folded structure from the sequence? Sequences from distant regions of the amino-acid chain may come together in the final, folded structure. Some amino acids could participate in more than one kind of structure (say, an α-helix or a β-strand), so it can be hard to decide what kind of structure is actually found in each region. Folding also depends on the environment of the protein, which is not the same in every compartment of a cell or even every region of the cytoplasm (‫)السيتوبالزم‬. The protein may fold from one end to the other as it is synthesized (‫)توليف‬, or it can be held unfolded by a chaperone (a protein that assists other proteins in folding) so that a different structure forms. 4. What determines how an RNA molecule can fold? What kinds of criteria might be used to decide whether one potential folded structure is better than another? Folding happens to allow base-pairing between complementary bases along the RNA strand. Usually, base-paired structures are more stable than unpaired structures, so finding the structure that allows the most base-pairs to form is one criterion for a “good” structure. 3 Bioinformatics solutions: RNA and protein structure With huge numbers of nucleic acid and protein sequences at our disposal, the ability to accurately predict folded structures directly from primary sequence would open up a vast wealth of new knowledge. Bioinformatics algorithms use base-pairing and secondary structure rules to try to determine that kind of secondary structure a given nucleotide or amino acid is likely to be part of. But how do we decide which of the multiple possible structures is best? Our ability to effectively predict the tertiary structure is unfortunately quite limited at present; therefore, we will focus on the exploring techniques for secondary structure prediction. Web exploration: protein structure modeling and drug design Traditionally, new drugs have been discovered by doing initial testing of huge number of molecules that might possibly effect some process of interest. In this case imagine that you are working for a pharmaceutical company and are seeking new kind of drugs to treat cancer. In many cancers, normal cellular proteins that promote cell division are over-expressed or inappropriately expressed, leading to uncontrolled multiplication of cells. You would like to design a drug to inhibit some of these growth-promoting proteins. SRC is a tyrosine kinase – an enzyme that puts a phosphate group on tyrosine amino acids within target proteins. Phosphorylation (‫ )الفسفرة‬is a common mechanism of regulating the activity of proteins in eukaryotic cells; most often the process activates an inactive protein. Predicting the secondary structure of SRC This investigation will begin with an examination of the normal SRC protein from human cells. You can use PSIPRED, a secondary structure prediction program to look for regions of the protein that form α-helices (H), β-sheets (E), or random coils (C). 1. Download accession number NP_005408, the amino acid sequence from human SRC (GenBank). 2. Save the sequence as a text file in FASTA format. 3. Locate PSIPRED server (currently at http://bioinf.cs.ucl.ac.uk/psipred/). 4. Enter the SRC sequence into the box. 5. Choose predict the secondary structure. 6. Submit your request. 7. Explore the results (Figure 3). 8. List two regions of the SRC protein with α-helices structure. 4 Figure 3: Sample output from the PSIPRED program Examining the SRC Crystal Structure 1. Check the accession number NP_005408, the amino acid sequence from human SRC (GenBank). 2. From "Protein 3D Structure" check the 3D structure. 3. Click on “full-featured 3D viewer”. 4. Under “details” explore the 2D structure. 5. Explore the 3D structure, conserved domains, functional sites, binding sites, etc. Chou-Fasman algorithm for predicting the protein secondary structure Although the PSIPRED algorithm is a good solution, you probably found that its predirections fell short of a perfect match to the SRC crystal structure. In fact, developing a program that can accurately predict the secondary structure of a protein by identifying α–helices, β-strands, and β–turns is still an open problem in bioinformatics. In this section, we will illustrate how the Chou-Fasman algorithm attempts to predict the protein secondary structure. The Chou–Fasman algorithm is an empirical technique for the prediction of tertiary structures in proteins, originally developed in the 1970s by Peter Y. Chou and Gerald D. Fasman (https://en.wikipedia.org/wiki/Chou-Fasman_method). The method is based on analyses of the relative frequencies of each amino acid in alpha helices, beta sheets, and turns based on known protein structures solved with X-ray crystallography. From these frequencies a set of probability parameters were derived for the appearance of each amino acid in each secondary structure type, and these parameters are used to predict the probability that a given sequence of amino acids would form a helix, a beta strand, or a turn in a protein. The method is at most about 50–60% accurate in identifying correct secondary structures, which is significantly less accurate than the modern machine learning–based techniques. Please visit the link: http://cib.cf.ocha.ac.jp/bitool/MIX/ to test the method. 5 Algorithm Step 1: Alpha-helices a. Find a region of six continuous resdidues where at least 4 have P(a)>103. b. Extend the region until a set of 4 continuous residudes with P(a)103, the region’s length is >5, and ∑ 𝑃(𝑎) > ∑ 𝑃(𝑏) for the region, then that region is predicted to be a α–helix. Step 2: Beta-strands a. Find a region of 5 continuous residues where at least 3 have a P(b)>105 b. Extend the region until a set of 4 continuous residues with P(b)105, and ∑ 𝑃(𝑏) > ∑ 𝑃(𝑎) for the region, then that region is predicted to be a β–strand. Step 3: Beta-turns a. For each residue j, determine the turn propensity or P(t) for j as follows: 𝑃(𝑡)𝑗 = 𝑓(𝑖)𝑗 × 𝑓(𝑖 + 1)𝑗+1 × 𝑓(𝑖 + 2)𝑗+2 × 𝑓(𝑖 + 3)𝑗+3 b. A turn is predicted at position j if P(t)>0.00075, and the avregare P(turn) for residues j to j+1 >100, and the ∑ 𝑃(𝑎) < ∑ 𝑃(𝑡𝑢𝑟𝑛) > ∑ 𝑃(𝑏) ******************* Step 1: build a table in which you write the propensities of all amino acids to be in helices or strands Step 2: Separately, predict helices and strand: a. For α-helices: i Find a potential nucleation site: a stretch of 6 amino acids with at least 4 having a propensity to be in a helix greater than 1. ii Expend the nucleation site in both directions: add an amino acid X, and check if the average helical propensity over a window of 4 amino acids ending at X is greater than 1; if it is, add X to the current prediction, otherwise stop iii Check that the average propensity over the whole region predicted to be helical is greater than 1 (if not, prediction is discarded) b. For β-strands i Find a potential nucleation site: a stretch of 5 amino acids with at least 3 having a propensity to be in a strand greater than 1 ii Expend the nucleation site in both directions: add an amino acid X, and check if the average strand propensity of a window of 4 amino acids ending at X is greater than 1; if it is, add X to the current prediction, otherwise stop iii Check that the average propensity over the whole region predicted to be expanded is greater than 1 (if not, prediction is discarded) In cases in which the same region is predicted to be both helical and strand, pick the prediction with the greatest overall average. 6 Chou–Fasman parameters: Name P(a) P(b) P(turn) f(i) f(i+1) f(i+2) f(i+3) Alanine (A) 1.42 0.83 0.66 0.06 0.076 0.035 0.058 Arginine (R) 0.98 0.93 0.95 0.070 0.106 0.099 0.085 Aspartic Acid (D) 1.01 0.54 1.46 0.147 0.110 0.179 0.081 Asparagine (N) 0.67 0.89 1.56 0.161 0.083 0.191 0.091 Cysteine (C) 0.70 1.19 1.19 0.149 0.050 0.117 0.128 Glutamic Acid (E) 1.39 1.17 0.74 0.056 0.060 0.077 0.064 Glutamine (Q) 1.11 1.10 0.98 0.074 0.098 0.037 0.098 Glycine (G) 0.57 0.75 1.56 0.102 0.085 0.190 0.152 Histidine (H) 1.00 0.87 0.95 0.140 0.047 0.093 0.054 Isoleucine (I) 1.08 1.60 0.47 0.043 0.034 0.013 0.056 Leucine (L) 1.41 1.30 0.59 0.061 0.025 0.036 0.070 Lysine (K) 1.14 0.74 1.01 0.055 0.115 0.072 0.095 Methionine (M) 1.45 1.05 0.60 0.068 0.082 0.014 0.055 Phenylalanine (F) 1.13 1.38 0.60 0.059 0.041 0.065 0.065 Proline (P) 0.57 0.55 1.52 0.102 0.301 0.034 0.068 Serine (S) 0.77 0.75 1.43 0.120 0.139 0.125 0.106 Threonine (T) 0.83 1.19 0.96 0.086 0.108 0.065 0.079 Tryptophan (W) 1.08 1.37 0.96 0.077 0.013 0.064 0.167 Tyrosine (Y) 0.69 1.47 1.14 0.082 0.065 0.114 0.125 Valine (V) 1.06 1.70 0.50 0.062 0.048 0.028 0.053 How to calculate propensity: 𝐹𝐴 𝑃𝛼 = 𝐹𝛼 Where 𝐹𝐴 is the number of amino acid (A) in the α-helix reion divided by the number of the amino acid (A) in the sequence. 𝐹𝛼 is the number of amnio acids in the α-helix region divided by the total sequence residue. Similarly for the β: 𝐹𝐴 𝑃𝛽 = 𝐹𝛽. Example 1: Calculate P(a)  Amino acid (A) in a sequence = 2,000  Total sequence residues = 20,000  How long is a helix region = 4,000  How many amino acid A in the helix region = 500 Example 2: Calculate P(b)  Amino acid (A) in a sequence = 2  Total sequence residues = 12  How long is β-strand region = 5  How many amino acid A in the β-strand region = 1 7 Excerise: Using the Chou and Fasman probability values, predict the secondary structure of the following peptide sequences: (Secondary structures are given in one-letter code, and only three states are considered: H = Helix, E = Strand, and O = other) 1) WHGCITVYWMTV Select the correct structure: A) OOOOHHHHHHOO B) EEEEEEEEEEEE C) HHOOHHHHHHOH D) EHOEEEEEEEEE 2) CAENKLDHVRGP A) HHHHHHHHHHOO B) OOHHHHHHHHHH C) HHHHHHHHHHHH D) OOHHHHHHHHOO 3) TSPTAELMRSTG A) HHHHHHHHHHOO B) OOHHHHHHHHHH D) OOHHHHHHHHOO C) OOHHHHHHHHHO 8 Chapter 9 Microarrays and Gene Expression Objectives Understand the importance of measuring gene expression and advantage of global analysis using microarrays. Understand how microarrays work and how they measure gene expression. Work with publicly available data and analysis tools to investigate changes in gene expression under experimental conditions. Understand what data manipulations are necessary to obtain usable information from row microarray data. Learn how the large amount of data generated by a microarray can be mined to answer different questions. Develop programs to process microarray data, analyze gene expression, and examine the functions of genes with a particular expression pattern. 1. Introduction It is well known that some cancer patients respond better to chemotherapy (‫)ا ج ا‬ than others, and that patients who respond well to one drug may not necessarily respond well to another. For example, only about four of ten lymphomas (a cancer affecting a specific class of white blood cells) respond to the drug regimen in use. Why is this the case? Possible factors that could influence the cure rate include the genetics of the patient, gene expressed in the tumor, and so on. The development of microarray technology now permits researcher to examine the expression of every gene in the genome at once. 2. Understanding the problem: Gene expression All of the previous techniques (alignment, prediction, clustering) relied on a static snapshot of an organism’s genetic code. But how is that genetic information used in a living cell, non- static organism? How can different cells in the same organism perform different functions when they all share the same genetic code? And how do cells respond to changing environmental conditions? The answer to these questions is gene expression - the process of transcribing (‫ )اال‬and translating a gene to yield a protein product. Although all the cells in the human body have the same genome, only a fraction of the genes in the genome are expressed. In each cell type, a specific set of genes is activated, and the proteins they encode give the cell the characteristics of a muscle cell, bone cell, nerve cell, and so on. The set of proteins a cell makes is referred to as its proteome and the proteome is determined by which genes are transcribed, how they are spliced (), and which mRNAs are translated. Since the synthesis (‫ )آ‬of a protein requires an mRNA transcript, measuring transcription is another way to attack the problem of which the genes are expressed-and this has a number of practical advantages over proteomic techniques. 3. Bioinformatics solutions: Microarrays There are a number of ways to measure the expression of a gene such as Northern blotting, which was one of the first good ways to investigate expression. Today, reverse transcriptase (‫( )ا‬RT-PCR) is a commonly used technique for rapidly measuring how much of a particular mRNA is present in a cell. Or, we may want to look at how much protein is made. 3.1 What is a microarray? A DNA microarray consists of a large number of DNA samples attached to a solid surface, such as glass slide. Individual spots of known locations on the side contain specific DNA molecules. In a microarray representing human genome, for example, we would find a spot containing DNA from the HBB gene at one location, a spot representing the CFTR gene at another, and so on (Figure 9.1). (a) (b) Figure 9.1: (a) A DNA microarray; the enlargement shows that it consists of individual DNA "spots". (b) A DNA microarray; the enlargement shows that it consists of individual DNA "spots" Group Assignment: Each group should prepare a 10 minutes presentation to show how the DNA is attached to the microarray chip? 3.2 How does a microarray experiment work? In most cases, we cannot generate a number that represents the level of expression of a particular gene in a particular cell. Most often, microarrays measure relative expression: they compare gene expression in one cell type or under one condition with expression in a different type or under different condition. The results are thus expressed in ratio showing how much a gene is expressed under one condition relative to another. In the example in Figure 9.2, gene expressed in control cells (non-cancerous cells, for example), are compared to those expressed under some experimental conditions (such as lymphoma cells). Figure 9.2: Synthesis and labeling of cDNA for a microarray experiment. Each cDNA will base-pair with DNA in a specific spot on the chip-the spot containing DNA with a sequence of part of the gene from which that specific mRNA was transcribed. This process is called Hybridization (). 3.3 How is the microarray data gathered? The spots on a DNA chip are too small for the results to be observed directly. A laser can be passed over the chip, causing the fluorescent nucleotides to fluoresce ( !). The color and intensity of the fluorescence for each spot depend on whether there is a cDNA present in the sample for the first condition, the second condition, both, or either. If the gene is expressed at a low level under condition 1 but a high under condition 2, there will be a lot of “green” cDNA bound to that spot but little “red” cDNA. 3.4 How are microarray data analyzed? In order to analyze the microarray data, a computer takes the red fluorescent data and merges them with the data for green fluorescence. It then produces an image in which spots representing genes included under condition 2 appear green, spots representing genes repressed under condition 2 appear red, spots representing genes with similar expression under both conditions appear yellow, and spots representing unexpressed genes appear black. The intensity of the colors can then be used to quantities the level of expression. 4. Class activities: analyzing microarray data 4.1 Web exploration using MAGIC tool In this project we explore gene expression in yeast (‫ )ة‬in response to zinc using a freely available software application called MAGIC tool (MircoArray Genome Imaging and Clustering Tool) to analyze microarray data. Microarray experiments produce data for the expression of thousands of genes. A research conducting an experiment can “mine” his or her data to answer a particular question: In which genes does expression increase the most when these cells are starved? What genes are turned off when these cells are infected by a virus? etc. The work consists of the following steps: 1. Installing MAGIC. 2. Downloading and formatting the microarray data. 3. Converting the array data to numbers representing gene expression and building the expression file. 4. Transforming the data to more useful format. 5. Explore the expression file to get information on gene expression. 4.1.1 Installing MAGIC. MAGIC tool can be found on the GCAT website at http://www.bio.davidson.edu/projects/GCAT/gcat.html 4.1.2 Downloading and formatting the microarray data Visit Stanford MicroArray Database (SMD) available at http://smd.stanford.edu. Choose Search by Datasets. Pick experiments. Select Saccharomyces cerevisiae (‫ز‬$‫) ة ا‬. Select Metals. Click Display Data button to see the results of your search. We will use microarray data comparing the gene expression of yeast grown with two different concentrations of zinc: 3mM (high) and 61 nM (low). 1. Create a folder called GuidedProject within your MAGIC Tool folder and within this folder create another folder called DataFiles. 2. Download the Raw Data and the OriData from experiment ID (819). 3. From the 819.xls extract the gene names column and save it in a text file name it 819genelist.txt. 4.1.3 Building an expression file You are now ready to use MAGIC Tool, which has two functions: (1) Converting raw microarray data to an expression file containing quantitative data representing gene expression level, (2) Analyzing those quantitative data to learn about gene expression. 1. Create a project and call it guided9 and save it in your GuidedProject folder. 2. You need to associate the data files (TIFF files and gene list) with the project. Use Add Directory and select your DataFiles folder. 3. Choose Build Expression File to load the images. 4. Load the gene list (819genelist.txt). 5. Choose Build Expression File | Addressing/Gridding | Create/Edit Grid. 6. Enter the values in Figure 9.3. Figure 9.5: Gridding the microarray data in MAGIC Tool. 7. Once you are happy with the layout of the grid, repeat this process for the remaining 15 grids using an easy shortcut. How to display a valid yeast gene name? 8. Once you have finished all the 16 grids, click done. 9. What the purpose of using flagging? What about the grid 2, col 19, row 22? 10. Use segmentation to find the actual red and green ratios for each, a numeric representation of the relative expression of each gene at 3 mM zinc (red) as compared with 61 nM (green). You can use variance segmentation methods to calculate the intensity of red and green in each spot. 4.1.4 Transforming the expression data 1. Download the ch9guidedexp.exp file and save it in your Guided Project folder (students’ folder). 2. Create a new project call it guided9 and save it in your Guided Project. 3. Associate the file ch9guidedexp.exp to your project guided9. 4. Choose Expression | Working Expression File and choose ch9guidedexp.exp. 5. Choose Expression | View Data. You should have two columns of data. 6. To log transform the data, click Expression| Manipulate Data | Transform 7. To fix the replicates you may choose Expression | Average Replicates (you may skip this step if you are using the downloaded ch9guidedexp.exp file). 4.1.5 Exploring gene expression 1. Choose Expression | Explore and click Gene Matching Criteria in order to select genes of interest from the expression file. 2. From the dialog box include in the presence of zinc. The log-transformed red/green ratio > 2.0. You should see that approximately 15 genes met these criteria. 3. To get the visual picture of the results click Plot Selected Group. You will see a graph with dot to represent each gene whose expression was at least 4 times higher in the presence of 3 nM zinc. 4. Click on a point what do you see?

Use Quizgecko on...
Browser
Browser