Document Details

GratifyingTriangle

Uploaded by GratifyingTriangle

FBDA

Dr. Abdul Qader Abbady

Tags

bioinformatics biological foundations molecular biology

Summary

This document contains lecture notes on bioinformatics, specifically focusing on the biological foundations of this field. The notes cover topics including nucleic acids, proteins, DNA, RNA structure, the storage of genetic information, and the central dogma of molecular biology.

Full Transcript

# Lecture 02: Bioinformatics ## The Biological Foundations of Bioinformatics **Dr. Abdul Qader Abbady** **FBDA** ## Outlines - Nucleic Acids and Proteins - Structure of the Nucleic Acids DNA and RNA - The Storage of Genetic Information - The Structure of Proteins - Primary Structure - S...

# Lecture 02: Bioinformatics ## The Biological Foundations of Bioinformatics **Dr. Abdul Qader Abbady** **FBDA** ## Outlines - Nucleic Acids and Proteins - Structure of the Nucleic Acids DNA and RNA - The Storage of Genetic Information - The Structure of Proteins - Primary Structure - Secondary Structure - Tertiary and Quaternary Structure - Signal Peptides - Transmembrane Proteins - References ## Nucleic Acids and Proteins - Nucleic acids and proteins are two important classes of macromolecules that play crucial roles in nature and form the basis of all life. - Deoxyribonucleic acid (DNA) is the carrier of genetic information. - Ribonucleic acid (RNA) is involved in the biosynthesis of proteins. - Proteins control the cellular processes of life. ### The basic monomer constituents : - **Nucleic acids**: are nucleotides - **Proteins**: are amino acids ## Structure of DNA and RNA - The structure of nucleotides is the same in DNA and RNA (Alberts et al., 2014). ### Nucleotides consist of: - a pentose - a phosphoric acid residue - a heterocyclic base - In a DNA or RNA strand, nucleotides are linked via chemical bonds between the pentose sugar of one nucleotide and the phosphoric acid residue of the next. ## Structure of DNA and RNA - The basic framework of nucleic acids is a polynucleotide where the phosphoric acid forms an ester bond between the 3' OH group of the sugar residue of one nucleotide and the 5' OH group of the sugar of the next nucleotide. ### Of the polynucleotide chain: - At one end, a phosphate group is connected to the 5’ oxygen of a pentose sugar. - At the other end, a free 3’ hydroxyl group is connected to the 3’ oxygen of a pentose sugar. ## Structure of DNA and RNA - Each unit of the basic ribose/phosphoric acid residue structure carries a heterocyclic nucleobase that is connected to the sugar residue via an N-glycosidic linkage. ### The nucleic acids consist of five different bases: - cytosine, uracil, thymine, adenine, and guanine - uracil occurs only in RNA - thymine occurs only in DNA ## Structure of DNA and RNA - DNA and RNA not only differ in their bases, but their respective sugar residues also differ in chemical composition. - In RNA, the sugar is a ribose. - In DNA, the sugar is 2-deoxyribose. ## Structure of DNA and RNA - DNA consists of two nucleotide strands that combine in an antiparallel orientation so that hydrogen bonds are formed between the bases of each strand resulting in a ladder-like structure. - Nucleotides may be abbreviated using the first letter of the corresponding base and their succession indicates the nucleotide sequence of the nucleic acid strand. ## The Storage of Genetic Information - DNA consists of four nucleotides that store genetic information. The base sequence is the only variable element on the nucleotide strand, encoding the necessary information to generate proteins. - Proteins are composed of varying amounts of up to 20 amino acids. Each amino acid is encoded by a triplet of bases, termed codons. - Information flows from DNA to RNA to proteins. ## The Storage of Genetic Information - Doublet codons give 4² = 16 possible combinations. This is insufficient to generate 20 amino acids. - Triplet codons give 4³ = 64 possibilities, allowing for more combinations than necessary to encode 20 amino acids. - Information flows from DNA to RNA to proteins. ## The Storage of Genetic Information - From these theoretical calculations, one can infer that an individual amino acid may be encoded by more than one codon. Therefore, the resulting genetic code is described as being degenerate. - This code is redundant but unambiguous. - Information flows from DNA to RNA to proteins. ## The Storage of Genetic Information - The genetic code in this figure applies universally to all living organisms. However, some exceptions can be found in mitochondria and ciliates. ## The Storage of Genetic Information - Genetic information is encoded in the DNA as the sequence of its bases. - This information is transferred to messenger RNA (mRNA) during the process of transcription, whereas the unambiguous transfer of information is guaranteed by the pairing of complementary bases. - The final process of building proteins from mRNA is called translation. ## The Storage of Genetic Information - The central dogma of molecular biology states that the flow of information always proceeds from the genome to the proteome. - This does not happen vice versa. - Exceptions are reactions that are catalyzed by the reverse transcriptase and replicase of RNA viruses. ## The Storage of Genetic Information - The organization of a gene region is different in prokaryotes than in eukaryotes. - Prokaryotic gene information is encoded on a continuous DNA stretch. - Eukaryotes, coding exons are interrupted by noncoding introns (Krebs et al., 2014). ## The Storage of Genetic Information - Eukaryotic transcription of DNA to mature mRNA (containing information derived only from exons) requires several steps. The introns are removed during the process of splicing. ## The Storage of Genetic Information - Through alternative splicing, different mRNAs and different proteins can result from one gene. - Alternative splicing, among other mechanisms, explains why a relatively low number of genes are found in the human genome compared to the greater number of proteins actually produced (Claverie 2001; Venter et al. 2001). ## The Structure of Proteins - The structure of proteins include primary, secondary, tertiary and quaternary structures. ## The Structure of Proteins - Proteins are macromolecules that are composed of the 20 naturally occurring amino acids. - The primary structure is the amino acid sequence. - Under physiological conditions, proteins fold into characteristic three-dimensional structures that dictate their biological properties and functions (Berg et al., 2015). ## The Structure of Proteins - The common configuration of natural amino acids is characterized by an amino and a carboxyl group around a central alpha-carbon atom and a side chain. ## The Structure of Proteins - Differences in the side chains distinguish the various amino acids. ## The Structure of Proteins - The main L-amino acids with three-letter and one-letter codes. - The colored lines group amino acids with similar properties. ### Properties: - aliphatic side chains (gray) - acids and their amides (red) - basic side chains (blue) - with a hydroxyl group (magenta) - aromatic side chains (orange) ## Proteins Primary Structure - The corresponding side chain of each amino acid determines the chemical properties. These can be hydrophobic, polar, acidic, or basic. - Due to the limitation of just 20 amino acids, denatured unfold proteins have very similar properties that correspond essentially to a homogeneous cross section of randomly distributed side chains. - The different properties of functional proteins are based on the three-dimensional conformation (folding) of the protein. ## Proteins Primary Structure - Peptide bonds connect individual amino acids in a polypeptide chain. Each amino acid is linked via the acid amide bond of its alpha-carboxyl group to the alpha-amino group of the next amino acid.. Consequently, polypeptides have free N- and C-termini. - The connection of this main part of amino acids is called the protein backbone. - The primary structure of a polypeptide, i.e., the amino acid sequence from the N- to the C-terminus, can contain between three and several hundred amino acids. ## Proteins Secondary Structure - The secondary structure describes the ordered folding patterns of a polypeptide chain into regular helices (alpha-helix), sheet structures (beta-strand), and irregular (turns). - Turns are built up from three up to six amino acids, covering a huge conformational space of the protein backbone. - Loops are another structural element that consist of multiple turns, connecting helices and sheets. ## Proteins Secondary Structure - Turns are important for the protein globularity since helices and sheets are linear structural elements.. - These three secondary structure elements (alpha-helix, beta-strand, Loops) represent the building blocks of the three-dimensional folding pattern of proteins (Koch and Klebe 2009). ## Proteins Secondary Structure - The key to understanding these more complex structures lies in the geometric properties of the peptide group. - Linus Pauling and Robert Corey demonstrated in 1930s and 1940s that the peptide bond is a rigid, planar structure, which can be attributed to 40% double-bond character of the peptide bond. - Accordingly, a polypeptide chain can be regarded as a sequentially linked chain of rigid and planar peptide groups.. ## Proteins Secondary Structure - The chain conformation of a polypeptide can therefore be determined by the torsion angles around the Cα, with N binding (Φ, phi) and C binding (Ψ, psi). - In the planar and fully stretched (all trans) conformation, all angles are 180. - Viewed from the Cα atom, the angles increase with a clockwise rotation. ## Proteins Secondary Structure - Not all conceivable values for Φ and Ψ are possible. However, owing mainly to steric hindrance, caused by the side chains of the amino acids. ## Proteins Secondary Structure - A Ramachandran plot is a conformation chart of those values that are sterically possible for Φ and Ψ. - Areas in the Ramachandran plot correspond to sterically possible values of angles Φ and Ψ, which are called permissible areas. - Those corresponding to values that are not possible are called forbidden areas. ## Proteins Secondary Structure - Ramachandran plot of transcription regulator protein GAL4 from Saccharomyces cerevisiae. The amino acids are represented as small black squares. Evidently, almost all amino acids lie in preferred, permissible areas (red and yellow). - Two amino acids (LYS23 and ARG63) are found in slightly forbidden areas of the Ramachandran plot. This means that the combination of the values for Ψ and Φ would theoretically not be possible owing to the steric hindrance of the neighboring side chains. However, in practice, it can be observed. - The plot was generated with the program PROCHECK (Laskowski et al. 1993; Rullmann 1996). ## Secondary Structure, alpha-helix - The polypeptide chain of an alpha-helix displays a pitch of 0.54 nm with 3.6 residues per turn. - Alpha-helices are stabilized by hydrogen bonds. The hydrogen bonds in a helix are between neighboring strands. - However, they are found not within a local part of the polypeptide chain. ## Secondary Structure, beta-sheets - The polypeptide chain of beta-strands are stabilized by hydrogen bonds. - In beta-strands, each successive side chain is on the opposite side of the plane of the sheet with a repetition unit of two residues at a distance of 0.7 nm. ## Secondary Structure/alpha-helices and beta-sheets - The hydrogen bonds in beta-strands exist in both parallel and antiparallel forms, owing to the direction of the polypeptide chain. ## Secondary Structure/alpha-helices and beta-sheets - A globular protein consists of approximately a half each of alpha-helices and beta-sheets. The rest of the protein consists of nonrepetitive turns. - Turns are responsible for the globularity of proteins since they allow a huge amount of different conformations. - Overall, 158 different conformations of the protein backbone are described for turns (Koch and Klebe 2009). ## Tertiary Structure - The tertiary structure describes the three-dimensional arrangement and placement of secondary structural elements. - Large polypeptide chains (>200 amino acids) frequently fold themselves into several units termed domains. Normally such domains are composed of 100-200 amino acids with a diameter of approx. 2.5 nm. - The tertiary structure specifies the protein properties. ## Tertiary Structure - Through the compaction of secondary structural elements and interactions between the amino acids of those elements, the structure of the protein is stabilized. - The amino acid interactions include hydrogen bonds between peptide groups, disulfide bonds between cysteine residues, ionic bonds between charged groups of amino acid side chains, and hydrophobic interactions. ## Quaternary Structure - The quaternary structure is the arrangement of several polypeptide subunits. - These are associated in a specific geometry so that a symmetrical complex is formed. - The assembly of the individual subunits is carried out through noncovalent interactions. ## Signal Peptides - For many proteins, the site of synthesis is not the site of action. This applies to transmembrane proteins, proteins within the endoplasmic reticulum, and proteins that are secreted or imported into lysosomes. - Prior to their activation, these proteins must first be transported to the site of action, and this is facilitated by a peptide recognition signal for the cellular transport system. - The recognition signal is an N-terminal leader sequence (signal peptide) that consists of approx. 15–30 amino acids, placed on the N-terminus of the mature protein. ## Signal Peptides - According to the signal hypothesis of Günter Blobel and David Sabatini (Blobel and Sabatini 1971), the signal peptide is recognized by a signal recognition particle, guiding the nascent polypeptide chain through the membrane of the endoplasmic reticulum. - As soon as the signal peptide has passed the membrane, it is specifically cleaved from the nascent polypeptide by a signal peptidase. ## Signal Peptides - Proteins with a signal peptide are called preproteins, or, in those cases where they also contain propeptides, preproproteins. - Unlike signal peptides, propeptides are proteolytically removed to allow for protein activation. ## Signal Peptides - Schematic illustration of a preproprotein exemplified by cysteine proteases of the papain family. The amino acids of the catalytic triad (Cys25, His159, and Asp175) are each located within the characteristic sequence motifs of cysteine proteases (M1–M3). - Only a few cysteine proteases have an additional C-terminal extension for which a function is still not known. ## Signal Peptides - The presence of a signal peptide gives an important clue as to the site of action of proteins. This knowledge in turn can help clarify function and, thus, help in determining whether that protein is a suitable target molecule. For this reason, methods for predicting the presence of signal peptides in the primary structure have been developed. - An example is the program SignalP from the Center for Biological Sequence Analysis (CBS) at the Technical University of Denmark (Petersen et al. 2011). ## Signal Peptides - The recognition of signal peptides by the signal recognition particle is not due to a conserved amino acid sequence, but depends on physicochemical properties. - A signal peptide usually consists of three parts: the n-region (contains 1–5 usually positively charged amino acids), the h-region (is made up of 5–15 hydrophobic amino acids), and the c-region (has 3–7 polar but mostly uncharged amino acids). ## Signal Peptides - A classical sequence alignment method is therefore unsuitable for the prediction of signal peptides. The SignalP program in its current fourth version is instead based on the use of neural networks, which are machine learning methods where characteristics of a training data set with known sequences are learned and can be used for the prediction of unknown data. - The trained neural networks are thus able to judge the properties of amino acids in unknown sequences, thereby allowing the recognition of signal sequences. ## Signal Peptides - Before the analysis is started, it is important to choose the right organism group, which can be the gram-negative bacteria, gram-positive bacteria, or eukaryotes. - The C-score stands for cleavage site score, which was trained on the recognition of the cleavage site, which predicts the cleavage site of SPase I, between signal peptide and the protein sequence. The maximum C-score occurs at the position of the first amino acid of the mature protein, so one position behind the cleavage site. ## Signal Peptides - The S- score, the signal peptide score, is trained on the differentiation of signal peptides and other sequences. The S-score has a high value if the corresponding amino acid is part of the signal peptide. Therefore, amino acids of the mature protein have a low S-score. ## Signal Peptides - The Y- score (combined cleavage site score) is a geometrical mean of the C-score absolute values and the gradient of the S-score, showing where the C-score is high, and the S-score has its inflection point. Analysis of the three scores shows the likely cleavage site between amino acids 21 and 22. ## Transmembrane Proteins - Biological membranes contain integral proteins that have various functions in the cell, such as acting as cell-surface receptors. - Integration into the membrane lipid bilayer is accomplished by hydrophobic interactions between the protein and the nonpolar chain structures of the lipids. - The polar head groups of the lipids build hydrogen bonds and ionic bonds with the protein. - Integral membrane proteins are therefore always amphiphilic molecules that have both hydrophilic and lipophilic regions. ## Transmembrane Proteins - These proteins are orientated asymmetrically in the membrane. Some membrane proteins are only exposed on one side of the membrane, whereas others completely penetrate the membrane and are exposed on both the extracellular and intracellular sides. The latter are called transmembrane proteins. - The hydrophobic transmembrane domains are usually formed by alpha-helices. ## Transmembrane Proteins - The prediction of transmembrane proteins is of great importance for classification and defining function, as described previously for signal peptides. - The program TMHMM [tmhmm] of the CBS server in Denmark can predict transmembrane domains. ## Transmembrane Proteins - TMHMM is based on a hidden Markov model (HMM) that has been trained to detect hydrophobic transmembrane helices and predicts the orientation of the individual domains in the membrane, which can either be intracellular or extracellular. ## Transmembrane Proteins - The graphical output of such a prediction with TMHMM is shown for the transmembrane domains of the G protein-coupled receptor (GPCR) 5-hydroxytryptamine-1B receptor of the mole rat Spalax leucodon ehrenbergi (5H1B-SPAEH). ## Transmembrane Proteins - Such GPCRs are integral membrane proteins with typically seven transmembrane domains. In the graph, the probability of a transmembrane helix and its intracellular or extracellular localization is plotted along the amino acid sequence.. - Additionally, in the upper part of the figure, a schematic representation of the topology is inserted. The graphical representation of the probabilities also allows the recognition of transmembrane helices of relatively low likelihood. ## Transmembrane Proteins - KLRG2: - MEPPQVPAEAPQPRASEDSPRPERTGWEEPDAQPQELPEKSPSPALSGSPRVPPLSLGYGAFRRLGSCSRELPSPSPSWAEQPRDGEAELEP - WTASGEPAPASWAPVELQVDVRVKPVGAAGASRAPSPAPSTRFLTVPVPESPAFARRSAPTLQWLPRAPSPGSTWSRGSPLAANATESVSPA - EGCMVPPGSPACRCRCREPGLTKEDDALLQRAGIDGKKLPRAITLIGLPQYMKSLRWALVVMAVLLAVCTVAVVALASRGGTKCQPCPQGWM - WSQEQCYYLSEEAQDWEGSQAFCSAHHATLPLLSHTQDFLRKYRITKGSWVGARRGPEGWHWTDGVPLPSQLFPADSEDHPDFSCGGLEEG - RLVALDCSSPRPWVCARETK* ## Analyses of Protein Structures - As stated earlier, the prediction of a protein 3D structure from an amino acid sequence is currently not feasible and will not be feasible for the foreseeable future. - Therefore, experimental methods must be employed to determine protein structures. - The two primary approaches are X-ray crystallography and high-resolution nuclear magnetic resonance (NMR) spectroscopy. - A third approach using the electron microscope is useful for large proteins. - Overall, despite much technological progress, these methods are still: very time-consuming, costly, and the successful resolution of a crystal structure is not guaranteed for every protein. ## X-ray crystallography - X-ray crystallography is a technique used to determine the three-dimensional structure of molecules. It is based on the diffraction of X-rays by the electrons in a crystal. ## High-resolution nuclear magnetic resonance (NMR) spectroscopy - High-resolution nuclear magnetic resonance (NMR) spectroscopy is a technique used to determine the structure and dynamics of molecules. It is based on the interaction of the magnetic moments of atomic nuclei with an external magnetic field. ## The Electron Microscope - The electron microscope is a type of microscope that uses a beam of electrons to illuminate a sample and produce an image. It has much higher resolution than a light microscope, allowing scientists to visualize objects as small as a few atoms. ## Protein Data Bank (PDB) - The PDB is a database of experimentally determined crystal structures of biological macromolecules, coordinated by a consortium located in the USA, Europe, and Japan (Berman et al. 2000). - It is probably the best-known Web page of the PDB, which is that of the Research Collaboratory for Structural Bioinformatics (RCSB). - The PDB was founded at the Brookhaven National Laboratory in 1971, reflected in the frequent use of the name Brookhaven Protein Data Bank. ## Protein Data Bank (PDB) - About 175,000 macromolecule structures are stored in the PDB database (as of March 2021). These are predominantly proteins but also include DNA and RNA structures and protein-nucleic acid complexes. - Structures of other macromolecules, for example glycopeptides and polysaccharides, constitute only a very small proportion of the total structures. - As of 2002, only those crystal structures that have been solved experimentally are stored in the PDB database. Whereas data of theoretical protein models are kept in their own section [pdb-models]. ## Molecular Structure Databases - The PDB database offers several query options on the main page. These include a text-based search, PDB ID keyword, and further, a number of search options exist on the search database page, including detailed keyword and BLAST queries.. - A database record summarizes all of the information in the file, which is then detailed on subsequent pages. - In addition, the molecular structure can be visualized by means of different applets. ## References - Alberts B, Johnson A, Lewis J, Morgan D, Raff M, Roberts K, Walter P (2014) Molecular Biology of the Cell. Garland Science, New York - Berg JM, Tymoczko JL, Gatto GJ, Stryer L (2015) Biochemistry, 8th edn. W. H. Freeman - Claverie JM (2001) What if there are only 30000 human genes? Science 291:1255-1256 - Crick F (1970) Central dogma of molecular biology. Nature 227:561-563 - Koch O, Klebe G (2009) Turns revisited: a uniform and comprehensive classification of normal, open, and reverse turn families minimizing unassigned random chain portions. Proteins 74:353-367 - Krebs JE, Goldstein ES, Kilpatrick ST (2014) Lewins Genes XI. Jones & Bartlett Learning, Burlington - Laskowski RA, MacArthur MW, Moss DS, Thornton JM (1993) PROCHECK: a program to check the stereochemical quality of protein structures. J Appl Crystallogr 26:283-291 - Rullmann JAC (1996) AQUA, Computer program. Department of NMR Spectroscopy, Bijvoet Center for Biomolecular Research, Utrecht University - Venter JC, Adams MD, Myers EW, Li PW, Mural RJ et al (2001) The sequence of the human genome. Science 291:1304-1351 - Watson JD, Crick FHC (1953a) Molecular structure of nucleic acids. Nature 171:737-738 - Watson JD, Crick /FHC (1953b) Genetical implications of the structure of deoxyribonucleic acid. Nature 171:964-967 ## Further Reading - Amino acids. - https://en.wikipedia.org/wiki/Amino acid - Biochemistry. - https://en.wikipedia.org/wiki/Biochemistry - NCBI Books. - http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Books - Protein structures. - http://www.rcsb.org/ ## Thank You - [email protected]

Use Quizgecko on...
Browser
Browser