LEC 1 Bioinformatics PDF
Document Details
Uploaded by PositiveRadon
Ain Shams University
Ass. Prof. Ghadir El-Housseiny
Tags
Summary
This document is an introductory lecture to bioinformatics. It highlights the importance of bioinformatics in biological research and covers topics like the types of biological data, various methods in bioinformatics, history, and overall aims of the subject.
Full Transcript
By: Ass. Prof. GHADIR EL-HOUSSEINY Microbiology and Immunology Overall Aim of the Course The course aims to provide an introduction to the field of bioinformatics, with a focus on important bioinformatics tools, and resources. The course aims to use a combination of theoretical and practic...
By: Ass. Prof. GHADIR EL-HOUSSEINY Microbiology and Immunology Overall Aim of the Course The course aims to provide an introduction to the field of bioinformatics, with a focus on important bioinformatics tools, and resources. The course aims to use a combination of theoretical and practical sessions in order for students to gain practical experience in using various tools and resources. Students will be able to search and retrieve information from genomic and proteomic databases (e.g. GenBank, Swiss-Prot), and to analyze their search results using software available on the internet Students will be able to locate sequences, genes and open reading frames within biological sequences, design primers for detection and align sequences in databases to determine the degree of matching. They will be introduced to phylogenetic analysis Students will be able to perform elementary predictions of protein structure and function COURSE CONTENT: 1. An introduction to bioinformatics 2. Biological databases and resources 3. DNA analysis: Open reading frame (ORF) analysis; 4. Primer Design: for detection and for cloning 5. Sequence alignment: BLAST search analysis; MSA 6. Protein sequence analysis and Structure prediction 7. Phylogenetic analysis 1-INTRODUCTION TO BIOINFORMATICS OUTLINE: DEFINITION HISTORY IMPORTANCE COMPONENTS REVISION ON BASIC CONCEPTS APPLICATIONS OF BIOINFORMATICS What is Bioinformatics: is a scientific subdiscipline that involves using computer technology to generate, analyze, store, access and disseminate biological data or information (DNA and amino acid sequences) It is an interdisciplinary field of science that combines biology, chemistry, computer science, information engineering, mathematics and statistics to analyze and interpret the biological data. HISTORY David J. Lipman, former director of (NCBI), called Dayhoff -European Molecular Biology Laboratory (EMBL) ‘the mother and father of bioinformatics’. Nucleotide Sequence Data Library established at European Bioinformatics Institute (EBI) -GenBank at the NCBI (National Institute Of Biotechnology Information) The term “bioinformatics” was coined by Paulien Hogeweg and Ben Hesper in 1978 to mean the study of informatic processes in biotic systems NOW, bioinformatics can be regarded as computational molecular biology, that uses computational techniques to study the structure, function, regulation, and interactive network of genes and proteins. BIOINFORMATICS vs COMPUTATIONAL BIOLOGY COMPUTATIONAL BIOLOGY Computational biology is an umbrella term that includes any BIOINFORMATICS subdiscipline in biology that uses computer-aided analysis, modeling, and prediction. Some examples include the modeling of predator-prey relationships in an ecosystem. In contrast, bioinformatics can be regarded as computational molecular biology Therefore, computational biology is much broader in scope and bioinformatics is a part of it. Importance of bioinformatics The field of bioinformatics experienced explosive growth starting in the mid- 1990s, driven largely by the Human Genome Project and by rapid advances in DNA sequencing technology. By 2003, the project had mapped around 85% of the human genome. Work continued, however, and by 2021 level "complete genome" was reached As we collect more and more biodata, bioinformatics will be essential to any scientific discovery. Without bioinformatics and the ability to use computer science tools to big data, understanding and concluding biodata would be very hard. All life science related professionals need to know at least the basics of bioinformatics! The main goals of bioinformatics are: The primary goal: to increase the understanding of biological processes and to be able to predict the biological processes in health and disease. What sets it apart from other approaches, however, is its focus on developing and applying computationally intensive techniques to achieve this goal. The aims of bioinformatics to reach this goal are three-fold: Organize biodata so that it becomes easily accessible and searchable Develop software to help analyze biodata Analyze and accurately interpret biodata in a biologically meaningful manner Components of bioinformatics are: 1. 2. BIOLOGICAL BIOLOGICAL DATA DATABASES 3. ANALYSIS TOOLS 1. Biological Data: Information derived from living organisms and their products, fed into computers for processing. They include: Nucleic Acid Sequences (DNA, RNA), Protein sequences, Protein structures, Literature The most comprehensive way of obtaining information about the genome of any living organism is to determine the precise order of nucleotides, known as sequencing, in its complete DNA sequence. Earlier, traditional methods (Sanger’s chain termination method) used for DNA sequencing are quite expensive and time consuming Demand for low-cost and highly efficient sequencing gave rise to “Next-Generation Sequencing (NGS)”: the high- throughput sequencing technologies which can simultaneously sequence millions or billions of DNA molecules. 2. Biological Databases A database (DATABANK) is an organized collection of data stored and accessed electronically. A biological database is an Organized collection of biological data. Most are public. 3. Analysis Tools Software programs (computer algorithms and statistics) that are designed for extracting the meaningful information from the raw biological databases. Major categories: Homology and Similarity Tools: used to identify similarities between novel query sequences (of unknown structure and function) and database sequences whose structure and function have been elucidated eg. BLAST Protein Function Analysis tools: compare your protein sequence to the secondary protein databases that contain information on motifs, signatures and protein domains, allow you to approximate the function of your query protein. Structural Analysis tools: compare structures with the known structure databases. The function of a protein is more directly a consequence of its structure rather than its sequence. Sequence Analysis tools: allows you to carry out further, more detailed analysis on your query sequence including evolutionary analysis, identification of mutations. Revision: Gene Features on the Prokaryotic Genome Gene is the functional unit of genetic information responsible for a given trait in an organism. All life forms, contain genes. Genome The sum of genetic elements that make up the total genetic information in a cell. The nucleoid (meaning nucleus-like) is an irregularly shaped region within the prokaryotic cell that contains all or most of the genetic material. In contrast to the nucleus of a eukaryotic cell, it is not surrounded by a nuclear membrane. Prokaryotic genomes generally comprise of a single circular double- stranded DNA molecule. DNA 20 Gene Structure of Nucleic acids (DNA, RNA): polynucleotides Purine or pyrimidine ’ ’ ’ ’ ’ Nucleoside 22 23 7 6 1 5 8 2 4 9 3 4 5 3 6 1 2 24 Important features of DNA structure 1. DNA molecule consists of double helix of polynucleotides 2. The strands are held together by hydrogen bonds 3. The two strands of DNA are complementary G with C (3 H bonds) A with T (2 H bonds). 4. The two strands are arranged in an antiparallel fashion 25 Bacterial chromosome: DNA: linear chromosomes in eukaryotes, circular chromosomes in prokaryotes. In bacteria: 90% of the DNA encodes proteins 10% is noncoding (humans??) Gene expression: It involves two distinct processes. The DNA is firstly transcribed by the RNA polymerase into messenger (m)RNA. Ribosomes attach to the ribosome binding sites on mRNA, and translate the encoded information into a linear polypeptide. Generally, prokaryotic genes are clustered together as operons 27 Activator protein Repressor protein Inducer Corepressor 29 Operon: functioning unit of DNA containing a cluster of adjacent structural genes, under the control of a single promoter and regulated by a common operator. The genes are transcribed together into a single mRNA strand and translated in the cytoplasm to more than one protein. An operon is made up of 3 basic DNA components: Promoter – a regulatory nucleotide sequence that acts as the binding site for RNA polymerase, which then initiates transcription. Operator – a segment of DNA to which a repressor binds. The repressor protein physically obstructs the RNA polymerase from transcribing the genes. Structural Genes – the genes encoding the amino acid sequence of a protein that are co-regulated by the operon. All the structural genes of an operon are turned ON or OFF together Transcription terminator is a section of nucleic acid sequence that marks the end of a gene or operon that mediates transcriptional termination 30 Not always included within the operon, but important in its function is a regulatory gene. Regulatory gene: a gene involved in controlling the expression of one or more other genes. genes that are involved in turning on or off the transcription of structural genes. Eg. gene that codes for a repressor protein that inhibits transcription. Eg. genes that code for activator proteins which bind to "activator- binding site". and causes an increase in transcription of a nearby gene. An inducer (small molecule) can displace a repressor (protein) from the operator site (DNA), resulting in an uninhibited operon. Alternatively, a corepressor can bind to the repressor to allow its binding to the operator site. 31 On mRNA: Start codon (AUG: both codes for methionine and serves as an initiation site) indicates where translation may start. Stop codon (UAA, UAG UGA) is a codon that signals the termination of the translation process of the current protein. A ribosome binding site (RBS), is a sequence of nucleotides upstream of the start codon of an mRNA transcript that is responsible for the recruitment of a ribosome during the initiation of translation. 32 Genetic code: The exact sequence of DNA nucleotides read as codons, that determines the sequence of amino acids in protein synthesis. The genetic code can be expressed as either RNA codons or DNA codons. The genetic code is degenerate because different codons specify the same amino acid. A series of codons in part of a messenger RNA (mRNA) molecule. This mRNA molecule will instruct a ribosome to synthesize a protein according to this code. 33 Standard genetic code Genes are encoded with a single scheme called the standard genetic code used to translate nucleotide triplets into the corresponding amino acid. Standard DNA genetic code 34 The standard RNA codon table organized in a wheel 35 Inverse table for the standard genetic code (compressed using IUPAC notation) Amino acid DNA codons Compressed Amino acid DNA codons Compressed Ala, A GCT, GCC, GCA, GCN Ile, I ATT, ATC, ATA ATH The inverse GCG table can be Arg, R CGT, CGC, CGA, CGN, AGR; or Leu, L CTT, CTC, CTA, CTN, TTR; or used to CGG; AGA, AGG CGY, MGR CTG; TTA, TTG CTY,YTR deduce a possible Asn, N AAT, AAC AAY Lys, K AAA, AAG AAR triplet code if Asp, D GAT, GAC GAY Met, M ATG the amino Asn or Asp, B AAT, AAC; GAT, RAY Phe, F TTT, TTC TTY acid is known GAC Cys, C TGT, TGC TGY Pro, P CCT, CCC, CCA, CCN CCG Gln, Q CAA, CAG CAR Ser, S TCT, TCC, TCA, TCN, AGY TCG; AGT, AGC Glu, E GAA, GAG GAR Thr, T ACT, ACC, ACA, ACN ACG Gln or Glu, Z CAA, CAG; GAA, SAR Trp, W TGG GAG Gly, G GGT, GGC, GGA, GGN Tyr, Y TAT, TAC TAY GGG His, H CAT, CAC CAY Val,V GTT, GTC, GTA, GTN GTG START ATG STOP TAA, TGA, TAG TRA, TAR 36 The nucleic acid notation was first formalized by the International Union of Pure and Applied Chemistry (IUPAC) in 1970. This universally accepted notation uses G, C, A, and T, to represent the four nucleotides commonly found in deoxyribonucleic acids (DNA). Degenerate base symbols are an IUPAC representation for a position on a DNA sequence that can have multiple possible alternatives. 37 IUPAC degenerate base symbols Bases represented Description Symbol No. A C G T Adenine A A Cytosine C C Guanine G 1 G Thymine T T Uracil U U W A T S C G M A C 2 K G T R A G Y C T Not A B C G T Not C D A G T 3 Not G H A C T Not T V A C G Any one base N 4 A C G T Gap - 0 38 Major Applications in Bioinformatics 1 Sequence analysis 2 Genome annotation 3 Analysis of gene expression 4 Analysis of regulation 5 Analysis of protein expression 6 Analysis of mutations in cancer 7 Prediction of protein structure 8 Comparative genomics 9 Computational evolutionary biology 39 1. Sequence analysis: is the process of subjecting a DNA, RNA or peptide sequence to analytical methods to understand its features: Identification of intrinsic features of the sequence such as reading frames, and regulatory elements. The comparison of sequences in order to find similarity. Methodologies used include sequence alignment. Sequence alignment is a way of arranging the sequences of DNA, RNA, or protein for comparison, to identify regions of similarity that may be a consequence of evolutionary relationships between the sequences. 40 2. Genome annotation the process of identifying the locations of genes and all of the coding regions in a genome and determining what those genes do. Once a genome is sequenced, it needs to be annotated to make sense of it. This process needs to be automated because most genomes are too large to annotate by hand In 1995, A team at The Institute for Genomic Research performed the first complete sequencing and analysis of the genome of the bacterium Haemophilus influenzae. Owen White designed and built a software system, GeneMark program: gene prediction program, to identify the genes encoding all proteins, tRNAs, rRNAs and to make initial functional assignments 3. Analysis of gene expression The expression of many genes can be determined by measuring mRNA levels with multiple techniques including: microarrays These techniques are extremely noise-prone, and a major research area in bioinformatics involves developing statistical tools to separate signal from noise Such studies are often used to determine the genes implicated in a disorder: one might compare microarray data from cancerous epithelial cells to data from non-cancerous cells to determine the transcripts that are up- regulated and down-regulated in a particular population of cancer cells. The output of DNA Microarray is vast databases which need to be processed by computation tools to take out biological significance. 4. Analysis of regulation Gene Regulation is the complex organization of events by which a signal eventually leads to an increase or decrease in the activity of proteins. Bioinformatics techniques have been applied to explore various steps in this process. For example, promoter analysis involves the identification and study of sequences in the DNA surrounding the coding region of a gene. These sequences influence the extent to which that region is transcribed into mRNA. 5. Analysis of protein expression mRNA is not always translated into protein. Proteomics confirms the presence of the protein and provides a direct measure of its quantity. Protein microarrays and high throughput (HT) mass spectrometry (MS) can provide a snapshot of the proteins present in a biological sample. Bioinformatics is very much involved in making sense of protein microarray and HT MS data; it deals with the problem of matching large amounts of mass data against protein sequence databases, and the complicated statistical analysis of samples 6. Analysis of mutations in cancer In cancer, the genomes of affected cells are rearranged in unpredictable ways. Massive sequencing efforts are used to identify point mutations in a variety of genes in cancer. Bioinformaticians continue to produce specialized automated systems and software to compare the sequencing results to the human genome sequences. New technology is employed to identify single nucleotide polymorphisms. dbSNP is a SNP database from NCBI which lists SNPs in humans. The OMIM database describes the association between polymorphisms and diseases Single nucleotide polymorphism, or SNP (pronounced "snip") is a variation at a single position in a DNA sequence in more than 1% of a population If a SNP occurs within a gene, then the gene is described as having more than one allele. (For example, the G nucleotide may appear in most individuals, but in some, the position is occupied by an A. Variations – G or A – are said to be the alleles for this specific position). Roughly 90 % of the genetic variation that exists between humans is the result of SNPs. Although the majority of variations do not alter cellular function and thus have no effect, some SNPs contribute to the development of diseases such as cancer SNPs are identified and characterized by sequencing the same genomic region in several populations 54 Types of SNP: 55 Types of SNP SNPs in coding regions: Synonymous substitutions do not result in a change of amino acid in the protein, due to degeneracy of the genetic code, Nonsynonymous substitutions: change the amino acid sequence of protein Missense – single change in the base that results in change in amino acid of protein and its malfunction which leads to disease Nonsense – single change in the base that results in a premature stop codon, results in a nonfunctional protein product. SNPs in non-coding regions. SNPs that are not in protein-coding regions but may still affect transcription factor binding. Can manifest in a higher risk of cancer and may affect mRNA structure and disease susceptibility. Gene expression affected by this type of SNP is referred to as an eSNP (expression SNP) and may be upstream or downstream from the gene. 56 MISSENSE NONSENSE 57 7. Prediction of protein structure Protein structure prediction is another important application of bioinformatics. primary structure: amino acid sequence of its polypeptide chain secondary structures are defined by patterns of hydrogen bonds: Alpha helices, Beta sheets. Tertiary structure refers to the three-dimensional structure of a single polypeptide chain; software tools: I-TASSER and AlphaFold. Quaternary structure aggregation of two or more individual polypeptide chains that operate as a single functional unit (multimer); Programs: AlphaFold-Multimer 63 8. Comparative genomics the study of the interrelationships of genomes of different species. Whole or large parts of genomes of different organisms are compared to study basic biological similarities and differences between organisms. Comparative genomics has revealed high levels of similarity between closely related organisms, such as humans and chimpanzees, and, more surprisingly, between seemingly distantly related organisms, such as humans and the yeast Saccharomyces cerevisiae VISTA is a collection of databases and tools that permit extensive comparative genomics analyses. 9. Computational evolutionary biology Evolution is the process through which populations and species change over successive generations. Evolutionary biology or Phylogenetics is the study of evolution, species change over time. Molecular evolution is the study of variations and evolution in the molecular components of a cell. The key molecular aspect of evolution is sequence variation which is detected by comparing DNA or protein sequences. Bioinformatics has enabled researchers to trace the evolution of a large number of organisms by measuring changes in their DNA, using different computational tools, rather than through physical or physiological observations alone.