Genomics & Proteomics PDF
Document Details
Uploaded by ComprehensiveRetinalite9771
Manipal
Dr. Naresh K. Mani
Tags
Summary
This document provides an overview of genomics and proteomics, including fundamental concepts, methodologies, and applications. It also touches upon the history of genomics and proteomics research.
Full Transcript
Genomics & Proteomics BIO 4066 – PE VII Dr. Naresh K. Mani Associate Professor Email: [email protected] CO1 Know the basic facts about all types of genomes with their contents & sequencing methods CO2 To understand DNA sequencing projects, fundamental...
Genomics & Proteomics BIO 4066 – PE VII Dr. Naresh K. Mani Associate Professor Email: [email protected] CO1 Know the basic facts about all types of genomes with their contents & sequencing methods CO2 To understand DNA sequencing projects, fundamentals of proteomics & protein separation CO3 To comprehend various protein identification methods, protein microarrays and application of proteomics. Genomes & Genome organization Genes and Proteins, Genomes: Bacteria, Virus, Archae, Metazoa, Eukaryotes, Gene content in eukaryotes, Interspersed repeats, Pseudogenes, Micro & Mini satellites, Human Genome, HGP. Genome browsers, Sequencing & NGS Genome browsers, Genome Projects – preparing genomic DNA for sequencing, Sanger Dideoxy method, Fluorescence method, shot-gun approach, Pyrosequencing, Genome Mapping: Physical & Genetic, Genomics - SNPs, DNA Chips. NGS Introduction, Illumina, SoLID, Ion-torrent, Single-molecule, Nanopore, Functional genomics, Isothermal DNA amplification. Proteomics & Separation: OMICS, Genome, Proteome, Metabolome – Need for Proteomics – Scope of Proteomics – Current Challenges in Proteomics, Strategies for Protein Separation, General Principles – 1D, 2D, 2DGE, Applications – Principles of Multidimensional Liquid Chromatography. Strategies for Protein Identification & Quantification Introduction – Protein Identification with Antibodies – Edman Degradation – Mass Spectrometry Principles & Instrumentation – Protein Identification using Data from Mass Spectrometer Quantitative Proteomics based on 2DGE – Multiplexed Gel in Proteomics – Quantitative Mass Spectrometry. Interaction Proteomics, Protein Modification in Proteomics, Protein Microarrays, Applications of Proteomics. References: 1. S.B. Primrose, Principles of Gene Manipulation and Genomics, Blackwell Publishing, 7th Edition. 2. Benjamin Lewis. 2003. Genes VIII. Oxford University Press. 3. TA Brown Genomes III. Taylor & Francis. 4. Jonathan Pevsner. 2015. Bioinformatics and Functional Genomics. Wiley Blackwell 5. Daniel Liebler. 2002. Introduction to Proteomics. Humana Press 6. Richard M. Twyman, Priniciples of Proteomics, Garland Science 2005 All living forms are the lineal descendants of those which lived long before. - Charles Darwin 1865 Mendel discovers laws of genetics 1900 Rediscovery of Mendel’s genetics 1944 DNA identified as hereditary material 1953 DNA structure 1960’s Genetic code 1977 Advent of DNA sequencing 1975-79 First human genes isolated 1986 DNA sequencing automated ~50 years 1990 Human genome project officially begins 1995 First whole genome 1999 First human chromosome 2003 ‘Finished’ human genome sequence Today’s lecture The cell Mendel’s contributions The chromosomal theory of inheritance The chemical nature of genetic material Composition and structure of DNA Genomes of prokaryotes and eukaryotes Viral Genomes “Life” Living things / Non-living things Definition: Metabolize (Processes that extract energy from environment). Replicate (Utilize that energy to build new molecules). 3.5 billion years ago Storage (Information) Components of Life Definition: Genes and proteins “Storehouse of information and executers of cellular life processes “ Genome sequencing 1. Revealed the blueprint of life of many organisms. 2. Information is exploited to understand the living organisms in a better way. 3. Research and manufacture of new drugs and diagnostic methods. Organization of genes and genomes Genomics Mapping of genomes Genome sequencing Annotation of genomes 1. Help us to understand the living organisms as a whole. 2. Applied to high throughput Proteins expressed in a cell at techniques different times Proteomics Post-translational modifications Protein–protein interactions Cell : basic structural and functional unit of life. Living organisms possess well-defined cellular architecture (controlled by the Information for “heredity” is stored genes) The branch of science dealing with heredity and variation is known as genetics. Mendel’s contributions : “Father of genetics” Prokaryote: Haploid - These cells have only one set of genetic material per cell. Eukaryotes: Diploid - Two sets of genetic material, one set is maternally inherited and the other set is paternally inherited. Chromosomal theory of inheritance: Quest for physical material of inheritance. Nucleosome: basic unit of chromosome Entities carrying genetic material in the cell. DNA + H2A, H2B, H3, H4 (also H1 linker) Organized structures : DNA + Protein Karyotype: Arrangement (Size & Appearance) Friedrich Miescher Genome: Total amount of genetic material, stored as DNA. The nuclear genome refers to the DNA in the chromosomes contained in the nucleus; in the case of humans the DNA in the 46 chromosomes. It is the nuclear genome that defines a multicellular organism; it will be the same for all (almost) cells of the organism. Organelle genomes : Mitochondria, Chloroplast, Plasmid Transcriptome: Total amount of genetic information which has been transcribed by the cell. (RNA) The transcriptome is unique to a cell type and is a measure of the gene expression. Different cells within an organism will have different transcriptomes. Cell types can be identified by their transcriptome. Proteome: The cell’s complete protein output. “GENOME” DNA as letters: “ATATATAT” Arranged as sentences / paragraphs/ pages : GENES Combination of Pages: Book Living systems on earth: “Common information storage system” “GENOME” DNA (Total DNA content) (Fundamental building block of all of life) Polymer of nucleotides. Nucleotide: Phosphate, Sugar, Nitrogenous base. Adenine, Thymine, Guanine, Cytosine. Complementary Base pairing. Base pairs (Kbps, Mbps) Philosophers, scientists grappled with questions mammals regarding diversity of life on Earth: Aristotle (384–322 bc) vertebrates Lamarck (1744-1829) Carl Linnaeus (1707–1778) invertebrates Haeckel (1879) protozoa Chatton (1937) distinguished prokaryotes (bacteria that lack nuclei) from eukaryotes (having nuclei). Whittaker (1969) and others described the five-kingdom system: animals, plants, protists, fungi, and monera. Plantae Fungi Animalia Five kingdom system (Haeckel, 1879) In the 1970s and 1980s, Carl Woese and colleagues described the archaea, thus forming a tree of life with three main branches: archaea, bacteria, eukaryotes. Studying genome : GENOMICS “Structure & function of organism, complete set of genetic material” What is the mother of Genomics? Molecular Biology PAX6 gene : belongs to a family of genes that play a critical role in the formation of tissues and organs during embryonic development. Pax6 alterations : Similar phenotypic alterations of eye morphology and function across a wide range of species. The genomes of cellular organisms vary in size over five orders of magnitude Single organism : Cells can be of different (ploidy). E.g. Germ cells are usually haploid, Somatic cells diploid Virus : 1-2 × 105 bp (Largest) The size of the haploid genome also is known as the C-value. Unicellular Eukaryotes: 1-2 × 107 bp Few more questions to be asked? Are some plants really more organizationally complex than humans as these data imply? “C-Value Paradox” refers to the observation that genome size does not uniformly increase with respect to perceived complexity of organisms, for example vertebrate with respect to invertebrate animals. Clue: 1. Check the number of base pairs. 2. 10 to 100 fold variation in size. Genomes : consist of unique sequences of DNA and repeated sequences. The proportions of the two vary in different organisms. Simpler organisms : Unique sequences (genes). Higher organisms: Large amount of repetitive DNA. The proportion of repeated DNA in different organisms. Repetitive DNA: tandem repeats and dispersed repeats. Things to ponder!! (i) The length of the non-repetitive DNA component increases (go up the evolutionary tree) and reaches a maximum of 2 x 109 bp in mammals. (ii) Many plants and animals have a much higher C- value (large amounts of repetitive DNA). mRNA Hybridization Process: Binding (Annealing) will be mostly to non- repetitive DNA Some one wants to sequence the genome (of a particular phylum) Conclusion: Most genes are present in non-repetitive DNA. Bottleneck: Repetitive DNA confounds the assembly of a Genetic complexity is proportional to the content of complete sequence of a genome the non-repetitive DNA (not to genome size). Solution: Select the one with the lowest content of repetitive DNA. Arabidopsis (125 Mb) and Rice (430 Mb) How to assess genome complexity? Reassociation Kinetics The cell Prokaryotic Genome - Bacteria & Archaea Mendel’s contributions - Physical features of prokaryotic genomes - Genetic features of prokaryotic genomes The chromosomal theory of inheritance - Eukaryotic organellar genomes The chemical nature of genetic material Composition and structure of DNA Genomes of prokaryotes and eukaryotes Increase in genome complexity sometimes are accompanied by increase in the complexity of gene structure Some genes the coding sequence : interrupted by the presence of non- coding (untranslated) sequences - introns. “Genes” are known as split genes and the parts of these genes that are translated are known as exons. Prokaryotes: Split genes are rare. Eukaryotes: Split genes are much common. Prokaryote genome Genomes of Bacteria and Archaea : compact. All of their DNA is “functional” (contains genes or gene regulatory elements). Size : 1 million to 10 million base pairs of DNA, usually in a single, circular chromosome. Genes : biochemical pathway or signaling pathway are often clustered together (arranged into operons). The size of prokaryotic genomes is directly related to their metabolic capabilities – the more genes, the more proteins and enzymes they make. Eukaryote genome Genome sizes of eukaryotes are tremendously variable, even within a taxonomic group (so- called C-value paradox). Eukaryotic genomes are divided into multiple linear chromosomes; each chromosome contains a single linear duplex DNA molecule. Eukaryotic genes in a biochemical or signaling pathway are not organized into operons. Many eukaryotic genes (most human genes) are split; non-coding introns must be removed and the exons spliced together to make a mature mRNA. Introns are “intervening” sequences in genes that do not code for proteins. Multiple exons in a eukaryotic gene can be spliced in different ways to make multiple mRNAs and multiple proteins from a single gene (alternative splicing). The majority of human genes can be spliced in two or more different ways. Therefore, the actual number of human proteins far exceeds the number of protein-coding genes. Alternative splicing : “tissue-specific” versions of the same gene, where one splice variant is present in, for example, cardiac muscle, while a different splice variant of the same gene is present in skeletal muscle. The image below shows one hypothetical gene with 3 different possible proteins depending on which exons are included in the final mRNA. Chromosome Predicted number of Organism Base pairs number (diploid) genes Saccharomyces 6,275 16 1.25×107 cerevisiae (budding yeast) (~5,800 functional) Drosophila 8 1.65×108 13,600 melanogaster (fruit fly) Caenorhabditis 6 1.0×108 ~19,000 elegans (nematode worm) Canis familiaris (dog) 78 2.4×109 ~19,000 Homo sapiens (human) 46 3.3×109 ~19,000 Mus musculus (mouse) 40 3.4×109 ~20,000 Oryza sativa (rice) 24 4.66×108 ~37,000 Prokaryotic Genomes & Eukaryotic organelles 1. Bacteria & Archaea 2. Physical features of prokaryotic genomes 3. Genetic features of prokaryotic genomes 4. Eukaryotic organellar genomes Bacterial and archaeal classification: genome size Bacterial and archaeal genomes range from: ~0.5 megabases (Mb) to ~10 Mb. Bacteria: typically ~0.16 Mb to 10 Mb Smallest: Candidatus Carsonella ruddii (0.16 Mb) Largest: Solibacter usitatus Ellin(10 Mb) Archaea: ~0.5 Mb to ~6 Mb Smallest: Nanoarchaeum equitans (0.49 Mb) Largest: Methanosarcina acetivorans (5.75 Mb) Bacteria : Circular, Linear or multipartite genomes E. coli genome : single circular DNA molecule. Linear genome?? Majority of bacterial and archaeal chromosomes. Lyme disease (1989) Borrelia burgdorferi Transmitted through the bite of deer ticks Contains up to 11 copies of a single linear chromosome No supercoiling Strands are diffused throughout the cell Other microorganisms: Streptomyces coelicolor Agrobacterium tumefaciens Circular or Linear Multipartite prokaryotes genomes Multipartite: having or involving several or many parts or divisions. Genomes are divided into two or more DNA molecules. Problem: Distinguishing a genuine part of the genome from a plasmid A plasmid: small piece of DNA, often but not always circular, that coexists with the main chromosome in a bacterial cell. Few facts: 1. Some plasmid are able to integrate into the main genome 2. Some are permanently independent. 3. Their replication process is distinct 4. Copy numbers of a thousand or more in a single cell. 5. Plasmids carry nonessential genes Question to ponder: Should we include plasmid in genome? 1. Irregularly shaped structure called the nucleoid. 2. Bacteria can pick up new plasmids from other bacterial cells (during conjugation) or from the environment. 3. Every plasmid has its own ‘origin of replication’ – a stretch of DNA that ensures it gets replicated (copied) by the host bacterium. Vibrio cholerae: Causes cholera Contains two circular chromosomes. (i) 2.96 Mb (73% of the organism’s 4113 genes) (ii)1.07 Mb Observation: Two DNA molecules together constitute the Vibrio genome Larger : Most of the genes for central cellular activities ( such as genome expression and energy generation, as well as the genes that confer pathogenicity) Smaller: Mega plasmid acquired by the ancestor (to Vibrio) in the bacterium’s evolutionary past. Borrelia burgdorferi : Linear chromosome of 911 kb (875 genes). 19 linear and circular plasmids, (contribute another 504 kb and another 478 genes). Functions : Most of the genes are unknown Genes for membrane proteins and purine biosynthesis. Conclusion: Prokaryotes are having multipartite genomes (rather than conventional genome) One or more bacterial chromosomes, carrying essential genes and located in the nucleoid. Chromids, Genes that the bacterium needs to survive. Genuine plasmids, Genes are non-essential to the bacterium The cell Prokaryotic Genome - Bacteria & Archaea Mendel’s contributions - Physical features of prokaryotic genomes - Genetic features of prokaryotic genomes The chromosomal theory of inheritance - Eukaryotic organellar genomes The chemical nature of genetic material Composition and structure of DNA Genomes of prokaryotes and eukaryotes Genetic features of prokaryotic genomes Sequence Inspection Prokaryotes Eukaryotes Prokaryotic genomes (have been sequenced) Estimation of number of genes (their functions) Gene organization in the E.coli K12 genome Compact genetic organizations with very little space between genes Circular gene map Intergenic DNA : 11% (distributed throughout the genome). Very little wasted space. Theory: 1. Compact organization is beneficial to prokaryotes 2. Enabling the genome to be replicated relatively quickly E. Coli Genome Facts: Genome: 1. Gram negative, non-spore-forming Complete genome sequence (1997) by Blattner et al., at organism. the University of Wisconsin. 2. Optimal growth condition occurs at Salient features: 37ºC. 1. Single circular chromosome of size (4,639,221 bp) 3. Non-pathogenic E. coli strains are used 2. G+C content of the genome is 50.8 per cent as probiotic. 3. 88% of the genome codes for 4288 proteins 4. K12 is a laboratory strain and is used for molecular biological studies. 4. 0.8% of the genome represents genes-coding for rRNA, tRNA, etc. 5. Rapid growth rate and simple nutritional requirements. 5. 11% of the genome harbors other regulatory sequences. 6. Prevalent species can be used for host–pathogen interaction studies. Separated by a Group of genes involved in a single biochemical pathway: Operon single nucleotide 50 kb segment of the Escherichia coli genome. The segment runs between nucleotide positions 377 and 50,377. Note that some genes are so close together that they appear to be continuous when drawn at this scale; examples are thrA, thrB, and thrC; caiD and caiC; fixA and fixB; fixC and fixX. (Data from the UCSC Microbial Genome Browser.) First feature: 1. Infrequency of repetitive sequences. 2. Most prokaryotic genomes don’t possess high-copy-number interspersed repeat families. 3. They possess insertion sequences IS1 and IS186. 4. IS are transposable elements (ability to move around the genome). 5. IS can transfer from one organism to another (between two different species too). Second feature: 1. Scarcity of introns. 2. E. coli K12 has no discontinuous genes. 3. Introns are uncommon among other bacteria and archaea. 4. Some introns discovered (different from eukaryotic pre-mRNA). 5. Ability to self-splice (Note: Eukaryotes need catalytic proteins!!) Prokaryotic Genome sizes & number of genes Genome sizes and numbers of genes vary within individual species 1. Differences in genome sizes and gene contents (within a prokaryotic species). 2. Pan-genome concept = Core + Accessory genome. 3. Core genome: genes possessed by all members of the species. 4. Accessory genome: Collection of additional genes present in different strains and isolates of that species. Core : Basic biochemical and cellular activities. Accessory : Biological capability of species (as a whole) Organellar Genome Genes located outside the nucleus : extrachromosomal genes (initially called - 1950) Electron microscopic and biochemical studies (hints that DNA molecules might be present in mitochondria and chloroplasts). Existence, independent of and distinct from the eukaryotic nuclear genome – 1960. Relics of free living bacteria Origin of organellar genome (symbiotic association). 1. Endosymbiont theory (widely accepted now) – unorthodox when proposed (1960). 2. Theory : Gene expression processes occurring in organelles are similar to equivalent processes in bacteria. 3. Evidence: Nucleotide composition (organelle genes similar to bacterial genes than eukaryotic nuclear genes). Algae “glaucophytes” has photosynthetic structures (cyanelles). Different from chloroplasts and resemble cyanobacteria Cyanelle: External layer of peptidoglycan (remnant of the cyanobacterial cell wall). Light-harvesting proteins resemble free-living cyanobacteria (rather than chloroplasts). How this transfer occurred? 1. Mass transfer of many genes at once or a gradual trickle from one site to the other. 2. DNA transfer from organelle to nucleus, and between organelles, still occurs (1980). 3. Partial sequences of chloroplast genome contained (copies of part of mitochondrial genome). 4. Promiscuous DNA transferred from one organelle to the other. Another type of transfer: (Arabidopsis) Mitochondrial genome: segments of nuclear DNA as well as 16 fragments of the chloroplast genome. Nuclear genome of this plant: several short segments of the chloroplast and mitochondrial genome. Organelle genome: Shape (Circular/Linear) All eukaryotes : mitochondrial genomes Organellar genomes: Circular Photosynthetic eukaryotes: chloroplast genomes. Electron microscopy: Both Many eukaryotes, the circular genomes coexist in the circular and linear DNA. organelles with linear versions. Linear DNA are fragments of Marine algae (Dinoflagellates): chloroplast genomes circular genomes (preparation are split into many small circles, (just a single gene) for electron microscopy). Mitochondrial genomes of Paramecium, Great deal of variability in Chlamydomonas, and several yeasts are always different organisms. linear. Organelle genome: Size & contents Mitochondrial genome sizes : Variable Unrelated to the complexity of the organism. Compact genetic organization, where the genes are close together with little space (Human). Lower eukaryotes (S. cerevisiae) have larger and less compact mitochondrial genomes, with a number of the ATP6 and ATP8 (Overlap), genes containing introns. genes for ATPase subunits 6 and 8; COI, COII, and COIII, genes for cytochrome c oxidase subunits I, II, and III; Cytb, gene for apocytochrome b; ND1– ND6, genes for nicotinamide adenine dinucleotide (NADH) hydrogenase subunits 1–6. Lower eukaryotes (S. cerevisiae) Display greater variability Gene contents ranging from three for Plasmodium falciparum 93 for the protozoan Reclinomonas americana Chloroplast genomes : less variable sizes Most have a structure similar to of the rice chloroplast genome. Rice chloroplast genome: Genes with known functions are shown. General features: organelle genome 1. Organelle genomes: some proteins are found in the organelle 2. Other proteins are coded by nuclear genes, synthesized in the cytoplasm, and transported into the organelle. Question to ponder: If the cell has mechanisms for transporting proteins into mitochondria and chloroplasts, then why not have all the organellar proteins specified by the nuclear genome? Comparison of three The cell Prokaryotic Genome - Bacteria & Archaea Mendel’s contributions - Bacterial classification - Morphology, genome size & disease relevance The chromosomal theory of inheritance - Physical features of prokaryotic genomes The chemical nature of genetic material - Genetic features of prokaryotic genomes - Eukaryotic organellar genomes Composition and structure of DNA Genomes of prokaryotes and eukaryotes Model Organisms Viral Genomes Human Genome 3.2 ×109 base pairs 2001 (10 years) Challenges of understanding the genomes, data and analysis : improve human welfare (Pharmacogenomics) How do the contents of our genomes determine who we are? https://www.verywellmind.com/what-is-nature-versus-nurture-2795392 Phenotype = Genotype+ Environment +Life history+ Epigenetics Genotype: Nuclear and mitochondrial. (For plants, include also the sequence of the chloroplast DNA.) Phenotype: The collection of your observable traits, other than your DNA sequence (Macroscopic properties) Life history : experiences, physical and psychological environment Epigenetic factors: DNA sequence same (different sets of genes expressed or silenced in liver, brain, etc) Eukaryotes: Major Differences between Eukaryotes ✓ eu‐ (“true”) and karutos(“having nuts”) and Bacteria and Archaea ✓ Membrane‐bound nucleus and a cytoskeleton ✓ Genomic DNA organized into chromosomes Membrane‐bound nucleus, organelles ,& a cytoskeleton Sexual reproduction Bacteria & Archae (High density of Protein coding genes, E.Coli 0.7% non-coding repeats) Genome size Genes that encode the proteome Human genome : 23 000 protein-coding genes Subtelomeric regions (on all chromosomes), chromosomes 18 and X : Poor protein- coding genes. Chromosomes 19 and 22 are relatively rich : Rich protein-coding genes Exons (expressed regions) interrupted by introns (regions spliced out of mRNA and not translated to protein). Average exon size is about 200 bp. Variability in intron size that causes the large size differences among protein-coding genes: the gene for insulin is 1.7 kb long and the dystrophin gene is 2400 kb. Eukaryotic Nuclear Genomes 1. Nuclear genomes are contained in chromosomes 2. How are the genes arranged in a nuclear genome? 3. How many genes are there and what are their functions? Chromosomes are much shorter than the DNA molecules they contain Packaging system : how genomes function Prokaryotes or Eukaryotes How DNA is packaged : Influences expression of individual genes Biochemical results 1970s : Breakthrough in understanding DNA packaging. Knowledge on nuclear DNA & their proteins “Histones”. Nuclease protection experiments on chromatin (1974). Nuclear genomes are contained in chromosomes Eukaryotes: Fun fact: 1. Yeast : 16 chromosomes 1. Nuclear genomes : set of linear DNA molecules (contained in a chromosome) 2. Fruit fly : 4 chromosomes 2. Common, at least one chromosome 3. Chromosome number varies Conclusion: No link between chromosome number and genome size. More reflection of the non-uniformity of evolutionary process. Karyogram Lengths of chromatids. Location of the centromere relative to telomeres. Individual chromosomes : recognized Staining techniques. Banding pattern (characteristic for a particular chromosome). Centromeres and Telomeres Understanding the nucleotide sequence of centromeric DNA (Higher eukaryotes). Centromeric regions: often excluded. Difficulty in obtaining : accurate reading (highly repetitive structures). Arabidopsis thaliana (genome sequencing in 2000). Centromeres: 0.4–3.0 Mb of DNA (Each of 178–180 bp repeat sequences). Repeat sequences: principal component (??). Multiple copies of a variety of repeats (found elsewhere in genome). Telomeres : Essential, mark the ends of chromosomes. Cell to distinguish a real end from an unnatural end caused by chromosome breakage. “Repairing mechanism” 5’-TTAGGG-3’ (Humans) Telomere-binding proteins: TRF1, TRF2 (protects from degradation by nuclease, maintain the length during DNA replication) Eukaryotic Nuclear Genomes 1. How are the genes arranged in a nuclear genome? 2. How many genes are there and what are their functions? How are the genes arranged in a nuclear genome? Gene density in Eukaryotes Gene density along the largest of the five Arabidopsis thaliana chromosomes. Chromosome 1 (29.1 Mb ) Arabidopsis thaliana (The first Human genome: Gene deserts (density is very low, eukaryotic genome) sequences : regions as long as several mega base pairs). Genes are unevenly distributed. Distribution of protein-coding genes between different Genome size: 135 Mb. human chromosomes (very uneven). Chromosome 13: 3.16 genes/Mb Average gene density in the Chromosome 19 : 22.61 genes/Mb genome is 25 genes/100 kb Outside the centromeres and telomeres : 1 to 38 genes/100 kb Facts: Segment of human genome Variations in gene density (length of a 200 kb segment of the human genome (Chromosome 1): eukaryotic Nucleotide position 55,000,000 to position 55,200,000. chromosome). Difficult to identify regions. Pattern of gene organization varies greatly between different eukaryotes. BSND gene: Chloride channel protein 3 introns Reason: Differences PCSK9 gene : proprotein convertase 11 introns reflect the genetic subtilisin (metabolism of cholesterol) features and evolutionary histories. USP24 gene: ubiquitin-specific peptidase 73 introns 1. Relatively small amount of space Fun facts: taken up by the coding parts of the genes. 5.33% is gene rich region 2. Total length of exons is 10,664 bp All the exons in the human genome make up only 48 Mb 3. Equivalent to 5.33% of the 200 kb segment. just 1.5% of the total. Interspersed repeats: 44% of the genome. How extensive are the differences in gene organization among eukaryotes? Answer: Substantial difference Complexity = no. of genes The human genome: 3235 Mb (20,441 protein-coding genes) Saccharomyces cerevisiae : 12.2 Mb is 0.004 times human genome 0.004 × 20,441 genes = 82 (protein-coding genes) Actual: 6692 protein-coding genes. C-value paradox Conclusion: Space is saved in the genomes of less complex organisms because the genes are more closely packed together. 1. The gene density in the yeast genome is much higher than that for humans. 2. Relatively few of the yeast genes are discontinuous How many genes are there and what are their functions? 1. Gene numbers can be misleading 2. Gene catalogs reveal the distinctive features of different organisms 3. Families of genes 4. Pseudogenes and other evolutionary relics Gene numbers can be misleading Human genome : 20,441 protein-coding genes Sophisticated species (Planet) : more 22,219 genes for noncoding RNAs genes than any other organism. Recent years: Initial comparison works! But???? 1. Protein-coding genes decreased (questionable ORF’s are discarded) - 19,000 2. Non-coding RNA genes increased Initial hypothesis: single gene specifies a single mRNA and a single protein. No of proteins : 80,000–100,000 genes Actual number of protein-coding genes : lower than indicated. Individual gene to specify more than one protein. Many discontinuous genes in the human genome. Contents of the Human Genome Protein-coding genes : 2–3% of the overall sequence. Distributed across the different chromosomes, but not evenly Genome encode non-protein coding RNA molecules Binding sites for ligands responsible for regulation of transcription Contents of the Human Genome Repetitive elements : Long and Short Interspersed Elements (LINES and SINES) account for 21% and 13% of the genome. Highly repeated sequences – minisatellites and microsatellites – may appear as tens or even hundreds of thousands of copies, in aggregate amounting to 15% of the genome Contents of the Human Genome Organization of eukaryotic Genomes into Chromosomes Genomic DNA organized into chromosomes Karyotyping Centromere, Telomere Deletions or Duplications Deletion 11q syndrome: trigonencephaly (a triangle‐shaped head), a carp‐shaped mouth, and cardiac defects. Hemizygous deletion of the terminus of chromosome 11q. ▪ 21 Trisomy ▪ In nucleus: Chromosomes, unravelled structure, occupy restricted spaces – chromosome territories. ▪ Chromosome specific-fluorescent probes. Repetitive Sequences in the Eukaryotic Genome Repetitive DNA: 1. Vast proportions of eukaryotic genome Britten and Kohne (1968) 2. Repeated nucleotides of various lengths Repetitive nature of 3. Mammals: 60%, Yeasts: 20% eukaryotic DNA Importance: Diseases, Recombination events (Duplication or Deletion) Molecular Fossils (Evolutionary studies) Experiments: 1. Genomic DNA from a wide variety of species, sheared it, and dissociated the DNA strands. 2. Hint: Under appropriate conditions of salt, temperature, and time, the DNA strands re- anneal. 3. DNA reassociation in mouse genomic DNA Y-axis : Percent of the DNA that remains single stranded. X-axis : Log-scale of the product of the initial concentration of DNA (in moles/liter) multiplied by length of time the reaction proceeded (in seconds). The designation for this value is Cot and is called the "Cot" value. A. Size or complexity of the genome Re-association kinetics B. Amount of repetitive DNA within the genome Repetitive DNA will renature at low C0t values. Complex and unique DNA sequences will renature at high C0t values. The first sequences to reanneal are the highly repetitive sequences because so many copies of them exist in the genome, and because they have a low sequence complexity. The second portion of the genome to reanneal is the middle repetitive DNA, and the final portion to reanneal is the single copy DNA or unique DNA sequence. Unique or non-repetitive sequences are those found once or a few times within the genome. Structural genes are typically unique sequences of DNA. The vast majority of proteins in eukaryotic cells are encoded by gene present in one or a few copies. Dispersed repeated sequences: Families of repeated sequences. Transposons are mobile DNA sequences which migrate to different regions of the genome via transposition. Long or Short / Actual mobile elements (transposons or retrotransposons) Long interspersed elements (LINEs) (1,000–7,000 bp long) Short interspersed elements (SINEs) (100–400 bp long) LINES LINEs (long interspersed nuclear elements), comprise about 21% of the human genome and consist of repetitive sequences up to 6500 bp long that are adenine-rich at their 3’ends. Mammalian diploid genomes have about 500,000 copies of the LINE-1 (L1) family. Other LINE families may be present also, but they are much less abundant than LINE-1. Full length LINE-1 family members are 6–7 kb long, although most are truncated elements of about 1–2 kb. SINES SINEs are found in a diverse array of eukaryotic species, including mammals, amphibians, and sea urchins. Short sequences (about 100–400 bp), An internal pol III promoter but do not encode any proteins. SINEs are derived from tRNA and 7SL RNA genes A well-studied SINE family is the Alu family of certain primates. Non-autonomous transposable elements. Assistance of L1 elements for transpostition. Occur at much higher density in GC-rich regions (13% of the human genome). This family is named for the cleavage site for the restriction enzyme AluI typically found in the repeated sequence. In humans, the Alu family is the most abundant SINE family in the genome, consisting of 200–300-bp sequences repeated as many as a million times and making up about 10% of the human genome. One Alu repeat is located every 5,000 bp in the genome, on average. G-C rich region G-C rich region Roles of LINES & SINES “Junk DNA“ Both LINEs and SINEs were incorporated into novel genes. Evolve new functionality. The distribution of these elements has been implicated in some genetic diseases and cancers. Genomic DNA: stable template of heredity, largely dormant and unchanging (May be point mutation). LINE insertions: Hemophilia, Thalassemia, Duchenne muscular dystrophy. LINE insertion: APC gene of adenocarcinoma cells from a colon cancer patient SINEs : Hot-spots for recombination Tandem Repeats Moderately and highly repetitive sequences are clustered together in a tandem array, also known as tandem repeats. In a tandem array, a very short nucleotide sequence is repeated many times in a row. In Drosophila, for example, 19% of the chromosomal DNA is highly repetitive DNA found in tandem arrays. Depending on the average size of the arrays of repeat units, highly repetitive noncoding DNA belonging to this class can be grouped into three subclasses: satellite, mini-satellite and micro-satellite DNA. – Classical satellite DNA: repeat unit 100-5000 kb – Minisatellite DNA: 100 bp – 20 kb – Microsatellite DNA: >> Drosophila melanogaster Humans: Polymorphic [Genetic Marker] Any negative consequences: 1. Repeats of CAG 2. Huntington’s disease Pseudogenes ✓ Definition: Not actively transcribed (or) translated ✓ Once functional, not now (lack of protein product) How to recognize Pseudogenes? (17,000 in Human Genome) 1. Stop codon 2. Frame-shift mutation Functions: 1. Non-functional, Roles in recombination 2. Human Chromosome 1: 3141- Protein coding genes, 991- Pseudogenes. 3. Smallest autosome, chromosome 21: 225- known genes, 59- Pseudogenes Human Genome Project It is essentially immoral not to get it [the human genome sequence] done as fast as possible. Second-world war: “Scientific thinking”. Change in genetic make-up? Post-war effects: Radiation damage to the human genetic material and its ill-effects on subsequent generations The Department of Energy (DOE) – USA and the International Commission for Protection Against Environmental Mutagens and Carcinogens – 1984. To develop techniques to detect mutations in the survivors. Ignited the idea of “HGP”. Department of Energy (DOE) - proposed Genome mapping Ultimate aim: Sequencing Order of 3 billion nucleotides Computer-aided arrangement of the sequences. Human National HGP (1990) Genome Research Council Initiative (3 years) 15 - years The International Human Genome Sequencing Consortium (IHGSC). NIH & DOE, 25 labs from 5 different countries. Budget: 200 million US$ per year. ELSI – 2% of the budget. Challenges 1. 1991–95 (FY Plan No. 1), 1996–2000 (FY Plan No. 2) and 2001–03 (FY Plan No. 3). 2. Available mapping and sequencing technologies were not advanced enough to sequence the 3 billion bases of the human genome in the specified time frame. 3. Costly, Slow, and not accurate. Objective 1. To prepare a high resolution map of the human genome using genetic as well as physical mapping techniques. 2. To determine the order of the nucleotide arrangement in all the 22 autosomes and the two sex chromosomes (X and Y). 3. To develop high throughput sequencing technology. 4. To learn the sequence of genomes of model organisms to test the feasibility of different mapping and sequencing techniques. 5. To develop computer tools to store the sequence data and access the stored sequence information for various purposes. 6. To annotate the DNA sequence based on its sequence content, such as ORF, promoter, terminator, enhancers and repeat sequences present in the genome. 7. To address ethical social and legal issues that might arise pertaining to human genome sequencing and its use. Human Genome Organization (HUGO) Objective of agency: Coordinate the sequencing project in order to avoid unnecessary competition among scientists To avoid duplication of work and also to encourage the exchange of scientific material and data relevant to genome sequencing among scientists. “Coordinating agency” to conduct training programmes, and as nodal agency to provide information relevant to genome sequencing to the public. To solicit opinions so as to address the ethical, legal, and social issues (ELSI) related to genome sequencing. It is also involved in providing fellowships, training, and course materials related to genome sequencing. It provided expert advice to governments on developments in genome sequencing. Meetings and workshops were conducted in a phased manner in different locations to exchange ideas and materials related to HGP. Findings of the HGP 2% (human genome) made up of coding DNA. Genes predicted (20,000–25,000) < (100,000 anticipated) One gene codes for 2–3 proteins through the alternate splicing method. No. of genes (human) two to three times of tiny fruit fly (13,500) and the worm (19,000). Comparison of human genome : similarity with other model organisms (orthologous genes). Genes (not distributed equally throughout the chromosome): gene-rich / gene-poor regions. Gene number is not constant throughout : (Chromosome 19 has the largest number of genes while chromosome 5 has the least). Potential applications of HGP 21st century : HGP offers: 50 billion $ for 1. Not only genome sequencing. biotechnology 2. Genomes of model organisms / genes. products. 1. Pharmaceuticals : DNA-based products Predisposition to many diseases 4,000 diseases: mutations in various genes 2. 5.3% Euchromatic regions carry segmental duplications Solution: High resolution genetic maps as well as physical maps 3. Williams syndrome, Charcot-Marie-Tooth region, DiGeorge syndrome region. Personalized Medicine “Analysis of patients’ genes and proteins permits selection of drugs and dosages optimal for individual patients” Pharmaco genomics Red marrow contains blood stem cells that can become red blood cells, white blood cells Acute lymphoblastic leukaemia (ALL) Over production Acute lymphoblastic leukaemia Acute lymphoblastic leukaemia treated by thiopurines Enzyme thiopurine methyltransferase (TPMT) breaks down the drug. Thioguanine Mercaptopurine Azathioprine But a genetic variant producing an inactive enzyme. Unmetabolized drug : bone marrow, Toxic. Human Genome & Medicine Vaccinations: pre-emptive strikes “diseases” immune system “Pathogens” Small pox & Polio Individual Genetic Susceptibility : Medical ailments Alpha 1-antitrypsin : protease inhibitor Chronic obstructive pulmonary disease (COPD) /Emphysema Shortness of breath and cough with sputum production Heavy Smokers @ 50 Human Genome & Medicine Huntington’s disease : Neurodegenerative disorder Symptoms: Uncontrollable dance-like movements, Mental disturbance, Personality changes Repeats of CAG (Polyglutamine) Normal gene : 11–28 CAG repeats, 29–34 repeats Diseased : 35–41 repeats Severely diseased: >41. Two Different Groups Worked to Obtain the DNA Sequence of the Human Genome The HGP is a multinational consortium established by government research agencies and funded publicly Celera Genomics is a private company whose former CEO, J. Craig Venter, ran an independent sequencing project Differences arose regarding who should receive the credit for this scientific milestone June 6, 2000, the HGP and Celera Genomics held a joint press conference to announce that TOGETHER they had completed ~97% of the human genome Human genome project: strategies Whole genome shotgun sequencing (Celera) -- Given the computational capacity, this approach is far faster than hierarchical shotgun sequencing -- The approach was validated using Drosophila Hierarchical shotgun sequencing (public consortium) -- 29,000 BAC clones -- 4.3 billion base pairs -- It is helpful to assign chromosomal loci to sequenced fragments, especially in light of the large amount of repetitive DNA in the genome -- Individual chromosomes assigned to centers 9 months, Shot-gun approach Published The International Human Genome Sequencing Consortium published their results in Nature, 409 (6822): 860-921, 2001.”Initial Sequencing and Analysis of the Human Genome” Celera Genomics published their results in Science, Vol 291(5507): 1304-1351, 2001.“The Sequence of the Human Genome”