Genome Lecture Notes PDF
Document Details
Uploaded by UserFriendlyRecorder8657
Cairo University Science
Prof DR Adel Khalil Gohar
Tags
Related
Summary
This document presents a lecture on genomes, covering topics such as the definition of a genome, its structure, function, evolution, and mapping. It details different types of genomes, such as prokaryotic, eukaryotic, and viral genomes. The lecture also explains repetitive DNA, including microsatellites and tandem repeats.
Full Transcript
Genome Prof DR Adel Khalil Gohar Genome “The branch of molecular biology concerned with the structure, function, evolution and mapping of genomes.” Genome Genome: is all the DNA in a cell. - All the DNA on all the chromosomes Includes genes, intergenic sequences, repeats - Specif...
Genome Prof DR Adel Khalil Gohar Genome “The branch of molecular biology concerned with the structure, function, evolution and mapping of genomes.” Genome Genome: is all the DNA in a cell. - All the DNA on all the chromosomes Includes genes, intergenic sequences, repeats - Specifically, it is all the DNA in an organelle. - Eukaryotes can have 2-3 genomes - Nuclear genome - Mitochondrial genome - Plastid genome - If not specified, “genome” usually refers to the nuclear genome. Viral Genomes Viral genomes: ssRNA, dsRNA, ssDNA, dsDNA, linear or circular Viruses with RNA genomes: - all ssRNA viruses produce dsRNA molecules Prokaryotic Genomes Haploid, one dsDNA, circular chromosome, anchored by proteins. Usually without introns. - 98% of genome is coding. Relatively high gene density (0.5-10 Mbp, 500-10000 genes). - Transcription and translation take place in the same compartment. - Operons: polycistronic transcription units - Often indigenous plasmids are present. Plasmids Plasmids as Extra chromosomal circular DNAs: Found in bacteria, yeast and other fungi Size varies (3,000bp to 100,000bp). Replicate autonomously, may contain resistance genes Ability to transfer from one bacterium to another or across kingdoms. Multi-copy plasmids (~1 –2 up to 400 plasmids/per cell) Eukaryotic Genome: Eukaryotic cells package their DNA as 1 molecule/linear chromosome. Six levels of chromosome packing: 1. DNA duplex (2 nm) 2. Nucleosome fiber(10 nm) 3. 30 nm chromatin fiber 4. Coiled chromatin fiber 5. Coiled coil 6. Metaphase chromosome. Gene Numbers & Genome Sizes (C, N and K values) Gene Numbers: - Number of genes do not correlate with the complexity of an organism. Genome Sizes: - The C-value = the DNA content of the haploid genome. - The units of length of nucleic acids in which genome sizes are expressed : - Kilobase (Kb) 103 base pairs - Megabase (Mb) 106 base pairs The 3 Genomic Paradoxes: - C-value paradox: Complexity does not correlate with genome size. - N-value paradox: Complexity does not correlate with gene number. - K-value paradox: Complexity does not correlate with chromosome number. Large vs. Small Genomes Polyploidy: having more than 2 sets of chromosomes in the genome - Particularly relevant in plants (Auto- or Allopolyploidy). The amount of non-coding DNA: - Highly repetitive DNA: > 100,000 copies/genome - Moderately repetitive DNA: 100 – 10,000 copies/genome Eukaryotic Genomes: - Located on several chromosomes - Relatively low gene density (50 genes per mm of DNA in humans) - Contour length of DNA from a single human cell = 2 meters - Approximately 1011 cells = total length 2 x 1011 km - Distance between sun and earth (1.5 x 108 km) - Human chromosomes vary in length over a 25 fold range - Carry organelles genome as well Mitochondrial Genome (mtDNA) - Multiple identical circular chromosomes - Size ~15 Kb in animals, - Size ~ 200 kb to 2,500 kb in plants - Over 95% of mitochondrial proteins are encoded in the nuclear genome. - No introns and Very few repeats - Often A+T rich genomes. - Mt DNA is replicated before or during mitosis - 24 of 37genes are RNA coding: 22 mt tRNA and 2 mit ribosomal RNA (23S, 16S) - 13 of 37 genes are protein coding : (synthethized on ribosomes inside mitochondria) some subunits of respiratory complexes and oxidative phosphorylation enzymes some subunits of respiratory complexes and oxidative phosphorylation enzymes Chloroplast Genome (cpDNA) - Multiple circular molecules - Size ranges from 120 kb to 160 kb - Similar to mtDNA - Many chloroplast proteins are encoded in the nucleus (separate signal sequence) Gene Molecular definition: Entire nucleic acid sequence necessary for the synthesis of a functional polypeptide (protein chain) or functional RNA. Repetitive DNA Tandem Repeats, Tandem repeats occur in DNA when a pattern of two or more nucleotides is repeated, and the repetitions are adjacent to each other. Microsatellite DNA Microsatellite - Unit - 1-6 bp. Repeat - on the order of 5-100 times. - Location - Generally euchromatic. Minisatellite DNA Unit - 15-400 bp Location - Generally euchromatic. - Examples - DNA fingerprints. Tandemly repeated but often in dispersed clusters. Also called VNTR’s (variable number tandem repeats). Disease due to tandem repeats VNTR’s (variable number tandem repeats). Tandemly Repetitive DNA Can Cause Diseases: - Fragile X Syndrome: “CGG” is repeated hundreds or even thousands of times creating a “fragile” site on the X chromosome. It leads to mental retardation. - Huntington's Disease: “CAG” repeat causes a protein to have long stretches of the amino acid glutamine. Leads to a neurological disorder that results in death Interspersed Repetitive DNA 2) Interspersed Repetitive DNA: - Interspersed repetitive DNA accounts for 25–40 % of mammalian DNA. They are scattered randomly throughout the genome. - The units are 100 – 1000 base pairs long. - Copies are similar but not identical to each other. - Transposons: jumping genes Interspersed repetitive genes are not stably integrated in the genome; they move from place to place. These are: a) Retrotransposons (class I transposable elements) (copy and paste): copy themselves to RNA and then back to DNA (using reverse transcriptase) to integrate into the genome. Such as; LTR, SINEs and LINEs. b) Transposons (Class II TEs) (cut and paste): uses transposases to make makes a staggered sticky cut. Transposons (also called transposable elements, or “jumping genes”) are DNA sequences that move from one genomic location to another. Repeat sequence units of this type are usually 100 bp to over 10 kb in length, and may appear in over 1 million loci dispersed across the genome. Noncoding Genomic Elements Although protein-coding genes are the most studied genomic element, they may not necessarily be the most abundant part of the genome. Prokaryotic genomes are usually rich in protein-coding gene sequences, for example, they account for approximately 90% of the E. coli genome. In complex eukaryotic genomes, however, their percentage is lower. For example, only about 1.5% of the human genome codes for proteins - Definition: A gene family is a group of genes that share important characteristics. These may be : a) Structural: have similar sequence of DNA building blocks Types and (nucleotides). Their products (such as proteins) have a similar structure or function. examples of 1) Classical gene families (overall conservativeness) Histones, alpha Gene Families and beta-globines. 2) Gene families with large conservative domains (other parts could be low conservative) HLH/bZIP box transcription factors. Gene families with short conservative motifs e.g. DEAD box (Asp- Glu-Ala-Asp), WD (TrpAsp) repeat. Functional gene family : have proteins produced from these genes work together as a unit or participate in the same process. 1) Regulatory protein gene families 2) Immune system proteins 3) Motor proteins 4) Signal transducing proteins 5) Transporters Multigene Families - The classic examples of multigene families of nonidentical genes- Are two related families of genes that encode globins. Pseudogenes are defective copies of genes. They have lost their protein-coding ability : have stop codons in middle of gene they lack promoters, or truncated just fragments of genes. Pseudogenes accumulation of multiple mutations Processed pseudogenes copied from mRNA and incorporated into the chromosome but lack of protein-coding ability (no intron/ poly-A tail present/ no promoter). pseudogenes Non-processed pseudogenes are the result of tandem gene duplication or transposable element movement. When a functional gene get duplicated, one copy isn’t necessary for life. Duplicated Genes. Encode closely related (homologous) proteins, Clustered together in genome. Formed by duplication of an ancestral gene followed by mutation. CpG Islands Region of the genome with high frequency of CpG sites than the rest of the genome. Formal Definition - CpG island is a region with at least 200 bp, and a GC percentage that is greater than >50%. CpG is shorthand for “—C—phosphate—G—” that is, cytosine and guanine separated by only one phosphatZ CpG islands located in the promoter regions of genes; therefore, can play important roles in gene silencing. Housekeeping genes : Almost all housekeeping genes are associated with at least one CpG island. Tissue specific genes: About 40 % tissue specific genes are associated with islands Genome Sizes For the least sophisticated organisms, such as Mycoplasma genitalium, a minimal genome is sufficient. For increased organismal complexity, more genetic information and, therefore, a larger genome is needed. As a result, there is a positive correlation between organismal complexity and genome size, especially in prokaryotes. In eukaryotes, however, this correlation becomes much weaker, largely due to the existence of noncoding DNA elements in varying amounts in different eukaryotic genomes. In terms of total gene number, the currently documented range is 182 in the genome of Candidatus Carsonella The protein-coding regions are the part of the genome that Protein- we foremost study and know most about. The content of these regions directly affects protein synthesis and protein diversity Coding in cells. In prokaryotic cells, functionally related protein- coding genes are often arranged next to each other and Regions of the regulated as a single unit known as an operon. Genome The gene structure in eukaryotic cells is more complicated. The coding sequences (CDSs) of almost all eukaryotic genes are not continuous and interspersed among noncoding sequences. The noncoding intervening sequences are called introns (int for intervening), whereas the coding regions are called exons. During gene transcription, both exons and introns are transcribed. In the subsequent mRNA maturation process, introns are spliced out and exons are joined together for protein translation. , the average number of exons per gene is 8.8. The titin gene, coding for a large abundant protein in striated muscle, has 363 In the human exons, the most in any single gene, and also has the longest single exon (17,106 bp) among all currently known exons. The total genome, number of currently known exons in the human genome is around 180,000. With a combined size of 30 Mb, they constitute 1% of the exome, human genome. This collection of all exons in the human genome, or in other trascriptome eukaryotic genomes, is termed as the exome. Different from the transcriptome, which is composed of all actively transcribed clinical exome mRNAs in a particular sample, the exome includes all exons contained in a genome. Although it only covers a very small percentage of the genome, the exome represents the most important and the best annotated part of the genome. Sequencing of the exome has been used as a popular alternative to whole genome sequencing. While it lacks on coverage, exome sequencing is more cost effective, faster, and easier for data interpretation. DNA Sequence Mutation and Polymorphism Although DNA replication is a high-fidelity process and the nucleus maintains an army of DNA repair enzymes, sequence mutation does happen, though at a very low frequency. In general, the rate of mutation in prokaryotic and eukaryotic cells is at the scale of 10–9 per base per cell division. In multicellular eukaryotic organisms, germline cells have a lower mutation rate than somatic cells. In these organisms, because most cells, including germline cells, undergo multiple divisions in the organisms’ lifetime, the per-generation mutation rate is significantly higher. For example, whole genome sequencing data collected from human blood cell DNA estimates a mutation rate of 1.1 × 108 per base per generation, corresponding to about 70 new mutations in each human diploid genome. Depending on the nature of the change, mutations may have deleterious, neutral, or, rarely, beneficial effects on the organism. There are various forms of DNA mutations, from single nucleotide substitutions, to small insertions/deletions (or indels), to structural variations (SVs) that involve larger genomic regions. Among these different types of mutations, single nucleotide substitutions, also called point mutations, are the most common. These substitutions can be either transitions or transversions. Transitions involve the substitution of a purine for the other purine (i.e., A↔G) or a pyrimidine for the other pyrimidine (i.e., C↔T). Genome Evolution Mutation The spontaneous mutations that lead to sequence variation and polymorphism in a population are also the fundamental force. Gradual sequence change and diversification of early genomes, over billions of years, have evolved into the extremely large number of genomes that had existed or are functioning in varying complexity today. In this process, existing DNA sequences are constantly modified, duplicated, and reshuffled. Most mutations in protein-coding or regulatory sequences disrupt the protein’s normal function or alter its amount in cells, causing cellular dysfunction and affecting organismal survival. Under rare conditions, however, a mutation can improve existing protein function or lead to the emergence of new functions. If such a mutation offers its host a competitive advantage, it is more likely to be selected and passed on to future generations. Genome evolution : duplication Gene duplication provides another major mechanism for genome evolution. If a genomic region containing one or multiple gene(s) is duplicated resulting in the formation of an SV, the duplicated region is not under selection pressure and therefore becomes substrate for sequence divergence and new gene formation. Although there are other ways of adding new genetic information to a genome such as interspecies gene transfer, DNA duplication is believed to be a major source of new genetic information generation. Gene family as model (olfactory receptors) Gene duplication often leads to the formation of gene families. Genes in the same family are homologous, but each member has its specific function and expression pattern. As an example, in the human genome there are 339 genes in the olfactory receptor gene family. Odor perception starts with the binding of odorant molecules to olfactory receptors located on olfactory neurons inside the nose epithelium. To detect different odorants, a combination of different olfactory receptors that are coded by genes in this family is required. Based on their sequence homology, members of this large family can be even further grouped into different subfamilies. DNA recombination role in genome evolution DNA recombination, or reshuffling of DNA sequences, also plays an important role in genome evolution. Although it does not create new genetic information, by breaking existing DNA sequences and rejoining them, DNA recombination changes the linkage relationships between different genes and other important regulatory sequences. Without recombination, once a harmful mutation is formed in a gene, the mutated gene will be permanently linked to other nearby functional genes, and it becomes impossible to regroup all the functional genes into the same DNA molecule. Through this regrouping, DNA recombination makes it possible to avoid gradual accumulation of harmful gene mutations. Most DNA recombination events happen during meiosis in the formation of gametes (sperm or eggs) as part of sexual reproduction. Genome Sequencing and Disease Risk New sequencing technologies, has uncovered extensive sequence variation in individual genomes within a population. The extensiveness in sequence variation was not envisioned in early days of genetics, not even when the Human Genome Project was completed in 2003. This has gradually led to a paradigm shift in disease diagnosis and prevention. As a result, the public becomes more aware of the role of individual genomic makeup in disease development and predisposition. In addition, the easier accessibility to our DNA sequence has further prompted us to look into our genome and use that information for preemptive disease prevention. The declining cost of genome sequencing has also enabled the biomedical community to dig deeper into the genomic underpinnings of diseases, by unraveling the linkage between sequence polymorphism in the genome and disease incidence. Following is a brief overview of the major categories of human diseases that have an intimate connection with DNA mutation, polymorphism, genome structure, and epigenomic abnormality. Mendelian Single-Gene Diseases The simplest form of hereditary diseases is caused by mutation(s) in a single gene, and therefore called monogenic or Mendelian diseases. For example, sickle cell anemia is caused by a mutation in the HBB gene located on the human chromosome 11. This gene codes for the β subunit of hemoglobin, an important oxygen-carrying protein in the blood. A mutation of this gene leads to the replacement of the sixth amino acid, glutamic acid, with another amino acid valine in the coded protein. This change of a single amino acid causes conformational change of the protein, leading to the generation of sickle-shaped blood cells that die prematurely. This disease is recessive, meaning that it only appears when both copies (or alleles) of the gene carry the mutation. In dominant diseases, however, one mutant allele is enough to cause sickness. Huntington’s disease, a neurodegenerative disease that leads to gradual loss of mental faculties and physical control, is such a dominant single-gene disease. It is caused by mutation in a gene called HTT on the human chromosome 4, coding for a protein called huntingtin. The involved mutation is an expanded and unstable trinucleotide (CAG) repeat. Individuals carrying one copy of the mutant HTT gene usually develop the disease later in life. 2 Complex Diseases That Involve Multiple Genes Most common diseases, including heart disease, diabetes, hypertension, obesity, and Alzheimer’s disease (AD), are caused by multiple genes. In the case of AD, while its familial or early-onset form can be attributed to one of three genes (APP, PSEN1, and PSEN2), the most common form, sporadic AD, involves a large number of genes. In this type of complex diseases, the contribution of each gene is modest, and it is the combined effects of mutations in these genes that predispose an individual to these diseases. Besides genetic factors, lifestyle and environmental factors often also play a role in these complex diseases. For example, a history of head trauma, lack of mentally stimulating activities, and high cholesterol levels are all risk factors for developing AD. Because of the number of genes involved and their interactions with nongenetic factors, complex multigene diseases are more challenging to study than single-gene diseases. Diseases Caused by Genome Instability Diseases can also occur as consequences of large-scale genomic changes such as rearrangement of large genomic regions, alterations of chromosome number, and general genome instability. For example, when a genome becomes unstable in an organism, it can cause congenital developmental defects, tumorigenesis, premature aging, and so forth. Dysfunction in genome maintenance, such as DNA repair and chromosome segregation, can lead to genome instability. Fanconi anemia, a disease caused by genome instability, is characterized by growth retardation, congenital malformation, bone marrow failure, high cancer risk, and premature aging. The genome instability in this disease is caused by mutations in a cluster of DNA repair genes, and manifested by: increased mutation rates, cell cycle disturbance, chromosomal breakage, and extreme sensitivity to reactive oxygen species and other DNA damaging agents. Diseases Caused by Genome Instability Aside from the gene-centered disease models introduced earlier, diseases can also occur as consequences of large-scale genomic changes such as rearrangement of large genomic regions, alterations of chromosome number, and general genome instability. For example, when a genome becomes unstable in an organism, it can cause congenital developmental defects, tumorigenesis, premature aging, and so forth. Dysfunction in genome maintenance, such as DNA repair and chromosome segregation, can lead to genome instability. Fanconi anemia, a disease caused by genome instability, is characterized by growth retardation, congenital malformation, bone marrow failure, high cancer risk, and premature aging. The genome instability in this disease is caused by mutations in a cluster of DNA repair genes, and manifested by increased mutation rates, cell cycle disturbance, chromosomal breakage, and extreme sensitivity to reactive oxygen species and other DNA damaging agents. Cancer, to a large degree, is also caused by genome instability. This can be hinted by the fact that two well-known high-risk cancer genes, BRCA1 and BRCA2, are both DNA damage repair genes. Mutations in the two genes greatly increase the susceptibility to tumorigenesis, such as breast and ovarian cancers. In general, many cancers are characterized by chromosomal aberrations and genome structural changes, involving deletion, duplication, and rearrangement of large genomic regions. The fact that genome instability is intimately related to major aspects of cancer cells, such as cell cycle regulation and DNA damage repair, also points to the important role of genome instability in cancer development. Epigenomic/Epigenetic Diseases Besides gene mutations and genome instability, abnormal epigenomic/ epigenetic patterns can also lead to diseases. Examples of diseases in this category include fragile X syndrome, ICF syndrome, Rett syndrome, and Rubinstein-Taybi syndrome. In ICF syndrome, for example, the gene DNMT3B is mutated leading to the deficiency of DNA methyltransferase 3B. Patients with this disease invariably have DNA hypomethylation, and have symptoms such as facial anomaly, immunodeficiency, and chromosome instability. Cancer, as a genome disease that is caused by more than one genetic/genomic factor, is also characterized by abnormal DNA methylation, including both hypermethylation and hypomethylation. The hypermethylation is commonly observed in the promoter CpG islands of tumor suppressor genes, which leads to their suppressed transcription. The hypomethylation is mostly located in highly repetitive sequences, including tandem repeats in the centromere and interspersed repeats. This lowered DNA methylation has been suggested to play a role in promoting chromosomal rearrangements and genome instability. Thank you