Molecular Genetics Lecture 5 PDF
Document Details
Uploaded by ObtainableHeather
Tags
Summary
These lecture notes cover gene families, practical steps to classify genes, and the genetic map, providing a comprehensive introduction to molecular genetics. The document explores various aspects of gene classification, including functional and structural approaches. It also explains the concept of open reading frames (ORFs) and the practical steps involved in their identification.
Full Transcript
Notes on Molecular Genetics Lecture#5 Gene families PRACTICAL STEPS TO CLASSIFY A GENE The genetic map Central Dogma - Gene families Previously, we presented about gene classification based on the purpose of study and we have learned that the people who are int...
Notes on Molecular Genetics Lecture#5 Gene families PRACTICAL STEPS TO CLASSIFY A GENE The genetic map Central Dogma - Gene families Previously, we presented about gene classification based on the purpose of study and we have learned that the people who are interested in studying cancer biology are using their own terms (oncogenes to express the genes responsible for cancer), and the people who are interested in evolution and bioinformatics are used specific terms to express homology and origin such as orthologues and paralogues terms. These terms are not giving gene families. Here, we will talk about 2 different attitudes to classify genes into gene families. One possibility is to classify the genes according to their function, as shown in Figure 40. This system has the advantage that the fairly broad functional categories used in Figure 40 can be further subdivided to produce a hierarchy of increasingly specific functional descriptions for smaller and smaller sets of genes. The weakness with this approach is that functions have not yet been assigned to many genes, so this type of classification leaves out a proportion of the total gene set. One challenge that is faced by evolutionary and functional genetic studies is that many genes have been identified as “orphan” or lineage-specific genes because homologous sequences cannot be identified outside of a limited taxonomic range. Many lineage-specific genes may not be truly novel, however, since sequence divergence can cause homologs to become undetectable by sequence-search methods (Vakirlis et al., 2020; Weisman et al., 2020). Identifying such extremely 1 Notes on Molecular Genetics divergent homologs remains a significant bioinformatic challenge and limits functional inferences derived from homology (Loewenstein et al., 2009). Figure 40: Gene families in human genome A more powerful method is to base the classification not on the functions of genes but on the structures of the proteins that they specify. A protein molecule is constructed from a series of domains, each of which has a particular biochemical function. Examples are the zinc finger, which is one of several domains that enable a protein to bind to a DNA molecule, and the ‘death domain’, which is present in many proteins involved in apoptosis, the process of programmed cell death. Each domain has a characteristic amino acid sequence, perhaps not exactly the same sequence in every example of that domain, but close enough for the presence of a particular domain to be recognizable by examining the amino acid sequence of the protein. The amino acid sequence of a protein is specified by the nucleotide sequence of its gene, so the domains present in a protein can be determined from the nucleotide sequence of the gene that codes for that protein. The genes in a genome can therefore be categorized according to the protein domains that they 2 Notes on Molecular Genetics specify. This method has the advantage that it can be applied to genes whose functions are not known and hence can encompass a larger proportion of the set of genes in a genome. Domains are the fundamental unit of tertiary structure. A domain can be defined as a polypeptide chain or a part of a polypeptide chain that can independently fold into a stable tertiary structure. Domains are built from different combinations of secondary structural elements and motifs. The alpha helices and beta strands of the motifs are adjacent to each other in the three-dimensional structure and connected by loop regions. This combination of structures will then fold to form the active form of the protein (Figure 41). Domains are also units of function. Proteins may comprise a single domain or as many as several dozen domains. Base on the considerations of connected motifs, protein structures can be divided into three main classes: alpha-domain structure, alpha/beta domain structure, and beta-domain structure. 3 Notes on Molecular Genetics Figure 41: Protein folding sequence “stages” including domains formation. PRACTICAL STEPS TO CLASSIFY A GENE Classifying a gene into a specific gene family typically involves several steps, which can vary depending on the gene family and the organisms being studied. Here is a general outline of the steps involved in classifying a gene to a specific gene family: 1. Sequence Retrieval: o Obtain the sequence of the gene of interest. This can be done through databases like GenBank, Ensembl, or UniProt. 2. Sequence Alignment: o Align the gene sequence with known sequences of genes in the gene family of interest. Tools like BLAST, Clustal Omega, or MAFFT can be used for alignment. 3. Phylogenetic Analysis: 4 Notes on Molecular Genetics o Construct a phylogenetic tree using the aligned sequences to analyze the evolutionary relationships between the gene of interest and other genes in the family. 4. Conserved Domain Analysis: o Identify conserved domains or motifs within the gene sequence using tools like InterProScan or Pfam. These conserved domains can help in the classification process. 5. Functional Annotation: o Investigate the known functions of genes within the gene family. Functional annotation can provide insights into the role of the gene of interest. 6. Gene Family Database Search: o Search gene family databases such as Pfam, PROSITE, or Gene Ontology to see if the gene of interest matches any known gene families. 7. Literature Review: o Review relevant scientific literature to understand the characteristics, functions, and classification criteria of the gene family. 8. Validation: o Validate the classification by comparing the characteristics of the gene of interest with other members of the gene family and by consulting experts in the field if necessary. 9. Documentation: o Document the classification process, including the methods used, data sources, and reasoning behind the classification. 10.Further Analysis: 5 Notes on Molecular Genetics o Perform additional analyses such as protein structure prediction, expression pattern analysis, or functional studies to further characterize the gene within the gene family. By following these steps, researchers can classify a gene into a specific gene family based on sequence similarity, evolutionary relationships, conserved domains, and functional characteristics. - The genetic map We have already learnt that bacterial genomes have compact genetic organizations with very little space between genes. To re-emphasize this point, the complete circular gene map of the E. coli K12 genome is shown in Figure 36. There is non-coding DNA in the E. coli genome, but it accounts for only 11% of the total and it is distributed around the genome in small segments that do not show up when the map is drawn at this scale. In this regard, E. coli is typical of all prokaryotes whose genomes have so far been sequenced - prokaryotic genomes have very little wasted space. There are theories that this compact organization is beneficial to prokaryotes, for example by enabling the genome to be replicated relatively quickly, but these ideas have never been supported by hard experimental evidence. 6 Notes on Molecular Genetics Figure 36: Genetic map of the E. coli 83972 chromosome and the small plasmid pABU. Nucleotide sequence analysis of the E. coli 83972 chromosome a): The two most outer circles represent all putative open reading frames (ORFs), depending on ORF orientation. The following five circles report the results of a two-way genome comparison between E. coli 83972 and one of the following E. coli strains: CFT073 (UPEC), 536 (UPEC), UTI89 (UPEC), MG1655 (K-12) and Sakai (EHEC O157:H7). Genes shared between the strain pair compared are indicated in grey and variable genome regions are indicated in red. The innermost circle represents the G + C distribution. - The Open reading frame An open reading frame, as related to genomics, is a portion of a DNA sequence that does not include a stop codon (which functions as a stop signal). A codon is a DNA or RNA sequence of three nucleotides (a trinucleotide) that forms a unit of genomic information encoding a particular amino acid or signaling the termination 7 Notes on Molecular Genetics of protein synthesis (stop codon). There are 64 different codons: 61 specify amino acids and 3 are used as stop codons. A long open reading frame is often part of a gene (that is, a sequence directly coding for a protein). "Open reading frame" is a terrible term that we're stuck with. What it refers to is a frame of reference, and what is being read, "reading", is the RNA code, and it is being read by the ribosomes in order to make a protein. And "open" means that the road is open to keep reading, and the ribosome will be able to keep reading the RNA code and add another amino acid one after another. Now, DNA, though it is a monotonous repetition of As, Cs, Ts, and Gs, has a language, which is transcribed, of course, into RNA and then translated into a protein. And when it's translated into a protein, the mRNA is not read one letter at a time, but it's read three letters at a time. And those three letters are called a codon, and each of those codons, whether it's an AAA or UUU or an AUG, each of those codons is interpreted by the ribosome, the molecular machine, that's going to make the protein as a certain amino acid. So AUG codes for one amino acid, and UUU codes for another, and etc. So an open reading frame is the length of DNA, or RNA, which is transcribed into RNA, through which the ribosome can travel, adding one amino acid after another before it runs into a codon that doesn't code for any amino acid. And when that happens, it confuses the ribosome, and the ribosome stops. The open reading frame could be of forward or back direction as indicated by arrow orientation of the ORFs in Figure 37. 8 Notes on Molecular Genetics Figure 37: (A) The genome map of R7M. The orientation of each ORF corresponds to the direction of transcription. Genes within different functional categories are indicated by colors noted below. (B) Relative abundance of R7M structural proteins detected in ESI-MS/MS. So you'll be pleased to hear that codons, which make that happen are called stop codons, and a stop codon ends an open reading frame. So an open reading frame is sometimes 300 amino acids long, and sometimes maybe it's 600, and sometimes it's longer. The longer an open reading frame is, the longer you get before you get to a stop codon, the more likely it is to be part of a gene which is coding for a protein. Now the finally confusing thing about an open reading frame is that because the codons are three nucleic acids long and DNA has two strands, the ribosome can read an RNA derived from one strand or another, and it can read it in 1-2-3s that are separated one from another so you can actually get three reading frames reading in one direction (Figure 38), three reading frames going in the other direction. So it's actually six different reading frames for every piece of DNA, which might give you an open reading frame. Identification of an Open Reading Frame 9 Notes on Molecular Genetics Figure 38: An example of frameshift in one direction (3 frames). Websites for bio-informaticians: To find an ORF: https://www.ncbi.nlm.nih.gov/orffinder/ - Operons are characteristic features of prokaryotic genomes One characteristic feature of prokaryotic genomes illustrated by E. coli is the presence of operons. In the years before genome sequences, it was thought that we understood operons very well; now we are not so sure. An operon is a group of genes that are located adjacent to one another in the genome, with perhaps just one or two nucleotides between the end of one gene and the start of the next. All the genes in an operon are expressed as a single unit. This type of arrangement is common in prokaryotic genomes. A typical E. coli example is the lactose operon, the first operon to be discovered (Jacob and Monod, 1961), which contains three genes involved in conversion of the disaccharide sugar lactose into its monosaccharide units - glucose and galactose (Figure 39A). The monosaccharides are substrates for the energy-generating glycolytic pathway, so the function of the genes in the lactose operon is to convert lactose into a form that can be utilized by E. coli as an energy source. Lactose is not a common component 10 Notes on Molecular Genetics of E. coli's natural environment, so most of the time the operon is not expressed and the enzymes for lactose utilization are not made by the bacterium. When lactose becomes available, it switches on the operon; all three genes are expressed together, resulting in coordinated synthesis of the lactose-utilizing enzymes. Figure 39: Two operons of Escherichia coli. (A) The lactose operon. The three genes are called lacZ, lacY and lacA, the first two separated by 52 bp and the second two by 64 bp. All three genes are expressed together, lacYcoding for the lactose permease which transports lactose into the cell, and lacZ and lacA coding for enzymes that split lactose into its component sugars - galactose and glucose. Altogether there are almost 600 operons in the E. coli K12 genome, each containing two or more genes, and a similar number are present in Bacillus subtilis. In most cases the genes in an operon are functionally related, coding for a set of proteins that are involved in a single biochemical activity such as utilization of a sugar as an energy source or synthesis of an amino acid. 11 Notes on Molecular Genetics Link for my research paper; https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2937477/ Bioinformatics I course material; https://drive.google.com/drive/folders/1tt5xFHlXE8FUnBXuxLyDQCvq_--leKaN?usp=share_li nk FIND the ORF in the following sequence; tggcgaatgg gacgcgccct gtagcggcgc attaagcgcg gcgggtgtgg tggttacgcgcagcgtgacc gctacacttg ccagcgccct agcgcccgct cctttcgctt tcttcccttcctttctcgcc acgttcgccg gctttccccg tcaagctcta aatcgggggc tccctttagggttccgattt agtgctttac ggcacctcga ccccaaaaaa cttgattagg gtgatggttcacgtagtggg ccatcgccct gatagacggt ttttcgccct ttgacgttgg agtccacgttctttaatagt ggactcttgt tccaaactgg aacaacactc aaccctatct cggtctattcttttgattta taagggattt tgccgatttc ggcctattgg ttaaaaaatg agctgatttaacaaaaattt aacgcgaatt ttaacaaaat attaacgttt acaatttcag gtggcacttttcggggaaat gtgcgcggaa cccctatttg tttatttttc taaatacatt caaatatgtatccgctcatg agacaataac cctgataaat gcttcaataa tattgaaaaa ggaagagtatgagtattcaa catttccgtg tcgcccttat tccctttttt gcggcatttt gccttcctgtttttgctcac ccagaaacgc tggtgaaagt aaaagatgct gaagatcagt tgggtgcacgagtgggttac atcgaactgg atctcaacag cggtaagatc cttgagagtt ttcgccccgaagaacgtttt ccaatgatga gcacttttaa agttctgcta tgtggcgcgg tattatcccgtattgacgcc gggcaagagc aactcggtcg ccgcatacac tattctcaga atgacttggttgagtactca ccagtcacag aaaagcatct tacggatggc atgacagtaa gagaattatgcagtgctgcc ataaccatga gtgataacac tgcggccaac ttacttctga caacgatcggaggaccgaag gagctaaccg cttttttgca caacatgggg gatcatgtaa ctcgccttgatcgttgggaa ccggagctga atgaagccat accaaacgac gagcgtgaca ccacgatgcctgcagcaatg gcaacaacgt tgcgcaaact attaactggc gaactactta ctctagcttcccggcaacaa ttaatagact ggatggaggc ggataaagtt gcaggaccac ttctgcgctcggcccttccg gctggctggt ttattgctga taaatctgga gccggtgagc gtgggtctcgcggtatcatt gcagcactgg ggccagatgg taagccctcc cgtatcgtag ttatctacacgacggggagt caggcaacta tggatgaacg aaatagacag atcgctgaga taggtgcctcactgattaag cattggtaac tgtcagacca agtttactca tatatacttt agattgatttaaaacttcat ttttaattta aaaggatcta ggtgaagatc ctttttgata atctcatgaccaaaatccct taacgtgagt tttcgttcca ctgagcgtca gaccccgtag aaaagatcaaaggatcttct tgagatcctt tttttctgcg cgtaatctgc tgcttgcaaa caaaaaaaccaccgctacca gcggtggttt gtttgccgga tcaagagcta ccaactcttt ttccgaaggt aactggcttc agcagagcgc agataccaaa tactgtcctt ctagtgtagc cgtagttaggccaccacttc aagaactctg tagcaccgcc tacatacctc gctctgctaa tcctgttaccagtggctgct gccagtggcg ataagtcgtg tcttaccggg ttggactcaa gacgatagttaccggataag gcgcagcggt cgggctgaac ggggggttcg tgcacacagc ccagcttggagcgaacgacc tacaccgaac tgagatacct acagcgtgag ctatgagaaa gcgccacgcttcccgaaggg agaaaggcgg acaggtatcc ggtaagcggc agggtcggaa caggagagcgcacgagggag cttccagggg gaaacgcctg gtatctttat agtcctgtcg ggtttcgccacctctgactt gagcgtcgat ttttgtgatg ctcgtcaggg gggcggagcc tatggaaaaacgccagcaac gcggcctttt tacggttcct ggccttttgc tggccttttg ctcacatgtt ctttcctgcg ttatcccctg attctgtgga taaccgtatt accgcctttg agtgagctgataccgctcgc cgcagccgaa cgaccgagcg cagcgagtca gtgagcgagg aagcggaagagcgcctgatg cggtattttc tccttacgca tctgtgcggt atttcacacc gcatatatgg tgcactctca gtacaatctg ctctgatgcc gcatagttaa gccagtatac actccgctatcgctacgtga ctgggtcatg gctgcgcccc gacacccgcc aacacccgct gacgcgccctgacgggcttg tctgctcccg gcatccgctt acagacaagc tgtgaccgtc tccgggagctgcatgtgtca gaggttttca ccgtcatcac cgaaacgcgc gaggcagctg cggtaaagctcatcagcgtg gtcgtgaagc gattcacaga tgtctgcctg ttcatccgcg tccagctcgttgagtttctc cagaagcgtt aatgtctggc ttctgataaa gcgggccatg ttaagggcggttttttcctg tttggtcact gatgcctccg tgtaaggggg atttctgttc atgggggtaatgataccgat gaaacgagag aggatgctca cgatacgggt tactgatgat gaacatgccc ggttactgga acgttgtgag ggtaaacaac tggcggtatg gatgcggcgg gaccagagaaaaatcactca gggtcaatgc cagcgcttcg ttaatacaga tgtaggtgtt ccacagggtagccagcagca tcctgcgatg cagatccgga acataatggt gcagggcgct gacttccgcgtttccagact ttacgaaaca cggaaaccga agaccattca tgttgttgct caggtcgcagacgttttgca gcagcagtcg cttcacgttc gctcgcgtat cggtgattca ttctgctaaccagtaaggca accccgccag cctagccggg tcctcaacga caggagcacg atcatgcgcacccgtggggc cgccatgccg gcgataatgg cctgcttctc gccgaaacgt ttggtggcgggaccagtgac gaaggcttga gcgagggcgt gcaagattcc gaataccgca agcgacaggccgatcatcgt cgcgctccag cgaaagcggt cctcgccgaa aatgacccag agcgctgccggcacctgtcc tacgagttgc atgataaaga agacagtcat aagtgcggcg acgatagtcatgccccgcgc ccaccggaag gagctgactg ggttgaaggc tctcaagggc atcggtcgagatcccggtgc ctaatgagtg agctaactta cattaattgc gttgcgctca 12 Notes on Molecular Genetics ctgcccgctt 3661 tccagtcggg aaacctgtcg tgccagctgc attaatgaat cggccaacgc gcggggagaggcggtttgcg tattgggcgc cagggtggtt tttcttttca ccagtgagac gggcaacagctgattgccct tcaccgcctg gccctgagag agttgcagca agcggtccac gctggtttgccccagcaggc gaaaatcctg tttgatggtg gttaacggcg ggatataaca tgagctgtcttcggtatcgt cgtatcccac taccgagata tccgcaccaa cgcgcagccc ggactcggtaatggcgcgca ttgcgcccag cgccatctga tcgttggcaa ccagcatcgc agtgggaacgatgccctcat tcagcatttg catggtttgt tgaaaaccgg acatggcact ccagtcgccttcccgttccg ctatcggctg aatttgattg cgagtgagat atttatgcca gccagccagacgcagacgcg ccgagacaga acttaatggg cccgctaaca gcgcgatttg ctggtgacccaatgcgacca gatgctccac gcccagtcgc gtaccgtctt catgggagaa aataatactgttgatgggtg tctggtcaga gacatcaaga aataacgccg gaacattagt gcaggcagcttccacagcaa tggcatcctg gtcatccagc ggatagttaa tgatcagccc actgacgcgttgcgcgagaa gattgtgcac cgccgcttta caggcttcga cgccgcttcg ttctaccatcgacaccacca cgctggcacc cagttgatcg gcgcgagatt taatcgccgc gacaatttgcgacggcgcgt gcagggccag actggaggtg gcaacgccaa tcagcaacga ctgtttgcccgccagttgtt gtgccacgcg gttgggaatg taattcagct ccgccatcgc cgcttccactttttcccgcg ttttcgcaga aacgtggctg gcctggttca ccacgcggga aacggtctgataagagacac cggcatactc tgcgacatcg tataacgtta ctggtttcac attcaccaccctgaattgac tctcttccgg gcgctatcat gccataccgc gaaaggtttt gcgccattcgatggtgtccg ggatctcgac gctctccctt atgcgactcc tgcattagga agcagcccagtagtaggttg aggccgttga gcaccgccgc cgcaaggaat ggtgcatgca aggagatggcgcccaacagt cccccggcca cggggcctgc caccataccc acgccgaaac aagcgctcatgagcccgaag tggcgagccc gatcttcccc atcggtgatg tcggcgatat aggcgccagcaaccgcacct gtggcgccgg tgatgccggc cacgatgcgt ccggcgtaga ggatcgagatctcgatcccg cgaaattaat acgactcact ataggggaat tgtgagcgga taacaattcccctctagaaa taattttgtt taactttaag aaggagatat acatatgaat accgatgttcgtattgagaa agacttttta ggagaaaagg agattccgaa agacgcttat tatggcgtacaaacaattcg ggcaacggaa aattttccaa ttacaggtta tcgtattcat ccagaattaattaaatcact agggattgta aaaaaatcag ccgcattagc aaacatggaa gttggcttactcgataaaga agttgggcaa tatatcgtaa aagctgctga cgaagtgatt gaaggaaaatggaatgatca atttattgtt gacccaattc aaggcggggc aggaacttcc attaatatgaatgcaaatga agtgattgct aaccgcgcat tagaattaat gggagaggaa aaaggaaactattcaaaaat tagtccaaac tcccatgtaa atatgtctca atcaacaaac gatgctttccctactgcaac gcatattgct gtgttaagtt tattaaatca attaattgaa actacaaaatacatgcaaca agaattcatg aaaaaagcag atgaattcgc tggcgttatt aaaatgggaagaacgcactt gcaagacgct gttcctattt tattaggaca agagtttgaa gcatatgctcgtgtaattgc ccgcgatatt gaacgtattg ccaatacgag aaacaattta tacgacatcaacatgggtgc aacagcagtc ggcactggct taaatgcaga tcctgaatat ataagcatcgtaacagaaca tttagcaaaa ttcagcggac atccattaag aagtgcacaa catttagtggacgcaactca aaatacagac tgctatacag aagtttcttc tgcattaaaa gtttgcatgatcaacatgtc taaaattgcc aatgatttac gcttaatggc atctggacca cgcgcaggcttatcagaaat cgttcttcct gctcgacaac ctggatcttc tatcatgcct ggtaaagtgaatcctgttat gccagaagtg atgaaccaag tggcattcca agtgttcggt aatgatttaacaattacatc tgcttctgaa gcaggccaat ttgaattaaa tgtgatggaa cctgtgttattcttcaattt aattcaatcg atttcgatta tgactaatgt ctttaaatcc tttacagaaaactgcttaaa aggtattaag gcaaatgaag aacgcatgaa agaatatgtt gagaaaagcattggaatcat tactgcaatt aacccacatg taggctatga aacagctgca aaattagcacgtgaagcata tcttacaggg gaatccatcc gtgaactttg cattaagtat ggcgtattaacagaagaaca gttaaatgaa atcttaaatc catatgaaat gacacatccg ggaattgctggaagaaaata atgagatccg gctgctaaca aagcaccgaa aggaagctga gttggctgctgccaccgctg agcaataact agcataaccc cttggggcct ctaaacgggt cttgaggggttttttgctga aaggaggaac tatatccgga t 13 Notes on Molecular Genetics 14