Basic concept of Biological macromolecules- DNA^JRNA.PDF

Full Transcript

Basic concept of Biological macromolecules- DNA,RNA and proteins DNA Deoxyribonucleic acid (DNA) is the heredity material found in humans and all living organisms. It is a double-stranded molecule and has a unique twisted helical structure. DNA is made up of nucleotides, each n...

Basic concept of Biological macromolecules- DNA,RNA and proteins DNA Deoxyribonucleic acid (DNA) is the heredity material found in humans and all living organisms. It is a double-stranded molecule and has a unique twisted helical structure. DNA is made up of nucleotides, each nucleotide has three components: a backbone made up of a sugar (Deoxyribose) and phosphate group and a nitrogen-containing base attached to the sugar. Each strand has many nucleotides , a phosphate group, and nitrogenous bases. These nitrogenous bases are complementary to the other strand’s nitrogenous base to maintain helical symmetry. Each base pairs are bonded through Hydrogen bonding. These nitrogenous bases are Adenine (A), Guanine (G), Cytosine (C), and Thymine (T), A is complementary to T, and G to C. DNA is made up of two helical strands that are coiled around the same axis. If coiled from right it is known as right-handed helices DNA and if coiled from left then it is known as left-handed helices. However, the right-handed helices DNA is the most stable. The helical chains run anti-parallel to each other, one polynucleotide chain runs from 5’ to 3’ and the other polynucleotide chain runs from 3’ to 5’. These chains are connected to each other via nitrogen bases through hydrogen bonding. The region of higher concentration of C-G has a higher melting temperature cause these bases are bonded with three hydrogen bonds, which require more energy to break than the region of higher concentration A-T which are bonded only with two hydrogen bonds. DNA is made of two helical chains that intertwine with each other to form a double helix. The most widely accepted structure of DNA is right-handed helix DNA also known as the B-form of DNA, which is 1.9 nm in diameter. Hydrogen bonding contributes to the specificity of base pairing. Adenine preferentially pairs with Thymine through 2 hydrogen bonds. Similarly, Cytosine preferentially pairs with Guanine through 3 hydrogen bonds. Base pairing happens when Pyrimidines pair with Purines because Pyrimidines refers to the single ring structure of Thymine and Cytosine and Purines refers to double-ring structures, Adenine and Guanine. The base pairs A = T and G ≡ C are known as complementary base pairs. Hence, the amount of Adenine is equal to the amount of Thymine, and the amount of Guanine is equal to the amount of Cytosine. The major groove occurs when the backbones are far apart from each other and the minor groove occurs when they are close. The regularity of the helical structure forms two repeating and alternating spaces: Major and Minor grooves. These grooves act on base-pair recognition and binding sites for protein, the major groove contains base pair specific information while the minor groove is largely base-pair nonspecific, caused by protein interactions in the grooves. The double-helical structure of DNA is highly regular, each turn of the helix measures approximately 10 base pairs. The distance between each turn is 3.4 nm. The major groove is 2.2 nm wide and the minor groove is 1.1 nm wide. Molecular structure of DNA Major and Minor grooves As a result of the double helical nature of DNA, the molecule has two asymmetric grooves. One groove is smaller than the other. The larger groove is called the major groove, occurs when the backbones are far apart; while the smaller one is called the minor groove, occurs when they are close together. Types of DNA 1. A-DNA In A form the conformation of the deoxyribose sugar ring is C3 endo conformation. In A-form, the base pairs are diverted away from the central axis towards the major groove. The distance between two base pairs is 0.29 nm. One turn of the helix contains 11 base pairs with a length of 2.8 nm. Shorter than B-form of DNA. However, the helical width is 2.3 nm which is more than B-form. Narrow and deep major groove and wide and shallow minor groove. This form of DNA is favored by low hydration and by repeating units of purines or pyrimidines. 2. B-DNA The standard structure of DNA that is commonly known, was described by Watson and Crick and is a right-handed double helix. The double-helical chains run antiparallel to each other, one running from 5’ to 3’ and another running from 3’ to 5’ and are joined together via complementary nitrogenous base pairing. The distance between the base pairs is 0.34 nm. One turn of the helix contains 10 base pairs with a length of 3.4 nm. This form of DNA is 1.9 nm in diameter, which means the width of the helix is 1.9 nm. The wide and shallow major groove of 2.2 nm, making it easily assessable to proteins, and narrow and minor groove of 1.1 nm. 3. Z-DNA It is a left-handed helix and is a very different structure when compared with the A and B-form. This form of DNA can form when the DNA is in alternating purines-pyrimidines sequences. The backbone is not a smooth helix but an irregular zig-zag, which is resulted from alternating sequences of purines and pyrimidines. It is long and thin than the B and A forms. The helical width is 1.8 nm, being the smallest among the three forms. The distance between the base pairs is 0.37 nm. One turn of the helix contains 12 base pairs with a length of 4.56 nm. The major groove is flat and the minor groove is narrow and deep. Functions of DNA Storage of Genetic Information Transmission of Genetic Information (Inheritance) Gene Expression and Protein Synthesis Regulation of Cellular Activities Mutation and Evolution Self-Replication Repair Mechanisms RNA RNA (Ribonucleic acid) is a single-stranded nucleic acid molecule and made up of ribonucleotides. A ribose nucleotide in the chain of RNA consists of a ribose sugar, phosphate group, and a base. In each ribose sugar, one of the four bases is added: Adenine (A), Guanine (G), Cytosine (C), and Uracil (U). The base is attached to a ribose sugar with the help of a phosphodiester bond. As RNA comprises many ribose nucleotides, the length of the chains of nucleotides can vary according to their types or their functions. RNA is a single-stranded molecule and not a double helix, one of the consequences of this, is that RNA can form a variety of three-dimensional molecular complexes than DNA. RNA has ribose sugar in its nucleotides, rather than deoxyribose. These two sugars differ from each other in the presence or absence of an Oxygen atom. An RNA strand is synthesized in the 5’ to 3’ direction from a locally single- stranded region of DNA. The three-dimensional structure of RNA is critical to its stability and function. RNA being a single-stranded molecule can form a complex structure by allowing its ribose sugars and bases to be modified on the action of cellular enzymes to perform different functions. They are even capable of folding on themselves and showing intramolecular hydrogen bonding between complementary strands, making it a double-stranded molecule to exhibit specific function. Types of RNA 1.mRNA (messenger RNA) 2.rRNA (ribosomal RNA) 3.tRNA (transfer RNA) Messenger RNA It is a single-stranded RNA molecule that is complementary to one of the strands of DNA. mRNA is the version of the genetic materials that leave the nucleus and move to the cytoplasm where responsible proteins are synthesized. This RNA has utmost importance during protein synthesis, when the ribosome moves along this mRNA, it reads the base sequences and uses the genetic code to translate them into specific proteins. These codes are in the form of triplet sequences of nitrogenous bases and are often referred to as codons. Ribosomal RNA It is a single-stranded RNA molecule found in cells that forms the part of the protein-synthesizing organelle, ribosomes. It is synthesized inside the nucleus particularly in the nucleolus where rRNA coding genes are present. The synthesized rRNA can be of varying sizes, commonly distinguished as small and large. These newly synthesized rRNAs combine with ribosomal proteins and form smaller subunits and larger subunits of ribosomes respectively. These rRNAs are vital in recognizing conserved regions of incoming mRNAs and tRNA thus facilitating their binding and carrying out protein synthesis. Additionally, rRNA also has enzymatic activity, peptidyl transferase and catalyzes the formation of the peptide bond in between two aligned proteins/amino acids during protein synthesis. Transfer RNA It is a type of RNA molecule that helps to decode information present in mRNA sequences into specific proteins. It is encoded by DNA in the cell nucleus and transcribed with the help of RNA polymerase ΙΙΙ. The structure of tRNA folds upon itself and creates an intra complementary base pairing which gives raise to hydrogen-bonded stems and associated loops that contains nucleotides with modified bases. The structure in two-dimensional resembles a cloverleaf having three loops and an open end are usually 75-90 ribonucleotides in length. Each of these loops consisting of arms has a distinct name and function. The three-loop consisting arms are namely: DHU or D arm, which has recognition site for specific enzyme amino-acyl tRNA synthetase; T arm that consists of ribosome recognition site and Anticodon arm that recognizes and bind to mRNA present in the ribosome. The open end with no loop is the site for attachment of amino acid, via 3’ OH bonding with COOH- group of the amino acid. In general, tRNA reads the code on the mRNA sequence in Ribosome and translates specific amino acid, it does so along the length of the mRNA and gives out a polypeptide chain of amino acids (proteins) in association with other important enzymes like aminoacyl tRNA synthetase and peptidyl transferase. Functions of RNA The prime function of RNA is in protein synthesis. Without RNA, the information encoded in DNA could have never been transcribed to make essential proteins that a cell needs to maintain its integrity. mRNAs have now been widely used in pharmaceutical industries to synthesize potential vaccines. Moreover, mRNAs are now used to develop new categories of medicines mRNAs have made the formation of the cDNA library possible. rRNAs are structural units of Ribosomes, which are essential organelles during protein synthesis. Ribozymes can help suppress the expression of specific mRNA by cleaving them out without relying on the host’s machinery. Artificial antisense RNAs are capable of arresting protein synthesis by binding with the mRNAs, which have contributed to the human’s ability to combat diseases and mutations. Protein Proteins are macromolecules made up of monomers called amino acids. Amino acids are the building block of all proteins. An amino acid is a simple organic compound consisting of a basic group (-NH2), an acidic group (-COOH), and an organic R group that is unique to each amino acid. Is also called alpha-amino carboxylic acid. Each molecule has a central carbon atom, called the α-carbon to which both the groups are attached. The remaining two bonds for the central carbon are satisfied by the hydrogen atom and an organic R group. The organic R group can be as simple as a hydrogen atom (H) or a methyl group (— CH3) or a more sophisticated group. Thus, the α -carbon in all the amino acids is asymmetric except in glycine where the α -carbon is symmetric with a hydrogen atom as an R group. Because of this asymmetry, the amino acids (except glycine) exist in two optically active forms: those having — NH2 group to the right are designated as D-forms, and those having — NH2 group to the left as L- forms. The property to exist in two optically different forms is termed as chirality. Amino acids are amphoteric compounds with both acidic and alkaline groups. These also always exist as ions except at the isoelectric point. Structure of protein Primary structure The primary structure of a protein is the linear sequence of amino acids in a polypeptide chain. It is determined by the sequence of nucleotides in the gene encoding the protein. This sequence dictates how the protein will fold into its functional three-dimensional shape. Secondary structure The secondary structure refers to local folding of the polypeptide chain into specific structures. Alpha Helix: A right-handed coil where each amino acid forms a hydrogen bond with the fourth amino acid in the sequence. Beta Sheet: Formed by beta strands connected laterally by hydrogen bonds, creating a sheet-like structure. These structures are stabilized by hydrogen bonds and contribute to the overall shape of the protein. Tertiary structure: The tertiary structure is the overall three-dimensional shape of a single polypeptide chain. Interaction involved Hydrophobic Interactions: Nonpolar side chains tend to cluster away from the aqueous environment. Hydrogen Bonds: Between polar side chains. Disulfide Bridges: Covalent bonds between the sulfur atoms of cysteine residues. Ionic Bonds: Between charged side chains. The tertiary structure is crucial for the protein's functionality as it determines the positioning of functional groups. Quaternary structure: The quaternary structure involves the assembly of multiple polypeptide chains into a functional protein complex. Hemoglobin, which consists of four subunits, each with its own tertiary structure. This level of structure is important for the regulation and stability of protein function. Functions of proteins 1.Enzymatic Activity Catalyze biochemical reactions, speeding up the process and allowing it to occur under physiological conditions. Amylase , breaks down starch and DNA polymerase , synthesizes DNA 2.Structural Support Provide structural integrity and support to cells and tissues Collagen in connective tissues, keratin in hair and nails. 3.Transport Transport molecules across cell membranes or through the bloodstream. Hemoglobin transports oxygen in blood, membrane transport proteins facilitate movement of substances across cell membranes. 4.Regulatory Functions Regulate various biological processes, including gene expression and cellular signaling. Hormones like insulin regulates blood glucose levels, transcription factors regulate gene expression. 5.Defense Protect the body from pathogens and other foreign substances. Antibodies recognize and neutralize pathogens, complement proteins assist in immune response. 6.Movement Facilitate movement at both the cellular and organismal levels. Actin and myosin are the contractile proteins in muscle cells, dynein and kinesin are the motor proteins that move cellular components. 7.Storage Store essential nutrients and ions. Ferritin stores iron, casein is milk protein that stores amino acids. Nucleotide sequence database Nucleotide databases are a type of biological database containing genetic information, which includes DNA and RNA sequences that come from a variety of sources, including whole genomes, transcriptomes, and individual genes. The International Nucleotide Sequence Database Collaboration (INSDC) is a group of three organizations – GenBank, EMBL, and DDBJ – that collect and share nucleotide sequence data GenBank GenBank is a sequence database that contains a collection of annotated nucleic acid sequence data. GenBank’s primary purpose is to collect, archive, and distribute sequence data of DNA, RNA, and proteins. It includes various types of genetic material, such as genomic DNA, messenger RNA (mRNA), complementary DNA (cDNA), expressed sequence tags (ESTs), high-throughput raw sequence data, and sequence polymorphisms. The database includes a vast range of sequences from different organisms, including bacteria, archaea, eukaryotes, viruses, and more. The data comes from research laboratories around the world, and it covers both well- studied model organisms and less-characterized species. Data Types Nucleotide Sequences: GenBank primarily houses nucleotide sequences, which include genomic DNA, complementary DNA (cDNA), and RNA sequences. Protein Sequences: Although GenBank is focused on nucleotide sequences, it also includes protein sequences which are translated from nucleotide sequences. Records: GenBank records are structured to include several key components: Locus: The name and length of the sequence. Definition: A brief description of the sequence. Accession Number: A unique identifier for the sequence. Version Number: Indicates updates or changes to the sequence. Source: Information about the organism or tissue from which the sequence was derived. Features: Annotations describing functional elements, such as genes, coding regions, and regulatory elements. Sequence: The actual nucleotide or protein sequence. References: Citations to scientific papers related to the sequence. Feature Table: Contains detailed annotations of functional elements, including coding sequences (CDS), introns, exons, and other features. Researchers submit their sequence data to GenBank through the NCBI submission portal. GenBank is regularly updated with new sequences and annotations. As new information becomes available or as corrections are made, the records in GenBank are updated accordingly. GenBank is freely accessible online. Users can search for sequences using various criteria, including accession numbers, keywords, or sequences themselves. NCBI provides several tools for working with GenBank data, including: BLAST: For comparing sequences to identify similarities. Entrez: A search and retrieval system for accessing GenBank and other NCBI databases. GeneBank Viewer: For visualizing sequence features and annotations. Applications Research: GenBank supports a wide range of research applications, from basic biology to applied sciences. It is crucial for gene discovery, evolutionary studies, and comparative genomics. Clinical and Therapeutic Applications: Sequence data in GenBank can be used in clinical research for identifying genetic variations linked to diseases and in developing targeted therapies. Education: GenBank is a valuable resource for teaching and learning about genetics and genomics. EMBL It aims to advance the understanding of biological processes at the molecular level and to contribute to the development of new technologies and methodologies. EMBL was established in 1974 and is supported by 27 member states. The main laboratory is located in Heidelberg, Germany. It is focused on the storage and distribution of nucleotide and protein sequences. EMBL also develops tools to help researchers analyze and interpret this data. EMBL conducts a wide range of research in molecular biology, including genomics, proteomics, structural biology, and systems biology. Data Types: The ENA stores a variety of data types including: Genomic Sequences: Complete genomes, chromosome sequences, and large-scale sequencing projects. Transcriptomic Data: mRNA sequences, gene expression profiles. Metagenomic Data: Sequences from environmental and microbiome studies. The data structure includes: Accession Numbers: Unique identifiers for sequences. Metadata: Information about the source organism, sequencing methods, and study details. Sequence Data: The actual nucleotide sequences and associated features. ENA and EMBL provide tools for searching and retrieving data, including: ENA Browser: A web-based tool for exploring nucleotide data. EBI Search: A powerful search engine for accessing multiple databases managed by EMBL’s European Bioinformatics Institute (EBI). EMBL offers a suite of bioinformatics tools for data analysis, including: Ensembl: A genome browser that provides detailed annotations and comparative genomics tools. InterPro: A resource for protein sequence analysis and functional annotation. DDBJ The DNA Data Bank of Japan (DDBJ) is another nucleotide database that exchanges data with GenBank and EMBL as a member of INSDC. DDBJ collects and exchanges nucleotide sequence data and manages bioinformatics tools for data submission and retrieval. It also develops tools for biological data analysis and organizes Bioinformatics Training Courses in Japanese. DDBJ was established in 1986. Located at the National Institute of Genetics in Mishima, Japan. DDBJ primarily archives nucleotide sequences, including genomic DNA, complementary DNA (cDNA), and RNA sequences. Data Records : Accession Number: A unique identifier assigned to each sequence. Definition: A brief description of the sequence. Version Number: Indicates revisions or updates to the sequence. Source: Information about the organism or tissue from which the sequence was obtained. Features: Annotations detailing the functional elements of the sequence, such as genes, exons, introns, and regulatory regions. Sequence: The actual nucleotide sequence data. References: Citations to scientific publications related to the sequence. Tools and access: DDBJ Search: A web-based tool that allows users to search for and retrieve nucleotide sequences from the DDBJ database. Searches can be conducted using various criteria, such as accession numbers, keywords, or sequence motifs. DDBJ Web Interface: Provides access to sequence records, feature annotations, and other related information. Users can view, download, and analyze sequences through this interface. BLAST: DDBJ integrates with BLAST (Basic Local Alignment Search Tool), allowing users to compare sequences against the DDBJ database to identify similarities and potential homologs. Applications of Nucleotide sequence database Nucleotide databases are used to identify the gene or the function of a particular nucleotide sequence by comparing an unknown sequence with the known sequences in the database. Nucleotide databases can be used to study and examine gene expression by using the sequence information stored in the databases. Nucleotide databases are also used to identify potential drug targets and develop new therapies for genetic diseases. Nucleotide databases also help in identifying genetic variations that may be linked to diseases, which ultimately helps in the development of diagnostic tools and treatments. Nucleotide databases can be used in phylogenetic analysis to analyze the evolutionary relationships between organisms, by comparing and examining their DNA or RNA sequences. Genomic databases A genomic database is a specialized repository designed to store, manage, and provide access to a wide range of data related to genomes, genes, and genetic variations. A genomic database is a structured collection of data that focuses on genetic information, including: Genome Sequences: The complete DNA sequences of organisms. Gene Annotations: Information about the locations, structures, and functions of genes within the genome. Genetic Variants: Data on variations such as single nucleotide polymorphisms (SNPs), insertions, deletions, and other mutations. Functional Annotations: Details on the roles of genes and genetic elements, including regulatory regions and their impacts on phenotype and disease. It includes the integration of various data types such as genomic sequences, gene annotations, variant data, and functional information. It Links with other databases and resources for comprehensive analysis. Genome Browsers are available for visualizing genomic data in a graphical format, including gene structures, variant positions, and regulatory elements. It has tools for comparing genomes across different species to study evolutionary relationships and conserved elements. It has tools to interpret the functional implications of genetic variants, such as predicting the impact of mutations on protein function or disease. Many genomic databases are publicly accessible, providing a valuable resource for researchers worldwide. Periodic updates are done to incorporate new data, improve annotations, and refine database content based on ongoing research. Ensembl Ensembl is a major resource for genome annotation and analysis, developed by the European Bioinformatics Institute (EBI) and the Wellcome Trust Sanger Institute. It provides comprehensive data and tools for exploring the genomes of a wide range of species, with a strong emphasis on vertebrates. Ensembl offers a centralized platform for exploring genomic data, including gene annotations, variations, regulatory elements, and comparative genomics. Genome Browser Interactive Visualization: Allows users to explore genomic sequences and gene annotations through a web-based interface. Users can zoom in and out, view different tracks, and customize the display to focus on specific regions or features. Track Management: Users can view various data tracks, including gene models, regulatory regions, variant data, and expression data. Gene Annotations Gene Models: Detailed information on gene structures, including exons, introns, and untranslated regions (UTRs). Annotations include both protein-coding genes and non-coding RNA genes. Transcript Information: Information on different transcript isoforms of a gene, including their sequences and functional annotations. Protein-Coding Genes: Functional annotations of protein-coding genes, including domain structures and predicted protein functions. Variant Data Variant Integration: Data on genetic variants such as single nucleotide polymorphisms (SNPs) and insertions/deletions (indels). Variants are annotated with information on their potential effects on gene function and phenotype. Variant Effects: Predictions of how variants might impact gene function, including the potential for causing diseases or altering protein function. Comparative genomics Gene Orthologs: Identification of orthologous genes across different species, which helps in understanding evolutionary relationships and functional conservation. Genomic Alignments: Tools for visualizing and analyzing alignments of genomes from different species, highlighting conserved and divergent regions. Regulatory data Regulatory Features: Annotations of regulatory elements such as promoters, enhancers, and transcription factor binding sites. Diverse Taxa: Ensembl includes data for a wide range of species, from model organisms like humans, mice, and zebrafish, to less-studied species, providing broad coverage of the tree of life. Tools and resources Ensembl genome browser:The main interface for visualizing and exploring genome data. Users can navigate through different chromosomes, view gene structures, and customize the display of various data tracks. Variant Effect Predictor (VEP): A tool for predicting the functional consequences of genetic variants, such as their impact on gene coding sequences and regulatory regions. Compara: A tool for comparative genomics that includes gene orthology predictions, synteny analyses, and phylogenetic trees. Regulatory Build: A resource for exploring and annotating regulatory regions and their potential effects on gene expression. Functions Functional Genomics: Helps researchers understand gene functions, regulatory mechanisms, and the impact of genetic variants on phenotypes. Comparative Genomics: Supports studies of evolutionary relationships and conserved features across species. Clinical Research: Assists in annotating genetic variants associated with diseases and predicting their potential impact on health. UCSC genomic browser The UCSC (University of California, Santa Cruz) Genome Browser is an online tool used for the visualization and analysis of genomic data. Users can zoom in and out from the whole chromosome view down to the nucleotide sequence level. This allows for detailed exploration of regions of interest. can search for specific genes, chromosomal coordinates, and other genomic features. The browser supports various identifiers like gene names, accession numbers, or genomic positions. Contains information regarding: Genes and Gene Predictions: Tracks for known genes, mRNA, ESTs (expressed sequence tags), and predicted genes. Conservation Tracks: Compare the conservation of sequences across species to identify functionally important regions. Variation Data: Shows SNPs (single nucleotide polymorphisms) and structural variations. Regulatory Data: Includes regions such as promoters, enhancers, and other transcriptional regulatory elements, often based on experimental data from sources like ENCODE. Epigenomics: Provides tracks related to DNA methylation, histone modification, and other epigenetic features. The browser supports multiple genome assemblies (also called genome builds) for many species. For example, human genome builds include hg19 (GRCh37) and hg38 (GRCh38). Users can compare different assemblies, which is useful for tracking changes in annotations between versions. Tools available: BLAT (BLAST-Like Alignment Tool): A fast alignment tool that can find sequences in the genome that are similar to a query sequence. It's often used for identifying the genomic location of short sequences. Table Browser: Allows more advanced users to extract data in tabular form, download entire datasets, and filter through various tracks for bulk data analysis. Gene Sorter: A tool for finding genes with similar expression patterns, functions, or evolutionary relationships. LiftOver: Converts genome coordinates from one assembly to another, ensuring compatibility across different versions of genome builds. Protein sequence database Protein sequence databases are repositories of protein-related data, specifically amino acid sequences derived from a variety of sources such as genome sequencing projects, laboratory experiments, and predictive bioinformatics tools. These databases are essential for a range of applications, including protein identification, functional annotation, evolutionary studies, and drug discovery. UniProt UniProt (Universal Protein Resource) is one of the most comprehensive and widely used protein sequence and functional information databases. UniProt Database Structure: UniProt is divided into several interrelated components: a. UniProtKB (UniProt Knowledgebase) UniProtKB is the main component of UniProt, consisting of two sections: Swiss-Prot: A manually curated, high-quality section of UniProtKB. Every entry is reviewed by experts, making Swiss-Prot non-redundant and highly reliable. The curation process involves experimental evidence, literature references, and computational tools. Key features include:Detailed functional annotations (molecular functions, biological processes, cellular localization). Post-translational modifications (PTMs), isoforms, domains, and catalytic sites. Information on pathways, mutations, and variants. Links to external databases like PDB, KEGG, and OMIM. TrEMBL: Automatically annotated and unreviewed protein sequences. TrEMBL entries are derived from translations of coding sequences from nucleotide databases (such as EMBL-Bank, GenBank, and DDBJ). Although not manually curated, they contain useful predicted annotations and are constantly updated. b. UniParc (UniProt Archive) UniParc is a comprehensive, non-redundant protein sequence archive. It stores every unique protein sequence, regardless of its source or annotation. This archive enables tracking of sequence changes across various databases and versions. Each unique sequence is assigned a stable UniParc identifier (UPI), allowing users to trace identical sequences across species or databases. c. UniRef (UniProt Reference Clusters) UniRef provides clustered sets of protein sequences to reduce redundancy and facilitate faster sequence searches. The clustering process groups similar sequences based on a percentage identity threshold: UniRef100: Contains all sequences from UniProtKB and UniParc, without redundancy. UniRef90: Clusters sequences with at least 90% sequence identity. UniRef50: Clusters sequences with at least 50% identity, providing a representative sequence for each cluster. Annotations Protein Function: Detailed descriptions of the biological role of proteins, supported by experimental data or computational predictions. It includes molecular functions (enzymatic activity, binding), biological processes (pathways), and involvement in disease. Gene Information: Gene names, synonyms, and links to external gene-related resources such as Gene Ontology (GO), Ensembl, and NCBI Gene. Protein Structure: Predicted and experimentally determined structural information. Links to 3D structure databases such as the Protein Data Bank (PDB) are provided. Protein Domains and Motifs: Annotations about conserved domains, motifs, and family memberships (often linked to resources like Pfam, InterPro, and PROSITE). Subcellular Localization: Information about where a protein is located within a cell (e.g., nucleus, cytoplasm, membrane). Post-Translational Modifications (PTMs): Includes phosphorylation, glycosylation, ubiquitination, and cleavage sites. These modifications are critical for protein function and regulation. Pathways: The involvement of a protein in metabolic or signaling pathways, often linked to pathway databases like KEGG, Reactome, or MetaCyc. Sequence Variants and Mutations: Information on natural variants and disease-associated mutations is included, often with links to resources like OMIM (Online Mendelian Inheritance in Man) for clinical data. Protein Interactions: Data on protein-protein interactions, often linked to external interaction databases like IntAct and STRING. Evolution and Phylogeny: Information on the evolutionary relationships of proteins, including orthologs and paralogs. Search and retrieval tools Basic Search: Allows users to query the database using protein names, gene names, accession numbers, keywords, or specific organisms. Advanced Search: Enables more refined queries by allowing users to filter results based on annotations like function, domain, PTMs, or subcellular localization. Blast: UniProt integrates BLAST (Basic Local Alignment Search Tool) to allow users to search for similar protein sequences within the database by inputting a query sequence. Align: This tool aligns multiple sequences, highlighting conserved regions and differences. It is useful for comparing homologous proteins or isoforms. Batch Retrieval: Users can input multiple accession numbers or gene names to retrieve a list of corresponding entries. Integration with external database NCBI Gene: For detailed genetic and genomic information. PDB: For 3D structural data. KEGG, Reactome: For pathway and metabolic information. OMIM, ClinVar: For disease-related genetic variants. Ensembl, UCSC Genome Browser: For genomic sequence data. Gene Ontology (GO): For consistent functional annotation. Pfam, InterPro: For protein family and domain classification. Applications Proteomics: For the identification and annotation of proteins in large-scale proteomics experiments. Drug Discovery: To identify protein targets, study protein-ligand interactions, and understand pathways involved in diseases. Comparative Genomics and Evolution: Used to study evolutionary relationships and functional conservation across species. Molecular Biology: To explore protein function, structure, interactions, and regulation. Education and Training: As a comprehensive resource for learning about proteins, pathways, and bioinformatics. Structural databases Structural databases are repositories that store 3D structural data of biological molecules, such as proteins, nucleic acids (DNA/RNA), and complex macromolecular assemblies. These databases are crucial in bioinformatics, structural biology, drug discovery, and computational biology for understanding molecular function, interactions, and mechanisms. Important structural databases includes: PDB SCOP PDB The Protein Data Bank (PDB) is the largest and most comprehensive repository for 3D structural data of biological macromolecules, such as proteins, nucleic acids, and complex assemblies. The PDB was established in 1971 by the Research Collaboratory for Structural Bioinformatics (RCSB) and is managed by the Worldwide Protein Data Bank (wwPDB). Primary goal is to provide open access to structural data on biological macromolecules and make this data freely available to researchers worldwide. Information contained Types of Structures: The PDB contains 3D structures of biological macromolecules, primarily proteins, but also includes nucleic acids (DNA/RNA), small molecules (ligands), protein-ligand complexes, and large assemblies. Structure Determination Methods: X-ray Crystallography: The majority of structures are determined using this method, providing high-resolution data. Nuclear Magnetic Resonance (NMR): Provides insights into the dynamic structures of proteins and their interactions in solution. Cryo-Electron Microscopy (cryo-EM): An increasingly popular method for studying large complexes and membrane proteins at high resolution. Atomic coordinates of each atom in the molecule. Experimental data, such as electron density maps (for X-ray structures) and chemical shifts (for NMR structures). Metadata including experimental conditions, citations, structure validation, and quality checks. The PDB has over 200,000 structures (as of 2024), and new structures are added regularly. Data Submission and Curation Data Submission: Researchers submit their experimentally determined structures to the PDB using standardized file formats such as PDB or mmCIF. Data submission includes detailed information about the experimental methods used, refinement statistics, and quality indicators. Curation: Each submitted structure undergoes a curation process that involves validation and annotation. Curators ensure that the structure meets quality standards and provide detailed annotations for molecular features (secondary structures, domains, ligands, etc.). Validation Tools: PDB uses various tools to ensure data quality, such as MolProbity, which checks for geometric correctness and identifies structural issues like steric clashes, bond length deviations, and Ramachandran plot analysis. Data access and tools The primary interface for accessing PDB data is through the RCSB PDB website, which allows users to search, download, and visualize structures. Search Features: Users can search for structures based on keywords, molecule names, sequence, structure type, or PDB ID. Download Options: Structure files can be downloaded in various formats, including PDB and mmCIF formats. Several built-in visualization tools like JSmol (browser-based) allow users to view and manipulate 3D structures directly on the PDB website. External tools: Chimera/CPyMOL: A popular molecular visualization software used for detailed visualization and molecular modeling. chimeraX: Advanced visualization tools for comparative analysis, structural annotation, and figure generation. VMD (Visual Molecular Dynamics): Often used for visualizing molecular simulations and large-scale structures. Applications of PDB Structural Biology: PDB data helps researchers understand how proteins and other macromolecules fold, function, and interact. It allows the study of secondary, tertiary, and quaternary structures, and provides insights into structural motifs and functional domains. Drug Discovery: PDB is extensively used in rational drug design, where researchers analyze protein-ligand binding sites to design small molecule drugs that can bind and modulate protein function. Structural data supports the identification of potential drug targets and aids in virtual screening and docking studies. Protein Engineering: Scientists use PDB data to engineer proteins with desired properties by modifying their structure. Understanding the 3D conformation of a protein helps predict the effects of mutations on stability and function. Comparative Modeling (Homology Modeling): Researchers use known structures in the PDB to predict the structures of homologous proteins whose 3D structures have not been experimentally determined. Bioinformatics and Computational Biology: PDB data is foundational for algorithms that predict protein structure, function, and interactions. Structural data is integrated into machine learning models for predicting protein folding and evolutionary relationships. Molecular Dynamics Simulations: PDB structures serve as starting points for simulations that study molecular motions and protein dynamics over time, revealing mechanisms behind conformational changes and interactions. File formats in PDB PDB Format: The traditional file format containing atomic coordinates, chain information, and structural features. Though simple, it has limitations with large macromolecular complexes. mmCIF Format: A more flexible and comprehensive format that overcomes PDB format limitations, especially for large structures. It includes additional metadata and supports better interoperability with modern tools. PDBML (XML Format): A machine-readable format for automated processing and database integration. SCOP SCOP (Structural Classification of Proteins) is a hierarchical database that classifies protein structures based on their structural and evolutionary relationships. SCOP was created to organize the growing number of protein structures into a system that reveals their similarities and differences, helping to understand protein function and evolution. Hierarchical arrangement SCOP’s classification system is hierarchical and consists of several levels that describe protein structural and evolutionary relationships: Class: The most general level of classification, grouping proteins by their overall secondary structure composition. Examples of Classes: All-α: Proteins dominated by alpha helices. All-β: Proteins primarily made up of beta sheets. α/β: Proteins with both alpha helices and beta sheets arranged in repeating patterns. α+β: Proteins with alpha helices and beta sheets that do not follow a regular pattern. Small Proteins: Proteins with less than 100 amino acids, often stabilized by metal ions or disulfide bonds. Fold: Proteins in this category share the same major secondary structural elements, which are arranged in the same 3D configuration, although they may not have any evolutionary relationship. Example: The TIM barrel fold, found in many enzymes, consists of eight alpha helices and eight parallel beta strands. Superfamily: Proteins within the same superfamily share structural and functional similarities, suggesting a common evolutionary origin, even if their sequence identity is low. Example: The globin superfamily, which includes hemoglobin and myoglobin, evolved from a common ancestor and share similar folds. Family: Members of a family share both structural and sequence similarities, indicating closer evolutionary relationships than superfamilies. Example: Serine proteases like trypsin and chymotrypsin belong to the same family because they have a similar structure and high sequence identity. Protein Domain: The smallest unit of classification in SCOP. A domain is a distinct functional and structural unit of a protein, often responsible for a specific function. Example: The SH2 domain, found in many proteins involved in signaling pathways, binds to phosphorylated tyrosines in target proteins. There are different versions of Scop : SCOP 1.x, SCOP 2.0, SCOPe (SCOP- extended), Astral. SCOP curators use the following criteria to classify proteins: Structural features: SCOP places heavy emphasis on the overall fold, secondary structure elements, and the arrangement of these elements. Evolutionary relationships: Proteins that share a common evolutionary ancestor are grouped together, even if their sequence similarity is low. Functional similarities: Proteins with similar biochemical functions, such as enzymes with similar catalytic mechanisms, are often classified within the same superfamily or family. Applications Protein Evolution: SCOP helps researchers understand how protein structures have evolved over time by identifying structural motifs shared by proteins from different species. Function Prediction: SCOP’s classification helps predict the functions of newly discovered proteins by comparing their structures to those of known proteins in the database. Protein Structure Prediction: Homology modeling and other structure prediction methods often rely on SCOP to identify structural templates for proteins with unknown structures. Comparative Genomics: SCOP’s classifications are used to study the structural diversity of proteins across different species, providing insights into evolutionary history and functional divergence. Drug Discovery: Understanding the structural relationships between proteins can aid in the development of drugs that target specific protein families or superfamilies, particularly for enzymes and receptors.

Use Quizgecko on...
Browser
Browser