Lecture 3: Genomic Sequence Databases PDF

Document Details

EventfulQuantum

Uploaded by EventfulQuantum

New Mansoura University

Dr. Rami Elshazli

Tags

genomic sequence databases bioinformatics sequencing techniques biological databases

Summary

This lecture covers genomic sequence databases and various sequencing techniques. It details the classification of biological databases into sequence, structure, and functional categories and discusses the sequencing method, from its onset to recent advanced technologies. This resource provides a basic understanding of bioinformatics procedures.

Full Transcript

Bioinformatics BIO417 Lecture 3 Genomic sequence databases Prepared by Dr. Rami Elshazli Associate Professor of Biochemistry and Molecular Genetics Introduction to genomic sequences...

Bioinformatics BIO417 Lecture 3 Genomic sequence databases Prepared by Dr. Rami Elshazli Associate Professor of Biochemistry and Molecular Genetics Introduction to genomic sequences Dr. Rami Elshazli Associate Professor of Biochemistry and Molecular Genetics  Bioinformatics involves the use of information technology to collect, store, retrieve, and analyze the enormous amount of biological data that are available in the form of sequences and structures of proteins, and nucleic acids.  Biological databases are mainly classified into:  Sequence databases.  Structure databases.  The first database was created after the sequencing of insulin protein in 1956.  Insulin was the first protein to be sequenced; it contains 51 amino acid residues. Dr. Rami Elshazli Sequence Data Generation Associate Professor of Biochemistry and Molecular Genetics  The sequencing technique has played an essential role in analyzing the biological data of organisms.  Researchers moved from expensive and time-consuming in vitro and in vivo methods to quick and reliable in silico analysis as a first-line option in biomedical research. First Generation of Sequencing  The Sanger technique was categorized as the first generation of sequencing.  The first genome elucidated through Sanger technique was bacteriophage with genomic size is 5374 base pairs.  Sanger technique had some drawbacks like difficulty in  Maxam and Gilbert sequencing method, which is based on handling complex genomic species, and it is still an expensive chemical degradation steps, is widely used as first-generation and time-consuming approach. sequencing method.  However, this method is a risky sequencing approach as it uses the toxic chemical to sequence the data. Dr. Rami Elshazli Second Generation of Sequencing Associate Professor of Biochemistry and Molecular Genetics  The second generation of sequencing could generate millions of parallel short reads at a time.  This technique is less expensive and less time-consuming than the first generation of sequencing.  Sequence output is generated without the involvement of the electrophoresis method.  It depends on ligation and synthesis method. Third Generation of Sequencing  To solve the very complex repetitive area of the genome, researchers have developed the new generation of sequencing called the third generation of sequencing.  This technique is less costly and less time-consuming compared to the second generation of sequencing.  This approach can generate long reads of sequences at a time. Dr. Rami Elshazli Classes of Biological Databases Associate Professor of Biochemistry and Molecular Genetics  These databases gather all available DNA, RNA, and protein  Biological databases were classified into three major information and make it freely available. categories:  An accession number recognizes collections in NCBI sequence  Sequence databases. databases.  Structure databases.  This is a unique number that is simply associated with one  Functional databases. sequence.  The sequences of proteins and nucleic acids are stored in  As well as the sequence itself, the NCBI databases also store a sequence databases. few extra annotation data, consisting of the name of the  The solved structure of transcripts, and proteins are stored in species, references to the publication describing that structural databases. sequence. Types of Sequence Databases  The three huge databases which stores the sequence information are:  NCBI databases (www.ncbi.nlm.nih.gov).  European Molecular Biology Laboratory (EMBL) database (https://www.ebi.ac.uk/).  DNA Data Bank of Japan (DDBJ). Dr. Rami Elshazli Nucleotide Sequence Databases Associate Professor of Biochemistry and Molecular Genetics (1) EMBL/DDBJ/GenBank  GenBank is the most comprehensive and annotated collection of publicly available DNA sequence which consists of DDBJ,  The EMBL Nucleotide Sequence Database (also known as EMBL, and GenBank at NCBI. EMBL-Bank) is the primary nucleotide sequence resource  The EMBL database along with GenBank and DDBJ plays the maintained by the European Bioinformatics Institute (EBI), pivotal role in the acquisition, storage, and distribution of situated in the United Kingdom. human genome sequence data.  The DNA and RNA sequences are submitted directly from individual researchers, and genome sequencing projects.  The EBI provides bioinformatics tools for database searching, sequence, homology searching, and multiple sequence alignments.  DNA Data Bank of Japan (DDBJ) initiated the DNA data bank activities in 1986.  The DDBJ collects DNA sequence data mainly from Japanese and researchers from all over the world. Nucleotide Sequence Databases https://www.ncbi.nlm.nih.gov/protein/NP_937983.2 (2) Reference Sequence (RefSeq)  The RefSeq is a commonly used database in genomic and proteomic research.  The Reference Sequence database is annotated collection of publicly available nucleotide and protein sequences.  The RefSeq group works together with many expert groups including official nomenclature authorities such as:  HUGO Gene Nomenclature Committee (HGNC)  UniProt.  The RefSeq sequence records are generated by various methods depending on the sequence class and organism.  RefSeq sequence data are retrieved by using NCBI’s nucleotide and protein databases. https://ncbi.nlm.nih.gov/refseq/ Dr. Rami Elshazli Associate Professor of Biochemistry and Molecular Genetics Nucleotide Sequence Databases https://www.ensembl.org/index.html (3) Ensembl  Ensembl is a genome browser for vertebrate genomes that supports research in comparative genomics, evolution, sequence variation and transcriptional regulation.  Ensembl database supports various publicly available vertebrate genome assemblies.  Ensembl database was designed to annotate high-quality draft genome assemblies of different species.  Ensembl does not produce the genome assemblies.  It provides annotation on genome assemblies that have been deposited into various databases such as GenBank and DDBJ. Dr. Rami Elshazli Associate Professor of Biochemistry and Molecular Genetics Protein Sequence Database Commonly used sequence databases and their descriptions Category Name Link Description  Different types of protein sequence databases ranging from DNA AFND allelefrequencies.net Allele Frequency Net Database simple to complex sequence databases exists. dbSNP ncbi.nlm.nih.gov/snp Database of single nucleotide polymorphisms  It is the collection of sequence data extracted from many essentialgene.org Database of essential genes DEG sources (the annotated coding region of translations) in Ensembl ensembl.org Ensembl genome browser GenBank and RefSeq. GeneCards genecards.org Integrated database of human genes (1) GenPept MITOMAP mitomap.org human mitochondrial genome database 1000 Genomes 1000genomes.org A deep catalog of human genetic variation  The GenPept database is developed by the National Center for RefSeq ncbi.nlm.nih.gov/refseq NCBI Reference Sequence Database Biotechnology Information (NCBI). Protein EKPD ekpd.biocuckoo.org Eukaryotic Kinase and Phosphatase  The GenPept database is a collection of sequences based on Database ModBase modbase.compbio.ucsf.e Database of comparative protein structure translations from annotated coding regions in GenBank. du/ models  This format is text-based and derived from the parent GenBank PDB rcsb.org/pdb Protein Data Bank for 3D structures of biological macromolecules format. uniprot.org Universal protein resource UniProt TreeFam treefam.org Database of phylogenetic trees of animal species https://www.ncbi.nlm.nih.gov/protein/NP_937983.2 CATH cath.biochem.ucl.ac.uk Protein structure classification PDBsum https://www.ebi.ac.uk/t Database that provides an overview of Dr. Rami Elshazli hornton- contents of each 3D macromolecular structure Associate Professor of Biochemistry and Molecular Genetics srv/databases/pdbsum/ deposited in the Protein Data Bank https://www.uniprot.org/ Protein Sequence Database (2) UniProt  UniProt is a comprehensive and freely accessible resource of protein sequence and functional information.  It a major protein retrieval database. (3) AlphaFold  AlphaFold is an artificial intelligence program developed by DeepMind, which performs predictions of protein structure.  It is an AI system developed by Google DeepMind that predicts a protein's 3D structure from its amino acid sequence. https://alphafold.ebi.ac.uk/entry/O14746 Dr. Rami Elshazli Associate Professor of Biochemistry and Molecular Genetics (3) AlphaFold https://alphafold.ebi.ac.uk/ https://doi.org/10.1016/j.arcmed.2024.102970 Dr. Rami Elshazli Associate Professor of Biochemistry and Molecular Genetics Dr. Rami Elshazli Associate Professor of Biochemistry and Molecular Genetics

Use Quizgecko on...
Browser
Browser