Podcast
Questions and Answers
Which of the following best describes the primary function of the National Center for Biotechnology Information (NCBI)?
Which of the following best describes the primary function of the National Center for Biotechnology Information (NCBI)?
- Conducting original research in molecular biology.
- Regulating biotechnology industries.
- Developing new pharmaceutical drugs.
- Advancing science and health by providing access to biomedical and genomic information. (correct)
Within the context of bioinformatics databases, what is meant by 'annotation'?
Within the context of bioinformatics databases, what is meant by 'annotation'?
- The act of computationally predicting protein structures.
- The process of aligning multiple DNA sequences.
- The assignment of descriptive information to genomic elements. (correct)
- The statistical analysis of gene expression data.
Why is the >
symbol crucial at the beginning of the definition line in FASTA format?
Why is the >
symbol crucial at the beginning of the definition line in FASTA format?
- It indicates the start of the actual sequence data.
- It denotes the end of a sequence.
- It specifies the quality score of the sequence.
- It is a required formatting element for analysis programs to recognize the sequence. (correct)
Which of the following is the most accurate description of a 'sequence flat file' as used in bioinformatics?
Which of the following is the most accurate description of a 'sequence flat file' as used in bioinformatics?
What information is contained within the 'LOCUS' field of a GenBank entry?
What information is contained within the 'LOCUS' field of a GenBank entry?
If a researcher identifies a new variant of a gene sequence and submits it to GenBank, how is the existing entry updated according to GenBank's versioning system?
If a researcher identifies a new variant of a gene sequence and submits it to GenBank, how is the existing entry updated according to GenBank's versioning system?
What is the primary purpose of an accession number in bioinformatics databases?
What is the primary purpose of an accession number in bioinformatics databases?
In a GenBank record, which section provides information about genes, gene products, and regions of biological significance reported in the sequence?
In a GenBank record, which section provides information about genes, gene products, and regions of biological significance reported in the sequence?
Which of the following is the primary function of the EMBL-EBI?
Which of the following is the primary function of the EMBL-EBI?
Within a GenBank file, the ORIGIN section might be left blank or display 'Unreported'. If it does contain data, what does this section primarily provide?
Within a GenBank file, the ORIGIN section might be left blank or display 'Unreported'. If it does contain data, what does this section primarily provide?
Flashcards
Sequence Data Format
Sequence Data Format
A specific layout or arrangement of text characters, symbols, keywords, and descriptions to identify a sequence and its attributes.
FASTA Format
FASTA Format
A simple and widely used format for storing biological sequences (DNA or protein).
GenBank
GenBank
An online database at the National Center for Biotechnology Information (NCBI) that contains an annotated collection of publicly available DNA sequences.
Accession Number
Accession Number
Signup and view all the flashcards
Version Number
Version Number
Signup and view all the flashcards
GI Number
GI Number
Signup and view all the flashcards
Keywords
Keywords
Signup and view all the flashcards
Source
Source
Signup and view all the flashcards
Feature
Feature
Signup and view all the flashcards
Organism (GenBank)
Organism (GenBank)
Signup and view all the flashcards
Study Notes
Data Retrieval and Analysis
- The range of online databases and resources provides a wealth of information
- Crucially, understanding which databases exist, which tools enable searching, and which tools analyze data across resources is needed
Searching for a Gene of Interest
- Determining whether you need nucleotide or protein sequences is the first step
- You need to know if you require genomic or RNA-derived nucleotide sequences
- You need to know if you want all possible sequences that exist, or just curated ones
- Retrieving a sequence involves finding and interpreting the annotated information in the sequence features table
- The sequence features table contains information about the different kinds of features such as gene, mRNA, coding region, and tRNA
Gene Identification or Annotation Questions
- The characterization of genomic features using computational and experimental methods helps answer the following:
- Which region codes for a protein?
- Which DNA strand is used to encode the gene?
- Where does the gene start and end?
- Where are the exon-intron boundaries?
- Where are the regulatory sequences for that gene?
Bioinformatics Resources
- Two of the most popular bioinformatics resources are:
- National Centre for Biotechnology Information (NCBI)
- European Bioinformatics Institute (EMBL-EBI)
NCBI
- National Centre for Biotechnology Information (NCBI) is an NIH-funded initiative
- It was established to store molecular biology information
- It has grown since the completion of the human genome project and the reduction in sequencing costs
- It develops and maintains a variety of databases and resources
GenBank
- GenBank is the NIH genetic sequence database
- It contains an annotated collection of all publicly available DNA sequences
- The database is updated regularly
NCBI - DNA and RNA
- NCBI provides access to DNA and RNA sequence data via its comprehensive databases
NCBI - Expanding Beyond DNA Data
- NCBI's extensive resources extend beyond just DNA data
EMBL-EBI
- The EMBL-EBI maintains the world's most comprehensive range of freely available and up-to-date molecular databases
- EMBL-EBI also offers online and live training events for using their resources
- The training can be found at this URL: https://www.ebi.ac.uk/training/
Sequence Data Formats
- A sequence data format is a specific layout that uses text, characters, symbols, keywords, and descriptions to identify a sequence and its attributes
- A typical file includes text, numbers, and simple signs, and is readable and printable by a computer
FASTA Format
- FASTA stands for "fast all" and is a simple and widely used format for storing biological sequences
- Essential for sequence-data input in various sequence-analysis programs
FASTA Format Crucial Elements
- The definition line starts with the ">" sign and is a crucial element
- Analysis programs will fail without the ">" sign included
- There should be no space between the ">" sign and the first letter of the definition
- Lines after the header contain the DNA or protein sequence, using a one-letter code, and can be written with or without gaps
FASTA Format Examples
- Example 1: >Mouse Oatp-5 protein MGEPGKRVGI HRVRCFAKI KVFVYM
- Example 2: >Mouse Oatp-5 mRNA ATCAATTTAGATTAAAGCTTATATGCA
Sequence Flat File Format
- When submitting a sequence, relevant information must be included, such as:
- Name of mRNA or gene
- Source
- Author
- Annotation
- Version
- Open reading frame
- Putative translation product
- All information is displayed, along with the sequence, in a flat file
GenBank Data Flat File Components
- Consists of a LOCUS field containing various data elements:
- Locus name
- Sequence length
- Molecule type
- GenBank division
- Modification date
- Example LOCUS: AH002844 4969 bp DNA linear PRI 10-JUN-2016
GenBank Division
- The GenBank division is indicated with a three-letter abbreviation
- The GenBank database is divided into 18 divisions, including:
- PRI (primate sequences)
- ROD (rodent sequences)
- MAM (other mammalian sequences)
- VRT (other vertebrate sequences)
- INV (invertebrate sequences)
- PLN (plant, fungal, and algal sequences)
- BCT (bacterial sequences)
- VRL (viral sequences)
- Examples:
- LOCUS AH002844 4969 bp DNA linear PRI 10-JUN-2016
- LOCUS AF213260 2798 bp mRNA linear ROD 31-JAN-2001
DEFINITION
- Brief description of sequence, which includes source organism, gene/protein name, and function when non-coding
- If the sequence has a coding region (CDS), a completeness qualifier follows, such as "complete cds"
- Example: DEFINITION Homo sapiens insulin (INS) gene, complete cds
ACCESSION
- Unique identifier for a sequence record, combining letters and numbers
- For example, it is a single letter followed by five digits (e.g., U12345)
- Another example is two letters followed by six digits (e.g., AF123456)
- It is a stable way of identifying GenBank entries
- It is also used for both DNA and proteins
- Example: ACCESSION AF213260
Accession Numbers
- Each GenBank record is assigned a unique accession number
- The record includes both a sequence and its annotations
Accession Number Prefixes
- Accession number prefixes indicate sequence types in the GenBank database
- NC_ or NG_
- Known Refseq, genomic regions or assembly
- NM_
- Known Refseq, mRNA
- NR_
- Known Refseq, RNA
- NP_
- Known Refseq, protein
- NT_ or NW_
- Known Refseq, genomic contig or scaffold
- XM_
- Model Refseq, mRNA
- XR_
- Model Refseq, RNA
- XP_
- Model Refseq, protein
- AH, annotated from high-throughput genomic/transcriptomic studies
VERSION
- Version is a nucleotide sequence identification number representing a single, specific sequence in GenBank
- If any change occurs to the sequence data, the version number increases (e.g., U12345.1 to U12345.2)
- The accession portion will remain stable
GI
- GI stands for "GenInfo Identifier", which is a sequence identification number for the nucleotide sequence
- if a sequence changes in any way, there will be a new GI number assigned
- For example, VERSION AF213260.1 GI:12619376
KEYWORDS
- Keywords are words or phrases which describe the sequence
- Keywords are generally present in older records
- Known examples are GC rich region, insulin, polymorphic variation and tandem repeat
SOURCE
- This includes information, including an abbreviated form of the organism name, sometimes followed by a molecule type
- Example: SOURCE Homo sapiens (human)
Organism
- This is the formal scientific name for the source organism, and its lineage
- The data is based on the phylogenetic classification scheme used in the NCBI Taxonomy Database
- Example: ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae; Homo
REFERENCE
- Publications by the authors of the sequence that discuss the data
- References are automatically sorted within the record based on date of publication, showing the oldest references first
Source Example
- REFERENCE 1
- AUTHORS Beeckmans,S., De Greve,H., Read,J.S., Bourjolly,A. Ncube,l., Deboeck,F., Bouckaert,J. and Van Driessche,E.
- TITLE Purification, characterization and amino acid sequence of the seed lectinfrom Pterocarpusangolensis
- JOURNAL Unpublished
- REFERENCE 2 (bases 1 to 998)
- AUTHORS De Greve,H.
- TITLE Direct Submission
- JOURNAL Submitted (20-DEC-2001) De Greve H., Genetische Virologie, Vrije Universiteit Brussel, Paardenstraat 65, Sint-Genesius-Rode, B-1640, BELGIUM
- PUBMED 6827840
FEATURE
- Information about genes and gene products, as well as regions of biological significance reported in the sequence
- The featured codes for proteins and RNA models, as well as other features
Source
- This Mandatory feature in each record that summarizes the length of the sequence, scientific name of the source organism, and Taxon ID number
- Can also include other information such as map location, strain, clone, tissue type, etc., if provided by submitter
CDS
- Coding sequence
- Region of nucleotides that corresponds with the sequence of amino acids in a protein
- Location includes start and stop codons, and an amino acid translation takes place
Protein ID
- A protein sequence identification number, similar to the Version number of a nucleotide sequence
- Protein IDs consist of three letters followed by five digits, a dot, and a version number
Translation
- The amino acid translation corresponding to the nucleotide coding sequence (CDS)
ORIGIN
- The ORIGIN may be left blank, may appear as "Unreported," or may give a sequence data
Data Access
- Ensure you are clear what data you are searching for
- To find what you are looking for, link to all annotations for a particular query using most tools
- Portals are provided by both NCBI and EBI to allow searching across all databases with a single query
Popular Databases
- Genes: One stop resource for all annotation information for a gene
- PubMed: Extensive biomedical literature database
- Nucleotide: Database of all DNA sequence data
- dbSNP: Database of single nucleotide polymorphisms
- Protein: Database of protein sequences
Supplementary Popular Databases
- RefSeq: Comprehensive, integrated, well-annotated set of reference sequences (genomic, transcript, and protein)
- OMIM: Online Mendelian Inheritance in Man, with a database of human genes and genetic phenotypes
- ClinVar: a comprehensive database of genomic variation and the relationship to human health
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.