Data Retrieval and Analysis

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

Which of the following best describes the primary function of the National Center for Biotechnology Information (NCBI)?

  • Conducting original research in molecular biology.
  • Regulating biotechnology industries.
  • Developing new pharmaceutical drugs.
  • Advancing science and health by providing access to biomedical and genomic information. (correct)

Within the context of bioinformatics databases, what is meant by 'annotation'?

  • The act of computationally predicting protein structures.
  • The process of aligning multiple DNA sequences.
  • The assignment of descriptive information to genomic elements. (correct)
  • The statistical analysis of gene expression data.

Why is the > symbol crucial at the beginning of the definition line in FASTA format?

  • It indicates the start of the actual sequence data.
  • It denotes the end of a sequence.
  • It specifies the quality score of the sequence.
  • It is a required formatting element for analysis programs to recognize the sequence. (correct)

Which of the following is the most accurate description of a 'sequence flat file' as used in bioinformatics?

<p>A document displaying the sequence alongside relevant information like gene name, source, and annotations. (C)</p> Signup and view all the answers

What information is contained within the 'LOCUS' field of a GenBank entry?

<p>Locus name, sequence length, molecule type, GenBank division, and modification date. (D)</p> Signup and view all the answers

If a researcher identifies a new variant of a gene sequence and submits it to GenBank, how is the existing entry updated according to GenBank's versioning system?

<p>The accession number remains stable, and the version number is incremented. (C)</p> Signup and view all the answers

What is the primary purpose of an accession number in bioinformatics databases?

<p>To serve as a unique and stable identifier for a sequence record. (B)</p> Signup and view all the answers

In a GenBank record, which section provides information about genes, gene products, and regions of biological significance reported in the sequence?

<p>FEATURES (C)</p> Signup and view all the answers

Which of the following is the primary function of the EMBL-EBI?

<p>To maintain a comprehensive range of freely available molecular databases. (B)</p> Signup and view all the answers

Within a GenBank file, the ORIGIN section might be left blank or display 'Unreported'. If it does contain data, what does this section primarily provide?

<p>The actual nucleotide or amino acid sequence data. (A)</p> Signup and view all the answers

Flashcards

Sequence Data Format

A specific layout or arrangement of text characters, symbols, keywords, and descriptions to identify a sequence and its attributes.

FASTA Format

A simple and widely used format for storing biological sequences (DNA or protein).

GenBank

An online database at the National Center for Biotechnology Information (NCBI) that contains an annotated collection of publicly available DNA sequences.

Accession Number

A unique identifier for a sequence record, usually a combination of letters and numbers.

Signup and view all the flashcards

Version Number

A nucleotide sequence identification number in GenBank that represents a single, specific sequence.

Signup and view all the flashcards

GI Number

A sequence identification number specific to the nucleotide sequence; a new GI number is assigned if the sequence is altered.

Signup and view all the flashcards

Keywords

A word/phrase describing the sequence, often present in older records within databases like GenBank.

Signup and view all the flashcards

Source

Information including an abbreviated form of the organism name and the molecule type.

Signup and view all the flashcards

Feature

Information about genes and gene products.

Signup and view all the flashcards

Organism (GenBank)

The formal scientific name for the source organism and its lineage.

Signup and view all the flashcards

Study Notes

Data Retrieval and Analysis

  • The range of online databases and resources provides a wealth of information
  • Crucially, understanding which databases exist, which tools enable searching, and which tools analyze data across resources is needed

Searching for a Gene of Interest

  • Determining whether you need nucleotide or protein sequences is the first step
  • You need to know if you require genomic or RNA-derived nucleotide sequences
  • You need to know if you want all possible sequences that exist, or just curated ones
  • Retrieving a sequence involves finding and interpreting the annotated information in the sequence features table
  • The sequence features table contains information about the different kinds of features such as gene, mRNA, coding region, and tRNA

Gene Identification or Annotation Questions

  • The characterization of genomic features using computational and experimental methods helps answer the following:
  • Which region codes for a protein?
  • Which DNA strand is used to encode the gene?
  • Where does the gene start and end?
  • Where are the exon-intron boundaries?
  • Where are the regulatory sequences for that gene?

Bioinformatics Resources

  • Two of the most popular bioinformatics resources are:
  • National Centre for Biotechnology Information (NCBI)
  • European Bioinformatics Institute (EMBL-EBI)

NCBI

  • National Centre for Biotechnology Information (NCBI) is an NIH-funded initiative
  • It was established to store molecular biology information
  • It has grown since the completion of the human genome project and the reduction in sequencing costs
  • It develops and maintains a variety of databases and resources

GenBank

  • GenBank is the NIH genetic sequence database
  • It contains an annotated collection of all publicly available DNA sequences
  • The database is updated regularly

NCBI - DNA and RNA

  • NCBI provides access to DNA and RNA sequence data via its comprehensive databases

NCBI - Expanding Beyond DNA Data

  • NCBI's extensive resources extend beyond just DNA data

EMBL-EBI

  • The EMBL-EBI maintains the world's most comprehensive range of freely available and up-to-date molecular databases
  • EMBL-EBI also offers online and live training events for using their resources
  • The training can be found at this URL: https://www.ebi.ac.uk/training/

Sequence Data Formats

  • A sequence data format is a specific layout that uses text, characters, symbols, keywords, and descriptions to identify a sequence and its attributes
  • A typical file includes text, numbers, and simple signs, and is readable and printable by a computer

FASTA Format

  • FASTA stands for "fast all" and is a simple and widely used format for storing biological sequences
  • Essential for sequence-data input in various sequence-analysis programs

FASTA Format Crucial Elements

  • The definition line starts with the ">" sign and is a crucial element
  • Analysis programs will fail without the ">" sign included
  • There should be no space between the ">" sign and the first letter of the definition
  • Lines after the header contain the DNA or protein sequence, using a one-letter code, and can be written with or without gaps

FASTA Format Examples

  • Example 1: >Mouse Oatp-5 protein MGEPGKRVGI HRVRCFAKI KVFVYM
  • Example 2: >Mouse Oatp-5 mRNA ATCAATTTAGATTAAAGCTTATATGCA

Sequence Flat File Format

  • When submitting a sequence, relevant information must be included, such as:
  • Name of mRNA or gene
  • Source
  • Author
  • Annotation
  • Version
  • Open reading frame
  • Putative translation product
  • All information is displayed, along with the sequence, in a flat file

GenBank Data Flat File Components

  • Consists of a LOCUS field containing various data elements:
  • Locus name
  • Sequence length
  • Molecule type
  • GenBank division
  • Modification date
  • Example LOCUS: AH002844 4969 bp DNA linear PRI 10-JUN-2016

GenBank Division

  • The GenBank division is indicated with a three-letter abbreviation
  • The GenBank database is divided into 18 divisions, including:
  • PRI (primate sequences)
  • ROD (rodent sequences)
  • MAM (other mammalian sequences)
  • VRT (other vertebrate sequences)
  • INV (invertebrate sequences)
  • PLN (plant, fungal, and algal sequences)
  • BCT (bacterial sequences)
  • VRL (viral sequences)
  • Examples:
  • LOCUS AH002844 4969 bp DNA linear PRI 10-JUN-2016
  • LOCUS AF213260 2798 bp mRNA linear ROD 31-JAN-2001

DEFINITION

  • Brief description of sequence, which includes source organism, gene/protein name, and function when non-coding
  • If the sequence has a coding region (CDS), a completeness qualifier follows, such as "complete cds"
  • Example: DEFINITION Homo sapiens insulin (INS) gene, complete cds

ACCESSION

  • Unique identifier for a sequence record, combining letters and numbers
  • For example, it is a single letter followed by five digits (e.g., U12345)
  • Another example is two letters followed by six digits (e.g., AF123456)
  • It is a stable way of identifying GenBank entries
  • It is also used for both DNA and proteins
  • Example: ACCESSION AF213260

Accession Numbers

  • Each GenBank record is assigned a unique accession number
  • The record includes both a sequence and its annotations

Accession Number Prefixes

  • Accession number prefixes indicate sequence types in the GenBank database
  • NC_ or NG_
  • Known Refseq, genomic regions or assembly
  • NM_
  • Known Refseq, mRNA
  • NR_
  • Known Refseq, RNA
  • NP_
  • Known Refseq, protein
  • NT_ or NW_
  • Known Refseq, genomic contig or scaffold
  • XM_
  • Model Refseq, mRNA
  • XR_
  • Model Refseq, RNA
  • XP_
  • Model Refseq, protein
  • AH, annotated from high-throughput genomic/transcriptomic studies

VERSION

  • Version is a nucleotide sequence identification number representing a single, specific sequence in GenBank
  • If any change occurs to the sequence data, the version number increases (e.g., U12345.1 to U12345.2)
  • The accession portion will remain stable

GI

  • GI stands for "GenInfo Identifier", which is a sequence identification number for the nucleotide sequence
  • if a sequence changes in any way, there will be a new GI number assigned
  • For example, VERSION AF213260.1 GI:12619376

KEYWORDS

  • Keywords are words or phrases which describe the sequence
  • Keywords are generally present in older records
  • Known examples are GC rich region, insulin, polymorphic variation and tandem repeat

SOURCE

  • This includes information, including an abbreviated form of the organism name, sometimes followed by a molecule type
  • Example: SOURCE Homo sapiens (human)

Organism

  • This is the formal scientific name for the source organism, and its lineage
  • The data is based on the phylogenetic classification scheme used in the NCBI Taxonomy Database
  • Example: ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae; Homo

REFERENCE

  • Publications by the authors of the sequence that discuss the data
  • References are automatically sorted within the record based on date of publication, showing the oldest references first

Source Example

 - REFERENCE 1
 - AUTHORS Beeckmans,S., De Greve,H., Read,J.S., Bourjolly,A. Ncube,l., Deboeck,F., Bouckaert,J. and Van Driessche,E.
 - TITLE  Purification, characterization and amino acid sequence of the seed lectinfrom Pterocarpusangolensis
 - JOURNAL Unpublished
 - REFERENCE 2 (bases 1 to 998)
 - AUTHORS De Greve,H.
 - TITLE Direct Submission
 - JOURNAL Submitted (20-DEC-2001) De Greve H., Genetische Virologie, Vrije Universiteit Brussel, Paardenstraat 65, Sint-Genesius-Rode, B-1640, BELGIUM
 - PUBMED  6827840

FEATURE

  • Information about genes and gene products, as well as regions of biological significance reported in the sequence
  • The featured codes for proteins and RNA models, as well as other features

Source

  • This Mandatory feature in each record that summarizes the length of the sequence, scientific name of the source organism, and Taxon ID number
  • Can also include other information such as map location, strain, clone, tissue type, etc., if provided by submitter

CDS

  • Coding sequence
  • Region of nucleotides that corresponds with the sequence of amino acids in a protein
  • Location includes start and stop codons, and an amino acid translation takes place

Protein ID

  • A protein sequence identification number, similar to the Version number of a nucleotide sequence
  • Protein IDs consist of three letters followed by five digits, a dot, and a version number

Translation

  • The amino acid translation corresponding to the nucleotide coding sequence (CDS)

ORIGIN

  • The ORIGIN may be left blank, may appear as "Unreported," or may give a sequence data

Data Access

  • Ensure you are clear what data you are searching for
  • To find what you are looking for, link to all annotations for a particular query using most tools
  • Portals are provided by both NCBI and EBI to allow searching across all databases with a single query
  • Genes: One stop resource for all annotation information for a gene
  • PubMed: Extensive biomedical literature database
  • Nucleotide: Database of all DNA sequence data
  • dbSNP: Database of single nucleotide polymorphisms
  • Protein: Database of protein sequences
  • RefSeq: Comprehensive, integrated, well-annotated set of reference sequences (genomic, transcript, and protein)
  • OMIM: Online Mendelian Inheritance in Man, with a database of human genes and genetic phenotypes
  • ClinVar: a comprehensive database of genomic variation and the relationship to human health

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

More Like This

Understanding Databases
21 questions
Static Indexes in Databases
8 questions
Use Quizgecko on...
Browser
Browser