Podcast
Questions and Answers
Which activity is considered a fundamental aspect of bioinformatics?
Which activity is considered a fundamental aspect of bioinformatics?
- Developing new laboratory techniques for synthesizing proteins.
- Creating new statistical models for climate prediction.
- Analyzing DNA and protein sequences using various programs and databases. (correct)
- Designing computer hardware for biological research.
What role does bioinformatics play in the context of biological education?
What role does bioinformatics play in the context of biological education?
- Replacing traditional lab experiments with computer simulations.
- Restricting biological research to computational methods only.
- Helping biologists effectively access databases and use analysis tools. (correct)
- Discouraging interdisciplinary collaboration in biological studies.
Which task relies on bioinformatics for its execution?
Which task relies on bioinformatics for its execution?
- Traditional microscope-based cell counting.
- Culturing cells in a petri dish.
- Analysis of gene variation and expression. (correct)
- Manual sorting of bacterial colonies on agar plates.
How does bioinformatics contribute to clinical applications?
How does bioinformatics contribute to clinical applications?
Which of the following is an application of bioinformatics?
Which of the following is an application of bioinformatics?
What is a key application of bioinformatics in molecular biology?
What is a key application of bioinformatics in molecular biology?
What is genomics primarily concerned with, as opposed to proteomics?
What is genomics primarily concerned with, as opposed to proteomics?
How do genomics and proteomics differ in terms of the 'unit under study'?
How do genomics and proteomics differ in terms of the 'unit under study'?
What role do high-throughput techniques play in genomics?
What role do high-throughput techniques play in genomics?
What is a key difference in the nature of study material between genomics and proteomics?
What is a key difference in the nature of study material between genomics and proteomics?
What practical benefit do proteomic studies offer over genomic studies in understanding cell conditions?
What practical benefit do proteomic studies offer over genomic studies in understanding cell conditions?
Which type of genomics focuses on predicting the functions of proteins?
Which type of genomics focuses on predicting the functions of proteins?
Which bioinformatics resource focuses on protein families, motifs, and domains?
Which bioinformatics resource focuses on protein families, motifs, and domains?
Which database is a primary nucleotide sequence database located in Japan?
Which database is a primary nucleotide sequence database located in Japan?
What is the most accurate description of a primary biological database?
What is the most accurate description of a primary biological database?
How do secondary databases differ from primary databases?
How do secondary databases differ from primary databases?
What is the primary function of a biological database?
What is the primary function of a biological database?
What is the main advantage of using multiple databases in protein studies?
What is the main advantage of using multiple databases in protein studies?
Which element is a part of the annotation fields in sequence databases?
Which element is a part of the annotation fields in sequence databases?
What is the 'alignment score' in the context of pairwise sequence alignments?
What is the 'alignment score' in the context of pairwise sequence alignments?
Why is sequence alignment considered a powerful tool?
Why is sequence alignment considered a powerful tool?
What is the role of 'null characters' in sequence alignment?
What is the role of 'null characters' in sequence alignment?
What do FASTA and BLAST primarily achieve?
What do FASTA and BLAST primarily achieve?
How does FASTA find matching sequences?
How does FASTA find matching sequences?
What is meant by 'k-tuples or k-tuplet' in FASTA?
What is meant by 'k-tuples or k-tuplet' in FASTA?
How does BLAST differ from FASTA in sequence alignment?
How does BLAST differ from FASTA in sequence alignment?
What is a key utility of BLAST as a bioinformatics tool?
What is a key utility of BLAST as a bioinformatics tool?
Which of the following is a variant of BLAST used to compare nucleotide sequences against protein sequences?
Which of the following is a variant of BLAST used to compare nucleotide sequences against protein sequences?
In the context of a phylogenetic tree, what do the tree's branches represent?
In the context of a phylogenetic tree, what do the tree's branches represent?
What is transcriptomics primarily concerned with?
What is transcriptomics primarily concerned with?
How does pharmacogenomics enhance drug safety?
How does pharmacogenomics enhance drug safety?
What is a key benefit of pharmacogenomics in healthcare?
What is a key benefit of pharmacogenomics in healthcare?
What are the initial steps in phylogenetic analysis?
What are the initial steps in phylogenetic analysis?
What does the term 'phylogeny' refer to?
What does the term 'phylogeny' refer to?
What does the genomic era provide?
What does the genomic era provide?
The BLAST program was developed in 1990 by?
The BLAST program was developed in 1990 by?
Give an example of a protein databank:
Give an example of a protein databank:
Give an example of a nucleic acid database:
Give an example of a nucleic acid database:
SWISS-PROT is...
SWISS-PROT is...
The study of gene expression and the level of mRNA in a cell are components of:
The study of gene expression and the level of mRNA in a cell are components of:
Flashcards
Bioinformatics
Bioinformatics
An interdisciplinary field involving molecular biology, genetics, computer science, mathematics, and statistics to manage and analyze biological data.
Genomics
Genomics
The study of the entire set of genes in the genome of a cell or organism.
Proteomics
Proteomics
The study of all the proteins produced in a cell.
Genomics Definition
Genomics Definition
Signup and view all the flashcards
Proteomics Definition
Proteomics Definition
Signup and view all the flashcards
Genomics vs Proteomics Study
Genomics vs Proteomics Study
Signup and view all the flashcards
Genomics vs Proteomics Function
Genomics vs Proteomics Function
Signup and view all the flashcards
Primary Biological Databases
Primary Biological Databases
Signup and view all the flashcards
Secondary biological databases
Secondary biological databases
Signup and view all the flashcards
Importance of Biological Databases
Importance of Biological Databases
Signup and view all the flashcards
Nucleotide Sequence Databases
Nucleotide Sequence Databases
Signup and view all the flashcards
GenBank
GenBank
Signup and view all the flashcards
EMBL
EMBL
Signup and view all the flashcards
Bioinformatics Software Tools
Bioinformatics Software Tools
Signup and view all the flashcards
Open-Source Bioinformatics Software
Open-Source Bioinformatics Software
Signup and view all the flashcards
Web-Services in Bioinformatics
Web-Services in Bioinformatics
Signup and view all the flashcards
Bioinformatics Workflow Management System
Bioinformatics Workflow Management System
Signup and view all the flashcards
Pharmacogenomics
Pharmacogenomics
Signup and view all the flashcards
Phylogenetic Tree
Phylogenetic Tree
Signup and view all the flashcards
Sequence Alignment
Sequence Alignment
Signup and view all the flashcards
Column (in alignment)
Column (in alignment)
Signup and view all the flashcards
Substitution (alignment)
Substitution (alignment)
Signup and view all the flashcards
Indel (alignment)
Indel (alignment)
Signup and view all the flashcards
Alignment Score
Alignment Score
Signup and view all the flashcards
Optimal Alignment
Optimal Alignment
Signup and view all the flashcards
BLAST and FASTA
BLAST and FASTA
Signup and view all the flashcards
FASTA and BLAST
FASTA and BLAST
Signup and view all the flashcards
BLAST Program
BLAST Program
Signup and view all the flashcards
FASTA Definition
FASTA Definition
Signup and view all the flashcards
Protein Databank (PDB)
Protein Databank (PDB)
Signup and view all the flashcards
Pharmacogenomics
Pharmacogenomics
Signup and view all the flashcards
Study Notes
Bioinformatics
- Bioinformatics is an emerging field applying computers to collect, organize, analyze, manipulate, present, and share biological data.
- It is an interdisciplinary field that involves molecular biology, genetics, computer science, mathematics, and statistics.
- A central part of bioinformatics involves designing and operating biologic databases effectively.
- With large amounts of nucleotide and protein sequence data from research techniques are stored in biological databases, scientists use bioinformatics tools on computers to analyze biological data in daily research.
- Bioinformatics helps biologists access databases and use analysis tools efficiently, becoming a vital part of biological education.
- As an evolving discipline, bioinformatics uses complex software to retrieve, sort, analyze, predict, and store DNA and protein sequence data.
- A fundamental activity is the sequence analysis of DNA and proteins using web-available programs and databases.
- Pharmaceutical companies employ bioinformaticians to perform and maintain the extensive bioinformatics needs of these industries.
- Besides genome sequence data analysis, it is also used for gene variation and expression analysis, and for predicting gene and protein structure and function.
- Critical tasks like predicting and detecting gene regulation networks, and molecular pathways analyses for understanding gene-disease interactions use bioinformatics.
- Bioinformatics has clinical applications that allow for genome sequencing allow for the production of human gene products.
Bioinformatics Applications
- It has revolutionized advancements in biological science, including advancements and benefits to biotechnology
- Human genome sequencing was completed in record time because of bioinformatics.
- Bioinformatics applications:
- Mapping biomolecules (DNA, RNA, proteins)
- Identifying nucleotide sequences of functional genes.
- Finding sites that can be cut by restriction enzymes.
- Designing of primer sequence for polymerase chain reaction.
- Predicting functional gene products.
- Tracing the evolutionary trees of genes.
- Predicting the 3-dimensional structure of proteins.
- Molecular modeling of biomolecules.
- Designing of drugs for medical treatment.
- Processing large amounts of biological data
- Developing models for cells, tissues, and organs functions.
- Most fields in biological sciences rely on bioinformatics today.
Genomics vs. Proteomics
- Genomics is the study of the entire set of genes in the genome of a cell
- Proteomics is the study of the entire set of proteins produced by the cell.
- Genomics studies genetic material present in a cell or organism, while Proteomics researches set of proteins expressed by an organism's genome.
- Genomics studies the genes in an organism, while Proteomics studies all the proteins in a cell.
- In genomics, the study unit is the function of genomes; in proteomics, it is the function of proteomes.
- In genomics the genome is same for every cell, whereas in proteomics the proteome is dynamic, as protein production differs by tissue according to gene expression.
- Genomics maps, sequences, and analyzes genomes using high-throughput techniques; proteomics uses these techniques to characterize the 3D structure and function of proteins.
- Genomics techniques sequence strategies like directed gene sequencing, expressed sequence tags (ESTs), single nucleotide polymorphisms (SNPs), and the analysis of sequenced data using software and databases.
- Proteomics techniques include extraction and electrophoretic separation of proteins, digestion of proteins with trypsin, amino acid sequencing by mass spectrometry, and using the information in protein databases.
- Genomics involves structural and functional genomics, while proteomics involves structural, functional, and expression proteomics.
- Genomics focuses on genome sequencing projects like the Human Genome Project
- Proteomics focuses on proteome database developments like SWISS-2DPAGE, and software development for computer-aided drug design.
- Genomics studies are important for understanding the structure, function, location, regulation of genes to study organisms.
- Proteomics is important because the study of entire set of proteins produced by a cell type is done to understand its structure and function.
- Genomics studies genes in the nucleus- Proteins are functional molecules and represent actual conditions, making proteomics studies more beneficial.
Databases
- Databases are essential for bioinformatics research and applications.
- Many databases exist with various information types, including DNA/protein sequences, molecular structures, phenotypes, and biodiversity.
- Databases may contain empirical data (from experiments), predicted data (from analysis), or both.
- Databases can be specific to an organism, pathway, molecule, or incorporate data from multiple databases.
- Databases differ in format, access mechanism, and whether they are public.
- Commonly used databases include:
- Genbank, UniProt (biological sequence analysis)
- Protein Data Bank (PDB) (structure analysis)
- InterPro, Pfam (finding protein Families and Motif Finding)
- Sequence Read Archive (Next Generation Sequencing)
- KEGG, BioCyc (Network Analysis: Metabolic Pathway Databases)
- GenoCAD (design of synthetic genetic circuits)
Software and Tools
- Software tools for bioinformatics range from simple command-line tools to complex graphical programs and standalone web-services.
- Free and open-source software tools for bioinformatics have existed as potential for innovative in silico experiments.
- Open code bases facilitate both bioinformatics and range of open-source software.
- Open source tools act as incubators of ideas.
- Community-supported plug-ins in commercial applications act as de facto standards and shared object models for assisting with the challenge of bioinformation integration.
- Open-source software titles include: Bioconductor, BioPerl, Biopython, BioJava, BioJS, BioRuby, Bioclipse, EMBOSS, .NET Bio, Orange with its bioinformatics add-on, Apache Taverna, UGENE and GenoCAD.
- The non-profit Open Bioinformatics Foundation have supported the annual Bioinformatics Open Source Conference (BOSC) since 2000.
- The MediaWiki engine with the WikiOpener extension can be used to build public bioinformatics databases.
Web-Services in Bioinformatics
- SOAP and REST-based interfaces have been developed for bioinformatics applications, allowing applications to use algorithms, data and computing resources remotely.
- The main advantages are the end user does not deal with software and database maintenance.
- Basic bioinformatics services are classified by the EBI into these categories SSS (Sequence Search Services), MSA (Multiple Sequence Alignment) and BSA (Biological Sequence Analysis).
Bioinformatics Workflow Management Systems
- A bioinformatics workflow management system is a specialized workflow management system designed for composing and executing computational or data manipulation steps in a Bioinformatics application.
- These systems provide an easy-to-use environment for application scientists to create their own workflows.
- These systems provide interactive tools enabling the scientists to execute their workflows and view the results in real time.
- These systems simplify the process of sharing and reusing workflows between scientists.
- These systems enable scientists to track the provenance of the workflow execution results and the workflow creation steps.
- Platforms giving this service: Galaxy, Kepler, Taverna, UGENE, Anduril, and HIVE.
BioCompute and BioCompute Objects
- The US Food and Drug Administration sponsored a conference held at the National Institutes of Health Bethesda Campus in 2014 for discussion reproducibility in bioinformatics.
- A consortium of stakeholders met regularly over three years to discuss what defines the BioCompute paradigm.
- These stakeholders included representatives from government, industry, and academic entities
- This also included session leaders representing numerous branches of the FDA and NIH Institutes and Centers, non-profit entities.
- These stakeholders included the Human Variome Project, the European Federation for Medical Informatics, and research institutions
- These institutions included Stanford, the New York Genome Center, and the George Washington University.
- The BioCompute paradigm would be in the form of 'lab notebooks' for the reproducibility, replication, review, and reuse, of bioinformatics protocols.
- US FDA funded this work so pipelines would be transparent to their regulatory staff.
- In 2016, the group reconvened at the NIH in Bethesda and discussed the potential for a BioCompute object as part of the BioCompute paradigm.
- A "standard trial use" document with a preprint paper was uploaded to bioRxiv, while The BioCompute object shares the JSON-ized record among employees, collaborators, and regulators.
Biological Databases
- Modern genomic research is hallmarked by the generation of enormous amounts of raw sequence data.
- Sophisticated computational methodologies manage the growing volume of genomic data.
- Computer databases manage the staggering amount of information and computer associated software is used to update, query, and retrieve components from the system.
- A simple database might be a single file with records with same information
- Databases organize data in a structured set of records for easy retrieval of information.
- Examples include GenBank from NCBI, SwissProt from the Swiss Institute of Bioinformatics, and PIR from the Protein Information Resource.
Types of Biological Databases
- Biological databases are divided into two categories: primary and secondary.
Primary Databases
- They are also called archival databases.
- They contain experimentally derived data such as nucleotide sequence, protein sequence or macromolecular structure.
- Researchers submit experimental results directly into the database, so the data are essentially archival.
- Data in primary databases aren't changed after being given an accession number, forming a part of the scientific record.
- Examples include: ENA, GenBank and DDBJ (nucleotide sequence), Array Express Archive and GEO (functional genomics data), and Protein Data Bank (PDB; coordinates of three-dimensional macromolecular structures).
Secondary Databases
- Secondary databases derive data from analyzing primary data.
- Secondary databases often draw upon information from primary and secondary sources and controlled vocabularies with scientific literature.
- These databases are highly curated, using a complex combination of computational algorithms and manual analysis and interpretation for deriving new knowledge from the public record of science.
- Examples: InterPro (protein families, motifs and domains), UniProt Knowledgebase (sequence and functional information on proteins), and Ensembl (variation, function, regulation and more layered onto whole genome sequences).
- However, many data resources resources have both primary and secondary characteristics
- UniProt accepts primary sequences derived from peptide sequencing experiments.
- UniProt also infers peptide sequences from genomic information and provides a wealth of additional information with automation (TrEMBL) and manual analysis (SwissProt).
- Specialized databases cater to a research interest like Flybase, HIV sequence database, and Ribosomal Database Project which specialize in a particular organism or data type.
Importance of Databases
- Databases act as a storehouse of information.
- Databases are used to store and organize data in a way that information can be easily retrieved
- Databases facilitate knowledge discovery, which identifies connections between pieces of information that were not known or entered before and discovering new insights from raw data.
- Secondary databases provide molecular biologists with a reference library with information investigated by the research community.
- Databases improve data access, indexing, and removing redundancy.
Nucleotide Sequences Databases
- As biology has increasingly turned into a data-rich science, the need for storing and communicating large datasets has grown significantly.
- The nucleotide and protein sequences are stored, as is 3D structural data produced by X-ray crystallography and macromolecular NMR.
- Nucleic acid's biological information is in sequences and protein data is in sequences and structures
- Nucleotide sequences are in single dimensioned, while the structure contains the three-dimensional data of sequences.
- A biological database collects and organizes data that is easily accessed, managed, and updated,
- This is combined with software for processing, archiving, querying and distributing data.
- Databases with nucleotide sequences are called "nucleic acid sequence databases."
Nucleic acid Sequence Databases
- The Nucleotide database is a sequence collection from sources like GenBank, RefSeq, TPA and PDB.
- Genome, gene and transcript sequence data provide the foundation for biomedical research and discovery.
- Primary databases of nucleotide sequences are chief databases of with available raw nucleic acid sequences: GenBank, EMBL, DDBJ.
- It serves as a repository and are also called the primary nucleotide sequence databases
- GenBank is physically located in the USA, while EMBL, the European Molecular Biology Laboratory, is in the UK and DDBJ, the DNA databank of Japan, is in Japan.
- To optimize synchronization the three databases accept nucleotide sequence submissions and exchange new and updated data consistently.
- They are primary databases, as they house original sequence data and collaborate with Sequence Read Archive (SRA), which archives raw reads from high-throughput sequencing instruments.
- The GenBank sequence database is open access with annotation to available protein translations.
- NCBI produces and maintains it as part of the International Nucleotide Sequence Database Collaboration (INSDC)
- GenBank receives sequences world wide from more than 100,000 distinct organisms and has grown at an exponential rate, doubling roughly every 18 months.
- The EMBL (European Molecular Biology Laboratory) Nucleotide Sequence Database is a comprehensive collection maintained at the European Bioinformatics Institute (EBI)
Secondary Databases of Nucleotide Sequences
- Many secondary databases are sub-collections of GenBank or EMBL while other databases add value through annotations, software, presentation of info, and cross-references,
- These databases do not present sequences, but they gather and show information from other sequence databases
- The Omniome Database is a microbial resource maintained by TIGR to show the sequence and annotation of processed genomes
- Omniome has information about organisms such as taxin and gram stain patterns, the structure and composition of their DNA molecules, and the attributes of protein sequences predicted from the DNA.
- It facilitates meaningful multi-genome searches and analysis, like genome alignment, and comparing traits of proteins and genes from genomes.
- In FlyBase Database, A consortium sequenced the entire genome of the fruit fly D. Melanogaster to ensure high degree of completeness and quality.
- ACeDB serves as a repository with the sequence and genetic map, as well as phenotypic information about the C. Elegans nematode worm.
Protein Databases
- Biology has increasingly turned into a data-rich domain, leading to an increase in data storage and communication of datasets.
- Nucleotide and protein sequences, with structural data are produced through X-ray crystallography and macromolecular NMR.
- Protein's biological information is in sequences and structures, while sequences are one-dimensional and structures are three-dimensional.
- A biological database is a collection of data that can be easily accessed, managed, and updated.
- A protein database includes datasets with protein amino acid sequence, conformation, structure, and features with active sites.
- Protein databases consist of DNA sequences translated from different gene databases with structural information.
- They are an important resource because proteins mediate most biological functions.
Importance of Protein Databases
- Huge amounts of protein structure, function, and sequence data has lead to the creation and use of databases.
- It has the following uses:
- Comparison between proteins or protein families provide info on the relationship between proteins within a genome.
- Secondary databases provide annotated, derived databases on these proteins.
- These databases help researchers understand the structure and function of a protein.
Primary databases of Protein
- The databases hold the experimentally determined protein sequences of translated nucleotide sequences.
- The data is not experimentally derived but results from interpretation of nucleotide data.
- A number of primary protein sequence databases exist.
Protein Information Resource (PIR) – Protein Sequence Database (PIR-PSD)
- The PIR-PSD is collaborative between the PIR, Germany's MIPS, and Japan's JIPID.
- The PIR-PSD is a comprehensive, non-redundant, expertly annotated, object-relational DBMS.
- A key characteristic of the PIR-PSD is protein sequence classification using the superfamily concept.
- The PIR-PSD sequence classifies homology domain and motifs: homology domains correspond to evolutionary building blocks, and sequence motifs represent functional sites or conserved regions.
- The classification approach shows a complete understanding of sequence function-structure relationship.
- The database also contributes to SWISS-PROT , in addition to being a well-known and used protein database which also provides high levels of annotation.
- Each entry in the database is considered separately as core data and annotation.
- Core data has the sequences entered as letter amino acid, related references, and bibliography.
- The organism's taxonomy also forms part of core information.
- Post-translational modification is included in the annotation, as is phosphorylation, acetylation, sites binding calcium, ATP, or zinc, and structural features like alpha helix, beta sheet, quaternary structure, and similar proteins.
Protein Databank (PDB)
- PDB is a primary protein structure (crystallographic) database that is a for three-dimensional structure of large biological molecules.
- In spite of the name, PDB also archives the three-dimensional structures of biologically important molecules.
- The database data is usually determined via X-ray crystallography, NMR experiments, and molecular modeling. TrEMBL is a computer-annotated protein sequence database released as a SWISS-PROT supplement that contains the translation of all coding sequences present in EMBL nucleotide database.
- This may have protein sequences that are never expressed or identified in organisms.
Functional Genomics
- Functional genomics is the identification of genes and their respective functions.
Structural Genomics
- Structural genomics is predictions related to functions of proteins.
Comparative Genomics
- Comparative genomics is the means For understanding the genomes of different species of organisms.
DNA Microarrays
- DNA microarrays measure the levels of gene expression in different tissues, various stages of development and in different diseases.
Annotation
- These are text fields of information about a biosequence added to sequence databases.
- Annotation defines the aspects, which are: function(s) of the protein, post-translational modification(s), domains and sites such as calcium binding regions, ATP-binding sites, zinc fingers, homeobox, kringle, secondary structure, quaternary structure, similarities to other proteins, disease(s) associated with deficiencie(s) in the protein, and sequence conflicts, variants, etc.
Transcriptome
- The transcriptome is the set of mRNA transcripts produced by the genome at any one time.
- All cells of an organism contain the same genome while the dynamic transcriptome varies considerably.
Significance of Transcriptomics
- As the transcriptome includes all mRNA transcripts in the cell, it reflects the genes that are being expressed while the study of transcriptomics examines the expression level of mRNA in a given cell population.
Metabolomics
- Genomics is concerned with the total complement of genes and proteomics with the analysis of the entire set of proteins.
- Metabolomics measures low molecular weight metabolites both qualitatively and quantitatively in any given sample, cell or tissue, integrating data to analyze of gene function.
- The genome expression profiling methods (transcriptome, proteome and the metabolome) developed at the level of the 'post-genomic era',
- Comprehensive measurements for the differences between cell types, tissues, organs and whole organisms will allow a full and global comparison with measurements from working parts of the system to prove unknown characteristics of gene function, physiology and metabolism.
Areas of Metabolomics
- Metabolic analysis is divided into 4 areas:
- Target compound analysis, as the quantification of specific metabolites.
- Metabolic profiling, as quantitative and qualitative determination of group compounds or of different members.
- Metabolomics as quantitative and qualitative analysis of all metabolites.
- Metabolic fingerprinting, as analysis for sample classification by rapid global analysis without extensive compound identification.
Pharmacogenomics
- Pharmacogenomics studies how inherited genes interact with medicine and how they affect medications for each person
- Genetic differences mean that a drug can be safe, harmful, have side effects or experience problems with different doses, so pharmacogenomics testing helps the doctor choose the safest and most effective drug and dose.
- Pharmacogenomics is constantly changing; a result of researcher discoveries to identify genetic variations that affect how a drug works
How Pharmacogenomics Differs from Genetic Testing
- Genetic testing searches for genes like BRCA1 and BRCA2 for preventive or risk reduction steps like more frequent cancer screening, lifestyle changes, and preventive treatment.
Benefits and Challenges of Pharmacogenomics
- It may improve patient safety, prevent 120,000 severe drug reaction hospitalizations each year, improve health care costs and efficiency by helping give proper medications
- It has challenges like its expense if insurance does not cover the costs so access to certain tests is limited and privacy issues remain, even with federal anti-discrimination laws.
Phylogenetic Analysis Steps
- Building a phylogenetic tree requiring the identification and acquisition of a set of homologous DNA or protein sequences, the alignment of them, evaluating a tree from the sequences, and presenting the tree in a way that effectively shows relevant information.
Phylogenetic Trees
- A phylogenetic tree is a diagram that represents evolutionary relationships among organisms, but is only a hypothesis.
- Branching patterns in the tree reflects lineage evolved from a series of common ancestors.
- In phylogenetic tress, two species are more related if they have more recent common ancestor and less related if they have less recent common ancestor.
- Phylogenetic trees can be drawn in various equivalent styles without affecting their information
Systems of Classification
- Classification based primarily on organisms' phylogeny.
- All modern systems base their classification on the evolutionary relationships among organisms.
- Systems organize species or other groups in ways that reflect lineage from common ancestors
- Species or groups of interest are at the tips of lines referred to as the tree's branches.
Sequence Alignment
- Alignments compare related DNA or protein sequences to capture facts about evolutionary descent or structural function and that the alignment is from a common ancestral sequence.
- DNA molecules contain nucleotides, while protein molecules contain amino acids.
- The specific order of nucleotides or amino acids called DNA and protein sequences
- DNA sequences encode protein sequences, because proteins are involved in most biological functions of living cells.
Pairwise Alignment Definitions
- An alphabet is a finite set of letters
- A sequence is a finite string of letters chosen from the alphabet.
- A null character is represented by the symbol "-" signifies an absent letter.
- An expanded sequence S' is the sequence S with null characters placed at its start, end, or between any two of its characters.
- A global pairwise alignment of sequences S and T is a one-to-one co-linear correspondence of expanded sequences S' and T'.
- A local pairwise alignment of sequences S and T is a one-to-one co-linear correspondence of segments of expanded sequences with no nulls.
Pairwise Alignment Scores
- Selecting among alignments of two sequences use a scoring function to assess each alignment.
- Alignments with optimal scores are sought through establishing scores to align particular letters to one another or to nulls for calculating a sum.
- A column of a pairwise alignment is the correspondence of a single letter (or null) with a single letter (or null).
- A substitution is a column is aligning two letters.
- A substitution score is a score defined for the substitution involving a particular pair of values.
- An indel is a column aligning a letter with a null
- An indel score is the score defined for aligning a letter with a null, where A gap of length k is composed of k adjacent indels.
- The alignment score is the sum of substitution and indel scores for an alignment's columns
FASTA and BLAST
- The quantities of large databases make alignment to find significant local alignment important
FASTA and BLAST Software
- There are two similar and homologous DNA searches by excess sequence similarity: BLAST and FASTA
- They provide facilities for comparing DNA and proteins sequences database through its functions.
- BLAST and FASTA are fast because they both pairwise sequence alignment using words
- They function finding short stretches of identical letters in two sequences- words
- The basic assumption is that related sequences have at least one word in common.
- A word match lets similarity regions extend from the words, join onto to a high-scoring full alignments.
Differences in Finding Sequence Alignments
- BLAST is often used in finding ungapped, locally optimal, alignments
- FASTA is often is involved in finding similarities between less similar sequences.
BLAST
- (Basic Local Alignment Search Tool) was developed by Stephen Altschul of NCBI in 1990 and is a popular sequence analysis resource.
- BLAST uses heuristics to align a query sequence with all sequences in a database to find high-scoring ungapped segments among related sequences.
- The existence of segments above a threshold indicates pairwise similarity.
- It is used quickly identify regions of local similarity between two sequences
- It calculates an expectation value, which estimates number of matches
- Various forms of BLAST include:
- BLAST-N (nucleotide sequence with nucleotide sequences)
- BLAST-P (protein sequences with protein sequences
- BLAST-X (nucleotide sequences against protein sequences)
- tBLAST-N with proteins sequences against a translation nucleotide one
- tBLAST-X translates sequences to look for frames of all the code.
FASTA Definition
- FASTA is a sequence alignment tool used to search similarities between sequences of DNA and proteins, using a “hashing” strategy to find matches for identical residues with a length of k.
- Strings of residues known as ktuples or ktups identify two groups of residues (a search) to target sequences for full sequence matches
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.