FASTA Algorithm PDF
Document Details
Tags
Summary
This document details the FASTA algorithm, a method for comparing DNA and protein sequences to identify regions of similarity and homology. The algorithm involves several steps including identifying regions with high similarity, rescoring diagonals, applying joining thresholds, and producing a final alignment. The algorithm uses a hashing strategy, scoring matrices (like BLOSUM50 or PAM), and dynamic programming for optimal alignment.
Full Transcript
FASTA Homology vs Similarity When comparing two sequences or structures, the terms of similarity or homology are often used interchangeably to indicate that there is a "close" relationship between the comparison objects. However, the two terms refer to different aspects of the compa...
FASTA Homology vs Similarity When comparing two sequences or structures, the terms of similarity or homology are often used interchangeably to indicate that there is a "close" relationship between the comparison objects. However, the two terms refer to different aspects of the comparison: The homology is a qualitative property of the comparison: you say that sequences are homologous if they have a common ancestral gene, or that share an ancestor. So the term homology indicates that the two entities share a common phylogenetic origin, from which they have evolved differentiated from each other. The similarity and a quantitative property of a comparison. The term similarity has a more general meaning indicating a similarity regardless of the reasons that led it. The similarity is often due to homology but it can also be generated by the case or by adaptive convergence phenomena at both morphological and molecular level. FASTA FASTA is a DNA and protein sequence alignment software package first described (as FASTP) by David J. Lipman and William R. Pearson in 1985. FASTA is pronounced "fast A", and stands for "FAST-All", because it works with any alphabet, and it is an extension of "FAST-P" (protein) and "FAST-N" (nucleotide) alignment. The current FASTA package contains programs for protein:protein, DNA:DNA, protein:translated DNA (with frameshifts), and ordered or unordered peptide searches. Recent versions of the FASTA package include special translated search algorithms that correctly handle frameshift errors (which six-frame- translated searches do not handle very well) when comparing nucleotide to protein sequence data. Lipman, DJ; Pearson, WR (1985). "Rapid and sensitive protein similarity searches". Science. 227 (4693): 1435–41. FASTA In addition to rapid heuristic search methods, the FASTA package provides SSEARCH, an implementation of the optimal Smith- Waterman algorithm. A major focus of the package is the calculation of accurate similarity statistics, so that biologists can judge whether an alignment is likely to have occurred by chance, or whether it can be used to infer homology. The FASTA program in its classic version, is a heuristic program able to search the global sequence similarity. The FASTA package is available from fasta.bioch.virginia.edu. FASTA Summarizing FASTA The final score is Opt. Step 1: Identifying Regions The first step is identifying regions with high similarity by creating a lookup table for the query sequence. This step is also called hashing step. To create the lookup table, the query sequence is first broken down into smaller words known as k-tuples (ktup). When the ktup value is increased, the number of background word hits is reduced. By reducing the number of these background word hits, the algorithm can focus on the more relevant hits, enhancing the overall search speed. k-tuple is usually 2 for proteins and 6 for nucleotide sequences. Once the lookup table is created, it is used to identify matches between the k-tuples in the query sequence and the database sequences. Similar regions are represented as diagonals in a two- dimensional matrix. The ten regions with the highest density of word matches are the high-similarity regions, and these best ten diagonals are saved. Step 2: Re-Scoring In the second step, the ten best diagonals are rescored using suitable scoring matrices. For protein, BLOSUM50 or PAM matrix is used; for DNA sequences, the identity matrix is used. A subregion with the highest score is identified for each of the rescanned diagonal regions. These high-scoring subregions within the diagonals are called initial regions. Step 3: Joining Threshold Next, a score cutoff or the joining threshold is applied that excludes segments unlikely to be part of the final alignment. The library sequences are ranked based on their initial scores. The regions with initial scores above the pre-set threshold are selected and checked to see if they can be joined together. This step introduces gaps between the diagonals while applying gap penalties. The score of the gapped alignment is calculated by subtracting a penalty for each gap, which is used to rank the database sequences by similarity. Step 4: Final Alignment Finally, the gapped alignment is refined to produce the final alignment. This is done by using the banded Smith-Waterman algorithm, which is a dynamic programming algorithm that calculates the optimal score (opt) for alignment. This score is used for statistical calculations. FASTA Summarizing FASTA The final score is Opt. FASTA Programs FASTA was originally developed for comparing protein sequences. The original program was referred to as FASTP. It quickly became a popular tool for sequence alignment and database searching. The program has been continually updated and improved. There are now different FASTA programs available, each used for different types of sequence searches: FASTA compares a DNA query sequence against a database of DNA sequences or a protein query sequence against a database of protein sequences using the FASTA algorithm. SSEARCH performs protein-protein or DNA-DNA comparisons using the Smith-Waterman algorithm. GGSEARCH/ GLSEARCH works using a global alignment algorithm (GGSEARCH) or a combination of global and local alignment algorithms (GLSEARCH) to compare protein and nucleotide sequences. FASTX/ FASTY compares a DNA sequence and a database of protein sequences by translating the DNA sequence into three frames and allowing gaps and frameshifts. TFASTX/ TFASTY compares a protein sequence and a database of DNA sequences. The DNA sequence is translated in six frames – three in the forward direction and three in the reverse direction. FASTF/ TFASTF compares mixed peptide sequences against a protein (FASTF) or translated DNA (TFASTF) databases. FASTS/ TFASTS compares a set of short peptide fragments against the protein (FASTS) or translated DNA (TFASTS) databases.