FASTA Algorithm for Sequence Alignment PDF

Summary

This document describes the FASTA algorithm for database similarity searches of biological sequences. It details the four steps of the algorithm. The document compares FASTA with BLAST, emphasizing the differences in their approaches to sequence alignment.

Full Transcript

Window size and threshold like dot matrix For determining HSP Extension in right and left direction will Continue unless and until the threshold Value drops below 22 due to mismatches a Maximal Segment Pair (MSP) is an ungapped local alignment whose score cannot be i...

Window size and threshold like dot matrix For determining HSP Extension in right and left direction will Continue unless and until the threshold Value drops below 22 due to mismatches a Maximal Segment Pair (MSP) is an ungapped local alignment whose score cannot be im- proved by extending or shortening the alignment; a High scoring Segment Pair (HSP) is a maximal segment pair with score S≥ST , where ST is a similarity score threshold (typically user defined). Developed by William Pearson FASTA is one of the first widely-used database similarity search tools. FASTA (or FastA), an abbreviation for ‘Fast-All’, is a sequence alignment tool that takes nucleotide or protein sequences as input and compares it with existing databases. It was first developed by David J. Lipman and William R. Pearson in 1985 and has since been refined and adapted for various applications. FASTA Programs FASTA was originally developed for comparing protein sequences. The original program was referred to as FASTP. It quickly became a popular tool for sequence alignment and database searching. The program has been continually updated and improved. There are now different FASTA programs available, each used for different types of sequence searches: FASTA compares a DNA query sequence against a database of DNA sequences or a protein query sequence against a database of protein sequences using the FASTA algorithm. SSEARCH performs protein-protein or DNA-DNA comparisons using the Smith-Waterman algorithm. GGSEARCH/ GLSEARCH works using a global alignment algorithm (GGSEARCH) or a combination of global and local alignment algorithms (GLSEARCH) to compare protein and nucleotide sequences. FASTX/ FASTY compares a DNA sequence and a database of protein sequences by translating the DNA sequence into three frames and allowing gaps and frameshifts. TFASTX/ TFASTY compares a protein sequence and a database of DNA sequences. The DNA sequence is translated in six frames – three in the forward direction and three in the reverse direction. FASTF/ TFASTF compares mixed peptide sequences against a protein (FASTF) or translated DNA (TFASTF) databases. FASTS/ TFASTS compares a set of short peptide fragments against the protein (FASTS) or translated DNA (TFASTS) databases. How FASTA Works FASTA works by comparing a query sequence to a database of sequences to identify similar matches. The program uses a heuristic algorithm to quickly search the database and identify the most significant matches. The working mechanism of FASTA is described in the following steps: Step 1: Identifying Regions The first step is identifying regions with high similarity by creating a lookup table for the query sequence. This step is also called hashing step. To create the lookup table, the query sequence is first broken down into smaller words known as k-tuples (ktup). When the ktup value is increased, the number of background word hits is reduced. By reducing the number of these background word hits, the algorithm can focus on the more relevant hits, enhancing the overall search speed. k-tuple is usually 2 for proteins and 6 for nucleotide sequences. Once the lookup table is created, it is used to identify matches between the k-tuples in the query sequence and the database sequences. Similar regions are represented as diagonals in a two-dimensional matrix. The ten regions with the highest density of word matches are the high-similarity regions, and these best ten diagonals are saved. Step 2: Re-Scoring In the second step, the ten best diagonals are rescored using suitable scoring matrices. For protein, BLOSUM50 or PAM matrix is used; for DNA sequences, the identity matrix is used. A subregion with the highest score is identified for each of the rescanned diagonal regions. These high-scoring subregions within the diagonals are called initial regions. Step 3: Joining Threshold Next, a score cutoff or the joining threshold is applied that excludes segments unlikely to be part of the final alignment. The library sequences are ranked based on their initial scores. The regions with initial scores above the pre-set threshold are selected and checked to see if they can be joined together. This step introduces gaps between the diagonals while applying gap penalties. The score of the gapped alignment is calculated by subtracting a penalty for each gap, which is used to rank the database sequences by similarity. Step 4: Final Alignment Finally, the gapped alignment is refined to produce the final alignment. This is done by using the banded Smith-Waterman algorithm, which is a dynamic programming algorithm that calculates the optimal score (opt) for alignment. This score is used for statistical calculations. The main difference between BLAST and FASTA is that BLAST is mostly involved in finding of ungapped, locally optimal sequence alignments whereas FASTA is involved in finding similarities between less similar sequences.

Use Quizgecko on...
Browser
Browser