Bioinformatics Notes PDF
Document Details
Uploaded by PermissibleSerpent1570
Taylor's University
Tags
Summary
These notes provide an overview of DNA sequencing technologies, data formats used in bioinformatics (FASTA, FASTQ, SAM, BAM, VCF), protein structure levels, and the importance of data standards and metadata in bioinformatics research. They discuss applications and techniques within the field.
Full Transcript
Explain about DNA sequence, and provide some examples. 2 main technologies in DNA sequencing: - Sanger Sequencing: The first widely used sequencing method, based on chain termination, is useful for short reads - NGS: Modern high-throughput methods that can sequence millions of fragments...
Explain about DNA sequence, and provide some examples. 2 main technologies in DNA sequencing: - Sanger Sequencing: The first widely used sequencing method, based on chain termination, is useful for short reads - NGS: Modern high-throughput methods that can sequence millions of fragments simultaneously, generating vast amounts of data for entire genomes Examples: - Genome annotations: Identifying genes, regulatory elements, and structural features of a genome - Variant calling: Detecting mutations (SNPs, InDels) that might be associated with diseases - Evolutionary studies: Comparing sequences across species to understand evolutionary r%sp How many types of data format are there? 1. FASTA 2. FASTQ 3. SAM 4. BAM 5. VCF Explain more on the FASTA format and purpose. 1. Purpose: Primarily used to represent nucleotide or protein sequences 2. Content: Contains sequence information only, without any quality scores 3. Structure: a. The first line starts with a > symbol, followed by a description or identifier for the sequence (e.g. sequence name or accession number). b. Subsequent lines contain the actual sequence (DNA, RNA, or protein). 4. Advantages: FASTA is widely used for tasks where only the sequence is needed, such as in sequence alignment (BLAST), database searches, or gene annotations Elaborate more on the FASTA format and purpose. 1. Purpose: Used to represent nucleotide sequences from sequencing technologies, such as NGS, along with quality scores for each nucleotide 2. Content: Contains both the sequence data & quality scores, which are crucial for evaluating the accuracy of sequencing results. 3. Structure: a. Line 1: Starts w/ @ symbol followed by a sequence identifier b. Line 2: The actual sequence c. Line 3: Starts with a + symbol d. Line 4: Encodes the quality scores. 4. Quality score: The quality score is typically a Phred score, which indicates the likelihood of a sequencing error for each base. Higher scores mean higher confidence in the base call. Elaborate more on protein sequence. Structure of a protein can be divided into 4 levels: 1. Primary sequence: The sequence of amino acid 2. Secondary sequence: Local folding patterns such as alpha-helices or beta-sheets 3. Tertiary sequence: The overall 3D structure of a single polypeptide 4. Quaternary structure: The arrangement of multiple protein subunits Protein sequencing techniques: - MS: used to determine mass-to-charge ratio of peptides - Edman degradation: A chemical method that sequentially removes amino acids from N-terminus of a peptide What are some applications of knowing protein sequences? - Drug discovery: identifying potential drug targets based on protein structure and function - Molecular Modeling: predicting how mutations affects protein structure and function - Proteomics: Large-scale study of protein expression and modifications in different conditions What are the importance of data standards & metadata in bioinformatics? - Ensuring data interoperability → analyze sequence data consistently, regardless of the tools being used - Reproducibility of results → data standards ensures everyone is working data in the same format → reduce variability - Metadata: adding context to data - Sample source (e.g., species, tissue type) - Sequencing method (e.g. illumina, PacBio) - Experimental conditions (e.g. time points, treatment vs control) Metadata is essential for proper data interpretation What are the key differences between FASTA & FASTQ? Feature FASTA FASTQ Information Sequence only Sequence + Quality Score File Size Smaller, as only sequence data is Larger, as it includes quality data stored Common Usage Sequence alignment, database Raw data from NGS, sequence searches quality assessment Format > identifier + sequence @ identifier, sequence, + quality scores Explain more on the SAM format and purpose. SAM = Sequence Alignment/Map format. It is a TAB-delimited text format consisting of a header section, which is optional, and an alignment section. - Full description of the SAM format: - Is flexible enough to store all the alignment information generated by various alignment programs - Is simple enough to be easily generated by alignment programs or converted from existing alignment formats - Is compact in file size - Allows most of operations on the alignment to work on a stream without loading the whole alignment into memory - Allows the file to be indexed by genomic position to efficiently retrieve all reads aligning to a locus Explain more on the BAM format and purpose. BAM (Binary Alignment/Map) format is used to store aligned sequence data in a compressed, binary format. It’s the compressed version of the SAM format 1. Purpose: essential for storing alignment data produced during genome mapping or variant discovery processes. Each read from sequencing is mapped to a reference genome, and BAM stores the results of this alignment 2. Structure: a. Store information about how the reads align to the reference genome, along with metadata such as mapping quality, cigar strings (which represent InDels), and flags for the alignment status 3. Advantages: a. It’s binary & compressed, meaning it uses much less disk space than raw sequence files b. Indexing BAM allows for fast retrieval of specific regions of the genome, which is crucial for large-scale genomic studies 4. Use Case: Commonly used in workflow such as variant calling, where identifying mutation (e.g., SNPs, InDels) depends on accurately aligned reads. Explain more on the BAM format and purpose. Variant Call Format (VCF) ise used for storing genetic variants, such as SNPs, InDels, relative to a reference genome. 1. Purpose: VCF files are used to represent the output from variant calling algorithms, listing the variations found when comparing sample genomes to a reference 2. Structure: a. Each entry in a VCF file represents a variant and includes information such as: i. The position of the variant in the genome ii. The reference base and the alternative base iii. Quality scores indicating how confident the variant call is iv. Genotype information for each sample 3. Advantages: VCF is essential for genome-wide association studies (GWAS) and personalized medicine, where identifying genetic variants can reveal disease risk factors or suggest targeted therapies