Activity 3: Consensus Sequence & Genetic Patterns PDF

# ACTIVITY 3 ## Consensus Sequence and Genetic Patterns Seen through Pairwise Sequence Alignment ### Introduction Pairwise sequence alignment (PSA) is a method of comparing two sequences, be it nucleic acid or proteins, aiming to determine the relatedness of the sequences. It is primarily used to gain insights into evolutionary relationships, structural similarities, and functional conservation of sequences believed to be homologous - those that share a common evolutionary history. Paralogous and orthologous sequences are general types of homologous sequences commonly studied using PSA. While there is no unit of measure for homology, relatedness is often expressed in terms of similarity, identity, and other statistics based on established scoring matrices. Apart from determining evolutionary relatedness, there are various ways in which PSA can be used to analyze sequenced data. For instance, the potential novelty of discovered genes or protein sequences can be easily characterized when compared against reference data. Determining conserved domains or motifs and the detection of variants or mutations in sequences can also be achieved through PSA. Several phylogenetic tools and algorithms initially rely on PSA to build tree models. Lastly, generating a consensus sequence for quality control checking is also possible through PSA. For routine bidirectional sequencing outputs, researchers are usually given forward and reverse reads of the same target region. Resolving the final sequence can be done using PSA to extract regions with the highest quality scores. The consensus of both sequences is combined into one file, which is used in downstream analysis, i.e., the subsequent activities in this manual. Apart from this, this activity will explore the genomes of different organisms and see their similarities and differences when compared to other genomes or their own to observe genetic rearrangements and patterns. ### Objectives At the end of the activity, the students should be able to: - Create files of clean consensus DNA from bidirectional sequenced files. - Retrieve sequences from the genome databases for PSA - Align whole genome sequences of representative genomes or chromosomes using the re-dot-able tool. - Interpret and export the output of the whole genome or chromosomal PSA using different parameters. ### Data and Tools To access and download the datasets we will use, kindly visit the dedicated online drive for the class. SeqTrace 0.9.0 software can also be located in the shared drive. Download the installer and follow the installation wizard's instructions until the tool is completely installed and operational. This activity would also require a stable internet connection to access the genome and or chromosomal datasets for the dot plot activity at the NCBI database. The application for the re-dot-able for Windows is available on the dedicated drive. Download the folder and open the software through the EXE file. You may also download the latest version at https://www.bioinformatics.babraham.ac.uk/projects/download.html#redotable. ### Procedure #### Generating Consensus Sequence 1. The data is a bidirectionally sequenced barcode for unknown bacterial samples isolated and collected from various projects. The 16S rRNA gene was PCR amplified using the 27F/1492R primer pairs, which were also subsequently used during the capillary sequencing. The following sequences will be used for this exercise: | Sample Code | Size (27F/1492R) | Source | |---|---|---| | D-D1 | 1149/1198 | Bacteria (Project MinA) | | D-D6 | 1322/1329 | Bacteria (Project MinA) | | D-E7 | 1217/1263 | Bacteria (Project MinA) | | D-010 | 1315/1296 | Bacteria (Project MinA) | | G1-3 | 1419/1442 | Bacteria (Project C3PO) | | G3-3 | 1436/1436 | Bacteria (Project C3PO) | | G4-2D | 1444/1320 | Bacteria (Project C3PO) | | G6-3 | 1454/1504 | Bacteria (Project C3PO) | 2. Open the SeqTrace 0.9.0 application. Start by clicking on the black document icon to "Create a new project." 3. When the pop-up menu opens, locate the folder where you saved the sequenced samples, then click "OK". Change the search string of the trace files to "_27F" and "_1492R" to indicate which is the Forward or Reverse sequence, respectively. In the "Primer sequences" portion, copy + paste the following primers used during the PCR and sequencing: * AGAGTTTGATCMTGGCTCAG as the forward primer, and * TACGGYTACCTTGTTACGACTT for the reverse primer. DO NOT CLICK OK YET! 4. After setting up the needed inputs for the first tab, click the "Sequence processing" tab next to the current "Trace setting" tab. Change the value of the "Min. confidence score" down to 20 and retain the consensus algorithm in default (Bayesian). Similarly, retain other sequence trimming parameters to their default settings, then click "OK." 5. You can now start adding your trace files from the folder you selected for your newly created project. Do this by clicking on the "+" icon. Select all sequences from the folder, then click "Add." Once it finished loading, the trace files will be shown in the main interface, with each tagged either as the forward or reverse sequence. 6. Select the first pair of sequences from the list of samples. Once highlighted, click on the "Traces" tab, then select the option "Group selected forward/reverse files" 7. You will then be asked what to name these grouped files. Simply type in the sample code, as both files are sequences of the same specimen. Then click "OK." Do the same to the other sequences. 8. Once done with the grouping, highlight the first sample you want to process, then click on the "View selected trace file (s)..." option. This will redirect you to another pop-up screen which is similar to the Chromas view, but SeqTrace will now allow you to view the files after the pairwise alignment. 9. The new screen will show you different parts as labeled: * (a) shows where the general settings are. * (b) shows the chromatogram of the reverse sequence. * (c) shows the chromatogram of the forward sequence. * (b1 and c1) shows the PHRED score values of each base call. * (d) shows the actual sequence of either the forward or reverse. * (e) shows the consensus of these base calls that are editable. 10. Look at the portion labeled "e" and locate the first consensus sequence called. Click on portion "d," and the chromatogram will refresh and direct you to the part (highlighted in yellow) where the base is located. 11. To start the cleaning process, we should first set the criteria for bases to be considered in the final PSA consensus. In this case, we are looking for base calls that are more than or equal to 20 and should be present in at least one of the reads (either forward or reverse). We make sure that the final sequence should only be composed of A, T, C, or G and that cryptic bases (N) be edited (if the PHRED score and chromatogram peak are good) or removed. 12. If cryptic bases are noted, click on their corresponding base to determine their score and peak. If the PHRED score is below 20 and or with messy, overlapping peaks, then we will remove it as shown below. To remove the base, simply highlight it in the "e" part, then click on the trash icon in the "a" section. Alternatively, you can right-click on the base and then select "Delete selected base(s)." 13. If the cryptic base passes the criteria we set, as shown in the following figure, then we edit the base corresponding to the call in section "e." To edit the base, highlight it in the "e" section, then click on the paper and pencil icon in the "a" section. Alternatively, you can right-click on the base, then select "Modify selected base (s)." 14. In times when a contradicting base is called in either forward or reverse sequence, choose only one following the criteria on PHRED score and chromatogram peak quality. You may save the changes in the consensus by clicking on the floppy disk icon to save the working sequence (especially when work is in progress). 15. Repeat the process until you reach the end of the consensus sequence at part "e." Once cleaning and editing are done, browse through the chromatogram again to double-check. You can then get the consensus sequence by clicking on the "File" tab and selecting the "Export working sequence" option. 16. In the pop-up menu, type in the name of the sequence file. By default, files will be exported as "FASTA" files. Then click on "Save." Check the file by opening it in your NOTEPAD application. 17. Provide the information being asked in the worksheet. ### Bacterial Whole Genome PSA 1. Retrieve the whole genome sequence of the following microbes in NCBI Genome database using the link: https://www.ncbi.nlm.nih.gov/datasets/genome/ | Organism name | RefSeq Accession | Assembly | |---|---|---| | Escherichia coli O157:H7 str. Sakai | GCF_000008865.2 | ASM886v2 | | Escherichia coli str. K-12 substr. MG1655 | GCF_000005845.2 | ASM584v2 | | Salmonella enterica subsp. enterica serovar Typhimurium str. LT2 | GCF_000006945.2 | ASM694v2 | | Shigella dysenteriae str. SWHEFF_49 | GCF_022354085.1 | ASM2235408v1 | | Bacillus subtilis subsp. subtilis str. 168 | GCF_000009045.1 | ASM904v1 | 2. Type in the name of the organism in the "Selected Taxa" search bar and select the suggested taxa closest to the name. Options on additional filters will be shown in a tabular format containing the following details: Assembly, GenBank Accession, RefSeq Accession, Scientific Name, etc. Notice that the reference genome for the given taxa is marked by a green check near the Assembly accession. Click on the assembly corresponding to the details in the table above. 3. Once you visit the link using the Assembly accession, a "Download" button will be seen. 4. Click download and change the clicked option for "File Source" to "RefSeq Only (1)" in the pop-up menu, change the name of the file to your convenience, then click "Download." 5. Do the same for the other bacterial genome assemblies. They will be downloaded compressed into a ZIP file, so start extracting it first. The assembly is int the innermost folder named according to its RefSeq accession and Assembly number with the ".fna" extension. You may rename the file for your convenience as well. 6. Now, open the re-dot-able application as shown below. (Reminder: Do not close the black-screened terminal, as this is part of the application). 7. Using the "File" tab of the application, open the assembly of "Escherichia coli O157:H7 str. Sakai" in the "X" axis. 8. Once the menu opens, locate the file as saved on your device. Since the file is not in the FASTA format, change the file type displayed to ALL FILES then click OPEN. 9. For our first PSA, open "Escherichia coli str. K-12 substr. MG1655" in the "Y" axis. Both the X and Y axes should display the name and the scale of their respective sequence. 10. Under the "File" tab, select "Start Aligning" and allow the software to compute for the alignment. The results will be displayed graphically. Save the result using the "File" tab, then select "Save Dotplot". Name the PNG file "E coli vs E coli PSA." 11. Without changing the X-axis, rerun the other PSA by changing the Y-axis with the remaining assemblies you downloaded. Therefore, you should have a combination of different PSA, i.e., "E coli vs S enterica PSA," "E coli vs S. dysenteriae PSA," and "E coli vs B subtilis PSA." 12. You may explore the tool's output using the other features of the re-dot-able. For example, you may change the Window size of your alignment using the sliding button at the right part of the tool. You may also "Zoom in" on specific regions of the alignment by boxing in that portion of the screen using your cursor. 13. Answer the worksheet using the saved images of your PSA. ## Self-PSA of Bacterial or Chromosomal Genomes 1. PSA is not only useful for aligning a query sequence to a different target sequence. It is also useful in aligning the query sequence to itself. Self-PSA can help us discover regions of repeats, possible motifs repeated in the genome, and palindromic sequences. 2. To start, open the "Escherichia coli O157:H7 str. Sakai" in both the "X" and "Y" axes. Save the alignment image per the previously described steps as "Self Escherichia coli O157 H7". 3. Do the same to the other bacterial genomes. 4. We will also use eukaryotic sequences (yeast and human) and see what kinds of patterns can be observed from them. Since eukaryotic genomes are usually large, we will settle only to their smallest chromosomes. 5. To retrieve the FASTA sequence of these chromosomes, we visit the Genome Data Viewer of NCBI at https://www.ncbi.nlm.nih.gov/genome/gdv/. Browse the genome of the baker's yeast (Saccharomyces cerevisiae), then select chromosome number 1. Reduce the zoom view of the chromosome so that the magnification is 0%. 6. A "Download" option can be seen on the right side of the screen. Click the drop-down menu, then hover your cursor at the "Download FASTA" option, then click on "FASTA (Visible Range)" as shown below. Rename the downloaded file to "Yeast Chromosome 1". 7. Do the same to download human chromosome number 21. 8. Perform self-PSA on "Yeast Chromosome 1" following the same procedure as above. However, for "Human Chromosome 21", before you "Start Aligning," change the window size from the default of 50 to 100. Do this by clicking the "View" tab and then the "Preferences..." option. A pop-up menu will appear where you can input the 100-window size and then "Save Preferences." For "Human Chromosome 21", you should expect the run to take longer. 9. Save the images as per the previous instructions. Before closing all self-PSA, look at the needed information in the worksheet to guide you on what to do with the results displayed in the re-dot-able interface. ### Acknowledgment Special thanks to Bicol University for funding the research (Project MinA and Project C3PO), where all the raw data are generated. ### Data Usage and Sharing Restrictions All data used as inputs or generated from the laboratory activities is intended solely for academic purposes. Students are prohibited from uploading, sharing, or reusing the data outside of the course without explicit permission from the instructor. ### References - Andrews, Simon. (2018). re-dot-able Interactive Dot Plot Tool (v1.2) [Software]. Babraham Institute, Babraham Bioinformatics. Available from https://www.bioinformatics.babraham.ac.uk/projects/redotable/ - National Center for Biotechnology Information (NCBI) [Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information; [1988]. Available from: https://www.ncbi.nlm.nih.gov/- [cited 2023 Oct 6] - Pevsner, J. (2015). Bioinformatics and Functional Genomics (3rd ed.). John Wiley & Sons Inc. - Stucky B. J. (2012). SeqTrace: a graphical tool for rapidly processing DNA sequencing chromatograms. Journal of biomolecular techniques : JBT, 23(3), 90-93. https://doi.org/10.7171/jbt.12-2303-004

Activity 3: Consensus Sequence & Genetic Patterns PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue