MBB110 Data Analysis for Molecular Biology & Biochemistry (Spring 2025) Lecture 5 PDF

MBB110 Data Analysis for Molecular Biology & Biochemistry (Spring 2025) Lecture 5 Recap from last week LECTURE LAB Learned about common genomics data Got hands on experience with some of the tools formats we discussed in lecture Learned about different tools for handling Wrote and ran our very first python script data Explored annotations and the types of information they represent Discovered the concept of white space and how computers see it Recognised the difference between tall and wide data formats Learned how we can chain together commands in ”pipes” 2 Lecture 5: Regular expressions and 5 pattern matching Find common motifs! 3 Learning objectives SHARED LIBRARIES BIOPYTHON Know how to import from Become familiar with some Python libraries. BioPython features for manipulating sequences. PATTERN MATCHING REGULAR EXPRESSIONS Become familiar with some Understand how to apply advanced methods for pattern regular expressions, specifically matching using command-line in Python. tools. 4 Importing from packages Python has many standard modules that are part of the language and thousands of packages that can be installed as extensions to the base language A module is a single file containing functions or class definitions A package is a collection of modules that share a namespace Read about modules here: https://docs.python.org/3/tutorial/modules.html 5 Where/how does import find packages? Using import in your script causes Python to search a series of locations for matching modules (the module search path) This includes the directory that contains the script and the installation directory for the python you’re running Users can add to this list by modifying a Bash environment variable (PYTHONPATH) 6 Working with the filesystem You may want your code to process multiple files that are not explicitly defined (or defined during the running of your script) e.g. perform some action individually on every file in a specific directory This is a perfect use case for loops but you first need a list of the files The glob package is your friend! 7 Bash-like glob Reminder: * acts like a wildcard that will match any character from glob import glob file_list = glob(”*.py") # equivalent to this in bash: ls *.py print(file_list) # this is just a list of all the files saved in # the scripts directory that match the pattern "*.py" ## ['scripts/mbb_functions.py', 'scripts/fasta_parse_2.py', 'scripts/hello_args.py', 'scripts/fasta_parse.py'] 8 Processing many files in a loop Getting a list of all files we might want to work with via glob is half the battle Modularized functions for the task we want to accomplish make short work of the rest! some_files = glob("some_directory/some_pattern*txt") some_results = [] for a_file in some_files: one_result = do_a_thing(a_file) # call a function we can put elsewhere or in this script some_results.append(one_result) # store the result or directly output it 9 Using BioPython to load FASTA Much simpler to rely on existing parsers when they’re available e.g. BioPython Require some familiarity with new objects and converting between object types BioPython: https://biopython.org/wiki/SeqRecord 10 From BioPython objects to strings Sometimes it’s convenient to just work with strings The object has a few attributes we use to get at the components (seq, id) from Bio import SeqIO fasta_file = "/local-scratch/course_files/MBB110/some_human_genes.fa" for sr in SeqIO.parse(fasta_file, "fasta"): seq_obj = sr.seq # get the Sequence object id = sr.name[0:20] # get the header string some_string = str(seq_obj)[0:20] print(id,some_string) ## ENSG00000001626|ENSG GTAGTAGGTCTTTGGCATTA ## ENSG00000241644|ENSG ACATTTCAGGGACACCATGA ## ENSG00000050327|ENSG GGGGCGGCCGGGCCTGCGCT 11 Let’s summarize multiple sequences file_list = glob("/local-scratch/course_files/MBB110/human_genes_chr7/*0000631*.fa") num_files = len(file_list) print(num_files) #a manageable size ## 11 for fa_file in file_list: for sr in SeqIO.parse(fa_file, "fasta"): so = sr.seq type(so) print(so) ## ## ATGGCTGCAGCTCCTCCAAGTTACTGTTTTGTTGCCTTCCCTCCACGTGCTAAGGATGGTCTGGTGGTATTTGGGAAAAATTCA GCCCGGCCCAGAGATGAAGTGCAAGAGGTTGTGTATTTCTCGGCTGCTGATCACGAACCGGAGAGCAAGGTTGAGTGCACTTAC ATTTCAATCGACCAAGTTCCAAGGACCTATGCCATAATGATAAGCAGACCCGCCTGGCTCTGGGGAGCAGAAATGGGAGCCAAT GAACATGGAGTGTGCATAGCCAATGAAGCCATCAACACCAGAGAGCCAGCTGCCGAGATAGAAGCCTTGCTGGGGATGGATCTG GTCAGGAACGGGCAGGGTGAGCTTGACATACCTGTGAGAAGGTCATGGGCTCCCAGGGAAAGATTTTCCTATAAAACTAGGAAT TGA 12 Searching for patterns in text Searching for exactly one thing in strings is easy(ish) Where and how many matches are there? Approximate matches that fit some specification? sentence = "Wubba Lubba Dub Dub" if "Dub" in sentence: print("match for Dub") else: print("mwah mwah mwahhhhh") #sad trombone ## match for Dub if "dub" in sentence: print("match for dub") else: print("mwah mwah mwahhhhh") #sad trombone ## mwah mwah mwahhhhh 13 Why are patterns important in biology? Identify potential transcription factor binding sites Predict off-target or degenerate primer binding sites Find oligonucleotide repeats or more complex repeats … 14 Regular expressions (regex) Structured way to define a search pattern within text The most basic form is a simple substring (literal character matches) e.g. Find in Microsoft Word or search pattern with grep Power comes from meta-characters Bash globs use * as a meta-character for “one or more of anything” What if we want to be more specific for placement or character types? 15 Regular expressions for postal codes What pattern is found in every Canadian postal code? How can we summarize this pattern? Note that all letters are uppercase. Letter-Number-Letter-Space-Number-Letter-Number H0H 0H0 V9N 8R9 V2X 8Z3 T5A 9A1 E5J 4N3 A1B 0C4 16 Regular expressions for postal codes What pattern is found in every Canadian postal code? How can we tell a computer to match text that fits that pattern? Let [A-Z] represent all letters of the alphabet and [0-9] represent all digits Let \s represent a whitespace character Every postal code could be described as: [A-Z][0-9][A-Z]\s[0-9][A-Z][0-9] 17 Brackets and braces for classes and repetition Enclosure Use {m} Defines number of matches must be m [] Defines a character class (match any character in the class) $ End of string * zero or more repetitions + One or more repetitions 18 Using regular expressions in python import re codes = ["H0H 0H0","V9N 8R9","V2X 8Z3","T5A 9A1"] garbage = ["90210","CCCCCC","hoh oho","HOH OHO","....."] pattern = re.compile("[A-Z][0-9][A-Z]\s[0-9][A-Z][0-9]") for pc in garbage: if pattern.match(pc): print(pc,"matches pattern",sep=" ") else: print(pc,"doesn't match pattern",sep=" ") ## 90210 doesn't match pattern ## CCCCCC doesn't match pattern ## hoh oho doesn't match pattern ## HOH OHO doesn't match pattern ##..... doesn't match pattern 19 Special matching characters in python regex Character Use. Any character except newline ^ Start of string $ End of string * zero or more repetitions + One or more repetitions ^ Invert the behaviour for this set of characters (i.e. non-matching) 20 Introduction to Wordle Goal is to guess a 5 letter word using information gained from earlier guesses Grey letters don’t exist in the word Green letters are in the correct place in the word Yellow letters belong elsewhere in the word 21 Regex for this mystery word Position: 1) R 2) E 3) not any of R, A, N, L, D, Y 4) not any of R, A, N, L, D, Y 5) X 22 Regex for this mystery word RE[^RANLDY][^RANLDY]X 23 Cheating at Wordle some_words = ["NERDY", "LEARN", "REDOX","RELAX","REGEX"] pattern = re.compile("RE[^RANLDY][^RANLDY]X") for word in some_words: if pattern.match(word): print(word,"matches regex",sep=" ") else: print(word,"is wrong!",sep=" ") ## NERDY is wrong! ## LEARN is wrong! ## REDOX is wrong! ## RELAX is wrong! ## REGEX matches regex 24 Step 1: load a word file all_words = [] h = open("/local-scratch/course_files/MBB110/words.txt","r") for l in h: l = l.rstrip("\n") l = l.upper() # why? all_words.append(l) 25 Step 2: start guessing guess1 = "SNAKE" pattern1 = re.compile("[^S][^N][^A][^K]E") c = 0 for w in all_words: if pattern1.match(w): print(w,"matches regex",sep=" ") c+=1 if c > 5: break ## THERE matches regex ## THESE matches regex ## WRITE matches regex ## WHERE matches regex Custom puzzle: https://mywordle.strivemath.com/?word=loivp 26 Next guess guess2 = "TEASE" pattern2 = re.compile("[^ST][^NE][^A]SE") c=0 for w in all_words: if pattern2.match(w): print(w,"matches regex",sep=" ") c+=1 if c > 5: break ## HOUSE matches regex ## CLOSE matches regex ## HORSE matches regex ## WHOSE matches regex ## CAUSE matches regex ## NOISE matches regex 27 One more try guess3 = "LARGE" pattern3 = re.compile("[^STL]ARSE") c=0 for w in all_words: if pattern3.match(w): print(w,"matches regex",sep=" ") c+=1 if c > 5: break ## PARSE matches regex 28 Got it! 29 What if our list included larger words? Regular expressions can be anchored to ensure your match is also anchored to one or both ends of a string e.g. regex “MYC” matches all of: “MYC”, “c-MYC” and “MYCBP” ^ anchors the match at the start of the string $ anchors it at the end of the string e.g. “^MYC$” will only match “MYC” but not the others 30 mRNAdle: guess the coding segment (CDS) Given a FASTA file that contains a full-length cDNA sequence, guess which strand encodes protein and find the most likely CDS start and end i.e. Find the start codons on both strands Translate from each start codon Compare the lengths of proteins Why is this likely to give you the right answer most of the time? 31 mRNAdle example Problem 1: how do we reverse-translate sequence in Python? from Bio.Seq import Seq cdna= "GCTCCTTCATCATGAACTGGCACATGATCATCTCTGGGCTTATTGTGGTAGTGCTTAAAGTTGTTGGAATGACCTTAT TTCTACTTTATTTCCCACAGATTTTTAACAAAAGTAACGATGGTTTCACCACCACCAGGAGCTATGGAACAGTCTGCCC CAAAGACTGG" cdna_seq = Seq(cdna) comp_cdna = cdna_seq.reverse_complement() print(comp_cdna) ## CCAGTCTTTGGGGCAGACTGTTCCATAGCTCCTGGTGGTGGTGAAACCATCGTTACTTTTGTTAAAAATCTGTGGGAAA TAAAGTAGAAATAAGGTCATTCCAACAACTTTAAGCACTACCACAATAAGCCCAGAGATGATCATGTGCCAGTTCATGA TGAAGGAGC 32 mRNAdle example Problem 2: how do we find every ATG in both strands? The string built-in “find” method is not sufficient cdna.find("ATG") #this is just the first and may not be the right reading frame! ## 11 comp_cdna.find("ATG") #this is just the first and may not be the right reading frame! ## 136 33 mRNAdle example Problem 2: how do we find every ATG in both strands? Regular expressions can be used here import re print("start end seq") ## start end seq for m in re.finditer("ATG", cdna): print(m.start(),m.end(),m.group(0)) # these are all the positions of an ATG on the plus strand ## 11 14 ATG ## 23 26 ATG ## 68 71 ATG ## 117 120 ATG ## 141 144 ATG 34 Dictionaries A dictionary is a collection of key-value pairs, where each key is unique, and values can be any data type. Similar to a real-world dictionary, where each word (key) has a definition (value). Unordered – unlike an array that has an implicit order Mutable – you can update the dictionary by adding new keys, and updating the values of existing keys 35 Genetic code as a python dictionary genetic_code = { 'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M', 'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T', 'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K', 'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R', 'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L', 'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P', 'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q', 'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R', 'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V', 'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A', 'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E', 'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G', 'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S', 'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L', 'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_', 'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W', } 36 mRNAdle example Problem 3: how do we translate into protein? Use our original genetic_code dictionary and a fancy loop # for simplicity, we will only deal with one reading frame here (first match) maybe_cds = cdna[11:] print("maybe CDS:", maybe_cds[:15]) ## maybe CDS: ATGAACTGGCACATG for i in range(0,len(maybe_cds),3): print(i,maybe_cds[i:i+3],genetic_code[maybe_cds[i:i+3]]) ## 0 ATG M ## 3 AAC N ## 6 TGG W 37 mRNAdle example Problem 4: Count amino acids before the first STOP (-) and retain the peptide sequence num_aa = 0 peptide = "" for i in range(0,len(maybe_cds),3): aa = genetic_code[maybe_cds[i:i+3]] if aa != "-": num_aa+=1 peptide = peptide + aa else: break #exit the loop print(num_aa,"amino acids before first stop") ## 52 amino acids before first stop print(peptide) ## MNWHMIISGLIVVVLKVVGMTLFLLYFPQIFNKSNDGFTTTRSYGTVCPKDW 38 Summary Genomic sequence data contains many motifs that we use to make predictions about genes and gene products, the most familiar to you right now will be the start codon (ATG) and the many stop codons. We can use pattern matching to search for the positions of these motifs, then using code, mimic what happens biologically to determine a possible outcome (i.e. translation). Python uses regular expressions which are syntactically strict statements that will search for motifs according to the rules you specify. You can access these in your scripts by importing the package re 39 Lab this week Searching for patterns! 40 Thanks! Any questions ? 41

MBB110 Data Analysis for Molecular Biology & Biochemistry (Spring 2025) Lecture 5 PDF

Document Details

Tags

Related

Summary

Full Transcript