Biotech4bi3 - Lecture 1 - Python Introduction PDF
Document Details
Uploaded by EfficientHurdyGurdy
McMaster University
Tags
Summary
This document is a lecture on Python programming and its applications in bioinformatics, specifically exploring BioPython. The lecture covers various Python concepts such as data types, flow control, functions, and error handling in the context of bioinformatics data. The presentation includes practical examples and code snippets demonstrating the use of BioPython functions.
Full Transcript
BIOTECH4BI3 – BIOINFORMATICS Lecture 1 – Python review and introduction to BioPython Getting Python Anaconda! (www.anaconda.com/products/individual) Get Python 3.12 version for your platform Things to Remember Python has a large standard library Python is built on th...
BIOTECH4BI3 – BIOINFORMATICS Lecture 1 – Python review and introduction to BioPython Getting Python Anaconda! (www.anaconda.com/products/individual) Get Python 3.12 version for your platform Things to Remember Python has a large standard library Python is built on the idea that new functionality should be easy to add The way your code is structured MATTERS Designed to be not cluttered and uses English language keywords “…clever is not considered a compliment…” Hello World! #! /usr/bin/python print(“Hello world!”) Each program starts with a declaration of where to find the python interpreter (not necessary in Windows) The ‘\n’ signifies the newline character Words to print are surrounded by quotation marks Save this to a text file with extension ‘.py’ and run Use the Python 3.12 installation from www.anaconda.com Types of Data Python has a number of built in types like numbers and strings We assign data to variables in python as follows; myNumber=1 mySentence=“You will love my class” Operators allow us to manipulate data; +,-,*,/,%,** (numbers) ,++,=,!= Built-in Types in Python There are many built-in data types but the following are the most commonly used Integers myNumber=1 int(x) Floats myFloat=2.1 float(x) Lists myList=[1,2,3,’four’] Range myRange=range(0,10,2) Strings myString=“biotech” Dictionaries myHash={‘Joe’:123,’John’:456} Flow Control Like many other programming languages Python has familiar constructs to control the flow of a program if/elif/else for statements while statements break continue If/elif/else A condition statement that offers the program a choice of actions Has the following structure if True/False: operation 1 elif True/False: operation 2 else: operation 3 If/elif/else Example #! /usr/bin/python var=5 if var>5: print("value is too high\n") elif var’ Multiple FASTA records can be put in a single file Do not mix DNA and protein records in the same file Reading in a file is called ‘parsing’ FASTA Example >gene1 disease response gene CCTATTAAAGATAACAAGGAAGAAGATGATATAGATGAAGATGCTTTTGAGGCACTGTTTCAGCTA GAGGAGGACCTCGCGAAGCTTGAACGTGAATTGGAAGAGGCACTGAAGGATGATGAACTATTAGGA GGGGTAGTAGAAGATGATAACGAAGAAGAAGAAGAAGAAGAATTACCCGTGAATTTGAAAAATTGG AGTTGTGATCTCTATGTTAGTCTTCTCAAATTAAGATGTTGCAATCGCACTACCAGATACTCATTA AGCACGAAGCAAAGAACATCAATCCATGTGTCTAAAATTTTAAATTGCAGTCCGTATGTAGCTTCA AAAAAACTGGACATCTACTTTATGCTTGCACCATTTTCTTGGCATTCATTAGCTAAAACTATATAT TTCTAGATAAAGTCATCATAAGTATAGTCGGAAGTTTCAGAACCTGTGGGCCTGTGGCTGTAACAT ATTTAAAGAGAATATTTCTACCACTGCTAATTCTGTATCTGTAAATTCTATGTTTCCTTCCAGATA FASTQ Format A common format for data generated by high throughput DNA sequencing instruments Incorporates base call information as well as information about the probability about the base calls. These values are called ‘Phred’ scores or ‘quality’ scores A FASTQ entry has 4 lines; – Line 1 – starts with the ‘@’ symbol followed immediately by the sequence name. Option descriptive information can follow the name – Line 2 – the DNA base calls – Line 3 – the plus symbol (‘+’) – Line 4 – the Phred scores for the base calls Multiple FASTQ records can be present in a file FASTQ Example @SEQ001 TACGGTAGCTAAGTGAGTAGTACGTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGC + AAAAAEEEEEEEEEEEEEEAEEEEEEAEEEE6EEEAEEEE/EEEEEEEEEEEEEEEEEEEE @SEQ002 CGATCGTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTA + A6AAAAAEEAEEEEEEEEAEEEEEEEEAEEEEEEEEEEAEEEEEEEEEEEAEEEEEEEEEE @SEQ003 GCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTA + EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE @SEQ004 ATAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAG + E6EEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEAEEEEEEEEEEEEAEEEEEEEEEEEE Genbank Format LOCUS AY069118 1502 bp mRNA linear INV 17-DEC-2001 for seque nce a ccura cy, prese nce o f a p olyA tail and conti guity with in 10 0 kb in th e ge nome. Thus we b eliev e th e seq uence to DEFINITION Drosophila melanogaster GH13089 full length cDNA. refl ect a ccura tely this part icula r cDN A clo ne. Howe ver, there are ACCESSION AY069118 arti facts asso ciate d wi th th e gen erati on of cDN A clo nes t hat m ay VERSION AY069118.1 GI:17861571 have not been detec ted in ou r ini tial analy ses such as in terna l KEYWORDS FLI_CDNA. prim ing, primi ng fr om c ontam inati ng ge nomic DNA , ret ained intr ons SOURCE Drosophila melanogaster (fruit fly) due to re verse tran scri ption of u nspli ced p recu rsor RNAs, and reve rse t ransc ripta se e rrors that resu lt in sin gle b ase c hange s. ORGANISM Drosophila melanogaster For furth er in forma tion abou t thi s seq uence , in cludi ng it s loc ation Eukaryota; Metazoa; Arthropoda; Hexapoda; Insecta; Pterygota; and relat ionsh ip to oth er se quenc es, p lease vis it ou r Web site Neoptera; Endopterygota; Diptera; Brachycera; Muscomorpha; (htt p://f ruitf ly.be rkel ey.ed u) or send emai l to Ephydroidea; Drosophilidae; Drosophila. cdna @frui tfly. berke ley. edu. REFERENCE 1 (bases 1 to 1502) FEATURE S Locat ion/Q uali fiers so urce 1..15 02 AUTHORS Stapleton,M., Brokstein,P., Hong,L., Agbayani,A., Carlson,J., /orga nism= "Dro sophi la me lanog aster " Champe,M., Chavez,C., Dorsett,V., Farfan,D., Frise,E., George,R., /stra in="y ; cn bw s p" Gonzalez,M., Guarin,H., Li,P., Liao,G., Miranda,A., Mungall,C.J., /db_x ref=" taxo n:722 7" Nunoo,J., Pacleb,J., Paragas,V., Park,S., Phouanenavong,S., Wan,K., /map= "39B3 -39B 3" Yu,C., Lewis,S.E., Rubin,G.M. and Celniker,S. ge ne 1..15 02 /gene ="E2f 2" TITLE Direct Submission /note ="ali gnme nt wi th ge nomic scaf fold AE00 3669" JOURNAL Submitted (10-DEC-2001) Berkeley Drosophila Genome Project, /db_x ref=" FLYB ASE:F Bgn00 24371 " Lawrence Berkeley National Laboratory, One Cyclotron Road, Berkeley, CA 94720, USA COMMENT Sequence submitted by: Berkeley Drosophila Genome Project Lawrence Berkeley National Laboratory Berkeley, CA 94720 This clone was sequenced as part of a high-throughput process to sequence clones from Drosophila Gene Collection 1 (Rubin et al., An Example Bio.SeqIO ‘Bio.SeqIO provides a simple uniform interface to input and output assorted sequence file formats’ The design of the this module was based on Perl’s Bio::SeqIO The Bio.SeqIO object allows one to iterate over all of the sequence records in a file Create the object as follows – seqIOObj=SeqIO.parse(FILEHANDLE,FORMAT) Example - SeqIO Example - SeqIO Bio.SeqIO – Format Conversion It is easy to use BioPython to convert DNA sequences among differenct formats OR to use BioPython formatting features to clean up your sequences The approach is to open a source file in one format and a sink file in another format As you iterate over the source sequences you write the objects to the sink file Example – Format Conversion Bio.SearchIO There are a number of bioinformatics tools that are used to search one sequence against a database of others The goal of this is to find similarities between known sequences and unknown sequences with the hope of assigning a putative function Basic Local Alignment Search Tool (BLAST) is one of the most commonly used bioinformatics tools and revolutionized what came before it. BLAST reports can be very detailed and very long. BioPython makes the easy to work with Example – Perform a BLAST Search Example – Parse BLAST Result Example – Parse BLAST Result BLAST Record