Bioinformatics Lecture 1 on Python PDF
Document Details
Uploaded by EfficientHurdyGurdy
McMaster University
Tags
Summary
These lecture notes provide an introduction to Python programming, focusing specifically on bioinformatics applications. The lecture covers basic Python syntax, data types, and programming constructs, and introduces the BioPython library, a key tool in bioinformatics. A wealth of Python examples related to various bioinformatic functions are included.
Full Transcript
BIOTECH4BI3 – BIOINFORMATICS Lecture 1 – Python review and introduction to BioPython Getting Python Anaconda! ( www.anaconda.com/products/individual) Get Python 3.12 version for your platform Things to Remember Python has a large standard library Python is built on t...
BIOTECH4BI3 – BIOINFORMATICS Lecture 1 – Python review and introduction to BioPython Getting Python Anaconda! ( www.anaconda.com/products/individual) Get Python 3.12 version for your platform Things to Remember Python has a large standard library Python is built on the idea that new functionality should be easy to add The way your code is structured MATTERS Designed to be not cluttered and uses English language keywords “…clever is not considered a compliment…” Hello World! #! /usr/bin/python print(“Hello world!”) Each program starts with a declaration of where to find the python interpreter (not necessary in Windows) The ‘\n’ signifies the newline character Words to print are surrounded by quotation marks Save this to a text file with extension ‘.py’ and run Use the Python 3.12 installation from www.anaconda.com Types of Data Python has a number of built in types like numbers and strings We assign data to variables in python as follows; myNumber=1 mySentence=“You will love my class” Operators allow us to manipulate data; +,-,*,/,%,** (numbers) ,++,=,!= Built-in Types in Python There are many built-in data types but the following are the most commonly used Integers myNumber=1 int(x) Floats myFloat=2.1 float(x) Lists myList=[1,2,3,’four’] Range myRange=range(0,10,2) Strings myString=“biotech” Dictionaries myHash={‘Joe’:123,’John’:456} Flow Control Like many other programming languages Python has familiar constructs to control the flow of a program if/elif/else for statements while statements break continue If/elif/else A condition statement that offers the program a choice of actions Has the following structure if True/False: operation 1 elif True/False: operation 2 else: operation 3 If/elif/else Example #! /usr/bin/python var=5 if var>5: print("value is too high\n") elif var’ Multiple FASTA records can be put in a single file Do not mix DNA and protein records in the same file Reading in a file is called ‘parsing’ FASTA Example >gene1 disease response gene CCTATTAAAGATAACAAGGAAGAAGATGATATAGATGAAGATGCTTTTGAGGCACTGTTTCAGCTA GAGGAGGACCTCGCGAAGCTTGAACGTGAATTGGAAGAGGCACTGAAGGATGATGAACTATTAGGA GGGGTAGTAGAAGATGATAACGAAGAAGAAGAAGAAGAAGAATTACCCGTGAATTTGAAAAATTGG AGTTGTGATCTCTATGTTAGTCTTCTCAAATTAAGATGTTGCAATCGCACTACCAGATACTCATTA AGCACGAAGCAAAGAACATCAATCCATGTGTCTAAAATTTTAAATTGCAGTCCGTATGTAGCTTCA AAAAAACTGGACATCTACTTTATGCTTGCACCATTTTCTTGGCATTCATTAGCTAAAACTATATAT TTCTAGATAAAGTCATCATAAGTATAGTCGGAAGTTTCAGAACCTGTGGGCCTGTGGCTGTAACAT ATTTAAAGAGAATATTTCTACCACTGCTAATTCTGTATCTGTAAATTCTATGTTTCCTTCCAGATA FASTQ Format A common format for data generated by high throughput DNA sequencing instruments Incorporates base call information as well as information about the probability about the base calls. These values are called ‘Phred’ scores or ‘quality’ scores A FASTQ entry has 4 lines; – Line 1 – starts with the ‘@’ symbol followed immediately by the sequence name. Option descriptive information can follow the name – Line 2 – the DNA base calls – Line 3 – the plus symbol (‘+’) – Line 4 – the Phred scores for the base calls Multiple FASTQ records can be present in a file FASTQ Example @SEQ001 TACGGTAGCTAAGTGAGTAGTACGTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGC + AAAAAEEEEEEEEEEEEEEAEEEEEEAEEEE6EEEAEEEE/EEEEEEEEEEEEEEEEEEEE @SEQ002 CGATCGTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTA + A6AAAAAEEAEEEEEEEEAEEEEEEEEAEEEEEEEEEEAEEEEEEEEEEEAEEEEEEEEEE @SEQ003 GCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTA + EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE @SEQ004 ATAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAG + E6EEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEAEEEEEEEEEEEEAEEEEEEEEEEEE Genbank Format for sequence accuracy, presence of a polyA tail and contiguity LOCUS AY069118 1502 bp mRNA linear INV 17-DEC-2001 within 100 kb in the genome. Thus we believe the sequence to DEFINITION Drosophila melanogaster GH13089 full length cDNA. reflect accurately this particular cDNA clone. However, there are ACCESSION AY069118 artifacts associated with the generation of cDNA clones that may VERSION AY069118.1 GI:17861571 have not been detected in our initial analyses such as internal priming, priming from contaminating genomic DNA, retained introns KEYWORDS FLI_CDNA. due to reverse transcription of unspliced precursor RNAs, and SOURCE Drosophila melanogaster (fruit fly) reverse transcriptase errors that result in single base changes. ORGANISM Drosophila melanogaster For further information about this sequence, including its location and relationship to other sequences, please visit our Web site Eukaryota; Metazoa; Arthropoda; Hexapoda; Insecta; Pterygota; (http://fruitfly.berkeley.edu) or send email to Neoptera; Endopterygota; Diptera; Brachycera; Muscomorpha; [email protected]. Ephydroidea; Drosophilidae; Drosophila. FEATURES Location/Qualifiers REFERENCE 1 (bases 1 to 1502) source 1..1502 /organism="Drosophila melanogaster" AUTHORS Stapleton,M., Brokstein,P., Hong,L., Agbayani,A., Carlson,J., /strain="y; cn bw sp" Champe,M., Chavez,C., Dorsett,V., Farfan,D., Frise,E., George,R., /db_xref="taxon:7227" Gonzalez,M., Guarin,H., Li,P., Liao,G., Miranda,A., Mungall,C.J., /map="39B3-39B3" gene 1..1502 Nunoo,J., Pacleb,J., Paragas,V., Park,S., Phouanenavong,S., Wan,K., /gene="E2f2" Yu,C., Lewis,S.E., Rubin,G.M. and Celniker,S. /note="alignment with genomic scaffold AE003669" TITLE Direct Submission /db_xref="FLYBASE:FBgn0024371" JOURNAL Submitted (10-DEC-2001) Berkeley Drosophila Genome Project, Lawrence Berkeley National Laboratory, One Cyclotron Road, Berkeley, CA 94720, USA COMMENT Sequence submitted by: Berkeley Drosophila Genome Project Lawrence Berkeley National Laboratory Berkeley, CA 94720 This clone was sequenced as part of a high-throughput process to sequence clones from Drosophila Gene Collection 1 (Rubin et al., An Example Bio.SeqIO ‘Bio.SeqIO provides a simple uniform interface to input and output assorted sequence file formats’ The design of the this module was based on Perl’s Bio::SeqIO The Bio.SeqIO object allows one to iterate over all of the sequence records in a file Create the object as follows – seqIOObj=SeqIO.parse(FILEHANDLE,FORMAT) Example - SeqIO Example - SeqIO Bio.SeqIO – Format Conversion It is easy to use BioPython to convert DNA sequences among differenct formats OR to use BioPython formatting features to clean up your sequences The approach is to open a source file in one format and a sink file in another format As you iterate over the source sequences you write the objects to the sink file Example – Format Conversion Bio.SearchIO There are a number of bioinformatics tools that are used to search one sequence against a database of others The goal of this is to find similarities between known sequences and unknown sequences with the hope of assigning a putative function Basic Local Alignment Search Tool (BLAST) is one of the most commonly used bioinformatics tools and revolutionized what came before it. BLAST reports can be very detailed and very long. BioPython makes the easy to work with Example – Perform a BLAST Search Example – Parse BLAST Result Example – Parse BLAST Result BLAST Record