Methods In Molecular Biology WS 2022 PDF
Document Details
Uploaded by SaintlyThermodynamics
2022
Tags
Summary
Lecture notes from "Methods in Molecular Biology", WS 2022. The lecture covers DNA repair mechanisms, including base excision repair (BER) and nucleotide excision repair (NER), and discusses how DNA damage can lead to genomic instability. The material also touches on the topic of different types of damage to the DNA and their underlying causes.
Full Transcript
VO Methods in molecular biology WS 2022 VO Methods in molecular biology Lecture 1: DNA repair and remodelling models (Dea Slade) DNA damage Happenes all the time, exogenous factors and endogenous; - exogenous factor are: o secret smoke with about 200 cancerogenic compounds like...
VO Methods in molecular biology WS 2022 VO Methods in molecular biology Lecture 1: DNA repair and remodelling models (Dea Slade) DNA damage Happenes all the time, exogenous factors and endogenous; - exogenous factor are: o secret smoke with about 200 cancerogenic compounds like benzo-a-pyrene, it attaches itself on guanin and causes bulky adducts which are mutagenic if not repaired, DNA repair is very efficient but there is also a limit and become unfunctional or insufficient, o UV light or sun light; damage by bipyrimidine dimers; so when 2 pyrimidines are attached they are called bipyrimidine photoproducts o Radiation (Chernobyl); causes oxidase base damage, single and double strand breaks, there are also chemotherapeutic agents which are called radiomimetic agents because they mimic the effect of radiation by inducing similar types of DNA damage, in chemotherapy topoisomerase inhibitors are also often used which induce ether single or double strand breaks - endogenous factors are: o Hydrolysis; water mediated damage can lead to deamination of cytosine into uracil and depurination where the whole purin is removed (most frequent can happen spontaneously) o Mitochondrial ROS (reactive oxygen species), radicals etc. damage DNA ether by oxidative based damage, single or double strand breaks o Replication, recombination; replication errors; mismatches occur, or during the repair of the programmed double strand breaks during recombination that is happening during antibody production - Crosslinks; crosslinking agents such as cisplatin (chemotherapeutic agent) forms crosslinks between bases usually guanins, can be within one strand – intrastrand or between 2 strands called interstrand, but there is also an endogenous factor aldehyde, by drinking to much alcohol it becomes aldehyde and can get accumulated and cause crosslinks - Alkylation; temozolomide (chemotherapeutic agent) and also SAM (s adenosylmethionine) as an endogenous source of alkylation 30 000 endogenous lesions / cell / day Majority depurination by hydrolysis 18 000, also SAM 6000 and 1200, DNA repair mechanisms - For base damage we have base excision repair (BER), - for nucleotide damage (like bulky adduct and bipyrimidine photoproduct): nucleotide excision repair (NER), - 2 pathways for interstrand crosslinks: NER and interstrand crosslink repair (ICLR), difference ICLR can only function during DNA replication, - for single strand breaks we have single strand break repair (SSBR), - for double strand breaks we have double strand break repair (DBSR) which has two subpathways; homologous recombination (HR) and non-homologous end joining (NHEJ) - for mismatch we have mismatch repair (MR) 1 VO Methods in molecular biology WS 2022 DNA repair defects These pathways are really important, if there is a mutation due to some hereditary mutation or somatic cells, can result to many disorders, f.e. picture; problems in BER and SSBR can result to AOA where people have problems with movement, NER doesn’t function -> very susceptible to UV light – XP develop skin cancer, NHEJ results in immunodeficienty because we need it for the repair of the programmed dsb during antibody formation and so on… Point mutations - Transitions: o Pyrimidine to pyrimidine (T -> C) o Purine to purine (A -> G) - Transversions: o Pyrimidine to purine (T -> A or G) o Purine to pyrimidine (G -> C or T) Point mutation can be classified in different ways; one would be how they change: Transitions and transversion mutation the other classification would be the functional outcome: Point mutations No mutation -> TTC (DNA level) -> AAG (mRNA level) -> Lys (protein level) - Silent PM -> TTT -> AAA -> Lys - Nonsense PM -> ATC -> UAG -> STOP - Missense o Conservative: TCC -> AGG -> Arg o Non-conservative: TGC -> ACG -> Thr 2 VO Methods in molecular biology WS 2022 Point mutation can be divided into silent (same aminoacid after mutation), nonsense a STOP codon instead of an AA and missense; converting one AA to another, Frameshift mutations Induced by small insertions or deletions; the frame is lost such that you end up with a complete different AA sequence, can also lead to a prematurely STOP codon Structural mutations Large changes at the level of the chromosome, deletion; one part of the chromosome is removed, duplication; its duplicated, inversion and insertion from one chromosome to another, or translocation when f.e. the whole chromosome arm moves from one to another chromosome Genomic (in)stability DNA damage -> sufficient repair -> genomic stability is maintained however due to some reason like mutations or changes in gene expression and there is a disfunctional DNA repair then mutations can occur -> one pathway could take over for another but for this to happen cell cycle needs to stop to allow the pathway to repair the damage or in general it has to stop so that the damage can be repaired, the protein P53 is critical to stop the cell cycle, which is called the guardian of the genome -> makes the cell cycle stop at G1 transition, to repair the damage can also trigger apoptosis if there is to much damage, the problem is that it is mutated in more than 50% of the cancers, Genomic instability is one of the hallmarks of cancer Nobel prizes in DNA repair 2015 Tomas Lindahl for base excision repair (BER), Aziz Sancar for photoreactivation and nucleotide excision repair (NER) and Paul Modrich for Mismatch repair (MR) Base Excision Repair (BER) damaged bases 1993: ʻAlthough DNA is the carrier of genetic information, it has limited chemical stability. Hydrolysis, oxidation and nonenzymatic methylation of DNA occur at significant rates in vivo, and are counteracted by specific DNA repair processes. The spontaneous decay of DNA is likely to be a major factor of mutagenesis, carcinogenesis and ageing.ʼ Tomas Lindahl 1. Base deamination 2. Base oxidation 3. Base alkylation Spontaneous base deamination f.e. cytosine deamination (most frequent), happens spontaneously due to hydrolysis, cytosine is converted into uracil by nucleophilic attack from water molecules to the aminogroup, which causes the leave of the amino group and oxidation of hydroxyl into the ketogroup -> functional consequence of the base pairing: cytosine basepaires 3 VO Methods in molecular biology WS 2022 normally with Guanin, now cytosine was converted into uracil which basepaires only with adenine, result: transition mutation -> thereby uracil has to be removed!!! - Adenine → Hypoxanthine - Guanine → Xanthine - Methyl cytosine → Thymine Adenine and guanine can also be deaminated, but its much less frequent, there is one additional base that has a consequent when it comes to gene expression, Methyl cytosine deaminates to Thymine, is cytosine methylated the gene is supressed -> by deamination no silencing anymore -> gene is expressed, can be dangerous if the gene is f.e. an oncogene Base oxidation - 1500 8-oxoG is generated per cell per day Guanine → oxidation → 8-oxoguanine Guanine can be oxidated at position 8 to 8-oxoguanine which cannot basepair with cytosine, instead it basepaires with adenine → transversion mutation Base alkylation Spountaneous base methylation by S-adenosylmethionine (SAM): - 3 methyladenine; 1200 per cell per day - 7 methylguanine; 6000 per cell per day Alkylating agents (temozolomide): O6-methylguanine SAM can induce the formation of 3 methyladenine or 7 methylguanine, or induction by different types of drugs, Base excision repair AP site = abasic site (apurinic/apyrimidinic) 1) DNA glycosylase cleaves the glycosidic bound between damaged base and deoxyribose 2) AP endonuclease creates a nick in the phosphodiesterbackbone of the AP site yielding 3’OH and 5’phospho 3) Phosphodiesterase cleaves the phosphodiester bond and removes the AP site 4) DNA polymerase β fills the gap 5) DNA ligase III seals the end First the damaged base needs to be removed by enzymes called glycosylases, cleave the glycosidic bound between the base and the deoxyribose, based on the damage there are different types, for uracil its uracil glycosylase, now we have a abasic site (AP), an endonuclease cleaves the 5’ 4 VO Methods in molecular biology WS 2022 phosphodiester bond of the AP site, another phosphodiesterase cleaves now the 3’ from this AP site -> the whole nucleotide is gone = single strand break, an intermediate product of base excision repair, (single strand break repair is a subpathway of base excision repair), now comes a polymerase and synthesizes the missing nucleotide using the intact strand as a template, finally the ligase seals the ends Why is base excision repair so important? - 18000 purines/cell/day spontaneously undergo depurination because N-glycosyl linkage between deoxyribose and sugar hydrolyses → spontaneous depurination - 100-500 bases/cell/day spontaneously deaminate → spontaneous deamination What is about the phosphodiester bond? Water is a to weak nucleophile -> nothing would happen, but RNA is highly susceptible to degradation, In DNA there is no hydroxal group at the 2 position but in RNA there is one → RNA is less stable than DNA - Under alkaline conditions 2’OH group of ribose is deprotonated and acts as a nucleophile to hydrolyse the phosphate bond Nucleotide Excision repair (NER) repairs UV lesions - 15min sun exposure yields 10^5 (100.000) lesions/keratinocyte Cyclobutane pyrimidine dimer (CPD), (6-4) pyrimidine photoproduct (6-4PP) most frequent types of conncections are pyrimidindimers NER deals with different types of lesions, has a broad specificity, recognizes different types of DNA damage, NER repairs bulky adducts N-hydroxyl-aminofluorene, Aflatoxin B1, Benzo-a-pyrene attaches themselves on guanine, NER repairs crosslinks Cisplatin often used in chemotherapy Crosslinking agents cause different degrees of DNA distortion What do these different lesions to DNA? They distort the DNA, are bulky -> becomes unstable; has single stranded characters as well (duplex is no longer tight), the distortion and ss character is recognized by the enzymes 5 VO Methods in molecular biology WS 2022 Structural basis of DDB2 binding to 6,4-PP Example how the helix is disported, is single stranded because it dispatches from the double helix -> this is recognized by the proteins Nucleotide excision repair - XPC-RAD238 recognize lesions that thermodynamically destabilize DNA duplexes - TFIIH consists of 10 subunits, including helicases XPB and XPD - XPA is the central component of the NER complex ensuring that all NER factors are in the right place for the incision to occur - XPF first makes 5‘ incision - XPG makes 3‘ incision Pathway is divided into two main categories - global genome NER: operates all the time, there are different proteins dealing to recognize the different types of damage, recognition always doe to distortion - Transcription-coupled NER: operates in conjunction with transcription, CSA and CSB travel together with RNA Pol II, stop transcription and recognize lesions -> allow NER to repair the lesion before the transcription After recognition, TFIIH is important to unwind the DNA, to allow the enzymes to make a cut 5’ and 3’ from the damaged site, such that the whole nucleotide is removed, following steps like base excision repair (polymerase synthesize gap and ligase seals it) Why is nucleotide excision repair so important? - Congenital mutations in XP (xeroderma pigmentosum) genes (James Cleaver, 1968 and Richard Setlow, 1969) - 1 in 250,000 - Clinical manifestations: o Skin changes (light sensitivity) o Skin cancer If the genes that are involved in the pathway are mutated -> cannot repair UV lesions -> can develop XP and skin cancer Mismatch repair (MR) repairs replication errors Mismatches during replication - Transition mismatches are repaired more efficiently - T-G, G-T, and C-T mismatches are most frequent Small insertions or deletions caused by strand slippage in long repetitive sequence - Small insertions or deletions escape proofreading by replicative polymerases Why are these 3 mismatches the most frequent? All three have thymidine because its most abundant In repetitive regions polymerase slips – can lead to insertion; is usually not recognized by polymerases because it escapes the proofreading activity, 6 VO Methods in molecular biology WS 2022 Mismatch repair efficiency - Deletions are repaired most efficiently Mismatch repair - MutS = Msh2-Msh6 heterodimer recognizes the mismatch - ATP and mismatch binding induce a conformational change in MutS such that it forms a clamp that can move along DNA and recruit MutL - MutL = Mlh1-Pms2 heterodimer is activated by PCNA to incise the nascent strand - PCNA enables strand discrimination as it is loaded asymmetrically at replication forks - EXO1 degrades the incised strand First damage need to be recognized by MutS -> leads to conformational change – switches the dimer into a sliding clamp; can translocate along the DNA – recruits MutL dimer which contains a nucleotilic activity -> cleaves DNA 5’ and 3’ from the mismatch, how do we now which is the new synthesized strand (that carriers the error? In bacteria no methylation but here?) PCNA is asymmetrically loaded and can recognize it, EXO1 chops of the oligonucleotide that contains the mismatch, now again; polymerase synthesizes the gap and ligase seals it Why is mismatch repair so important? - Replication fidelity: 10-10 errors/nucleotide - Without mismatch repair: 10-7 errors/nucleotide - Mismatch repair defective tumours are characterized by microsatellite instability as mismatch repair is crucial for the repair of deletions or insertions resulting from replication of repetitive sequences - Hereditary nonpolyposis colorectal cancer (HNPCC) caused by mutations in mismatch repair genes (Msh2 and Mlh1) Without MR replication fidelity is reduced 1000-fold, is mutated in colorectal cancer, microsatellite instability – small insertions and deletions appear in repetitive regions, Double strand break repair Different sources; radiation etc. also due to replication forks, coming from single strand breaks that weren’t repaired: there is a ss break the replication fork comes and collapses resulting in double strand breaks, DNA double strand breaks (DSBs) Topoisomerase inhibitors, topoisomerase resolves positive and negative supercoils, 2 Types type one makes SSBs and Topo II DSBs, inhibitors freeze the topoisomerase after cleaving the DNA – wont get repaired -> SSBs and DSBs accumulate, VDJ recombination Process for generating antibodies, consists of V, D and J regions which are part of the variable region of the antibody, though reshuffling of this elements its variable – how? Programmed induction of 7 VO Methods in molecular biology WS 2022 double strand breaks by RAG nuclease (enzyme) – recombination – nonhomologous end joining, repair, when its mutated NHEJ is not working –> immunodeficiency Sensing double strand DNA breaks (DSBs): PARP1 and MRN First step; recognition and sensing of DSBs, two important protein complexes; PARP1 and MRN complex, PARP1 is an enzyme that creates posttranslational modification; modifies itself first and then deposits the modification on the substrates, at the site of the damage the closest substrates are the histones, when histones get parylated this leads to chromatin relaxation and nucleosome disassembly -> important allows the recruitment of all factors that are needed for the repair to gain accesses, Poly(ADP-ribsosyl)ation The cofactor is NAD, which consists of ADP-ribose and Nicotinamide, PARP1 is going to take the ADP-ribose and attaches it to a protein acceptor – generates glycosylic bonds between different ADP ribosyl units (up to 200 units of ribose) to create long chains, can be linear or branched, PARylation induced chromatin relaxation Electron microscopy of PARylated pancreatic chromatin → DNA damage sensing and processing: MRN (Mre11/Rad50/Nbs1) MRN consists of Mre11 which recognises the DSB, Rad50 upon binding changes it to a parallel conformation – important for keeping the ends of the DSB together, the DNA doesn’t dissociate, Nbs1 is critical for signalling how? Signalling DSBs: Phosphorylation - NBS1 recruits ATM kinase - ATM kinase phosphorylates histone H2AX - MCD1 binds gammaH2AX - NBS1 binds MDC1 and propagates gammaH2AX NBS1 recruits ATM kinase, is the central kinase because it has multiple substrates (thousands to regulate gene repair, gene expression, apoptosis and so on), ATM kinase now phosphorylates H2AX which is a histone variant, why is it used as a marker of DSBs? Because gammaH2AX recruits the reader protein MDC1 – recruits another MRN complex – recruits ATM – ATM phosphorylates H2AX -> goes on and on really far away from the damage site, Now there are antibodies which recognize the phosphorylation of the H2AX Methods to quantify DSBs - H2AX is a histone variant - H2AX is phosphorylated by the ATM kinase in response to DNA damage - It spreads >100kb away from the sites of damage - It recruits DNA damage signaling proteins (MDC1) These antibodies can be used to detect or count the numbers of DSBs after f.e. exposure of radiation, nucleus was treated with radiation and stained with an antibodies that recognizes gammaH2AX -> if only a few phosphorylation one wouldn’t be able to detect it, but method is not good to follow repair process, therefore use… 8 VO Methods in molecular biology WS 2022 Methods to quantify DSBs - Comet assay: visualize and analyse comet tails; comet tails come from DSBs, method to analyse the number of DSBs in a cell – do not extract DNA – DNA will be sheered, more damage would be created, thereby you take the whole cell and bet it into an agarose gel on the top of a slide (Objektträger) – treat the cells to lyse the proteins, membranes, etc., degrade them ether in neutral (DSBs and SSBs) or alkaline (SSBs) conditions (alkane conditions DNA will be denatured – only single strands) only DNA left -> now put slides into special chambers (special electrophoresis) -> DNA will migrate into this field -> dye DNA and visualize it -> no damage in cells = dot, damage = creates comet tail, why? DNA is not migrating synchronously because longer fragments take longer than short fragments -> more fragments = larger tail - Pulsed-field gel electrophoresis: special gel electrophoresis with 2 cathodes and 2 anodes, can switch fields and change angles, helps to separate large fragments (genome level), picture of gel; not a marker but a genome 1 mio bp was restricted to make the fragments smaller and to analyse the DSBs, sec row genome was radiated – completely fragmented, was rebuild after time -> after 3 hours is had completely reassembled all fragments γH2AX kinetics why is γH2AX not a good marker to monitor the pair of DSBs? After 1 Gy (Dose of radiation) -> quite a lot damage even after 10, 15h, if we have f.e. 20 DSBs it seems that we still have 5 DSBs even after 25h but it is not! Problem: γH2AX is not removed at the same time as the DSBs – lags behinds γH2AX vs PFGE comparison from γH2AX with pulsed field gel electrophoresis; 20 Gy by PFGE and 1 Gy with γH2AX in the same cell type; with PFGE and 20 Gy were after 2h 80% of the DSBs repaired, while with γH2AX and 1 Gy after 2h only about 30-40% were repaired! γH2AX only reliable on counting the DSBs and not the repair Signalling DSBs: Ubiquitination - MDC1 recruits ubiquitin ligases RNF8 and RNF168 - UBC13/RNF8 ubiquitinates H1 and recruits RNF168 - RNF168 ubiquitinates H2A - BRCA1 binds H2Aub; 53BP1 binds H2AK15ub and H4K20me2 We just talked about phosphorylation of H2AX to γH2AX as signalling, which is only the first step, second step is now ubiquitination; MDC1 also recruits UBC13/RNF8 which will ubiquitinate Histone 1, recruits RNF168 which ubiquitinates H2A, the polyubiquitination of H2A will serve as a recruiter for BRCA1 complex which signals homologous recombination and in combination with the modifications of the histones it will recruit 53BP1 which signals NHEJ (again the details are not important for the exam! ) through histone modifications it signals different repair pathways! 9 VO Methods in molecular biology WS 2022 How to determine if there are any defects in DNA damage signalling? WB: pATM, pSMC1, pChk2 IF: MDC1, 53BP1, BRCA1 How to determine if your protein involved in DNA damage signalling? Is the signalling step disturbed if the protein was degraded/depleted/knock out. Example RNF8 which is necessary for the ubiquitination of histones and functions downstream of gammaH2AX and upstream of BRCA1, how was that determined? Silencing of RNF8 and immunoflourscent dyed what happens so gammaH2AX, BRCA1? After silencing the amount of gammaH2AX stays the same but BRCA1 is decreasing Repairing DSBs - Non-homologous end joining (NHEJ) and homologous recombination (HR) - Homologous recombination (HR) cannot occur during G1 NHEJ can function the whole cell cycle but homologous recombination HR can only function in S phase and G2, because for homologous recombination you need a sister chromatid, only here after replication Non homologous end-joining (NHEJ) → i. Ku70/Ku80 bind to DSB ends ii. Ku70/Ku80 recruit DNA-dependent protein kinase (DNA- PK), DNA ligase IV (LIG4), XRCC4, and XRCC4-like factor (XLF) iii. DNA-PK phosphorylates Artemis to stimulate its nuclease activity required for DSB end processing iv. LIG4/XRCC4/XLF ligate the ends Ku70/Ku80 proteins recognize the ends of the DSBs and recruit kinase DNA-PK and other proteins, DNA-PK phosphorylates and activates Artemis which removes the ends in the case if there were damaged nucleotides etc., the complex Lig4/XRCC4/XLF ligates the ends together →Homologous recombination i. PARP1 and MRN sense DSB ends and recruit CtIP-BRCA1 to the DSB end ii. MRE11 and CtIP resect DSB ends (shortrange resection) iii. EXO1, DNA2 and BLM mediate long-range resection to generate a long 3’ ssDNA tail coated by RPA. iv. BRCA2 displaces RPA from ssDNA. RAD51 forms filaments on ssDNA. v. RAD51 nucleofilament invades the homologous duplex and forms a D-loop. vi. Polymerases extend the invading strand using the invaded donor molecule as a template. A double Holliday junction is formed. vii. Nucleases (SLX4, MUS81-EME1, GEN1) resolve Holliday junctions generating crossovers. or BTR complex (BLM-TOPO3- RMI1-RMI2) dissolves Holliday junctions (no crossovers) Start by sensing the DSB by PARP1 and MRN, then MRE11 and CtIP performs a short ranged resection, now long range resection is performed – creates large regions of ssDNA by RPA, RPA is removed by BRCA2 and allows the association of RAD51 which invades the homologous duplex to form a D-loop, the polymerase now copies the missing pieces – creates a double Holliday junction -> can be resolved to CO or dissolved to NCO 10 VO Methods in molecular biology WS 2022 Why is DSBR so important? - DSBR is required for faithful repair of DSBs - the most lethal form of DNA damage - NHEJ is required for generating antibodies by V(D)J recombination and class switch recombination (CSR) - Mutations in NHEJ lead to severe immunodeficiency - Mutations in HR lead to cancer and premature aging Risk of cancer due to BRCA1/2 mutations -> about 50% breast cancer or ovarian cancer! How to know if your protein is important for DNA damage response? - MTS assay: to check if a protein is important? Just knock it out or silence it and then see if the cells are sensitive to DNA damage. Cells in 96 wells plate – treat them witch an agent to cause damage - let cells recover for a few days and then apply MTS which if the cells are metabolically active (healthy) reduce the amount of MTS and create Formazan using NADH – change in colour from yellow to purple – measure it by photo spectrometer -> determine survival - Colony formation assay: Crystral violet stains cell walls; plate cells on a dish at very low numbers, measure number of colonies that are formed (to many cells – not separated – bad to count) – expose them to DNA damage agents – let them recover for 1-2 weeks – recovered cells form colonies – colonies can be detected by staining with crystal violet and count Does it form ionizing radiation induced foci (IRIF)? If your protein is involved it would be recruited, some like BRCA1, RAD51 form foci because multiple proteins are recruited, not possible for proteins that are recruited in low numbers (1 or 2) like Ku70/Ku80 And if it doesn’t? Chromatin immunoprecipitation & laser-induced micro-irradiation microscopy ChIP Crosslink your DNA by using an crosslinking agent – sheer DNA – immunoprecipitated your POI using an Antibody that is specific – proteolytic degrade your protein – Result: DNA which was bound to POI – analyse pieces by qPCR or sequencing, What does that mean in context to your DSBs? Use restriction enzymes → ChIP at DNA damage site(s) - I-SceI: I-SceI has a recognition site, that is not present in our genome – introduce it first, create cellline, add I-SceI and induce damage – ChIP - qPCR - I-PpO (200-300/human genome); recognition sequence is present in human genome, (introduction not needed) add, ChIP, qPCR with primers that flank the recognition site (more efficient) – amplified region means that protein was bound to that region – take cells at different timepoints – kinetics of recruitment can be determined DIvA (DSB inducible via AsiSI) cell line - AsiSI: 8bp recognition sequence: - Ca. 1000 in the human genome, 10% efficiently cleaved Cell line was created using AsiSI which was fused to oestrogen receptor to keep it in the cytoplasm, (inducible systems are leaky, means baseline expression that why it was fused to the receptor) 11 VO Methods in molecular biology WS 2022 because it acts in the nucleosome – make sure that its not active before/ regulation through receptor) - after addition of a ligand that binds receptor it moves into the nucleus - DivA ChIP Use system to analyse genome wide to see which repair pathway is active after which damage – how epigenetic context is important, RAD51 for HR and XRCC4 for NHEJ immunoprecipitation – analyse peak which protein was recruited, Laser-induced micro irradiation and IF Very quick but cannot be done genome wide (microscopy based system) – but good to induce damage in a highly localized manner, PARG-YFP in nucleus – induce damage – follow recruitment by green light – can also easily determine kinetics (Prof Slade said a few times, only systems are needed to learn not the exact protein names! =) Lecture 2: NGS - Next generation sequencing I (Andreas Sommer) What is NGS? How does it work? What is it good for? DNA/RNA Sequencing - Any method than can be used to determine the specific order of the 4 nucleotides in a strand of DNA or RNA. - Key method in (molecular) biology - Service offered by many companies and core facilities NGS: Historical development At the end of the 19th century, it came clear that the nucleus was involved in hereditary, chromosomes where described, 1940 there is DNA, in the 50s Watson and Crick discovered the double helix structure, tRNA was sequenced by copping it into small pieces and physical analysis, like UV absorption mass, charge etc., Maxam gilbert and Sanger independently developed the first useful DNA sequencing method, maxam gilberts method was more messy – not as good as sanger, Central dogma, genetic code (1950-1960) Double strand opens up to replicate -> DNA -> transcription to RNA -> translation to proteins, then genetic triple code was discovered, Sanger sequencing (1977) - Dideoxynucleotides (ddNTPs) - Random termination of polymerase elongation - 4 reactions per sample 12 VO Methods in molecular biology WS 2022 - Radioactive labelling - Visual readout - Max. 1kb read length We have a sample/template, has to consist of one single fragment of DNA, has a known sequence piece so that primer can bind, whenever a ddNTP was incorporated by the DNA polymerase to the growing chain the polymerase stops because it cannot add anymore, the sample is split into 4 reactions where Polymerase and dNTPs were added and to each of them a different ddNTP (ddATPs, ddTTPs, ddCTPs, ddGTPs) where it would terminate at this position when it was added – get reaction with all possible fragments, run on a big gel electrophoresis for days to separate the fragments -> visualize bands by radioactive labelled nucleotides, read the gel by eye, Automated sanger sequencing (1986) - Fluorescent labelled nucleotides - Capillary electrophoresis - 1 reaction per sample - 96 samples in parallel - Laser detection - Digital output No longer radioactivity now fluorescent labelled nucleotides, do not need to separate the reactions anymore, each nucleotide got his own dye, microcapillary electrophoresis – detection via laser beam - > digital readout with peaks, was used for the first large projects Human genome project (1990 – 2003) - First full human genome - Large sequencing centers from 6 countries - Hundreds of dedicated scientists - Approx. 10 years to first draft, 13 yrs project duration - > 3 billion USD → demand for faster & cheaper sequencing technologies Shortly after the market reacted “a new generation of non-sanger-based sequencing technologies has been deliviered on its promise of sequencing DNA at unprecedented speed, thereby enabling impressive scientific achievements and novel biological applications” nature methods: method of the year 2007 Were not only able to sequence 1000 of fragments (nowadays millions) in parallel -> common Massive parallelization! of sequencing reactions -> NGS, High Throughput, sequencing costs NovaSeq X (Illumina) specs: 1.3 million €, 20.000 human genomes per year, Max. 50 billions reads, 16 Tb per run, 200 dollar per human genome (newly announced) NGS vs. Sanger Sanger sequencing has been supplanted by “Next-Gen” sequencing methods, especially for large- scale, automated genome analyses. However, the sanger method remains in wide use, primarily for smaller-scale projects. 13 VO Methods in molecular biology WS 2022 Small Scale Experiment: - Cloning → check - PCR → Check - Small genomic regions: HLA, rRNA, … Costs per sample: NGS: >200€, Sanger: approx.. 5€ Sanger was only replaced by NGS for larger projects! Sequence is… - Storage of genetic information - Information flow - Structure (Folding, looping, etc.) - Function (e.g. Ribosome, tRNA) - Interaction (DNA-DNA, DNA-RNA, DNA/RNA-Protein) - Beyond AT/UGC (DNA/RNA base modifications, Epigenetics) Not only is a sequence genetic information it also determines its structure which is important (tRNA!), and function, interactions (depend on sequence specificity to bind etc.) also it can carry additional information like modifications, epigenetic, NGS Glossary - Fragment o A short stretch of nucleic acid resulting from the fragment of longer stretches The required size of a fragment is specific to the type of experiment and sequencer possibilities. - Read o Data output from the analysis of a single fragment (sequence). - Read length o The number of read bases per fragment, respectively the maximum length of the fragment, which can be sequenced at a time (indicated in bases) - Read depth o Number of times a nucleotide is read - Coverage o Average read depth. Summary: history and development NGS… - Evolved from sanger sequencing - Means the massive parallelization of sequencing reactions - Delivers up to billions of reads per experiment - Substitutes sanger sequencing for large(r) projects - Makes formerly prohibitively expensive experiments affordable 14 VO Methods in molecular biology WS 2022 Many technologies available? Not really… Sequencing Platforms/Technologies Currently only 4 different platforms, each has its own sequencing method, the one that are in development are all based on sequencing by synthesis, NGS Technologies 2nd Generation: Amplification (of signal) required - High Throughput: 1M-50B reads - Short reads: 50-600bp - MGI: DNA nanoball sequencing - Illumina: Sequencing by synthesis 3rd Generation: Single molecule sequencing - Low Throughput: 1k-10M reads - Long reads: 1kbp – 1Mbp+ - PacBio: SMRT sequencing - ONT: Nanopore sequencing One way to classify current technologies; do they need amplification? PacBio and Nanopore sequence single molecules, while MGI and Illumina use an amplification of the material, sequencing of single molecules has a drawback it is very difficult to archive the same throughput as when you amply stuff, on the other hand it can read very long read lengths, an amplified signal is a mixed signal which comes from many molecules which depurates over time – not the case if you look at one molecule, Sequencing Experiment Select molecules of interest → Prepare library → Sequence → Bioinf. Analysis - Sequencing library o A set of nucleic acid fragments which has undergone all processing steps and is ready for actual sequencing. How do we do it? First select molecules of interest, classical way; extract DNA/RNA from the cells, now library preparation; generate a sequencing library – mix of molecules we want to analyse, now sequencing and lastly bioinformatic analysis of the result Sequencing by Synthesis (Illumina) - Market leader; >75% of all sequencing systems - Benchtop (MiSeq, iSeq) to population screening systems (NovaSeq) - Short read sequencing (36-600bp) - >200 protocols/applications available Many different illumina that differ in price, output and time, Sequencing by Synthesis: Library preparation by ligation Most easily way to prepare library; by ligation, we start with fragments which come from chopped genomic DNA – f.e. cDNA fragments, RNA, result from ChIP, etc. – should be smaller than 1kb, now the ends need to be repaired, make them blunt – add a 3’ A tail and ligate adaptors to the fragments -> insert is out interested fragment, flanked by adaptors, the adaptors are usually Y-shaped to avoid 15 VO Methods in molecular biology WS 2022 adaptor – adaptor formation, now amplify our fragments get linearization by it and the ones that are flanked by adaptors get enriched -> sequencing Sequencing by Synthesis: Adaptors, Multiplexing, Paired End reading - P5/P7 o Flowcell binding sites -> Clustering (Amplification on flow cell surface) - Index 1,2 (barcodes) o Unique samples identifier -> Multiplexing - Multiplex o A library containing various samples labelled with barcodes/indices. - Rd1 SP, Rd2 SP o Read 1 and Read 2 sequencing primer binding sites -> initiation of sequencing - Paired-End (-Read) sequencing o A method of reading a fragment where the fragments are first read from one end and then from the other. The P5 and P7 ends bind to the surface of a flow cell and are also used to amplify a material, illumina is extremely powerful – not needed for a single sample – Multiplexing: mixing many samples, sequencing them all together and separating the read with the index, then we have Read1 and Read2 sequencing primers – polymerase needs primer for amplification, Illumina invented paired-end sequencing -> from one day to another they double the output of the sequences, flip it make a complementary strand and read it from the outer side -> we can read 300bases from one side if we flip the fragment we can read another 300bases! – doubling the sequencing output of one reaction Multiplexing (available for all methods) - Available for all sequencing methods! f.e. we have 2 samples that get different indexes (Index 1: CATTCG and index 2: AACTGA), pool them together -> sequence them -> Sequence output to Data file; compositions of index and our read -> demultiplexer sorts data by index into the 2 different samples/categories -> further processing Sequencing by Synthesis: Library preparation by Tagmentation - Transposomes o Hyperactive Tn5 (transposase) with reduced binding specificity, loaded with adapter sequences Second method for library preparation – tagmentation, use transposomes – an enzyme called ransposase from a bacteria which copy and pastes transposable elements – highly regulated in the cell because they can destroy the genome -> now for this method is was made highly active and less dependent on the binding sequence -> will now bind everywhere and cut the DNA and insert the sequence that they have loaded (part of adaptor sequence) into the fragments, now add the index and P5 and P7 with PCR -> full library; cheap and fast, less steps (more used) 16 VO Methods in molecular biology WS 2022 Sequencing by Synthesis: Library Preparation by Amplification Third method to prepare library; by amplification is the simplest one, just PCR region of interest, then PCR with overhang (part of adaptor sequence) to create the flankes, - next PCR to add the rest of the adaptor, very fast and cheap but only works if you now the flanking regions and if the size is not to big, used f.e. for 16S library which is conserved in more species, and then read into the variable region – classify from which species they come from Sequencing by Synthesis: Loading Before the sequencing we need to prepare the flow cells; glass banks that have channels, have an inlet and outlet? And the surface of this channels is layered with oligos which are complementary to the adaptor sequence, introduce flanked fragments – will hybridise on the surface of the flow cell, use them as a priming for polymerase reaction to create a complementary strand -> get rid of the original strand, Illumina: Bridged Amplification, Clustering Illumina not strong enough (too slow to generate output) to sequence single fragment -> bridge amplification; the second end of the fragment also hybridize on the flow cell oligo, now use that again as a primer starting point – amplification – strands are linearized again – means the 2 strands are now each binding on one oligo, repeated -> cluster formation – thousands of copies of the same fragment -> all linearized no start sequencing reaction by binding the sequencing primer Illumina: Sequencing by Synthesis Max. 2x300bp (MiSeq), up to 10 billion colonies per flowcell (NovaSeq) Now sequencing; thousands are doing that at the same time, special nucleotides which are fluorescent labelled, which is chemically cleavable and have a 3’ block means similar to the ddNTPs – stop the reaction, in Illumina it is removeable polymerase binds T binds to A and stops – can image the cluster – depending on which nucleotide was incorporated we have a different colour – showing up at different wavelengths – will be tracked and recorded – once done cleave the fluorescent dye and the block -> next polymerase and nucleotide can bind,… sometimes there are errors and the polymerase f.e. cannot bind or 2 nucleotides are incorporated etc. -> some molecules are ahead or lack behind – we get a mixed signal -> after 300bases the signal gets really bad 17 VO Methods in molecular biology WS 2022 Illumina: Sequencing by Synthesis / results We expect 1 error in 1000 = Base call accuracy 99,9% Billions of colonies as result, 2 possibilities; random clustering and patterned flowcell, for patterned flowcells the fragments can only start at specific areas -> achieve higher densities of colony formation for the random clustering they start to grow into each another – will be excluded – can happen you have millions of colonies but only 10000 reads, why? Because of bad quality PacBio: Single Molecule Real Time Sequencing - Long read platform - 3rd generation: single molecule sequencing - Continuous polymerase activity (monitored in real time) A long read platform without amplification – single molecule, real time – detection of polymerase activity takes a video not images, millions of small wells where each will sequence one DNA molecule, PacBio SMRT: Library Preparation (Ligation) - SMRTbell library: circular adapters - Pre-binding of Polymerase - Insert size: HiFi 20kb, CLR 30-150kb Most typical library preparation, very similar to ligation protocol of Illumina, but 2 main differences: the Adaptors are Circular and before loading the DNA into the tip we preload the polymerase PacBio: Single Molecule Real Time Sequencing 18 VO Methods in molecular biology WS 2022 The golden surface of the SMRT cell are millions of small little channels, where single polymerase with a single DNA molecule fits into, polymerase attaches on the surface of the well and again fluorescently labelled nucleotides are added – whenever it introduces one nucleotide – ? is released and emit a little pulse of light – again in a wavelength dependent on the colour/nucleotide Because the adaptor is circular the polymerase opens up in the full library and now the DNA can go in circles – can be sequenced more often PacBio SMRT sequencing: CLR vs CCs(HiFi) - Continuous long reads (CLR): longest read mode - HiFi reads: long accurate reads Fragment moves around and is read by the polymerase, advantage polymerase will introduce errors at some point and these errors can be investigated when you sequence several times the same molecule -> can built up the consensus sequence and investigate errors and get rid of it -> high accuracy With this sequencing method you can ether go for long reads or high accuracy PacBio SMRT sequencing: Data - Half of data reads: >30.000 bp - Output per SMRT cell: up to 10 Gb - Reads per SMRT cell: around 400.000 Distribution sometimes polymerase reads a very long time sometimes it stops very fast, typical parameters N50 which tells us the size of the fragments from more than 50% of the bases are in – quality control Short reads vs. long reads - Access to high GC content regions - Resolution of complex regions of the genome (e.g. MHC) - Correct mapping of repetitive regions - Resolution of structural variation - Differentiation of paralogous regions - Phasing (resolution of allele sequences) Trade off: Throughput (# of reads), cost per Gb Putting together long reads is much easier than small reads, can access regions with high GC content, complex regions, repetitive regions, and so on ONT: Nanopore sequencing - Long read platform - 3rd generation: single molecule sequencing - Biological nanopores embedded in biological membrane - Only portable system (MinION) First sequencing instrument that is portable, the flow cell is disposable – after finishing the run it is trashed, MinION can be connected via USB to a computer, load flow cell – start sequencing 19 VO Methods in molecular biology WS 2022 ONT: Nanopore sequencing - Motor protein bound binds to pore and slows transport through pore - Tether sequence anchors fragment on membrane surface To create a library for nanopore ether by ligation of transposes, here we need a tether sequence that also helps to attach on the membrane and a motor-protein which has two functions ONT: Nanopore sequencing - Better pores lead to more accurate output - 3rd generation: single molecule sequencing First function of motor protein – attach on the pore to move the DNA through the pore and to act as a break – without motor-protein the DNA would flow to fast that it couldn’t be detected – makes flow slower – can be detected and differentiated between the different nucleotides Nucleotides are read in a completely different way, we look at the current that flows through the pore – whenever a nucleotide flows trough it changes its current – nucleotides change the level on a different way, sequence is a mix of five nucleotides – all currently passing through the pore and change the current -> bad base currency (lot of errors compared to other NGS) ONT: Nanopore sequencing: results Reads with 2Mb – can sequence chromosomes from end to end MGI DNA Nanoball Sequencing - Rolling circle amplification - Linear amplification produces less errors than PCR - Circularization encompasses sequencing primer binding site Center of this nanopore sequencing is to do DNA nanoballs, no ligation no tagmentation no PCR – you circulate your sequence of interest including an insertion which carriers a sequencing primer binding side and do a rolling circle amplification – not exponential always start from the same template – less errors than PCR – the copies will hang together -> DNA nanoball MGI DNA Nanoball Sequencing Take DNA nanoballs and load it on a flowcell – spotted with errors were the DNA nanopores can bind and attach to 20 VO Methods in molecular biology WS 2022 MGI DNA Nanoball Sequencing: Cold MPS 3 different sequencing methods; standard MPS, hot MPS and cold MPS (standard and hot similar to Illumina) also sequencing by synthesis but using different nucleotides – we have the block but the nucleotides are not directly labelled instead they use antibodies that are complementary and block dependent which carry the fluorescent signal, once the base is incorporated antibodies can be washed away, the block can be removed and then we get the sequence by repeating it NGS technologies comparison Which instrument use when? Illumina needs amplification, costs quite low, short read-lengths but many different protocols and high throughput PacBio high costs but really long read lengths, few protocols and low throughput NGS applications by technology - Whole genome re-sequencing, SNPs: Illumina, MGI, PacBio (Phasing) - Sequencing large genome rearrangements: PacBio, ONT - Counting applications (ChIP-seq, RNA-seq): Illumina - De novo whole genome assembly: PacBio, ONT - Small genomes; Any benchtop sequencer - Fast, long reads, determining subspecies: ONT Experimental considerations - Sequencing method - Sequencing instrument - Read length / coverage - Multiplexing - Library preparation - Technical and biological replicates, statistics - Bioinformatics 21 VO Methods in molecular biology WS 2022 Points to consider in a typical sequencing experiment, chose method, platform, the instrument depending on the throughput you need, think about read length, how many samples can I put in one multiplex to sequence as efficiently as possible, choose library preparation, think about statistic do I need technical or biological replicates and how many, and bioinformatics (big result) specialized programs required SUMMARY: Methods - 4 relevant systems available (as of 2022): Illumina, ONT, PacBio, MGI - Different sequencing methods, sequencing by synthesis, nanopore sequencing, SMRT sequencing, Nanoball sequencing - 2nd and 3rd generation sequencing instruments - Differences in read length, throughput, cost - Sophisticated bioinformatic requirements to analyse data Lecture 3: NGS – Next generation sequencing II (Andreas Sommer) What is it good for? Sequencing Experiment: Flexible approach NGS is a versatile and flexible technology! Select molecules of interest → prepare library → Sequence → Bioinf. Analysis - Variable input, selection of molecules of interest. >200 Illumina protocols - Sequence any library which meets requirements. - Continuous Tool development. Applications With so many protocols we can nowadays target every cellular nucleic acid; mRNA, tRNA, ncRNA, cRNA, genome, modified bases, interactions between DNA-DNA DNA-RNA etc., look at transcription, translation, dynamics of degradation etc. We can also compare cells and tissues with each another, De novo Sequencing - Challenges: o DNA extraction (HMW DNA!) -> optimise protocols o Sequencing errors/bias -> use of multiple platforms o Assembly -> use long reads o Annotation: genome annotation is a multi-level process that includes prediction of protein-coding genes, as well as other functional genome units such as structural RNAs, tRNAs, small RNAs, pseudogenes, control regions, direct and inverted repeats, insertion sequences, transposons and other mobile elements. - Applied in: o Research: Ecology, Evolution, Genetics o Agriculture industry + animal breeding o Biomedical research: improvement of assemblies of model organisms First we need to extract DNA, we also need large fragments – need to retain the size throughout the process of extraction, sequencing errors and bias – systems have limitations (read length, output, errors) therefore use different platforms for one project to correct the readouts, assembly to put the 22 VO Methods in molecular biology WS 2022 fragments together the shorter the more difficult, the assembly is completely useless without an annotation; assembly is just a long stretch of nucleotides, in order to make sense out of it we need to find regions in the genome that code for genes, are regulatory etc. De novo Sequencing Every dot represents one genome assembly of a different species, after 2010 things started to be sequenced after 2016 whole genomes, A de novo Chromosome – Level genome Assembly of the White-tailed deer One example, grow and get rid of antlers (Geweih) regularly – which genes are driving this? Focus on technology behind this; using pacbio 2 modes; one is maximising the read length other is the circular consensus reading several times the same protocol – here limiting the size of the fragment to read it multiple times – accurate reads, SMRT cells is the unit where it is loaded, Sequel is the current system, 24h movies – select how long to watch the polymerase in action, hyper library construction – in order to correct pacbio sequences use Illumina for short sequence reading and for preparation they used a ligation library preparation, shotgun chop whole genome and sequence it, chromatin conformation capture sequencing libraries were prepared using the Omni-C kit from Dovetail genomics; → new Chromatin Conformation Capture Sequencing Chromosomes are organized extremely densely packed, analyse the unfolding of the different loopings, different protocols were developed; 3C, 4C, 5C, HI-C, CMP-loop etc. which analyse different interactions, 3C; analyses one to one interactions (can also be done with PCR), 4C; one vs all which interaction has this one to the rest of the genome, 5C; many vs many, Hi-C; currently hot topic; all vs. all Chromatin Conformation Capture Sequencing - Captures and maps chromatin interactions. - Used to understand cell division, transcriptional regulation, development and the correction of whole-genome assemblies (only intrachromosomal interactions expected). First crosslink DNA, creates bonds between the DNA that is near, cut with an restriction enzyme all overhangs – get short fragments, the ends are repaired and marked with biotin, then ligated – circular molecule, now they are sheered and pulled down using beads which bind to the biotin – specifically selected for molecules that carry the biotin and because biotin was added were the two molecules were ligated together – all that were pulled down come from originally different molecules, library preparation and sequence them – one read comes from one molecule and the other from the other molecule, big file usually represented as a plot -> now we now this part of x and this part of y where next to it each other when we did the crosslinking → find out where DNA molecules interact with each other, can also be used to correct whole genome assemblies (correct: only diagonal in plot), 23 VO Methods in molecular biology WS 2022 De novo sequencing of human genomes 30x PacBio; every base was covered at 30 reads, 100x Illumina PCR-free sequencing; preparation by ligation emits the last PCR reaction ( 1 PCR not 2 – doesn’t overamplify small fragments – a bit better result but much more difficult to handle) and to generate clusters they still have to amplify, BioNano optical maps and single cell DNA template strand sequencing (Strand-seq – can differentiate between Watson and crick strands) Optical Mapping (Bionano) - No “real” sequencing method - Produces maps of nicking enzyme binding site. Mapping is not really a sequencing technology – no sequencing base by base, use sequence specific nicking enzymes to produce nicks in one of the DNA strands, add taq polymerase which will repair the open strand by adding fluorescent labelled nucleotides, this part of the chromosome will now shine up Nanochannels: Now take the densely packed labelled DNA and apply a current – the DNA will move and will have to pass pillars were the DNA becomes more and more entangled – complex formation gets more and more linear till its fully linearized, image them -> reference map for the assembly ➔ Its not only used for whole genome assembly correction but also used to analyse structures of variants – for cancer patient etc. Ancient DNA Sequencing – Pääbo - Nobel Price 2022 A High-coverage Genome Sequence from an Archaic Denisovan Individual - Ancient DNA is heavily degraded. - Long overhangs - C -> U conversion of DNA. - Minimal amounts of DNA. - Contamination issues. - Single strand library preparation yields substantially more library material and minimises adapter-adapter dimers. - Short fragments -> short read sequencing Sequenced DNA form a Denisovan bone – new subspecies that wasn’t classified before, lot of problems – listed above, to analyse it he had to come up with a library preparation, how does it work? Sequence both strands – double it, normally you ligate dsDNA to get one molecule but here they heated them up (separate them – 2 library molecules), splinter oligos which one carries random nucleotides (which hybridize with the strand) at the end and the other one carries a biotin – 24 VO Methods in molecular biology WS 2022 connection is ligated -> biotin can be catched by beads – finish now full library preparation and sequence it, no use of long reads because DNA is highly degraded Sequencing of populations - Challenges: o Library preparation and sequencing costs -> 2nd Gen systems (max. output) o Data management, data interpretation - Applied in: o Research: Ecology, Evolution o (pre-)clinical research - WGS vs reduced sequencing space Sequencing of populations, challenges you want to sequence 100 or thousands of individuals – gets expensive, analyse variants and make links to diseases etc. -> application of whole genome analysis gets very expensive -> ways to make it simpler; reducing the sequencing space you look at; 2 examples RAD-seq, Restriction enzyme associated DNA - Sequencing at restriction enzyme cutting sites. - Many variations of the original protocol: ddRAD-seq, GBS, ezRAD, nextRAD - Focusing on a subset of the genome (0.1 – 10%). - Quick and cost effective analysis of thousands of genetic markers in large populations. First example RAD-seq, sequencing only parts of the genome typically 0,1-10% of the genome, we are only looking at sites where a give restriction enzyme binds, cut – size selection of the fragments – prepare libraries – add multiplexing – sequence – wherever restriction enzyme cuts you should get reads, enough to categorize variants Exome sequencing - Genome reduced to protein coding sequence. - Requires well studied genome. - Focusing on a subset of the genome (1%). - Quick and cost-effective analysis of thousands of genetic markers in large populations. Second example; also reducing sequencing space, this time by hybridisation, exons make up 1-2% of the genome, needs to know the sequence of the exon – well studies genome required, library preparation by ligation, pull down sequence of interest via hybridisation; oligoprobes which are tagged with biotin -> capture and sequence the target specific probes (lot of mutations responsible for diseases are in the exons) Characterizing microorganism communities Metagenomics - Challenges: o Representing the true composition of a community of microorganisms (genome bias). o Contamination o Data interpretation - Applied in: o Research: Ecology, Pre-clinical (gut, skin) 25 VO Methods in molecular biology WS 2022 o Biotechnology Broad application range, all sequenced in a specific environment; ocean, hair, gut etc. – everywhere where microorganisms interact witch each other, susceptible to change If environment changes, two ways to sequence microorganism communities; amplicon sequencing and metagenomics sequencing – not easy to represent composition of a community Characterizing microorganism communities Shotgun sequencing; detect DNA; chop it off, make a sequence library, sequence it, -> analysis of the data is tricky – multispecies assembly Amplification (16S, 18S, ITS): only looking at specific regions where species are different, classical 16S, 18S or ITS, contains variable regions, 16S gene is very long thereby only 2 or 3 of the variable regions are synthesized, use primers that hybridise on the conserved regions, sequence it, classify reads – much simpler, nowadays PacBio is more used – can cover the full 16S region Conserved and variable regions 3rd gen sequencing for maximal resolution 16S sequencing Re-Analysis of 16S Amplicon Sequencing Data Reveals Soil Microbial Population Shifts in Rice Fields under Drought Condition Microbiome in plant and soil change under drought, relatives of abundance of different species/organisms, shift over time during drought Comparing cells and tissues - Many features which can be compared: genome (mutations, rearrangements), gene expression, methylation, chromosomal arrangement,… - Bulk and single cell methods - Most common NGS application - Universally applied by all research areas Bulk experiment; thousands or millions of cells, put them together treat them the same – outcome mix of reads, Cancer genome sequencing SNP calling, rearrangements: - Phenotype - Adaptation - Extreme genome aberrations Rearrangements in cancer genomes, can lead to extreme genome aberrations, (SNP = Single nucleotide polymorphism), also adaption of a tumour to drugs, 26 VO Methods in molecular biology WS 2022 RNA sequencing (RNA-seq) mRNA-seq links genomic information and function! One way to sequence RNA, are also sRNA, microRNA, ncRNAs, etc. Standard method; mRNA has a poly A tail, use poly T oligos as primer, mRNA is fragmented, primed and the first strand cDNA is synthesized – RNA/DNA hybrid, get rid of the RNA synthesize second cDNA strand – fragments can now be introduced in the standard ligation process, sequence it, we get libraries – the higher the (spalt) the more often expressed, some similar expressed, some more and other less RNA-seq visualization Different ways to visualize the data; volcano plot or Scatterplot f.e. -> do pattern analysis to make sense out of it Direct RNA sequencing (ONT) Allow to sequence and analyse RNA modifications! Oxoford nanopore has one protocol that sequence RNA, adaptors directly align to the RNA, the cDNA strand is synthesized but only so that the library molecule can get down to the pore, the strand that is then actually displayed through the pore is the RNA strand – more difficult because it breaks more often than DNA, but still RNA sequenced so we can also see the modifications which are functional otherwise we would lose them -> only protocol that currently can directly sequence RNA Sequencing DNA base modification - DNA base modification are of functional relevance. - Epigenetic layer of information. - Best studied: 5-Methylcytosine, more than 30 described. - Aberrant modifications in diseases (Biomarker) - Sequenced indirectly (2nd gen) or directly (3rd gen seq, signal distortion) Also DNA base modifications can be sequenced, can also be used in diseases as biomarkers, cancer alters the methylation status of the DNA, the DNA base modifications can also be sequenced directly with 3rd gen seq; PacBio and ONT – get signal but slightly shifted, in PacBio the polymerase incorporates the modified base template, incorporate the corresponding base but takes slightly longer because of the modification – make a movie – see time difference, also in Nanopore if the modified base goes through the pore – gives a slightly different signal In 2nd gen seq we need to do this indirectly, after amplification the base modification is gone, do a bisulfite conversion to synthesize the 5’methylcytosine, convert all C to U (happens also naturally over time but here induced) are now sequenced as T, the C that is methylated is not converted – remain, compare them – determine modified C 27 VO Methods in molecular biology WS 2022 Single cell sequencing Single cell sequencing examines the sequences information from individual cells with optimized next generation sequencing (NGS) technologies, providing a higher resolution of cellular differences and a better understanding of the function of an individual cell in the context of its microenvironment. E.g. scRNA-seq, scATAC-seq, scDNA-seq If you sequence more cells together (bulk) you get a blurry result, single cell sequencing much more granulated picture/result, Single cell RNA-seq Low throughput: smart-seq 2 High throughput: 10X genomics Two methods, low throughout – putting a single cell in a wheel, costly for many samples, then high throughput method – encapsulate cells and reagents in small oil droplets, add barcode to them now the oil can be broken? And they can all be treated together – each droplet marked by barcode – each droplet one cell -> every cell marked, Single cell RNA-seq Identify cell types, in complexes you have many different cells in different stages that have different jobs, to identify superpopulations immune cell sequencing is the method of choice, you can reconstruct cell hierarchy – what is the fate of a cell type? Even populations are shifting over time, also you can interfering regulatory networks, Spatial transcriptomics Old: FISH or Y2H, New: in sito capture, most common used at this moment, take a section and put it on an array, has different spots where each spot carries different oligos, release the RNA from the section by opening up the membranes, flow into the spots where the oligos capture them, spots have an sample barcode and also an xy barcode telling us in which spot the RNA was captured and where exactly it comes from, ChIP-seq and other enrichment methods Method based on an enrichment, you can enrich via size selection, ChIP (antibody) or phenol-chloroform extraction (gelgradients) - ChIP-seq for Protein – DNA interaction – where binds protein – selection through antibodies - DNase-seq, ATAC-seq, Faire-seq; look for open chromatin - MNase-seq; mapping the positions of histones, Data integration Understanding the organization of a genomic locus by applying multiple sequencing methods Integrade different sequencing data, using different methods, protocols and put them together to understand the molecular mechanism, 28 VO Methods in molecular biology WS 2022 COVID-19 Applications - Full genome sequencing (genome, trace infection routes) enrichment by hybridization - RNA sequencing, patient lung tissue - Metagenomics, interaction with microbiome - Population screening (amplify regions of interest -> Sequence, monitor mutations) Summary: Applications - Hundreds of protocols available - Enrichment/hybridisation/selection methods add diversity - Most cellular nucleic acids can be analysed - Data integration from multiple applications - Low input methods allow sequencing of single cells - Spatial transcriptomics combines gene expression status and localization Lecture 4: Genome editing with a focus on CRISPR Methodology (Krzysztof Chylinski) Biological machines CRISPR cas9 is a biological machine – using it for humanity, PCR or restriction enzymes are natural things that we use in an professional way, as well as CRISPR cas9 CRISPR - Clustered Regularly Interspaced short palindromic Repeats - Adaptive immune system in >40% of bacteria and most archaea. - It uses short RNAs to target and guide endonucleases for destruction of invading nucleic acids. - It can memorize novel invading sequences to provide future immunity. Bacteria are attacked by phages etc. -> have adaptive immune system to protect themselves, it uses short RNA molecules to find specific DNA/RNA sequences and cut them – can memorize novel invading sequences, CRISPR How does it work? CRISPR disculstererd regulatory interspaced short palindromic repeats – describes a small locus that is composed of identical units -> repeats usually palindromic, and between them are unique sequences 20-30 nucleotides long, that derived from foreign genetic elements, upstream of this locus is an operon of cas genes (CRISPR associated), contain the whole machinery for CRISPR to function, because CRISPR is a mobile genetic element itself (can travel to another host f.e. by phages), The repeat spacer array is being transcribed as a single transcript, called pre-crRNA, the palindromic repeats form hairpins between them are the unique sequences, processed in a maturation step usually by Cas RNase is cutting within each repeat – get small RNAs containing one spacer sequence each flanked by parts of the repeats – now called 29 VO Methods in molecular biology WS 2022 crRNA associate with Cas proteins – form executioner complex having one RNA – Interference step; this RNA is finding a matching nucleic acid – binds and triggers the executional complex to cut the target DNA – cause degradation Adaption step; if a phage with a new genetic element arrives in a bacterial cell – no CRISPR yet, has a chance (1/1.000.000) to cut out a piece of DNA and integrate it into the repeat spacer area, for this two proteins Cas1 and Cas2 are responsible, CRISPR-Cas- PAM sequence Not only the small CRISPR RNA has to be complementary to the DNA sequence – next to the binding sequence is also an additional small motive present, called PAM (protospacer adjustment motive), a few nucleotides – not bound by CRISPR RNA CRISPR diversity CRISPR systems are extremely divers, 2 different classes; Class 1 and Class 2, within them are different types with different sets of proteins that are active, the proteins that form the executioner complex are present in the class 1 types, when it comes to genome engineering class 2 is more interesting because its easier to use a single protein for the process rather than 6 different, Type II mechanism of action Different; the locus/operon is smaller, we have cas9, cas1, cas2 and csn2 – 4 proteins, cas1, cas2 and csn2 are mostly responsible for new spacer integration, cas9 does most of the job when it comes to cutting the DNA, system has also tracrRNA How does it work? Transcription of the whole CRISPR RNA as one transcript – pre-crRNA, tracrRNA is a small RNA that is complementary to the CRISPR repeats – binds to each repeat forming a dsRNA structure – this is recognized by RNase III – cuts within each of the duplexes, cas9 is recognizing the tracrRNA and protecting part of it from degradation -> processing into short CRISPR RNAs – have now a precomponent complex by; cas9, crRNA and tracrRNA – the crRNA has the spacer – recognize the invading complementary DNA and binds, tracrRNA needed for structure, - DNA degradation Target cleavage – Cas9 - Cas9 is a large, multidomain protein. - Cas9 has two endonuclease domains: central HNH and split RuvC domain. Has different nucleases inserted during time; RuvC and HNH, one of them cuts, also has a PAM interacting and recognition lobe region, Target recognition Two upper strands; target DNA, with an TGG (NGG) - PAM sequence of Strep. pyogenes cas9, upstream of PAM we have the targeted 20 nucleotides long sequence – which is 30 VO Methods in molecular biology WS 2022 bound to the 20 nucleotides long spacer derived from the crRNA, while the repeat RNA of the crRNA is bound to the tracrRNA → tracrRNA doesn’t make contact with the target DNA, and no RNA is making contact to the PAM sequence Target recognition - Target DNA sequence complementary to crRNA = protospacer. - PAM – Protospacer Adjacent Motif - Cas9 binds to PAM and opens the DNA. - crRNA binds to complementary DNA forming an R-loop. The spacer of the RNA is complementary binding to the DNA – problem the DNA sequence is already bound – ds! -> cas9 is recognizing the PAM sequence – cas9 is jumping from one NGG to another NGG on the DNA – when cas9 binds to the PAM and opens the DNA next to it (completely independent of the RNAs) – one of the strands is now accessible to the crRNA binding – binding opens DNA even more – upon full binding of crRNA to the DNA – cas9 is using 2 domains to cut the DNA upstream of the PAM Cas9 is cutting in a very precise way, making a blunt cut three nucleotides upstream of PAM Cas9 orthologs We were always talking about s.pyogenes cas9 (NGG), different cas9 in different species – different PAM sequences, different sizes etc. – big cas9 hard to package in a virus Single guide RNA – gRNA Single RNA combining features of crRNA and tracrRNA, 3 component system in nature; cas9, crRNA and tracrRNA Made it even simpler -> combine crRNA and tracrRNA to one guide RNA, From idea to application… - July 2012 – Jinek and Chrylinski o Emanuelle Charpentier lab and Jennifer doudna lab ▪ Cas9 – dual endonuclease using crRNA and tracrRNA ▪ Single guide RNA ▪ Proposed genome editing tool - January 2013 – Mali, Cong and Ran o George Church lab and Feng Zhang lab ▪ Cas9 shown to be an active genome editing tool in human cell lines Idea; use the 20 nucleotides that are derived from the spacer – program It cut almost any DNA sequence as long it is next to PAM Genome engineering Programmable nucleases: ZFN and TALENs - Modified crops, human cells, cell lines, mice, zebrafish, rats, worms and cattle. - Proposed therapies, clinical trial for HIV therapy with ZFNs. 31 VO Methods in molecular biology WS 2022 CRISPR cas could be done so fast because there were already genome engineering tools, Zink finger nucleases and TALENs, were also allowing to target specific DNA sequences, ZFN and TALENs are enzymes that are engineered (not occur in nature); composed of 2 domains, one is the programmable domain (DNA binding domain) and the other is the DNA cleavage domain; monomer of a Foki nucleases, itself it cannot cut DNA but if we put 2 TALENs or 2 ZFN together – the two Foki monomers from an active nuclease and cut In TALENs and ZFN the DNA sequence is recognized by specific small domains, that are put together in a specific order to recognize the DNA sequence, TALENs is recognizing nucleotides while ZFN is recognizing triplets -> were quite good and used, why was CRISPR a revolution? Do the same thing but it is faster, cheaper and easier, - most of the things we do now with CRISPR were possible before but were too expensive with the previous genomic engineering methods! DNA break repair DNA break repair can result; error free, in gene deletion or by precise insertion to gene correction CRISPR/cas9, ZFN and TALENs can precisely cut DNA – how do the actual engineering? If the Double strand break is induced in vivo it has to be repaired, can now happen in an error free way – the two end are often glued together by NHEJ -> in case of cas9 if the repair was error free it comes and cuts it again till there is a mistake, can be a deletion or insertion of a few nucleotides - indels, causes a frameshift in a coding region, for NHR we provide the cell additionally a piece of DNA that is encoding the change we want to introduce – needs to be flanked by nucleotides that are identical to the DNA – homology regions, now there is a chance that the piece of DNA is used as a template for homologous repair to ether insert it or to just repair it, How to make a knock-out? Remove, replace, disturb We have a promoter, Start and Stop codon, introns and exons, the 3’ and 5’ends, if you want to remove/delete this gene or the product – we can try to remove all the introns and exons by inducing a ds break at the start and stop codon and remove all that’s in between – is the actual deletion, or we replace it thereby we have identical regions at the piece of DNA we introduce – try to replace one of the exons by something different, also we can try to disturb – we make a break in one of the first exons, hope there is a mistake happen in the repair that is going to introduce a small indel – stop codon = nonsense mutation – only a non-functional protein is transcribed Modern genome engineering: programmable nucleases Short insertions or deletions (indels) can lead to functional gene KO Indel can cause frameshift or gene deletion, -1 indel A was deleted – proteins change completely , +1 an A was inserted – frameshift - one of the triplets now coding for a STOP codon – shorter protein sequence CRISPR/Cas9 To target out sequence we have to make a sgRNA were the first 20 nucleotides are complementary to the target sequence, the rest of sgRNA is always the same, cas9 binds sgRNA next to the PAM – cas9 is cleaving the DNA to a double stranded break 32 VO Methods in molecular biology WS 2022 Single guided RNA – gRNA Spacer, repeat and tracrRNA derived a complementary and a linker loop, Consists of 2 parts; a 20nt targeting sequence (the spacer) which is variable and the 90nt gRNA scaffold that is constant, Already quite short about 110 nucleotides, but the shorter the easier to make – shorter sgRNA on the market gRNA design gRNA targeting sequence – 20 nt target sequence next to NGG PAM + gRNA constant scaffold PAM is not in the gRNA sequence! The complementary strand of out target has the same sequence as our spacer, to design the spacer just take the 20 nt upstream the PAM Using specialized systems to express gRNAs – which require a G at the very beginning – if there is no naturally occurring G people would add one gRNA design G needs to be at the 5’end for transcription with U6 promoter! How? First look for PAMs (NGG) – 3 possibilities, now we look at the 20 nucleotides upstream of PAM if there is no G at the beginning you add one – PAM is not part of the gRNA! Of course because DNA is double stranded we also look at the other strand (CCN instead of NGG) - two additional possibilities – read other direction always 5’ to 3’ Because PAM has only 3 nucleotides it is very common – most of the genes you can design hundreds of gRNAs Cas9 double mutant Inactivation of both Cas9 nuclease domains turn Cas9 into RNA-guided DNA binding protein. Cas9 has two different endonuclease domains; RuvC and HNH and each domain cuts one strand of DNA ether the complementary strand that binds the gRNA or the noncomplementary strand, well described was easy to point-mutate both to create a dead Cas9; dCas9 a protein that cannot cut anymore but still binds -> RNA-guided DNA binding protein, why is it useful? Cas9 – beyond DNA cleavage We can fuse things to Cas9, bring stuff to DNA for different approaches; example: used dCas9 fused it with an additional regulators (transcriptional repressor or activator) – regulate specific → decrease or increase the expression level, you can also bring modifiers, GFP, enzymes etc. to the DNA 33 VO Methods in molecular biology WS 2022 CRISPR-Craze - Cell lines, somatic cells, stem cells - Model organisms; rabbit, bee, chicken, mouse, frog, cattle, yeast, rabbit, rat, worms, fly, pig, etc. all you can imagen gRNA activity is unpredictable Problem causes severe mutations where it is hard for us to find solutions, we can design hundreds of gRNAs for one gene; problem all have different activities, experiment; 19 different gRNA against GFP - % of deletion can be easily measured, some with almost no activity other with over 95% activity Can we predict how active the gRNAs are? A lot tried, 8 different algorithm but nothing works, - some work better than other can at least be used to create a higher chances of success – take the gRNA that works best CRISPR/Cas9 specificity - Sequence-dependent off-targeting - Sequence independent effects o Immunogenicity o Possible integration o Cas9 expression What if there is an identical target sequence in the genome with just one mismatch? gRNA still binds (with one mismatch) – high chance that cas9 is cleaving (even if there are only 17 matches) -> cas9 is doing off-targeting of the target sequence – as not all targets are cleaved also not all off targets are cleaved, where the mismatch is located also determines if its going to bind and cut, if the mismatch is near the PAM the DNA is not opening further up – gRNA leaves but if it is further away where the DNA is already further opened it will stay, ➔ Where and how many mismatches are located can be analysed! Use gRNA with the littlest amounts of off targets There are also specific independent effects; Immunogenicity; cas9 is usually coming from a pathogen; Strep. pyogenes – we can have antibodies against it or even cas9, or possible integration; if we introduce a plasmid encoding cas9 – can by chance (0.1-0.5%) integrate somewhere into the genome, also we do not know what would be the effect of a promoted cas9 expression – done in mice – seem fine even after generations but still we cannot be sure Avoiding off-target activity - Algorithms can predict guide RNAs with low off-target activity. - Cas9 nickase can be used to prevent off-targets. - Lower expression level of Cas9 reduces off-target cleavage. - Purified Cas9 protein introduced to the cells has very low off-target activity. - Cas9 mutants with lowered off-target activity were constructed. (Cas9-HFs, Cas9 1.1) Induce a purified Cas9 protein or mutants – reduces off target cleavage DNA break repair When it comes to applications in therapy, we don’t want to delete genes we want to correct them, introduce DNA piece with “correct version” and hope it would be inserted, unfortunately random process (0.1-20%), other get repaired or nondisruptive or disruptive error (deletions) 34 VO Methods in molecular biology WS 2022 Imperfections of the method Group of cells coding blue protein were we are introducing CRISPR/cas9 – not every cell is going to take Cas9 in (transformation is never 100%), from these some are turning white because they have a full deletion of our blue gene, some got light blue – heterozygous deletion, 50% of homozygous deletion so if transformation is 10% -> it is in (10/10) 1% successful, 2 cells that then used the introduced DNA as template -> green protein Using the “imperfect method” - Models for basic research, - Modified organisms in biotechnology and medicine o Modified pigs for organ transplantation How can we use an imperfect method? Use CRISPR cas9 to create models for basic research (1 or 10% is enough) infect 100 mice get 5 with the intendent mutation – breed them get population, is also enough for biotechnology and medicine; Pig organs are commonly used for transplantations – problem carry retrovirus that can activate themselves and harm a donor, lab just used gRNA to remove all the viruses from the pigs – breed animals CRISPR/Cas9 for therapy - Lung cancer - HIV infection - Beta thalassemia - Duchenne muscular dystrophy - Huntington disease - Retinitis pigmentosa How use it for therapy? 2 different approaches; direct delivery: into the body or organ the other approach is cell-based therapy; take cells from the patient and modify them outside of the body – use successful cells, Direct delivery is very challenging, Duchenne muscular dystrophy Usually caused by a nonsense mutation in exon 23 (middle), experiment in mouse; not make a correction (inefficient) instead use two different gRNAs to cut out that exon that carries the STOP codon completely – missing a bit of the protein that is luckily not so important – do better but still far away from the WT ATTR Amyloidosis Group of diseases in which the amyloid plaques are found in different body organs – affects different muscles, neurons etc., one important thing; heart failure: the liver is producing the TTR gene – distributed all over the body – can form amyloid plaques Came up with an idea; take Cas9 mRNA and sgRNA packing it into lipid particles and inject it into the veins, finds its way into the liver -> by day 28 they saw a 93% mean and 98% maximum TTR reduction 35 VO Methods in molecular biology WS 2022 Haemoglobin disorders Example for cell-based therapy; group of diseases where haemoglobin is affected/mutated, one of them is sickle cell anaemia – different morphology of red blood cells, sickle formed blood cells can form fatal or painful clots that can also affect organs, Haemoglobin is a tetramer of two alpha-globins and two beta-globins, also two different alpha globin and in the beta there are a few different encoded, the different combinations exist at different levels of development, for embryo, foetus and adult. In adult we have alpha/beta if beta is now defective like in sickle cell anaemia – one way to treat it would be to bring it back to the foetal stage instead of alpha/beta we use alpha/gamma, → One way of treatment; specific drugs like hydroxyurea can cause a shift in expression, thereby gamma-globin is expressed again Another approach; gene and cell therapy; in a virus that is going to infect all over the body or infect the bone marrow cells that were taken from the patient – put it back, you can express the gamma globin next to the beta-globin hoping that it is taken by the cells, -> new idea use CRISPR for it 2 beta globin genes that are in the same locus more or less, the switch is the epigenetically programming of it, we have on the locus an enhancer, LCR which is jumping between gamma and beta globin – only one active at a time, the switch Is caused by a protein called BCL11 which causes it by binding to a specific sequence that is upstream of the locus – use CRISPR/cas9 to target this specific binding site making a disturbance that inactivates the BCL11A binding -> shift from beta to gamma globin -> we do not try to repair it but using a knockout to treat the patient – first patient healed in 2018 Cancer immunotherapy Use CRISPR/Cas9 to enhance cancer immunotherapy, normally cancer cells are recognized by the immune system most of them are wiped out before they form an actual cancer, some cancer can still be recognized but they escape; cheats PD-1 and PD-L1 interaction to say they are “normal”, the tumour cell has an antigen that the T cell is recognizing as something to kill but then the PD-1 binds to their ligand on the tumour cell -> T cell ignores the danger = escapes With treatment of Anti-PD1 which bind s PD-1 or PD-L1, they cannot interact anymore – T cell recognize and kills the tumour cell, different way; take T cells from the body of a patient and delete the PD-1 from these T cells using CRISPR/Cas9 – getting superkiller T cells – inject back to the body and try to eradicate the tumour, Cancer immunotherapy - 2016; China: o PD-1 KO-cells in small cell lung cancer - 2017; China: o PD-1 KO T-cells for 21 esophageal cancer patients o Anti CD19 CRISPR-mediated CAR-T cells for lymphoma - 2018; USA: o CRISPR Gene edited T-cells with del. TCR and PD-1 in 18 Sarcoma, Multiple Myeloma and Melanoma patients Knock out makes treatment more efficient – might be approved quite soon 36 VO Methods in molecular biology WS 2022 Beyond therapy - Antimicrobials - Revival of extinct species (mammut) - Biosensors - Gene drivers Not discussed anymore Lecture 5: Proteomics: Basics and applications (Christopher Gerner) Proteins - „actual“ actors of cell-biological activities - Functional cell state may determine the abundance, modifications, localisation, interaction partners, enzymatic activity - Sensitive and quantitative assays exist but are not comprehensive (2D-PAGE, LC-MS/MS...) - Analysis meaningful with respect to : o Marker-proteins for diseases, risk for diseases, stratification marker o Activation of signal transduction pathways (phosphorylation, proteolytic activation etc.) o Characterisation of cell responses to altered environmental conditions For bioinformatics genetic mutations are stable; which means it doesn’t matter when the sample is taken – stable information – always there, very different regarding post genomic techniques, because protein expression and what they are doing can change rather quickly, f.e. viral infection; asymptotic, sick or even die from it; use genomics, transcriptomics or proteomics? Use all get different answers, genomics: genetic property that makes you more vulnerable than other but if there are environmental factors, medication etc. you need other methods, proteomics is a method that provides inside in a given stage that may even change in the course of time, you have to know or identify the other influencing factors, can be done with proteomics; like why is one cell more vulnerable to infection than another cell, in proteomics not only interested in what protein it is but also where it is, with whom does it interact, what modifications does it show, Proteins are made of amino acids - All 20 amino acids have the same basic structure, but differ in their side chains. - A polypeptide is formed by chaining amino acids. The bonds between consecutive amino acids are called peptide bonds. When two amino acids are bound, a water molecule is released. Physico-chemical properties of amino acids As each amino acid has its own sum formula, you can it assign a given different molecular mass, two different terms: average mass; consider the different isotopic mixture of molecules you have, in mass spectroscopy this different isotopes can be distinguished and assign a mono isotopic peak -> and the mono acid topic mass is the molecule built of the smallest isotope f.e. C12, H1 which is the most abundant, typically the amino acids have distinct masses with the exception of isoleucine and leucine which have the same sum formular – same mass – no distinction with mass spectrometry 37 VO Methods in molecular biology WS 2022 Basic considerations - Proteins are linear assemblies of amino acids - The sequence of amino acids defines the identity of a protein - Partial sequences may be sufficient for protein identification - Partial sequences and homologies are searched with „Blast“ The AA sequence defines the protein – but we don’t need to determine the whole sequence to say which protein it was -> analyse peptides not intact proteins, so for analyse the protein would be cut into pieces, forming peptides and sequence that, using blast database which protein contains this given amino sequence – abundancy very low; typically one candidate or sometimes a family structure (like ATP binding domain) Uniprot is the most important protein database www.uniprot.org Identification of proteins - Western-Analysis: proteins are separated by SDS-PAGE, blotted to a membrane (nitrocellulose or PVDF) and detected using a specific antibody. A secondary antibody is typically linked with one of several detection methods such es enzymatic reaction of chemoluminescence. This method is sensitive and efficient but only allows to investigate known candidate proteins. Note: the false discovery rate is much higher in comparison to modern mass spectrometry - Edman-degradation: chemical sequencing requiring HPLC-separation of each amino acid, old- fashioned and reliable but expensive - Peptide-Mass-Fingerprinting: purified proteins are digested and the masses of peptides determined by mass spectometry (MS). Comparison of in silico digests with experimantal data allows protein identification - PSD, CID, MS/MS: isolated peptides are fragmented using PSD (post source decay), CID (collision induced dissociation) or other methods. The resulting spectra support the identification of the corresponding amino acid sequence. Represents the main method for protein identification nowadays. Alternative fragmentation techniques may be, amongst others, HCD (higher energy collisional dissociation) or ETD (electron transfer dissociation) In western blot we can only estimate the protein, in mass spectroscopy we can calculate it Identification of 2D-spots 1. cutting the spot 2. digestion with trypsin 3. MS-analysis of peptides A 2D protein separation method, each dot represents a protein of a given charge, cut out of the gel the given protein, how to determine the identity? Mass spectromere is a kind of molecular balance, determine the mol weight of a given molecule, why not use the mol weight of an intact protein to identify it? Too many candidates, new idea: if I have a database with the AA sequence I can easily calculate the peptides which would be formed upon digestion of these proteins, like trypsine a side specific enzyme -> predict where trypsine cut, with the sequence of the peptide you can easily calculate the masses – in silico digest, and apply mass spectrometry Mass spectrometry - Molecular weight determination based on a molecule’s mass to charge ratio. 38 VO Methods in molecular biology WS 2022 - First mass spectrometer built 1907 by J.J Thomson Sample e.g. tryptic digest -> ionise (Ion source) -> Analyse (Mass spectrometer) Two conditions: get the molecules ionised and in gasphase, Matrix-assisted Laser Desorption/ionisation MALDI In MALDI, generally from [M+H]^+ ions In order to get a protein/peptide into gas phase -> MALDI, shot with an nitrogen Laser 337nm on a salt matrix -> salt evaporates and the proteins on top of these small crystals get into gasphase,