Systembiologi - Lecture Notes PDF

Summary

These lecture notes provide an introduction to systems biology and biological networks. They cover topics such as the philosophical approach of reductionism, contrasting it with the systems-biology perspective, and the significance of large-scale data (like genomics). Topics in omics (related to the study of whole biological systems), such as genomics, proteomics, and metabolomics, are also included.

Full Transcript

Indholdsfortegnelse {#indholdsfortegnelse.Overskrift} =================== [Blok 1 2](#blok-1) [Lecture 01 (September 5) - Intro 1 2](#lecture-01-september-5---intro-1) [Lecture 02 (September 12) - Intro 2 5](#lecture-02-september-12---intro-2) [Lecture 03 (September 19) - Intro 3 12](#lecture-03...

Indholdsfortegnelse {#indholdsfortegnelse.Overskrift} =================== [Blok 1 2](#blok-1) [Lecture 01 (September 5) - Intro 1 2](#lecture-01-september-5---intro-1) [Lecture 02 (September 12) - Intro 2 5](#lecture-02-september-12---intro-2) [Lecture 03 (September 19) - Intro 3 12](#lecture-03-september-19---intro-3) [Blok 2 16](#section) [Lecture 04 (September 26) - Yeast Systems Biology 1 16](#lecture-04-september-26---yeast-systems-biology-1) [Lecture 05 (October 3) - Yeast Systems Biology 2 21](#lecture-05-october-3---yeast-systems-biology-2) [Lecture 06 (October 10) - Yeast Systems Biology 3 27](#lecture-06-october-10---yeast-systems-biology-3) [Blok 3 32](#blok-3) [Lecture 08 (October 31) - Systems Biology in Biomedical Research (Heart diseases) 1 32](#lecture-08-october-31---systems-biology-in-biomedical-research-heart-diseases-1) [Lecture 09 (November 7) - Systems Biology in Biomedical Research (Heart diseases) 2 43](#lecture-9-november-7---systems-biology-in-biomedical-research-heart-diseases-2) [Lecture 10 (November 14) - Protein isoforms 1 47](#lecture-10-november-14---protein-isoforms-1) [Lecture 11 (November 21) - Protein isoforms 2 47](#lecture-11-november-21---protein-isoforms-2) [Lecture 12 (November 28) - Integrating multiple omics data types for cancer research 47](#lecture-12-november-28---integrating-multiple-omics-data-types-for-cancer-research) Blok 1 ====== Lecture 01 (September 5) - Intro 1 ---------------------------------- **Lecture:** *Introduction to Systems Biology and biological networks*  **Definition of Reductionism:** - Reductionism is a philosophical approach that involves breaking down complex systems into their simpler, fundamental components. by understanding the individual parts, one can ultimately understand the whole system. - In biology, this has traditionally involved studying individual genes, proteins, or cellular processes in isolation to grasp the workings of more complex biological phenomena. **Reductionism vs. systems biology:** Et billede, der indeholder tekst, skærmbillede, Font/skrifttype, linje/række Automatisk genereret beskrivelse **Reductionism (\"Normal\" Biology) -- Bottom-Up Approach** (Right Side, Blue Text) - **Bottom-Up Approach**: This is the traditional method in biological research where scientists start at the most basic level and build their understanding upwards. - This approach focuses on **breaking down complex biological systems into their simplest parts** to understand how they work individually before piecing together the overall picture. It is often used in molecular biology, biochemistry, and genetics **Systems Biology -- Top-Down Approach** (Left Side, Red Text) - **Top-Down Approach**: In contrast to reductionism, systems biology begins with the larger, more complex systems (like ecosystems or whole organisms) and moves downward to understand how the integrated components contribute to the function and behavior of the system. **Large scale data:** 1. **Systems Biology Overview:**  focuses on understanding how interactions within biological networks lead to various functions and behaviors within an organism. 2. **Importance of Large-Scale Data:** The phrase "system-wide measurements" underscores the need for extensive data collection across various biological components (genes, proteins, metabolites, etc.). To fully understand the intricate web of interactions within a biological system, a comprehensive dataset is essential. Such large-scale data encompasses multiple layers of biological information, ranging from **genomics** and **transcriptomics** to **proteomics** and **metabolomics**. 3. **Applications in Biology:** Large-scale data in systems biology enables numerous applications, such as identifying biomarkers for diseases, understanding drug mechanisms, and improving metabolic engineering for industrial applications. By leveraging extensive datasets, researchers can make more accurate predictions about cellular behavior and devise strategies for targeted interventions. **Omics**: involves studying the entirety or complete set of something within a biological context. This approach is widely used in systems biology to understand broad and complex biological functions. - Examples include **genomics** (the study of the entire genome), **proteomics** (the study of all proteins), and **metabolomics** (the study of metabolites). **Genomics:** Genomics is the study of the structure, function, and interactions of genes within the entire genome, with a focus on how they influence biological processes and organisms. **Proteomics** Proteomics is the large-scale study of proteins - Proteins are the key functional molecules in living organisms. Proteins are responsible for virtually every process within cells, and their structures define their functions. Different levels of protein structure: - - **Primary**: - **Secondary** - - **Tertiary** - **qusternary** ![Et billede, der indeholder Grafik, tegneserie, kunst, design Automatisk genereret beskrivelse](media/image5.png) **Primary structure:** - is a linear sequence of amino acids linked together by peptide bonds, showing the protein\'s primary structure. - This sequence is encoded by the corresponding gene. - The primary structure determines the protein\'s final 3D shape and function **secondary structure:** - Shows common motifs in protein secondary structure: the alpha helix and the beta- sheet. - Secondary structure refers to the local folding patterns of the polypeptide chain. It is primarily stabilized by hydrogen bonds between the backbone atoms - **Alpha Helix:** A coiled structure stabilized by hydrogen bonds between every fourth amino acid, providing a helical shape. - **Beta- Sheet:** Consists of beta strands connected laterally by hydrogen bonds, forming a sheet-like structure. **Tertiary structure:** - Illustrates the complex 3D shape formed by the folding of the secondary structures. - Tertiary structure refers to the overall spatial arrangement of a polypeptide chain, including interactions between side chains (R groups) of amino acids. - Tertiary structure determines the protein\'s functional regions (e.g., active sites in enzymes). **Quaternary Structure:** - Depicts a protein composed of multiple polypeptide chains, showing the quaternary structure. - Quaternary structure arises when two or more polypeptide chains (subunits) come together to form a functional protein complex. Understanding these protein structures is fundamental in proteomics, as the structure of a protein directly impacts its function. - Studying proteomics involves identifying and characterizing these proteins, including their expression levels, interactions, modifications, and involvement in cellular pathways. **metabolomic ** is the study of small molecules (metabolites) within a biological system, providing insights into biochemical activity and physiological states. - It uses techniques like mass spectrometry and NMR to analyze metabolite profiles, often for applications in medicine, biomarker discovery, and understanding disease mechanisms. **Protein - protein interactions networks (PPI)** Is the interactions between proteins within a cell: ![](media/image7.png) The left side shows a network diagram with nodes (circles) and edges (lines). - Each node represents a protein - Each edge represents a physical interaction between two proteins. - The network shows how multiple proteins are interconnected, highlighting clusters and hubs (central proteins that interact with many others). Lecture 02 (September 12) - Intro 2 ----------------------------------- **Protein - protein interaction networks** **Proteome More Dynamic Than the Genome:** **Protein complexes:** Proteome = the entire set of proteins expressed by a genome, cell, tissue, or organism = is more dynamic and complex than the genome. This is due to several factors: - **Folding:** After synthesis, proteins fold into specific three-dimensional structures that determine their function. Misfolding can lead to loss of function or diseases. - **PTMs (Post-Translational Modifications):** Proteins undergo various chemical modifications (e.g., phosphorylation, glycosylation) after translation, affecting their activity, stability, and interactions. - **Half-life of Proteins:** Proteins have varying lifespans, with some being quickly degraded and others persisting longer. This dynamic turnover allows cells to adapt rapidly to changes. - **Interactions:** Proteins interact with other molecules and proteins, forming complexes that carry out specific cellular functions. These interactions and complexes are vital for most cellular processes. **Binary interactions:** Binary interactions involve the study of direct interactions between two proteins. Examples of these methods include: - **Yeast Two-Hybrid (Y2H) System**: A molecular biology technique used to discover protein-protein interactions by testing if two proteins can bind to each other in yeast cells. - **Protein Fragment Complementation Assays (PCA)**: Techniques where proteins are split into fragments, and interaction between two proteins is inferred if the fragments come together to reconstitute a functional protein (such as fluorescent proteins like GFP or YFP). **Yeast Two-Hybrid (Y2H) method:** - The Yeast Two-Hybrid method is an in vivo assay, meaning it is conducted within a living organism (in this case, yeast cells). This allows researchers to study protein-protein interactions in a natural cellular environment, providing insight into how these interactions occur in living cells. **Transcription Factor Domains:** The method takes advantage of the fact that transcription factors typically have two distinct domains: - **DNA Binding Domain (DBD):** Responsible for binding to specific DNA sequences. - **Transcription Activation Domain (AD):** Activates the transcription of a target gene. In the Y2H system, these domains can function separately. This separation allows for the creation of a functional transcription factor if the AD and DBD are brought together by an interacting protein pair. **Mechanism of Y2H:** **Protein \"Bait\" and \"Prey\":** - A protein of interest (the \"bait,\" labeled as \"X\" in the diagram) is fused to the DBD. - A library of potential interacting proteins (the \"prey,\" labeled as \"Y\") is fused to the AD. When the \"bait\" and \"prey\" proteins interact, they bring the AD and DBD domains into proximity, reconstituting a functional transcription factor. This reconstituted transcription factor can then bind to the upstream activating sequence (UAS) of a **reporter gene** and activate its transcription **Advantages of Yeast Two-Hybrid (Y2H):** 1. **Direct PPI Detection:** Allows for the identification of specific, direct PPI interactions. 2. **High Sensitivity:** can detecting both transient (short-lived) and stable interactions. 3. **In Vivo Context:** Performed in living yeast cells, providing a biologically relevant environment. 4. **Independent of Natural Protein Levels:** Enables controlled expression of proteins, ensuring low endogenous levels do not hinder detection. **Limitations of Yeast Two-Hybrid (Y2H):** 1. **False Positives:** - Some proteins can activate the reporter gene without a true interaction (auto-activation). - Detected interactions may be artificial and not biologically relevant. 2. **False Negatives:** - Certain interactions are missed if they require additional factors (e.g., co-factors or other proteins) absent in the yeast system. 3. **Limited Complexity:** Restricted to testing two proteins at a time, making it challenging to study complex networks or multi-protein interactions. 4. **Yeast-Specific Interactions:** Interactions detected in yeast may not accurately represent those occurring in other organisms or natural environments. **Protein Fragment Complementation Assays (PCA):** Protein Fragment Complementation Assays (PCA) are in vivo techniques used to study protein-protein interactions (PPIs) in the natural cellular environment of living cells. Like Yeast Two-Hybrid (Y2H), PCA focuses on detecting and analyzing interactions between proteins, but its underlying mechanism differs significantly. PCA relies on the reconstitution of a split reporter protein, which signals the interaction. **Principle of PCA:** **Split Reporter Protein:** - PCA is based on the use of a reporter protein that is split into two non-functional fragments. - **N-terminal fragment:** The first half of the reporter protein. - **C-terminal fragment:** The second half of the reporter protein. - These fragments are each fused to a protein of interest: - Protein A (bait) is fused to one fragment. - Protein B (prey) is fused to the other fragment. **Mechanism:** - If Protein A and Protein B interact, their proximity facilitates the reassembly of the two fragments of the reporter protein. - Reconstitution of the reporter protein restores its activity, generating a measurable signal such as fluorescence, luminescence, or enzymatic activity. - This signal confirms a physical interaction between the two proteins. **Advantages of PCA:** 1. **Sensitive and Versatile:** Detects both transient and stable PPIs across various organisms and conditions. 2. **In Vivo Relevance:** Performed in living cells, reflecting natural biological conditions. 3. **Quantitative Output:** Provides measurable signals like fluorescence or luminescence. 4. **Flexible Reporters:** Supports diverse reporter proteins tailored to experimental needs. **Key Limitations of PCA:** 1. **False Positives:** Overexpression or spontaneous reporter reassembly can produce misleading signals. 2. **False Negatives:** Misfolding or missing co-factors may prevent detection. 3. **Structural Interference:** Reporter fusions may disrupt protein structure or activity. 4. **Limited to Binary Interactions:** Less effective for studying multi-protein complexes. **Comparison to Y2H:** **Feature** **PCA** **Y2H** -------------------------- ------------------------------------------ ---------------------------------------------- **Detection** Reconstitution of split reporter Reconstitution of transcription factor **Reporter Type** Enzymatic, fluorescent, or luminescent Transcription-activated gene expression **Versatility** Applicable in various organisms Limited to yeast cells **Complex Interactions** Limited to binary interactions Limited to binary interactions **False Positives** Non-specific fragment reassembly Auto-activation or indirect interactions **False Negatives** Improper folding or stringent conditions Variability in results or missing co-factors **Affinity chromatography**  Affinity chromatography is a method used to purify a specific protein from a complex mixture by exploiting the protein's binding affinity for a particular ligand (e.g., an antibody, antigen, enzyme substrate, or receptor). **Key Feature:** This technique selectively isolates proteins, providing a high degree of purification through specific interactions. **Affinity Purification/Mass Spectrometry (AP-MS):** AP-MS combines affinity chromatography with mass spectrometry to study protein-protein interactions and identify protein complexes. - A \"bait\" protein (protein of interest) is tagged with an affinity tag, enabling its isolation along with any interacting proteins (\"prey\"). - Purified proteins are then identified using mass spectrometry to determine the co-purifying partners that form a complex with the bait. **Purpose:** To determine protein-protein interactions and identify functional units in cellular processes. **Bait (Y-axis) and Prey (X-axis):**  - **Bait Proteins (Y-axis):** Tagged proteins isolated in individual experiments. - **Prey Proteins (X-axis):** Proteins co-purified with the bait, representing interaction partners. **Workflow for Identifying Protein Complexes:** The typical workflow for identifying protein complexes by using affinity purification and mass spectrometry: A. **Bait Protein:** A tagged bait protein captures interacting proteins in the cell, aiding in their purification. B. **Protein Complex Formation:** The bait forms complexes with its natural partners, which are isolated using an affinity column. C. **SDS-PAGE:** The purified complex is separated by SDS-PAGE to resolve proteins by size. D. **Mass Spectrometry:** Protein bands are digested into peptides and analyzed by mass spectrometry to identify interaction partners. **Strengths of AP-MS (Affinity Purification Mass Spectrometry):** - **High Specificity:** Selectively isolates target proteins and their true interaction partners. - **Effective for Strong Interactions:** Captures stable protein complexes and strong transient interactions. - **Physiological Relevance:** Mimics natural cellular conditions, providing biologically relevant results. **Limitations:** - May miss complexes not forming under experimental conditions. - Identifies indirect interactions (e.g., A-C-B links), making direct interactions unclear. - Less effective for weak or highly transient interactions. **Y2H vs. AP-MS:** **Y2H (Yeast Two-Hybrid):** - **Method:** Uses genetic reconstitution of a transcription factor to detect protein-protein interactions in yeast. - **Advantages:** - Identifies direct binary interactions. - Effective for discovering novel interactions, especially nuclear ones. - **Limitations:** - Restricted to interactions occurring in yeast, which may not reflect other organisms. - Does not capture complex multi-protein interactions. **AP-MS:** - **Method:** Tagged bait proteins isolate interacting partners, followed by mass spectrometry to identify associated proteins. - **Advantages:** - Identifies multi-protein complexes and complex networks. - Provides quantitative and physiologically relevant data. - **Limitations:** - Requires specialized equipment and is resource-intensive. - May include indirect or non-specific interactions requiring validation. **Spoke vs. matrix:** **Spoke model:** - simplified way to represent protein interactions. - uses a central \"hub\" (Protein that has many interaction partners) protein that connects to several other \"spoke\" proteins. - each interaction is directly linked to the central hub, but no direct interactions are assumed between the spoke proteins themselves. This approach is often used in high-throughput interaction studies where a central protein is tested for binding with many others **Matrix model:** - broad representation that considers possible interactions between all proteins involved. - every protein can potentially interact with every other protein, forming a network of connections. - multiple connections indicating a more complex network where interactions are not limited to a central hub. - used when protein interactions are complex, involving multiple proteins interacting in various configurations, such as in large multiprotein complexes or signaling networks. ![](media/image9.png)Lecture 03 (September 19) - Intro 3 -------------------------------------------------------- **Protein network representations** **Regulatory interactions (protein-DNA):** - **Gene A regulations Gene B:** The yellow arrow between Gene A and Gene B represents a regulatory interaction, where Gene A influences the expression or activity of Gene B. - **Protein-DNA interaction**: This type of interaction typically involves a protein encoded by Gene A binding to a DNA sequence near Gene B to regulate its transcription. - **Example**: In many biological systems, transcription factors (proteins from Gene A) bind to the promoter regions of DNA near Gene B, thereby increasing or decreasing its transcription. ** Functional Complex (Protein-Protein Interaction PPI):** - **Gene A binds to Gene B**: The green line between Gene A and Gene B represents a **protein-protein interaction**, where proteins encoded by these genes bind to each other to form a functional complex. - **Substrate relationship**: Gene B (or its protein product) may be a substrate of Gene A, meaning that Gene A's product could modify, activate, or interact with the protein produced by Gene B. - **Protein-Protein Interactions (PPIs)**: These interactions are essential in forming **functional complexes**, like enzymes binding their substrates or structural proteins assembling into larger molecular structures. PPIs are crucial for almost all biological processes, including signal transduction, cell division, and metabolic control. ** Metabolic Pathways:** - **Gene A produces a reaction product that is a substrate for Gene B**: The red arrow between Gene A and Gene B represents a relationship within a **metabolic pathway**, where the product of one reaction (catalyzed by the protein from Gene A) becomes the substrate for the next reaction (catalyzed by the protein from Gene B). - **Sequential biochemical reactions**: This type of interaction is common in metabolic pathways, such as in glycolysis or the citric acid cycle, where each gene\'s protein product performs a step in a multi-step conversion of metabolites. - **Example**: In glycolysis, the enzyme hexokinase (from Gene A) catalyzes the phosphorylation of glucose, which then becomes the substrate for the next enzyme (encoded by Gene B) in the pathway. **Other types of molecular interaction Networks:** - **Genetic interactions**: - These networks represent interactions between genes, where the combined effect of mutations in two or more genes can lead to new or unexpected phenotypes (such as synthetic lethality). - **Metabolic reactions**: - These networks map out the biochemical reactions within cells, where enzymes (proteins) catalyze the conversion of metabolites. - **Co-expression interactions**: - Co-expression networks are based on the observation that genes or proteins that are co-expressed (expressed together across conditions) may be functionally related. - Such networks are commonly used in systems biology to infer functional relationships or regulatory mechanisms. - **Text mining interactions**: - These networks are generated through computational tools that mine large amounts of scientific literature to identify reported biological interactions. - Text mining helps researchers stay up to date with vast amounts of published data and can generate hypotheses for new experimental studies. - **Association networks**: - Association networks represent correlations between various biological objects, such as genes, proteins, or metabolites, without necessarily indicating direct interactions. - These networks can be useful for identifying biomarkers or for understanding complex traits in genetics. **Topology** ( "place") It is studying the most fundamental properties of space, particularly those that remain invariant under continuous deformations such as stretching, bending, or twisting. These properties include: - **Connectedness**: The idea that two or more points are in some way linked or connected without any breaks or gaps. - **Continuity**: How objects behave when smoothly transformed, without tearing or gluing. **Types og networks:** **A: Random Network** (Aa): A random network has nodes connected in a relatively uniform and random manner, with no visible pattern of connectivity. In these networks, each node has an equal probability of connecting to any other node. - **Degree Distribution (P(k)):** The degree distribution for a random network is a **bell-shaped curve**, indicating that most nodes have a moderate number of connections, with very few having either very high or very low connectivity. - **Clustering Coefficient (C(k)):** The clustering coefficient in random networks remains relatively **flat**, meaning that nodes are not more likely to form tightly interconnected groups regardless of their degree (number of connections). **B: Scale-Free Network** (Ba): ![](media/image11.png)A scale-free network has a few nodes (hubs) with many connections, while the majority of nodes have only a few links. This structure follows a power-law distribution, where the likelihood of a node having many connections decreases logarithmically as the number of connections (k) increases. - **Degree Distribution (P(k)):** The degree distribution for the scale-free network is a **logarithmic graph** with a steep downward slope, indicating that most nodes have a low number of connections, and only a few hubs have many connections. This is characteristic of power-law behavior. - **Clustering Coefficient (C(k)):** In scale-free networks, the clustering coefficient also remains **flat** or low, as connections are distributed sporadically, and hubs are not necessarily part of tightly clustered groups. **C: Hierarchical Network** (Ca):shows a more organized structure with a clear hierarchy. It combines the characteristics of scale-free networks with an additional layer of clustering and modularity. Smaller clusters of nodes connect to form larger clusters, producing a nested structure. This is often observed in complex biological systems, where nodes are organized into functional modules. - **Degree Distribution (P(k)):** The degree distribution for a hierarchical network also follows a **power-law distribution** but with a slower decay. This reflects the presence of highly connected hubs, as in scale-free networks, but with additional organization in the form of clusters or groups of nodes that are more interconnected than in a pure scale-free network. - **Clustering Coefficient (C(k)):** In hierarchical networks, the clustering coefficient exhibits a **logarithmic decline** with increasing degree. This indicates that nodes with fewer connections tend to be part of tightly clustered groups, while highly connected hubs are less likely to be part of dense clusters. This pattern reveals the hierarchical structure of the network, where smaller groups form larger, looser structures. **Special Instance (Hierarchical Network as a Type of Scale-Free Network)**: - **hierarchical networks** are a special type of **scale-free network**. This means that in addition to having a few highly connected nodes (hubs), the network is organized in layers, with nodes at higher levels connecting to those at lower levels. - This hierarchical arrangement supports efficient organization and communication within the network, often found in biological and social systems where functions are divided among different levels of the hierarchy. **Network descriptors:** 1. **Node Degree or Connectivity (k)**: - The **node degree** refers to the number of connections a particular node (e.g., a gene or protein) has to other nodes in the network. In a biological network, this can indicate how many interactions a given molecule has. - The **average degree** ⟨k⟩ is the mean number of connections per node across the entire network. High average degree values suggest that, on average, nodes in the network are highly interconnected. 2. **Shortest Path (l)**: - The shortest path between two nodes is the minimum number of edges (connections) that must be traversed to go from one node to the other. - The **average path length** ⟨l⟩ is the average of all shortest paths in the network, representing how easily information or influence can spread across the network. - **Network diameter** refers to the longest shortest path between any two nodes, representing the overall size or reachability of the network. - **Betweenness centrality** measures how often a node acts as a bridge along the shortest path between other nodes. Nodes with high betweenness centrality play a critical role in controlling information flow in the network. 3. **Degree Distribution P(k)**: - The degree distribution **P(k)** gives the probability that a randomly selected node has exactly **k** connections. This distribution describes how connectivity is spread across the network. - In biological networks, a \"scale-free\" degree distribution, where few nodes have many connections (hubs) and most have few, is often observed. Such networks are robust to random failures but vulnerable to attacks on highly connected hubs. 4. **Clustering Coefficient (C)**: - The **clustering coefficient** of a node measures the likelihood that its neighbors nodes are also connected to each other. This reflects how tightly knit a node\'s local neighborhood is.The **average clustering coefficient** ⟨C⟩ gives the overall tendency of nodes to cluster together in the network. In biological systems, high clustering may indicate functional modules or groups of proteins/genes that interact closely to perform a specific biological function. ![](media/image13.png)**Clustering coefficient C** The **clustering coefficient** quantifies the degree to which nodes in a network tend to cluster together. It measures the likelihood that two neighbors of a given node are also connected to each other, thus forming a triangle. The clustering coefficient is particularly useful in understanding the local connectivity of a node within a network. The clustering coefficient for a node C\_i​ is given by: [\$C\_{i} = \\frac{2{\*n}\_{i}}{k(k - 1)}\$]{.math.inline} Where: - ni​ is the number of edges between the neighbors of node i. - k is the degree of node i, representing the number of neighbors node i has. - The factor k(k−1)/2 represents the maximum possible number of edges between neighbors of node ii: This formula calculates how many of the possible connections between the neighbors of node i are actually present. This ratio gives a value between 0 and 1: - A **clustering coefficient of 1** indicates that all of the node\'s neighbors are connected to each other. - A **clustering coefficient of 0** indicates none of the neighbors are connected to each other. Blok 2 ====== Lecture 04 (September 26) - Yeast Systems Biology 1 --------------------------------------------------- **Yeast cell cycle:** The yeast cell cycle is divided into four main phases: 1. **G1 Phase (Gap 1)**: The cell grows in size and prepares for DNA replication. In this phase, cells make decisions about whether to continue through the cell cycle or enter a resting state (G0). This phase is tightly regulated, as passing through the G1 checkpoint commits the cell to division. 2. **S Phase (Synthesis)**: DNA replication occurs during this phase, where the entire yeast genome is duplicated in preparation for cell division. 3. **G2 Phase (Gap 2)**: Following DNA replication, the cell undergoes further growth and prepares for mitosis. The G2/M checkpoint ensures that the DNA is fully replicated and any damage is repaired before division. 4. **M Phase (Mitosis)**: The final stage of the cell cycle, where the cell divides its replicated DNA equally into two daughter cells. Mitosis is followed by cytokinesis, the physical separation of the cell into two cells. - **Mitosis** is the process in which the nucleus divides, and it includes several sub-stages: - **Prophase:** Chromatin condenses into visible chromosomes, the nuclear membrane begins to break down, and the mitotic spindle starts to form. - **Metaphase:** Chromosomes align at the metaphase plate, ensuring that each daughter cell will receive one chromatid from each chromosome. - **Anaphase:** Sister chromatids are pulled apart toward opposite poles of the cell by the spindle fibers. - **Telophase:** The separated chromatids reach opposite poles, nuclear membranes re-form around each set of chromosomes, and the chromosomes begin to uncoil back into chromatin. - **Cytokinesis** follows mitosis and involves the division of the cytoplasm, resulting in two genetically identical daughter cells. **The Cell Cycle Checkpoints:** - **G₁/S Checkpoint:** Ensures the cell is ready for DNA synthesis. The cell checks for size, nutrient availability, and the integrity of the DNA before proceeding to the S phase. - **G₂/M Checkpoint:** Ensures that DNA replication is complete and checks for DNA damage before the cell proceeds to mitosis. - **Metaphase Checkpoint:** Ensures that all chromosomes are properly attached to the spindle fibers before the cell proceeds to anaphase. **Cell Cycle Regulation:** - The cell cycle is regulated by specific proteins, such as **cyclins** and **cyclin-dependent kinases (CDKs)**. These molecules act at various points in the cycle to ensure smooth progression from one phase to the next. Any errors in cell cycle regulation can lead to uncontrolled cell division, which is often associated with cancer. - The **core machinery** of the cell cycle is largely conserved throughout eukaryotes (from yeast to humans) - Especially the multi-cellular organisms have extensive and sophisticated control mechanisms ( **check-points)** that check if one step (e.g. DNA replication) is completed and error-free, before proceeding to the next steps (e.g. segregation of the chromosomes) **Yeast as a model organism:** Why yeast is a convenient model system: - Single cell - Haploid in the vegetative state - Genome very well characterized - Easy to grow - Very well established experimental platform ![](media/image15.png)**CDC28 (CDK1) -- regulation** **Cell Cycle Phases:** The cell cycle is divided into distinct stages, which the CDK1 complexes control as follows: - **G1 Phase (Gap 1)**: - This phase is represented at the far left of the diagram, where the cell grows and prepares for DNA synthesis. - *Cln3/Cdc28* first binds to initiate the progression into G1, followed by *Cln1/Cln2/Cdc28*, ensuring that the cell commits to division. - **S Phase (Synthesis)**: - The middle of the diagram corresponds to DNA replication (S phase), where the *Clb5/Clb6/Cdc28* complexes are essential. These cyclin-CDK complexes initiate DNA synthesis and ensure that the genetic material is copied correctly. - **M Phase (Mitosis)**: - On the right side of the figure, the cell prepares for mitosis with the *Clb1/Clb2/Cdc28* complexes facilitating entry into and progression through M phase. - These complexes ensure that chromosomes are segregated correctly and that the cell undergoes cytokinesis, producing two daughter cells. **Inhibitory Signals:** - A red arrow highlights a point of negative regulation, possibly indicating a checkpoint or inhibitory mechanism that ensures the cell cycle does not proceed in the presence of DNA damage or incomplete replication. The term **just-in-time synthesis** refers to the regulated production of proteins and enzymes at the precise moment they are needed during the cell cycle. The diagram is organized around the four major phases of the cell cycle: **G₁ phase**, **S phase**, **G₂ phase**, and **M phase**(mitosis). It highlights the different protein complexes and regulators that are synthesized \"just in time\" for each phase of the cell cycle, ensuring that each phase proceeds smoothly and in order. - This ensures that cellular resources are efficiently utilized, and the timing of key processes is tightly controlled. - The figure demonstrates how specific groups of proteins are synthesized at different stages of the cell cycle to support various critical cellular events such as DNA replication, chromatin regulation, cell division, and cytokinesis. **Arrest-and-release synchrinization** - **Temperature sensitive mutants:** Unable to pass a certain point in the cell cycle at high temperatures - **Alpha factor:** All cells arrest in the same stage by triggering the mating response **Yeast mating types:** In yeast, specifically *Saccharomyces cerevisiae*, there are two mating types: **\"a\"** and **\"α\"** (alpha). - These mating types are haploid, meaning they have one set of chromosomes. - Yeast cells can switch between these two mating types and undergo a mating process to form diploid cells, which contain two sets of chromosomes. - This process is a fundamental aspect of yeast biology and is crucial for sexual reproduction and genetic diversity. **Time-series VS: comparative studies** **Comparative Studies:** These experiments focus on identifying **differentially expressed genes** by comparing two conditions, such as **Condition A vs. Condition B**. - The goal is to pinpoint genes that show significant differences in expression levels between the two states. - **Typical Analysis**: - Statistical tests, such as the **t-test**, are used to determine whether the observed differences are statistically significant. **Time-Series Experiments:** These studies investigate how **gene expression changes over time**, aiming to identify genes that exhibit specific expression patterns across different time points. - Unlike comparative studies, a simple t-test is **not applicable**, as the focus is not on comparing two groups but on analyzing trends and patterns over time. - **Typical Analysis**: - Techniques such as clustering, regression models, or pattern recognition algorithms are used to detect and characterize these dynamic expression changes. **periodicity analysis of microarray data** **Is the gene significantly regulated?** - **p(regulation)**: This p-value is used to determine whether a gene is significantly regulated during the cell cycle. - The method compares the **actual standard deviation** of the gene expression data to a **randomly sampled background distribution**. - If the gene\'s expression varies significantly from the random background, it is considered significantly regulated. **Is the expression pattern periodic?** - **p(periodicity)**: This p-value is used to determine if the gene\'s expression follows a periodic pattern. - The method compares the **actual Fourier signal** (which captures periodic components in a dataset) to a distribution generated by **randomly shuffling the data points** within each expression profile. - A periodic pattern suggests that the gene expression rises and falls in a regular manner, corresponding to specific phases of the cell cycle. - **Fourier analysis** is used here as it helps detect periodic signals by breaking down the time series of gene expression into its constituent frequencies, which are then compared to random permutations. **Combined Score with Penalty Function** - **Combi score**: This combined score is the result of integrating both p-values from the regulation and periodicity tests. - It incorporates a **penalty function**, which ensures that only genes that are both significantly regulated and show a periodic pattern receive a high score. - This approach reduces false positives, ensuring that genes flagged as periodic are not only fluctuating but also showing biologically relevant and significant patterns. Lecture 05 (October 3) - Yeast Systems Biology 2 ------------------------------------------------ **Ontology:** - An ontology is a structured framework that represents a set of concepts within a domain and the relationships between those concepts. - **Well-defined Terms**: Each term is precisely defined to avoid doubt. This specificity ensures that all users understand the terms in the same way. - **Well-defined Relationships**: Ontologies specify how different terms relate to each other, which helps clarify their interactions and hierarchies. **Gene Ontology (GO):**\ The Gene Ontology (GO) is a structured framework in molecular biology providing a standardized vocabulary for describing the functions of genes and their products, enabling consistency in data sharing and analysis. 1. **Set of Terms and Relationships**:\ GO comprises terms describing molecular functions, biological processes, and cellular components. Each term is connected to others in a hierarchical structure that mirrors biological relationships. 2. **Standardized Vocabulary**:\ GO serves as a common language that reduces ambiguity and improves reproducibility in research, ensuring consistent usage across studies and databases. 3. **Gene Annotation**:\ GO is used to annotate genes and their products, linking them to relevant functions and roles in biological processes. This allows researchers to analyze genes in specific biological contexts. **Ontologies in GO:**\ The GO classification system consists of three main categories: - **Biological Process (BP)**: Represents chains of molecular events or activities carried out by gene products, such as cell division, signal transduction, or metabolic pathways. - **Molecular Function (MF)**: Describes the fundamental activities performed by gene products, such as catalysis or binding to molecules. - **Cellular Component (CC)**: Indicates where a gene product is active within the cell, such as organelles (e.g., nucleus, mitochondria) or general areas like the cytoplasm. ![](media/image18.png)Together, these categories provide a consistent vocabulary for cross-species research in genomics and bioinformatics. **Relationships Between Categories**:\ GO terms are connected through relationships that help define how gene functions occur within cellular contexts: - **Molecular Function**: The basic activities of gene products, which occur in specific cellular components and contribute to biological processes. - **Biological Process**: Chains of molecular events involving multiple gene products, localized within specific cellular components. - **Cellular Component**: Locations within the cell where gene products are active, providing a context for where functions and processes take place **Types of Relationships**: - **\"is\_a\"**: Indicates that one term is a subtype of another (e.g., \"kinase activity\" is a type of \"catalytic activity\"). - **\"part\_of\"**: Describes when a term is part of a larger process or structure (e.g., \"nucleus\" is part of the \"cell\"). - **\"has\_part\"**: Indicates a term includes another smaller component. - **\"regulates\"**: Defines when one process or function affects another, such as upregulation or downregulation of gene expression. ![](media/image20.png) **Structure of GO:** The Gene Ontology (GO) is structured hierarchically, allowing for the organization of biological terms in a manner that reflects their relationships and functions within biological systems. **Large scale data analysis** **GO Slims**:\ GO Slims are simplified versions of the Gene Ontology, providing a high-level overview of key biological concepts without the details of the full ontology. They are used for summarizing gene annotations in broad categories and are available in both generic and organism-specific versions (e.g., goslim\_plant, goslim\_yeast). **Advantages of GO Slims**: - They provide a broad overview of biological functions, reducing complexity and making it easier to summarize large datasets. - Useful for organizing high-level summaries of cellular components and biological activities, focusing on the main biological categories rather than detailed subcategories. **Overrepresentation / enrichment analysis** measures whether the observed number of proteins or genes in a particular category exceeds the number that would be expected by random chance. - **Formula:** Enrichment=Number observed/Number expected - The analysis aims to answer the question: **Is a larger proportion of genes in the subset involved in a specific compartment, process, or function than expected by random chance?** - If the observed number is much higher than expected, the category is considered enriched, indicating a potential biological significance. **Statistical Test:** - Once enrichment is identified, a **statistical test** is applied to assess how unexpected the enrichment is, or whether it could be due to chance. - Common statistical tests include: - **Chi-squared test**: This test is used to assess whether the observed frequencies differ significantly from the expected frequencies. - **Fisher's exact test**: Often used when sample sizes are small, this test is appropriate for analyzing contingency tables and determining if there is a significant association between two categorical variables (e.g., whether genes in a subset are more likely to belong to a particular functional category than expected). **Fisher´s exact test in R:** Fisher\'s Exact Test is a statistical method used to determine if there are nonrandom associations between two categorical variables. It is particularly useful for small sample sizes or when the assumptions of the Chi-square test are not met. Here, the matrix represents: - **Row 1**: Receptors (5 in cluster, 2 not in cluster) - **Row 2**: Non-receptors (8 in cluster, 37 not in cluster) 1. **Conducting the Test**: - The fisher.test() function is then called on the matrix: fisher.test(m) 2. **Results of the Test**: The output provides: - **p-value**: 0.007641 This value indicates the probability of observing the data assuming the null hypothesis (no association) is true. A low p-value (typically \< 0.05) suggests that the observed association is statistically significant. - **Alternative Hypothesis**: States that the true odds ratio is not equal to 1, implying that there is a significant association between the two variables. - **Confidence Interval**:1.465373 to 132.948587 This interval provides a range for the odds ratio, indicating the uncertainty around the estimated effect size. - **Odds Ratio**: 10.81661 ![](media/image22.png)This estimate indicates that the odds of being a receptor in the cluster are approximately 10.82 times greater than the odds of being a receptor outside the cluster. - The table is divided into four cells as follows: - a and b: Row X for the presence or absence of condition Y. - c and d: Row Xˉ (negation of X) for the presence or absence of Y. - The columns correspond to Y (the outcome or condition) and Yˉ (the absence of the outcome). This formula measures the ratio of the odds of an outcome occurring in the presence of an exposure (or condition) compared to the odds of the outcome occurring in the absence of the exposure. Lecture 06 (October 10) - Yeast Systems Biology 3 ------------------------------------------------- **Correlation between mRNA and protein abundance** ![](media/image24.png) **Scatter Plot**: - The left side of the figure displays a scatter plot where: - **X-axis**: Represents the number of mRNA copies per cell (on a logarithmic scale). - **Y-axis**: Represents the number of protein copies per cell (also on a logarithmic scale). - Each point in the plot corresponds to a specific gene, indicating its mRNA and protein levels. - The general trend shows a positive correlation between mRNA and protein abundance, suggesting that higher levels of mRNA typically lead to increased protein production. However, this correlation is not perfect, indicating that other factors can influence protein levels. **Histograms**: - The right side of the figure features two histograms: - **Blue Histogram**: Represents the distribution of mRNA copies per cell. - **Median**: 17 copies of mRNA per cell. - **Red Histogram**: Represents the distribution of protein copies per cell. - **Median**: 50,000 copies of protein per cell. The histograms highlight the differences in abundance between mRNA and protein, with proteins generally being present in much higher measures than their corresponding mRNA transcripts. **Conclusion:** The correlation between mRNA and protein abundance is a foundational concept in systems biology and gene expression analysis. It underscores the complexity of cellular regulation and the need to consider multiple layers of biological data for a comprehensive understanding of cellular processes. This graph shows the growth of **public RNA-seq datasets**  - The data is divided into two categories: - **Bulk RNA-seq** (shown in red): This refers to RNA-seq performed on pooled cell populations. - **Single-cell RNA-seq** (shown in blue): This is RNA sequencing performed on individual cells, providing higher resolution at the cellular level. - The number of public RNA-seq datasets has grown exponentially over time, especially since 2014, with bulk RNA-seq dominating the early years. - By 2020, single-cell RNA-seq shows significant growth, but bulk RNA-seq remains the larger portion of available datasets. **Comparison:** - **Transcriptomics** datasets have seen continuous and rapid growth, particularly driven by the adoption of single-cell technologies in recent years. - **Proteomics** datasets also show a sharp rise, but with a slight downturn after 2020, possibly due to a temporary focus shift during the pandemic or changes in available resources. Within transcriptomics, various methods are used to quantify RNA transcripts to study gene expression levels: **Methods to Quantify Transcripts:** - **Probe-based Methods**: These include techniques like microarrays, where probes designed for specific RNA sequences bind to complementary transcripts. The intensity of the signal reflects the amount of RNA present. Microarrays are useful for analyzing gene expression across large sets of known genes. - **Sequencing-based Methods**: RNA sequencing (RNA-Seq) is the most common sequencing-based approach. In RNA-Seq, RNA is converted to cDNA, then sequenced to quantify RNA levels. This method is powerful because it can detect known and novel transcripts, and quantify a wide range of expression levels with high precision. **Gene Set Enrichment Analysis (GSEA)** Gene Set Enrichment Analysis (GSEA) is a powerful tool for interpreting transcriptomic data by determining whether specific gene sets show statistically significant differences in expression under different conditions **Two Methods of GSEA:** 1. **Overrepresentation Analysis (ORA/OA)**: - This method focuses on whether a set of genes is **overrepresented** among a list of differentially expressed genes. In this approach, genes are typically divided into two categories (e.g., significant vs. not significant), and the analysis tests whether the genes in the set appear more frequently in the significant group than expected by chance. 2. **Functional Class Scoring (FCS)**: - FCS involves scoring each gene set based on the collective behavior of all the genes in the set, rather than focusing solely on differentially expressed genes. This method takes into account **the entire range of expression values** across the dataset. ![](media/image26.png) - **Example**: Instead of selecting only differentially expressed genes, FCS assesses whether a predefined set of genes, as a whole, tends to have higher or lower expression values in one condition compared to another. **Overrepresentation Analysis (OA):** - **Method Overview:**\ Quantifies whether a gene set is overrepresented in the significant group compared to the background, using statistical tests like Fisher's exact test or chi-square test to assess enrichment. - **Pros:** - Simple: Focuses on presence/absence of genes in the significant group. - Widely applicable: Suitable for diverse gene sets and experimental conditions. - **Cons:** - Binary classification: Requires a strict significance threshold (e.g., p-value cutoff), potentially missing borderline significant or biologically relevant genes. - Ignores gene ranking: Does not account for effect size, fold change, or other differential expression measures. - Sensitive to threshold: Results vary with chosen significance cutoff. **Functional Class Scoring (FSC):** - **Method Overview:**\ Analyzes how the ranking of genes in a gene set (based on expression or other measures) compares to random expectations. Evaluates the entire dataset instead of separating significant and non-significant genes. Methods like Gene Set Enrichment Analysis (GSEA) rank all genes and determine if genes in a set cluster at the top or bottom of the ranked list compared to random chance. - **Pros:** - No arbitrary thresholds: Considers all genes based on ranking or biological contribution. - More sensitive: Detects subtle, coordinated changes in gene sets. - Ideal for complex data: Effective for high-throughput datasets (e.g., RNA-seq). - **Cons:** - Complexity: Requires sophisticated algorithms and is harder to interpret. - Data-intensive: Needs comprehensive gene rankings, which may not always be available. - Ranking bias: Sensitive to how rankings are generated (e.g., fold change vs. p-value). Blok 3 ====== Lecture 08 (October 31) - Systems Biology in Biomedical Research (Heart diseases) 1 ----------------------------------------------------------------------------------- **The human heart and congenital heart defects** 1. **Anatomy and Function**: - The human heart is a muscular organ roughly the size of a fist, located slightly left of the center of the chest. - It consists of four chambers: two upper chambers (atria) and two lower chambers (ventricles). - The right side of the heart receives deoxygenated blood from the body and pumps it to the lungs for oxygenation. The left side receives oxygen-rich blood from the lungs and pumps it out to the rest of the body. **Congenital Heart Defects (CHDs)** 1. **Definition and Overview**: - Congenital heart defects are structural problems with the heart that are present at birth. They affect the heart\'s walls, valves, or blood vessels. - CHDs are among the most common birth defects, affecting nearly 1 in 100 live births. 2. **Types of Congenital Heart Defects**: - **Atrial Septal Defect (ASD)**: A hole in the wall that separates the two upper chambers of the heart (atria). - **Ventricular Septal Defect (VSD)**: A hole in the wall that separates the two lower chambers of the heart (ventricles). - **Tetralogy of Fallot**: A combination of four heart defects that result in insufficient oxygenated blood reaching the body. - **Coarctation of the Aorta**: A narrowing of the aorta, which can increase blood pressure in the heart and lead to heart failure. The table divides the CHDs into different phenotypic stages based on the timing of their appearance during heart development: **Early, Intermediate, Late phenotypes,** and **Cardiomyocyte growth and organization**. **Early Phenotypes** These defects manifest in the early stages of heart development, primarily affecting the initial formation and structure of the heart tube and looping processes. Early phenotypes often disrupt foundational processes, leading to more complex abnormalities later. **Intermediate Phenotypes** Intermediate phenotypes occur as the heart structure further develops and differentiates. These defects often involve the heart's internal structures, such as septa and valves, which are crucial for directing blood flow. **Late Phenotypes** Late phenotypes manifest during the final stages of heart development and often involve more specific structural features, such as the outflow tracts and ventricular structures. These defects may affect blood flow dynamics and heart efficiency. **Cardiomyocyte Growth and Organization** ![](media/image28.png)This category includes defects related to the growth and structural organization of cardiomyocytes (heart muscle cells), affecting the heart\'s functional capacity and resilience. **The Cardiovascular system** The cardiovascular system is essential for sustaining life, as it ensures the continuous movement of blood, maintaining the supply of oxygen and nutrients to tissues while removing carbon dioxide and other waste products. - Veins and arteries play complementary roles, with veins carrying oxygen-poor blood back to the heart and arteries distributing oxygen-rich blood throughout the body. The cardiovascular system relies on the coordinated functions of arteries, veins, and capillaries to ensure efficient blood flow and nutrient delivery throughout the body. Arteries distribute oxygen-rich blood, veins return deoxygenated blood, and capillaries allow for essential exchanges between blood and tissue cells. This system plays a crucial role in maintaining the body's internal environment and supporting cellular functions through continuous blood circulation. ![](media/image30.png) **Ventricular septal defect (VSD)** **A) Normal Heart** 1. **Blood Flow in a Healthy Heart**: - In a normal heart, the right side (right atrium and right ventricle) receives deoxygenated blood from the body. This blood is pumped to the lungs via the pulmonary artery, where it becomes oxygenated. - Oxygenated blood returns to the left side of the heart (left atrium and left ventricle) and is pumped out through the aorta to supply the body with oxygen-rich blood. 2. **Ventricular Septum**: - The ventricular septum is the thick wall that divides the left and right ventricles, preventing any mixing of oxygenated and deoxygenated blood between these chambers in a healthy heart. - This separation ensures that deoxygenated blood flows toward the lungs and oxygenated blood is directed to the body. **(B) Heart with Ventricular Septal Defect** 1. **Septal Opening**: - A VSD is an opening in the ventricular septum, which can vary in size and location. The defect allows blood to pass from the left ventricle (high-pressure oxygenated blood) to the right ventricle (lower-pressure deoxygenated blood). - This mixing of blood reduces the overall oxygen content that reaches the body and increases the workload on the heart. 2. **Blood Mixing and Circulatory Impact**: - Due to the pressure gradient, oxygenated blood from the left ventricle leaks into the right ventricle. This leads to mixing of oxygen-rich and oxygen-poor blood. - This abnormal flow can lead to \"left-to-right shunting,\" where oxygenated blood recirculates back to the lungs instead of being distributed to the body, leading to an inefficient use of oxygen. - The increased volume of blood passing through the pulmonary circulation can lead to over-circulation in the lungs and potentially cause complications like pulmonary hypertension. 3. **Alternate Locations for VSD**: - VSDs can occur at different points within the ventricular septum. Some defects may be close to the atrioventricular (AV) valves, while others might be more centrally located. The location and size of the defect determine the severity and the impact on heart function. - In the illustration, one defect is shown in the middle of the septum, while another possible location is highlighted lower on the ventricular septum. **Heart development:** **1. Specification of Cardiac Precursor Cells** - **Definition**: Cardiac precursor (progenitor) cells are specialized mesoderm-derived cells destined to become cardiomyocytes (heart muscle cells), endothelial cells (vessel lining), and fibroblasts (supporting cells). - **Embryonic Origin**: These cells arise from mesodermal progenitor fields during early development. - **Key Signaling Pathways**: - **Bone Morphogenetic Proteins (BMPs)**: Crucial for early heart development and differentiation. - **Wnt Signaling**: Regulates cell fate and cardiac precursor specification. - **Nodal and FGF**: Drive mesoderm formation and promote cardiac tissue development. **2. Migration and Fusion of Cardiac Precursor Cells into the Primitive Heart Tube** After specification, cardiac precursor cells migrate centrally to form the **primitive heart tube**, the earliest heart structure. The primitive heart tube is the early form of the heart, a simple tubular structure that eventually undergoes complex transformations to become the fully developed, multi-chambered heart. The tube consists of several regions: - **Sinus venosus**: the inflow tract, which will later contribute to parts of the atria. - **Atrium**: the early stage of the upper chambers of the heart. - **Ventricle**: the precursor to the lower chambers of the heart. - **Bulbus cordis**: a region that will add to the right ventricle and outflow tracts. **3. Heart Looping** - **Process**: The straight primitive heart tube twists and folds, creating a three-dimensional structure that resembles the mature heart\'s shape. - **Purpose of Looping**: Ensures correct alignment of heart chambers and inflow/outflow tracts. - **Outcomes**: Proper looping establishes distinct chambers and supports coordinated contraction. **4. Heart Chamber Formation**: Following heart looping, the heart tube begins to develop into four distinct chambers: two atria and two ventricles. Septa (walls) form to divide the heart internally: - **Septum Primum and Septum Secundum**: These structures contribute to the formation of the interatrial septum, separating the right and left atria. - **Interventricular Septum**: This septum forms to separate the right and left ventricles, guided by the growth of endocardial cushions and other structural changes within the heart. Additionally, **valves** begin to develop to ensure unidirectional blood flow between the chambers. These include the atrioventricular valves (mitral and tricuspid) and semilunar valves (aortic and pulmonary). As chamber formation progresses, the heart starts to exhibit distinct pathways for blood flow, supporting the development of pulmonary and systemic circulations. **5. Septation and Valve Formation**: Septation is the process of forming internal walls (septa) within the heart to completely separate the chambers. **The atrial septum** divides the atria and **The ventricular septum** divides the ventricles. Valves also begin to form at this stage, ensuring unidirectional blood flow between the chambers and out through the major arteries. Proper septation and valve formation are essential for the heart's functionality, as they prevent the mixing of oxygenated and deoxygenated blood and maintain efficient circulation. **Human protein interaction networks (PINs)** PINs map how proteins interact to carry out cellular functions, such as signaling, metabolism, gene regulation, and growth. **Protein interaction networks** **Yeast Protein Interaction Network** - **Organism**: The network shown represents protein interactions in **yeast** (Saccharomyces cerevisiae), a model organism widely studied due to its relatively simple structure, fast growth, and genetic similarities to more complex organisms. - **Network Structure**: The yeast protein interaction network is composed of clusters and sub-clusters, each depicting proteins with higher interaction frequencies among themselves. These clusters provide insight into **protein complexes** and **functional modules** that carry out specific tasks within the cell. **Key Features of the Network** 1. **Clusters**: - **Highly Connected Clusters**: Some regions of the network show tightly interconnected proteins, forming dense clusters. These clusters often correspond to **protein complexes** or **functional units** where multiple proteins work closely together to perform a single function. - **Example Functions**: In yeast, such clusters might represent complexes involved in DNA replication, transcription, metabolic pathways, or cellular structure. 2. **Hubs**: - **Central Nodes (Hubs)**: Certain proteins, known as **hubs**, have many connections to other proteins. These are typically essential proteins involved in multiple cellular pathways, reflecting their importance in maintaining network integrity. - **Role of Hubs**: Hub proteins often act as central regulators or structural components within the network. They play critical roles in signaling pathways, **The InWeb** The InWeb is a large-scale resource for **protein-protein interaction (PPI)** data, integrating multiple databases to create a curated and comprehensive network of human protein interactions. It is frequently updated and widely used in research on cellular functions, disease mechanisms, and drug discovery. **1. Data Sources and Integration** - InWeb integrates data from numerous established PPI databases, including: - **MINT** (Molecular INTeraction Database) - **BIND** (Biomolecular Interaction Network Database) - **DIP** (Database of Interacting Proteins) - **GRID** (General Repository for Interaction Datasets) - **HPRD** (Human Protein Reference Database) - **KEGG** (Kyoto Encyclopedia of Genes and Genomes) - **REACTOME** (a pathway database) - **IntAct** (an open-source molecular interaction database) - **Curated sources** to ensure data accuracy and reliability. These sources cover experimentally validated and computationally predicted interactions, providing a broad and reliable view of PPIs. **2. Process of PPI Transfer and Transformation:** - PPI data is transferred between organisms, often to humans, based on high similarity in protein structure and function, enabling inference of interactions through conserved functions across species. - Data is reformatted into a standardized structure for consistency and ease of analysis. **3. Automated Scoring of Interactions:** - InWeb uses automated scoring to evaluate interactions, considering factors such as experimental reliability, multiple source confirmations, and evolutionary conservation. - This prioritizes high-confidence interactions, focusing research on the most relevant data. **4. Regular Updates and Scale of the Network:** - InWeb includes \~433,000 interactions among 10,300 human proteins and is updated biannually, providing a comprehensive framework for studying protein networks. **The network will contain many false positives and false negatives. It is a possibility space.** **Properties of disease proteins** Disease-associated proteins often differ from non-disease proteins in **protein interaction networks (PINs)**, offering insights into how their dysfunction contributes to disease. Key properties include: - **Centrality**: Disease proteins frequently act as **hubs**, interacting with many other proteins placing them in crucial positions within cellular pathways. Mutations in these proteins can have widespread effects. - **Essentiality**: Many disease-related proteins are essential for cell survival and function. Disruption in these proteins often lead to severe cellular dysfunction, contributing to diseases like cancer or neurological disorders. **Network modularity** Modularity refers to the division of a network into smaller, densely connected **clusters** or **modules**, where proteins interact more frequently within the same module than with those in other parts of the network. - **Functional Significance**: Modules often align with specific **biological functions** or processes, such as signaling pathways or metabolic activities. - **Disease Modules**: Disease-associated proteins often cluster within specific modules, forming **disease modules**. These modules highlight groups of proteins whose dysfunction contributes to particular diseases, such as Alzheimer's or other neurodegenerative disorders. **The Local Hypothesis** The local hypothesis posits that proteins involved in the same disease tend to interact closely within the network, forming distinct clusters called **disease modules**. - **Implications**: According to this hypothesis, if a protein is known to be associated with a disease, its interaction partners are more likely to be implicated in the same or related disease processes. This principle aids in identifying disease-associated proteins and understanding the molecular basis of diseases through network analysis. **Network Modularity: Types of Modules** **Disease Module**: A disease module consists of network components (e.g., proteins or genes) that collectively contribute to a specific cellular function or pathway. Disruption of this module can lead to disease. - **Characteristics**: Tightly interconnected components working together to maintain cellular homeostasis. Disruptions (e.g., mutations, protein misfolding) can trigger chain reactions, leading to pathological states. - **Significance**: identifying disease modules aids in understanding disease mechanisms and developing targeted therapies. Example: Alzheimer's disease modules reveal proteins or pathways involved in neurodegeneration. **Functional Module**: Functional modules are groups of nodes (e.g., proteins, genes) within a network that share a similar function or are involved in a common biological process. - **Characteristics**: Represent specific functions like DNA repair, cell cycle regulation, or metabolism. Nodes interact frequently, reflecting their cooperative role. - **Applications**: Helps predict unknown protein functions, study biological organization, and identify critical players in cellular processes. **Topological Module**: A topological module refers to a dense cluster within a network where nodes have a higher proportion of links (connections) to other nodes within the module than to nodes outside the module. - **Characteristics**: Topological modules are defined by structural density rather than specific biological function. Often corresponds to functionally or evolutionarily conserved regions in the network. **Identification of modules** **Step-by-Step Process for Module Identification** 1. ![](media/image32.png)**Selection of Candidate Proteins**: - **Identify Proteins**: Select proteins of interest based on their association with a biological function, pathway, or disease. - **Relevance**: Careful selection ensures the analysis focuses on proteins likely involved in the process under study. 2. **Interaction Mapping Using InWeb**: - **Database Query**: Use the InWeb database to extract protein-protein interaction (PPI) data for the candidate proteins and their interaction partners. - **Visual Representation**: Interaction networks are color-coded (e.g., green for candidate proteins and blue for interaction partners) for clarity and easier interpretation. 3. **Network Formation**: - **Create Network**: Construct a network where nodes represent proteins, and edges represent interactions. - **Complexity**: Networks often contain numerous nodes and edges, requiring further analysis to identify functionally related clusters. 4. **Module Identification Using MCODE**: - **Clustering**: Apply the MCODE algorithm to identify **topological modules**---densely connected sub-clusters in the network. - **Output**: MCODE highlights modules likely to represent functional groups, providing insights into biological processes or pathways. **MCODE** MCODE (Molecular Complex Detection) is a widely used algorithm in network analysis, specifically designed to identify densely connected regions or clusters within networks. These clusters, or \"complexes,\" are often interpreted as functionally related groups, such as protein complexes in biological networks. **Three Stages of MCODE** MCODE detects complexes in a network through a three-step process: 1. **Node Weighting**:\ Each node is assigned a weight based on its \"core-clustering coefficient,\" reflecting its connectivity within its local neighborhood. Nodes in dense regions receive higher weights, indicating their likelihood of being part of a complex. 2. **Complex Prediction**:\ Highly weighted nodes are selected as \"seeds,\" and the algorithm expands around them by adding directly connected nodes that meet specific connectivity criteria. This forms predicted complexes. 3. **Post-processing**:\ Predicted complexes are refined by removing loosely connected nodes that do not meet the required density threshold, ensuring robust and tightly connected clusters. **Scoring and Ranking of Complexes**: - After identifying the complexes, MCODE assigns a score to each one based on its density and connectivity. - Complexes are then ranked according to their scores, allowing researchers to prioritize clusters that are more densely connected and potentially more biologically significant. Higher scores generally indicate a stronger, more cohesive complex, making it a candidate for further biological analysis. **Applications of MCODE**: - MCODE is widely used in biological network analysis, particularly in protein-protein interaction (PPI) networks, where researchers aim to identify protein complexes that play key roles in cellular functions. Beyond biology, MCODE\'s approach to detecting dense clusters can be applied to any network where identifying tightly connected groups is valuable, such as social networks or ecological networks. **Sticky Proteins and Topological Filtering** **Sticky Proteins**: - Highly abundant proteins that co-purify with others due to nonspecific binding, often creating experimental noise. - These interactions are not biologically relevant and can obscure genuine functional relationships. **Solution -- Topological Filtering**: 1. **Assumptions**: - Sticky proteins exhibit widespread, nonspecific interactions, while true protein-protein interactions occur within dense, functionally related clusters. 2. **Filtering Criteria**: - Non-seed proteins must direct a significant proportion (e.g., 10%--20%) of their interactions toward seed proteins to remain in the network. - This removes weak or nonspecific interactions, refining the network to focus on biologically meaningful connections. Lecture 9 (November 7) - Systems Biology in Biomedical Research (Heart diseases) 2 ---------------------------------------------------------------------------------- **p-value:** A p-value quantifies the probability of observing a test statistic as extreme or more extreme than the one obtained, assuming the **null hypothesis (H₀)** is true. - **Interpretation**: - **Low p-value (e.g., \< 0.05)**: Suggests that the observed data is unlikely under H₀, providing evidence to reject the null hypothesis in favor of the alternative hypothesis (H₁). - **High p-value**: Indicates that the observed data is consistent with H₀, meaning there is insufficient evidence to reject it. - **Purpose**: Helps determine the statistical significance of the results in hypothesis testing. **Null Hypothesis (H₀):** A default assumption that there is no effect, no relationship, or no difference between groups being studied. - \"**Purpose**: Provides a starting point for statistical testing, allowing researchers to evaluate whether evidence supports rejecting this assumption. **Alternative Hypothesis (H₁):** Opposes the null hypothesis, proposing that there **is** an effect, relationship, or difference. - **Purpose**: Represents what researchers aim to demonstrate. H₁ can only be supported by evidence but not definitively proven true. **One-Tailed Test** Tests for an effect in one specific direction (greater or less than a value). - Example: Testing if a drug increases blood pressure but not decreases it. - **Use**: More sensitive to effects in one direction but inappropriate if effects could occur in both directions. **Two-Tailed Test** Tests for an effect in either direction (increase or decrease). - Example: Testing if a drug affects blood pressure in any way. - **Use**: Suitable when the effect's direction is unknown or any significant deviation is relevant. **E-value / Expect-value** The E-value, also known as the Expect-value, represents the number of unrelated hits (matches) that could be found by chance with an equal or better alignment score. - In simpler terms, it predicts how many random matches would appear in the database search with a score as high or higher than the given alignment score due to purely stochastic (random) factors. - **A lower E-value** indicates a more significant match, implying that the observed alignment is less likely due to chance. **k-core** A **k-core** is a concept in graph theory used to identify densely connected subgraphs within a larger network. It is particularly useful in network analysis for identifying robust, interconnected clusters of nodes. **Definition of k-core** - A **k-core** is defined as a subgraph in which each node has a degree of at least k within the subgraph. In other words, every node in the k-core is connected to at least k other nodes in that subgraph. - The **degree** of a node refers to the number of edges (connections) it has with other nodes. **Example in the Diagram** In the diagram: - The red nodes form a **2-core**, meaning each node in this subgraph has at least 2 connections (degree ≥2) with other nodes in the subgraph. - ![](media/image34.png)The **green node** is connected to only one node in this subgraph, which does not satisfy the requirement of having at least 2 connections within the 2-core. Therefore, it is **not part of the 2-core**. - Nodes that do not meet the degree requirement for a particular k-core are excluded from that k-core. **What Does This Tell Us?** - **High Density**: A CCC of 0.83 suggests that the subgraph is very densely connected. In biological networks like protein-protein interaction networks, this high density can be indicative of a **tight protein complex**, where the proteins are strongly interacting with each other. - **Interpretation**: A high CCC value (close to 1) implies that most of the nodes within the subgraph are well connected, which is typical for functional modules or complexes in biological systems. Highest k-core of Each Node: - **Finding the Highest k-core**: The highest k-core of a node is the largest subgraph in which the node has a degree of **k** or more within that subgraph. - In this case, the node labeled **u** is part of a **3-core** (colored yellow), indicating that the highest k-core for node **u** is 3. - **kmax, u = 3**: This shows that the highest k-core for node **u** is **3**. It means node **u** is connected to at least 3 other nodes in the same subgraph. **Core Clustering Coefficient (CCC):** - **Definition**: The **Core Clustering Coefficient (CCC)** for a given k-core is the ratio of the actual number of edges in the subgraph (core) to the maximum possible number of edges in that subgraph. - **Formula**:CCC=Number of actual edges in the subgraph/Total possible edges in the subgraph - **Calculation**: - The network shown represents a **3-core** where the 3 nodes are fully connected to each other. - **Number of edges**: In the 3-core subgraph, there are **9 edges** (as shown by the solid lines between the yellow nodes). - **Number of possible edges**: For 3 nodes, the total number of possible edges (in a complete graph) is calculated as: - The CCC for the 3-core is calculated as: CCCu= 9/10 = 0.9 - This means the CCC of the 3-core is **0.9** (or 90%), indicating a very dense subgraph where most of the potential connections between nodes are present. **The weight of a node u:** [*W*~*u*~ = *k*~*max*, *u*~ \* *CCC*~*u*~]{.math.inline} - Start with the node u, with the highest weight, Wu in the network - ![](media/image36.png)Recursively, move outward from this node including more nodes *I* in the complex as long as the weight of node *w* satisfies: [*W*~*i*~ ≥ *x* \* *W*~*u*~]{.math.inline} \ [0 \

Use Quizgecko on...
Browser
Browser