Lecture 9 - scRNA-Seq PDF
Document Details
Uploaded by ProtectiveJustice
Mohawk College
Tags
Summary
This document presents a lecture on single-cell RNA sequencing (scRNA-seq). It covers topics such as the method's definition, importance, applications, data processing, and quality control.
Full Transcript
Lecture 9 – Single Cell RNA-Seq BIOTECH 4BI3 - Bioinformatics Where are we going? DNA Sequencing DNA DNA Read Sequencing Quality Assembly Mapping...
Lecture 9 – Single Cell RNA-Seq BIOTECH 4BI3 - Bioinformatics Where are we going? DNA Sequencing DNA DNA Read Sequencing Quality Assembly Mapping Control Genome Expression Annotation Analysis Marker-Trait Population Polymorphis Genotyping Associations Analysis m Discover Learning Outcomes Describe the basics of scRNA-seq molecular biology, understanding how gene expression is captured at the single-cell level. Outline the main steps in scRNA-seq data processing, from quality control to normalization. Identify methods used for differential gene expression (DGE) analysis in single-cell data. Explain how cell lineage and RNA velocity techniques provide dynamic insights into cellular differentiation and development. Apply pathway and functional enrichment analysis to interpret biological significance in identified cell clusters. Introduction to scRNA-seq Definition: A high-resolution method that analyzes gene expression at the level of individual cells. Importance: Captures cellular heterogeneity and uncovers subpopulations within tissues. Applications: Developmental biology, cancer research, immunology, neuroscience, signal transduction. Why Single-Cell Analysis? Bulk RNA-seq averages signals across many cells, potentially masking individual cell differences. scRNA-seq isolates each cell’s transcriptome, uncovering variations and rare cell types. Useful for complex tissues with diverse cell types. Carr et al (2020) 10.3389/fmed.2020.00021 Applications of scRNA-seq in Research Cell type identification: ScRNA-seq can reveal new cell types and identify different cell types and biomarkers. Tissue heterogeneity: ScRNA-seq can characterize tissue heterogeneity by revealing rare cell types that can have a big impact on health and disease. Drug target discovery: ScRNA-seq can help discover and develop new drug targets. Cell development pathways: ScRNA-seq can reconstruct cell development pathways. Immune profiling: ScRNA-seq can be used for immune profiling. Cancer Profiling: ScRNA-seq can be used to map and analyze for CNV profiling in cancer. Single Cell Library Platforms There are a few different platforms to create sequencing libraries from single cells The 10X Genomics is the most popular There areThroughput Platform a number of Sensitivity Data Quality Technical Considerations 10x Genomics High (thousands of Reliable, suited for high- High cost, easy to use, compatible technologies Chromium used toModerate cells per run) create throughput applications with many downstream analyses single-cellLow (upsequencing to hundreds High libraries (good for low- Reliable, high gene Lower throughput, higher sensitivity, Fluidigm C1 abundance of cells per run) detection rates labor-intensive setup transcripts) High (good for low- WaferGen Medium (hundreds of Reliable, consistent Moderate cost, requires specialized abundance iCell8 cells per run) transcript coverage equipment transcripts) Illumina/Bio- Moderate (hundreds Reliable, though lower Lower cost, user-friendly, but limited Moderate Rad ddSEQ to thousands) gene detection than C1 compatibility with certain analyses 10X Genomics Chromium Key Features: Uses microfluidics to encapsulate individual cells into droplets, each containing unique barcodes. Employs a 3’-tag sequencing method, capturing the 3’ end of each mRNA transcript. Advantages: High Throughput: Processes thousands of cells per run, making it suitable for large-scale studies. Efficiency: Cost-effective for large sample sizes with diverse cell populations. Compatibility: Works with multiple downstream analyses, including transcriptomics and multiomics. 10X Genomics Chromium Prep Swaminath, Sharmada & Russell, Alistair. (2024) scRNA-Seq Data Processing Sequencing Data QC Why Quality Control Matters: Ensures data reliability and accuracy. Identifies potential issues early, reducing errors in downstream analysis. Key Quality Metrics: Base Quality Scores: Assess sequencing accuracy. Read Length Distribution: Verifies consistency and expected read length. Adapter Sequence Detection: Identifies any leftover adapter sequences. Key QC Metrics Base Quality Scores: High scores indicate more reliable base calls. Quality usually declines toward the end of the read. Read Length Distribution: Uniform read length across sequences is ideal. Shorter or variable lengths may indicate sequencing issues or degradation. Adapter Sequence Presence: Adapters should be removed before downstream analysis. Common tools for adapter trimming: Trimmomatic, Cutadapt. Read Mapping in scRNA-Seq Mapping aligns sequencing reads to a reference to identify gene expression levels. Types of Mapping: 1. Genome Mapping: Aligns reads to the entire genome, capturing all potential sequences. Detects novel splicing events and unannotated regions. 2. Transcriptome Mapping: Aligns reads to known spliced transcripts, faster but may miss novel transcripts. Faster and more resource-efficient than genome mapping. Ideal for well-annotated genomes where most transcripts are known 3. Augmented Transcriptome Mapping: Augmentation is the transcriptome plus addition information such as full-length unspliced transcripts or excised intronic sequences Detects known transcripts and novel splicing events. Balances speed and accuracy for complex analyses. 10X Genomics Read Structure Cell Barcode Correction Barcodes are unique DNA sequences used to identify reads from individual cells in scRNA-seq experiments. Why Correction is Necessary: Errors in barcodes, from sequencing or synthesis mistakes, can misassign reads to the wrong cell. Accurate barcode correction ensures data integrity, preserving the identity of each cell. Types of Barcode Errors: 1. Sequencing Errors Random errors in base calling that result in mismatched nucleotides in barcodes. Typically seen as single or few nucleotide changes. 2. Synthesis Errors Occur during the chemical synthesis of barcode sequences. Often introduces systematic errors that can affect multiple sequences in similar ways. Methods for Cell Barcode Correction Algorithmic Approaches: Hamming Distance Correction: calculate the difference between barcodes, correcting those that differ by only 1–2 bases. Cluster-Based Correction: groups similar barcodes and assigns them to the most probable correct sequence within each cluster. Filtering Techniques: Ambiguous Barcode Filtering: barcodes with many errors are excluded from analysis to maintain data quality. Consensus-Based Correction: Uses data patterns to predict the most likely original barcode sequence, particularly useful in high-throughput systems where some errors are predictable. Challenges with Cell Barcode Correction Distinguishing True Variability: Difficulties in differentiating true biological diversity from technical errors in barcodes. Over-Correction Risks: Excessive corrections can mistakenly alter barcodes that represent unique cells. Data Noise: Large datasets may contain high levels of barcode noise, complicating correction. Computational Demand: Sophisticated corrections require extensive computational resources. Future Directions: Machine Learning Integration: Algorithms using machine learning to better detect and correct errors without compromising unique cell identities. Improved Error Models: Enhanced error models to predict and correct for both random and systematic errors in barcode data. Unique Molecular Identifiers Recall that UMIs are short, random sequences added to each mRNA molecule before PCR amplification. UMIs uniquely tag each transcript, allowing us to distinguish between uniquely sequenced transcripts and PCR duplicates. Why UMIs are Important: Help reduce amplification bias by distinguishing unique transcripts from duplicated reads. Ensures more accurate quantification of gene expression. Graph-Based UMI Resolution Collect all of the UMIs present at a particular locus Resolving UMIs: Each UMI is represented as a node in a graph. Nodes (UMIs) that are similar, typically differing by one base, are connected. Uses clustering to identify which UMIs are likely duplicates. Quantifying UMIs: The number of unique UMIs per gene is counted to estimate transcript abundance accurately. Ensures that the final counts reflect actual biological molecules, not technical duplicates. Graph-Based UMI Resolution Smith et. al., Genome Res. 2017 Mar;27(3):491-499 Challenges of UMI Resolution Benefits of Using UMIs: Reduces PCR amplification bias, enhancing the accuracy of transcript quantification. Allows for higher-resolution data, especially in high-throughput experiments. Challenges in UMI Resolution: Error Correction: Misreads in UMI sequences can introduce errors that are hard to distinguish from true duplicates. Complexity of Graph-Based Approaches: Computationally intensive, requiring careful tuning to balance accuracy with processing time. Empty Droplet Removal Once the unique UMIs have been identified from each cell, further quality control is need on the count data. Empty droplets are droplets in single-cell RNA sequencing that contain no cells. These droplets can still capture environmental RNA from the solution, leading to background noise in the data. Removing empty droplets is essential to avoid inaccurate expression profiles. Strategies for Empty Droplet Removal Threshold-Based Filtering: Sets a minimum threshold for the number of detected transcripts per droplet. Droplets with transcript counts below this threshold are likely empty and are excluded. Ambient RNA Profiling: Identifies gene expression patterns characteristic of ambient RNA (background RNA in solution). Detects and removes droplets dominated by these patterns, marking them as empty. Statistical Methods (e.g., EmptyDrops): Uses statistical models to distinguish real cells from empty droplets based on transcript distribution. The EmptyDrops algorithm applies a Monte Carlo method to classify droplets, enhancing accuracy in detecting empty droplets (https://doi.org/10.1186/s13059-019- 1662-y) Empty Droplet Removal Double Detection What is a doublet: Doublets occur when two or more cells are captured together in a single droplet. This results in mixed gene expression profiles, leading to inaccurate data if not detected. Why doublet detection is important: Doublets can create artificial cell types or clusters, impacting downstream analyses. Removing doublets ensures each droplet represents a single cell Doublet Removal Density-Based Clustering: Cells with unusually high gene or UMI counts are flagged as potential doublets. Gene Expression Patterns: Doublets often show mixed profiles from two distinct cell types, which can be identified through unique or marker gene expression. Doublet Detection Tools: Scrublet: Uses simulations to identify doublets based on expected cell-to-cell gene expression similarity. DoubletFinder: A popular R package that uses clustering to detect doublets based on gene expression profiles. Doublet Detection Count Data Normalization Proper representation and normalization of count data reduces noise biases, allowing meaningful comparisons between cells and conditions. Raw Counts: Direct counts of RNA transcripts per gene in each cell (after processing). Used as a starting point, but unprocessed counts can be influenced by technical factors like sequencing depth. Normalized Counts: Adjusts counts to account for differences in sequencing depth or cell size. Common methods: CPM (Counts Per Million), TPM (Transcripts Per Million). Count Data Normalization Log-Transformed Counts: Applies a logarithmic transformation to normalized counts. Helps stabilize variance across genes, making data more suitable for visualizations and clustering. Scaled Counts: Further standardizes counts, often by centering and scaling each gene across all cells. Useful for dimensionality reduction techniques, like PCA and clustering. Variance Stabilization log2 DESeq rlog Compare rep 1 and 2 Compare rep 1 and 2 Some packages prefer data that has not been stabilized to better model the variation among sa Overview of scRNA-Seq Analysis Goal of Analysis: Preprocessing (Qualiy Control, Trimming, Mapping, Identify patterns in gene expression Quant) across individual cells. Normalization Discover unique cell types, (TMM, TPM, Stabilization) functional states, and biological Dimensionality Reduction pathways. (PCA, t-SNE, UMAP) Types of Analyses: Clustering Dimensionality Reduction (Cell type specific clustering) Clustering Differential Gene Expression Differential Gene Expression (MAST, Seurat, scanpy) Advanced Analysis Cell Type Annotation (gene network analysis, temporal ordering, etc.) Dimensionality Reduction Purpose: Reduces the complexity of high-dimensional data, making it easier to visualize and analyze. Common Methods: PCA (Principal Component Analysis): Identifies major patterns in data, often as a first step. t-SNE (t-Distributed Stochastic Neighbor Embedding): Clusters similar cells together for visualization. UMAP (Uniform Manifold Approximation and Projection): Preserves both global and local data structure, commonly used for single-cell data. Tools: Seurat (R) and Scanpy (Python) provide built-in functions for these methods. Principal Component Analysis PCA is a dimensionality reduction technique that simplifies large datasets It transforms the data into a set of new variables (principal components) that summarize the original data. How It Works: Principal Components: Each component captures a different aspect of variability in the data, with the first few components explaining the most variance. Reduction in Complexity: PCA reduces high-dimensional data (many genes per cell) to fewer components, making analysis easier. Why PCA is Useful in scRNA-seq: Noise Reduction: Focuses on key patterns, filtering out noise and minor variations. Preparation for Clustering: Simplifies the data, making it more manageable for clustering algorithms. Visualization: Allows 2D or 3D plotting of cells, helping visualize relationships between cell populations. Principle Component Analysis https://setosa.io/ev/principal-component- analysis/ t-SNE (t-Distributed Stochastic Neighbour Embedding) t-SNE is a nonlinear dimensionality reduction technique that visualizes high- dimensional data in 2D or 3D by grouping similar data points together. It emphasizes local structure enabling dentifying clusters within complex data. How It Works: Neighborhood Preservation: t-SNE calculates probabilities that represent similarities between data points in high-dimensional space, then replicates these similarities in lower-dimensional space. Cost Function: Minimizes differences between high-dimensional and low- dimensional relationships, ensuring that similar cells are grouped closely. Why t-SNE is Useful in scRNA-seq: Cluster Visualization: Shows clear, visually distinct clusters, helping identify unique cell populations. Focus on Local Structure: Emphasizes relationships among similar cells, which is beneficial for diverse single-cell data. Limitations: t-SNE is primarily for visualization and doesn’t preserve global PCA vs t-SNE https://newsletter.theaiedge.io/p/formulating-and- implementing-the UMAP (Uniform Manifold Approximation and Projection) UMAP is a nonlinear dimensionality reduction technique that reduces high- dimensional data to 2D or 3D while preserving both local and global data structure. How It Works: Preserves Local and Global Structure: UMAP tries to maintain relationships at both small (local) and larger (global) scales, offering a more comprehensive view of data patterns. Manifold Learning: UMAP assumes the data lies on a "manifold" (a lower- dimensional space embedded in high-dimensional space) and maps this manifold in fewer dimensions. Why UMAP is Useful in scRNA-seq: Improved Visualization: Often provides clearer and more stable clusters compared to t-SNE. Better Reproducibility: UMAP results are generally more consistent across multiple runs, making it ideal for single-cell data. Flexible for Clustering: Helps reveal cellular diversity and can highlight both t-SNE vs UMAP UMAP t-SNE Use Case Use UMAP when overall Preserves both local and Focuses primarily on local relationships between cell global structure structure types are needed, such as in complex tissue structures. Use UMAP if reproducibility is More reproducible across Less reproducible, with important, like in projects multiple runs variations between runs comparing results across datasets. Use UMAP for large-scale Faster and more scalable for Slower and more datasets (e.g., whole-tissue or large datasets computationally intensive multi-condition scRNA-seq studies). Use UMAP if your data includes Suitable for both discrete Best suited for datasets with gradual transitions, such as clusters and continuous data clear, well-separated clusters developmental or differentiation Use UMAP when you need a balance of local and global structure, reproducible pathways. results, faster performance, or a method suited for gradual changes in cell states. Use t-SNE when you need to emphasize tightly-knit clusters and don’t require consistent, large-scale reproducibility across runs. PCA vs t-SNE vs UMAP https://mbernste.github.io/posts/ dim_reduc/ Cluster Analysis in scRNA-Seq Purpose of Clustering: Groups cells with similar gene expression profiles, allowing us to identify distinct cell types or functional states. Essential for uncovering cellular diversity within complex tissues. Types of Clustering: Graph-Based Clustering: Ideal for single-cell data, captures relationships between cells effectively. Hierarchical Clustering: Groups cells into a tree structure to show relationships at various levels. Graph-Based Clustering How It Works: Represents cells as “nodes” in a graph, connecting them based on gene expression similarity. Cells with similar expression are more strongly connected, forming clusters. Key Algorithms: Louvain Algorithm: Optimizes clusters by maximizing connections within each group. Leiden Algorithm: Builds on Louvain, improving accuracy and stability. Advantages: Efficient and well-suited for large, high-dimensional single-cell data. Captures subtle relationships between cells, revealing distinct populations. Louvain Clustering The Louvain algorithm is a method for community detection in graphs, widely used for clustering cells in scRNA-seq data by grouping similar cells based on gene expression profiles. How the Louvain Algorithm Works 1. Graph Construction: Each cell is represented as a node in a graph. Nodes (cells) are connected by edges, with edge weights representing the similarity between cells (e.g., based on gene expression). 2. Initial Step: Local Clustering: Each cell starts in its own community (cluster). The algorithm iteratively examines whether moving a node from one community to another increases the modularity (a measure of the strength of division of a network into communities). Louvain Clustering 3. Modularity Optimization: Modularity is a score that indicates how densely connected nodes within communities are, compared to connections between communities. The algorithm moves each node into the neighboring community if it increases modularity, grouping nodes with high similarity into the same community. 4. Community Aggregation: After modularity cannot be improved at the local level, each community is collapsed into a single node. The process repeats on this “coarser” graph, with new communities formed based on modularity optimization. 5. Iterative Process: The algorithm continues iterating, forming larger communities with each pass until modularity no longer increases. The final communities represent the clusters of cells, revealing distinct groups or cell types. Louvain Clustering Hierarchical Clustering A clustering technique that builds a tree-like structure, called a dendrogram, to represent relationships between cells. Cells are initially treated as individual clusters, which are progressively merged based on similarity. Two Approaches: 1. Agglomerative Approach: Each cell starts as its own cluster. Clusters that are most similar in gene expression are merged step-by-step, forming larger clusters. 2. Divisive Approach: Starts with all cells in a single cluster, which is then split into smaller, more distinct clusters. Dendrogram Structure: Shows the merging sequence, with closely related clusters near each other and more distinct clusters further apart. Advantages in scRNA-seq: Clear Relationships: Reveals hierarchical relationships, useful in studying Hierarchical Clustering Dimensionality Reduction vs Clustering Aspect Dimensionality Reduction Cluster Analysis Reduces the complexity of Groups cells based on similar Purpose high-dimensional data gene expression profiles Simplifies data for easier Identifies distinct cell types or Goal visualization and analysis states PCA (Principal Component Graph-based clustering (Louvain, Common Methods Analysis), t-SNE, UMAP Leiden), hierarchical clustering Performed after dimensionality Order in Workflow Performed before clustering reduction Highlights major patterns or Defines biologically meaningful Function differences between cells groups of cells Lower-dimensional Distinct clusters representing Output representation (e.g., 2D/3D potential cell types or states plot) Reduces noise, improves Helps interpret cellular diversity Importance in Workflow clustering performance, and and identify unique populations Differential Gene Expression Identifies genes with different expression levels between clusters or conditions. Helps reveal cell type-specific markers or functional states within clusters. Applications: 1. Identifying Marker Genes: Finds genes that uniquely identify each cell type. 2. Comparing Conditions: Studies changes in gene expression between healthy and diseased cells. 3. Pathway Analysis: Links differentially expressed genes to specific pathways for functional insights. Tools for DGE: Seurat: Popular in R, provides built-in DGE functions for scRNA-seq. MAST and edgeR: Tools for robust statistical testing of differential expression in single-cell data. Differential Gene Expression Identifies genes that are expressed at significantly different levels across clusters, conditions, or cell types. Purpose: Reveals unique markers for each cell type. Helps identify biological differences between clusters. Provides insights into gene functions and cellular states. Types of Comparisons: Between Clusters: Identify genes that differentiate cell types. Across Conditions: Compare gene expression in diseased vs. healthy cells. DEG Statistics Content: Challenges in Single-Cell DGE: High variability and low counts in single-cell data require specialized methods. Common Statistical Methods: Wilcoxon Rank Sum Test: Non-parametric test used to compare expression between groups. Likelihood Ratio Test (LRT): Tests whether gene expression models differ significantly between conditions. MAST: Accounts for “dropouts” (zero values due to low detection sensitivity) common in scRNA-seq data. DEG Tools Popular DGE Tools: Seurat: Built-in DGE functions that use statistical tests to identify DEGs across clusters. DESeq2 and edgeR: R-based tools originally for bulk RNA-seq, adapted for single-cell datasets. MAST: Handles dropout data common in scRNA-seq, ideal for sparse expression matrices. Choosing a Tool: Consider factors like dataset size, technical variation, and integration into your workflow. Visualizing DEG Common Visualization Techniques: Volcano Plot: Shows fold changes vs. significance, highlighting key DEGs. Heatmap: Visualizes expression patterns of DEGs across clusters or conditions. Dot Plot: Displays DEG expression levels within clusters, showing both presence and abundance. Interpreting Results: Marker Genes: Identify genes uniquely expressed in certain cell types. Pathway Enrichment: Link DEGs to biological pathways for functional insights. scRNA-Seq vs Bulk RNA-Seq Aspect scRNA-seq Bulk RNA-seq Consideration High variability between cells due to Averages expression across Use statistical models that account for Cell-to-Cell individual differences, resulting in many cells, reducing individual high variability in scRNA-seq, like zero- Variability biological noise variability inflated models Many zero values due to “dropout” events Dropouts (Zero- Minimal dropout effect, as gene DGE tools for scRNA-seq must handle zero where genes are not detected in certain cells Inflation) expression is measured consistently inflation to accurately detect DEGs (sparse data) Higher read depth per sample, scRNA-seq requires robust statistical tests Read Depth per Lower read depth per cell due to data allowing for better detection of and normalization methods that adjust for Cell spread across thousands of cells low-expressed genes variable read depth scRNA-seq requires more complex Needs specialized normalization to adjust for Uses simpler normalization normalization, such as scaling or regression- Normalization library size, cell-specific variability, and methods like TPM or RPKM based methods to correct for cell cycle and technical artifacts other artifacts scRNA-seq allows fine-grained DGE Enables DGE analysis within specific cell Provides an average expression Cell-Type-Specific analysis, but results need careful types or clusters, allowing high-resolution across mixed cell types, which DEGs interpretation to differentiate biological insights may mask differences differences from technical noise Individual cells are often treated as Uses biological replicates (e.g., scRNA-seq may require sophisticated Statistical Power & replicates, though true biological replication tissue samples), providing reliable statistical approaches or pseudoreplication to Replication is challenging DGE estimates ensure accurate and robust DGE results Pathway Analysis and Functional Enrichment Purpose: Identifies biological pathways, cellular functions, and processes associated with differentially expressed genes (DEGs). Helps interpret the biological roles of specific cell types, conditions, or states in the dataset. Key Concepts: Pathway Analysis: Links DEGs to biochemical pathways (e.g., signaling, metabolic). Functional Enrichment: Identifies overrepresented biological functions (e.g., immune response) within DEG lists. Pathway Enrichment Over-Representation Analysis (ORA): Compares observed gene counts in pathways/functions with what is expected by chance. Identifies pathways significantly enriched in DEGs. Gene Set Enrichment Analysis (GSEA): Ranks all genes by expression changes and determines if genes in specific pathways are enriched at the top of the ranking. Useful for subtle expression differences and more sensitive to small changes. Network-Based Analysis: Maps DEGs onto interaction networks (e.g., protein-protein interactions) to identify functional modules. Popular Enrichment Tools DAVID (Database for Annotation, Visualization, and Integrated Discovery): Provides enrichment analysis for functional terms, pathways, and GO categories. GSEA (Gene Set Enrichment Analysis): Detects enriched gene sets within the ranked list of genes. Reactome and KEGG: Offers pathway-specific insights, linking DEGs to curated pathways like metabolic or signaling pathways. ClusterProfiler (R package): Integrates with Bioconductor and provides enriched GO terms and pathway analysis for scRNA-seq data. Interpreting Enrichment Key Considerations: Biological Relevance: Focus on pathways/functions that match known biology of cell types or conditions. Pathway Redundancy: Similar pathways can appear enriched due to overlapping genes. Applications: Identify Core Pathways: Understand which pathways drive cellular behavior (e.g., immune response in macrophages). Hypothesis Generation: Identify new biological roles or therapeutic targets for specific cell types or disease states. Pathway Enrichment Analysis Cell Linage Analysis What is Cell Lineage Analysis? Cell lineage analysis traces the development and differentiation of cells over time. Helps identify the progression from stem cells or progenitors to fully differentiated cell types. Why Use scRNA-seq for Lineage Tracing? Captures gene expression profiles at single-cell resolution, providing snapshots of cells at different stages. Allows for identifying and ordering cells along a developmental or differentiation pathway. Spanjaard et al, 2018 https://www.nature.com/articles/nbt.4124 Approaches to Cell Linage Analysis Pseudotime Analysis: Orders cells along a “trajectory” that reflects their progression from one state to another (e.g., stem cell to neuron). Assumes cells at different points along the trajectory represent different stages in a continuous process. Tools: Monocle, Slingshot, and scVelo. Lineage Trees: Builds branching trees that show how cells diverge from common ancestors into specialized cell types. Useful for visualizing multiple differentiation paths. Advantages: Provides insights into developmental processes and identifies transitional cell states. RNA Velocity What is RNA Velocity? RNA velocity predicts the “future state” of each cell based on the direction and rate of change in gene expression. It uses the balance of unspliced (newly transcribed) and spliced (mature) mRNA to infer the trajectory of cells in a developmental or differentiation process. Velocity Calculation: Counts of spliced and unspliced mRNAs are obtained for each gene in each cell. The transcriptional dynamics are modeled to predict where each cell “moves” next in gene expression space. RNA velocity vectors are generated, showing the direction and magnitude of change in cell state. Tools for RNA Velocity: Velocyto: First popularized RNA velocity by estimating future states from scRNA-seq data. scVelo: An advanced Python tool that refines the method, offering improved accuracy and scalability for larger datasets. Why RNA Velocity is Useful: Dynamic Insights: Unlike static scRNA-seq snapshots, RNA velocity provides a temporal perspective, allowing researchers to observe cells in motion. Applications: Used in studying cellular transitions in development, differentiation, and disease progression. RNA Velocity https://journals.plos.org/ploscompbiol/article?id=10.1371/