유전체의학 – 유전체자료 (NGS) PDF
Document Details
Uploaded by MesmerizingGyrolite5380
아주대학교
2024
김규태
Tags
Summary
이 문서는 유전체의학과 유전체 자료 (NGS)에 관한 내용입니다. 유전체 분석을 통한 의생명과학 연구, 빅데이터 유전체 분석, 유전체 시퀀싱 데이터의 종류 및 활용 등에 대해 서술되어 있습니다.
Full Transcript
유전체의학 – 유전체자료 (NGS) 2024-11-26 김규태 아주대학교 의과대학 생리학교실 아주대학교 대학원 의생명과학과 빅데이터 유전체 분석을 통한 의생명과학 연구 What is Genome? https://en.wikipedia.org/wiki/Genome In the fields of molec...
유전체의학 – 유전체자료 (NGS) 2024-11-26 김규태 아주대학교 의과대학 생리학교실 아주대학교 대학원 의생명과학과 빅데이터 유전체 분석을 통한 의생명과학 연구 What is Genome? https://en.wikipedia.org/wiki/Genome In the fields of molecular biology and genetics, a genome is all genetic material of an organism. It consists of DNA (or RNA in RNA viruses). The genome includes both the genes (the coding regions) and the noncoding DNA, as well as mitochondrial DNA and chloroplast DNA. The study of the genome is called genomics. What is Genome? ▪ Biological systems are fundamentally ‘digital’ in nature by storing, copying, and processing their information encoded in the letters A, C, G and T. ▪ The major evolutionary advantage of a digital medium for storing genetic information is that it can persist across thousands of generations, while analog signals would be diluted from generation to generation from basic chemical diffusion. What is Genome? Digits to DNA and back again (original image data encoded to DNA) (retrieved image decoded from DNA) Shipman et al., Nature, 2017 DNA… Future Hard Drives?? DNA 1g ≈ 100,000,000 Gb (100,000 Tb) What is Genome? Writing/Editing Genome Reading Genome What is Genome? Reading Genome Human Genome Project (HGP) Early days: a DNA-sequencing lab in 1994 What is Genome? Reading Genome What is Genome? Reading Genome What is Genome? Writing/Editing Genome Emmanuelle Jennifer Charpentier Doudna CRISPR genome editing 2020 Nobel Prize in Chemistry What is Genome? in your days in near future…? 빅데이터 유전체 분석을 통한 의생명과학 연구 Why ‘Big’ data? How big? Total number of Cells ≈ 37.2 trillion Total mass of DNA ≈ 60 grams Total length of genome ≈ 3 billion base pairs Why ‘Big’ data? How big? https://www.pacb.com/blog/the-evolution-of-dna-sequencing-tools/ Why ‘Big’ data? How big? Why ‘Big’ data? How big? Why ‘Big’ data? How big? Why ‘Big’ data? How big? Why ‘Big’ data? How big? BI (BioInformatics) | Data Sciences | Computational Biology > Computational algorithms (machine learning, artificial intelligence, etc.) > > Statistics, Mathematics Why ‘Big’ data? How big? WGS (Whole-genome sequencing) data Why ‘Big’ data? How big? in terms of Hardwares in terms of Softwares ▪ How would you do mining such huge ▪ Which methods/algorithms would you sizes of data? With your own PC? apply for efficient data processing? ▪ What if there are several samples or ▪ What if there are no available known multi-types of sequencing data in a study? approaches to investigate your study? Can you devise a brilliant algorithm for the first time? 빅데이터 유전체 분석을 통한 의생명과학 연구 What are types of sequencing data? Decoding such digital information in a ‘comprehensive fashion’ can facilitate unprecedented access to actionable insights in the most fundamental questions long-standing in biomedical sciences. To understand better… More annotations (GPS, borders, labels, building names) Multi-layered information (traffic status, landmarks for tourists, locals, hipsters, etc.) What are types of sequencing data? What are types of sequencing data? I need more … to map micro-environmental dependencies in cancer evolution and inter-clonal collaboration. > Morphology-guided single-cell profiling to uncover complex landscape of cancer evolution in the spatial context. > Single-cell multi-omics to link genetic, epigenetic and transcriptional information in cancer evolution. > Screening causal epigenetic aberrations that impact on cancer evolution and cell persistence. 빅데이터 유전체 분석을 통한 의생명과학 연구 Hypothesis first Data first Collecting and Observation/ Analyzing data Application Question Application Evaluation Hypothesis Evaluation Hypothesis Testing with Testing with experiment experiment Director, Founding Core Institute Member Roberg Weinberg Todd Golub How can we analyze such sequencing data? Data mining which is the process of finding out out of huge amount of impurities in data How can we analyze such sequencing data? The first step is… to process raw materials, thereby enabling data to be cracked down easily. How can we analyze such sequencing data? Kim KT*, Lee H*, Lee H* et al., 2015 Genome Biology How can we analyze such sequencing data? How can we analyze such sequencing data? Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Computational Biology Why Computational Biology? Answers from MIT graduates’ More efficient and in depth way to explore biology DNA is a massive dataset It answers questions not easily solvable by traditional experimental biology Computational biology and simulations can help deconvolve results from experiments It’s interesting and new There’re tons of biological datasets waiting to be analyzed Biology benefits from approximation More and more sequencing data are coming out It might be the biggest frontier of computing today Because you can others’ datasets and then get good research done on a budget It has a potential to do whatever you want without waiting for experiments Life itself is digital Efficiency (reduce experimental space to cover) Ability to visualize Pattern finding There are rules It’s all about data Understanding the molecular etiology of complex disorder with a quantifiable certainty Deciphering the molecular etiology of complex disorder with a quantifiable certainty Bio-Medical Science Computational Biology Computer Science Physiology Single-Cell Mathematical Analysis Algorithms Genetics Multi-Omics Computer Integration Programming Immunology Immuno-Genomics Visualization Cancer Cancer Statistics Biology Genomics Fundamental Biomedical Problems Understanding the molecular etiology of complex disorder with a quantifiable certainty Biological systems are by storing, copying, and processing their information encoded in the letters A, C, G, and T. (genetic fingerprinted) Director, Institute for Computational Medicine New York University, School of Medicine Sequencing reads from Genome Transcriptome Methylome Metagenome Understanding Genetic Variation Genetic variation: refers to the differences in DNA sequences among individuals within a population Importance: the foundation of diversity in living organisms and a key driver of evolution Types of Genetic Variation Single Nucleotide Polymorphisms (SNPs) - Definition: Variations at a single nucleotide position. - Example: A change from adenine (A) to guanine (G). - Impact: Can affect gene function or regulation. Insertions and Deletions (Indels) - Definition: Addition or loss of small DNA segments. - Example: A sequence insertion or a few base pairs deletion. - Impact: Can cause frameshift mutations if occurring in coding regions. Cardoso et al., Front. Bioeng. Biotechnol., 2015 Copy Number Variations (CNVs) - Definition: Changes in the number of copies of a particular gene or genomic region. - Example: Duplication or deletion of large DNA segments. - Impact: Can affect gene dosage and lead to diseases. Structural Variants - Definition: Large-scale alterations in chromosome structure. - Examples: Translocations, inversions, and large deletions or duplications. - Impact: Can disrupt gene function or regulation. Source of Genetic Variation → Producing new genes and alleles and increase genetic variation Mutations - Definition: Permanent changes in DNA sequence. - Causes: Errors in DNA replication, environmental factors (e.g., UV radiation, chemicals). Recombination - Definition: Exchange of genetic material during meiosis. - Importance: Increases genetic diversity by creating new allele combinations. Gene Flow - Definition: Transfer of genetic material between populations. - Importance: Introduces new genetic variants into a population. Sexual reproduction Genetic drift Environmental variance Application in Medicine and Research 1. Personalized Medicine → Tailoring medical treatment to individual characteristics based on genetic profile. Applications: - Pharmacogenomics: Studying how genes affect drug response (e.g., genetic variants influencing drug metabolism of CYP450 enzymes) - Targeted Therapies: Developing therapies targeting specific genetic mutations (e.g., HER2 inhibitors in breast cancer, EGFR inhibitors in lung cancer) Application in Medicine and Research 2. Disease Diagnosis and Risk Assessment - Genetic Testing: Testing for specific genetic mutations associated with diseases (e.g., BRCA1/2 testing for cancer risk, CFTR testing for cystic fibrosis) - Newborn Screening: Screening newborns for genetic disorders (e.g., congenital hypothyroidism) - Polygenic Risk Scores (PRS): Calculating disease risk based on multiple genetic variants (e.g., PRS for cardiovascular diseases, diabetes) Application in Medicine and Research 3. Understanding Genetic Basis of Diseases - Identifying Disease-Associated Genes: Using genetic studies to find disease-related genes (e.g., APOE in Alzheimer's, HBB in sickle cell anemia) - Functional Genomics: Studying how genetic variations affect gene function (e.g., CRISPR for gene function studies, eQTL analysis) Tuuli Lappalainen group, Science, 2020 Xue et al., Nat Comm., 2018 Application in Medicine and Research 4. Population Genetics and Evolution - Studying Genetic Diversity: Analyzing genetic variation to understand human diversity (e.g., Human Genome Diversity Project, 1000 Genomes Project) - Tracing Human Evolution: Using genetic data to trace ancestry and evolution (e.g., past migration and admixture of world-wide human populations, Mitochondrial DNA, Y-chromosome analysis) Genetic ancestry and admixture dating of ancient populations from Xinjiang and its vicinity. (The genomic origins of the Bronze Age Tarim Basin mummies) Fan Zhang et al., Nature, 2021 Application in Medicine and Research 5. Agricultural Biotechnology - Crop Improvement: Developing crops with desirable traits (e.g., GMOs) - Animal Breeding: Improving livestock breeds for desirable traits (e.g., Selective breeding, genetic testing for traits) The Tomato Genome Consortium, Nature, 2012 DNA sequencing: targeted / WES / WGS Targeted Sequencing - Definition: Sequencing specific regions of the genome that are of particular interest - Applications: Diagnostic purposes, research, identifying genetic mutations - Advantages: Cost-effective, higher depth of coverage in targeted regions Whole Exome Sequencing (WES) - Definition: Sequencing all the protein-coding regions (exons) of the genome - Applications: Studying Mendelian disorders, cancer genetics, rare diseases - Advantages: Focuses on functionally relevant parts, more cost-effective than WGS Whole Genome Sequencing (WGS) - Definition: Sequencing the entire genome, including coding and non-coding regions - Applications: Disease research, evolutionary studies, personalized medicine Tarek Atia, Biocell, 2019 - Advantages: Captures all genetic variations, identifies structural variants, CNVs Company Website Notes 23 and Me http://www.23andme.com Consumer genomics Allere Laboratory http://www.allerelabs.com/about.php NGS Based health and wellness Alpha Genomix http://www.alphagenomix.com/ Clinical NGS/molecular diagnostics center Ancestry.com http://www.ancestry.com Integrating geneologies with DNA analysis Ariana Pharma http://www.arianapharma.com Multiple research/clinical genomics applications Asuragen http://www.asuragen.com Clinical NGS ion torrent FFPE capabilities Biocrates http://www.biocrates.com/ Metabolomics Biodiscovery http://www.biodiscovery.com Genomics analysis consulting Caris Life Sciences http://www.carislifesciences.com/platforms/ Multi-platform tumor profiling for precision cancer therapy Castle Medical http://www.castlemedical.com Clinical tests include NGS CD Genomics http://www.cd-genomics.com/ full service genomics provider Claritas Genomics http://claritasgenomics.com/ Exome sequence interpretation for rare disease diagnosis CompanionDx http://www.companiondxlab.com/ Clinical NGS Counsyl https://www.counsyl.com Carrier Screening via NGS or Targeted Genotyping Cypher Genomics http://cyphergenomics.com Interpretation of NGS data clinical deCode Genetics http://www.decode.com Clinical genomics DNA Link https://www.dnalink.com NGS microarray personal genomics, forensics etc. Edge Bio http://www.edgebio.com CLIA exome sequencing Foundation Medicine http://www.foundationmedicine.com Cancer diagnostics (including NGS) Fulgent Diagnostics http://fulgentdiagnostics.com/ Clinical NGS Gene TLC http://www.genetic.com Personalized genomics GenebyGene http://www.genebygene.com Services include Ancestry, Health Research and Paternity GeneDx http://www.genedx.com Clinical including whole exome Geneyouin https://www.geneyouin.ca Direct to consumer NGS and sequence analysis Genomic Engenharia Molecular https://www.genomic.com/ Paternity testing/consumer/clinical genomics Good Start Genetics http://www.goodstartgenetics.com Carrier screening Health In Code http://www.healthincode.com Genetic diagnosis of inherited cardiovascular diseases: NGS/Sanger Lab solutions (.net) http://www.labsolutions.net/ Drug screening and confirmatory testing MapMyGenome http://mapmygenomein/ Consumer genomics Millennium Health Labs http://www.millenniumhealth.com/ Personalized Medicine MolecularMD http://www.molecularmd.com Comprensive CRO with significant NGS and bioinformatics capabilities Multiplicom http://www.multiplicom.com Dx kits for targeted resequencing Mycroarray http://mycroarray.com/ Custom microarrays and capture bait libraries Myriad Genetics https://www.myriad.com Clinical Genetics and Genemics Services Natera http://www.natera.com Clinical prenatal genetic testing Navigenics http://www.navigenics.com Consumer genomics (Assimilated by the Thermo Collective) Nextcode https://www.nextcode.com/ Delivering the resources developed at deCODE Genetics to the clinical domain NextGen Diagnostics http://nextgendx.biz/ Clinical and Reseaerch Genomics Oxford Gene Technology http://www.ogt.co.uk RNASeq Targeted sequencing Familial/Trio Analysis and Advanced Analysis Paradigm Diagnostics http://www.paradigmdx.org Clinical interpretation of NGS data Pathgroup http://www.pathgroup.com Pathology services including clinical NGS Pathogenica http://www.pathogenica.com Clinical Sequencing/NGS Pathway Genomics http://www.pathway.com Personal genomics/nutrition Personal Genome Diagnostics http://www.personalgenome.com Clinical cancer exome sequencing through interpretation Personalis http://www.personalis.com Interpretation of NGS data clinical Prevention Genomics http://www.preventiongenetics.com Clinical Genetic Screening Prevention Genomics http://www.preventiongenetics.com Clinical Genetic Screening Prognosys http://www.prognosysbio.com/sequensys NGS Sequencing and analysis services qGenomics https://www.ggenomics.com Genomics for human health QUEST Diagnostics http://www.QuestDiagnostics.com Clinical Diagnostics including clinical genomics (see Genomic Vision) Response Genetics http://www.responsegenetics.com Genetic and genomic approaches to cancer diagnostics Rheumakit http://www.rheumakit.com NGS-based Rheumatology diagnostics Sequenta http://sequentainc.com Clinical NGS and immunology Sophia Genetics http://www.sophiagenetics.com Integrated Clinical NGS Dry Lab Service Sorensen Genomics http://www.sorensongenomics.com Clinical and forensic genomics Synexa Life Sciences http://www.synexagroup.com Clinical human genomics UD-GenoMed http://www.ud-genomed.hu Full service clinical genomics lab XDx http://www.xdx.com Clinical, expression-basedtransplant monitoring http://grouthbio.com/Genome_Software_Service.php https://www.illumina.com/products/by-area/oncology/cancer-panels.html https://www.foundationmedicine.com/test/foundationone-cdx https://www.macrogen.com/ko/business/diagnosis/cancer-genome https://www.kr-geninus.com/html/service/service01.html RNA sequencing Kukurba et al., Cold Spring Harb Protoc 2015 Common analysis goals of RNA-seq analysis Gene expression and differential expression Alternative expression analysis Transcript discovery and annotation Allele specific expression Mutation discovery Fusion detection RNA editing Rising KEYWORDs ▬ genome ▬ transcriptome ▬ epigenome ▬ metagenome Google search trend ▬ microbiome Publication in PubMed [2004 – 2024] ▬ probiotics [1945 - 2024] ▬ genome probiotics ▬ probiotics ▬ microbiome microbiome ▬ ▬ metagenome ▬ epigenome ▬ genome ▬ transcriptome Association of microbiota with diseases Brain Microbiota-Gut-Brain axis Brain → Gut → Microbiota Microbiota → Gut → Brain Their functions are expanded to Gut have critical roles in infections, autoimmune diseases, cancer, Microbiota neurological and psychiatric disorders (e.g., autism, depression, Dementia). Fecal Microbiota Transplant (FMT) Gut microbiota play a crucial role in our health, and an imbalance can lead to various diseases. FMT is a treatment that involves transplanting gut microbiota from a healthy donor to a patient to restore the balance of gut microbiota. FMT works by transplanting healthy microbiota to restore the balance of gut microbiota. This interacts with the host's immune system to help restore health. Fecal Microbiota Transplant (FMT) Clinical studies with microbiome modulations Cullin N et al., Cancer Cell, Now, there are three causes of disease easily correctable! Environment/ Gene Microbiome Lifestyle Purposes of Metagenome Analysis ▶ What species constitute the gut microbiota? ▶ What is the diversity of the gut microbiota ecosystem? ▶ Is the gut microbiota ecosystem stable? ▶ How does the gut microbiota distribution differ between diseased individuals and healthy individuals? ▶ What genes are present in the gut microbiota? ▶ Are there any functional issues with the gut microbiota? ▶ Is there potential for diagnosis and treatment using the gut microbiota? Detection in host reads (unpublished) Microbiome composition and diversity in CRC Tissues vs. Adjacent Non-Cancer Tissues Normal (n=54) CRC (n=54) (unpublished) Microbiome composition and diversity in CRC Tissues vs. Adjacent Non-Cancer Tissues p value < 0.05 FC; log2(1.5) Enriched in Normal Enriched in CRC (Under review) Approaches of Metagenome Analysis Shotgun 16S rRNA amplicon Costs High sequencing & computational cost inexpensive Taxonomic resolution possible at species and strain level limited at genus level Direct analyzing with gene expression Gene/functional analysis Assumptions based on known references profiles Plasmids/Phages/Viruses possible to detect not detectable host DNA is removed to avoid unnecessary Host contamination Applicable to high host DNA contamination sequencing costs Detection in host reads Walker et al., Bioinformatics, 2018 Dohlman et al., Cell Host & Microbe, 2021 Lab-work procedure Anahtar et al., J. Vis. Exp, 2016 Lab-work procedure https://help.ezbiocloud.net/mtp-pipeline/ Starting with raw sequencing reads (.fastq ) How can we analyze such sequencing data? The first step is… to process raw materials, thereby enabling data to be cracked down easily. Data types.fastq Data types.fastq (extension of.fasta + BQ) Line Description 1 Information of each read starting with ‘@’ 2 Actual (biological) sequence 3 Delimiter with ‘+’ 4 Base quality in phred score Data types.sam/.bam (summarizing position, quality and structure for each read) Data types Visualization of.bam in IGV (Integrated Genomics Viewer) G/A heterozygote Depth of coverage Individual reads aligned to the reference.bam REFERENCE SEQUENCE Data types.tsv/txt.fastq.bam.vcf/maf.fastq Kim KT, Lee HW, Lee HO et al., 2015 Genome Biol. Data types variant samples WES mapping calling annotation.fastq.bam.vcf.txt.vcf.txt (raw).bam RNA-seq.fastq genes.txt WES processing pipeline variant WES mapping calling annotation.fastq.bam.vcf.txt Processing pipeline Pre-processing reads (initial mapping) BWA-mem Pre-processing reads (marking duplicates) ▶ Duplicates: non-independent measurements of a sequence - Sampled from same template of DNA - Violates assumptions of variant calling - Errors in sample/library prep. Will get propagated to all the duplicates → among duplicates, pick the ‘best’ copy → mitigates the effects of errors Pre-processing reads (local realignment around indels to correct mapping errors) Pre-processing reads (Base Recalibration, BQSR to correct sequencer errors) Sequencers make systematic errors in base quality scores BQSR corrects the quality scores (not the bases) Variant Discovery Variant Discovery (matched Tumor/Normal samples from the same individual) Kwon et al., Cancer Discovery, 2021 TMB and mutational signatures Mutational landscape Cancer evolution Why Computational Biology? Answers from MIT graduates’ More efficient and in depth way to explore biology DNA is a massive dataset It answers questions not easily solvable by traditional experimental biology Computational biology and simulations can help deconvolve results from experiments It’s interesting and new There’re tons of biological datasets waiting to be analyzed Biology benefits from approximation More and more sequencing data are coming out It might be the biggest frontier of computing today Because you can others’ datasets and then get good research done on a budget It has a potential to do whatever you want without waiting for experiments Life itself is digital Efficiency (reduce experimental space to cover) Ability to visualize Pattern finding There are rules It’s all about data 감사합니다 [email protected]