Applied Analytics for Business – Healthcare PDF

Applied Analytics for Business – Healthcare Dr. Ravi Shankar Prof – Data Science (AI -ML) Dean – Enterprise Solutions & Academic Collaborations https://www.linkedin.com/in/ravi-shankar-70584b1/ [email protected] S4 & S5, AUG 31st , 2023 • ML application in HC • Classifying DNA Sequences • Ten Algorithms Faculty Profile : Dr. Ravi Shankar INTRODUCTION • Background: PhD - Econometrics, Entrepreneur, Start-up Mentor, Social Impact Investor, Data Science Thought Leader, Senior Venture Partner • Current Focus: AI adoption in Business, Deep Learning, XAI, ML Operations, Design Thinking, OB & Strategy • Sectoral Exposure: multiple sectors straddling BFSI / Pharma / Manufacturing / Retail / Tech • Overall Experience : 30 Yrs. Applied Analytics for Business - Healthcare: Learning Journey Five sessions of 1.5 hours each First & Second session Introduction Third session Healthcare 4.0 + HC - AMM Fifth A session Fifth B session Fourth session AI- ML Real World Applications – I AI- ML Real World Applications – II Data Governance DNA Classification Dataset + Text Mining DNA Classification Algorithms ML ALGORITHMS TO CLASSIFY DNA SEQUENCES BIOINFORMATICS GENOMICS E-COLI DNA SEQENCES UCI REPOSITORY TEXT TO NUMERICAL DATA BUILDING & TRAINING COMPARE & CONTRAST INFERENCES STRUCTURE & LEARNING OUTCOMES… About Bioinformatics & Genomics Building & evaluating Supervised Classificatory algorithms to classify DNA Sequences Python Code Design & Flow 1. 2. 3. 4. https://youtu.be/v1cTNhiZ2_c https://youtu.be/mmgIClg0Y1k Import data from the UCI repository Convert text inputs to numerical data Build and train classification algorithms Compare and contrast classification algorithms GENOMICS – 101 ALPHA FOLD Levinthal's paradox the theory of protein folding. the molecule has an astronomical number of possible conformations. An estimate of 10300 was made in one of his papers For example, a polypeptide of 100 residues will have 99 peptide bonds, and therefore 198 different phi and psi bond angles. If each of these bond angles can be in one of three stable conformations, the protein may misfold into a maximum of 3198 different conformations (including any possible folding redundancy). Therefore, if a protein were to attain its correctly folded configuration by sequentially sampling all the possible conformations, it would require a time longer than the age of universe. This is true even if conformations are sampled at rapid (nanosecond or picosecond) rates. The "paradox" is that most small proteins fold spontaneously on a millisecond or even microsecond time scale. The solution to this paradox has been established by computational approaches to protein structure prediction. C https://www.youtube.com/watch?v=TCCjZe0y4Qc https://youtu.be/gg7WjuFs8F4 1. What is bioinformatics? 14 https://youtu.be/iCISHYdrCOs What is bioinformatics?  Bioinformatics, n. The science of information and information flow in biological systems, esp. of the use of computational methods in genetics and genomics. (Oxford English Dictionary)  "The mathematical, statistical and computing methods that aim to solve biological problems using DNA and amino acid sequences and related information." -- Fredj Tekaia  "I do not think all biological computing is bioinformatics, e.g. mathematical modelling is not bioinformatics, even when connected with biology-related problems. In my opinion, bioinformatics has to do with management and the subsequent use of biological information, particular genetic information.“-----Richard Durbin 15 • DNA & Nucleotides The nucleotides which make up DNA are adenine (A), thymine (T), cytosine (C), and guanine (G). In RNA, the thymine is replaced with uracil (U). Together, these small chemical compounds make up the genetic code of an organism, with their arrangement coding for the production of a number of proteins. Adenine can only bond with thymine, and cystosine can only bond with guanine. This means, for example, that when a strand of DNA is examined, if there's a A on one end of a rung, a T must be on the other. Adenine and thymine form a base pair in DNA, as do cytosine and guanine. https://youtu.be/2JUu1WqidC4 What are Base Pairs? DNA contains base pairs of nucleotides. Base pairs are pairs of nucleotides joined with a hydrogen bond found in DNA and RNA. This genetic material is typically double-stranded, with a structure which resembles a ladder, and each set of base pairs making up a single rung of the ladder. Base pairs have a number of interesting properties which make them topics of interest, and understanding how base pairs work is important to many geneticists. • What are Base Pairs? Adenine and guanine are both types of molecules known as purines, while thymine and cytosine are pyrimidines. Purines are larger, with a structure which prohibits two of them from fitting on one rung of the ladder, while pyrimidines are too small. This means that adenine cannot become a base pair with guanine, and thymine cannot be in a base pair with cytosine. https://youtu.be/NTO_1oDE_HU Base pairs are pairs of nucleotides joined with a hydrogen bond found in DNA and RNA. • Promoter Promoters - Bing video A promoter is a sequence of DNA needed to turn a gene on or off. The process of transcription is initiated at the promoter. Usually found near the beginning of a gene, the promoter has a binding site for the enzyme used to make a messenger RNA (mRNA) molecule. BIOINFORMATICS in a Nutshell Biology Is Extremely Complex, Indeed!!! “… We think that physics is complicated because it is hard for us to understand, and because physics books are full of difficult mathematics. But the objects that physicists study are still basically simple objects. … The objects and phenomena that a physic book describes are simpler than a single cell in the body of its author. …” Hierarchy of Organization biome molecule (chlorophyll) organelle (chloroplast) cell (plant cell) tissue (plant epithelial) organ (leave) organism (maple tree) ecosystem population (maple population) Amazing of Organization Emergent Properties biome molecule (chlorophyll) “Each level of biological organization has emergent properties.” organelle (chloroplast) cell (plant cell) tissue (plant epithelial) organ organism (maple tree) ecosystem We know very little population about the whole biology. (maple population) What is Bioinformatics? Bioinformatics is an interdisciplinary field that develops methods and software tools for understanding biological data. As an interdisciplinary field of science, bioinformatics combines computer science, statistics, mathematics, and engineering to study and process biological data. WIKIPEDIA Bioinformatics derives knowledge from computer analysis of biological data. These can consist of the information stored in the genetic code, but also experimental results from various sources, patient statistics, and scientific literature. Research in bioinformatics includes method development for storage, retrieval, and analysis of the data. Bioinformatics is a rapidly developing branch of biology and is highly interdisciplinary, using techniques and concepts from informatics, statistics, mathematics, chemistry, biochemistry, physics, and linguistics . It has many practical applications in different areas of biology and medicine. — Michael Nilges & Jens P. Linge, Institut Pasteur — Bioinformatics in My Opinion!!! Bioinformatics is an interdisciplinary subject that uses knowledges and techniques from computer science, mathematics, statistics, information technologies and linguistics to get some informations from the massive biological data. Synonyms of BIOINFORMATICS computational biology biocomputing computational molecular biology Bioinformatics ≠ Computer + Biology Computer Scientist Biologist Bioinformaticians are the bridge between these groups Bioinformatician Central Dogma How information flow? Ref: http://genius.com/Biology-genius-the-central-dogma-annotated Sequence = strings DNA = {A, T, C, G} RNA = {A, U, C, G} protein = {A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y} CATCAGCTCCACGCATCAGCGACTACACATTCGACTCAGCATCGACTACGCATCAGCTCCACGCATCAGCGACT ACACATTCGACTCAGCATCGACTACGCATCAGCTCCACGCATCAGCGACTACACATTCGACTCAGCATCGACTA CGCATCAGCTCCACGCATCAGCGACTACACATTCGACTCAGCATCGACTACGCATCAGCTCCACGCATCAGCGA CTACACATTCGACTCAGCATCGACTACGCATCAGCTCCACGCATCAGCGACTACACATTCGACTCAGCATGACT MSFQDIQQSEHFLLRPSEKVQKLETSQWPLLLKNFDKLNVLTNHYVPIPSGCSPLKRSIEDYVKSGFINLDKPA NPSSHEVVAWAKRILKVDKTGHSGTLDPKVTGCLIVCIERATRLVKSQQGAGKEYVCIFHLHSPVEDEQKVAKN IERLTGALFQRPPLISAVKRQLRVRTVYESKMLEYDKDKGMGVFWVSCEAGTYIRTMCVHLGLFLGVGGQMQEL RRVRSGINSEKEGLVTMHDILDAQWLYENHKDESYLRRAIKPLEALLTSHKRVIMKDTAVNALCYGAKIMLPGV Main Types of Biological Data Sequence Data Structural Data Profile Data (Some) Areas of Bioinformatics Biodatabase Sequence Analysis Structural Bioinformatics Microarray Data Analysis Systems Biology Biodatabase Why Biologists Needs Database? PubMed The World Largest Biodatabases http://www.ncbi.nlm.nih.gov Growth of GeneBank Ref: http://www.nlm.nih.gov/about/2015CJ.html PDBJ KEGG Database Pfam Database Genome is a book of life ATTCGACTCAGCATCGACTACGCATCAGCTCCACGCATCAG CGACTACACATTCGACTCAGCATCGACTACGCATCAGCTCC ACGCATCAGCGACTACACATTCGACTCAGCATCGACTACGC ATCAGCTCCACGCATCAGCGACTACACATTCGACTCAGCAT CGACTACGCATCAGCTCCACGCATCAGCGACTACACATTCG ACTCAGCATCGACTACGCATCAGCTCCACGCATCAGCGACT ACACATTCGACTCAGCATCGACTACGCATCAGCTCCACGCA TCAGCGACTACACATTCGACTCAGCATCGACTACGCATCAG CTCCACGCATCAGCGACTACACATTCGACTCAGCATCGACT ACGCATCAGCTCCACGCATCAGCGACTACACATTCGACTCA GCATCGACTACGCATCAGCTCCACGCATCAGCGACTACACA TTCGACTCAGCATCGACTACGCATCAGCTCCACGCATCAGC GACTACACATTCGACTCAGCATGACTACACATTCGACTCAG CATCGACTACGCATCAGCTCCACGCATCAGCGACTACACAT TCGACTCAGCATCGACTACGCATCAGCTCCACGCATCAGCG ACTACACATTCGACTCAGCATCGACTACGCATCAGCTCCAC GCATCAGCGACTACACATTCGACTCAGCATCGACTACGCAT CAGCTCCACGCATCAGCGACTACACATTCGACTCAGCATCG ACTACGCATCAGCTCCACGCATCAGCGACTACACATTCGAC TCAGCATCGACTACGCATCAGCTCCACGCATCAGCGACTAC ACATTCGACTCAGCATCGACTACGCATCAGCTCCACGCATC AGCGACTACACATTCGACTCAGCATCGACTACGCATCAGCT CCACGCATCAGCGACTAAAAACTCGCGCCTACAGCGCATCA GCATACGACTACAACGACAGCAGCAGCAGCAGCAGCAGCAG CAGCGCCCCAGAAGAGAGAGAACACATTCGACTCAGCATCG ACTACGCATCAGCTCCACGCATTCAGCTCCACTACCGACGA TTAATCTACTACTACTCCCCTATTTCACCTATTTACATCAC AAAACCGACTCGACATCAGCTCTTCGCATCAGCTACGACGC ATCAAGCAGACGACTACGACCGCGCGACAGCAGCGACACTC CCGCGCAACCAACAGATAGATAGATAGAAAAACCGACTCGA CATCAGCTCTTCGCATCAGCTACGACGCATCAAGCAGACGA CTACGACCGCGCGACAGCAGCGACACTCCCGCGCAACCAAC AGATAGATAGATAGAAAAACCGACTCGACATCAGCTCTTCG CATCAGCTACGACGCATCAAGCAGACGACTACGACCGCGCG ACAGCAGCGACACTCCCGCGCAACCAACAGATAGATAGATA GAAAAACCGACTCGACATCAGCTCTTCGCATCAGCTACGAC GCATCAAGCAGACGACTACGACCGCGCGACAGCAGCGACAC TCCCGCGCAACCAACAGATAGATAGATAGAAAAACCGACTC GACATCAGCTCTTCGCATCAGCTACGACGCATCAAGCAGAC GACTACGACCGCGCGACAGCAGCGACACTCCCGCGCAACCA ACAGATAGATAGATAGAAAACCGACTCGACATCAGCTCTTC GCATCAGCTACGACGCATCAAGCAGACGACTACGACCGCGC GACAGCAGCGACACTCCCGCGCAACCAACAGATAGATAGAT AGAAAAACCGACTCGACATCAGCTCTTCGCATCAGCTACGA CGCATCAAGCAGACGACTACGACCGCGCGACAGCAGCGACA CTCCCGCGCAACCAACAGATAGATAGATAGAAAAACCGACT CGACATCAGCTCTTCGCATCAGCTACGACGCATCAAGCAGA CGACTACGACCGCGCGACAGCAGCGACACTCCCGCGCAACC AACAGATAGATAGATAGAAAAACCGACTCGCTACGACGCAT CAAGCAGACGACTACGACCGCGCGACAGCAGCGACACTCCC GCGCAACCAACAGATAGATAGATAGAAAAACCGACTCATCC GCCCCCCCCCCGCGCGCCGAACTAGACATCAGCTCTTCGCA TCAGCTACGACGCATCAAGCAGACGACTACGACCGCGCGAC AGCAGCGACACTCCCGCGCAACCAACAGATAGATAGATAGA Genome Sequencing think big!!! The first bacterial genome (Haemophilus influenzae) The first eukaryotic genome (Saccharomyces cerevisiae) The first archaea genome (Methanococcus jannaschii) The First Plant Genome Arabidopsis thaliana $1000 per Genome in 2015 Moore's law for computing costs. U S $ 100 million 2002 Cost of genome sequencing. 2004 The $1,000 genome 2006 I n Silicon Valley, Moore’s law seems to stand on equal footin g with the natural With a unique programme, the US government has managed to drive the cost of genome sequencing dow n tow ards a 2008 much-anticipated target. laws codified by Isaac Newton. Intel co-founder Gordon Moore’s iconic observation that computing power tends to double — and that its price therefore halves — every 2 years has held true for nearly 50 years with only minor revision. But as an exemplar of rapid change, it is the target of playful abuse from genome researchers. In dozens of presentations over the past few years, scientists have compared the slope of Moore’s law with the swiftly dropping costs of DNA sequencing. For a while close.kept Thepace, pricebut of since sequencing an average human they about 2007, it has not even genome been has plummeted from about US$10 million to a few thousand dollars in just six years. That does not just outpace Moore’s law — it makes the once-powerful predictor of unbridled progress look downright sedate. And just as the easy availability of personal computers changed the world, the breakneck pace of genome-technology development has revolutionized bioscience research. It is also set to cause seismic shifts in medicine. In the eyes of many, a fair share of the credit for this success goes to a grant scheme run by the US National Human Genome Research Institute (NHGRI). Officially called the Advanced Sequencing Technology awards, it is known more widely as the $1,000 and $100,000 genome programmes. Started in 2004, the scheme has awarded grants to 97 groups of academic and industrial scientists, including some at every major sequencing company. It has encouraged mobility and cooperation among technologists, and helped to launch dozens of competing companies, staving off the stagnation that many feared would take hold after the Human Genome Project wrapped up in 2003. “The major companies in the space have really changed the way people do sequencing, and it all started with the NHGRI funding,” says Gina Costa, who has worked for five influential companies and is now a vice-president at Cypher Genomics, a genome-interpretation firm in San Diego, California. $ 1 0 million BY ERI K A CHECK HAYD EN As next-generation sequencers entered the market, the price dropped precipitously. A GIANT’S LEGACY The $1,000 genome programme, now close to achieving its goal, will award its final grants this year. As technology enthusiasts look to future challenges, the coming milestone raises questions about how the roughly $230-million government programme managed to achieve such success, and whether its winning formula can be applied elsewhere. It benefited from fortuitous timing and the lack of an entrenched industry. But Jeffery Schloss, director of the division of genome sciences at the NHGRI in Bethesda, Maryland, who has run the programme from its inception, says that its achievements also suggest that there are ways to navigate public–private partnerships successfully. “One of our challenges is to figure out what is the right role for the government; to not get in the way, but feed the © 2014 Macmillan publishers Limited. All rights reserved pipeline of private-sector technology development,” he says. The quest to sequence the first human genome was a massive $ 1 million $100,000 2010 $ 1 0,000 2012 The price of sequencing a whole h u m an genome hovers around $5,000 and is expected to drop even lower. $ 1 ,0 00 2 9 4 | NAT U RE | VO L 5 0 7 | 20 MA RC H 20 1 4 modified from: Hayden E C . (2014) Nature 507:294–295. Human Genome The human genome is 380,000 longer than the sequence shown here. From Gene to Genome Human Genome Project Achievements beyond HGP 20 Years of Bacterial Genome Sequencing Number of genomes sequenced 16,000 14,000 12,000 10,000 8,000 6,000 4,000 2,000 0 Year Land M, et al. (2015) Funct Integr Genomics 15,141–161. Deinococcus radiodurans Deinococcus radiodurans deinos—unusual extraordinarily resistant to oxidative stress, including desiccation and radiation survive under radiation around 3–5 million rad (100 rad can kill human) Comparative Genomics genome 1 genome 2 core genes decorative genes genome 3 decorative genes core genes Microbiome We are entering to the new era of omics, a wide variety of large-scale, multi-dimensional biology. From Standalone Biology to ‘Omics’ Study Features of Omics approach: high-throughput, data-driven, holistic, top-down methods understanding cell metabolism in one ‘integrated system’ high-output, requires bioinformatics to analyze & manipulate Omics Study Relies on Central Dogma DNA GENOMICS transcription mRNA TRANSCRIPTOMICS translation Protein PROTEOMICS metabolism Metabolite METABOLOMICS Understand Genome is Not Enough genomics is static don’t know the set of genes that express in a particular condition some phenotypes are consequent of interaction of gene interaction (emerging property) lot of changes happen in the downstream processes of genetic information (not in DNA) Microarray chip DNA Chip Microarray Technology Microarray Data Clustering of Microarray Data microarray data clustering tree on the top and left are just dendrogram always plots between genes versus conditions intensity of each color represents level of expression Background • Explore the world of bioinformatics by using Supervisory Classificatory models, K-nearest neighbor (KNN) algorithms, support vector machines, and other common classifiers to classify short E. Coli DNA sequences. • We will use a dataset from the UCI Machine Learning Repository that has 106 DNA sequences, with 57 sequential nucleotides (“base-pairs”) each. E. coli is Critical to Genetic Advances The microorganism Escherichia coli (E.coli) has a long history in the biotechnology industry and is still the microorganism of choice for most gene cloning experiments. Although E. coli is known by the general population for the infectious nature of one particular strain (O157:H7), few people are aware of how versatile and widely used it is in research as a common host for recombinant DNA (new genetic combinations from different species or sources). How E. Coli Makes a Difference: E. Coli is an incredibly versatile tool for genetic engineers; as a result, it has been instrumental in producing an amazing range of medicines and technologies. It has even, according to Popular Mechanics, become the first prototype for a bio-computer: "In a modified E. coli 'transcriptor,' developed by Stanford University researchers March 2007, a strand of DNA stands in for the wire and enzymes for the electrons. Potentially, this is a step towards building working computers within living cells that could be programmed to control gene expression in an organism." Such a feat could only be accomplished with the use of an organism that is well understood, easy to work with, and able to replicate quickly. Reasons E. coli Is Used for Gene Cloning (thoughtco.com) Genetic Simplicity Growth Rate Safety Escherichia coli also known as E. coli (is a Gram-negative, facultative anaerobic, rod-shaped, coliform bacterium of the genus Escherichia that is commonly found in the lower intestine of warmblooded organisms (endotherms). Well Studied Foreign DNA Hosting Ease of Care • Data set from UCI repository • • 106 E Coli Genes 57 Nucleotide Peptides Index of /ml/machine-learning-databases/molecular-biology/promoter-gene-sequences (uci.edu) • • • • • Number of Instances: 106. Number of Attributes: 59 -- class (positive or negative) -- instance name -- 57 sequential nucleotide ("base-pair") positions Python Implementation IMPORT DATASETS AND LIBRARIES PERFORM EXPLORATORY DATA ANALYSIS AND VISUALIZATION BUILD AND TRAIN Supervised Classificatory Algorithms PREPARE THE DATA BEFORE TRAINING THE AI/ML MODEL Importing Libraries.. We will be using some common Python libraries, such as pandas and numpy. Furthermore, for the machine learning side of this project, we will be using sklearn. Import these libraries in a way to ensure you have them correctly installed. import pandas import numpy import sklearn Classification Algorithms Linear Discriminant Analysis Logistic Regression Gaussian NB SVC KNN Decision Tree Random Forest Bagging AdaBoost GradientBoosting ExtraTrees Data & Modeling Approach Preprocessing / EDA - Check for Duplicates Split Data to Training and Validation set Spot-Check Algorithms & Normalized Models Train & Save Trained Model Performance Measure Metrics Validation Tuning Feature Extraction Preprocessing • from sklearn.metrics import accuracy_score, classification_report , confusion_matrix, auc, roc_curve • from sklearn.model_selec tion import train_test_split, cross_val_score, KFold • from sklearn.pipeline import Pipeline, make_pipeline • from sklearn.model_selec tion import GridSearchCV • from sklearn.feature_sele ction import RFE • from sklearn.preprocessin g import MinMaxScaler, StandardScaler, Normalizer, Binarizer, LabelEncoder Assignments 1. Generate & publish results for the remaining classifiers. Compare & contrast….. 2. Replicate Research Paper - Analysis of DNA Sequence Classification Using CNN and Hybrid Models Hindawi Computational and Mathematical Methods in Medicine Volume 2021, Article ID 1835056, 12 pages https://doi.org/10.1155/2021/1835056 Thank you

Applied Analytics for Business – Healthcare PDF

Document Details

Tags

Related

Summary

Full Transcript