Machine Learning and Bioinformatics Lecture Notes PDF
Document Details
Uploaded by GreatestMoon9518
Monsignor Doyle
Tags
Summary
These lecture notes cover machine learning concepts, including supervised and unsupervised learning, and deep learning, with an application focus on bioinformatics. The material includes various types of neural networks like fully connected and convolutional networks, and discusses applications such as DNA motif identification and SNP genotyping.
Full Transcript
Lecture 11b –Machine Learning and Bioinformatics BIO4BI3 - Bioinformatics Artificial Intelligence https://rapidminer.com/blog/artificial-intelligence-machine-learning-deep-learning/ Machine Learning With tight ties to statistics, ML is the study and application of algorithms to enab...
Lecture 11b –Machine Learning and Bioinformatics BIO4BI3 - Bioinformatics Artificial Intelligence https://rapidminer.com/blog/artificial-intelligence-machine-learning-deep-learning/ Machine Learning With tight ties to statistics, ML is the study and application of algorithms to enable computers to learn from data and perform a task without relying on explicit rules Paradigm of providing data to train the computer system followed by testing with unseen data Supervised Learning – a model is built that best transforms input data into known output. Once the model is satisfactory it can be applied to make novel predictions Classification and regression are types of supervised learning Unsupervised Learning – A model of the data is built without the use of known outputs. These techniques are used for clustering, pattern discovery, dimensionality reduction, and feature learning Supervised Learning Support Vector Machine Linear Regression Logistic Regression https://docs.opencv.org https://en.wikipedia.org/wiki/Linear_regression https://docs.microsoft.com/en-us/azure/machine-learning/studio/algorithm-ch Unsupervised Learning K-Means Clustering Hierarchical Clustering Principle Component Analysis https://www.mathworks.com/help/stats/kmeans.html https://stackabuse.com/hierarchical-clustering-with http://www.gettinggeneticsdone.com/2011/10/new-dimension -python-and-scikit-learn/ to-principal-components_27.html Deep Learning Using layers of artificial neurons to build layered models for data analysis Applied for supervised and unsupervised learning Deep refers to the number of the layers in the architecture Many types of neural networks; fully connected, convolution, recurrent Theory behind DL decades old but recent advances in hardware technology and access to large datasets have resulted in their increased use Deep Learning does not require feature engineering for the model Artificial Neuron A function that takes a number of inputs The inputs are multiplied by weights Theses values are summed A bias may be added to the sum The sum is then processed by an activation function to propagate the signal https://www.learnopencv.com/understanding-activation-functions-in-deep-learning/ Multilayer Perceptron Activation Functions The activation function adds a non-linearity to the neural network processing allowing them to approximate any function Originally modelled on the biological neural, an input threshold must be met before the signal is propagated to other neurons Sigmoid or Logistic Activation – puts bounds on the output value to between 0 and 1 centred on 0.5 Tanh – puts bounds on the output to between -1 and 1 with the centre at zero Rectified Linear Unit (ReLU) – It outputs the greater of the input or zero Sigmoid Activation Function https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6 Tanh Activation Function https://i.stack.imgur.com/Mg9s9.png ReLU Activation Function https://medium.com/@kanchansarkar/relu-not-a-differentiable-function-why-used-in-gradient-based-optimization-7fef3a4cecec Feed Forward An artificial neural network where there are no cycles within the network graph Single-layer perceptron Multi-layer percentprons First and most basic types of neural networks Most commonly trained through back- propagation https://en.wikipedia.org/wiki/Feedforward_neural_network Back-propagation The most popular method to train deep neural networks During training, the difference between the known results (ground- truth) and the output of the network is calculated. This is the error term or cost The error on the output layer is calculated using a ‘loss function’ The error is then propagated back through the network and the contribution of each neuron to the error is calculated (delta) For each neuron, the delta is multiplied by the weight to find the gradient The gradient value multiplied by a ‘learning rate’ is subtracted from each weight The dataset is fed through the network again and the process is repeated until the error has reached a minimum Fully Connected Neural Network A network where the each of the neurons of layer n-1 are connected to each of neurons in layer n Often the final layer(s) of a neural network are fully connected Linear series of data (called a tensor) are used as input to the input layer The input tensor can be multi- dimensional Must pay attention to over-fitting Convolutional Neural Network A popular class of neural networks for analyzing image data Re-invigorated research in deep neural networks when researchers from the U of T showed substantially improved image recognition performance Fully connected neural networks don’t preserve spatial relationships CNNs are collections of convolution layers, pooling layers, and fully connected layers Convolution layers apply kernels of fixed size, called filters, to the input image. Convolution filters learn the features of an image Later convolution layers combine earlier learned features into more complex features The fully connected layers are used to learn the classification of the image Convolutional Neural Networks https://adeshpande3.github.io/A-Beginner%27s-Guide-To-Understanding-Convolutional-Neural- Networks/ CNN Learned Filters https://www.youtube.com/watch?v=pM68et1o3Zk CNN Classifier CNN for Medical Imaging https://ars.els-cdn.com/content/image/1-s2.0- S1361841517301299-gr4.jpg Transformers Introduced in 2017 – “Attention is all you need” Originally an encoder-decoder approach for NLP Expanded the idea of ‘self-attention’ Transformers allows for encoding positional information all at once Transformer are able to be trained in parallel State-of-the-art LLMs are transformer based Transformers are replacing CNNs and RNNs/LSTMs in many problem domains https://jalammar.github.io/illustrated- transformer/ https://towardsdatascience.com/transformer-neural-network-step-by-step-breakdown-of-the- beast-b3e096dc857f Transformers and attention https://towardsdatascience.com/transformer-neural-network-step-by-step-breakdown-of-the- beast-b3e096dc857f Transformers and self-attention https://towardsdatascience.com/transformer-neural-network-step-by-step-breakdown-of-the- beast-b3e096dc857f Receiver Operating Characteristic Curve https://cdn-images-1.medium.com/max/1600/1*hf2fRUKfD-hCSw1ifUOCpg.png Receiver Operating Characteristic Curve https://images.radiopaedia.org/images/11592546/d785cb6dcc2fe33dcd316f1c90ad1d.jpg Confusion Matrix / Sensitivity vs Specificity Sensitivity – The ability to correctly identify true positive Specificity – The ability to correctly identify true negatives https://www.dataschool.io/content/images/2015/01/confusion_matrix2.png Applications in Bioinformatics Deep learning approaches are quickly gaining popularity in bioinformatics research Bioinformatics common deals with very large datasets which are ideal for training neural networks Biological systems are noisy and complex which makes feature engineering for traditional machine learning approaches difficult and often not robust Neural networks able to learn the important features of a system from multiple sources of experimental data Applications in Bioinformatics “Deep Learning in Bioinformatics” Min et al https://arxiv.org/vc/arxiv/papers/1603/1603.06430v3.pdf Identification of DNA Bind Motifs ‘DeepBind’ uses a convolutional neural network to identify DNA motifs associated with protein binding Nature Biotechnology volume33, pages831–838 (2015) Identification of RNA Binding Motifs ‘iDeepS’ uses two CNNs combined with an LSTM RNN to prediction RNA motifs associated with protein binding. They were able to combine primary sequence information and predicted RNA 2-D structure to achieve SOTA predictions BMC Genomics201819:511 SNP Genotyping Google’s ‘Deep Variant’ encodes DNA read mappings and associated data as an RGB image. The images are then analyzed by a CNN to identify polymorphisms Nature Biotechnology volume36, pages983–987 (2018) SNP Genotyping Broad Institutes DL SNP discovery system. Regions surrounding potential polymorphisms are identified. Information regarding the sequence alignment, read characteristics, and quality are encoded in a ‘read tensor’. These tensors are then evaluated by a non-CNN to identify polymorphisms https://gatkforums.broadinstitute.org/gatk/discussion/10996/deep-learning-in-gatk4 Phenotype/Genotype Prediction A CNN-LSTM framework to classify Arabidopsis plants by variety (ecotype). It uses a CNN to develop visual features for classification and an LSTM to use changes in the plant over time (growth) to classify the plants into ecotypes Plant Methods201814:66 DNA-BERT2 A deep learning model based on Google’s state-of-the- art NLP architecture, BERT (bidirectional encoder representations from transformers) Genomic DNA can regarded as containing a regulatory ‘grammar’. DNA-BERT can learn those underlying rules Pre-trained on human genomic sequence and annotations With a small amount of training data, DNA-BERT has achieved SOTA performance in predicting promoters, splice-sites, transcription factor binding sites. Can be used non-human analysis with a bit of training AlphaFold In silico protein structure prediction is one of the ‘grand challenges’ of biology There are 10^300 different conformations for the typical protein. It would take longer than the age of the universe to evaluate all of them for a single protein A computational protein folding competition named CASP has been help since 1994 Using ~170,000 structures from the PDB and 200 million predicted protein sequences, DeepMind developed a deep learning architecture called AlphaFold In 2020, AlphaFold2 achieve a median accuracy of 93 across all categories. A 90 is considered comparable to experimental results Nobel Prize in Chemistry was awarded to the AlphaFold team in 2024 AlphaFold2 AlphaFold2 AlphaFold2 The team released the predicted structures of every predicted protein across a large number of model species Others are expanding on their work; RoseTTAFold was inspired by AlphaFold and achieves almost comparable predictions with much less computation power required. Meta recently released their own protein folding software that is a 60X speed up over AlphaFold2 Advances are being reported in prediction protein complex structure predictions based on AlphaFold2 Being applied to drug design and artificial antibody design It isn’t perfect but it is a dramatic improvement on our ability to go from raw DNA sequence to 3D structure of a protein State of the Art AlphaFold3 Latest AI-based protein structure prediction tool by Google DeepMind. Improvements: Predicts structures for proteins, nucleic acids, ligands, and ions. Diffusion-based architecture for enhanced accuracy. Performance Highlights: Protein-ligand docking: Outperforms classical tools (e.g., AutoDock). Protein-nucleic acid interactions: Higher accuracy than RoseTTAFold2NA. Antibody-antigen modeling: Significant improvements over AF-Multimer. Challenges: Static structure predictions. Stereochemical errors and occasional clashing atoms. AlphaFold3 Summary Applying deep learning methods to bioinformatics is increasing in popularity The high volume and complex data from bioinformatics is a good fit for artificial neural network techniques The ability to avoid feature development before making predictions is very attractive Some are unsure about the ‘Black-Box’ nature of these systems progress is being made on how to better understand how NN are making their predictions This area will continue to grow as we work to integrate larger and larger diverse datasets