Lecture 4 - Data Mining for eHealth c64f6c21d36348138a2a7a4e48b2771e.pdf
Document Details
Uploaded by DefeatedRomanArt
Tags
Full Transcript
Lecture 4 - Data Mining for eHealth Outline What is Data Mining? Data Mining Common Tasks Algorithms Performance and Interpretability What is Data Mining? “[Data Mining] is defined as the process of discovering patterns in data. The process must be arithmetic or (more usually) semi-automatic. The pa...
Lecture 4 - Data Mining for eHealth Outline What is Data Mining? Data Mining Common Tasks Algorithms Performance and Interpretability What is Data Mining? “[Data Mining] is defined as the process of discovering patterns in data. The process must be arithmetic or (more usually) semi-automatic. The patterns discovered must be meaningful…” Witten and Frank, “Data Mining: Practical Machine Learning Tools and Techniques”, 2005 From Data to Knowledge Lecture 4 - Data Mining for eHealth 1 The bioinformatics knowledge gap refers to the lack of understanding or the disparity in comprehension between the amount of biological data generated and the ability to analyse and interpret it. It is caused by the rapid generation of vast amounts of biological data, primarily due to advancements in technology such as high-throughput sequencing. The gap is further widened by the lack of user-friendly tools to analyse and interpret this data, making it difficult for non-experts to make sense of it. This leads to underutilisation of the available data, as much of it remains unanalysed and its potential insights unexplored. Bridging this gap requires the development of more accessible data analysis tools and training for researchers in data analysis and interpretation skills. KDD Process Lecture 4 - Data Mining for eHealth 2 The KDD process, or Knowledge Discovery in Databases, is a method used to extract useful knowledge from large volumes of data. It begins with the selection of relevant datasets and preprocessing to clean and normalize the data. Then, a data mining algorithm is applied to discover patterns and interesting structures within the data. The resulting patterns must then be interpreted and evaluated for their validity and usefulness. This process can help bridge the bioinformatics knowledge gap by turning raw data into interpretable knowledge. Attribute Selection In Theory: having more attributes result in more accurate patterns, never less In Practice: having irrelevant attributes may “confuse” data mining algorithms Manual Selection: using domain knowledge from experts Automatic Selection: reduces the dimensionality by deleting unsuitable attributes based on heuristics Attribute Construction Lecture 4 - Data Mining for eHealth 3 Original representation of the search space might not be ideal New representation (new attributes) makes regularities more apparent Data Mining Tasks Descriptive Clustering Predictive Regression Classification Clustering Consists of finding a finite set of categories (clusters) to describe the data Grouping the examples into categories (users) So that the similarity of examples in a cluster is maximised And the similarity of examples from different clusters is minimised Lecture 4 - Data Mining for eHealth 4 Consists of finding rules that represent patterns in the data, by identifying relationships (associations) between attributes Market Basket Analysis Same group of items bought together placed together Healthcare Identifying patients with demands for similar treatments and services Lecture 4 - Data Mining for eHealth 5 Clustering: Example Exploits the fact that features often go together Patients who have high LDL (low-density lipoprotein - “bad” cholesterol) and low HDL tend to have high BMI Regression Lecture 4 - Data Mining for eHealth 6 Consists of finding a model (function) that maps a given input (attributes, values) to a numeric prediction Applications of Regression Forecasting: predicting the economy growth based on market indicators Medical diagnosis: predicting the trajectory of recovery of a patient after injury or surgery Regression in eHealth data mining can be used to predict health outcomes based on a variety of factors, including patient history, lifestyle, genetic data, and more. It can be used to identify trends and patterns in health data, contributing to the development of personalized medicine and treatment plans. Regression models can also be used to predict future healthcare needs, allowing for better planning and resource allocation. Challenges with using regression in eHealth data mining include dealing with noisy or incomplete data, ensuring patient privacy, and interpreting complex model outputs. Classification Consists of finding a model that is able to predict the value of the class attribute of an example based on the values of a set of attributes Each record (example) belongs to a predefined class Each example consists of two parts : Lecture 4 - Data Mining for eHealth 7 Goal: discover a relationship which allows us to predict the class of an example, given its predictor attributes The relationship is discovered by using a training set, where the class of the examples is known and then the relationship is used to predict the class of examples in the test set, whose class is unknown Lecture 4 - Data Mining for eHealth 8 Data Mining Classification Algorithms Artificial Neural Networks (ANNs) Support Vector Machines (SVMs) Decision Tree Induction Rule Induction Artificial Neural Networks (Partially) inspired by the observation that biological learning systems are built of very complex webs of interconnected neurons Built out of densely interconnected set of simple units Each unit takes a number of real-valued inputs (possibly the outputs of other units) and produces a single real-valued output (which may become the input to many other units) Well suited for for: Problems in which the training data correspond to noisy, complex sensor data often recorded in healthcare contexts Artificial Neural Networks (ANNs) are a powerful tool for data mining in eHealth, able to handle complex, multi-dimensional, and noisy data. ANNs can learn patterns and relationships from training data, making them useful for predicting health outcomes and diagnosing diseases. They can handle both categorical and numerical data, which is common in healthcare datasets. ANNs are flexible and can adapt as new data becomes available, making them ideal for the ever-evolving field of eHealth. Lecture 4 - Data Mining for eHealth 9 However, they can be computationally intensive and may require significant resources to train and implement effectively. Interpretability of ANNs can be challenging due to their complex nature, often referred to as "black box" models. Efforts to improve the interpretability of ANNs are ongoing in the field. Support Vector Machines Binary linear classifier Goal: find the maximum margin hyperplane The plane that gives the greatest separation between classes Support Vector Machines (SVMs) are a commonly used method in eHealth data mining, particularly for classification tasks. They work well with high-dimensional data, which is common in healthcare where each patient can have hundreds of recorded features. SVMs are effective at separating classes even when data is not linearly separable by using kernel tricks and are known for their robustness against overfitting. They can be used to predict disease occurrence or to classify patients based on their health records. Lecture 4 - Data Mining for eHealth 10 However, like ANNs, SVMs can be computationally intensive especially with large datasets and may be difficult to interpret due to their complex mathematical nature. Despite these challenges, their high accuracy makes them a valuable tool in the field of eHealth data mining. Decision Tree Induction Comprehensible graphical representation of a classification model Internal nodes correspond to attribute tests (decision nodes) Leaf nodes correspond to the predicted class values To classify an example: The tree is traversed in a top down fashion from the root node towards a leaf node Branches are selected according to the outcome of attribute tests represented by internal nodes until a leaf node is reached The class value associated with the leaf node is the class label predicted for the example Lecture 4 - Data Mining for eHealth 11 Evaluation and Interpretation Accuracy Most common measure of quality Simply: the proportion of correct predictions But more useful measures exist: Sensitivity and specificity Lecture 4 - Data Mining for eHealth 12 Measure of Test Performance Accuracy: The likelihood of a correct positive or negative prediction accuracy = TP + TN/TP + TN + FP + FN True-Positive Rate (TPR) or Sensitivity: the likelihood that a diseased patient has a positive test Can be expressed as conditional probability: P (P ositiveT est∣DiseaseIsP resent) True-Negative Rate (TNR) or Specificity: the likelihood that a health patient has a negative test Can be expressed as a conditional probability: P (NegativeT est∣DiseaseIsAbsent) A desirable test will have a High TPR (sensitivity or true-positive rate) High TNR (specificity or true-negative rate) Issues most tests produce a continuous output result Lecture 4 - Data Mining for eHealth 13 which then has to be interpreted as positive/negative Example: PCR test for HIV The amount of the virus present in the blood can be measured the value of sensitivity and specificity are dependent on the particular cut off value or threshold chosen to distinguish normal and abnormal results Tradeoff between increasing/decreasing the threshold: Lowering the number of false positive tests can increase the specificity, decrease the sensitivity by increasing the false negative tests Lowering the number of false negative tests can increase the sensitivity, decrease the specificity by increasing the false positive tests. Cutoff value tradeoff: is it better to tolerate false negatives (missed cases) or false positives (healthy people classified as diseased) If the disease is serious and if life saving therapy is available, then it should minimise the number of false negative results, at the cost of increased false positives If the disease is not series and the therapy has risks, then it should minimise the number of false positive results Evaluation and Interpretation Lecture 4 - Data Mining for eHealth 14 “Black box” models Artificial Neural Networks Support Vector machines Models cannot be easily interpreted “White box” models Decision Trees Models can be easily interpreted Importance of comprehensibility (example) The military trained an ANN to classify image of tanks into enemy vs friendly tanks Accuracy of ANN in the test set was very high When deployed n the field (corresponding to “future data”), poor accuracy rate Later users noted that all photos of friendly tanks were taken on a sunny day and enemy tanks on an overcast way ANN learned to discriminate between the colours of the sky If the model was comprehensible, such trivial mistake would immediately be noted Summary From Data to Knowledge Attribute/feature selection and construction Descriptive and predictive data mining neural networks support vector machines decision trees Evaluating and interpreting outputs Lecture 4 - Data Mining for eHealth 15