Data Science for Engineers - MIT-WPU PDF
Document Details
Uploaded by SustainableAlgebra
MIT-WPU
Tags
Summary
This document is a presentation or lecture notes related to data science and machine learning. It covers various topics such as supervised and unsupervised learning, regression, classification, and clustering, all presented in a clear and organized format, making this document useful for study notes.
Full Transcript
DATA SCIENCE T. Y. (B.Tech. Bio Engineering) Course: Data Science for engineers Course Credit : 3 (Theory-2hr, Lab-2hr) Course Instructor: Dr. Ankita Agarwal UNIT-III MACHINE LEARNING 1.Introduction to Machine Learning: Supervised and Unsupervised Learning 2.Splitting datasets:...
DATA SCIENCE T. Y. (B.Tech. Bio Engineering) Course: Data Science for engineers Course Credit : 3 (Theory-2hr, Lab-2hr) Course Instructor: Dr. Ankita Agarwal UNIT-III MACHINE LEARNING 1.Introduction to Machine Learning: Supervised and Unsupervised Learning 2.Splitting datasets: Training and Testing. 3.Regression: Simple Linear Regression, 4.Classification: Naïve Bayes classifier, 5.Clustering: K-means, 6.Evaluating model performance, Python libraries for ML INTRODUCTION: MACHINE LEARNING Artificial Intelligence makes AI a computer act/think like a human. Data science is an AI subset that deals with ML data methods, scientific analysis, and statistics, all used to gain insight and meaning from data. Machine learning is a DATA subset of AI that teaches SCIENCE computers to learn things from provided data. INTRODUCTION: MACHINE LEARNING “Machine Learning allows the machines to learn and make predictions based on its experience(data)” Definition by Tom Mitchell (1998): Machine Learning is the study of algorithms that improve their performance P at some task T with experience E. A well-defined learning task is given by. Defining the Learning Task: Improve on task T, with respect to performance metric P, based on experience E Q. Define the learning task for Automated handwritten word recognition T: Recognizing hand-written words P: Percentage of words correctly classified E: Database of human-labeled images of handwritten words Q. Define a learning task for Automated Spam filter T: Categorize email messages as spam or legitimate. P: Percentage of email messages correctly classified. E: Database of emails, some with human-given labels Few Applications of Machine Learning? Recognizing patterns: – Handwritten or spoken words – Medical images Generating patterns: – Generating images or motion sequences Recognizing anomalies: – Unusual credit card transactions – Unusual patterns of sensor readings in a nuclear power plant Prediction: – Future stock prices or currency exchange rates Personalized medicine – Individual’s genetic profile based medicines prediction MACHINE LEARNING - TYPES Supervised supervised learning is when we teach or train the machine using data that is well-labelled. Which means some data is already tagged with the correct answer. After that, the machine is provided with a new set of examples (data) so that the supervised learning algorithm analyses the training data (set of training examples) and produces a correct outcome (label) for new data trained from labeled data. Unsupervised Unsupervised learning is the training of a machine using information that is neither classified nor labeled and allowing the algorithm to act on that information without guidance. Here the task of the machine is to group unsorted information according to similarities, patterns, and differences without any prior training of data. SUPERVISED VS UNSUPERVISED ML SUPERVISED MACHINE LEARNING - TYPES Supervised learning is classified into two categories of algorithms: Classification: A classification problem is when the output variable is a category, such as “Red” or “blue” , “disease” or “no disease”, “active site” or “non-active site” Regression: A regression problem is when the output variable is a real value, such as “dollars” or “weight”. Example: predicting the catalytic activity of an enzyme Supervised learning deals with “labeled” data. This implies that some data is already tagged with the correct answer. Types:- Regression- Logistic Regression Classification: Binary Classification, Multi-class classification, Naive Bayes Classifiers K-NN (k nearest neighbors) Classifiers, Decision Trees (Random Forest, Gradient Boosting, AdaBoost) Support Vector Machine (SVM) Supervised Learning: Regression Given (x1, y1), (x2, y2),..., (xn, yn) Learn a function f(x) to predict y given x – y is real-valued == regression Supervised Learning: Classification Given (x1, y1), (x2, y2),..., (xn, yn) Learn a function f(x) to predict y given x – y is categorical == classification Advantages of Supervised Learning:- Helps to optimize performance criteria with the help of prior experience. Supervised machine learning helps to solve various types of real- world computation problems. It allows estimating or mapping the result to a new sample. We have complete control over choosing the number of classes we want in the training data. Disadvantages of Supervised Learning:- Classifying big data can be challenging. Training for supervised learning needs a lot of computational time It requires a properly labelled data set. UNSUPERVISED MACHINE LEARNING - TYPES Unsupervised learning is classified into two categories of algorithms: Clustering: A clustering problem is where you want to discover the inherent groupings in the data, (dividing by similarity) Example: grouping customers by their purchasing behaviour for targeted marketing Association: An association rule learning problem is where you want to discover rules that describe large portions of your data (identify patterns). Example: people that buy X also tend to buy Y used for purchase recommendations Unsupervised Learning Given x1, x2,..., xn (without labels) Output hidden structure behind the x’s – E.g., clustering (grouping) UNSUPERVISED MACHINE LEARNING - TYPES Clustering Probabilistic Clustering Exclusive (Each object is part of one subset only) Overlapping (Object can belong to one or more subset) Agglomerative (Set of nested clusters) Probabilistic (Model based on probability distribution function) Agglomerative Clustering UNSUPERVISED CLUSTERING ML - TYPES Clustering ML Types:- Hierarchical clustering (Agglomerative clustering) K-means clustering (Partitioning-based method) Principal Component Analysis (PCA) Singular Value Decomposition (SVD) Advantages of unsupervised learning: Dimensionality reduction can be easily accomplished Capable of finding previously unknown patterns in data. Flexibility: Unsupervised learning is flexible in that it can be applied to a wide variety of problems, including clustering, anomaly detection, and association rule mining. Exploration: Unsupervised learning allows for the exploration of data and the discovery of novel and potentially useful patterns that may not be apparent from the outset. Low cost: Unsupervised learning is often more time efficient than supervised learning because it doesn’t require labeled data, which can be time-consuming. Disadvantages of unsupervised learning : Difficult to measure accuracy due to lack of predefined answers during training. The results often have lesser accuracy. Lack of guidance: Unsupervised learning lacks the guidance and feedback provided by labeled data, which can make it difficult to know whether the discovered patterns are relevant or useful. Sensitivity to data quality: Unsupervised learning can be sensitive to data quality, including missing values, outliers, and noisy data. Scalability: Unsupervised learning can be computationally complex, particularly for large datasets or complex algorithms, which can limit its scalability. SUPERVISED VS UNSUPERVISED ML Parameters Supervised machine learning Unsupervised machine learning Algorithms are used against data Input Data Algorithms are trained using labeled data. that is not labeled Computational Simpler method Computationally complex Complexity Accuracy Highly accurate Less accurate No. of classes No. of classes is known No. of classes is not known Linear and Logistics regression, K-Means clustering, Hierarchical Algorithms used Classification like Random forest, Support clustering, etc. Vector Machine, Neural Network, etc. Output Desired output is given to machine Desired output is not given. Training data Use training data to infer model. No training data is used. Model We can test our model. We can not test our model. Example Example: Optical character recognition. Example: Find a face in an image. Basic Steps of Machine Learning Understand the problem and goals Collect prior knowledge (data) of the domain Data integration, selection, cleaning, pre- processing, etc. Split the data into training and testing Learn (Train) the models Interpret results (Accuracy of models) Consolidate and deploy discovered knowledge (predict on new data) When a dataset is large enough, it's a good practice to split it into training and test sets; the former to be used for training the model and the latter to test its performances. from sklearn.model_selection import train_test_split >>> X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state=1) MACHINE LEARNING ALGORITHMS Linear Regression Regression Analysis can be defined as the process of developing a mathematical model that can be used to predict one variable by another variable. Linear regression: is supervised machine learning statistical technique wherein we use linear modeling approach to build relationship between dependent variable and independent variable. Technique is called Simple linear regression if only one independent variable (x) is analyzed If multiple independent variables (x1, x2, …) are analysed, it is Multiple Linear Regression If Multiple dependent variable (y1,y2..) are predicted , it is called as multivariate linear regression. Properties of linear regression line Regression line always passes through mean of independent variable (x) as well as mean of dependent variable (y) Regression line minimizes the sum of “Square of Residuals”. The differences between the actual and estimated function values on the training examples are called residuals Linear Regression: 200000 180000 Example 160000 140000 120000 Salary 100000 80000 Experience Salary 60000 40000 12 87000 20000 0 0 5 10 15 20 25 30 35 40 35 187000 Experience 23 117000 13 89000 200000 180000 18 110000 160000 140000 120000 Salary 100000 80000 60000 40000 20000 0 0 5 10 15 20 25 30 35 40 Experience We can solve this using Linear Regression y = Dependent Variable x = Independent Variable Salary ($) + b0 = Constant term + + + b1 = Coefficient of + + + relationship between + ‘X’ & ‘Y’ + + + (b1 explains the change in Y with a change in X by one unit. In other words, if we increase Experience the value of ‘X’ by one unit then what will be the change in value of Y) Example If a student scored 80 in the Math test, what grade would we expect her to make in statistics? How well does the regression equation fit the data? RollNo Maths Marks Stat Marks b1 = 0.644 1 95 85 2 85 95 b0 = 26.768 3 80 70 4 70 65 5 60 70 Classification Process Model construction: describing a set of predetermined classes Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute The set of tuples used for model construction is training set The model is represented as classification rules, decision trees, or mathematical formulae Model usage: for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy rate is the percentage of test set samples that are correctly classified by the model Test set is independent of training set, otherwise over-fitting will occur If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known Step 1: Model Construction Classification Algorithms Training Data Classifier (Model) IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ Step 2: Model Usage Classifier Testing Data New Data (Jeff, Professor, 4) Tenured? Bayesian Classification A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities Foundation: Based on Bayes’ Theorem Performance: A simple Bayesian classifier, naïve Bayesian classifier, has comparable performance with decision tree and selected neural network classifiers Bayesian Theorem Let X be a data sample (“evidence”): class label is unknown Let H be a hypothesis that X belongs to class C Classification is to determine P(H|X), the probability that the hypothesis holds given the observed data sample X P(H) (prior probability), the initial probability E.g., X will buy computer, regardless of age, income, … P(X): probability that sample data is observed P(X|H) (posteriori probability), the probability of observing the sample X, given that the hypothesis holds E.g., Given that X will buy computer, the prob. that X is 31..40, medium income Naïve Bayes Theorem Let D be a training set of tuples and their associated class labels, and each tuple is represented by an n-D attribute vector X = (x1, x2, …, xn) Suppose there are k classes C1, C2, …, Ck. Classification is to derive the maximum posteriori, i.e., the maximal P(Ci|X) This can be derived from Bayes’ theorem Since P(X) is constant for all classes, only needs to be maximized Predicts X belongs to Ci if the probability P(Ci|X) is the highest among all the P(Ck|X) for all the ‘k’ classes Bayesian Classification Training Data Class: C1:buys_computer = ‘yes’ C2:buys_computer = ‘no’ Data sample X = (age, income, student, credit_rating) Bayesian Classification - Example Test for X = (age