Machine Learning PDF
Document Details
Christ Deemed to be University
Tags
Summary
This document is a set of notes on machine learning, detailing different types of machine learning, algorithms for each type, and their applications in areas like marketing, biology, and finance.
Full Transcript
Machine Learning MISSION VISION CORE VALUES CHRIST is a nurturing ground for an individual’s Excellence and Service Faith in God | Moral Uprightness holistic development to make effective contribution to...
Machine Learning MISSION VISION CORE VALUES CHRIST is a nurturing ground for an individual’s Excellence and Service Faith in God | Moral Uprightness holistic development to make effective contribution to Love of Fellow Beings the society in a dynamic environment Social Responsibility | Pursuit of Excellence CHRIST Deemed to be University Machine Learning is a field of study that gives computers the ability to learn without being explicitly programmed [ By Arthur Samuel in 1959] Machine Learning is the process by which a computer can work more accurately as it collects and learns from the data it is given[ MikeRoberts] ○Example: As a user writes more text messages on a phone , the phone learns more about the messages , common vocabulary and can predict(autocomplete) their words faster and more accurately. Machine Learning is a subfield of artificial intelligence and is closely related to applied mathematics and statistics 2 Excellence and Service CHRIST Deemed to be University Applications of machine learning in data science Finding place names or persons in text(Classification) Identifying people based on pictures or voice recordings(Classification) Proactively identifying car parts that are likely to fail (Regression) Identifying tumors and diseases (Classification) Predicting the number of eruptions of a volcano in a period (Regression) Bing Videos ( Regression) 3 Excellence and Service CHRIST Deemed to be University Modeling Process [ in the phase of data modelling] 1 Feature engineering and model selection 2 Training the model 3 Model validation and selection 4 Applying the trained model to unseen data 4 Excellence and Service Chain or combine multiple techniques ○ Chain- the output of the first model becomes the input for the second model ○ Combine- train them independently and combine their results- ensemble learning A model consists of constructs of information called features or predictors and a target or response variable Model’s goal is to predict the target variable. Ex: tomorrow’s high temperature Variables that help to do this are usually today’s temperature, cloud movements, current wind speed and so on 5 Engineering features and selecting a model With engineering features we must come up with and create possible predictors for the model Features ○ From data set ○ Scattered among different data sets ○ Apply transformation to combine multiple inputs Example: interaction variables- the impact of either single variable is low but if both are present their impact becomes immense Especially true in chemical and medical environments Example- although vinegar and bleach are fairly harmless by themselves mixing them results in poisonous chlorine gas [ a gas that killed thousands during world war I] ○ Derive features [ output of a a model becomes part of another model] ○ Availability Bias 6 7 Training model ○ When the initial features are created , a model can be trained to the data ○ Present to the model data from which it can learn Validating a model ○ A good model has 2 properties Good predictive power It generalizes well to data it hasn’t seen ○ Two common error measures in machine learning Classification error rate for classification problems Percentage of observations in the data set that the model dislabeled;lower is better 8 Mean squared error for regression problems How big the average of the prediction is Squaring the average error has two consequences: ○ You can’t cancel out a wrong prediction in one direction with a faulty prediction in the other direction ○ Bigger errors gets even more weight than they otherwise would ○ Small errors remain small whereas big errors are enlarged and definitely draw your attention 9 Validation strategies Dividing your data into a training set with X% of the observations and keeping the rest as a hold out data set ( a data set that’s never used for model creation) K-folds cross validation ○ Divides the data set into k parts and uses each part one time as a test data set while using the others as a training data set. ○ Advantage that we use all the data available in the data set Leave-1 out ○ Approach is same as k-folds but with k=1 ○ Always leave one observation out and train on the rest of the data ○ Used only on small data sets 10 Predicting new observations Process of applying the model to new data is called model scoring ○ Prepare a data set that has features exactly as defined by the model ○ Apply the model on the new data set and this results in prediction 11 Process (1): Model Construction Classification Algorithms Training Data Classifier (Model) IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ 12 Process (2): Using the Model in Prediction Classifier Testing Data Unseen Data (Jeff, Professor, 4) Tenured ? 13 Types of Machine Learning 14 Supervised Machine Learning Supervised learning is the types of machine learning in which machines are trained using well "labelled" training data, and on basis of that data, machines predict the output. The labelled data means some input data is already tagged with the correct output. In supervised learning, the training data provided to the machines work as the supervisor that teaches the machines to predict the output correctly. It applies the same concept as a student learns in the supervision of the teacher. Supervised learning is a process of providing input data as well as correct output data to the machine learning model. The aim of a supervised learning algorithm is to find a mapping function to map the input variable(x) with the output variable(y). 15 How Supervised Learning Works? 16 Suppose we have a dataset of different types of shapes which includes square, rectangle, triangle, and Polygon. Now the first step is that we need to train the model for each shape. If the given shape has four sides, and all the sides are equal, then it will be labelled as a Square. If the given shape has three sides, then it will be labelled as a triangle. If the given shape has six equal sides then it will be labelled as hexagon. Now, after training, we test our model using the test set, and the task of the model is to identify the shape. The machine is already trained on all types of shapes, and when it finds a new shape, it classifies the shape on the bases of a number of sides, and predicts the output. 17 Steps Involved in Supervised Learning First Determine the type of training dataset Collect/Gather the labelled training data. Split the training dataset into training dataset and test dataset Determine the input features of the training dataset, which should have enough knowledge so that the model can accurately predict the output. Determine the suitable algorithm for the model, such as support vector machine, decision tree, etc. Execute the algorithm on the training dataset. Evaluate the accuracy of the model by providing the test set. If the model predicts the correct output, which means our model is accurate. 18 Types of Supervised Learning Regression algorithms are used if Classification algorithms there is a relationship between the are used when the input variable and the output output variable is variable. It is used for the prediction categorical, which of continuous variables, such as means there are two Weather forecasting, Market Trends, classes such as Yes- etc. No, Male-Female, True- false, etc. 19 Regression Algorithms ○ Linear Regression Linear Regression (Python Implementation) - GeeksforGeeks ○ Regression Trees ○ Non-Linear Regression ○ Bayesian Linear Regression ○ Polynomial Regression Classification Algorithms ○ Random Forest ○ Decision Trees ○ Logistic Regression ○ Support vector Machines 20 21 22 CHRIST Deemed to be University Discrete variables represent counts (e.g., Continuous variables represent the number of objects in a collection). measurable amounts (e.g., water volume The statistical variable that assumes a or weight). finite set of data and a countable number As against this, the quantitative variable of values, then it is called as a discrete which takes on an infinite set of data and variable. a uncountable number of values is known Discrete variable assumes independent as a continuous variable. values continuous variable assumes any value in A discrete variable can be graphically a given range or continuum. represented by isolated points. Unlike, a continuous variable which can Example: be indicated on the graph with the help of ○ Number of printing mistakes in a connected points. book. Example: ○ Number of road accidents in New ○ Height of a person Delhi. ○ Age of a person ○ Number of siblings of an individual. ○ Profit earned by the company. 23 Excellence and Service Unsupervised Learning Unsupervised learning is the training of a machine using information that is neither classified nor labeled and allowing the algorithm to act on that information without guidance. Here the task of the machine is to group unsorted information according to similarities, patterns, and differences without any prior training of data. Unlike supervised learning, no teacher is provided that means no training will be given to the machine. Therefore the machine is restricted to find the hidden structure in unlabeled data by itself. For instance, suppose it is given an image having both dogs and cats which it has never seen. 24 Thus the machine has no idea about the features of dogs and cats so we can’t categorize it as ‘dogs and cats ‘. But it can categorize them according to their similarities, patterns, and differences, i.e., we can easily categorize the above picture into two parts. The first may contain all pics having dogs in them and the second part may contain all pics having cats in them. Here you didn’t learn anything before, which means no training data or examples. It allows the model to work on its own to discover patterns and information that was previously undetected. It mainly deals with unlabelled data. 25 26 Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group and dissimilar to the data points in other groups. It is basically a collection of objects on the basis of similarity and dissimilarity between them. For example The data points in the graph below clustered together can be classified into one single group. We can distinguish the clusters, and we can identify that there are 3 clusters in the below picture. 27 Applications of Clustering in different fields: Marketing: It can be used to characterize & discover customer segments for marketing purposes. Biology: It can be used for classification among different species of plants and animals. Libraries: It is used in clustering different books on the basis of topics and information. Insurance: It is used to acknowledge the customers, their policies and identifying the frauds. 28 City Planning: It is used to make groups of houses and to study their values based on their geographical locations and other factors present. Earthquake studies: By learning the earthquake-affected areas we can determine the dangerous zones. Image Processing: Clustering can be used to group similar images together, classify images based on content, and identify patterns in image data. Genetics: Clustering is used to group genes that have similar expression patterns and identify gene networks that work together in biological processes. 29 Finance: Clustering is used to identify market segments based on customer behavior, identify patterns in stock market data, and analyze risk in investment portfolios. Customer Service: Clustering is used to group customer inquiries and complaints into categories, identify common issues, and develop targeted solutions. Manufacturing: Clustering is used to group similar products together, optimize production processes, and identify defects in manufacturing processes. Medical diagnosis: Clustering is used to group patients with similar symptoms or diseases, which helps in making accurate diagnoses and identifying effective treatments. 30 Fraud detection: Clustering is used to identify suspicious patterns or anomalies in financial transactions, which can help in detecting fraud or other financial crimes. Traffic analysis: Clustering is used to group similar patterns of traffic data, such as peak hours, routes, and speeds, which can help in improving transportation planning and infrastructure. Social network analysis: Clustering is used to identify communities or groups within social networks, which can help in understanding social behavior, influence, and trends. Cybersecurity: Clustering is used to group similar patterns of network traffic or system behavior, which can help in detecting and preventing cyberattacks. 31 Climate analysis: Clustering is used to group similar patterns of climate data, such as temperature, precipitation, and wind, which can help in understanding climate change and its impact on the environment. Sports analysis: Clustering is used to group similar patterns of player or team performance data, which can help in analyzing player or team strengths and weaknesses and making strategic decisions. Crime analysis: Clustering is used to group similar patterns of crime data, such as location, time, and type, which can help in identifying crime hotspots, predicting future crime trends, and improving crime prevention strategies. 32 Association Rules An association rule is a rule-based method for finding relationships between variables in a given dataset. These methods are frequently used for market basket analysis, allowing companies to better understand relationships between different products. Understanding consumption habits of customers enables businesses to develop better cross-selling strategies and recommendation engines. Examples of this can be seen in Amazon’s “Customers Who Bought This Item Also Bought” or Spotify’s "Discover Weekly" playlist. While there are a few different algorithms used to generate association rules, such as Apriori, Eclat, and FP-Growth, the Apriori algorithm is most widely used. 33 Semi Supervised Learning Semi-supervised learning is a type of machine learning that falls in between supervised and unsupervised learning. It is a method that uses a small amount of labeled data and a large amount of unlabeled data to train a model. The goal of semi-supervised learning is to learn a function that can accurately predict the output variable based on the input variables, similar to supervised learning. However, unlike supervised learning, the algorithm is trained on a dataset that contains both labeled and unlabeled data. Semi-supervised learning is particularly useful when there is a large amount of unlabeled data available, but it’s too expensive or difficult to label all of it. 34 Intuitively, one may imagine the three types of learning algorithms as Supervised learning where a student is under the supervision of a teacher at both home and school, Unsupervised learning where a student has to figure out a concept himself Semi-Supervised learning where a teacher teaches a few concepts in class and gives questions as homework which are based on similar concepts. 35