Lecture 7: An Introduction to the World of Machine Learning PDF
Document Details
Uploaded by ThoughtfulDogwood
Indian Institute of Technology Bombay
Keerthana Vinod Kumar
Tags
Summary
This document is a lecture on machine learning, covering topics such as Linear Algebra, Vectors, and Matrices.
Full Transcript
Lecture 7 Email: [email protected] Phone no: 9400543249 An Introduction to the World of Machine Learning Instructor: Keerthana Vinod Kumar PMRF Scholar, Koita Centre for Digital healt...
Lecture 7 Email: [email protected] Phone no: 9400543249 An Introduction to the World of Machine Learning Instructor: Keerthana Vinod Kumar PMRF Scholar, Koita Centre for Digital health Indian Insitute of Technology Bombay ML WORKFLOW Module 4: Basics Math for ML: Fundamentals of Linear Algebra, Calculus, Statistics, Probability Why Maths? Statistical measures All computations happen in matrix format For instance, an image is seen by a computer as a 2D or 3D matrix Understanding the mathematical representation of data allows ML models to process and learn from it effectively Model Evaluation: Metrics like precision, recall, and error involve math Linear Algebra STATISTICS Probability Calculus Linear Algebra: Vectors A vector is a quantity or phenomenon that has two independent properties: magnitude and direction Example: Velocity, Increase/Decrease in Temperature etc. 1) Physics based approach Speed = 80 Km/h Magnitude, but no direction (Scaler quantity) Velocity = 80 Km/h in North Vector quantity (Vector quantity) Linear Algebra: Vectors 2) Mathematical based approach Magnitude of Vector = Now if the starting point is at (x, y) and the endpoint is at the origin, then the magnitude of a vector formula becomes; = Linear Algebra: Vectors 2) Mathematical based approach Vector 1 = 3,3 (3,3) Vector 1 = 3i + 3j (-3,2) Magnitude of Vector P = θ = B = Direction of Vector tanθ = P / B tanθ = P / B tanθ = 3/3 => 1 tanθ = 3/2 => 1.5 θ = 45° θ = 56.3° Linear Algebra: Vectors 3) Computational based approach Scaler (int/float) Vector (int/float) 12th GPA 8.0 UG % 94% Row [3 –4 8] Salary 60000 Column 5 8.4 4 92% 9 80000 Linear Algebra: Vectors Addition of Vectors 5 = 1 Linear Algebra: Vectors Subtraction of Vectors -1 = 5 Linear Algebra: Matrix Scaler (int/float) Vector (int/float) Matrix Row [3 –4 8] 5 8 10 4 -8 3 Column 9 4 6 5 4 9 Either one row or one column Multiple row and/or multiple column Linear Algebra: Matrix Shape of Matrix 5 8 10 4 -8 3 9 4 6 5 8 4 -8 3 X 3 Matrix 2 X 2 Matrix Column number 5 10 Row number 4 -8 9 4 3 X 2 Matrix Linear Algebra: Matrix Types of Matrix Linear Algebra: Matrix Transpose of Matrix The transpose of a matrix is found by interchanging its rows into columns or columns into rows. It is denoted by using the letter “T” in the superscript of the given matrix. For example, if “A” is the given matrix, then the transpose of the matrix is represented by A' or AT. Linear Algebra: Matrix Dataset for Housing 4 X 12 Matrix Linear Algebra: Matrix operations Two matrices can be added only if they have the same shape i.e, both should Addition of Matrix have same number or rows and columns Linear Algebra: Matrix operations Two matrices can be subtracted only if they have the same shape i.e, both should Subtraction of Matrix have same number or rows and columns Linear Algebra: Matrix operations Multiplication by scaler Linear Algebra: Matrix operations Multiplying 2 matrices Shape of matrix Cannot be multiplied Linear Algebra: Matrix operations Multiplying 2 matrices 2X2 2X2 https://colab.research.google.com/drive/1N3CoTFLROg2kA47WuI82nrMGctK7gaOF?usp=sharing Statistics What is Statistics? Statistics is the study of the collection, analysis, interpretation, presentation, and organization of data. Statistics helps to understand data in a better way Positive correlation? Negative correlation? The more it rains, the less you can water the garden. Statistics Role of ML in Statistics? Train ML model with lots of data --> Model will learn patterns/relations in data --> Predict according to it Data is integral part of machine learning Example: Train model with diabetes and w/o diabetes ---> model will find relation between the factors and make future prediction and make better analysis Lecture 6 Email: [email protected] Phone no: 9400543249 An Introduction to the World of Machine Learning Instructor: Keerthana Vinod Kumar PMRF Scholar, Koita Centre for Digital health Indian Insitute of Technology Bombay Recap “Dog”, “Cat”, “Cow” [Type of data?] Types of categorical data Data standardization “Excellent”, “Good”, “Bad” [Type of data?] One hot Encoding ML WORKFLOW Data standardization Data standardization comes into the picture when features of the input data set have large differences between their ranges, or simply when they are measured in different units (e.g., pounds, meters, miles, etc.). No matter what distance-based model you perform on this data set, the salary feature will dominate over the age/purchased feature and will have more contribution to the distance computation, just because it has bigger values compared to the height. So, to prevent this problem, transforming features to comparable scales using standardization is the solution. https://colab.research.google.com/drive/1C7je6LX9h4kDPSbjb OKbAYvBcPVi91ZM?usp=sharing Exercise 1. Read placement_subset.csv 2. Do some exploratory analysis [head(), tail(), info(), unique values, NA values], shape of data etc 3. Check using value_counts() code to find count of column: 'Placement Status' 4. Data cleaning: Check for missing values 5. Encode features and label Train-Test split The training set is used for training the model, and the testing set is used to test your model This allows you to train your models on the training set, and then test their accuracy on the unseen testing set. Train-Test split from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=31) Exercise 1. Read exercise.csv 2. Do some exploratory analysis [head(), tail(), info(), unique values, NA values], shape of data etc 3. Data cleaning: Check for missing values and either remove or impute mean values 4. Encode features and label 5. Split data into train and test: explore option for 70:30 split and check difference in X_train and Y_train IF YOU HAVE ANY DOUBT LET ME KNOW. To submit click here : Assignment 3 Lecture 5 Email: [email protected] Phone no: 9400543249 An Introduction to the World of Machine Learning Instructor: Keerthana Vinod Kumar PMRF Scholar, Koita Centre for Digital health Indian Insitute of Technology Bombay Recap Outliers Data Problems: Data quality Duplicate data Missing values Data imbalance Data bias Encoding for categorical data Categorical data refers to a type of data that represents specific categories or groups It is a type of data that is non-numerical and consists of labels or qualitative values rather than numerical values Categorical variables are an essential part of data analysis, but they cannot be directly processed by machine learning models To fit this data into the machine learning model it needs to be converted into numerical data. In machine learning, categorical data is typically represented using the “object” or “string” data type Examples Gender: Categorical variable with categories such as “Male” and “Female.” Marital Status: Categorical variable with categories such as “Married” and “Single.” Occupation: Categorical variable with categories such as “Teacher,” “Engineer,” “Doctor,” etc. These labels have no specific order of preference and also since the data is string labels, machine learning models misinterpreted that there is some sort of hierarchy in them. One approach to solve this problem can be label encoding where we will assign a numerical value to these labels for example Male and Female mapped to 0 and 1. But this can add bias in our model as it will start giving higher preference to the Female parameter as 1>0 but ideally, both labels are equally important in the dataset. To deal with this issue we will use the One Hot Encoding technique. Encoding for categorical data Nominal variables represent categories without any specific order or ranking between them. The categories are simply distinct groups. Ordinal variables represent categories that have a natural order or ranking between them. The categories can be ranked based on some criteria or scale. Encoding for categorical data Common approaches for converting ordinal and categorical variables to numerical values Ordinal Encoding: ordinal data One-Hot Encoding: categorical data 1. Ordinal Encoding Each unique category value is assigned an integer value. For example, “Excellent” is 1, “Good” is 2, and “Bad” is 3. 2. One hot Encoding For categorical variables where no ordinal relationship exists We do not want model to assume a natural ordering between categories One Hot Encoding Examples In One Hot Encoding, the categorical parameters will prepare separate columns for both Male and Female labels. So, wherever there is a Male, the value will be 1 in the Male column and 0 in the Female column, and vice-versa. Let’s understand with an example: Consider the data where fruits, their corresponding categorical values, and prices are given. Categorical Fruit Price value of fruit The output after applying one-hot encoding on the data is given as follows: apple 1 5 mango 2 10 apple 1 15 apple mango orange price orange 3 20 1 0 0 5 0 1 0 10 1 0 0 15 0 0 1 20 Lecture_5.ipynb - Colab (google.com) ML WORKFLOW Data standardization Data standardization comes into the picture when features of the input data set have large differences between their ranges, or simply when they are measured in different units (e.g., pounds, meters, miles, etc.). No matter what distance-based model you perform on this data set, the salary feature will dominate over the age/purchased feature and will have more contribution to the distance computation, just because it has bigger values. So, to prevent this problem, transforming features to comparable scales using standardization is the solution. Data standardization Z-score is one of the most popular methods to standardize data Mean (μ) Standard Deviation (SD or σ) a measure of central tendency a measure of the dispersion or spread of a dataset around its mean SD quantifies how much individual values in the dataset deviate from the mean. A low standard deviation indicates that the data points tend to be close to the mean, while a high standard deviation indicates that the data points are spread out over a wider range of values Normalization vs. Standardization Two commonly used methods for dealing with data that cannot easily be analyzed S.NO. Normalization Standardization 1. Minimum and maximum value of features are used for scaling Mean and standard deviation is used for scaling. It is used when we want to ensure zero mean and 2. It is used when features are of different scales unit standard deviation. 3. Scales values between [0, 1] or [-1, 1] It is not bounded to a certain range 4. It is really affected by outliers It is much less affected by outliers Scikit-Learn provides a transformer called MinMaxScaler for Scikit-Learn provides a transformer 5. Normalization. called StandardScaler for standardization. It is useful when the feature distribution is Normal 6. It is useful when we don’t know about the distribution or Gaussian. 7. It is a often called as Scaling Normalization It is a often called as Z-Score Normalization. Lecture 4 Email: [email protected] Phone no: 9400543249 An Introduction to the World of Machine Learning Instructor: Keerthana Vinod Kumar PMRF Scholar, Koita Centre for Digital health Indian Insitute of Technology Bombay Recap Data types: Libraries in Python: List Numpy Set pandas Dictionary More Libraries (This lecture): Matplotlib seaborn https://colab.research.google.com/drive/1MYmM87alEbHoUBX5tGgrhon6i9Mkw4E6?usp=sharing Steps in ML Module 3: Data collection and processing: Collection of data, Importing data through Kaggle API, Handling missing values, Data standardization, Label encoding, handling missing values, handling imbalance datasets Data Come from Everywhere But, they have different form What is data? Data is a crucial component in the field of Machine Learning Datasets are a matrix collection of data points It refers to the set of observations or measurements that can be used to train a machine-learning model Collection of records and their attributes An attribute is a characteristic of an object A collection of attributes describe an object Data can help us solve specific problems Data analysis pipeline Mining is not the only step in the analysis process Preprocessing: real data is noisy, incomplete and inconsistent Data cleaning is required to make sense of the data make it more reliable and usable. The goal is to transform raw data into a structured format that is consistent and accurate. Techniques: Sampling, Dimensionality Reduction, Feature Selection Post-Processing: Make the data actionable and useful to the user: Statistical analysis of importance & Visualization Data Quality Examples of data quality problems: Noise and outliers Missing values Duplicate data Data Quality: Noise Noise refers to modification of original values Examples: distortion of a person’s voice when talking on Data Quality: Outliers Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set Data Quality: Missing Values Reasons for missing values Information is not collected (e.g., people decline to give their age and weight) Attributes may not be applicable to all cases (e.g., annual income is not applicable to children) Handling missing values Eliminate Data Objects Estimate Missing Values Ignore the Missing Value During Analysis Replace with all possible values (weighted by their probabilities) Data Quality: Duplicate Data Data set may include data objects that are duplicates, or almost duplicates of one another Major issue when merging data from heterogenous sources Examples: Same person with multiple email addresses Data cleaning: Process of dealing with duplicate data issues Data Quality: Some more problems Data imbalance: Some classes or categories in the data may have a disproportionately high or low number of corresponding samples. As a result, they risk being under-represented in the model. Data bias: Depending on how the data, subjects and labels themselves are chosen, the model could propagate inherent biases on gender, age or region. Data bias is difficult to detect and remove. Where to collect data? 1. Open Source Datasets Free to use, easy to find and very time-effective to use The best sources for public datasets are: a. Kaggle (by far my favorite source!) [https://www.kaggle.com/] b. UCI Machine Learning [https://archive.ics.uci.edu/] c. Google Dataset search [https://datasetsearch.research.google.com/] 2. Build Synthetic Datasets generated through computer programs particularly useful when adequate real world data cannot be obtained (or is very hard to obtain) Risk of introducing biasness in data As of its current status, artificial data alone is not enough to train advanced machine learning algorithms 3. Manual Data Generation collect data by yourself Importing data through Kaggle API Automate data download from Kaggle directly, saving time and effort Seamless integration for handling big data scenarios and scaling analysis https://colab.research.google.com/drive/1q5fDdiWkx2ktJ0X18yGycg1EmBgaYITA?usp=sharing Email: [email protected] Lecture 3 Phone no: 9400543249 An Introduction to the World of Machine Learning Instructor: Keerthana Vinod Kumar PMRF Scholar, Koita Centre for Digital health Indian Insitute of Technology Bombay Recap Mutable Objects: 1.List 2.Set 3.Dictionary Lecture 3a Basics of python https://colab.research.google.com/drive/1_BB93F4WYel1B1lv3WGyc8jv7IGOaP6z?usp=sharing Lecture 3b Libraries in Python https://colab.research.google.com/drive/1UatKNUsOxxlkMQSE5tFcS5-g35JhtHfM?usp=sharing Email: [email protected] Lecture 2 Phone no: 9400543249 Introduction to Machine Learning Instructor: Keerthana Vinod Kumar PMRF Scholar, Koita Centre for Digital health Indian Insitute of Technology Bombay Recap Machine learning AI vs ML vs DL ANN Structured data Unstructured data Types of ML Reinforcement learning More general than supervised/unsupervised learning Learn from interaction with environment to achieve a goal Reinforcement learning works on a feedback-based process, in which an agent (A software component) automatically explore its surrounding by hitting & trail, taking action, learning from experiences, and improving its performance. Agent gets rewarded for each good action and get punished for each bad action; hence the goal of reinforcement Terms used in Reinforcement Learning Agent: An entity that can perceive/explore the environment and act upon it. Environment: A situation in which an agent is present or surrounded by. In RL, we assume the stochastic environment, which means it is random in nature. Action: Actions are the moves taken by an agent within the environment. State: State is a situation returned by the environment after each action taken by the agent. Reward: A feedback returned to the agent from the environment to evaluate the action of the agent. Semi-supervised learning Technique that uses a small portion of labeled data and lots of unlabeled data to train a predictive model. Self-learning Classification Tasks in Machine Learning 1) Binary Classification: >> classify the input data into two mutually exclusive categories >> labeled in a binary format: true and false; spam and not spam, etc. Classification Tasks in Machine Learning 2) Multi-Class Classification: has at least two mutually exclusive class labels a) one vs one b) one vs rest Multi-Class Classification Classification Tasks in Machine Learning 3) Multi-Label Classification try to predict 0 or more classes for each input Classification Tasks in Machine Learning 4) Imbalanced Classification number of examples is unevenly distributed in each class Module 2: Python Basics for ML: Google Collaboratory for Python – Getting Systems Ready, Python Basics, Basic libraries needed for ML: Numpy, Pandas, Matplotlib, Seaborn and Sklearn Google Collaboratory allows you to write, run, and share Python code within your browser Starting a document: https://colab.research.google.com/ Collab file on basics of python https://colab.research.google.com/drive/1JTGcvTm5ng7TS8I5fkE-17yfFEhqeMvR?usp=sharing Email: [email protected] Phone no: 9400543249 Introduction to Machine Learning Instructor: Keerthana Vinod Kumar PMRF Scholar, Koita Centre for Digital health Indian Insitute of Technology Bombay Attendance: 10% Quiz/Assignment: 20% Mid-sem: 30% End-sem: 40% Course Plan Attendance compulsory Do not Share meeting link with anyone Course Content Module 1: Basics of ML: Introduction to ML, Artificial Intelligence vs Machine Learning vs Deep Learning, Types of Machine Learning: Supervised, Unsupervised, Reinforcement Learning, Supervised Learning & its Types, Unsupervised Learning & its Types, Deep Learning – Basics Module 2: Python Basics for ML: Google Collaboratory for Python – Getting Systems Ready, Python Basics, Basic libraries needed for ML: Numpy, Pandas, Matplotlib, Seaborn and Sklearn Module 3: Data Collection and Processing: Collection of data, importing data through Kaggle API, Handling missing values, Data standardization Module 4: Basics Math for ML: Fundamentals of Linear Algebra, Calculus, Statistics, Probability Module 5: Training ML Models: Machine Learning Model, Selecting a model for training, Model Optimization Techniques, Model Evaluation Course Content Module 6: Classification Models in ML: Logistic Regression (LR), Support Vector Machines (SVM), Decision Tree Classification, Random Forest Classification, Naive Bayes, K-Nearest Neighbors Module 7: Regression Models in ML: Linear Regression, Logistic Regression, Support Vector Machine Regression, Decision Tree Regression, Random Forest Regression Module 8: Clustering Models in ML: K-Means Clustering, Hierarchical Clustering Module 9: ML projects with python: Disease prediction projects What is Artificial Intelligence? Artificial Intelligence suggest that machines can mimic humans in: Talking Thinking Learning Planning Understanding Artificial Intelligence is also called Machine Intelligence and Computer Intelligence. Example: Amazon Echo is a smart speaker that uses Alexa, the virtual assistant AI technology developed by Amazon What is AI? Non-Intelligence machines Intelligence machines Can make decisions on Cannot make their their own own decisions Perform given task Applications of Artificial Intelligence Machine Translation such as Google Translate Self-Driving Vehicles such as Tesla AI Robots such as Sophia and Aibo Speech Recognition applications like Apple’s Siri or OK Google What is Machine Learning? Machine Learning is a subfield of Artificial intelligence "Learning machines to imitate human intelligence“ It’s a technique to implement AI by learning from data by themselves without being explicitly programmed Applications of Machine learning Sales forecasting for different products Fraud analysis in banking Product recommendations Stock price prediction Dog Cat ML in LAYMAN'S TERM What is this object? It's a Car Human can learn from past experience and make decision of its own Let us ask the same question to him What is this object? [ But, he is a human being. He can observe and learn ] WHY COULDN'T HE?? CAR CAR BIKE BIKE LET US ASK THE SAME QUESTION NOW It's a Car WHAT ABOUT MACHINE? Machines follow instructions It cannot take decision of its own We can ask a machine To perform an arithmetic operations such as Addition Comparison Multiplication Print Division Plotting a chart We want a machine to act like a human To identify this object Predict the price in future recognize face I made met him yesterday Natural Language understand, and correct grammar What do we do? Just like, what we did to human, we need to provide experience to the machine. What is Machine Learning? This what we called as Data or Training dataset So, we first need to provide training dataset to the machine What is Machine Learning? Then, devise algorithms and execute programs on the data With respect to the underlying target tasks What is Machine Learning? Then, using the programs, Identify required rules What is Machine Learning? extract required patterns What is Machine Learning? Identify relations What is Machine Learning? So that machine can derive inferences from the data “Machine learning refers to a system capable of the autonomous acquisition and integration of knowledge.” “Learning is any process by which a system improves performance from experience.” Machine Learning Definition by Tom Mitchell (1998): Machine Learning is the study of algorithms that --improve their performance P --at some task T --with experience E In summary Given a machine learning problem Identify and create the appropriate dataset Perform computation to learn required rules, pattern and relations Output the decision Why Machine Learning? No human experts industrial/manufacturing control mass spectrometer analysis, drug design Black-box human expertise face/handwriting/speech recognition driving a car, flying a plane Rapidly changing phenomena Financial modeling Diagnosis, fraud detection Need for customization/personalization personalized news reader Automating automation movie/book recommendation Getting computers to program themselves Let the data do the work instead! What is ML? Example: predict whether a given email is spam or ham (no spam) What is Deep learning? Deep Learning is a subset of Machine Learning. Deep Learning is responsible for the AI boom of the last years Deep learning is an advanced type of ML that handles complex tasks like image recognition. Machine Learning Deep Learning A subset of AI A subset of Machine Learning Uses smaller data sets Uses larger datasets Trained by humans Learns on its own Creates simple algorithms Creates complex algorithms AI vs ML vs DL Ability of machines to imitate intelligent human behavior Applications of AI that allows a system to automatically learn and improve from experience Application of ML that uses complex algorithms and deep neural networks to train a model What is DL? Deep learning is a subset of machine learning that uses Artificial Neural Network (ANN) to learn from data. Deep learning algorithms can work with an enormous amount of both structured and unstructured data. It deals with algorithms inspired by the structure and function of the human brain. Structured data is typically organized in a tabular format, such as a database with rows and columns Applications of Deep learning Cancer tumor detection Unstructured data lacks a predefined data model or Music generation structure. Examples include text, images, audio, and video Image coloring Object detection Artificial Neural Network (ANN) Biological Neural Network Artificial Neural Network Biological Neural Network Artificial Neural Network Dendrites Inputs Relationship between Biological Cell nucleus Nodes neural network and artificial neural network Synapse Weights Axon Output Input layer Hidden layer1 Hidden layer2 Output layer Fig. Architecture of ANN Input Layer: This layer receives the initial input data. Hidden Layers: These layers come between the input and output layers and are responsible for learning patterns in the data. Output Layer: This layer produces the final output of the neural network. Types of Machine Learning Machine learning Supervised Unsupervised/semisupervised Reinforcement Supervised learning based on supervision we train the machines using the "labelled" dataset, and based on the training, the machine predicts the output The main goal of the supervised learning technique is to map the input variable(x) with the output variable(y) Classification Classification ƒ( , ) = CAR Supervised learning 1) Classification: >> Goal is to categorize input data into predefined classes or categories; >> The algorithm learns from labeled training data; >> Classification algorithms are used to solve the classification problems in which the output variable is categorical, such as "Yes" or No, Male or Female, Red or Blue, etc. Example: A common example is email spam detection. Given a set of emails labeled as spam or not spam, a classification algorithm can learn to predict whether a new, unseen email is spam or not based on features such as the content, sender, and subject. Classification is about predicting a class/discrete values Supervised learning : Classification Problem?? Classification ƒ( , ) = CAR Supervised learning : Classification Elephant Elephant Classifier Tiger Identify this animal? Dataset Supervised learning : Regression 2) Prediction / Regression: making an inference about a target variable based on input features; a linear relationship between input and output variables. These are used to predict continuous output variables, such as market trends, weather prediction, etc. ƒ( , ) = 20500.50 Regression is about predicting a quantity or continuous values Unsupervised learning As its name suggests, there is no need for supervision The machine is trained using the unlabeled dataset, and the machine predicts the output without any supervision The main aim of the unsupervised learning algorithm is to group or categories the unsorted dataset according to the similarities, patterns, and differences 1. Clustering meaningful patterns 2. Associations Unsupervised learning : Clustering Clustering It is a way to group the objects into a cluster such that the objects with the most similarities remain in one group and have fewer or no similarities with the objects of other groups. Unsupervised learning : Associations Associations Finds interesting relations among variables within a large dataset