Unit 2 ML.pdf
Document Details
Uploaded by LaudableMarimba
Tags
Full Transcript
Datasets in Machine Learning: In machine learning, a dataset is a collection of data used to train, validate, and test machine learning models. A dataset typically consists of a set of examples, each consisting of a set of input features (also known as independent variables or predictors) and a tar...
Datasets in Machine Learning: In machine learning, a dataset is a collection of data used to train, validate, and test machine learning models. A dataset typically consists of a set of examples, each consisting of a set of input features (also known as independent variables or predictors) and a target variable (also known as the dependent variable or response variable). A dataset can be thought of as a table, where each row represents a single example or observation, and each column represents a feature or variable. The target variable is the variable that the machine learning model is trying to predict or classify. Datasets can be categorized into different types based on their characteristics, such as: Structured datasets: These are datasets with well-defined schemas, such as tables with rows and columns, like relational databases. Unstructured datasets: These are datasets without a predefined schema, such as images, audio files, or text documents. Unstructured datasets: These are datasets without a predefined schema, such as images, audio files, or text documents. Need of Dataset: - The need for datasets in machine learning is multifaceted. Here are some of the key reasons why datasets are essential: 1. Train models: Datasets are used to train machine learning models to learn patterns and relationships. 2. Evaluate models: Datasets are used to evaluate the performance of machine learning models. 3. Tune hyperparameters: Datasets are used to tune hyperparameters to optimize model performance. 4. Select models: Datasets are used to select the best machine learning model for a particular problem. Machine learning Life cycle Problem definition Model Data maintenance collection Model Data deployment preprocessing Model Model evaluation selection Model training 1. Problem Definition: Identify a problem or opportunity you want to solve. Example: Predict which products will sell well during the holiday season. 2. Data Collection: Gather relevant data to help solve the problem. Example: Collect sales data, customer info, product details, and marketing campaign data. 3. Data Preprocessing: Clean and prepare the data for machine learning. Example: Remove missing values, normalize data, and select important features. 4. Model Selection: Choose a suitable machine learning algorithm. Example: Select a regression algorithm to predict sales. 5. Model Training: Train the model using the prepared data. Example: Train the regression model on the sales data. 6. Model Evaluation: Check how well the model performs. Example: Evaluate the model's accuracy in predicting sales. 7. Model Deployment: Put the model into action in a real-world environment. Example: Use the model to predict sales on the retailer's website. 8. Model Maintenance: Monitor and update the model as needed. Example: Retrain the model with new data to ensure it remains accurate. Data Pre-processing Data pre-processing involves transforming raw data into an understandable format. It is essential because raw data often contains noise, missing values, and inconsistencies. Effective data pre-processing helps improve the accuracy and efficiency of the machine learning model. 1. Handling Missing Values: - Handling missing values involves replacing or removing missing data points in a dataset to ensure that the machine learning model can be trained accurately. Example: Suppose we have a dataset of student grades with some missing values: STUDENT ID MATH GRADE SCIENCE GRADE ENGLISH GRADE 1 90 80 70 2 80 90 3 70 90 4 80 80 We can handle missing values by replacing them with the mean or median of the respective column. For example, we can replace the missing Math Grade for Student 4 with the mean Math Grade of 80. 2. Data Normalization: - Data normalization involves scaling numerical data to a common range, usually between 0 and 1, to prevent features with large ranges from dominating the model. Example: Suppose we have a dataset of house prices with features like number of bedrooms and square footage: HOUSE ID NUMBER OF SQUARE FOOTAGE PRICE BEDROOMS 1 3 1500 200000 2 4 2500 350000 3 2 1000 150000 We can normalize the Square Footage feature by scaling it between 0 and 1: HOUSE ID NUMBER OF BEDROOMS SQUARE FOOTAGE PRICE (NORMALIZED) 1 3 0.4 200000 2 4 0.8 350000 3 2 0.2 150000 3. Feature Scaling: - Feature scaling involves transforming features to have similar scales or ranges to prevent features with large ranges from dominating the model. Example: Suppose we have a dataset of customer information with features like age and income: CUSTOMER ID AGE INCOME 1 25 50000 2 30 70000 3 35 90000 We can scale the Age feature to have a similar range to the Income feature: CUSTOMER ID AGE (SCALED) INCOME 1 0.5 50000 2 0.6 70000 3 0.7 90000 4. Handling Outliers: - Handling outliers involves identifying and removing or transforming data points that are significantly different from the rest of the data. Example: Suppose we have a dataset of exam scores with an outlier: STUDENT ID EXAM SCORE 1 80 2 90 3 100 4 150 We can identify the outlier (150) and remove it or transform it to a more reasonable value. 5. Data Transformation: - Data transformation involves converting data from one format to another to make it more suitable for modeling. Example: Suppose we have a dataset of categorical variables like colors: OBJECT ID COLOR 1 Red 2 Blue 3 Green We can transform the categorical variable into a numerical variable using one-hot encoding: OBJECT ID RED BLUE GREEN 1 1 0 0 2 0 1 0 3 0 0 1 Difference between Artificial intelligence and Machine learning: ASPECT ARTIFICIAL INTELLIGENCE MACHINE LEARNING (ML) (AI) DEFINITION Broad field aiming to create Subfield of AI focused on systems that mimic human algorithms that learn from intelligence. data. SCOPE Broad, includes ML, robotics, Narrow, specific to learning expert systems, NLP, etc. from data and improving over time. GOAL Simulate human intelligence Analyze data to find patterns and perform complex tasks. and make decisions/predictions. FUNCTIONALITY Encompasses reasoning, Focuses on learning from data, problem-solving, making predictions, and understanding language, etc. improving accuracy. DATA DEPENDENCY Can operate with or without Highly dependent on data for large amounts of data. training and making predictions. ADAPTABILITY Can adapt and make decisions Continuously improves based on predefined rules and performance as more data is logic. provided. EXAMPLES Robotics, NLP (e.g., Siri, Spam filtering, Alexa), Expert Systems. Recommendation Systems (e.g., Netflix, Amazon), Image Recognition. TECHNIQUES USED Rule-based systems, search Supervised learning, algorithms, optimization, etc. Unsupervised learning, Reinforcement learning. APPLICATIONS Autonomous vehicles, Virtual Fraud Detection, Predictive Assistants, Robotics, Expert Maintenance, Customer Systems. Segmentation. DEVELOPMENT COMPLEXITY Higher complexity due to Complex, but more focused on broad scope and diverse algorithms and data. applications. Examples to Illustrate Differences: Artificial Intelligence (AI) Example: Virtual Assistants: AI-powered virtual assistants like Siri or Alexa can understand and process human language, perform tasks such as setting reminders, playing music, and providing weather updates, all of which require simulating aspects of human intelligence. Machine Learning (ML) Example: Spam Filtering: An email system that uses ML algorithms to identify and filter out spam messages. The system learns from labeled examples of spam and non-spam emails and improves its accuracy over time by analyzing new data. Basics of neural network What is a Neural Network: - A neural network is a machine learning model inspired by the structure and function of the human brain. It's a collection of interconnected nodes or "neurons" that process and transmit information. Basic Components: Neurons (Nodes): o Analogous to Biological Neurons: Each neuron receives inputs, processes them, and produces an output. o Activation Function: Determines the output of a neuron based on the weighted sum of inputs. Common activation functions include ReLU, Sigmoid, and Tanh. Layers: o Input Layer: The first layer of the neural network that receives the input data. o Hidden Layers: Layers between the input and output layers where computations are performed. A neural network can have one or more hidden layers. o Output Layer: The final layer that produces the output of the network. Structure of a Neural Network: Neural networks are typically organized in layers: Input Layer: Receives the input data. Hidden Layers: Perform transformations on the input data. Each hidden layer consists of multiple neurons. Output Layer: Produces the final result. Example: - Neural Network for Spam Classification I. Input Layer a. Features: Each neuron in the input layer represents a feature of the email. Common features might include: i. Frequency of specific words (e.g., “free,” “win”) ii. Length of the email iii. Presence of certain phrases or keywords iv. Email metadata (e.g., number of links, sender’s address) b. Example: If the email has 100 features, the input layer will have 100 neurons. II. Hidden Layers a. Purpose: These layers process the features from the input layer to extract patterns and relationships. Each neuron in a hidden layer applies an activation function to the weighted sum of its inputs to produce an output. b. Activation Functions: Common activation functions include ReLU (Rectified Linear Unit) or sigmoid, which introduce non-linearity into the model. c. Example: If you have two hidden layers, each with 50 neurons, the network will learn complex patterns by combining and transforming the input features. III. Output Layer a. Purpose: The output layer provides the final classification of the email. For a binary classification problem (Spam vs. Not Spam), you typically have two output neurons, each representing a class. b. Activation Function: Use the softmax function for multi-class classification, which converts the raw output scores into probabilities that sum to 1. c. Example: i. Neuron 1: Probability of the email being Spam ii. Neuron 2: Probability of the email being Not Spam IV. Training the Network a. Loss Function: A loss function, like binary cross-entropy for binary classification, measures how well the network’s predictions match the true labels. b. Optimizer: An optimization algorithm (e.g., SGD, Adam) adjusts the weights of the neurons to minimize the loss function through backpropagation. c. Epochs: The network is trained over multiple epochs, iterating through the training data to improve accuracy. Summary 1. Input Layer: Represents email features (e.g., word frequencies, email length). 2. Hidden Layers: Process features to learn patterns (e.g., presence of spammy words). 3. Output Layer: Classifies the email into Spam or Not Spam with probabilities.