Summary

This document provides an introduction to machine learning, focusing on various types of learning such as supervised, unsupervised, and reinforcement learning. It also covers key concepts like correlation, regression, and sampling methods. This information is useful for those learning about AI and machine-learning.

Full Transcript

# TE7704 AI & ML ## Unit 2: Supervised Learning ### Machine Learning Machine learning (ML) is a branch of artificial intelligence (AI) that focuses on the using data and algorithms to enable AI to imitate the way that humans learn, gradually improving its accuracy. ### Types of Machine Learning - S...

# TE7704 AI & ML ## Unit 2: Supervised Learning ### Machine Learning Machine learning (ML) is a branch of artificial intelligence (AI) that focuses on the using data and algorithms to enable AI to imitate the way that humans learn, gradually improving its accuracy. ### Types of Machine Learning - Supervised machine learning - Support vector machines - Decision trees - Neural Networks - Recurrent neural networks - Convolutional neural networks - Dense neural networks - Linear networks - Unsupervised machine learning - Clustering - Density-based spatial clustering w/ noise (DBSCAN) - K-means clustering - Associative rule learning - Dimensionality reduction - Principal component analysis - Isomap - Reinforcement learning ### Supervised Learning in ML A model based on supervised learning would require both previous data and the previous results as input. By training with this data, the model helps in predicting results that are more accurate. The data, which we use as input data, is also labelled in this case. If an algorithm has to differentiate between fruits, the data has to be labelled or classified for different fruits in the collection. The data is divided into classes in supervised learning. Supervised learning has methods like classification, regression, naïve Bayes theorem, SVM, KNN, decision tree, etc. ### Unsupervised Learning in ML Unsupervised learning needs no previous data as input. It is the method that allows the model to learn on its own using the data, which you give. Here, the data is not labelled, but the algorithm helps the model in forming clusters of similar types of data. For example, if we have the data of dogs and cats, the model will process and train itself with the data. Since it has no previous experience of the data, it will form clusters based on similarities of features. Features the same as the dog will end up in one cluster, and the same goes for a cat. In unsupervised learning, we have a clustering method. We have studied algorithms like K-means clustering in the previous articles. We study various mathematical concepts like Euclidean distance, Manhattan distance in this, as well. ### Semi-supervised Learning Method This is a combination of supervised and unsupervised learning. This method helps to reduce the shortcomings of both the above learning methods. In supervised learning, labelling of data is manual work and is very costly as data is huge. In unsupervised learning, the areas of application are very limited. To reduce these problems, semi-supervised learning is used. In this, the model first trains under unsupervised learning. This ensures that most of the unlabeled data divide into clusters. For the remaining unlabeled data, the generation of labels takes place and classification carries with ease. This technique is very useful in areas like speech recognition and analysis, protein classification, text classification, etc. This is a type of hybrid learning problem. ### Reinforcement Learning in ML Reinforcement Learning is enforcing models to learn how to make decisions. This type of learning is very awesome to learn and is one of the most researched fields in ML. The algorithm of this method helps to make the model learn based on feedback. **Example:** Let's say you have a dog and you are trying to train your dog to sit. You would give certain instructions to the dog to try to make it learn. If the dog executes the instruction perfectly, it would get a biscuit as a reward. If not, it would not get anything. The dog learns from this after some tries that it would get a biscuit if it sits. This is what the gist of reinforcement learning is. The reward here is the feedback received by the dog for sitting. This algorithm has various applications in real life. It is helpful in making self-driving cars. It also helps in various types of simulations. These were the four most popular methods of ML. ## Bi-variate measures of relationship: correlation and regression Correlation is the statistical analysis of the relationship or dependency between two variables. Correlation allows us to study both the strength and direction of the relationship between two sets of variables. Studying correlation can be very useful in many data science tasks. First, it is a key component in data exploratory analysis; the initial study we conduct in our data to see how it looks, to summarize its main characteristics, and to discover partners and anomalies. Second, correlations have many real-world applications. They can help us answer questions, such as whether there is a link between democracy and economic growth, or whether the use of cars correlates to the level of air pollution. - **Positive Correlation:** Two variables are said to be positively correlated when their values move in the same direction. - **Neutral Correlation:** No relationship in the change of variables X and Y. In this case, the values are completely random and do not show any sign of correlation. - **Negative Correlation:** Variables X and Y will be negatively correlated when their values change in opposite directions. Plotting our data is the fastest and most effective way to discover the type of correlation between variables. The typical way to visualize the dependencies between two variables is with a scatterplot. **Correlation coefficients** A correlation coefficient is a statistical summary that measures the strength and direction to which two variables are associated with one another. One of the advantages of correlation coefficients is that they estimate the correlation between two variables in a standardized way, meaning that the value of the coefficient will always be on the same scale, ranging from -1.0 to 1.0. There are several correlation coefficients. Choosing one or another depends on what we know about the structure of the relationship between the variables. However, data scientists normally stick to the most famous one: Pearson's correlation coefficient. ## Statistics for Machine Learning ### What is Statistics? Statistics is the discipline concerned with collecting, organizing, analyzing, interpreting, and presenting data. In simpler terms, it's the art and science of understanding data. Statistics is essentially the science of collecting, analyzing, interpreting, presenting, and organizing data. It's a tool used to make sense of information and draw conclusions. ### Types of Statistics - **Descriptive statistics:** - It analyzes data using different plots and charts on different kinds of data (numerical and categorical) like bar plot, pie chart, scatter plot, Histogram, etc. - Remember that descriptive statistics can be performed on a sample as well as population data but never do we get or take population data. - **Inferential statistics:** - It extracts data from a sample of data from population data, and from that sample of data, we are inferencing something (driving conclusion) for population data. - It means we perform some tests on sample data and make a conclusion specific to that population. We use various techniques to drive conclusions including data visualization, manipulation, etc. ### Types of data - **Numerical Data:** - Numerical data simply means Numbers or integers. Numeric data is further divided into 2 categories: - **Discrete Numerical variables:** These variables have values in a finite range, for example, rank in the classroom, number of faculties in the department, etc. - **Continuous Numeric variable:** These variables have values that can range infinitely, means not in the proper range for example salary of an employee. - **Categorical Data:** - Categorical data means categories or programming strings or a character type of data like name and color. - **Ordinal Variables:** These variables have values that can be ranked between any range like a grade of student(A, B, C), high, medium, and low. - **Nominal Variables:** These variables cannot be ranked; they simply contain names or numbers of categories like color name, subjects, etc. ### Measures of Central Tendency - **Mean:** The average of all data points. - **Median:** The middle value of the data when sorted in ascending order, it is not affected by outliers. - **Mode:** The most frequent value in the data ### Measures of Spread - **Range:** The difference between the highest and lowest values in the data. - **Standard Deviation:** A measure of how spread out the data is around the mean. - **Outliers:** Values that are far away from the other values in the data. - **Quartiles:** Quartiles divide a list of numbers into quarters. - **Interquartile Range (IQR):** A measure of dispersion between upper(75th) and lower(25th) Quartiles. **Example of Interquartile Range on Business Data: Employee Commute Time** Let's say a company wants to analyze the commute times of its employees to understand the typical range and identify potential outliers. They collect data from 50 employees on their daily commute time (in minutes). **Data (sorted in ascending order):** 10, 12, 15, 15, 18, 20, 20, 22, 23, 25, 25, 25, 28, 30, 30, 30, 32, 35, 35, 35, 38, 40, 40, 40, 42, 45, 45, 45, 48, 50, 50, 52, 55, 55, 58, 60, 60, 62, 65, 65, 68, 70, 70, 75, 78, 80, 85, 90 **Calculating the Interquartile Range (IQR):** 1. **Find the median (Q2):** (45 + 45) / 2 = 45 minutes. 2. **Find the first quartile (Q1):** 30 minutes. 3. **Find the third quartile (Q3):** 62 minutes. 4. **Calculate the IQR:** 62 - 30 = 32 minutes. **Interpretation:** The IQR of 32 minutes represents the range within which the middle 50% of employee commute times fall. This means that the typical spread of commute times is 32 minutes, with half of the employees commuting between 30 and 62 minutes. **Identifying Potential Outliers:** The IQR can also be used to identify potential outliers. Outliers are data points that fall significantly outside the typical range. A common rule of thumb is to consider any data point outside the range of Q1 - 1.5 * IQR and Q3 + 1.5* IQR as a potential outlier. - **Lower bound:** 30 - 1.5 * 32 = -18 - **Upper bound:** 62 + 1.5 * 32 = 110 Since all commute times are positive, there are no outliers on the lower end. However, any commute time above 110 minutes would be considered a potential outlier. **Business Implications:** Understanding the IQR of employee commute times can help the company make informed decisions regarding: - **Workplace flexibility**: A large IQR might suggest offering flexible work arrangements or remote work options to accommodate employees with longer commutes. - **Office location**: If the majority of employees have long commute times, the company might consider relocating to a more accessible location. - **Employee well-being**: Long commutes can contribute to stress and reduced productivity. The company might explore initiatives like providing transportation assistance or promoting carpooling to alleviate the burden of long commutes. This example demonstrates how the IQR can be used to understand the spread and central tendency of business data, identify potential outliers, and inform decision-making. You can apply this concept to various other business datasets like sales figures, customer satisfaction scores, or production times. ### Mean Absolute Deviation (MAD) The absolute deviation from the mean, also called Mean absolute deviation(MAD), describes the variation in the data set. In simple words, it tells the average absolute distance of each point in the set. **Example:** Let's consider a restaurant that wants to analyze the variability in customer waiting times to improve its service efficiency. They collect data on the waiting time (in minutes) for 10 customers. **Data (Wait times in Minutes):** 5, 8, 12, 3, 6, 10, 7, 9, 4, 11 **Calculating the Mean Absolute Deviation (MAD):** - **Step 1) Calculate the mean:** Mean = 7.5 minutes. - **Step 2) Calculate the absolute deviations:** Data Point | Deviation from Mean | Absolute Deviation ------- | -------- | -------- 5 | 7.5 - 5 = 2.5 | 2.5 8 | 7.5-8 = -0.5 | 0.5 12 | 7.5-12 = -4.5 | 4.5 3 | 7.5 - 3 = 4.5 | 4.5 6 | 7.5 - 6 = 1.5 | 1.5 10 | 7.5 -10 -2.5 | 2.5 7 | 7.5 - 7 = 0.5 | 0.5 9 | 7.5 - 9 = -1.5 | 1.5 4 | 7.5-4 = 3.5 | 3.5 11 | 7.5 - 11 = -3.5 | 3.5 - **Step 3) Calculate the MAD:** MAD = 2.5 minutes. **Interpretation:** The MAD of 2.5 minutes indicates that on average, customer waiting times deviate from the mean waiting time of 7.5 minutes by 2.5 minutes. This gives a sense of the typical dispersion of waiting times. ### Variance Variance measure how far is data point is from the mean, only the difference from MAD and variance is we take a square here. **Example:** Let's consider the same restaurant example: Data (Wait times in Minutes): 5, 8, 12, 3, 6, 10, 7, 9, 4, 11 **Calculating the Variance:** - **Step 1) Calculate the squared deviations:** Data Point | Deviation from Mean | Squared Deviation ------- | -------- | -------- 5 | 2.5 | 6.25 8 | -0.5 | 0.25 12 | -4.5 | 20.25 3 | 4.5 | 20.25 6 | 1.5 | 2.25 10 | -2.5 | 6.25 7 | 0.5 | 0.25 9 | -1.5 | 2.25 4 | 3.5 | 12.25 11 | -3.5 | 12.25 - **Step 2) Calculate the variance:** Variance = 8.89 minutes². **Interpretation:** The variance of 8.89 minutes² represents the average squared deviation of waiting times from the mean. While the variance provides a measure of dispersion, its units (minutes²) are not directly interpretable in the context of waiting times. ### Comparing MAD and Variance: - Both MAD and variance measure the variability or dispersion of data. - MAD is easier to interpret as it is expressed in the same units as the original data. - Variance is more sensitive to outliers due to the squaring of deviations. - Variance is widely used in statistical analysis and forms the basis for other measures like standard deviation. ### Business Implications: - **Identify areas for improvement:** A high MAD or variance suggests inconsistent waiting times, indicating potential issues with service processes. - **Set realistic expectations:** The restaurant can use the average waiting time and MAD to communicate expected waiting times to customers. - **Track performance over time:** Monitoring changes in MAD and variance can help assess the effectiveness of interventions aimed at reducing waiting times. - **Compare performance against competitors:** Benchmarking waiting time variability against competitors can provide insights into areas where the restaurant can improve its service efficiency. This example demonstrates how MAD and variance can be used to analyze the variability of business data and inform decision-making. You can apply these concepts to various other business datasets like sales figures, production output, or employee performance metrics. ## Sampling ### What is Sampling? Sampling is a process in statistical analysis where researchers take a predetermined number of observations from a larger population. Sampling allows researchers to conduct studies about a large group by using a small portion of the population. ### Scenario: A large telecommunications company with millions of customers wants to assess customer satisfaction with their services. Surveying every customer is impractical and costly. Instead, they decide to use sampling to gather insights from a representative subset of their customer base. - **Population:** All customers of the telecommunications company (millions). - **Sampling Frame:** A list of all customer phone numbers or email addresses. - **Dataset:** The dataset would consist of the survey responses from the selected sample of customers. It might include variables like: - Customer ID: Unique identifier for each customer. - Age: Customer's age group. - Service Plan: Customer's current service plan. - Location: Customer's geographic location. - Satisfaction Score: Overall satisfaction with the company's services, for example, on a scale of 1-5. - Reasons for Satisfaction/Dissatisfaction: Open-ended or multiple-choice questions about specific aspects of the service, for example, network quality, customer support, billing. ### Sampling Methods - **1. Simple Random Sampling:** - Each customer in the sampling frame has an equal chance of being selected. This can be achieved using a random number generator or specialized software. - **Sample Size:** For example, the company decides to sample 1,000 customers. - **Data Collection:** The selected customers are contacted via phone or email and asked to participate in a satisfaction survey. - **Advantages:** Simple to implement, unbiased representation of the population. - **Disadvantages:** May not capture specific customer segments (e.g., high-value customers) adequately. - **2. Stratified Random Sampling:** - The population is divided into strata (groups) based on relevant characteristics, for example, age, service plan, location. Then, a simple random sample is drawn from each stratum. - **Strata:** - Age: <30, 30-50, >50 - Service Plan: Basic, Premium, Family - Location: Urban, Suburban, Rural - **Sample Size:** For example, 100 customers from each stratum (total sample size = 900). - **Data Collection:** Customers are randomly selected from each stratum and contacted for the survey. - **Advantages:** Ensures representation from all relevant customer segments, allows for comparisons between strata. - **Disadvantages:** More complex to implement than simple random sampling, requires knowledge of population characteristics. ## ML Use Cases in Engineering - **Anomaly Detection and Predictive Maintenance:** - **Anomaly Detection:** Machine learning can be used to identify unusual patterns in data that may indicate an issue or anomaly. - **Predictive Maintenance:** Machine learning can be used to predict when equipment is likely to fail, enabling proactive maintenance and reducing downtime. - **Recommendation Systems:** - **Product Recommendations:** Machine learning can be used to recommend products or services based on user behavior and preferences. - **Content Recommendations:** Machine learning can be used to recommend content, such as movies or music, based on user behavior and preferences. - **Predictive Analytics** - **Forecasting:** Machine learning can be used to forecast future events or trends based on historical data. - **Risk Assessment:** Machine learning can be used to assess the likelihood of a particular outcome or event. **Recommendations for Finance:** - **Credit Risk Assessment:** Machine learning can be used to assess the creditworthiness of loan applicants. - **Portfolio Optimization:** Machine learning can be used to optimize investment portfolios based on market trends and risk analysis. **Recommendations for Marketing:** - **Customer Segmentation:** Machine learning can be used to segment customers based on their behavior and preferences. - **Targeted Advertising:** Machine learning can be used to target advertisements to specific groups of customers based on their behavior and preferences. **Recommendations for Education:** - **Personalized Learning:** Machine learning can be used to recommend personalized learning plans for students based on their learning style and abilities. - **Grading and Feedback:** Machine learning can be used to provide automated grading and feedback to students. ## Regression - **Simple Linear Regression:** - It is a type of regression analysis where the relationship between two variables is assumed to be linear. - **Formula:** y = bo + b₁*X₁ - y = Dependent variable (DV) - X₁ = Independent variable (IV) - bo = Constant - b₁ = Coefficient **Ordinary Least Squares** - It is a method used to estimate the coefficients of a linear regression model by minimizing the sum of squared errors. - It aims to find the line that best fits the data by minimizing the sum of squared distances between the actual values and the values predicted by the model. - **Formula:** SUM (y - y^)² -> min - y = Actual value of the dependent variable - y^ = Predicted value of the dependent variable. **Example of Simple Linear Regression:** This dataset contains information about a company's advertising spend and the corresponding sales revenue. The goal is to use simple linear regression to model the relationship between advertising spend and sales revenue, and then use the model to make predictions about future sales based on advertising spend. **Data:** Advertising Spend | Sales Revenue ------- | -------- $10,000 | $50,000 $15,000 | $65,000 $20,000 | $80,000 $25,000 | $95,000 $30,000 | $110,000 $35,000 | $125,000 $40,000 | $140,000 $45,000 | $155,000 $50,000 | $170,000 $55,000 | $185,000 This dataset can be used to build a simple linear regression model that predicts the sales revenue based on the advertising spend. The model will take the form: **Sales Revenue = a + b * Advertising Spend** - where 'a' is the y-intercept and 'b' is the slope of the line. By analyzing this dataset using simple linear regression, you can: - Determine the relationship between advertising spend and sales revenue. - Estimate the values of the coefficients 'a' and 'b' that best fit the data. - Use the regression model to predict the sales revenue for a given advertising spend. - Assess the goodness of fit of the model and the statistical significance of the relationship. This type of analysis can be useful for businesses to understand the impact of their advertising investments on sales, and to make more informed decisions about their marketing strategies and budgets. ## Sampling Distribution (Z, t, f, chi-squared statistic) ## Hypothesis Testing

Use Quizgecko on...
Browser
Browser