Full Transcript

SILVER OAK COLLEGE OF COMPUTER APPLICATION SUBJECT :MACHINE LEARNING TOPIC : INTRODUCTION TO MACHINE LEARNING Content: The ability for computers to learn from experience and advance without explicit programming is known as machine learning, a subtype of artificial intell...

SILVER OAK COLLEGE OF COMPUTER APPLICATION SUBJECT :MACHINE LEARNING TOPIC : INTRODUCTION TO MACHINE LEARNING Content: The ability for computers to learn from experience and advance without explicit programming is known as machine learning, a subtype of artificial intelligence (AI). Writing specific instructions for a computer to follow is what traditional programming entails, but machine learning enables computers to learn from data and enhance their performance over time. The main objective of machine learning is to create models and algorithms that can forecast the future, make choices, or find patterns in data. In the context of artificial intelligence, machine learning plays a crucial role in enabling computers to process and analyze large amounts of data, leading to better decision-making and problem-solving capabilities. The ability of machine learning models to learn from past experiences and adapt to new situations makes them powerful tools for a wide range of applications. Use Cases of Machine Learning Machine learning has become an integral part of various industries due to its ability to analyze large datasets and generate valuable insights. Some common use cases of machine learning include: Image and Speech Recognition: Machine learning algorithms can be trained to recognize and classify images or transcribe spoken language, powering technologies like facial recognition and voice assistants. Recommender Systems: Many online platforms use machine learning to recommend personalized content, products, or services based on user preferences and behavior, enhancing the user experience. Fraud Detection in Finance: Machine learning can identify patterns and anomalies in financial transactions, helping detect fraudulent activities and securing financial systems. Healthcare Applications: Machine learning models can assist in disease diagnosis, predict patient outcomes, and recommend personalized treatment plans based on patient data. Predictive Maintenance in Manufacturing: By analyzing sensor data from machines, machine learning can predict potential failures, enabling proactive maintenance and reducing downtime. Natural Language Processing: Machine learning powers language translation, sentiment analysis, chatbots, and virtual assistants, making human-computer interactions more intuitive Types of Machine Learning Supervised Learning: Supervised learning involves training a model on labeled data, where the desired output is known. The model learns to map input features to correct output labels, allowing it to make predictions on new, unseen data. Supervised learning is commonly used for tasks like classification, where the model predicts categories, and regression, where it predicts continuous values. Example: Handwriting Recognition - Given a dataset of handwritten digits labeled with their corresponding numbers, a supervised learning algorithm can learn to recognize and classify new handwritten digits. Unsupervised Learning: In unsupervised learning, the model deals with unlabeled data, meaning there are no predefined output labels. Instead, the algorithm identifies patterns, clusters, or structures in the data without explicit guidance. Example: Customer Segmentation - Given a dataset of customer data without any predefined segments, an unsupervised learning algorithm can group similar customers together based on their purchasing behavior and demographics. Semi-Supervised Learning: Semi-supervised learning is a combination of supervised and unsupervised learning. It utilizes a small amount of labeled data along with a larger amount of unlabeled data to make predictions. This approach is particularly useful when obtaining large labeled datasets is expensive or time-consuming. Example: Sentiment Analysis - In a dataset of customer reviews, a semi-supervised learning algorithm can use a small subset of labeled reviews to train a sentiment classifier, and then apply it to the rest of the data to classify sentiments. Reinforcement Learning: Reinforcement learning involves an agent learning from interacting with an environment to achieve specific goals. The agent receives feedback in the form of rewards or penalties based on its actions, guiding it to learn the best strategy for achieving the objectives. Example: Game Playing - In a game, an AI agent can use reinforcement learning to learn optimal strategies by taking actions, receiving rewards for good moves, and penalties for bad moves, ultimately improving its gameplay over time. Machine Learning Modeling Flow The process of developing machine learning models involves several key steps: 1. Data Collection: The first step is to gather relevant data from various sources. The quality and size of the data play a crucial role in the success of the machine learning model. 2. Data Preprocessing: Before feeding the data into the model, it needs to be cleaned, transformed, and prepared for analysis. This step ensures that the data is in a suitable format for the model to process effectively. 3. Feature Engineering: Feature engineering involves selecting and extracting the most relevant features from the data that will be used as inputs for the model. Proper feature selection can significantly impact the model's performance. 4. Model Selection: Choosing the appropriate machine learning algorithm for the task is crucial. Different algorithms have different strengths and weaknesses, and selecting the right one depends on the nature of the data and the problem to be solved. 5. Model Training: In this step, the selected algorithm is trained on the labeled data to learn the underlying patterns and relationships between the features and the target variable. 6. Model Evaluation: After training, the model's performance is evaluated using separate data that it has not seen before. This evaluation helps assess how well the model generalizes to new, unseen data. 7. Model Tuning: Fine-tuning the model involves adjusting hyperparameters or making other modifications to improve the model's accuracy and generalization capabilities. Supervised vs. Unsupervised Learning Challenges of Machine Learning 1. Data Quality: Poor-quality or biased data can lead to inaccurate predictions and unreliable models. Ensuring high-quality and representative data is crucial for the success of machine learning applications. 2. Overfitting: Overfitting occurs when a model performs well on training data but fails to generalize to new, unseen data. It is essential to detect and prevent overfitting to create models that can make accurate predictions on new data. 3. Interpretability: Some complex machine learning models are hard to interpret, making it challenging to understand their decision-making process. Interpretable models are preferred in certain applications where transparency is crucial. SILVER OAK COLLEGE OF COMPUTER APPLICATION SUBJECT :MACHINE LEARNING TOPIC :Unit:-2 Supervised Learning(Linear Regression) Supervised learning  In supervised learning, the machine is trained on a set of labeled data, which means that the input data is paired with the desired output.  The machine then learns to predict the output for new input data. Supervised learning is often used for tasks such as classification, regression, and object detection.  In supervised learning, the training data provided to the machines work as the supervisor that teaches the machines to predict the output correctly. It applies the same concept as a student learns in the supervision of the teacher.  Supervised learning is a process of providing input data as well as correct output data to the machine learning model. The aim of a supervised learning algorithm is to find a mapping function to map the input variable(x) with the output variable(y)  In the real-world, supervised learning can be used for Risk Assessment, Image classification, Fraud Detection, spam filtering, etc. How Supervised Learning Works?  In supervised learning, models are trained using labelled dataset, where the model learns about each type of data. Once the training process is completed, the model is tested on the basis of test data (a subset of the training set), and then it predicts the output. Steps Involved in Supervised Learning: First Determine the type of training dataset Collect/Gather the labelled training data. Split the training dataset into training dataset, test dataset, and validation dataset. Determine the input features of the training dataset, which should have enough knowledge so that the model can accurately predict the output. Determine the suitable algorithm for the model, such as support vector machine, decision tree, etc. Execute the algorithm on the training dataset. Sometimes we need validation sets as the control parameters, which are the subset of training datasets. Evaluate the accuracy of the model by providing the test set. If the model predicts the correct output, which means our model is accurate. Types of supervised Machine learning Algorithms:  1. Regression  Regression algorithms are used if there is a relationship between the input variable and the output variable. It is used for the prediction of continuous variables, such as Weather forecasting, Market Trends, etc. Below are some popular Regression algorithms which come under supervised learning: Linear Regression Regression Trees Non-Linear Regression Bayesian Linear Regression Polynomial Regression  2. Classification  Classification algorithms are used when the output variable is categorical, which means there are two classes such as Yes-No, Male-Female, True-false, etc.  Spam Filtering, Random Forest Decision Trees Logistic Regression Support vector Machines Advantages of Supervised learning: With the help of supervised learning, the model can predict the output on the basis of prior experiences. In supervised learning, we can have an exact idea about the classes of objects. Supervised learning model helps us to solve various real-world problems such as fraud detection, spam filtering, etc Disadvantages of supervised learning: Supervised learning models are not suitable for handling the complex tasks. Supervised learning cannot predict the correct output if the test data is different from the training dataset. Training required lots of computation times. In supervised learning, we need enough knowledge about the classes of object. Predicting Categorical Data Using Classification Algorithms  Classification Algorithms are Supervised Machine Learning Algorithms that use labeled data (aka training datasets) to train classifier models. These models then predict outcomes with the best possible accuracy when new data (aka testing datasets) is fed to them.  The outcome predicted by a classification algorithm is categorical in nature. These algorithms classify variables into a specific set of classes – such as classifying a text message into transactions or promotions through an SMS filter on your iPhones.  Overview of Classification Algorithms  Classification techniques predict discrete class label output(s) to which the data elements belong. For example, weather prediction is a type of classification problem – ‘hot’ and ‘cold’ being the class labels. This is called binary classification since there are only two classes.  Few more examples of classification problems – Speech recognition Face detection Spam texts/e-mails classification Stock market prediction Breast cancer detection Employee Attrition prediction How do Classification Algorithms work?  A classifier utilizes known (training) data to understand how the given input (dependent) variables relate to the target (independent) variable.  In the above example, we will take into account the outside temperatures of previous days and use that as the training data. This data would be fed into the classifier – if it is trained accurately, it would be able to predict future weather conditions.  We use Binary Classifiers in case there are only two classes and Multi-class Classifiers for more than two class divisions. K-Nearest Neighbor(KNN) Algorithm for Machine Learning K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised Learning technique. K-NN algorithm assumes the similarity between the new case/data and available cases and put the new case into the category that is most similar to the available categories. K-NN algorithm stores all the available data and classifies a new data point based on the similarity. This means when new data appears then it can be easily classified into a well suite category by using K- NN algorithm. K-NN algorithm can be used for Regression as well as for Classification but mostly it is used for the Classification problems. K-NN is a non-parametric algorithm, which means it does not make any assumption on underlying data. It is also called a lazy learner algorithm because it does not learn from the training set immediately instead it stores the dataset and at the time of classification, it performs an action on the dataset. KNN algorithm at the training phase just stores the dataset and when it gets new data, then it classifies that data into a category that is much similar to the new data.  Example: Suppose, we have an image of a creature that looks similar to cat and dog, but we want to know either it is a cat or dog. So for this identification, we can use the KNN algorithm, as it works on a similarity measure. Our KNN model will find the similar features of the new data set to the cats and dogs images and based on the most similar features it will put it in either cat or dog category. Why do we need a K-NN Algorithm?  Suppose there are two categories, i.e., Category A and Category B, and we have a new data point x1, so this data point will lie in which of these categories. To solve this type of problem, we need a K-NN algorithm. With the help of K- NN, we can easily identify the category or class of a particular dataset. Consider the below diagram: How does K-NN work?  The K-NN working can be explained on the basis of the below algorithm: Step-1: Select the number K of the neighbors Step-2: Calculate the Euclidean distance of K number of neighbors Step-3: Take the K nearest neighbors as per the calculated Euclidean distance. Step-4: Among these k neighbors, count the number of the data points in each category. Step-5: Assign the new data points to that category for which the number of the neighbor is maximum. Step-6: Our model is ready. Firstly, we will choose the number of neighbors, so we will choose the k=5. Next, we will calculate the Euclidean distance between the data points. The Euclidean distance is the distance between two points, which we have already studied in geometry. It can be calculated as:  By calculating the Euclidean distance we got the nearest neighbors, as three nearest neighbors in category A and two nearest neighbors in category B. Consider the below image:  How to select the value of K in the K-NN Algorithm?  Below are some points to remember while selecting the value of K in the K-NN algorithm:  There is no particular way to determine the best value for "K", so we need to try some values to find the best out of them. The most preferred value for K is 5. A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of outliers in the model. Large values for K are good, but it may find some difficulties. Advantages of KNN Algorithm: It is simple to implement. It is robust to the noisy training data It can be more effective if the training data is large.  Disadvantages of KNN Algorithm: Always needs to determine the value of K which may be complex some time. The computation cost is high because of calculating the distance between the data points for all the training samples. Random Forest Algorithm  Random Forest is a popular machine learning algorithm that belongs to the supervised learning technique. It can be used for both Classification and Regression problems in ML. It is based on the concept of ensemble learning, which is a process of combining multiple classifiers to solve a complex problem and to improve the performance of the model.  As the name suggests, "Random Forest is a classifier that contains a number of decision trees on various subsets of the given dataset and takes the average to improve the predictive accuracy of that dataset." Instead of relying on one decision tree, the random forest takes the prediction from each tree and based on the majority votes of predictions, and it predicts the final output.  The greater number of trees in the forest leads to higher accuracy and prevents the problem of overfitting. Assumptions for Random Forest  Since the random forest combines multiple trees to predict the class of the dataset, it is possible that some decision trees may predict the correct output, while others may not. But together, all the trees predict the correct output. Therefore, below are two assumptions for a better Random forest classifier: There should be some actual values in the feature variable of the dataset so that the classifier can predict accurate results rather than a guessed result. The predictions from each tree must have very low correlations Why use Random Forest?  Below are some points that explain why we should use the Random Forest algorithm: It takes less training time as compared to other algorithms. It predicts output with high accuracy, even for the large dataset it runs efficiently. It can also maintain accuracy when a large proportion of data is missing How does Random Forest algorithm work?  Random Forest works in two-phase first is to create the random forest by combining N decision tree, and second is to make predictions for each tree created in the first phase.  The Working process can be explained in the below steps and diagram:  Step-1: Select random K data points from the training set.  Step-2: Build the decision trees associated with the selected data points (Subsets).  Step-3: Choose the number N for decision trees that you want to build.  Step-4: Repeat Step 1 & 2.  Step-5: For new data points, find the predictions of each decision tree, and assign the new data points to the category that wins the majority votes.  The working of the algorithm can be better understood by the below example:  Example: Suppose there is a dataset that contains multiple fruit images. So, this dataset is given to the Random forest classifier. The dataset is divided into subsets and given to each decision tree. During the training phase, each decision tree produces a prediction result, and when a new data point occurs, then based on the majority of results, the Random Forest classifier predicts the final decision. Consider the below image:  Applications of Random Forest  There are mainly four sectors where Random forest mostly used: 1. Banking: Banking sector mostly uses this algorithm for the identification of loan risk. 2. Medicine: With the help of this algorithm, disease trends and risks of the disease can be identified. 3. Land Use: We can identify the areas of similar land use by this algorithm. 4. Marketing: Marketing trends can be identified using this algorithm.  Advantages of Random Forest Random Forest is capable of performing both Classification and Regression tasks. It is capable of handling large datasets with high dimensionality. It enhances the accuracy of the model and prevents the overfitting issue.  Disadvantages of Random Forest Although random forest can be used for both classification and regression tasks, it is not more suitable for Regression tasks. Decision Tree Classification Algorithm Decision Tree is a Supervised learning technique that can be used for both classification and Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-structured classifier, where internal nodes represent the features of a dataset, branches represent the decision rules and each leaf node represents the outcome. In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node. Decision nodes are used to make any decision and have multiple branches, whereas Leaf nodes are the output of those decisions and do not contain any further branches. The decisions or the test are performed on the basis of features of the given dataset. It is a graphical representation for getting all the possible solutions to a problem/decision based on given conditions. It is called a decision tree because, similar to a tree, it starts with the root node, which expands on further branches and constructs a tree-like structure. In order to build a tree, we use the CART algorithm, which stands for Classification and Regression Tree algorithm. A decision tree simply asks a question, and based on the answer (Yes/No), it further split the tree into subtrees. Below diagram explains the general structure of a decision tree: Why use Decision Trees?  There are various algorithms in Machine learning, so choosing the best algorithm for the given dataset and problem is the main point to remember while creating a machine learning model. Below are the two reasons for using the Decision tree: Decision Trees usually mimic human thinking ability while making a decision, so it is easy to understand. The logic behind the decision tree can be easily understood because it shows a tree-like structure.  Decision Tree Terminologies How does the Decision Tree algorithm Work?  In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root node of the tree. This algorithm compares the values of root attribute with the record (real dataset) attribute and, based on the comparison, follows the branch and jumps to the next node.  For the next node, the algorithm again compares the attribute value with the other sub-nodes and move further. It continues the process until it reaches the leaf node of the tree. The complete process can be better understood using the below algorithm: Step-1: Begin the tree with the root node, says S, which contains the complete dataset. Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM). Step-3: Divide the S into subsets that contains possible values for the best attributes. Step-4: Generate the decision tree node, which contains the best attribute. Step-5: Recursively make new decision trees using the subsets of the dataset created in step -3. Continue this process until a stage is reached where you cannot further classify the nodes and called the final node as a leaf node.  Example: Suppose there is a candidate who has a job offer and wants to decide whether he should accept the offer or Not. So, to solve this problem, the decision tree starts with the root node (Salary attribute by ASM). The root node splits further into the next decision node (distance from the office) and one leaf node based on the corresponding labels. The next decision node further gets split into one decision node (Cab facility) and one leaf node. Finally, the decision node splits into two leaf nodes (Accepted offers and Declined offer). Consider the below diagram:   Attribute Selection Measures  While implementing a Decision tree, the main issue arises that how to select the best attribute for the root node and for sub-nodes. So, to solve such problems there is a technique which is called as Attribute selection measure or ASM. By this measurement, we can easily select the best attribute for the nodes of the tree  There are two popular techniques for ASM, which are: Information Gain Gini Index  Information Gain: Information gain is the measurement of changes in entropy after the segmentation of a dataset based on an attribute. It calculates how much information a feature provides us about a class. According to the value of information gain, we split the node and build the decision tree. A decision tree algorithm always tries to maximize the value of information gain, and a node/attribute having the highest information gain is split first. It can be calculated using the below formula: 1. Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature) Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies randomness in data. Entropy can be calculated as: Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no) Gini index is a measure of impurity or purity used while creating a decision tree in the CART(Classification and Regression Tree) algorithm. An attribute with the low Gini index should be preferred as compared to the high Gini index. It only creates binary splits, and the CART algorithm uses the Gini index to create binary splits. Gini index can be calculated using the below formula: Logistic Regression? What is Logistic Regression?  Logistic regression is used for binary classification where we use sigmoid function, that takes input as independent variables and produces a probability value between 0 and 1.  For example, we have two classes Class 0 and Class 1 if the value of the logistic function for an input is greater than 0.5 (threshold value) then it belongs to Class 1 otherwise it belongs to Class 0. Logistic regression predicts the output of a categorical dependent variable. Therefore, the outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values which lie between 0 and 1. Types of Logistic Regression  On the basis of the categories, Logistic Regression can be classified into three types: 1. Binomial: In binomial Logistic regression, there can be only two possible types of the dependent variables, such as 0 or 1, Pass or Fail, etc. 2. Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered types of the dependent variable, such as “cat”, “dogs”, or “sheep” 3. Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of dependent variables, such as “low”, “Medium”, or “High”. Logistic Function – Sigmoid Function The sigmoid function is a mathematical function used to map the predicted values to probabilities. It maps any real value into another value within a range of 0 and 1. The value of the logistic regression must be between 0 and 1, which cannot go beyond this limit, so it forms a curve like the “S” form. The S-form curve is called the Sigmoid function or the logistic function. In logistic regression, we use the concept of the threshold value, which defines the probability of either 0 or 1. Such as values above the threshold value tends to 1, and a value below the threshold values tends to 0.  Sigmoid Function  Now we use the sigmoid function where the input will be z and we find the probability between 0 and 1. i.e. predicted y.  If the output of the sigmoid function is more than 0.5, we can classify the outcome as 1 or YES, and if it is less than 0.5, we can classify it as 0 or NO. For example, if the output is 0.75, we can say in terms of the probability that there is a 75 percent chance that a patient will suffer from cancer. How does Logistic Regression work? Prepare the data: The data should be in a format where each row represents a single observation and each column represents a different variable. The target variable (the variable you want to predict) should be binary (yes/no, true/false, 0/1). Train the model: We teach the model by showing it the training data. This involves finding the values of the model parameters that minimize the error in the training data. Evaluate the model: The model is evaluated on the held-out test data to assess its performance on unseen data. Use the model to make predictions: After the model has been trained and assessed, it can be used to forecast outcomes on new data. Advantages of the Logistic Regression Algorithm Logistic regression performs better when the data is linearly separable It does not require too many computational resources as it’s highly interpretable There is no problem scaling the input features—It does not require tuning It is easy to implement and train a model using logistic regression It gives a measure of how relevant a predictor (coefficient size) is, and its direction of association (positive or negative) Support Vector Machine (SVM) Algorithm  Support Vector Machine (SVM) is a powerful machine learning algorithm used for linear or nonlinear classification, regression, and even outlier detection tasks.  SVMs can be used for a variety of tasks, such as text classification, image classification, spam detection, handwriting identification, gene expression analysis, face detection, and anomaly detection  SVM algorithms are very effective as we try to find the maximum separating hyperplane between the different classes available in the target feature.  Support Vector Machine (SVM) is a supervised machine learning algorithm used for both classification and regression. Though we say regression problems as well it’s best suited for classification.  The main objective of the SVM algorithm is to find the optimal hyperplane in an N- dimensional space that can separate the data points in different classes in the feature space. Types of SVM  SVM can be of two types: Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can be classified into two classes by using a single straight line, then such data is termed as linearly separable data, and classifier is used called as Linear SVM classifier. Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a dataset cannot be classified by using a straight line, then such data is termed as non-linear data and classifier used is called as Non-linear SVM classifier.  Hyperplane and Support Vectors in the SVM algorithm:  Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n- dimensional space, but we need to find out the best decision boundary that helps to classify the data points. This best boundary is known as the hyperplane of SVM.  Support Vectors:  The data points or vectors that are the closest to the hyperplane and which affect the position of the hyperplane are termed as Support Vector. Since these vectors support the hyperplane, hence called a Support vector. How does SVM works?  The working of the SVM algorithm can be understood by using an example. Suppose we have a dataset that has two tags (green and blue), and the dataset has two features x1 and x2. We want a classifier that can classify the pair(x1, x2) of coordinates in either green or blue. Consider the below image:  Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary or region is called as a hyperplane. SVM algorithm finds the closest point of the lines from both the classes. These points are called support vectors. The distance between the vectors and the hyperplane is called as margin. And the goal of SVM is to maximize this margin. The hyperplane with maximum margin is called the optimal hyperplane.  Non-Linear SVM:  If data is linearly arranged, then we can separate it by using a straight line, but for non-linear data, we cannot draw a single straight line. Consider the below image: So to separate these data points, we need to add one more dimension. For linear data, we have used two dimensions x and y, so for non-linear data, we will add a third dimension z. It can be calculated as: z=x2 +y2 So to separate these data points, we need to add one more dimension. For linear data, we have used two dimensions x and y, so for non-linear data, we will add a third dimension z. It can be calculated as: SILVER OAK COLLEGE OF COMPUTER APPLICATION SUBJECT :MACHINE LEARNING TOPIC : Unit:-3 Unsupervised Learning Definition and its key characteristics and applications.  Definition  Machine learning (ML) is a branch of artificial intelligence (AI) that focuses on developing algorithms and statistical models that enable computers to perform specific tasks without using explicit instructions. Instead, they rely on patterns and inference derived from data.  Key Characteristics 1. Data-Driven: ML systems learn from data rather than being programmed with explicit instructions. The quality and quantity of data play a crucial role in the performance of the model. 2. Algorithms: Various algorithms are used to analyze data, build models, and make predictions. These include supervised learning (e.g., regression, classification), unsupervised learning (e.g., clustering, dimensionality reduction), and reinforcement learning. 3. Training and Testing: ML models are trained on a dataset to learn patterns or relationships. After training, they are tested on a separate dataset to evaluate their performance and generalizability.  Applications: Describe the problem setup, which involves discovering patterns  Identifying patterns, trends, and correlations is an essential task that allows decision- makers to extract important insights from a sea of data. This blog digs into the complex art of identifying these critical characteristics in data, shining light on their importance in sectors such as banking, healthcare, marketing, and more.  Patterns are recurring sequences or groupings seen in data and are frequently hidden under the surface. They give the predictive possibility of future events by providing an elementary understanding of the underpinning structure.  Subsequently, recognising trends entails determining the trajectory of data points over time. This temporal viewpoint benefits forecasting and strategic decision-making. Clustering algorithm: -K Means :  K-Means is a popular clustering algorithm used in unsupervised machine learning to partition a dataset into KKK distinct, non-overlapping subsets or clusters. Here’s a brief overview of how it works and some key points:  How K-Means Works 1. Initialization: Choose KKK initial centroids (these could be selected randomly or using some heuristic). 2. Assignment Step: Assign each data point to the nearest centroid based on a distance metric (commonly Euclidean distance). 3. Update Step: Recalculate the centroids by taking the mean of all points assigned to each centroid. 4. Repeat: Repeat the assignment and update steps until the centroids no longer change significantly or a specified number of iterations is reached. 5. Convergence: The algorithm converges when the centroids stabilize, meaning there’s no significant change in their positions or the cluster assignments.  K-Means Clustering Algorithm  K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering problems in machine learning or data science. In this topic, we will learn what is K-means clustering algorithm, how the algorithm works, along with the Python implementation of k-means clustering.  What is K-Means Algorithm?  K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into different clusters. Here K defines the number of pre-defined clusters that need to be created in the process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on.  The below diagram explains the working of the K-means Clustering Algorithm:   How does the K-Means Algorithm Work?  The working of the K-Means algorithm is explained in the below steps:  Step-1: Select the number K to decide the number of clusters.  Step-2: Select random K points or centroids. (It can be other from the input dataset).  Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.  Step-4: Calculate the variance and place a new centroid of each cluster.  Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of each cluster.  Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.  Step-7: The model is ready. Dimensionality reduction: - Principal Component Analysis :  Principal Component Analysis (PCA) is a linear technique used to reduce the dimensionality of data by projecting it onto a new set of orthogonal axes (principal components) that maximize variance. These principal components are linear combinations of the original features and are arranged in descending order of variance.  2. Key Characteristics 1. Variance Maximization: PCA seeks to find the directions (principal components) that capture the maximum variance in the data. The first principal component accounts for the most variance, the second for the next highest variance orthogonal to the first, and so on. 2. Orthogonality: Principal components are orthogonal (uncorrelated) to each other. This ensures that each component represents a unique dimension of the data. 3. Linear Transformation: PCA is a linear method, meaning it transforms the data using linear combinations of the original features. It does not capture non-linear relationships in the data. 4. Eigenvalues and Eigenvectors: PCA involves calculating the eigenvalues and eigenvectors of the covariance matrix of the data. The eigenvectors represent the principal components, while the eigenvalues indicate the amount of variance captured by each component. 5. Data Centering: PCA requires the data to be centered (i.e., mean subtracted) so that the principal components are computed based on the variance from the mean.

Use Quizgecko on...
Browser
Browser