Summary

This document provides an introduction to machine learning in R, covering classification and clustering. It discusses data mining techniques and the use of R in data analysis for various applications.

Full Transcript

Machine Learning In R Classification and Clustering Data Mining Data mining is the use of efficient techniques for the analysis of big collections of data and the extraction of useful and possibly unexpected patterns in data. Big data: refers to huge amounts of data or complex data (...

Machine Learning In R Classification and Clustering Data Mining Data mining is the use of efficient techniques for the analysis of big collections of data and the extraction of useful and possibly unexpected patterns in data. Big data: refers to huge amounts of data or complex data (or maybe a huge amount of complex data !) 2 Huge amounts of data: Twitter: 300 million tweets every day, Wikipedia: 4 million articles, Facebook: 500 million users, WALMART: 20M transactions per day. genomic sequences: http://www.1000genomes.org/page.php full sequence of 1000 individuals: 3*109 nucleotides per person in total 3*1012 nucleotides. Complex data: Multiple types of data: tables, time series, images, graphs, etc. Spatial (geographic information), temporal (represents a state in time), or directional (directions or angles) aspects. Interconnected data of different types: From the mobile phone we can collect, location of the user, friendship information, check-ins to venues, opinions through twitter, images though cameras, queries to search engines. 3 Efficient tools for data mining 4 What is Machine Learning? Machine learning is an important component of the growing field of data science. Machine learning is a subset of AI that involves training algorithms to learn patterns from data. Model is a result of Machine Learning algorithms implementation. Models are representations of what a Machine Learning system has learned from the training data within data mining projects. 5 Machine Learning Elements: 6 Branches of Machine Learning Supervised Learning (Classification) Logistic Regression Decision Tree Random Forest K Nearest Neighbor Naïve Bayes Unsupervised Learning (Clustering) K-mean Hierarchical clustering 7 What is Supervised Learning? Definition: A type of machine learning where the model is trained on labeled data, meaning each training example is paired with an output label. Goal: To learn a mapping from inputs to outputs. Examples: Classification tasks (e.g., spam detection) Regression tasks (e.g., predicting house prices) Applications of Supervised Learning Image classification Speech recognition Medical diagnosis 8 Supervised Learning: More Control, Less Bias Aim: Prediction Supervised algorithms are called as Predictive algorithms. There is an error function to evaluate the prediction of the model. The algorithm will repeat this evaluation to optimize the model if the model can fit better. Then, we can make prediction for a new observation or data using the optimized model. 9 Types of Supervised Learning Regression: Predicting continuous values. A regression problem is when the output variable is a real value, such as “dollars” or “weight”. Predicting the price of a house given house features, predicting the impact of SAT/GRE scores on college admissions, predicting the sales based on input parameters. Classification: classification of data according to some characteristics. In classification problem the output variable is a categorical variable, such as “red” or “blue” or “disease” and “no disease”. Classification is the problem of identifying which of a set of categories an observation belongs to. 10 Applications of Machine Learning in Daily Life Gmail Classification: Gmail categorizes emails into groups Primary, Promotions, Social, and Update and label the email as important. 11 Gmail Spam Filter: Gmail filters 99.9% of spam messages. 12 Image recognition on social media: Facebook is using facial recognition for tagging people on users’ photo. While uploading a new photo on Facebook, it automatically reflects faces and suggests friends tag. 13 Medical diagnoses: recognize cancerous tissues based on some features. cancerous tissues healthy tissue 14 What is Unsupervised Learning? Definition: A type of machine learning that deals with unlabeled data, focusing on identifying patterns and structures within the data. Goal: To find hidden patterns or intrinsic structures in the input data. Examples: Clustering (e.g., customer segmentation) Dimensionality reduction (e.g., PCA) Applications of Unsupervised Learning Market basket analysis Anomaly detection 15 Unsupervised Learning: Speed and Scale Aim: Description Deciphering the relationship, correlation and hidden structure of data. Unsupervised algorithms are called as Descriptive algorithms. No error function to evaluate the model (unsupervised). At no point do we know the correct output with certainty. No new data (no prediction). 16 Types of unsupervised learning Clustering: Splitting the dataset into groups based on similarity. Dimensionality reduction: Reducing the number of variables in a data set. Anomaly detection: Identifying unusual data points in a data set. Association mining: Identifying sets of items in a data set that frequently occur together. 17 Applications of Machine Learning in Daily Life Amazon’s “Frequently bought together” recommendations Recommender system: to analyze buyer baskets and detect cross-category purchase correlations. The company aims to create more effective up-selling and cross-selling strategies and provide product suggestions based on the frequency of particular items to be found in one shopping cart. 18 Applications of Machine Learning in Daily Life Netflix Recommendation Engine: to show user content based on their likes and what they watch. Netflix uses a Machine learning algorithm to understand the users likes and dislikes and then use this data and evaluate what content the user may like and recommend it to them. 19 Applications of Machine Learning in genetics Gene Clustering: Gene expression data hide vital information required to understand the biological process that takes place in a particular organism in relation to its environment. Deciphering the hidden patterns in gene expression data proffers a prodigious preference to strengthen the understanding of functional genomics. 20 Key differences: 21

Use Quizgecko on...
Browser
Browser