Machine Learning Lectures - PDF
Document Details
![IdyllicPentagon](https://quizgecko.com/images/avatars/avatar-1.webp)
Uploaded by IdyllicPentagon
Tags
Summary
These lectures provide an overview of machine learning, including various concepts and techniques, alongside a bibliography of related works.
Full Transcript
Machine learning 1 / 23 Bibliography 1 Larose, Daniel T. Discovering Knowledge in Data : An Introduction to Data Mining, Wiley, 2014. 2 Han Jiawei, Kamber Micheline, Data Mining: Concepts and Techniques. Elsevier, 2006. 3 Pang-Ning Tan, St...
Machine learning 1 / 23 Bibliography 1 Larose, Daniel T. Discovering Knowledge in Data : An Introduction to Data Mining, Wiley, 2014. 2 Han Jiawei, Kamber Micheline, Data Mining: Concepts and Techniques. Elsevier, 2006. 3 Pang-Ning Tan, Steinbach Michael, Vipin Kumar, Introduction to Data Mining, Pearson, 2014 4 G.James, D.Witten, T.Hastie, R.Tibshirani. An Introduction to Statistical Learning. New York: Springer, 2013. 5 Raschka Sebastian, Yuxi Liu, Vahid Mirjalili Machine Learning with PyTorch and Scikit-Learn. Packt, 2022. 2 / 23 Bibliography 1 Saed Sayad. Data Mining Map. http://www.saedsayad.com/data_mining_map.htm. 2 Analytics, Data Mining, and Data Science. http://www.kdnuggets.com/. 3 Kaggle https://www.kaggle.com/datasets?fileType=csv. 3 / 23 What big data looks like? The collection of the University of Lodz Library contains approximately 2.8 millions volumes. If the average size of a document was 1MB (although it usually has more), the library would take up 30 terabytes. Meanwhile, the database of courier shipments in a logistics company, is about 20 terabytes. 4 / 23 Machine learning Machine learning (ML) is a eld of computer science that studies algorithms and techniques for automating solutions to complex problems. Machine learning (ML) is a term that was coined around 1960, consisting of two words machine, which corresponds to a computer, robot, or other device, and learning, which refers to an activity intended to acquire or discover event patterns, which we humans are good at. 5 / 23 Machine learning Machine learning (ML) machine learning is often also referred to as data mining or predictive analysis. 6 / 23 Relation with Articial Intelligence (AI) Articial intelligence (AI) is a much broader eld of study than machine learning (ML) (ML is a subeld of AI). AI is all about making machines intelligent using multiple approaches, whereas ML is essentially about one approach making machines that can learn to perform tasks. An example of an AI approach that's not based on learning is developing expert systems. 7 / 23 Denition of datamining Graham Williams Data mining is the art and science of intelligent data analysis. The aim is to discover meaningful insights and knowledge from data. Discoveries are often expressed as models, and we often describe data mining as the process of building models. A model captures, in some formulation, the essence of the discovered knowledge. A model can be used to assist in our understanding of the world. Models can also be used to make predictions. 8 / 23 Data mining vs Machine learning 1 Data mining is a technique of discovering dierent kinds of patterns that are inherited in the data set and which are precise, new, and useful data. Data Mining is working as a subset of business analytics and similar to experimental studies. Data Mining's origins are databases, statistics. 2 Machine learning includes an algorithm that automatically improves through data-based experience. Machine learning is a way to nd a new algorithm from experience. Machine learning includes the study of an algorithm that can automatically extract the data. Machine learning utilizes data mining techniques and another learning algorithm to construct models of what is happening behind certain information so that it can predict future results. Data mining and Machine learning are areas that inuence each other and have many things in common. 9 / 23 What is not dm and ml Dm and ml is not OLAP. 10 / 23 What is not dm and ml 1 How many customers who 1 What product did the customers who bought a suit bought a shirt? bought the suit, buy? 2 Which customers are not 2 What credit risk does the customer paying back the loan? pose? 3 Which customers have not 3 Which customers may leave for another renewed their insurance company? policies? 11 / 23 Data analysis process The most commonly used approach is Cross Industry Process for Data Mining CRISP-DM, 1996). 1 Problem Understanding lub Business Understanding 2 Data Understanding 3 Data Preparation 4 Modeling 5 Evaluation 6 Deployment 12 / 23 Data analysis process - Problem Understanding lub Business Understanding This initial phase focuses on understanding the project aims and requirements from a business perspective, then converting this knowledge into a data analysis problem denition and a preliminary plan designed to achieve the aims. 13 / 23 Data analysis process - Data Understanding The data understanding phase starts with an initial data collection and proceeds with activities in order to get familiar with the data, to identify data quality problems. After this step you should know the answer to the following questions: 1 where did the data come from? 2 who collected them and what methods did they use to collect them? 3 what do the rows and columns in the data mean? 4 are there any obscure symbols or abbreviations in the data? 14 / 23 Data analysis process - Data Preparation The data preparation phase covers all activities to construct the nal dataset from the initial raw data. Data preparation most often requires: 1 the joining of several data sets, 2 reducing the number of variables to only those which will be relevant for the process, 3 data cleaning (removal of anomalies, reformatting, normalisation, missing data). 15 / 23 Data analysis process - Modeling In this phase, various modeling techniques are selected and applied and their parameters are calibrated to optimal values. 16 / 23 Data analysis process - Evaluation Process evaluation consists of determining whether the model or models meet the assumptions established in the rst stage (quality and eciency) verifying whether there are any important business or research objectives that have not been taken into account deciding on the further use of the results 17 / 23 Data analysis process - Deployment Creation of the model is generally not the end of the project. Even if the purpose of the model is to increase knowledge of the data, the knowledge gained will need to be organized and presented in a way that the customer can use it. 18 / 23 Data analysis engineering - summary http: //www.kdnuggets.com/2017/02/analytics-grease-monkeys.html https://www.purpleslate.com/what-is-data-mining/ 19 / 23 Python 4 ML Why Python is Preferred for Machine Learning? Python is known for its readability and simplicity , making it easy for beginners to grasp and valuable for experts due to its clear and intuitive syntax. Python oers many libraries and frameworks for machine learning and data analysis, such as Scikit-learn, TensorFlow, PyTorch, Keras, and Pandas. These libraries provide prebuilt functions and utilities for mathematical operations, data manipulation, and machine learning tasks, reducing the need to write code from scratch. 20 / 23 Python 4 ML Why Python is Preferred for Machine Learning? Python has a large and active community, providing ample tutorials, forums, and documentation for support, troubleshooting, and collaboration. The community ensures regular updates and optimization of libraries, keeping them up-to-date with the latest features and performance improvements. Python's exibility makes it suitable for projects of any scale, from small experiments to large, complex systems, and across various stages of software development and machine learning workows. 21 / 23 Python 4 ML NumPy This library is fundamental for scientic computing with Python. It provides support for large, multi-dimensional arrays and matrices, along with a collection of high-level mathematical functions to operate on these arrays. Pandas Essential for data manipulation and analysis, Pandas provides data structures and operations for manipulating numerical tables and time series. It is ideal for data cleaning, transformation, and analysis. Matplotlib It is great for creating static, interactive, and animated visualizations in Python. Matplotlib is highly customizable and can produce graphs and charts that are publication quality. 22 / 23 Python 4 ML Scikit-learn Provides a range of supervised and unsupervised learning algorithms via a consistent interface. It includes methods for classication, regression, clustering, and dimensionality reduction, as well as tools for model selection and evaluation. SciPy Built on NumPy, SciPy extends its capabilities by adding more sophisticated routines for optimization, regression, interpolation, and eigenvector decomposition, making it useful for scientic and technical computing. 23 / 23 Thank you for your attention!!! 24 / 23 Machine learning 1 / 43 Data analysis process Cross Industry Process for Data Mining CRISP-DM , 1996) 1 Problem Understanding lub Business Understanding 2 Data Understanding 3 Data Preparation 4 Modeling 5 Evaluation 6 Deployment The quality of the data and the amount of useful information that it contains are key factors that determine how well a machine learning algorithm can learn. Therefore, it is absolutely critical to ensure that we examine and preprocess a dataset before we feed it to a machine learning algorithm. 2 / 43 Data The word data, is the plural of the word datum. 3 / 43 Data The term was the rst to be used by Euclid, in the work Dedomena. data is called, in Euclid's work, a quantity resulting directly from the terms of a given problem. Euclid 4 / 43 Data in the form of records each record (object, sample, observation) is described by a set of attributes (variables). each observation (record) has a xed number of attributes (i.e. a xed tuple length), so that it can be considered as a vector in a multidimensional space whose dimension is equal to the number of attributes. the data set can be represented as a matrix of type m × n, where each of the m rows corresponds to an observation and each of the n columns corresponds to an attribute (D = {x } , j = 1,..., n.) m ij i=1 5 / 43 Data in the form of records 6 / 43 Data in the form of records Input variables are also referred to as independent variables, observed variables or descriptive variables. Output variables are dependent on the input variables. They are referred to as target, response or dependent variables. 7 / 43 Transaction-based sets each transaction (purchase, observation) is assigned a vector. transaction components denote goods, objects, etc. 8 / 43 Data in graph form The vertices of the graph are used to store the data, while the edges indicate the relationships between the data. 9 / 43 Data quality Machine learning algorithms are very sensitive to the quality of the source data; GIGO (Garbage In, Garbage Out) - results of processing incorrect data will be wrong regardless of the correctness of the processing procedure. Data properties: completeness, correctness, actuality. 10 / 43 Noise in the data label noise inconsistent observations classication errors- observations that are labelled as a class other than the actual class. Often, the term noised data is used as a synonym for corrupted data. 11 / 43 Noise in the data attribute noise this refers to incorrect values of one or more attributes wrong attribute values (1.02, green, class= positive) when we assume that an attribute has a bad value. missing or unknown attribute values(2.05,?, class = negative) - we do not know the value of the second attribute. incomplete attributes or values (=, green, class = positive) - the system cannot understand and correctly interpret values. outliers. 12 / 43 Types of Variable The data type of attributes helps analysts select the correct method for data analysis and visualization plots. We can divide attributes into just two types: qualitative, (non-measurable) (categorical) refers to names or labels of categorized variables; cannot be uniquely characterised by numbers; such as product name, brand name, zip code, state, gender, marital status or size of a T-shirt: small, medium, or large. quantitative (numeric) is presented as integer or real values.e.g. the height of a person, the number of items sold articles. 13 / 43 Types of Variable Qualitative data: the set of available values is always limited (e.g. in the case of months to 12); for this reason the values of categorical variables are called lables; nominal attributes - the value of a nominal attribute can be the symbol or name of items. The values are categorical, qualitative, and unordered in nature such as product name, brand name, zip code, state, gender, and marital status. ordinal attributes - refers to names or labels with a meaningful order or ranking.These types of attributes measure subjective qualities alone. That is why they are used in surveys for customer satisfaction ratings, product ratings, and movie rating reviews. Customer satisfaction ratings appear in the following order: 1: Very dissatised, 2: Somewhat dissatised, 3: Neutral, 4: Satised, 5: Very satised. determining the distance between values is only possible within the framework of the adopted model; it is impossible to perform arithmetic operations on them. 14 / 43 Types of Variable Quantitative data: the values can be compared with each other; it is possible to determine the distance between values; arithmetic operations can be performed on them; the values can be discrete (integers) - a nite or countable set of values; continuous (real numbers). 15 / 43 Types of Variable dataset 16 / 43 Types of Variable Variables can take one or multiple values. Single-valued variables are otherwise known as constants or identier. Single-valued variables should not be used in the data analysis process as they carry no information. remark - it should be checked whether a given variable is single-valued in the selected sample or in the entire source data set - if some values of a variable occur very rarely (e.g. once in 500 000 cases), we will probably not nd them in a sample of 10 000 rows. By removing such a variable from the training dataset of a data mining model, we may not only degrade the accuracy of its results, but also prevent it from recognising unusual cases, possibly the most interesting ones. 17 / 43 Types of Variable , this could Some variables are used to uniquely identify the observation be, for example, some ocial identication number. The identier can also be, for example, the date of the observation. Identiers are not used in the model. 18 / 43 Types of Variable Another type of variables not useful for predictive models are monotonic variables. The values of such variables constantly increase or decrease. this type of variable is very common, for example: the values of all time-related variables (such as invoice date or date of birth) are increasing; 19 / 43 Knowledge and information D.Larose Let us assume that we run a local shop and that we register all the details in the shop's database. We know our customers' details and what they buy each day. E.g. Alex, Jessica and Paul visit the shop every Sunday and buy candles. What we store in the database is just the data. Every time we want to know who the visitors are who buy the candles, we can search the database and get the answer. This is information. If we want to know how many candles were sold on each day of the week, then we can again direct a query to the database database and get the answer - this is also information. 20 / 43 Knowledge and information But suppose we have many other customers who also buy candles from us every Sunday (mostly with some level of freedom), and all of them are Christians (going to church). So so we can conclude that Alex, Jessica and Paul must also be Christians. The religion of Alex, Jessica and Paul was not recorded in the database, so we could not retrieve it as information from it. We learned this information indirectly. It is a knowledge that we discovered. Of course, it is likely that our ndings as to Alex, Jessica and Paul may be wrong. Therefore it is important that our knowledge and ndings are evaluated correctly. 21 / 43 Data pre-processing Data pre-processing involves cleaning and transformation of data to prepare it for mining. It is estimated that data preprocessing is 70-80% of the process of of knowledge discovery. For example, a database may contain elds that are out of date or redundant, records with missing values, outliers, data in a format unsuitable for machine learning models, values incompatible with principles or common sense. 22 / 43 Data pre-processing To illustrate the necessity of data cleaning, we will examine the following example (D.Larose): We will analyse the personal data from the following table Example 23 / 43 Descriptive Statistics. Data Mining Map http://www.saedsayad.com/data_mining_map.htm The eld of statistics helps us to gain an understanding of our data, and to quantify what our data and results look like. It also provides us with mechanisms to measure how well our application is performing and prevent certain machine learning pitfalls (such as under/overtting). 24 / 43 Distribution A distribution is a representation of how often values appear within a dataset. 25 / 43 Statistical measures There are two types of these measures: central tendency measures - these measure where most of the values are located, or where the center of the distribution is located; spread or dispersion measures - these measure how the values of the distribution are spread across the distribution's range (from the lowest value to the highest value). 26 / 43 Measures of central tendency Measures of central tendency include the following: Mean - this is what you might commonly refer to as an average. We calculate this by summing all of the numbers in the distribution and then dividing by the count of the numbers. The mean of a sample x , x ,... , x , usually denoted by x̄ 1 2 n x̄ = n 1 Xx i n i=1 27 / 43 Measures of central tendency Median - if we sort all of the numbers in our distribution from the lowest to highest, this is the number that separates the lowest half of the numbers from the highest half of the numbers. The median of a sample x , x ,... , x , where x ≤ x ≤... ≤ x 1 2 n 1 2 n n is odd, x(n+1)/2 em = 1 x 2 (n/2) + x(n/2+1) 28 / 43 Measures of central tendency Mode - this is the most frequently occurring value in the distribution. 29 / 43 Measures of central tendency If the mean and median are signicantly dierent, then we can expect that some observations for a given distribution are far from the mean. These are outlier points. 30 / 43 Measures of spread or dispersion Measures of dispersion include the following: Maximum - the highest value of the distribution Minimum - the lowest value of the distribution Range - the dierence between the maximum and minimum Variance - this measure is calculated by taking each of the values in the distribution, calculating each one's dierence from the distribution's mean, squaring this dierence, adding it to the other squared dierences, and dividing by the number of values in the distribution Standard deviation - the square root of the variance Quantiles/quartiles - similar to the median, these measures dene cut-o points in the distribution where a certain number of lower values are below the measure and a certain number of higher values are above the measure 31 / 43 Variance The variance of a sample x , x ,... , x is of the form 1 2 n (xi − x̄)2 Pn s2 = i=1 n where x̄ is the mean value. The standard deviation is calculating a sort of average distance of how far the data values are from the arithmetic mean. by taking x − x̄ , you are nding the literal dierence between the value and the mean of the sample; by squaring the result, (x − x̄) , we are putting a greater penalty on 2 outliers because squaring a large error only makes it much larger. i by dividing by the number of items in the sample, we are taking (literally) the average squared distance between each point and the mean. 32 / 43 Standard deviation The standard deviation of a sample x , x ,... , x is of the form 1 2 n s (xi − x̄)2 Pn i=1 σ= n where x̄ is the mean value. A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range. 33 / 43 Quartile A quartile divides the number of sorted data points into four parts, or quarters, of more-or-less equal size. The three main quartiles are as follows: the rst quartile Q is dened as the middle number between the 1 smallest number (minimum) and the median of the data set. It is also known as the lower or 25th empirical quartile, as 25% of the data is below this point. the second quartile Q is the median of a data set; thus 50% of the 2 data lies below this point. the third quartile Q is the middle value between the median and the 3 highest value (maximum) of the data set. It is known as the upper or 75th empirical quartile, as 75% of the data lies below this point. 34 / 43 Quantiles q-quantiles are values that partition a nite set of values into q subsets of (nearly) equal sizes. In statistics and probability, quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities, or dividing the observations in a sample in the same way. Common quantiles have special names, such as quartiles (four groups), quintiles (ve groups), deciles (ten groups), percentiles (100 groups). 35 / 43 Skewness Skewness measures the symmetry of a distribution. It shows how much the distribution deviates from a normal distribution. Its values can be zero, positive, and negative. A zero value represents a perfectly normal shape of a distribution. Positive skewness is shown by the tails pointing toward the rightthat is, outliers are skewed to the right and data stacked up on the left. Negative skewness is shown by the tails pointing toward the leftthat is, outliers are skewed to the left and data stacked up on the right. Positive skewness occurs when the mean is greater than the median and the mode. Negative skewness occurs when the mean is less than the median and mode. 36 / 43 Kurtosis Kurtosis measures the tailedness (thickness of tail) compared to a normal distribution. High kurtosis is heavy-tailed, which means more outliers are present in the observations, and low values of kurtosis are light-tailed, which means fewer outliers are present in the observations. There are three types of kurtosis shapes: mesokurtic platykurtic leptokurtic 37 / 43 Kurtosis A normal distribution having zero kurtosis is known as a mesokurtic distribution. A platykurtic distribution has a negative kurtosis value and is thin-tailed compared to a normal distribution. A leptokurtic distribution has a kurtosis value greater than 3 and is fat-tailed compared to a normal distribution. 38 / 43 Covariance Covariance measures the relationship between a pair of variables. It shows the degree of change in the variablesthat is, how the change in one variable aects the other variable. Its value ranges from -innity to + innity. The problem with covariance is that it does not provide eective conclusions because it is not normalized. 39 / 43 Correlation Correlation shows how variables are correlated with each other. Correlation ranges from -1 to 1. A negative value represents the increase in one variable, causing a decrease in other variables or variables to move in the same direction. A positive value represents the increase in one variable, causing an increase in another variable, or a decrease in one variable causes decreases in another variable. A zero value means that there is no relationship between the variable or that variables are independent of each other. 40 / 43 Correlation The method parameter can take one of the following three parameters: pearson: Standard correlation coecient kendall: Kendall's tau correlation coecient spearman: Spearman's rank correlation coecient 41 / 43 Correlation Spearman's rank correlation coecient is Pearson's correlation coecient on the ranks of the observations. It is a non-parametric measure for rank correlation. It assesses the strength of the association between two ranked variables. Ranked variables are ordinal numbers, arranged in order. First, we rank the observations and then compute the correlation of ranks. It can apply to both continuous and discrete ordinal variables. When the distribution of data is skewed or an outlier is aected, then Spearman's rank correlation is used instead of Pearson's correlation because it doesn't have any assumptions for data distribution. 42 / 43 Correlation Kendall's rank correlation coecient or Kendall's tau coecient is a non-parametric statistic used to measure the association between two ordinal variables. It is a type of rank correlation. It measures the similarity or dissimilarity between two variables. If both the variables are binary, then Pearson's = Spearman's = Kendall's tau. 43 / 43 Thank you for your attention!!! 44 / 43 Machine learning 1 / 28 Managing Data 1 Problem Understanding lub Business Understanding 2 Data Understanding 3 Data Preparation 4 Modeling 5 Evaluation 6 Deployment 2 / 28 Normal distribution In machine learning, the Gaussian distribution, is also known as the normal distribution. It is a continuous probability distribution function that is symmetrical at the mean, and the majority of data falls within one standard deviation of the mean. It is characterized by its bell-shaped curve. (x − µ)2 1 f (x) = √ exp − σ 2π 2σ 2 where x represents the Variable µ represents the Mean σ represents the Standard Deviation e represents the base of the Natural Logarithm. 3 / 28 Normal distribution (x − µ)2 1 f (x) = √ exp − σ 2π 2σ 2 4 / 28 Normal distribution The standard deviations are used to subdivide the area under the normal curve. Each subdivided section denes the percentage of data, which falls into the specic region of a graph. 5 / 28 Normal distribution The Empirical Rule, also known as the 68-95-99.7 rule, quanties the proportion of data falling within certain intervals around the mean in a normal distribution. It provides a quick way to estimate the spread of data without performing detailed calculations. 6 / 28 Normal distribution A smaller standard deviation results in a narrower and taller bell curve, indicating that data points are clustered closely around the mean. Conversely, a larger standard deviation leads to a wider and shorter bell curve, suggesting that data points are more spread out from the mean. 7 / 28 Normal distribution The standard normal distribution has a mean(central value) =0 and a standard deviation = 1. 8 / 28 Machine Learning Methods that uses Gaussian Distribution linear regression, logistic regression, and In algorithms, such as Gaussian mixture models, it is often assumed that the observed data is generated from a Gaussian distribution. It simplies the model and allows for ecient parameter estimation. In Bayesian machine learning, the Gaussian distribution is commonly used as a prior distribution over model parameters. This prior distribution reects about the parameters before observing any data and is updated to a posterior distribution using Bayes' theorem. Anomaly Detection where the goal is to identify rare events or outliers in the data. Anomalies are detected based on the likelihood of the data under the Gaussian distribution. Dimensionality Reduction - Principal Component Analysis (PCA), it nds the directions of maximum variance in the data, which correspond to the principal components. Kernel Methods - Gaussian kernel is commonly used in kernelized machine learning algorithms, such as Support Vector Machines (SVMs). 9 / 28 Central limit theorem Data analysis methods involve hypothesis testing and deciding condence intervals. All statistical tests assume that the population is normally distributed. The central limit theorem is the core of hypothesis testing. According to this theorem, the sampling distribution approaches a normal distribution with an increase in the sample size. Also, the mean of the sample gets closer to the population means and the standard deviation of the sample gets reduced. This theorem is essential for working with inferential statistics, helping data analysts gure out how samples can be useful in getting insights about the population. 10 / 28 Central limit theorem In the preceding diagram, you can see four histograms for dierent-dierent sample sizes 50, 100, 200, and 500. If you observe here, as the sample size increases, the histogram approaches a normal curve. 11 / 28 Collecting samples A sample is a small set of the population used for data analysis purposes. Sampling is a method or process of collecting sample data from various sources. It is the most crucial part of data collection. The success of an experiment depends upon how well the data is collected. If anything goes wrong with sampling, it will hugely aect the nal interpretations. Also, it is impossible to collect data for the whole population. Sampling helps researchers to infer the population from the sample and reduces the survey cost and workload to collect and manage data. 12 / 28 Data Visualization Data visualization is the initial move in the data analysis system toward easily understanding and communicating information. It helps analysts to understand patterns, trends, outliers, distributions, and relationships. Data visualization represents information and data in graphical form using visual elements such as charts, graphs, plots, and maps. 13 / 28 Data Visualization with Python Python oers various libraries for data visualization, such as Matplotlib - was the rst Python data visualization library, many other libraries are built on top of it or designed to work in tandem with it during analysis. Some libraries like pandas and Seaborn are wrappers over matplotlib. They allow you to access a number of matplotlib's methods with less code; Seaborn; The Grammar of graphics - any data graphic can Bokeh - is based on be created by combining data with layers of plot The Grammar of components such as axes, tickmarks, gridlines, Graphics; dots, bars, and lines. Plotly - interactive plots, also some charts you won't nd in most libraries, like contour plots, dendrograms, and 3D charts; geoplotlib is a toolbox for creating maps and plotting geographical data; missingno - a visual summary of dataset; 14 / 28 Data Visualization - comparison A comparison visualization is used to illustrate the dierence between two or more items at a given point in time or over a period of time. A commonly used comparison chart is the boxplot. Boxplots are typically used to compare the distribution of a continuous feature against the values of a categorical feature. Boxplot visualizes the ve summary statistics (minimum, rst quartile, median, third quartile, and maximum) and all outlying points individually. 15 / 28 Visualizing numeric features boxplots A common visualization of the ve-number summary is a boxplot, also known as a box-and-whisker plot. The boxplot displays the center and spread of a numeric variable in a format that allows you to quickly obtain a sense of its range and compare it to other features. 16 / 28 Data Visualization - relationship Relationship visualizations are used to illustrate the correlation between two or more variables. Scatterplots are one of the most commonly used relationship visualizations. These are typically for both continuous features. 17 / 28 Data Visualization - distribution Distribution visualizations show the statistical distribution of the values of a feature. One of the most commonly used distribution visualizations is the histogram. With a histogram you can show the spread and skewness of data for a particular feature 18 / 28 Data Visualization - distribution Histogram divides the feature values into a predened number of portions or bins that act as containers for values, with the same range. Histogram is composed of a series of bars with heights indicating the count, or frequency, of values falling within each of the equal-width bins partitioning the values. 19 / 28 Data Visualization - composition A composition visualization shows the component makeup of the data. Stacked bar charts and pie charts are two of the most commonly used composition visualizations. With a stacked bar chart, you can show how a total value can be divided into parts or highlight the signicance of each part relative to the total value. 20 / 28 Data Visualization - heatmap Heatmap is an intelligent, analytical tool that uses a color system to graphically represent various values. 21 / 28 Data Visualization - pair plot A pair plot, also known as a scatterplot matrix, is a matrix of graphs that enables the visualization of the relationship between each pair of variables in a dataset. It combines both histogram and scatter plots, providing a unique overview of the dataset's distributions and correlations. 22 / 28 Data Visualization - pair plot The primary purpose of a pair plot is to simplify the initial stages of data analysis by oering a comprehensive snapshot of potential relationships within the data. 23 / 28 Data Visualization - pair plot Pair plots enable data scientists to: visualize distributions - understand the distribution of single variables. identify relationships - observe linear or nonlinear relationships between variables. detect anomalies - spot outliers that may indicate errors or unique insights. 24 / 28 Data Visualization - pair plot Pair plots enable data scientists to: nd trends - linear or nonlinear relationships that suggest predictability. nd clusters - groups of data points that share similar characteristics, hinting at subpopulations within the dataset. nd correlations - the strength and direction of relationships between variables. 25 / 28 Data Visualization 26 / 28 Thank you for your attention!!! 27 / 28 Machine learning 1 / 52 Managing Data 1 Problem Understanding lub Business Understanding 2 Data Understanding 3 Data Preparation 4 Modeling 5 Evaluation 6 Deployment 2 / 52 Data cleaning Data cleaning is the process of xing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset. 3 / 52 Missing Values There are several reasons why data could be missing. changes in data collection methods, human error, combining various datasets, human bias, and others. 4 / 52 Missing Values It is important to try to understand if there is a reason or pattern for the missing values. For example, particular groups of people may not respond to certain questions in a survey. 5 / 52 Handling missing values removal - remove all instances with features that have a missing value. this is a destructive approach and can result in the loss of valuable information and patterns that would have been useful in the machine learning process; this approach should be used only when the impact of removing the aected instances is relatively small or when all other approaches to dealing with missing data have been exhausted or are infeasible. 6 / 52 Handling missing values imputation - is the use of a systematic approach to ll in missing data using the most probable substitute values. random imputation - involves the use of a randomly selected observed value as the substitute for a missing value. Disadvantage with this approach is that it ignores useful information or patterns in the data when selecting substitute values. distribution-based imputation approach - the substitute value for a missing feature value is chosen based on the probability distribution of the observed values for the feature. This approach is often used for categorical values, where the mode for the feature is used as a substitute for the missing value. mean or median imputation - involves the use of the mean or median of the observed values as a substitute for the missing value. predictive imputation is the use of a predictive model (regression or classication) to predict the missing value. With this approach, the feature with the missing value is considered the dependent variable (class or response), while the other features are considered the independent variables. 7 / 52 Handling outliers A data point is an outlier if it is more than 1.5 · IQR above the third quartile or below the rst quartile. Said dierently, low outliers are below Q1 − 1.5 · IQR and high outliers are above Q3 + 1.5 · IQR. Outliers are those data points that are distant from most of the similar points. Outliers cause problems when it comes to building predictive models, such as long model training times, poor accuracy, an increase in error variance or a decrease in normality. 8 / 52 Handling outliers Box Plot - we can use a box plot to create a bunch of data points through quartiles. It groups the data points between the rst and third quartile into a rectangular box. The box plot also displays the outliers as individual points using the interquartile range. Scatter Plot Z-Score the Z-score is a kind of parametric approach to detecting outliers. It assumes a normal distribution of the data. The outlier lies in the tail of the normal curve distribution and is far from the mean. 9 / 52 Transforming the Data As part of the data preparation process, it is often necessary to modify or transform the structure or characteristics of the data to meet the requirements of a particular machine learning approach, to enhance our ability to understand the data, to improve the eciency of the machine learning process. Feature scaling brings all the features to the same level of magnitude. 10 / 52 Z-score standardization z-score, or zero mean normalization - the approach results in normalized values that have a mean of 0 and a standard deviation of 1. x − x̄ x∗ = , σ where x ∗ new value, x value from dataset. 11 / 52 Min-Max Normalization With min-max normalization, we transform the original data to a interval [0, 1]. x − min(xi ) x∗ =. max(xi ) − min(xi ) normalize AUC≥0.7 - a acceptable/fair classier; 0.7>AUC≥0.6 - a poor classier; 0.6>AUC≥0.5 - no discrimination; predicted 1 0 FP TP FPR = TPR = observed 1 TP FN TN + FP TP + FN 0 FP TN 47 / 48 Thank you for your attention!!! 48 / 48 Machine learning 1 / 15 Machine learning models Machine learning models are algorithms that can nd patterns or make predictions on unseen data. 2 / 15 Errors in Machine Learning In any Machine learning algorithm, we need to mainly focus on 3 important concept predicted values - values resulted from the model actual values - original values error - dierence between actual and predicted output 3 / 15 Errors in Machine Learning There are two main types of errors present in any machine learning model irreducible errors - are errors which will always be present in a machine learning model, because of unknown variables, and whose values cannot be reduced. It is caused by unusual variables that have a direct inuence on the output. reducible errors are those errors whose values can be further reduced to improve a model. They are caused because our model's output function does not match the desired output function and can be optimized. bias - is the dierence between our actual and predicted values. (bias is related to training error) variance - is our model's sensitivity to any uctuations in the data. (variance is related to testing error) 4 / 15 Bias Bias is the dierence between our actual and predicted values. Bias is the simple assumptions that our model makes our data to be able to predict new data. When the bias is high, assumptions made by our model are too basic, the model can't capture the important features of our data. This means that our model hasn't captured patterns in the training data and hence cannot perform well on the testing data too. If this is the case, our model cannot perform on new data. 5 / 15 Variance Variance is our model's sensitivity to any uctuations in the data. Variance - shows how our model may learn from noise. This will cause our model to consider trivial features as important. High variance occurs when a model learns the training data's noise and random uctuations rather than the underlying pattern. As a result, the model performs well on the training data but poorly on the testing data. 6 / 15 Bias-Variance Tradeo An optimized model will be sensitive to the patterns in our data, but at the same time will be able to generalize to new data. In this, both the bias and variance should be low so as to prevent overtting and undertting. 7 / 15 Bias-Variance Tradeo Bias-variance tradeo is the balance between bias and variance. In this case we can capture the essential patterns in our model while ignoring the noise present it in. Bias-variance tradeo helps to optimize the error in our model and keeps it as low as possible. 8 / 15 Undertting and Overtting Undertting and overtting both introduce error and reduce the generalizability of the model (the ability of the model to generalize to future, unseen data). They are also opposed to each other: somewhere between a model that underts and has bias, and a model that overts and has variance, is an optimal model that balances the bias variance trade-o. 9 / 15 Undertting Reasons for undertting the model is too simple, so it may be not capable to represent the complexities in the data. the input variables which are used to train the model are not the appropriate representations of the underlying factors inuencing the target variable. the size of the training dataset used is not enough. incorrectly selected model parameters. variables are not scaled. 10 / 15 Undertting Techniques to reduce undertting increase model complexity. increase the number of features, performing feature engineering. remove noise from the data. increase the number of epochs or increase the duration of training to get better results. add data to training set. 11 / 15 Overtting Reasons for overtting model is too complex. the size of the training data. 12 / 15 Overtting Techniques to reduce overtting increase the training data can improve the model's ability to generalize to unseen data and reduce the likelihood of overtting. improving the quality of training data reduces overtting by focusing on meaningful patterns, mitigate the risk of tting the noise or irrelevant features. reduce model complexity. early stopping during the training phase (have an eye over the loss over the training period as soon as loss begins to increase stop training). use dropout(removal of redundant neurons) for neural networks. 13 / 15 Bias-Variance Tradeo 14 / 15 Thank you for your attention!!! 15 / 15 Machine learning 1 / 36 Evaluating and improving performance of a model evaluating performance improving performance parameter tuning ensemble methods 2 / 36 Ensemble learning Ensemble learning assumes that we may not always be able to nd the optimal set of hyperparameters for a single model and that, even if we did, the model may not always be able to capture all the underlying patterns in the data. Therefore, instead of simply focusing on optimizing the performance of a single model, we should use several complementary weak models to build a much more eective and powerful model. 3 / 36 Ensemble learning Ensemble learning is a method to combine results produced by dierent learners into one format, with the aim of producing better classication results and regression results. The most common methods: bagging random forest boosting 4 / 36 Ensemble learning Bagging is a voting method, which rst uses Bootstrap to generate a dierent training set, and then uses the training set to make dierent base learners. The bagging method employs a combination of base learners to make a better prediction. Random forest uses the classication results voted from many classication trees. The idea is simple; a single classication tree will obtain a single classication result with a single input vector. However, a random forest grows many classication trees, obtaining multiple results from a single input. Therefore, a random forest will use the majority of votes from all the decision trees to classify data or use an average output for regression. Boosting is similar to the bagging method. However, what makes boosting dierent is that it rst constructs the base learning in sequence, where each successive learner is built for the prediction residuals of the preceding learner. With the means to create a complementary learner, it uses the mistakes made by previous learners to train the next base learner. 5 / 36 Ensemble learning Methods of constructing ensemble models may involve modifying training data - the additional training data sets used by the individual classiers are created by a multiple sampling with a return; lists of variables - the training data of individual classiers contain only the variables selected (in the simplest case, randomly) from the original training data set; machine learning algorithm - individual classiers are learned using the same training dataset, but each algorithm has dierent parameters. 6 / 36 Disadvantages Bagging: If exactly the same predictors are used to build each tree in bagging, it may happen that in the case of variables that are very strongly correlated with the phenomenon being explained, that each tree selects the same variable for the rst split, even though other variables may be only slightly less correlated with the target variable. Random forest: A random forest is a stable and non-overtted model. Its weakness is its sometimes weak predictive ability. 7 / 36 Advantages Ensemble methods, which combine the prediction power of each single learner into a strong learner. The trees in the random forests and bagging can be developed in parallel calculations. 8 / 36 Boosting Boosting starts with a simple or weak classier and gradually improves it by reweighting the misclassied observations. Thus, the new classier can learn from previous classiers. Boosting algorithm is a sequential ensemble method in that the residual/misclassied point of the previous learner is improvised in the next run of the algorithm. 9 / 36 Boosting The idea of boosting is to "boost" weak learners, a single decision tree, into strong learners. An alternative to repeated sampling is to generate modied copies of the train data by modifying the weights vector. Base models h , h ,...,h would 1 2 then be generated using the same training set, but dierent vectors of m weights w , w ,...,w. 1 2 m let us assume that we have n points in our training dataset and we can assign a weight, w (1 ≤ i ≤ n), for each point. i during m iterations, we can reweight each point in accordance with the classication result in each iteration. if the point is correctly classied, we should decrease the weight. Otherwise, we increase the weight of the point. when the iteration process is nished, we can then obtain the m tted model, f (x) (1 ≤ i ≤ n). i 10 / 36 Boosting - algorithm 11 / 36 Boosting - example let us assume that we have ten observations. Each observation will have initial weight 0.1. we build a decision tree that misclassies, for example, four observations (7, 8 , 9 and 10). we can calculate the sum of the weights of these misclassied observations, which is 0.4 (we denote it by ϵ). We continue to use the value of ϵ as the measure used to update the weights and determine the weight of the model. 12 / 36 Boosting - example let's calculate α 1 − ϵ. 05 α =. · log ϵ the new weights for misclassied observations are calculated as e α multiplied by the old weight. so α = 0.2027 = 0.5 · log , new weights for 7, 8 , 9 and 10 are 1−0.4 0.4 equal as e × 0.1 = 0.1225. α 13 / 36 Boosting - example this new model will again have errors. Suppose this model misclassies observations 1 and 8. Their current weights are 0.1 and 0.1225, respectively. thus, the new value of ϵ = 0.1 + 0.1225 = 0.2225. Next α = 0.6256. We use this value to modify the weights of misclassied observations. Observation 1 now gets weight 0.1 · e = 0.1869, Observation 8 now α gets weight 0.1225 · e = 0.229. α 14 / 36 Boosting The idea of boosting is to "boost" weak learners, a single decision tree, into strong learners. nally, we can obtain the nal prediction through the weighted average of each tree's prediction, where the weight, β, is based on the quality of each tree. 15 / 36 AdaBoost AdaBoost also called Adaptive Boosting is a technique in machine learning used as an ensemble method. the most common algorithm used with AdaBoost is decision trees with one level that means with decision trees with only 1 split. These trees are also called decision stumps. The disadvantage of AdaBoost is its lack of robustness to noisy data and to outliers, because the algorithm tries to perfectly match each observation. 16 / 36 Gradient Descent - idea f