S3_1361 Minutes_Teaching_Aid_Lecture 1 DMML.pptx
Document Details
Full Transcript
Lecture 1 Machine Dr Mahmoud Aldraimli [email protected] Learning OPEN DOOR POLICY – JUST COME IN Copland Campus – 7th Floor, Office 7.118 © 2023 Mahmoud Aldraimli All rights reserved We extract coal for energy. We mine data for ________...
Lecture 1 Machine Dr Mahmoud Aldraimli [email protected] Learning OPEN DOOR POLICY – JUST COME IN Copland Campus – 7th Floor, Office 7.118 © 2023 Mahmoud Aldraimli All rights reserved We extract coal for energy. We mine data for ________ A Trends B Patterns : : C Knowledge D Insights : : © 2023 Mahmoud Aldraimli All rights reserved We extract coal for energy. We mine data for ________ A Trends B Patterns : : C Knowledge D Insights : : © 2023 Mahmoud Aldraimli All rights reserved Mining in its casual term refers to the extraction of valuable minerals. In the 21st century, Data is the most expensive mineral. Through Data Mining, we extract useful information in a given dataset to extract patterns and identify relationships. © 2023 Mahmoud Aldraimli All rights reserved Where is data stored? © 2023 Mahmoud Aldraimli All rights reserved Where is data stored? © 2023 Mahmoud Aldraimli All rights reserved Which skills do we need? © 2023 Mahmoud Aldraimli All rights reserved Areas of skills © 2023 Mahmoud Aldraimli All rights reserved Is there a process flow for mining data? © 2023 Mahmoud Aldraimli All rights reserved CRISP-DM is not the only framework, there are others © 2023 Mahmoud Aldraimli All rights reserved © 2023 Mahmoud Aldraimli All rights reserved Data Preparation Analysing data that has not been carefully screened for problems can produce misleading results. Thus, the examining the quality of data is first and foremost before running an analysis. Often, data preparation is the most important phase of a machine learning project. And it consumes a large portion of time. © 2023 Mahmoud Aldraimli All rights reserved Data Preparation Data preparation could include tasks to deal with missing data values, dealing with inconsistencies, transforming the data, formatting the data and others. These tasks are performed to enhance the modelling output. To prevent Garbage In – Garbage Out scenario. © 2023 Mahmoud Aldraimli All rights reserved data Bad quality, wrong, or useless Entity Machine Learning m P bi e oin as an t ed in les re , gl s, su us es lt ele s, s ss Where does data come from? © 2023 Mahmoud Aldraimli All rights reserved “There are in fact no numbers and no letters, we have codified our existence to bring it down to human size, to make it comprehensible, we have created a scale so we can forget its unfathomable scale.” – Film: Lucy (2014) – © 2023 Mahmoud Aldraimli All rights reserved “To measure is to know” – Lord Kelvin (1824 - 1907) – In the pursuit of knowledge, data is a collection of values that convey information, describing quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted. © 2023 Mahmoud Aldraimli All rights reserved “The type of measure used places constraints on which statistics can be used” – Stanley Smith Stevens (1906-1973) – A) If 1 and 2 are natural numbers, find 1+2 =3 B) If 1% and 2% are percentages, find 1% +2% = 3% C) If 1:Smoker, 2: Non-smoker, find 1+2 = Not Allowed In cases A and B we had values of arithmetic significance, thus addition was allowed. But for case C we had categorical values, addition makes no © 2023 Mahmoud Aldraimli All rights reserved Define these variables’ scale of measurement T-shirt size? {XS, S, M, L, XL, XXL, XXXL} (Ordinal, Nominal) Age in years? {30, 35, 46, 58, …} (Numeric) Students’ grades? {10%, 20%, … 100%} (Ratio, Numeric) Customer’s Satisfaction score? {1, 2, 3, 4, 5} (Discrete, Ordinal, Numeric) Has a disease? {Yes, No} (Binary, Nominal) Car Colours? {Black, White, Blue, Red} (Nominal) Age of students? {1, 2, six, 7, 145, eight, twenty-one, …} © 2023 Mahmoud Aldraimli (String) All rights reserved Importance of measurement scale Cigarettes Smoking Subject Can you guess 3: Current Smoker per day Status 2: Occasional Smoker Count of the missing 1: Former Smoker cigarettes ID 0: None Smoker values? (a) The type of measurement could 1 001 1 prohibit the use of some techniques. 2 002 2 (b) Eliminate the production of illogical estimates of values. Mode? 3 003 3 Mode? (c) Incorporate the importance of = 3 = 3 precision of present values if ? 004 ? needed. 3 005 3 Average? Average (1+2+3+3+0+2)/6 = 0 006 0 1.8 ≈ 2 1.8 ≈ 2 007 3 3 © 2023 Mahmoud Aldraimli All rights reserved Missing Data Sometimes you get a table, but some values are not there! An unobserved value. A deleted value. Removed considered an error. An unrecorded value. An unobtainable Value. An unknown value. An inaccessible value. A lost forever value. © 2023 Mahmoud Aldraimli All rights reserved Missing value investigation Understand The cause of your data missingness The pattern of The proportion of missing data missing data © 2023 Mahmoud Aldraimli All rights reserved Missingness by severity Missing 02 Structured 01 Missingness A value is missing from the data for Completely at Random (MCAR) The missing value has nothing to a valid reason. The data is missing do with its assumed value and because it should not occur with the values of other variables. considering the other variables. (Covid Test result missing, was the (Non–smokers are missing the sample damaged?) number of cigarettes consumed per day) 03 Missing at Random (MAR) Missing at random means that the 04 Missing Not at Random (MNAR) The data is missing due to tendency for a data point to be unobserved data, i.e. the data we do missing is not related to the not have or related to factors we did missing data itself, but is related not account for. (Customer service to some of the observed data in calls missing duration, perhaps the the dataset. (Certain professions call was dropped from queue or missing income values!) unanswered). We do not know. © 2023 Mahmoud Aldraimli All rights reserved Missingness proportion (P) and severity (S) Least on the scale of severity Low on the scale of severity (S). It may affect a large (S). It affects a very small portion (P) of records portion (P) of records. Struct. Missingness S P S P MCAR MAR MNAR Moderate on the scale of Very high severity (S). It is severity (S), it affects a larger likely to affect a large portion portion (P) of records than MCAR (P) of records. S P S P © 2023 Mahmoud Aldraimli All rights reserved Missingness by 02 Unit Non- 01 preference Item Non- Response Response The unit is the subject. Missingness This missingness at a record refers to the complete absence of a level for a subject refers to the record’s responses. absence of answers to specific questions. A subject answered most of the questionnaire but decided to skip a few unanswered. © 2023 Mahmoud Aldraimli All rights reserved Item Non-response For complete-case analysis, how many records can you use for analysis? Only 7 records have complete data (no missingness). For complete-case analysis, remove any record with missing data. This means you deleted 41% of your data! VERY BAD © 2023 Mahmoud Aldraimli All rights reserved Unit Non-response For complete-case analysis, how many records can you use for analysis? 10 records have complete data (no missingness). For complete-case analysis, EE remove any record with missing 65 data. This means you deleted 16% of your data! © 2023 Mahmoud Aldraimli All rights reserved Non-Response, Spread (S) and Threat (T) Unit Non-Response It tends to have a larger spread (S) in a dataset It has less threat by retaining more records in complete case analysis T S Item Non- It tends to have less spread (S) Response in a dataset It has a high threat (T) in complete case analysis T S © 2023 Mahmoud Aldraimli All rights reserved Options to dealing with missing data Avoid Ignore If you are collecting data, Leave them out of your ensure you have a solid analysis. This may mean survey design, reliable dropping variables with collection procedure and large missingness, deleting data maintenance. Check records with missing values the resources if you are or both. buying data. Top- Treat Include as many predictors Find information or make up as possible in a model so that the “missing at assumptions! Obtain reasonable parameter random” assumption is estimates about the reasonable. Or add more relationship between records with observations variables to provide the © 2023 Mahmoud Aldraimli which are usually best guess. All rights reserved expensive. Your options, so choose carefully Cause Avoiding missing data Dom ain tte rn Pa must always Considerations be your first choice e Sp m re lu Vo da © 2023 Mahmoud Aldraimli All rights reserved Missing Data Treatments Deletion Model Simple Imputation Imputation © 2023 Mahmoud Aldraimli All rights reserved Deletion One way is to get rid of the Case-wise deletion problem of missing data. Objects are retained as per completeness in target data for the specific analysis Case-wise deletes any record (case) with a minimum of one missing observation in all variables and retains the rest. Pair-wise discards records depending on the case study; records are retained as Variable Pair-wise per completeness in target variables for dropping deletion the specific analysis. Variable dropping removes a whole © 2023 Mahmoud Aldraimli variable for all records in the dataset. All rights reserved Simple Imputation Methods that substitute a missing value with a new observation based on simple or no assumptions. Logical-Rule Mean, & Median or Random Information Mode imputation. imputation imputation Simple Imputation Examples Hot-deck LOCF Indicator Cold-deck BOCF Substitution © 2023 Mahmoud Aldraimli All rights reserved Model-Based Imputation Expectatio Methods use a predictive model to Maximum n estimate (Not Predict) the missing Regression Likelihood Maximisati observations. on The dataset is split into two types of subsets for the variable under evaluation: one subset holds the present observations only (used to Model-Based build the model), and the other Imputation Examples subset contains all missing values that require substitution. Machine Hybrid Multiple Learning Methods Imputation KNN © 2023 Mahmoud Aldraimli All rights reserved A useful book for missing data (a) The book has a detailed part on the “HOW” part of the mentioned techniques in this lecture. (b)It also has Some Practical Considerations, including software options. © 2023 Mahmoud Aldraimli All rights reserved © 2023 Mahmoud Aldraimli All rights reserved © 2023 Mahmoud Aldraimli All rights reserved © 2023 Mahmoud Aldraimli All rights reserved Which is greater. 1Kg of Apples or 1m of Electric Cable A 1 Kg of Apples B 1 m Cable : : C They are equal D Convert them : : © 2023 Mahmoud Aldraimli All rights reserved Which is greater. 1 Kg of Apples or 1m of Electric Cable A They B Cannot : : C Be D Compared : : © 2023 Mahmoud Aldraimli All rights reserved Which astronaut is WARREN closer to THE PROBLEM OF commander D=2metr 0 e OBLONSKY? WARRE MAGNITUDE A WANG B N OBLONSK Y WAN G D=20 centimetr 00 e © 2023 Mahmoud Aldraimli All rights reserved Pay attention to units 2 km of measurements SCALING KNN Everything is 2000000 mm unitless 2000 m (Dimensionless) 2 2000 2000000 You Some Machine Learning algorithms are badly impacted by Machine varied magnitudes of Learning measurements. © 2023 Mahmoud Aldraimli All rights reserved SCALING is a transformation that SCALING FEATURE enlarges (increases) or 𝐵 shrinks (diminishes) objects (values) by a scale YOU Machine factor that is the same in Learner 𝐴 KNN all directions. 𝐶 STANDARDIZATION is a scaling process which makes observations conform to a standard measurement of a fraction of its standard deviation σ © 2023 Mahmoud Aldraimli All rights reserved © 2023 Mahmoud Aldraimli All rights reserved © 2023 Mahmoud Aldraimli All rights reserved EXTREME VALUES OUTLIERS AND Consistent Outliers or Extreme Observation points s © 2023 Mahmoud Aldraimli All rights reserved OUTLIERS EXTREME VALUES OUTLIERS AND ES DE M FI RE NE ET You define them! The definition is subjective, Extreme values depending on the Outliers appear to reside at the study be inconsistent boundaries of the with the data all values for an values input feature © 2023 Mahmoud Aldraimli All rights reserved STATISTICAL TESTS There are many of STA T DETECTION them, for example OUTLIERS Interquartile Range SCATTER (IQR) Test HISTOGRAM PLOT CS S SCATTER HISTOGRAM PLOT PLOT Visualize Visualize multi- COMMON SENSE! ! univariate variate data to data to find find outliers outliers. By examining one variable or by examining related variables © 2023 Mahmoud Aldraimli All rights reserved TREATMENT KEE REMO OUTLIERS P VE ERRORS CRITICAL RESULTS RE-COLLECTABLE THE PRESENCE OF VERIFIABLE LOTS OF OUTLIERS MORE DATA AVAILABLE © 2023 Mahmoud Aldraimli All rights reserved A B C D CHECKPOIN The opposite two multidimensional plots are for a raw dataset and its prepared version. Which preparation method could have changed the data view most? RAW DATA PREPARED DATA T A Imputing Missing Values B Removing Outliers C Type Conversion D Scaling © 2023 Mahmoud Aldraimli All rights reserved The effect of scaling raw data © 2023 Mahmoud Aldraimli All rights reserved © 2023 Mahmoud Aldraimli All rights reserved MODELLI Machine NG Learning © 2023 Mahmoud Aldraimli All rights reserved AREAS IN KEY © 2023 Mahmoud Aldraimli All rights reserved AREAS IN KEY © 2023 Mahmoud Aldraimli All rights reserved © 2023 Mahmoud Aldraimli All rights reserved