Data Understanding - Machine Learning Applications - 2023/2024
Document Details
Uploaded by LighterChaos
Technische Universität Darmstadt
2023
Uwe Klingauf
Tags
Summary
These lecture notes cover machine learning applications, including data understanding and explorative data analysis. The document details the methodologies of data mining, focusing on the initial steps of data collection, data description, and exploration.
Full Transcript
Machine Learning Applications Winter semester 2023/2024 Prof. Dr.-Ing. Uwe Klingauf Lecture III: Data Understanding and Exploratory Data Analysis 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf 1 Agenda 1. What is Data? 2. D...
Machine Learning Applications Winter semester 2023/2024 Prof. Dr.-Ing. Uwe Klingauf Lecture III: Data Understanding and Exploratory Data Analysis 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf 1 Agenda 1. What is Data? 2. Data Collection 3. Exploratory Data Analysis 1. Descriptive Statistics 2. Exploratory Statistics 4. Data Quality 5. Imbalanced Data 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf 2 Introduction to Data Understanding WHAT IS DATA? 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf 3 What is data? The word data is the plural of latin datum, „something given“. data. (n.d.) American Heritage® Dictionary of the English Language, Fifth Edition. (2011). Retrieved September 6 2022 from https://www.thefreedictionary.com/data 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf 4 What is data? Data is information, usually in the form of facts or statistics that can be analysed. data. (n.d.) Collins COBUILD English Usage. (1992, 2004, 2011, 2012). Retrieved September 6 2022 from https://www.thefreedictionary.com/data 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf 5 What is data? Representation of facts, concepts, or instructions in a formalized manner suitable for communication, interpretation, or processing by humans or by automatic means. Any representation such as characters or analog quantities to which meaning is or might be assigned. data. (n.d.) Dictionary of Military and Associated Terms. (2005). Retrieved September 6 2022 from https://www.thefreedictionary.com/data 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf 6 From Data To Knowledge Data Mining: discovering knowledge in large amounts of data [https://profisee.com/data-quality-what-why-how-who/] 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf 7 Data Mining Methodology CRISP-DM The process of data mining always starts with an understanding of the context! Business Understanding Business Understanding consists of the following steps: Determine business objectives: Understand what the customer wants to accomplish and define business Business Understanding Data Understanding criteria. Assess situation: Determine the available resources Data Preparation and requirements and assess possible risks. Determine data mining goals: Starting from the Deployment business objectives, determine goals and success Data Modeling criteria for the data analysis. Evaluation 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf 8 Data Mining Methodology Business Understanding Depending on the business context, the goals and requirements for the data mining differ: ▪ Application scenario ▪ Tolerance for erroneous estimations ▪ Safety-critical applications 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf 9 Data Mining Methodology CRISP-DM After a good understanding of the business task, the available data should be examined carefully with regard to the given task! Data Understanding Collect data: Acquire the data and, if necessary, integrate it into already existing data sets. Describe data: Examine the data and document its Business Understanding Data Understanding Data Preparation Deployment properties like data quantity and data format. Explore data: Get familiar with the data and discover first insights. Visualize the data and identify relationships among the data. Data Modeling Verify data quality: Identify and document any quality issues. Evaluation 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf 10 Data Mining Methodology Data Understanding Data Understanding Purpose Overarching question: Is the available data suitable to fulfill the intended purpose? ▪ Getting to know the data ▪ What attributes make up the data? ▪ What kind of values does each attribute have? ! ▪ How are the values distributed? ▪ Useful and inevitable step for data preprocessing and modeling ▪ GIGO: garbage in – garbage out 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf When using machine learning in engineering, it makes sense to consider the acquisition of the necessary data right from the start. This ensures that an appropriate amount of data is available. 11 Data Understanding DATA COLLECTION 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf 12 Where does the data come from? Existing Databases Data Acquisition 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf Modeling & Simulation 13 Where does the data come from? Data Acquisition Physical System Sensors Digital Acquisition System (DAQ) Data Transport & Storage Signal Processing Amplification Filtering AD Conversion 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf 14 Technical data comes from a wide variety of sources Source:https://www.rheinenergie.com/media/pict ures/double_teaser/Monteur_Tablet_Heizungsro hre_Keller_DT_DoubleTeaserDesktop_1x.jpg Source:https://store.arduino.cc/arduinouno-rev3 Source:https://www.dspace.com/de/ gmb/home/products/hw/simulator_h ardware/scalexio.cfm#145_39435_3 Source:https://www.kistler.com/de/prod ukte/komponenten/accelerometer-undbeschleunigungssensoren/?pfv_metrics =metric Source: http://www.ni.com/pdf/productflyers/compactrio-controller.pdf 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf Source:https://www.windowspro.de/ne ws/excel-tabelle-allen-events-fuersicherheits-system-logs/02926.html 15 Different kinds of data Structured High degree of organization E.g. stored in a relational database or a csv-file Unstructured No organizational form E.g. text or image files Maintenance data / Protocols Object detection → Lecture XIII TExxx - 10.12.2018: Complaint: ACC FRA T/FSU VIB ON ENG1´N1 UP TO 5,7 UNITS ON TAKE OFF, ON CROUISE 1,7 UNITS N1 Action: FOUND ON ENG1 FAN BLADES DEBRISE OF BIRDS AND SOME FAN BLADES SHAWLING (SEE ALSO TExxx BIRDSTRIKE) Natural Language Processing Main focus of this lecture 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf Computer Vision 16 Data Acquisition Need for labeled data Sensors generate huge amounts of data in modern industry processes Image source: https://wgp.de/de/industrie-4-0/ Use of sensor data for supervised machine learning tasks like predictive quality or predictive maintenance: ▪ Data have to be labeled ▪ Which data represents an anomaly in the process or a fault? ▪ Labeling often is a labor-intensive task Google reCAPTCHA: „reCAPTCHA offers more than just spam protection. Every time our CAPTCHAs are solved, that human effort helps digitize text, annotate images, and build machine learning datasets. This in turn helps preserve books, improve maps, and solve hard AI problems.“ → Acquisition of data using sensors is easy, labeling is much more complicated! 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf 17 Data Understanding EXPLORATORY DATA ANALYSIS 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf 18 Exploratory Data Analysis Exploratory Data Analysis Descriptive statistics Characterization of the data through metrics and graphics Examples: ▪ Locational and variance metrics ▪ Graphical methods Exploratory statistics Search for patterns in the data and development of hypotheses Examples: ▪ Relational metrics ▪ Graphical methods 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf Goals: Make best use of the data Understand the structure of the data For confirmation, the developed hypotheses must be tested on different data sets → Confirmatory Data Analysis 19 Exploratory Data Analysis DESCRIPTIVE STATISTICS 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf 20 Attribute Types Attribute types Nominal Ordinal Intervalscaled Numeric The type of an attribute depends on the possible values that the attribute can have. Ratioscaled 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf 21 Attribute Types Nominal Attributes Examples Green ▪ Categorical data ▪ No meaningful order Color Blue Yellow ▪ Not quantitative 2869322 Matriculation number 3156478 2986785 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf 22 Attribute Types Ordinal Attributes Examples Size ▪ Values have a meaningful order ▪ Difference between successive Small Medium Large values not known Grades 1,3 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf 1,7 2,0 23 Attribute Types Numeric Attributes ▪ Quantitative values ▪ Values have a meaningful order ▪ Differences between values can be quantified ▪ Values can be discrete or continuous Interval-scaled attributes: Arbitrary zero-point Ratios and multiples not meaningful Temperature 2°C is not twice as warm as 1°C Ratio-scaled attributes: Inherent zero-point Ratios and multiples can be quantified Distance 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf 2 km is twice as long as 1 km 24 Basic statistical descriptions Central Tendency There are several ways to measure the center of a data distribution ▪ Mean: average value 𝑥ҧ = σ𝑁 𝑖=1 𝑥𝑖 𝑁 ▪ For numeric data ▪ Median: middle value, separates the data into two equal-sized halfs ▪ For numeric and ordinal data ▪ Mode: value that occurs most frequently ▪ For numeric, ordinal and nominal data 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf 25 Basic statistical descriptions Central Tendency Which measure should be used? Mean value very sensitive to outliers Median is a better measure for skewed data Example for skewed data: Net equivalised income in Germany, 2021: Mean: 29,090 € Median: 25,015 € Source: Destatis [Han, Kamber, Pei: Data Mining, Concepts and Techniques] 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf 26 Basic statistical descriptions Dispersion of the data ▪ Range = maximum – minimum ▪ Quantiles: split the data into equal-size sets ▪ Quartiles Q1, Q2 (= median) and Q3 ▪ Percentiles [Han, Kamber, Pei: Data Mining, Concepts and Techniques] → none of these values is very informative on its own, several values have to be used to describe a data distribution 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf 27 Basic Statistical Descriptions Visualization of a Data Distribution ▪ 5-point descriptions Boxplot according to TUKEY ▪ Minimum ▪ Lower quartile Q1 ▪ Median Q2 (Upper Quartile) ▪ Upper quartile Q3 Interquartile range ▪ Maximum Tukey, J. W.: Exploratory Data Analysis. Pearson Publishing, Cambridge (1977) (Lower Quartile) Picture Source: https://www.infragistics.com/community/blogs/b/tim_brock/posts/demystifying-box-and-whisker-plots-part-1 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf 28 Basic Statistical Descriptions Dispersion of the data D Variance and standard deviation measure how much a data distribution spreads around the arithmetic mean. ▪ Variance: 𝑉 = 1 𝑁 σ (𝑥 𝑁 𝑖=1 𝑖 − 𝑥)ҧ 2 ▪ Standard deviation: 𝜎 = 𝑉 [Curvebreakers (curvebreakerstestprep.com)] 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf 29 Basic Statistical Descriptions Skewness and Kurtosis D The skewness is a measure for the asimmetry of a data distribution. ▪ Skewness S = 1 𝑁 𝑥 −𝑥ҧ σ𝑖=1( 𝑖 )3 𝑁 𝜎 D The kurtosis is a measure for the tailedness of a data distribution. ▪ Kurtosis K = 1 𝑁 𝑥 −𝑥ҧ σ𝑖=1( 𝑖 )4 𝑁 𝜎 ▪ Excess γ = 𝐾 − 3 left-skewed 𝑆0 𝛾0 Picture source: Wikipedia 30 Use Case Exploratory Data Analysis for Aircraft Engine Data C-MAPSS dataset Run-to-failure simulations of a commercial aircraft engine1 Various sensor data available Constant operating conditions and end of life due to fault in the high pressure compressor (HPC) Goal: Predict the remaining useful life (RUL) based on current and historical sensor data 1Link to Paper: Saxena et al.: „Damage propagation modeling for aircraft engine run-to-failure simulation“ 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf Parameter Name Description Unit 0 Engine Engine Number - 1 Cycle Cycle Number - 2 Altitude Altitude 1000 ft 3 Mach Number Mach Number - 4 TRA Thrust Resolver Angle - 5 T2 Total temperature at fan inlet °R 6 T24 Total temperature at LPC outlet °R 7 T30 Total temperature at HPC outlet °R 8 T50 Total temperature at LPT outlet °R 9 P2 Pressure at fan inlet psia 10 P15 Total pressure in bypass-duct Psia 11 P30 Total pressure at HPC outlet Psia 12 Nf Physical fan speed Rpm 13 Nc Physical core speed rpm 14 Epr Engine pressure ratio - 15 Ps30 Static pressure at HPC outlet psia 16 Phi Ratio of fuel flow to Ps30 pps/psi 17 NRf Corrected fan speed rpm 18 NRc Corrected core speed rpm 19 BPR Bypass ratio - 20 farB Burner fuel-air ratio - 21 htBleed Bleed enthalpy - 22 Nf_dmd Demanded fan speed rpm 23 PCNfR_dmd Demanded corrected fan speed rpm 24 W31 HPT coolant bleed lbm/s 25 W32 LPT coolant bleed lbm/s 31 Use Case Aircraft Engines Descriptive Statistics ▪ Run-to-failure data of 100 engines Descriptive statistics of the operating conditions ▪ Constant operating conditions Altitude [1000 ft] Mach Number TRA ▪ Altitude = 0 (sea level) Mean -0.000009 0.000002 100 ▪ Mach number = 0 Standard deviation 0.002187 0.000293 0 ▪ Thrust Resolver Angle = 100 (max thrust) Median 0.000000 0.000000 100 Analysis of the engine life time ▪ Median life time of 200 engine cycles ▪ Large deviation in life time ▪ Some engines make more than 350 cycles ▪ Some make less than 150 cycles → The age (already experienced number of cycles) alone is not sufficient to predict the remaining life time 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf 32 Use Case Aircraft Engines Descriptive Statistics of sensor data Descriptive statistics of the sensor data Mean Standard deviation Median T2 T24 T30 518.67 642.68 0.00 0.50 6.13 518.67 642.64 1590.10 P2 P15 P30 … 14.620 21.609 553.368 … 9.00 0.000 0.001 0.885 … 1408.04 14.620 21.610 553.44 … T50 1590.52 1408.93 ▪ Some sensors have a much larger deviation than others ▪ Mean and median not always equal: skewed data distributions ▪ Several sensor parameters have a standard deviation of 0 → These sensors are always constant and therefore not relevant for the remaining useful life prediction → For further analysis, these sensors can be neglected 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf 33 Use Case Aircraft Engines Descriptive Statistics of sensor data - Histograms Skewness 𝑆 = 0.44 Distribution of temperature sensor T50 ▪ Approximately normally distributed, but skewed ▪ Positively skewed: more high-value outliers Skewness 𝑆 = −0.39 Distribution of temperature sensor P30 ▪ Approximately normally distributed, but skewed ▪ Negatively skewed: more low-value outliers In RUL prediction, outliers can be due to system faults close to the end of the life time – this needs to be investigated further! 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf 34 Exploratory Data Analysis EXPLORATIVE STATISTICS 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf 35 Correlation Signal Processing D Cross-correlation is a measure of similarity between a random signal 𝑥(𝑡) and a time-shifted random signal 𝑦(𝑡). For non-complex functions 𝑥(𝑡) and 𝑦(𝑡): 𝑇 2 1 𝑅𝑥𝑦 (𝜏) = lim න 𝑥 𝑡 𝑦 𝑡 + 𝜏 𝑑𝑡 𝑇→∞ 𝑇 −𝑇 2 D Auto-correlation is a measure of similarity of a random signal 𝑥(𝑡) with its shifted version. 𝑇 2 1 𝑅𝑥𝑥 (𝜏) = lim න 𝑥 𝑡 𝑥 𝑡 + 𝜏 𝑑𝑡 𝑇→∞ 𝑇 −𝑇 2 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf [Puthusseripady: Applied Signal Processing] 36 Correlation Statistical Correlation D 𝑟𝑥𝑦 = The Pearson correlation coefficient is a statistical measure of the strength of a linear relationship between paired data. 1 𝑛 σ (𝑥 − 𝑥)(𝑦 ҧ 𝑖 − 𝑦) ത 𝑛 𝑖=1 𝑖 1 𝑛 1 𝑛 σ𝑖=1 𝑥𝑖 − 𝑥ҧ 2 ⋅ σ 𝑦 − 𝑦ത 𝑛 𝑛 𝑖=1 𝑖 L = {-1; 1} 2 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf 1 is total positive linear correlation 0 is no linear correlation −1 is total negative linear correlation 37 Correlation Pearson correlation coefficient Prerequisites: ▪ Linearity ▪ Attributes must be interval-scaled or binary ▪ Number of McDonalds sites in Germany and patients in hospitals in Germany ▪ Correlation: 0.9954 Interpretation: ▪ Susceptible to outliers ▪ Statistical signifance depends on the sample size → Correlation might be purely coincidental ▪ Correlation vs. Causality ▪ E.g. spurious correlations Source: in accordance to N. Zellmer, https://scheinkorrelation.jimdo.com/ 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf 38 Correlation Pearson correlation coefficient Four data sets with the same correlation coefficient 𝑟 = 0.816 [Anscombe (1973): Graphs in statistical Analysis] Image source: https://matheguru.com/stochastik/korrelation-korrelationskoeffizient.html 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf 39 Alternative Correlation Coefficients Rank correlation E.g. for ordinal data → Calculation of Pearson correlation coefficent not possible Alternative: 1. Calculate the rank of the raw values 2. Calculate the Pearson correlation coefficient of the ranks X 4 1 2 6 5 Y 3 5 5 5 7 Rank(X) 3 1 2 5 4 Rank(Y) 1 3 3 3 5 → The obtained coefficient is called Spearman‘s rho 𝑟𝑠 Take the average rank for equal values Here: 𝑟𝑠 = 0.2236 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf 40 Alternative Correlation Coefficients Rank correlation Spearman‘s rho always is in the area −1; 1 𝑟𝑠 = 1 → Increasing monotonic trend 𝑟𝑠 = −1 → Decreasing monotonic trend https://stackabuse.com/calculating-spearmans-rank-correlation-coefficient-in-python-with-pandas/ D The Spearman rank correlation coefficient or Spearman’s rho is a statistical measure of the monotonic relationship between two variables. 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf 41 Data Visualization Overview ▪ Essential part of an exploratory data analysis ▪ Illustrate the data and detect first trends and patterns Visualization of more than three dimensions is often difficult ▪ Complex relationships between multiple features cannot be depicted ▪ Dimensionality reduction techniques can help for visualization ▪ See lecture IV 1D: Image: Mathworks 2D: Image: Mathworks 3D: Image: Mathworks 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf 42 Data Visualization Nominal Data Image: Tibco Image: Mathworks ▪ There are many different types of charts and diagrams ▪ Use depends on the intended purpose ▪ What shall be depicted? ▪ Data type (nominal / numeric) ▪ Number of dimensions ▪ When used in a presentation, the addressee must be taken into account ▪ Graphs should always be clearly labeled Image: Tibco Image: Mathworks 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf 43 Data Visualization Misleading graphs ! Graphs can be misleading and exaggerate trends 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf 44 Use Case Aircraft Engines - Continuation Exploratory Statistics Sensor Correlation analysis (using Pearson correlation) Correlation with RUL ▪ Some high positive and negative correlations between the sensor data ▪ The information from these sensors might be redundant ▪ Several high correlations to the remaining useful life ▪ Probably important sensors for a RUL prediction Developed hypothesis: The aging of the aircraft engines can be seen in the sensor curves T50, P30, Ps30 and phi. 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf 45 Use Case Aircraft Engines Data Visualization Sensor data over time for three different engines ▪ Confirmation of the hypothesis: Aging is visible in the sensor data ▪ Temperature and Pressure are constant at the beginning ▪ Temperature T50 increases towards the end of life ▪ Pressure P30 decreases towards the end of life 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf Developed hypothesis: An increase in temperature / a decrease in pressure signifies aging of the engines or a beginning fault. 46 Use Case Aircraft Engines Data Visualization – Pair plots Pair plots help to understand the relationship between features: ▪ Negative correlation between T50 and P30 ▪ Relationship with the RUL depends on the RUL value itself ▪ Low RUL: positive correlation to P30 and negative correlation to T50 ▪ High RUL: no correlation to the sensor data visible → The correlations are time-dependent → This is not visible when using only the Pearson correlation coefficient 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf 47 Physical interpretation (based on statistical analysis) Ps30 Fault in high pressure compressor (HPC) T50 Ps30: Static pressure at HPC exit Health index/status Parameter Description Unit Ps30 Static pressure at HPC outlet psia P30 Total pressure at HPC outlet psia T50 Total temperature at LPT outlet °R φ Ratio of fuel flow to Ps30 pps/psi RUL prognosis 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf Change of other components and sensors P30 T50 𝜑 48 Data Understanding DATA QUALITY 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf 49 Assessing data quality How to create trust and confidence D Data quality: Data are of high quality if they are suitable for their intended use in operations, for decision support and for the planning of those. D Meta Data: Structured information, which describes, explains, localizes, or simplifies in another way the fetch, usage or management of an information source. Data quality Information content Model quality 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf Trust & confidence 50 Insufficient data quality Example „A salmon swimming down the river“ AI generated pictures 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf 51 Assessing the data quality Classes and dimensions for data quality assessment Understandability Expressiveness Unambiguousness Documentation Conformity System inherence Availability Access & reference Safety & security Latency Processing Reliability Accuracy/variance Authenticity Consistency Objectivity Redundancy Credibility Relevance Area of high data quality Timeliness Completeness Coherence 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf 52 Assessing the data quality Cost Benefit Analysis Investment trade-off for data quality Total Costs Protection Cost optimum Handling Data quality metric High cost for: Cleaning & Pre-Processing Selection … VS High cost for: Hardware Acquisition Processes Testing & Auditing … For each task, the optimal data quality must be assessed with regard to total costs! 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf 53 Typical data errors Examples from drone flight data Missing Values 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf 54 Signal Acquisition Reasons for errors can be found in all steps Physical System Sensors Digital Acquisition System (DAQ) Amplification Filtering A/D Conversion Data Transport & Storage Signal Processin g We want to gain specific information from the data. But all steps of the data acquisition process influence the signal! 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf 55 Illustration of measurement errors Errors can be classified into three types that can overlap. D Measurement errors are samples that do not represent the real physical value. However, only having the sensor’s information creates a difficulty for measurement error classification. Systematic errors y Random errors y t likelihood of occurrence Dynamic errors y Bias / Offset Precision t Variance Signal to noise ratio (SNR) t Lag / Delay Delay 1st order Accuracy 𝑆𝑁𝑅 𝑃𝑆𝑖𝑔𝑛𝑎𝑙 10 = 10 𝑑𝐵 𝑃𝑁𝑜𝑖𝑠𝑒 Real value 𝐺 𝑠 = 1 1 + 𝑇𝑠 Measurement value 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf 56 Common data set issues IMBALANCED DATA 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf 57 Class Imbalance Imbalanced data is a common problem in classification data sets! Number of samples: 𝑁𝑚𝑎𝑗𝑜𝑟𝑖𝑡𝑦 𝑐𝑙𝑎𝑠𝑠 ≫ 𝑁𝑚𝑖𝑛𝑜𝑟𝑖𝑡𝑦 𝑐𝑙𝑎𝑠𝑠 Example: Fault diagnosis for predictive maintenance ▪ E.g. diagnosis of aircraft engines ▪ Real-world data ▪ Lots of data of healthy engines ▪ Low amount of data for faulty engines (safety critical) Problems related to class imbalance: ▪ Bias towards majority class when learning ML models ▪ Higher weighting of minority classes ▪ Heavier penalization of misclassifications ▪ Bias of evaluation metrics like accuracy ▪ Use alternative evaluation metrics like precision or recall 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf 58 Resampling methods Creation of a balanced data set Oversampling ▪ Create random copies of the minority class instances ▪ Instances can appear multiple times ▪ Copying of the samples increases the likelihood of overfitting! Undersampling ▪ Random downsampling of the majority class instances ▪ No artificially generated data ▪ Loss in available amount of data can lead to reduced performance Image: TurinTech 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf 59 Data Augmentation Creation of a balanced data set SMOTE (Synthetic Minority Oversampling Technique)1 : a popular oversampling technique 1. Select a sample 𝑥𝑖 from the minority class 2. Calculate its k nearest neighbors 3. Select randomly one of the nearest neighbors 𝑥2 4. Draw a value t from a uniform distribution [0,1] 5. Calculate synthetic sample 𝑥 ′ = 𝑥1 + 𝑡(𝑥2 − 𝑥1 ) 6. Repeat the process for other minority samples Image DOI: 10.7717/peerj-cs.1280/fig-2 → No copying of existing samples → Potentially less risk of overfitting compared to random oversampling → Easy implementation, but high computational costs for high-dimensional data 1Link to Paper: Chawla et al.; SMOTE: Synthetic Minority Over-sampling Technique 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf 60 Topics of Today - Summary ▪ Introduction and Methodology ▪ Data Collection ▪ Exploratory Data Analysis ▪ Data Quality ▪ Imbalanced Data ▪ Next Lecture: Data Preparation and Preprocessing 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf 61 THANK YOU! 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf 62 ▪ Cleve, Jürgen; Lämmel, Uwe: Data Mining. 3. Auflage. De Gruyter. (2020) ▪ Han, Jiawei; Kamber, Micheline; Pei, Jian: Data Mining. Concepts and techniques. 3rd ed. (Online). Elsevier professional. (2011) ▪ Jebb, Andrew T.; Parrigon, Scott; Woo, Sang Eun: Exploratory data analysis as a foundation of inductive research. In: Human Resource Management Review 27 (2). (2017) ▪ Zins, Chaim: Conceptual approaches for defining data, information, and knowledge. In: J. Am. Soc. Inf. 58 (4). (2007) ▪ Bernstein, Jay H.: The Data-Information-Knowledge-Wisdom Hierarchy and its Antithesis. (2009) ▪ Puthusserypady, Sadasivan: Applied Signal Processing. Now Publishers. (2021) ▪ Anscombe, F. J.: Graphs in Statistical Analysis. In: The American Statistician 27 (1). (1973) ▪ Cleff, Thomas: Deskriptive Statistik und explorative Datenanalyse. Springer. (2015) ▪ Chawla, Nitesh V.; Bowyer, Kevin W.; Hall, Lawrence O.; Kegelmeyer, W. Philip: SMOTE: Synthetic Minority Oversampling Technique. In: Journal of Artificial Intelligence Research Vol. 16 (2002) 01.11.2023 | Machine Learning Applications | Data Understanding and Exploratory Data Analysis | Prof. Dr.-Ing. Uwe Klingauf 63