Feature Engineering Module 2 PDF

FEATURE ENGINEERING Module 2 DATA PREPROCESSING TECHNIQUES Data cleaning can be applied to remove noise and correct inconsistencies in data. Data integration merges data from multiple sources into a coherent data store. Data reduction can reduce data size by, for instance, aggregating, eliminating redundant features, or clustering. Data transformations (e.g., normalization) may be applied, where data are scaled to fall within a smaller range like 0.0 to 1.0. 2. DATA INTEGRATION ▪Data Integration - Merging of data from multiple data stores. Redundant data occurs due to the following reasons. Object identification: The same attribute or object may have different names in different databases Derivable data: One attribute may be a “derived” attribute in another table ▪Careful integration can help reduce and avoid redundancies and inconsistencies in the resulting data set. ▪This can help improve the accuracy and speed of the subsequent machine learning process A. ENTITY IDENTIFICATION PROBLEM Object identification: The same attribute or object may have different names in different databases e.g., A. cust-id = B. cust-# e.g., Bill Clinton = William Clinton (Schema integration and object matching can be tricky. Soln: Integrate metadata from different sources: Identify real world entities from multiple data sources, Detecting and resolving tuple duplicate and data value Ex: How can the data analyst or the computer be sure that customer_id in one database and cust_number in another refer to the same attribute? The metadata for each attribute include the name, meaning, data type, and range of values permitted for the attribute, and null rules for handling blank, zero, or null values. Such metadata can be used to help avoid errors in schema integration. This ensures that functional dependencies and referential constraints in the source system match in the target system. B. REDUNDANCY AND CORRELATION ANALYSIS Derivable data: One attribute may be a “derived” attribute in another table One data set has the customer age and other data set has the customers date of birth then age would be a redundant attribute as it could be derived using the date of birth. Some redundancies can be detected by correlation analysis. Given two attributes, such analysis can measure how strongly one attribute implies the other, based on the available data. For nominal data, we use the chi-square test. For numeric attributes, we can use the correlation coefficient and covariance, both of which access how one attribute’s values vary from those of another. CHI-SQUARE TEST - The Chi-Square test is a statistical procedure for determining the correlation relationship between two attributes in a nominal data - categorical variables (i.e Finding the difference between observed and expected data) It helps to determine whether a difference between two categorical variables is due to chance or a relationship between them. O = Observed Value E = Expected Value The degrees of freedom in a statistical calculation represent the number of variables that can vary. The df is calculated as df = (r-1)(c-1), where r is the number of rows and c is the no. of columns. If the observed chi-square test statistic exceeds the critical value, you can reject the null hypothesis. HOW TO SOLVE CHI-SQUARE PROBLEMS? 1. State the Hypotheses H0: There is no link between two attributes (null) 2. Calculate the Expected Frequencies 3. Compute the Chi-Square Statistic 4. Determine the Degrees of Freedom (df).. df = (r-1)(c-1) 5. Find the Critical Value and Compare Ex 1: Apply a chi-square test to the following set of data to determine the correlation between gender and preferred-reading Solution: Step 3: Compute the Chi-Square Statistic Step 4: determine df Df = (r-1) * (c-1) =(2-1) * (2-1)=1 Step 5: Find the Critical Value and Compare For 1 degree of freedom, the chi square value needed to reject the hypothesis at the 0.001 significance level is 10.83 Since our computed value is above this, we can reject the hypothesis that gender and preferred reading are independent. Conclude that the two attributes are (strongly) correlated for the given group of people. Ie. Reading habit and gender are dependent EX 2: Use Chi Square Test To Verify The Two Attributes Pass Status And Location Of Stay Are Correlated For The Given Group Of People Since our computed value is above, we can reject the hypothesis Attributes are dependent COVARIANCE & CORRELATION OF NUMERIC DATA Covariance is a measure to indicate the extent to which two random variables change together. It signifies the direction of the linear relationship between the two variables. Correlation is a measure used to represent how strongly two random variables are related to each other. Covariance can vary between -∞ and +∞ Correlation ranges between -1 and +1 COVARIANCE Variables of data can express relations. When two variables share the same tendency, we speak about covariance. Variance tells you how a single variable varies, covariance tells how two variables vary together A positive covariance means the variables are positively related, while a negative covariance means the variables are inversely related. The covariance between two variables (X,Y) is deﬁned as: where n is the length of both sets PEARSON’S CORRELATION Correlation shows whether and how strongly pairs of variables are related. The Pearson’s correlation: measure of the linear correlation between two variables X and Y Pearson’s correlation is always between −1 and +1, where the magnitude depends on the degree of correlation. If r= -1 or +1, perfectly correlated, so one variable can predict the other very well Properties: If the Pearson’s correlation is −1, it means that the variables are negatively correlated If the Pearson’s correlation is 1, it means that the variables are positively correlated If the Pearson’s correlation is 0, it means that the variables are not correlated Pearson’s correlation captures correlations of ﬁrst order, but not nonlinear correlations It does not work well in the presence of outliers EX: CALCULATE PEARSON COVARIANCE AND CORRELATION COEFFICIENT Conclusion: The covariance between the temperature and customers is 22.46. Since the covariance is positive, temperature and number of customers have a positive relationship. As temperature rises, so does the number of customers. =331.28 =48.78 Conclusion: 0.8 shows that strength of the correlation between temperature and number of customers is very strong. TUPLE DUPLICATION Duplication should also be detected at the tuple level, in addition to detecting redundancies between attributes Duplicate tuples generally occur due to inaccurate data entry or updating the files of similar data occurrences. A tuple is a finite sequence of values that are ordered - sequence of numbers or mathematical objects Use set() + tuple() to remove duplicates. In this, we convert the tuple to a set, removing duplicates and then converting it back again using tuple(). DATA VALUE CONFLICT DETECTION Data conflict means the data merged from the different sources do not match. The difference is due to -- they are represented differently in the different data sets. Ex: the price of a hotel room may be represented in different currencies in different cities. Data Profiling and Exploration Standardization Duplicate detection algorithm – Cross referencing

Feature Engineering Module 2 PDF

Document Details

Tags

Related

Summary

Full Transcript