FINALS COVERAGE.pdf
Document Details
Uploaded by BrightestPathos
Tags
Full Transcript
LESSON 4.2 Solutions: a. Binning - Transforming numerical DATA TRANSFORMATION values into categorical - A process o...
LESSON 4.2 Solutions: a. Binning - Transforming numerical DATA TRANSFORMATION values into categorical - A process of transforming data from one components/counterparts. format to another. b. Clustering - Grouping data into - It aims to transform the data values into a corresponding cluster. format, scales, or unit. c. Regression - Utilizing a simple - An important step in data preprocessing and regression line. a prerequisite. d. Combined computer and human inspection - Detecting suspicious POSSIBLE OPTIONS FOR DATA TRANSFORMATION: values. (NEEDS TO BE ROUNDED OFF) 3. Identifying outliers 1. NORMALIZATION - A way to scale specific Solution: Box plot variable to fall within a small specific range. a. Min-max normalization DATA CLEANING STEPS: - Transforming values to a new scale. 1. Monitor the errors 2. Standardization of the mining processes - Reduce the risk of duplication 3. Validation of the data accuracy b. Z-score standardization 4. Scrub for duplicate data - To identify the - Transforming a numerical variable duplicates to a standard normal distribution. 5. Analyze 6. Communicate with the team 2. ENCODING AND BINNING DATA REDUCTION - A process of obtaining a reduced a. Binning - The process of transforming representation of the data set. numerical variables into categorical counterparts. DATA REDUCTION STRATEGIES Equal-width partitioning a. Sampling - Utilizing a smaller representative - Divides the range into N intervals of or sample that will generalize the entire equal size. population. Equal-depth partitioning - Each containing approximately the TYPES OF SAMPLING: same number of samples. 1. Simple random sampling - There is an equal probability of selecting. b. Encoding - Process of transforming 2. Sampling without replacement - Removed categorical values to binary or numerical from the population. counterparts. 3. Sampling with replacement - Objects are Binary encoding (Unsupervised) not removed from the population. - Taking the values 0 or 1 to indicate 4. Stratified sampling - Split the data into the absence. several partitions. Class-based encoding (Supervised) Discrete class - Replace each b. Feature subset selection - Reduces the category of categorical variable with dimensionality of data by eliminating redundant its CORRESPONDING PROBABILITY. features. Continuous class - Replace each category of categorical variable with FEATURE SUBSET SELECTION TECHNIQUES: its CORRESPONDING AVERAGE. 1. Brute-force approach - Try all possible feature subset. DATA CLEANING 2. Embedded approaches - Feature selection - Addresses anomalies occurs naturally. - It is the process of altering data in a given 3. Filter approaches - Features are selected storage resource. before data mining. - Attempts to fill in missing values. 4. Wrapper approaches - Use the data mining algorithm as a black box. DATA CLEANING TASKS: 1. Fill in missing values 2. Cleaning noisy data c. Feature creation - Creating new attributes that Prediction - Uses one or several predictors. can capture the important information. LESSON 6 FEATURE CREATION METHODS: 1. Feature extraction Association Rule Mining - A rule-based method for 2. Mapping data to new space discovering relationships between variables. 3. Feature construction Clustering - The task that of assigning a set of objects LESSON 5 into groups called clusters. CLASSIFICATION K-means Clustering - Basic partitional clustering - The prediction of a class or category. approach that classifies based on attributes. - Predicting the class variable. - Historical data are used to build a model. Centroid - Center point of a cluster. REGRESSION - The prediction of numerical value. Sequential pattern mining - Concerned with finding statistically relevant patterns. Performed by growing CLASSIFICATION ALGORITHMS: subsequences. 1. ZeroR - The simplest method which relies on Sequence - An ordered list of elements. the target. 2. OneR - Simple yet accurate that generates Hierarchical Clustering - Produces a set of nested one rule. clusters. 3. Naïve-Bayes - A frequency-based classifier that uses a probabilistic framework. Dendrogram - A tree-like diagram that records. 4. Decision Tree - Builds classification models in the form of a tree structure. Text Mining - Also known as text data mining or 5. Nearest Neighbours - An intuitive method knowledge discovery. A semi-automated process of that classified unlabeled data and utilizes extracting knowledge. distance. 6. Artificial Neural Network (ANN) - A network TEXT MINING STEPS: of perceptrons or nodes. 1. Establish the corpus 7. Support Vector Machine (SVM) - Performs 2. Create term-document matrix classification by finding a plane. 3. Extract knowledge from term-document matrix 8. Ensemble - Predicts the class of previously unseen records. Social Media Sentiment Analysis - Takes two main 9. Random Forests - Relatively modern types of textual information: Facts and opinions. algorithm. Sentiment Analysis (Opinion Mining) - The Model Evaluation - A methodology used to find the computational study of opinions. model that represents the data. Regression - A data mining task of predicting the target’s value. Regression Analysis - The most widely used statistical technique. Multiple Linear Regression - A method used to model the linear relationship. Indicator Variables (dummy variables) - Used to model qualitative variables in regression. Multicollinearity - The inflation of coefficient. Logistic regression - Predicts the probability of an outcome. Large Variances - Implies unstable predictions. DATA CLEANING SAMPLE: (NEEDS TO BE ROUNDED OFF) TO GET TAX INCOME: ADD ALL THE PRESENT DATA AND DIVIDE IT INTO ITS QUANTITY. {all data /7} TO GET ‘YES’ TAX INCOME (ONLY POSSIBLE IF THERE’S CHEAT COLUMN): ADD ALL THE YES IN TAXABLE INCOME COLUMN AND DIVIDE IT INTO ITS QUANTITY (EXCLUDE THE MISSING YES DATA). {85k + 90k /2} TO GET ‘NO’ TAX INCOME (ONLY POSSIBLE IF THERE’S CHEAT COLUMN): ADD ALL THE NO IN TAXABLE INCOME COLUMN AND DIVIDE IT INTO ITS QUANTITY (EXCLUDE THE MISSING NO DATA). {all ‘no’ data /5} EQUAL WIDTH BINNING EQUAL DEPTH BINNING BINARY ENCODING - Gagawing 1 yung mga column depende sa kung ano yung nasa column 1 (Trend). Kung “Up”, gagawing 1 yung “Trend_Up” and put zero na sa sa “Trend_Down/Flat”. CLASS-BASED ENCODING Discrete Class - Kung ano yung nasa parenthesis ng probability sa table 1, for example “(Yes)”, ayun yung magiging numerator (dividend), and yung denominator (divisor) ay yung total ng No and Yes; So 2 / 3 = 0.66 - Plus lang lahat ng same trend and divide sa frequency or kung ilan. For example, sa “Up”, 21 + 24 + 26 = 71. Then 71/3 since tatlong “Up” ang nasa table 1. Kung anong mga lumabas na average, ilalagay mo lang sa table 2 in general.