Data Categorization Lecture Notes PDF
Document Details
Uploaded by StainlessCoral
Faculty of Computer Science and Information Technology
Tags
Summary
These lecture notes cover various aspects of data categorization in data science. It discusses different forms of datasets, data in data science, and data categorization. This includes topics like NOIR topology, nominal, ordinal, interval, and ratio scales along with binary, symmetric, and asymmetric data.
Full Transcript
Just a minute to mark your attendance 2 Outlines Different forms of Datasets Data in data science Data Categorization NOIR topology Nominal scale Binary Symmetric Asymmetric Ordinal scale Interval and ration scale...
Just a minute to mark your attendance 2 Outlines Different forms of Datasets Data in data science Data Categorization NOIR topology Nominal scale Binary Symmetric Asymmetric Ordinal scale Interval and ration scale 3 Types of dataset: (1)Record data Relational records: Database Relational tables, highly structured Types of dataset: (1)Record data Data matrix, e.g., numerical matrix, crosstabs Types of dataset: (1)Record data Transaction data Types of dataset: (1)Record data Document data: Term-frequency vector (matrix) of text documents Types of dataset: (2) graphs and networks Data in Data Science Entity: A particular thing is called entity or object. Attribute. An attribute is a measurable or observable property of an entity. Data (measurement): A measurement of an attribute is called data. Computer can manage all type of data (e.g., audio, video, text, etc.). 10 Data Categorization N: Nominal O: Ordinal NOIR I: Interval R: Ratio Classification of scales of Measurement NOIR Classification Nominal Ordinal Interval Ratio Alphabetical Binary Ternary Others Ordered Discrete Numerically Symmetric Ordered Continuous Literally Asymmetric Ordered Categorical (Qualitative) Numeric (Quantitative) Quantitative data and qualitative data There are two general types of data – quantitative and qualitative and both are equally important. Properties of data Following FOUR properties (operations) of data are pertinent. # Property Operation Type 1. Distinctiveness = and ≠ Categorical (Qualitative) 2. Order ,≥ 3. Addition + and - Numerical (Quantitative) 4. Multiplication * and / 14 NOIR Classification Nominal Ordinal Interval Ratio Alphabetical Binary Ternary Others Ordered Discrete Numerically Symmetric Ordered Continuous Literally Asymmetric Ordered Categorical (Qualitative) Numeric (Quantitative) Nominal scale Definition A variable that takes a value among a set of mutually exclusive codes that have no logical order is known as a nominal variable. Examples Gender Used letters or numbers { M, F} or { 1, 0 } Blood groups Used string {A , B , AB , O } Rhesus (Rh) factors Used symbols {+ , - } Country code 048 040 16 Nominal scale Note The nominal scale is used to label data categorization using a consistent naming convention. The labels can be numbers, letters, strings. Nominal data thus makes “category” of a set of data. The number of categories should be two (binary) or more (ternary, etc.), but countably finite. 17 Nominal scale Note A nominal data may be numerical in form, but the numerical values have no mathematical interpretation. For example, 10 prisoners are 100, 101, … 110, but; 100 + 110 = 210 is meaningless. They are simply labels. Two labels may be identical ( = ) or dissimilar ( ≠ ). These labels do not have any ordering among themselves. For example, we cannot say blood group B is better or worse than group A. Labels (from two different attributes) can be combined to give another nominal variable. For example, blood group with Rh factor ( A+ , A- , AB+, etc.) 18 Binary scale Definition A nominal variable with exactly two mutually exclusive categories that have no logical order is known as binary variable Examples Switch: {ON, OFF} Attendance: {True, False} Entry: {Yes, No} etc. Note A Binary variable is a special case of a nominal variable that takes only two possible values. 19 Symmetric and Asymmetric Binary Scale Different binary variables may have unequal importance. Symmetric binary variable: If two choices of a binary variable have equal importance Example: Gender = {male , female} // usually of equal probability. asymmetric binary variable: if the two choices of a binary variable have unequal importance Example: medical test (positive vs. negative) ◼ Convention: assign 1 to most important outcome (e.g., }covid positive) 20 NOIR Classification Nominal Ordinal Interval Ratio Alphabetical Binary Ternary Others Ordered Discrete Numerically Symmetric Ordered Continuous Literally Asymmetric Ordered Categorical (Qualitative) Numeric (Quantitative) Ordinal scale Definition Ordered nominal data are known as ordinal data and the variable that generates it is called ordinal variable. Example: Shirt size = { S, M, L, XL, XXL} Note Ordering in Ordinal scale variables: can be compared literally or using relational operators ( < , ≤ , > , ≥ ). 23 Operation on Ordinal data Usually relational operators can be used on ordinal data. Summary measures mode and median can be used on ordinal data. Ordinal data can be ranked (numerically, alphabetically, etc.) Calculations based on order are permitted (such as count, min, max, etc.). Note: Numerical variable can be transformed into ordinal variable and vice-versa, but with a loss of information. For example, Age [1, … 100] = [young, middle-aged, old] 24 NOIR Classification Nominal Ordinal Interval Ratio Alphabetical Binary Ternary Others Ordered Discrete Numerically Symmetric Ordered Continuous Literally Asymmetric Ordered Categorical (Qualitative) Numeric (Quantitative) Interval scale Definition Interval data is measured along a numerical scale that have an equal intervals between adjacent values Note Interval data are with well-defined interval Interval data doesn’t have a true value of zero For example, temperature in Celsius and Fahrenheit; 0⁰ does not mean absence of temperature(no heat). 26 Operation on Interval data We can add to or from interval data. For example: date1 + x-days = date2 Subtraction can also be performed. For example: current date – date of birth = age Negation (changing the sign) and multiplication by a constant are permitted. All operations on ordinal data defined are also valid here. + d ) or Affine transformations are permissible. Other one-to-one non-linear transformation (e.g., log, exp, sin, etc.) can also be applied. 27 Continuous and Discrete data Discrete data can only take Continuous data can only take On certain individual values On any value in a certain range NOIR Classification Nominal Ordinal Interval Ratio Alphabetical Binary Ternary Others Ordered Discrete Numerically Symmetric Ordered Continuous Literally Asymmetric Ordered Categorical (Qualitative) Numeric (Quantitative) Ratio scale Definition Ratio data is measured along a numerical scale that has equal distances between adjacent values, and a true zero Note Rao data may be in linear or non-linear scale. Both interval and ratio data can be stored in same data type (i.e., integer, float, double, etc.) All ratio data are interval data but the reverse is not true. 30 Operation on Ratio data All arithmetic operations on interval data are applicable to ratio data. In addition, multiplication, division, etc. are allowed. Any linear transformation of the form ( ax + b )/c are known. 31