Data Categorization PDF
Document Details
Tags
Summary
This document provides a presentation explaining data categorization within data analytics. It explores various data scales, including nominal, ordinal, interval, and ratio scales, and clarifies their properties and distinctions.
Full Transcript
Data Analytics (CS401/CS634) #2. Data Categorization Data in Data Analytics Entity: A particular thing is called entity or object. Attribute. An attribute is a measurable or observable property of an entity. Data. A measurement of an attribute is called data. Note ◦ Data defines an entit...
Data Analytics (CS401/CS634) #2. Data Categorization Data in Data Analytics Entity: A particular thing is called entity or object. Attribute. An attribute is a measurable or observable property of an entity. Data. A measurement of an attribute is called data. Note ◦ Data defines an entity. ◦ Computer can manage all type of data (e.g., audio, video, text, etc.). Data Analytics 2 Data in Data Analytics ◦ In general, there are many types of data that can be used to measure the properties of an entity. ◦ A good understanding of data scales (also called scales of measurement) is important. ◦ Depending the scales of measurement, different technique are followed to derive hitherto unknown knowledge in the form of ◦ patterns, associations, anomalies or similarities from a volume of data. Data Analytics 3 Data, Data Sets, Elements, Variables, and Observations Variables Element Names Annual Earn/ Company Sales($M) Share($) Dataram Dataram 73.10 0.86 EnergySouth EnergySouth 74.00 1.67 Keystone Keystone 365.70 0.86 LandCare LandCare 111.40 0.33 Psychemedics Psychemedics 17.60 0.13 Data Set Data Analytics 4 NOIR Classification of scales of Measurement Data Analytics 5 NOIR classification ◦ The mostly recommended scales of measurement are N: Nominal O: Ordinal I: Interval R: Ratio The NOIR scale is the fundamental building block on which the extended data types are built. Data Analytics 6 NOIR Classification Nominal Ordinal Interval Ratio Alphabetical Binary Ternary Others Ordered Discrete Numerically Symmetric Ordered Continuous Literally Asymmetric Ordered Categorical (Qualitative) Numeric (Quantitative) Data Analytics 7 Properties of data ◦ Following FOUR properties (operations) of data are pertinent. # Property Operation Type 1. Distinctiveness = and ≠ Categorical (Qualitative) 2. Order ,≥ 3. Addition + and - Numerical (Quantitative) 4. Multiplication * and / Data Analytics 8 NOIR summary Nominal (with distinctiveness property only) Ordinal (with distinctive and order property only) Interval (with additive property + property of Ordinal data) Ratio (with multiplicative property + property of Interval data) Data Analytics 9 Nominal scale Definition ◦ A variable that takes a value among a set of mutually exclusive codes that have no logical order is known as a nominal variable. Examples Gender (Letters or Numbers): { M, F} or { 1, 0 } Blood groups (String): {A , B , AB , O } Rhesus (Rh) factors (Symbols): {+ , - } Data Analytics 10 Nominal scale More o Data are labels or names used to identify an attribute of the element. o The labels can be numbers, letters, strings, enumerated constants or other keyboard symbols. o A nonnumeric label or numeric code may be used. o Nominal data makes “category” of a set of data. o The number of categories should be two (binary) or more (ternary, etc.), but countably finite. Data Analytics 11 Nominal scale More A nominal data may be numerical in form, but the numerical values have no mathematical interpretation. ◦ For example, 10 prisoners are 100, 101, … 110, but; 100 + 110 = 210 is meaningless. They are simply labels. Two labels may be identical ( = ) or dissimilar ( ≠ ). These labels do not have any ordering among themselves. ◦ For example, we cannot say blood group B is better or worse than group A. Labels (from two different attributes) can be combined to give another nominal variable. ◦ For example, blood group with Rh factor ( A+ , A- , AB+, etc.) Data Analytics 12 Binary scale Definition A nominal variable with exactly two mutually exclusive categories that have no logical order is known as binary variable Examples Switch: {ON, OFF} Attendance: {True, False} Pass: {Yes, No} Data Analytics 13 Symmetric and Asymmetric Binary Scale Different binary variables may have unequal importance. A binary variable is symmetric if both of its states are equally valuable, that is, there is no preference on which outcome should be coded as 1. Example: Gender () = {male , female} // usually of equal probability. A binary variable is asymmetric if the outcome of the states are not equally important. Example: Food preference = {V , NV}; positive or negative outcomes of a disease test. Data Analytics 14 Operations on Nominal variables Summary statistics applicable to nominal data are mode, contingency correlation, etc. Arithmetic ( + , - , * a n d / ) and logical operations ( < , > , ≠ e t c. ) are not permitted. The allowed operations are : accessing (read, check, etc.) and re-coding (into another non-overlapping symbol set, that is, one-to-one mapping) etc. Nominal data can be visualized using line charts, bar charts or pie charts etc. Two or more nominal variables can be combined to generate other nominal variable. Example: Gender (M,F) × Marital status (S, M, D, W); blood group with Rh factor ( A+ , A- , AB+, etc.) Data Analytics 15 Ordinal scale Definition Ordered nominal data are known as ordinal data and the variable that generates it is called ordinal variable. “Ordinal” indicates “order”. Ordinal data is quantitative data which have naturally occurring orders and the difference between is unknown. Example: ◦ Shirt size: { S, M, L, XL, XXL} ◦ How happy are you with the customer service?: {1- Very Unhappy, 2- Unhappy, 3- Neutral, 4- Unhappy, 5- Very Unhappy} ◦ Students of a university are classified by their class standing using a nonnumeric label such as Freshman, Sophomore, Junior, or Senior Note The values assumed by an ordinal variable can be ordered among themselves as each pair of values can be compared literally or by using relational operators ( < , ≤ , > , ≥ ). The data have the properties of nominal data and the order or rank of the data is meaningful. Data Analytics 16 Operation on Ordinal data Usually relational operators can be used on ordinal data. Summary measures mode and median can be used on ordinal data. Ordinal data can be ranked (numerically, alphabetically, etc.) Hence, we can find any of the percentiles measures of ordinal data. Calculations based on order are permitted (such as count, min, max, etc.). Spearman’s R can be used as a measure of the strength of association between two sets of ordinal data. Numerical variable can be transformed into ordinal variable and vice-versa, but with a loss of information. ◦ For example, Age [1, … 100] = [young, middle-aged, old] Data Analytics 17 Interval scale Definition Interval-scale variables are continuous measurements of a roughly linear scale. Example: latitude, longitude, weather, temperature, calendar dates, etc. The data have the properties of ordinal data, and the interval between observations is expressed in terms of a fixed unit of measure. Interval data are always numeric. Note Interval data are with well-defined interval. Interval data are measured on a numeric scale (with +ve, 0 (zero), and –ve values). Interval data has a zero point on origin. However, the origin does not imply a true absence of the measured characteristics. For example, temperature in Celsius and Fahrenheit; 0⁰ does not mean absence of temperature, that is, no heat! Data Analytics 18 Operation on Interval data Addition For example: date1 + x-days = date2 Subtraction can also be performed. For example: current date – date of birth = age Negation (changing the sign) and multiplication by a constant are permitted. All operations on ordinal data defined are also valid here. Linear (e.g. cx + d ) or Affine transformations are permissible. Other one-to-one non-linear transformation (e.g., log, exp, sin, etc.) can also be applied. Data Analytics 19 Operation on Interval data Note Interval data can be transformed to nominal or ordinal scale, but with loss of information. Interval data can be graphed using histogram, frequency polygon, etc. Interval data cannot be multiplied or divided. True to its quantitative character, almost all statistical analysis is applicable when calculating interval data. This includes, but not limited to mean, mode and median. Data Analytics 20 Ratio scale Definition Interval data with a clear definition of “zero” are called ratio data. ◦ Example: weight, height, Temperature in Kelvin scale, Intensity of earth-quake on Richter scale, Sound intensity in Decibel, cost of an article, population of a country, etc. Note All ratio data are interval data but the reverse is not true. In ratio scale, both differences between data values and ratios (of non-zero) data pairs are meaningful. Ratio data may be in linear or non-linear scale. Both interval and ratio data can be stored in same data type (i.e., integer, float, double, etc.) Data Analytics 21 Operation on Ratio data All arithmetic operations on interval data are applicable to ratio data. In addition, multiplication, division, etc. are allowed. Any linear transformation of the form ( ax + b )/c are known. Data Analytics 22 Self Study Data Cube Data Analytics 23