Introduction to Data Science PDF
Document Details
Uploaded by Deleted User
NOUFIA National University
Dr. Amira Abdelatey
Tags
Summary
This presentation provides an introduction to data science, focusing on data categorization and different types of data. It explains various data types such as record data, graph data, ordered data, and their properties (e.g., distinctiveness, order, addition, multiplication).
Full Transcript
Introduction to Data Science Dr. Amira Abdelatey Just a minute to mark your attendance 2 Outlines Different forms of Datasets Data in data science Data Categorization NOIR topology Nominal scale Binary Symmetric Asymme...
Introduction to Data Science Dr. Amira Abdelatey Just a minute to mark your attendance 2 Outlines Different forms of Datasets Data in data science Data Categorization NOIR topology Nominal scale Binary Symmetric Asymmetric Ordinal scale Interval and ration scale 3 Types of dataset: (1)Record data Relational records: Relational tables, highly structured Types of dataset: (1)Record data Data matrix, e.g., numerical matrix, crosstabs Types of dataset: (1)Record data Transaction data Types of dataset: (1)Record data Document data: Term-frequency vector (matrix) of text documents Types of dataset: (2) graphs and networks Types of dataset: (3) ordered data Data in Data Science Entity: A particular thing is called entity or object. Attribute. An attribute is a measurable or observable property of an entity. Data. A measurement of an attribute is called data. Computer can manage all type of data (e.g., audio, video, text, etc.). 10 Data Categorization There are two general types of data – quantitative and qualitative and both are equally important. You use both types to demonstrate effectiveness, importance or value. Properties of data Following FOUR properties (operations) of data are pertinent. # Property Operation Type 1. Distinctiveness = and ≠ Categorical (Qualitative) 2. Order ,≥ 3. Addition + and - Numerical (Quantitative) 4. Multiplication * and / 12 Data in Data Science In general, there are many types of data that can be used to measure the properties of an entity. N: Nominal O: Ordinal NOIR I: Interval R: Ratio Classification of scales of Measurement NOIR Classification Nominal Ordinal Interval Ratio Alphabetical Binary Ternary Others Ordered Discrete Numerically Symmetric Ordered Continuous Literally Asymmetric Ordered Categorical (Qualitative) Numeric (Quantitative) Nominal scale Definition A variable that takes a value among a set of mutually exclusive codes that have no logical order is known as a nominal variable. Examples Gender Used letters or numbers { M, F} or { 1, 0 } Blood groups Used string {A , B , AB , O } Rhesus (Rh) factors Used symbols {+ , - } Country code 048 040 15 Nominal scale Note The nominal scale is used to label data categorization using a consistent naming convention. The labels can be numbers, letters, strings. Nominal data thus makes “category” of a set of data. The number of categories should be two (binary) or more (ternary, etc.), but countably finite. 16 Nominal scale Note A nominal data may be numerical in form, but the numerical values have no mathematical interpretation. For example, 10 prisoners are 100, 101, … 110, but; 100 + 110 = 210 is meaningless. They are simply labels. Two labels may be identical ( = ) or dissimilar ( ≠ ). These labels do not have any ordering among themselves. For example, we cannot say blood group B is better or worse than group A. Labels (from two different attributes) can be combined to give another nominal variable. For example, blood group with Rh factor ( A+ , A- , AB+, etc.) 17 Binary scale Definition A nominal variable with exactly two mutually exclusive categories that have no logical order is known as binary variable Examples Switch: {ON, OFF} Attendance: {True, False} Entry: {Yes, No} etc. Note A Binary variable is a special case of a nominal variable that takes only two possible values. 18 Symmetric and Asymmetric Binary Scale Different binary variables may have unequal importance. Symmetric binary variable: If two choices of a binary variable have equal importance Example: Gender = {male , female} // usually of equal probability. asymmetric binary variable: if the two choices of a binary variable have unequal importance Example: medical test (positive vs. negative) ◼ Convention: assign 1 to most important outcome (e.g., }covid positive) 19 Operations on Nominal variables Summary statistics applicable to nominal data are mode, contingency correlation, etc. Arithmetic ( + , - , * a n d / ) and logical operations ( < , > , ≠ etc. ) are not permitted. Two or more nominal variables can be combined to generate other nominal variable. Example: Gender (M,F) × Marital status (S, M, D, W) 20 Ordinal scale Definition Ordered nominal data are known as ordinal data and the variable that generates it is called ordinal variable. Example: Shirt size = { S, M, L, XL, XXL} Note Ordering in Ordinal scale variables: can be compared literally or using relational operators ( < , ≤ , > , ≥ ). 21 Operation on Ordinal data Usually relational operators can be used on ordinal data. Summary measures mode and median can be used on ordinal data. Ordinal data can be ranked (numerically, alphabetically, etc.) Calculations based on order are permitted (such as count, min, max, etc.). Note: Numerical variable can be transformed into ordinal variable and vice-versa, but with a loss of information. For example, Age [1, … 100] = [young, middle-aged, old] 22 Interval scale Definition Interval data is measured along a numerical scale that have an equal intervals between adjacent values Note Interval data are with well-defined interval Interval data doesn’t have a true value of zero For example, temperature in Celsius and Fahrenheit; 0⁰ does not mean absence of temperature(no heat). 23 Operation on Interval data We can add to or from interval data. For example: date1 + x-days = date2 Subtraction can also be performed. For example: current date – date of birth = age Negation (changing the sign) and multiplication by a constant are permitted. All operations on ordinal data defined are also valid here. + d ) or Affine transformations are permissible. Other one-to-one non-linear transformation (e.g., log, exp, sin, etc.) can also be applied. 24 Continuous and Discrete data Discrete data can only take Continuous data can only take On certain individual values On any value in a certain range Ratio scale Definition Ratio data is measured along a numerical scale that has equal distances between adjacent values, and a true zero Note Rao data may be in linear or non-linear scale. Both interval and ratio data can be stored in same data type (i.e., integer, float, double, etc.) All ratio data are interval data but the reverse is not true. 26 Operation on Ratio data All arithmetic operations on interval data are applicable to ratio data. In addition, multiplication, division, etc. are allowed. Any linear transformation of the form ( ax + b )/c are known. 27