Data Mining - Data Types & Sources PDF

Summary

This document provides a brief explanation of data mining, including different types of data sets, databases, data warehouses, data marts, and their corresponding characteristics. It also outlines the different types of variables and their properties (quantitative, categorical).

Full Transcript

31/01/2024 Data Mining –Data Types & Sources Database, Data Warehouse, Data Mart, Data Set …? Rows: observations, examples or cases Columns: variables or attributes...

31/01/2024 Data Mining –Data Types & Sources Database, Data Warehouse, Data Mart, Data Set …? Rows: observations, examples or cases Columns: variables or attributes It is important to note that RapidMiner will use the term examples for rows of data 1 31/01/2024 Database, Data Warehouse, Data Mart, Data Set …? Database is an organized grouping of information within a specific structured Relational databases— Designed using many tables which relate to one another in a logical fashion. Relational databases generally contain dozens or even hundreds of tables, depending upon the size of the organization By relating tables to one another, we can reduce redundancy of data and improve database performance (Normalization) OLTP (online transaction processing) systems. very efficient for high volume activities such as cashiering, where many items are being recorded via bar code scanners in a very short period of time For analysis: not very efficient, because in order to retrieve data from multiple tables at the same time, a query containing joins must be written Query: Simple a method of retrieving data from database tables for viewing (SQL) Database, Data Warehouse, Data Mart, Data Set …? Data warehouse: A large database that has been denormalized and archived Denormalization: process of intentionally combining some tables into a single table in spite of the fact that this may introduce duplicate data in some columns (or in other words, attributes). OLAP (online analytical processing) systems: aim to reduce the number of joins necessary to query related data, thereby speeding up the process of analyzing our data Data warehouses generally contain archived data 2 31/01/2024 Database, Data Warehouse, Data Mart, Data Set …? Data set: is a subset of a database or a data warehouse. Usually denormalized so that only one table is used. May contain several steps, including appending or combining tables from source database tables, or simplifying some data expressions. Example: changing a date/time format from ‘10‐DEC‐2002 12:21:56’ to ‘12/10/02’. Made up of a representative sample of a larger set of data, or they may contain all observations. Database, Data Warehouse, Data Mart, Data Set …? Data mart: is an organizational data store, similar to a data warehouse, but often created in conjunction with business units’ needs in mind, E.g.: Marketing or Customer Service, for reporting and management purposes. Intentionally created by an organization to be a type of one‐stop shop for employees throughout the organization to find data they might be looking for. Data marts may contain wonderful data, prime for data mining activities, but they must be known, current, and accurate to be useful. They should also be well‐managed in terms of privacy and security. 3 31/01/2024 Some Useful Data Mining Definitions Dependent and independent variables Characteristics of Variables Quantitative: counts and measures (numerical) Discrete (No. of children) Vs Continuous (temperature, height) Categorical: grouping or labels (Doesn't’ require to perform arithmetic operations) E.g.: Nationality, Religion, Political party affiliation..etc. Nominal: No order required (e.g. gender) Ordinal: has to be in order ( observed levels: Low, Medium, High) Ratio: ordered with a true zero point in scale (e.g. Money) Interval: ordered categories with consistent intervals between values but no true zero (e.g. Temperature in Celsius or Fahrenheit, SAT scores: but a score of 0 does not imply an absence of ability. ) Understanding the scale of measurement is important to verify the procedure for analysis Some Useful Data Mining Definitions Population Sample Random Sample Cross‐sectional Time Series 4

Use Quizgecko on...
Browser
Browser