Computer Science Database PDF

Table of contents Table of Contents 1 A.1 Basic Concepts 4 1.1 data and information 4 1.2 information system and a database. 4 Database vs spreadsheet...

Table of contents Table of Contents 1 A.1 Basic Concepts 4 1.1 data and information 4 1.2 information system and a database. 4 Database vs spreadsheet 4 1.3 The Need for Databases 4 1.4 Transactions, states and updates 5 Transactions 5 States 5 Updates 5 1.5 Database Transaction 5 Properties (ACID) 5 1.6 Concurrency 6 1.7 Queries and Updates 6 Queries 6 Updates 6 1.8 Data validation and verification 6 Data validation 6 Data verification 6 A.2 The relational database model 7 2.1 DBMS and RDBMS 7 Database management system (DBMS) 7 Non-relational databases 7 Relational database management system (RDBMS) 8 2.2 Functions and tools of a DBMS 8 2.3 Data security 9 Authentication and access control 9 Encryption 9 Audit trail 9 Backup and recovery 9 Data masking 9 2.4 Schema 9 2.5 Conceptual, logical and physical schema 9 The conceptual schema 9 The logical schema 10 The physical schema 10 Role of DBMS 10 2.6 Data Dictionary 10 2.7 Data definition language 11 2.8 Importance of data modelling 12 2.9 Database terms 12 2.10 relationships within database 12 2.11 Issues with redundant data 13 Data inconsistency 13 Increased storage requirements 13 Potential for data anomalies 14 Security risks 14 Integrity and reliability 14 Challenges 14 2.12 Referential Integrity 14 2.13 Normalisation 14 1st normal form 15 2nd normal form 15 3rd normal form 15 2.14 Characteristics of a normalised database 16 2.15 Different data types 16 2.16 Entity-relationship diagram 16 Relationships 17 2.17 construct a 3NF database 18 2.18 View 18 2.19 simple vs complex query 18 2.20 Methods to construct a query 19 A.3 Further aspects of database management 19 3.1 Database administrator 19 3.2 Interaction of end-users 19 Structured query language SQL 20 Query by example QBE 20 Visual Queries 20 Natural language interfaces NLIs 21 3.3 Database recovery 21 Types of data loss 21 Methods 21 Recovery strategies 22 Backup and restore 22 Point in time recovery 22 Replication 22 Storage snapshots 23 Failover systems 23 Recovery techniques 23 Immediate repairs 23 Deferred repairs 23 Reconstruction 23 Social and ethical considerations 23 Cost vs importance of data 23 Access to recovery systems 23 Compliance with laws and regulations 24 3.4 How integrated database systems function 24 3.5 Use of Database 24 3.6 privacy of personal data, responsibilities 25 Ethical responsibilities of data holders 25 Legislation 25 Technology methods 25 Human methods 25 3.7 Open to interrogation 26 Necessity of access 26 Social and ethical implications 26 3.8 Data matching and data mining 26 Data matching 26 Data mining 27 A.4 Further database models and database analysis 27 4.1 Database models and their characteristics 27 4.2 Object-oriented vs relational databases 31 Data definition 31 Data manipulation 31 Data integrity 31 4.3 Data warehousing 31 4.4 Applications of data warehousing 32 4.5 Time dependence of data warehousing 32 4.6 Updates in a data warehouse in real time 32 4.7 evaluation of data warehousing 33 4.8 Need for ETL processes 33 4.9 ETL for Data cleaning 34 Extract 34 Transform 34 Load 34 4.10 Discovering patterns using mining 34 Cluster analysis 34 Associations 36 Classifications 36 Sequential patterns 37 Forecasting 37 4. 11 Evaluation of data mining 38 Case studies 38 Social impacts and ethical considerations 38 4.12 Predictive modelling 38 Decision tree induction 38 Neural networks 39 4.13 Database Segmentation 39 4.14 Link analysis 40 4.15 Deviation detection 41 Miscellaneous 42 Constraints on relationship types 42 A.1 Basic Concepts 1.1 data and information - Data is a collection of facts, meaningless on their own - Information puts data into understandable context 1.2 information system and a database. Database - Organised collection of related data Information system - Overall system that is capable of searching and providing useful information through queries - Database is a subset of an information system - consists of task, people, roles, and technology - Database is focused on storing and managing data; information system has broader functions that process data into meaningful information Database vs spreadsheet Spreadsheet: mainly manipulate data Database: mainly store data, generate queries to view specific data 1.3 The Need for Databases - Data organisation - Data integrity (data is accurate and consistent) - Data security (control access) - Scalability - Performance - Data sharing (allow collaboration, share data across different systems and applications) 1.4 Transactions, states and updates Essential for maintaining data consistency and integrity Transactions - A logical unit of work - Consists of one or more database operations that must be executed together - Ensures either all operations are completed or none are completed at all States - Condition of the database - Used to keep track of changes Updates - Changes made to the database - Uses a locking mechanism, which ensures that only one user can update a record at a time, preventing conflicts and inconsistencies 1.5 Database Transaction - A logical unit of work - Sequence of operations that are executed as a single unit of work Properties (ACID) - Atomicity: smallest unit of work - Consistency: a change in the database must retain the overall state of the database - Isolation: transactions are executed in isolation; changes made by one transaction are not visible to other transactions until completion - Implements record locking; records that are affected can’t be changed by other transactions - Durability: once completed, changes are permanently stored in non-volatile storage; not lost in system failure 1.6 Concurrency Concurrency in data sharing refers to the ability of multiple users or application to access and manipulate the same data simultaneously Concurrency control is the process of managing concurrent access to data in order to maintain data consistency and integrity Techniques - Locking: lock records that are being modified - prevents two or more database users from updating the same data at the same time - If the DB as a whole is locked then only one DB session can apply any updates - Time-stamping: assign a unique timestamp to each transaction that accesses the database; use timestamp to determine which transaction should be given priority - Multi-version concurrency control (MVCC): create multiple versions of data, each user/application uses one version that reflects the state of the time the transaction began Concurrency problems Temporary update problem - Updated item is used by another transaction before the item is changed Incorrect summary problem - One transaction is applying aggregate function on some records while another transaction is updating these records - Aggregate function may calculate some values before update and some after update Lost update problem - An update done to a data item by a transaction is lost as it is overwritten by the update by another transaction 1.7 Queries and Updates Queries Used to retrieve data E.g. select in SQL, find() in others - Allow data analysis such as sorting, grouping, filtering, aggregating data Updates Used to modify data E.g. insert, update, delete in SQL, save(), remove() in others - Maintain accuracy and integrity of data - Correct errors, update records, delete obsolete data 1.8 Data validation and verification Data validation The process of checking whether data entered into a system is consistent with predefined rules and constraints - Check for errors, invalid data, wrong data types Data verification Checking whether the data in the database is a true representation of the original data source - Usually performed on a periodic basis, such as during data migrations or when integrating data from multiple sources A.2 The relational database model 2.1 DBMS and RDBMS Database management system (DBMS) A collection of programs that enables users to store, modify and extract information from a database Benefits - Data organisation and management - Data security and privacy - Data consistency and integrity - Data sharing and collaboration - Data analysis and reporting →provide a centralised, controlled and efficient environment for managing data Relational database management system (RDBMS) - Data represented in tables; unique records - Contains link between tables 2.2 Functions and tools of a DBMS - Data modeling - Data storage and retrieval - Concurrency control - Data integrity and security - Backup and recovery 2.3 Data security Authentication and access control - Control who can access the system Encryption - Data stored securely in disk or memory; make data meaningless Audit trail - Track all activities, recording all changes, login attempts and other events Backup and recovery - Create copies of the database at regular intervals Data masking - Protect sensitive data by replacing it with fictitious (fabricated) data - Used during testing, training; when original data is not required Minimizing table access - Isolate the teachers from the data tables they do not need - Create views and user defined functions to support user access requirements and not give access to the tables; 2.4 Schema The logical structure of a database, defining the organization and relationships among the data elements or objects within the database. - A blueprint or a plan for the DB - Specifies types of data, relationships, constraints 2.5 Conceptual, logical and physical schema The conceptual schema The high- level representation of the database. Describes the overall logical structure without including details of physical storage Contents - Entities and attributes - Relationships - Independent from software and hardware Design consideration - User interface - Stakeholder input The logical schema Translates conceptual schema into a more detailed, software specific framework. Describes the structure in terms of the data model that the DBMS understands Contents - Table, attributes, types - Keys Design consideration - Normalization: process up to 3NF to reduce redundancy - DBMS specific - Separate from physical storage detail The physical schema The lowest level of schema, deals with physical storage Contents: - Storage files: files and file structure - Access paths: indexes, pointers etc Design considerations - Performance optimisation: indexing, partitioning, materialized views - Hardware specific: storage space allocation, data compression techniques Role of DBMS Intermediary between the physical database and the users. It relies on the schema definitions to ensure that data is accessed and stored according to the rules from schema 2.6 Data Dictionary A collection of metadata such as object name, datatype, size, classification and relationships with other data assets. Acts as a reference guide on a dataset Contents - Metadata - Names, types, size of data elements - Constraints and primary keys - Table definitions - Relationship with other tables - Index information - Speed up data retrieval by knowing which fields indices are built - User information - Access privileges, security settings 2.7 Data definition language Data Definition Language (DDL) is the subset of SQL used to define and modify the structure a database Functions: - Create tables and relationships (specify the structure and attributes) - Enforcing data integrity (specify constraints) - Facilitate database management (modify existing fields or relationships) - Supporting data security (specify access permissions) Importance - Transforms abstract models into tangible database structures - Consistent and standard language across various database systems 2.8 Importance of data modelling Blueprint of the database structure and relationship between the data entities. Creates a visual representation of the database Importance: - Data consistency and accuracy (model defines rules, constraints, relationships) - Efficiency (reduce data redundancy) - Flexibility - Collaboration (between stakeholders, easy to understand) - Maintainability (clear understanding → control impact made upon change) 2.9 Database terms table Collection of related data organised into row or columns record Collection of data that represents a single entity in a table; a row field Single piece of data stored in a record Primary key field/combination of fields that uniquely identifies each record Secondary key Field that is not the primary key but can be used to access and query data in the table Foreign key Field that refers to the primary key of another table; draws relationships between two tables Candidate key Field that can be used as primary key but is not currently used for that purpose Composite Primary key that consists of two or more fields in a btable primary key join A database operation that combines two or more tables based on a related field 2.10 relationships within database Also known was cardinality - Uniqueness of data values contained in an attribute. 2.11 Issues with redundant data Data inconsistency Different instances of the same data do not agree Real world implications - Conflicting information leads to confusion and mistrust - Reports are incorrect, impacting decisions made - Cause computational errors Techniques to avoid - Atomic transactions - Normalisation - Data cleansing Increased storage requirements Problems - Financial costs - Slower search and retrieval times Techniques to avoid - Normalisation - Data compression - Efficient indexing Potential for data anomalies Irregularities and inconsistencies during modification - Duplication makes insertion, deletion and updates more error-prone - Forget to change value in any record → data inconsistency Techniques to avoid - Normalisation - Cascading updates and deletes Security risks increase number of potential attack points for malicious actors Integrity and reliability Errors and inconsistencies make data have bad quality Methods - Constraints - Referential integrity - Audit trails Challenges - Complex design - Evolving nature of data - Balancing efficiency redundancy - Data warehousing: some controlled redundancy may improve performance 2.12 Referential Integrity Rule the database follow to ensure the relationships between tables remain consistent and accurate, even eas records are updated or deleted - Achieved through the use of foreign keys Methods Cascading update - action that automatically propagates changes to related records in child tables when a primary key value is updated in the parent table. Cascading delete - automatically deletes related records in child tables when a corresponding record in the parent table is deleted Problems - enter foreign keys that do not match the corresponding primary key in the related table; - Mismatched records 2.13 Normalisation - Splitting tables in a database and arranging data into 1, 2, 3NF is called normalisation - Remove redundant data to store data efficiently 1st normal form Rules: - All field names must be unique - Values in fields should be from the same domain - same type of data; storing first and last name in the same field are different data, should split into two fields) - Values in fields should be atomic - Single value only - Easy for search, sort functions; easier to query and maintain - No two records can be identical - Each table needs a primary key Composite primary key: made from two fields - E.g. two students of same name can’t be clearly identified with only student number, other information e.g. birth date and course can be included 2nd normal form Rules - Data is already in 1NF - Any partial dependencies have been removed Dependency: one field relies on a key Partial dependency: If the value of a non-primary attribute can be defined using part of the primary key (which is composite in this case) https://byjus.com/gate/partial-depeandency-in-dbms-notes/ Fixing many to many relationship - Create a linking table - Assign the primary keys from the two initial tables as the composite key for the new linking table - Create a 1:M relationship joined by the new table.eg. table 1= student course, link table = student_takes_course , table 2 = course 3rd normal form - The data is already in 2NF - Any transitive dependencies have been removed - one where the value of a field is determined the value of another field that is not part of the primary key - >>>impossible for update anomalies to occur 2.14 Characteristics of a normalised database - Minimal data redundancy - Consistent data - Reduced update anomalies - each table contains only a single, logically related category of data → updates are made only once - Increased scalability - organised into tables; isolated - DB structure can remain largely unchanged when extending new data types/fields - →the applications interacting with the database are minimally affected; - Improved query performance - data is organised into smaller, more manageable tables - Key dependent - Every non-key column is directly dependent on the key - Reduced redundancies, lesser anomalies, better efficiency 2.15 Different data types - Words - Varchar (varying length string) - Numbers such as phone numbers are often set as text data types (could start with 0) - Char - Text (long text that exceeds varchar limit) - numbers - - Integers - Float (fractional components, approximate) - Decimal (precise) - binary - BLOB binary large object; large binary objects like images - Binary, varbinary - Boolean - date/time - timestamp - Currency - Currency and number of decimal places - Object - Can’t enter via keyboard e.g. music, pictures; could be hyperlinks 2.16 Entity-relationship diagram An entity is a real-world object that is distinguishable from other objects - Each entity has attributes - An entity type defines a set of entities that have the same attributes - An entity set is the collection of all entities of a particular entity type Entity: rectangle Attributes: circles Relationships - A relationship type among two or more entity type defines a set of associations between entities from those types - Relationship set: a set of instances of the relationship type Relationships: diamonds Relationships can also have attributes - Attribute must exist only to facilitate the relationship between the two entities - Entities must exist without the relationship attribute 2.17 construct a 3NF database I need help 2.18 View A subset of a database that is based on a query. - Views can save frequently used queries, which can be complex Dynamic views: automatically updated Static views: manually updated Queries - Selecting fields - Filtering data - Sorting data - Grouping data - Calculating data 2.19 simple vs complex query Simple query: basic request, typically only involving one table and a small number of fields; can be created using simple query languages or graphical user interface Complex query: sophisticated request involving multiple tables and complex operations Key differences - Complexity - Purpose: retrieval vs data manipulation and analysis - Performance: simple queries are faster - Ease of use 2.20 Methods to construct a query - Graphical user interfaces - Query language - SQL - relational database - LINQ language-integrated query - a.NET framework component to query collections and databases - Stored procedures - A set of precompiled SQL statements (views) - Data access layers - A way to abstract the database from the application code - Object-relational mapping (ORM) - Map database tables to object-oriented code - Results can be returned as objects - Works with object oriented code - Web-based interfaces A.3 Further aspects of database management 3.1 Database administrator A database administrator (DBA) is responsible for the design, implementation, maintenance and management of an organisation’s database Responsibilities - Design and implementation: physical storage, organisation, logical relationship, security and access controls - Maintenance and performance tuning; monitoring performance, regular backups - Data security: implement access controls, data encryption - User management - Data modelling and architecture - Monitoring and troubleshooting - Training and support MS: 3.2 Interaction of end-users DBA - Specialised tools e.g. command line interfaces Internal employee - Sales representatives can use web-based interface - Managers can use reporting tools External customers - Web-based interfaces, mobile applications - Allow customers to place orders, track shipments etc - Easy to use and intuitive Structured query language SQL A domain specific language Key operations - Create: CREATE TABLE, ALTER TABLE - CRUD operations: - C: create - INSERT - R: read - SELECT - U: update - UPDATE - D: delete - DELETE Transaction control - BEGIN TRANSACTION, COMMIT, ROLLBACK Advanced - Subqueries and joins - nested queries, combining rows from two or more tables - Stored procedures - Manually triggered - Triggers - Automatic response to certain events Query by example QBE User provides an example of the data they are looking for, as opposed to writing a structured query Principles - Template-filled queries: system generates a query based on a form - Graphical interface: more approachable Complex queries - Can join tables - Can aggregate and group data (SUM, COUNT etc) Visual Queries Intuitive interface for database querying Characteristics - Direct manipulation interfaces (drag and drop) - Visualisation Natural language interfaces NLIs Allow users to write queries in their natural language, translated by the system into executable queries Features - Natural language processing (NLP) to understand user input - Query translation Pro: user friendly Con: interpretation issues Practical use - Conversational queries - Good NLIs will ask for clarification when needed 3.3 Database recovery Process of restoring a database to a consistent state after a failure or an error Types of data loss - Accidental deletion - Corruption: software bugs, hardware failures, cyber attacks - Disasters: floods, fires, earthquakes that physically damages the storage Methods Log recovery/ point in time recovery / rollback - Backup is recorded in the log files and restored at that point - Transaction after that point are added back in - Error due to transaction → undo Backup and Restore - BACKUP and RESTORE SQL commands - Periodic backups (full or incremental) (incremental only record changes made since last update) Deferred update - Changes made to a database are not immediately written to the disk, but are instead held in memory until a commit point is reached - All changes are written in a single batch - Much less resource-intensive; transactions are atomic - Consults the transaction log and any transactions that were not fully written to the database (deferred) are redone - No undo: if failure occurs before the commit, there is no need to undo any changes since non were yet made - Redo: if a failure occurs after the commit but before changes are fully applied, these changes can be redone using log records Mirroring/ instant recovery - Creating a duplicate copy of a database on a separate server - Redirects user workloads to the backup server - When recovery is complete the original system is used with the interim transactions added - High availability, fast recovery times - Complex to set up and maintain Social and ethical considerations Cost vs importance of data Consider - Financial value - Social impact - Ethical obligations (personal and sensitive information) Access to recovery systems - Restricted access to preserve privacy Compliance with laws and regulations 3.4 How integrated database systems function A system that provides a centralised, unified view of data from multiple sources - Provide a single point of access for users Core aspects - Centralised management - Redundancy reduction - Data synchronisation across the system Outline 1. Data collection: data from various sources consolidated into a single database or data warehouse 2. Data integration: integrate into a single, unified format 3. Data cleansing: identify and correct errors, inconsistencies, duplicates 4. Data storage: store in a centralised database or data warehouse 5. Data access: e.g. SQL queries, data visualization, reporting tools Architecture - Database servers - Database application layer (that interact with the database) - Database schema 3.5 Use of Database Stock control - Manage inventory levels and track sales - Information about each product e.g. price, quantity available - Sale is made → update stock level - Reports help with forecasting, ordering Police records - Store and manage information about crimes, suspects and victims - E.g. location, type of crime, evidence, witness - Identify patterns, track suspects, solve crimes Health records - Manage patient health records - E.g. name, age, medical history, diagnoses, medications, test results - Make informed decisions, monitor patient progress Employee data - E.g. name, address, contact information, job title - Manage payroll, track performance, provide benefits to employees 3.6 privacy of personal data, responsibilities Ethical responsibilities of data holders - Respecting autonomy: give users the freedom to manage their data - Preventing harm e.g. from identity theft or financial fraud - Maintaining trust - users entrust that their data will be protected Legislation Data protection act - Lawful, fair and transparent - Purpose limitation: not further processed in a manner that is not explicitly stated - Data minimisation: limited to what is necessary - Integrity and confidentiality: security of personal data Computer misuse act - Unauthorised access to computer material - Unauthorised acts with intent to impair Technology methods Access controls - Role-based access control (RBAC) - Least privilege principle: users given the minimum levels of permissions needed Data Encryption - End-to-end encryption (E2EE) - Data is encrypted on the sender’s system, only the recipient is able to decrypt it - Public key infrastructure: a pair of keys to encrypt and decrypt Data anonymisation and pseudonymisation - Anonymisation: process personal data so that data subject is not identifiable - Pseudonymisation: replace private identifiers with fake identifiers or pseudonyms, reducing risk Conduct regular audits and monitoring - Use software to detect unusual patterns Human methods - Employee training: policies and procedures for protecting personal data, detect and report privacy breaches - Access controls - Background checks - Privacy policies and notices are clearly stated 3.7 Open to interrogation Third parties such as the police or medical service may need to interrogate database systems The access must protect individuals’ rights while enabling the effective functioning of justice and security systems Necessity of access Law enforcement - Criminal investigation: gather evidence, access records of suspects, victims, witnesses - Compliance and regulations: industries such as healthcare and finance are subject to strict regulations; ensure that organisations are complying with the regulations National security concerns - Counter-terrorism: patterns that might indicate attacks - Foreign intelligence Public welfare and administration - Health emergencies: medical records of patients - Disaster response: provide insights into resources and personnel available in response Social and ethical implications privacy - Intrusion must be justified, monitored and regulated - Personal data includes sensitive information Accountability and public perception - Public trust in government 3.8 Data matching and data mining Data matching Data matching connects pieces of information that are related to the same entity but are stored in different data sources Objective - Provide a unified view of data spread across various data sets E.g. linking patient records in different hospitals, matching customer information from separate databases Techniques - Exact matching: same attribute e.g. social security number is compared for a direct match - Fuzzy matching: similar but not necessarily identical; handle data that may contain errors, misspellings, abbreviations, or variations in formatting Implications - Accuracy of matches: impacts the integrity of data - Data handling protocols: only allow authorised access Data mining An analytical process designed to explore large sets of data in search of patterns and systematic relationships, then validate findings by applying detected patterns to new subsets of data Objective - Extract predictive information, translate into actionable intelligence E.g. retailers predicting customer behavior, financial institutions assessing loan risks Techniques - Classification e.g. fraud detection, email spam filter - Clustering e.g. group customers w/ similar behaviors in marketing - Association rules e.g.find products commonly purchased together Privacy and data security - Personal identifiable information can be exposed → privacy concerns - Could lead to profiling or discrimination A.4 Further database models and database analysis 4.1 Database models and their characteristics Hierarchical database - Organise data into a tree-like structure, with each record having a single parent and zero or more children E.g. IBM’s information management system (IMS) Features - Easy addition, deletion - One to many relationship - Data redundancy - Query for data at the bottom is very slow Network database system - Organises data into a graph-like structure, with data represented as nodes (or records) with multiple parents and children for each record - Lots of pointers - Records are connected through sets E.g. integrated data store, project management system where tasks have dependencies on multiple other tasks E.g. linkedin, facebook members and connections (social networks) Features - Conceptually simple, easy to design - Handle one to many, many to many relationships - Data access is more flexible - Maintenance is very complex, uses lots of pointers - Modification of any record require the large number of pointer adjustments Relational database management system (RDBMS) - Data represented in tables; - Allows creation of relationships between tables (through foreign key) E.g. mySQL, oracle E.g. inventory management, accounting Features - Changes in database structure does not affect data access - Easy to understand - Data independence, structure independence - Easier to maintain - Mapping of objects is very difficult - No object oriented paradigm - Hardware overheads Object-oriented DBMS - Organises data into objects and allows for the creation of classes and inheritance - Contains both data and behavior - Supports encapsulation, inheritance and polymorphism E.g. mongoDB, apache cassandra Features - Handle different types of data, relational database handles a single data e.g. pictures, videos, text and numbers - Code reusability - Low maintenance costs because most tasks are encapsulated - No universally defined data model - Lack of support for security Spatial database model - Handle geometric data. Specifically designed to store and query spatial information like maps, plots, blueprints for spatial analysis - Digital map that stores information about locations of various features, each feautre is presented as a point, line, or polygon, and can have attributes e.g. name, population, elevation Features - Data related to space and geometry - Functions calculating distance, spatial joins\ Examples - city planning department manage land use, property boundaries - Google maps - National weather service Multi-dimensional database model - Organise data into a multidimensional array, where each dimension represents a different feature of the data - Designed for large volumes of data E.g. business intelligence, data warehousing applications; retail chan analyse sales E.g. walmart, netflix (data about video streaming service → viewer demographics, habits, preferences) Features - Multiple dimensions - Measures: quantitative values that represent data that is being analysed - Aggregation: combining and summarizing large amounts of data into a more compact form for analysis; analyse from different perspectives e.g. aggregate sales data by year, region or product category - Slice and dice: selectively filter and view data based on specific dimensions or subsets of dimensions - OLAP Online analytical processing technology: allows faster retrieval and manipulation 4.2 Object-oriented vs relational databases Data definition Relational databases - Fixed schema - Minimise redundancy through normalisation Object-oriented databases - Employ classes and objects - Handle complex data and relationships more naturally Data manipulation Relational databases - SQL - Excellent handling simple and complex, might be verbose for complex operations Object-oriented databases - Object methods, more intuitive - Direct representation of actions - Lack universal query language Data integrity Relational databases - Constraints and transactions ensure data integrity - Changes are controlled Object oriented databases - Also support ACID properties, more complex maintenance of integrity - Objects can encapsulate integrity rules → potential better maintenance over time 4.3 Data warehousing A repository for storing and managing large amounts of data from multiple sources over an extended period. - Supports queries and analysis rather than transaction processing and operational systems - Often contains large amounts of historical data Elements - A relational database - An extraction, loading and transformation (ELT) solution for preparing the data - Statistical analysis, reporting, data mining - Client analysis tools for visualisation - Machine learning algorithms Characteristics - Subject oriented (focus on a functional area e.g. sales over operations or transactions that generate the data) broken down into subject specific areas - Integrated: data from different sources, common data model and format - Non-volatile and stable - w contains historical data and a means to query data by date 4.4 Applications of data warehousing Strategic planning - Executives can perform high-level analysis to guide long-term strategy from historical data Business modelling - Simulate various business scenarios and outcomes using data from data warehouse Performance management - Measure and manage performance against benchmarks and objectives Data mining: - Uncover patterns and relationships in large amounts of data Customer behavior analysis Supply chain management Market analysis 4.5 Time dependence of data warehousing Data warehousing is time dependent because it stores historical and current data - Data is organised into time periods Importance - Trend analysis and historical reporting - Consistent view of data over time (through multiple sources) - Time-series comparisons: compare current periods against historical ones 4.6 Updates in a data warehouse in real time Stream processing - Capturing data in real time as it is generated, loading it directly into the warehouse - Suitable for high-velocity data streams e.g. log data, social media feeds Change data capture (CDC) - Monitoring source systems for changes and capturing only the changes as they occur, rather than reading the entire data set Batch processing - Updates the warehouse periodically, with minimal delay; process data in batches - Suitable for large datasets e.g. sales data, financial data Advantages of real time updates - Operational efficiency - Streamline operations by using up-to-date information - Customer service - Immediate feedback - Fraud detection - Identified swiftly 4.7 evaluation of data warehousing Pros - Improved decision - Centralized data - single version of truth - Data consistency - Data integration - different sources - Scalability - large volumes - Improved performance (e.g. OLAP cubes, data marts) - Data mining readiness Cons - High cost - Complexity - specialized skills - Data latency for batch processing - Maintenance requirements - Data privacy and security - stores and analyses large amounts of sensitive data - risk of security breaches 4.8 Need for ETL processes Unified data view - Standardize disparate data formats into a single format - Different sources usually have data in multiple formats Improved data quality - Cleansing - remove inaccuracies and inconsistencies - Enrichment - provide additional relevant information Efficiency and scalability - Automated processing - Reduces errors 4.9 ETL for Data cleaning Extract - Extract from multiple sources with varying formats - Parsing: locates and identifies individual data elements then isolates data elements into target files - Full vs incremental extraction - Accuracy checks Transform - Standardisation: conform a common format - Cleansing: remove duplicates, correct errors, missing values (drop record or fill with value) - Data mapping: source data fields matched to destination counterparts Load - Staging area: where additional checks and transformations can occur - Load strategies: bulk loading, trickle (incremental, batches) loading - Batch updates: refresh at set intervals - Real-time update: maintain data freshness Process before loading - Disable any constraints (make loading faster) - Enable constraints after loading (data consistency) - Maintain referential integrity (check unique identifiers) Applications: 4.10 Discovering patterns using mining Cluster analysis Partitioning of a data set into subsets, so that data in each cluster share some common trait K-means clustering - Partitions n observations into k clusters where each observation belongs to the cluster with the nearest mean 1. Initialise: clear k data points as cluster centers 2. Assign points: assign next point to closest cluster 3. Recompute centers: cluster centers updated to new mean 4. Repeat Hierarchical clustering - Creates a tree of clusters (dendrogram) which shows the arrangement of clusters Density-based clustering - Connects areas of high example density into clusters, allowing for arbitrary shaped distributions Parameters; min_samples, eps Eps defines maximum distance 1. Identify a dense region a. Points within are core samples 2. If atleast min_samples many data points within distance of eps, it is cluster - Core samples that are closer than eps are clustered together 3. Else labeled as noise 4. Cluster grows until there are no more core samples 5. Picks a new point and repeats Use cases - Market research, segmenting customers Better for exploratory analysis Associations Association rule mining finds correlations among large sets of data items - Utilizes rules that will determine the likelihood of an association existing between items E.g. if a customer buys bread, they are 80% likely to also buy milk - correlation: bread→milk E.g. shelf space allocation or inventory management in retail, identify products that are frequently purchased together Classifications Finding a model that describes and distinguishes data classes for the purpose to use the model to predict the class of objects whose class label is unknown Decision trees - Tree-like model of decisions and their possible consequences - Hierarchy of if/else questions - Recursive partitioning of the data until each region only contains a single target value (single class) Support vector machines - a supervised machine learning algorithm that classifies data by finding an optimal line or hyperplane - Support vectors: data points closest to the hyperplane; only data that impacts the hyperplane Bayesian networks - A statistical model that represents a set of variables and their conditional depednces via a directed acyclic graph E.g. financial sector to access loan applications, medical diagnosis to predict patient risk, e-commerce for product recommendations It is better for structured, targeted queries Sequential patterns Finding statistically relevant patterns that occur over time; predict future events GSP generalised sequential pattern - Store sequences and transactions E.g. analyse customer shopping sequences, web page visits, scientific data (DNA sequences) Forecasting Predict future values of a particular entity based on historical data Time series analysis - Analyse time-ordered sequence data Regression analysis - Relationship between a dependent and independent variable (target and predictor) E.g. used in stock market analysis, economic forecasting, inventory studies 4. 11 Evaluation of data mining Case studies Fraud detection - Anomaly detection techniques (classification) Benefits - Financial savings - Security enhancement of transactions and personal data Target marketing - Predictive analytics → tailored messages Benefits - Increased customer retention - Return on investment (ROI) improvement Social impacts and ethical considerations - Privacy concerns - Data security: data breach risks - Misuse of information: discrimination; used as a surveillance tool - Transparency and accountability: algorithmic transparency; regulatory compliance 4.12 Predictive modelling A statistical technique used to forecast future outcomes based on historical data Pros - Predict sales → Optimize inventory levels, helping prevent overstock and understock situations - Forecast demand → efficient production planning + proactive restocking that leads to improved customer satisfaction Cons - Relies on historic data, which may not account for rapid shifts in market demand - May need frequent updates for customer preference changes Decision tree induction Uses a tree-like model of decisions and their possible quences Concepts and construction - Root node: represents the entire population - Splitting: divide a node into two or more sub-nodes based on learnt conditions - Decision nodes: sub-node splits into further sub-nodes - leaf/terminal node: doesn’t split further; holds predicted outcome Pros - Easy to interpret - Handle both numerical and categorical - Minimal data cleaning required - Invariant to scaling of data Cons - Prone to overfitting - Poor generalisation Neural networks Consists of interconnected nodes (neurons) that process data in a layered architecture - High adept at modelling non-linear relationships Architecture - Input layer - Hidden layers: computations and feature extractions - Output layer Backpropagation - Send errors backwards through the network, adjusting weights and biases accordingly to minimise the error rate between the ground truth and predictions Pros - Derive non-linear relationships Cons - Black box nature - difficult to interpret, ethical considerations - Higher computational resources 4.13 Database Segmentation the process of partitioning a database into groups that are similar based on the chosen parameters to efficiently make marketing and operations decisions. The partitioning of a database according to some feature in common in the rows. Methods - Behavioural segmentation: patterns e.g. purchase history - Value-based segmentation - needs-based segmentation reasons/benefits - Performance - Smaller subsets of database are faster to query and update - Security: limit access to sensitive data - Maintenance is easier - More scalable; distribute load across multiple serves - Compliance - Payment card industry data security standard (PCI DSS) require credit card data to be stored in a separate segment 4.14 Link analysis The use of rules to establish associations between individual records in a data set. - Map complex networks in an understandable manner Purpose - Discover relationships - Identify structures - identify hierarchical structures e.g. level of commands - Influence measurement (of certain nodes) Techniques - Graph-based visualisation - Matrix models - Rows and columns - Intersection points represent presence or absence of a relationship - Centrality measures - Calculate degree centrality, betweenness centrality, closeness centrality to determine the importance of various nodes Examples - Identify key relationships - Clusters - Fraud detection - detect patterns - Customer analysis 4.15 Deviation detection Statistical techniques that are used to detect outliers that does not fit the assumed model - Identify unusual or unexpected data that may indicate errors, outliers, or anomalies Significance - Maintain data integrity - Risk management: mitigate potential threats - Operational efficiency: pinpoint areas that may indicate inefficiency in the business process 1. Normal profile establishment 2. Statistical modelling (that represents the normal behavior) 3. Outlier identification: flag data points falling outside the bounds of modelled normal parameters 4. Contextual analysis: determine whether outliers are due to legitimate factors or are genuine anomalies Techniques - Z-score (how far off a data is in terms of standard deviations) - IQR interquartile range method - Cluster analysis - Machine learning classification Applications - Security - Fraud detection - Network intrusion detection - Marketing - Customer relationship management - Health sector - Disease outbreak tracking - Patient care: unusual changes to health recrods - Public sector - Resource allocation: under/over utilisation of resources Miscellaneous Constraints on relationship types Need to specify the number of relationships instances that an entity can participate in E.g. a department can have many employees →1:M ratio E.g. borrowing books 1 student can borrow 3 books - 1:3 Student → borrow ← book If a book is a separate detail that is accessed through the student, it reduces data redundancy by reducing repetition Compound key: date of issue, hkid, ISBN can uniquely identify fields - Less reliable - More commonly create another ID, e.g. borrow ID

Computer Science Database PDF

Document Details

Tags

Related

Summary

Full Transcript