INFO90002 Lecture 8 PDF
Document Details
Uploaded by GratifyingMars
The University of Melbourne
Tags
Summary
This document covers the disadvantages of distributed database management systems, including complexity of management and control, data integrity issues, security concerns, and the need for standards. It also discusses replication, horizontal and vertical partitioning strategies, and emphasizes the trade-offs between availability and consistency in distributed database systems, and the need for consideration of factors such as replication, security, and networking.
Full Transcript
INFO90002 Lecture 8 Distributed DBMS disadvantages There are a lot of disadvantages because it is a complicated technology Disadvantages include complexity of management and control, data integrity, and security For data integrity, master manager could decide based on its own...
INFO90002 Lecture 8 Distributed DBMS disadvantages There are a lot of disadvantages because it is a complicated technology Disadvantages include complexity of management and control, data integrity, and security For data integrity, master manager could decide based on its own algorithms to reconcile this problem More interesting approach and reliable includes master-slave design. Out of all the servers that have this distributed data installed, only one is master server and others are slaves. For reading, all servers are available equally in terms of which one is closer to the reader For updating and inserting new records (writing), only master server gets something written. First priority for an update always goes to master server regardless of where writing process is. This writing priority will then be propagated to the next slaves based on their availability There is this period of inconsistency when only master record is updated and then at some stage only master and some slave records are updated. It could take several minutes or longer before this version of record is propagated to all servers where its replicated It will be a problem depending on application. Not a problem for social media as data availability is more important and the fastest update is not important compared to something like banking Security The more servers, the more hackers leading to more attacks Admin people are different and extra server could lead to extra chance for breach into database Lack of standards Different relational DDBMS vendors use different protocols leading to challenges in interoperability and migration Over years a lot of data gets accumulated and its not that simple to copy it from one place to another Increased training and maintenance costs This is due to more complex IT infrastructure - Increased disk storage ($) - Fast intra and inter network infrastructure ($$$) - Clustering software ($$$$) - Network speed ($$$$$) Important points about distributed databases Location transparency – user does not need to know where particular data are stored Local autonomy – a node can continue to function for local users if connectivity to the network is lost Trade-offs – availability vs consistency and synchronous vs asynchronous updates Location Transparency A user (or program) accessing data normally wouldn’t know on which server that data is located In most cases, we do not access data directly Some websites like Skyscanner do not actually have their own database – its an aggregate of websites and goes into database of different companies to see who is offering what. It has this software that knows how to query different databases, but it doesn’t need to know the actual location of the data. It just needs to know the API and how to connect to that data. Request to retrieve or update from any site are automatically forwarded by the application software to the database management system, which will know which servers to query. All data in the network appears to users as a single logical database stored at one site For us, its as if everything is in one place in one computer which makes it easier, however behind the scenes there is a single query that can join data from tables in multiple sites Local Autonomy A local administrator administers the local server with the database, and they have control over the local data and decide on the different security, being able to see logs of transactions happening on their server When one node fails, users may have replicated data on other nodes but it’s a local administrator that will recover the database in case of a failure Trade-offs Asynchronous is what happens in social media where you upload something its only on one server and propagates to other servers only when there is time break and workload Synchronous updates only when all nodes are available for update which can take a bit of time for request of update query to be processed Distribution options 3 main ones and 4th one is combination of above Data replication – data copied across sites and can be asynchronous still Data replication benefits include faster response as if I want data from Tokyo I will be getting it from Tokyo Data replication could be a waste of hard disk space as there are same data replicated across multiple sites and can be a headache where across multiple websites database is updated and consistent Horizontal portioning – when its table is split into groups of rows and rows are stored where they are needed Vertical portioning – split by columns and stored depending on admin needs Combination of above – one popular example is splitting table horizontally taking data from London being on London server and data related to Paris being on Paris server, but logs and lookup tables will be replicated across all servers Horizontal portioning Essentially splitting the table and grouping them where needed Example shows records being joined by geographical location which is quite a common approach Note: using UNION we can put those rows together in 1 table. Very helpful in distributed databased to collect data across different partitions and putting them into 1 set of results Some servers have faster CPU, and some have slower CPU and networks speeds could also differ so accessing speeds are inconsistent Vertical Partitioning Is when a database designer decides what columns to group together Example shows different columns being stored in different servers Also possible for all columns to be stored at the same location Vertical portioning advantages and disadvantages are the same as horizontal portioning Except - Combing data cross partitions is more difficult because it requires joins (instead of unions). Joins take more effort for the CPU and using joins is always slower than unions - Advantage for aggregate columns queries (vertical) vs queries that select entire rows (horizontal) Replication – advantages High reliability due to redundant copies of data Fast access to data at the location where it is most accessed There is no need to integrate data from multiple locations – don’t need unions but might still need joins for different tables but there are no complicated distributed integrity routines – replicated data is refreshed at scheduled intervals Decoupled nodes don’t affect data availability – transactions proceed even if some nodes are down. If my server goes down, I can still get my data from other servers as all data is replicated Reduced network traffic at prime time if updates can be delayed This is currently popular as a way of achieving high availability for global systems – most SQL and NoSQL databases offer replication Replication – disadvantages Need more storage space – each server stores a copy of the row Data integrity – retrieve incorrect data if updates have not arrived which could be a problem depending on the situation Network communication capabilities - updates can please heavy demand on telecommunications/networks. There is also high costs associated with high speed networks Synchronous updates Data is continuously kept up to date Important for stock, banking, or anything that needs real time information (more update to date information) Users anywhere access data and get the same answer If any copy of a data item is updated anywhere on the network, the same update is immediately applied to all other copies or the update is aborted Ensures data integrity and minimizes the complexity of knowing where the most recent copy of data is located Can result in slow response time and high network usage - DDBMS spends time checking that an update is accurately and completely propagated across the network - The committed updated record must be identical in all servers Asynchronous updates Always a change for some delay in propagating data updates to remote databases CAP Theorem The CAP theorem is a concept in distributed systems that state that it is impossible for a distributed database system to simultaneously provide more than two out of the following three guarantees – consistency, availability, and portioning During a network partition, a distributed system must choose between consistency and availability Distributed DBMS architecture Databases are accessed through application program so we will have this application program talking to the DBMS and we will have this local DBMS and its distributed aspect Through the network, it will then talk to other servers with similar setups in architecture User using a browser will connect to the server and possibly I’m getting data from local node if everything is sitting here and if not, it will talk to other database on the network, collect the data and bring it back to me Network partitions In this case, the situation is quite vague, and questions should and would not be asked like this. This is because it depends on what context we are talking about and what database design we are going for When we talk about data being distributed across multiple servers, how do we know which data is on which server? We have lookup tables or distributed database catalogues for this. It refers to the process of managing metadata about data objects that are distributed across multiple nodes or servers in a distributed database system Three popular management schemes for distributed catalogs include centralized catalogues, fully replicated catalogues, and partially replicated catalogues. Functions of distributed DBMS They put data together by locating data with a distributed catalog (meta data) Determine location from which to retrieve data and process query components DBMS translation between nodes with different local DBMSs (using middleware) data consistency (via multiphase commit protocols) Global primary key control – example includes when we have employees in America and employees in Australia and both employee ID starts from 1, it is responsible for reconciling this. Scalability Security, concurrency, query optimization, failure recovery Data warehousing lecture portion Talks about the differences between transactional (operational) and informational (dimensional) databases Characteristics of a data warehouse Overall architecture of a data warehouse Star schemas We need to collect data because managers want to know a lot of answers to how their businesses are operating To get solid answers to these answers, relational databases collect a lot of data for us and automate business processes including accounting, inventory, purchasing and sales. They are used to running day to day business operations and create huge efficiencies. However, they do not show us the big picture over years OLTP databases Transaction processing supports daily (routine, repetitive) operations which is mundane but crucial. It has become even more important with the growth of the internet Collection of read/write operations Processed as one unit Reliably and efficiently processed No data loss due to interference and failures (operating system, program, disk, …) OLTP (online transaction processing) characteristics Databases are great, but This centralized repository is populated from different sources: - It is populated from operational databases and external data sources - Data is optimized for reporting - It is integrated and transformed OLTP versus OLAP systems OLAP is a special purpose built data warehouse The kind of queries that OLAP works with is find avg, find total over certain period of time. Find highest or lowest values during this period of time in relation to this data It is historic data which is read only data as historic data cannot change because its from the past. If a product was brought it was brought Not every transaction is stored, and transactions are aggregated Data warehouse Differences between transactional and informational systems Point to remember is that int transactional databases, data is normalized into 3NF whereas in informational databases, it is not normalized, and it is often faster to merge several tables into 1 With informational databases, tables are denormalised to have very long rows of data because in this case, its faster queries Transactional (operational) database questions There could be simple queries where examples include customers calling about the forgotten membership card and asking for their membership number. The customer service center will access 1 table only to find membership number as long as you provide phone number Tables affected will be 1 and 1 row will be accessed There will be a very different story with analytical questions Question will be more complex Example: how many customers purchased more than $500 worth of alcohol in our Melbourne stores this year Can affect every table and millions of rows In the data warehouse, the queries will help answer these analytical questions. Aggregation functions are used and quite often some dimensions are analyzed in relation to the question Characteristics of a DW (data warehouse) Subject oriented – data warehouses are organized around particular subjects (sales, customers, products) Data there is already validated integrated data – data collected there from different system is converted to a common format; allows comparison and consolidation of data from different sources. Essentially data from various sources will be validated before storing it in a data warehouse Any manual entering of data will have errors Time dimension is very useful as it allows us to build our database based on our needs depending on time – historical data, trend analysis, snapshots Data marts and data mining Data mart Contains a subset of data warehouse information which is focused on a specific topic Data mining A process in which algorithms are applied to information to uncover patterns and relationships which otherwise are difficult to find as humans are only capable of processing so much data. Dimensional modelling Essentially a simple and restricted type of ER model Also known as star schema design Dimensional model consists of fact table, several dimensional tables and sometimes there are hierarchies in the dimensions Example for business analyst world In this question, we first determine what we need to figure out by looking at important nouns and then understand what we want to know Time will also be a dimension in most cases Transactional database is a lot more complex, and we will not worry about it as we will try to convert it into a data warehouse For this question, because we want to know the revenue, we want to know dollar sales and when its about sales, its often about quantities leading to including unit sales. You then have to decide what will be our dimensions where each dimension will have primary key which will become a PFK within our fact table Fact table example Data grouped by time stamp Star schema Time is an interesting dimension because its one thing not discussed much in the transactional databases With the time, you have to read the case study very carefully to determine what kind of time they want whether its monthly, weekly, quarterly, etc. We have 1 fact and 4 dimensions. Normally there are 3-4 dimensions because this is how managers analyze data and time dimension is not always obvious which is something we need to watch out for and put in FOR THE EXAM Dimension hierarchies A product could belong to a product type which in turn is part of the product group The examples are quite intuitive for example, hammer is a tool, and it belongs in the hardware category (group). Its kind of evident in the data We can have it as part of the dimension, and we don’t always have separate it into its own sub dimensions Designing a dimensional model 1. Choose a business process 2. Choose the measured facts (usually numeric, additive quantities) – what management wants like profit margins, etc. 3. Choose the granularity of the fact table 4. Choose the dimensions – if we are analyzing revenue and its in terms of an airline, it will be about how many flights and how many planes, how much petrol was lost, etc. 5. Complete the dimension tables Embedded hierarchies in dimensional tables Design outcomes; normalized vs denormalised You will have repeated data in denormalization which is a problem in transactional databases but not a problem in data warehouses, as its read only and data cannot be changed so duplication is fine.