Data Storage and Operations PDF

C H AP T ER 6 Data Storage and Operations 1. Introduction D ata Storage and Operations includes the design, implementation, and support of stored data, to maximize its value throughout its lifecycle, from creation/acquisition to disposal (see Chapter 1). Data Storage and Operations includes two sub-activities: Database support focuses on activities related to the data lifecycle, from initial implementation of a database environment, through obtaining, backing up, and purging data. It also includes ensuring the database performs well. Monitoring and tuning are critical to database support. 169 Order 11611 by LEXIE MAY on August 25, 2017 170 D MBO K 2 Database technology support includes defining technical requirements that will meet organizational needs, defining technical architecture, installing and administering technology, and resolving issues related to technology. Database administrators (DBAs) play key roles in both aspects of data storage and operations. The role of DBA is the most established and most widely adopted data professional role, and database administration practices are perhaps the most mature of all data management practices. DBAs also play dominant roles in data operations and data security. (See Chapter 7.) Data Storage and Operations Definition:The design, implementation, and support of stored data to maximize its value. Inputs: Business Drivers Activities: 1. Data Architecture Data Requirements Data Models Service Level Agreements 2. Manage Database Technology 1. Understand Database Technology P )( 2. Evaluate Database Technology (D) 3. Manage and Monitor Database Technology (O) Manage Database Operations 1. Understand Requirements (P) 2. Plan for Business Continuity (P) 3. Develop Database Instances (D) 4. Manage Database Performance (C ,O ) 5. Manage Test Datasets (O) 6. Suppliers: Consumers: Database Administrator Data Architect Technical Drivers Tools: Techniques: Change Implementation Path Physical Naming Standards Data Lifecycle Management Script Usage for All Changes Database Technology Evaluation Criteria Database Environments Migrated/Replicated/Ver sioned Data Business Continuity Plans Database Performance OLA Manage Data Migration (O) Participants: Data Architect Data Modeler Software Developer Application Testing Team Deliverables: Data Modeling Tools Database Monitoring Tools Database Management Tools Developer Support Tools Data Modeler Software Developer Application Testing Team Infrastructure Operations Metrics: Data Storage Metrics Performance Metrics Operations Metrics Service Metrics (P) Planning, (C) Control, (D) Development, (O) Operations Figure 54 Context Diagram: Data Storage and Operations Order 11611 by LEXIE MAY on August 25, 2017 D A T A ST O R A G E A ND O P E R A T IO N S 171 1.1 Business Drivers Companies rely on their information systems to run their operations. Data Storage and Operations activities are crucial to organizations that rely on data. Business continuity is the primary driver of these activities. If a system becomes unavailable, company operations may be impaired or stopped completely. A reliable data storage infrastructure for IT operations minimizes the risk of disruption. 1.2 Goals and Principles The goals of data storage and operations include: Managing the availability of data throughout the data lifecycle Ensuring the integrity of data assets Managing the performance of data transactions Data Storage and Operations represent a highly technical side of data management. DBAs and others involved in this work can do their jobs better and help the overall work of data management when they follow these guiding principles: Identify and act on automation opportunities: Automate database development processes, developing tools, and processes that shorten each development cycle, reduce errors and rework, and minimize the impact on the development team. In this way, DBAs can adapt to more iterative (agile) approaches to application development. This improvement work should be done in collaboration with data modeling and Data Architecture. Build with reuse in mind: Develop and promote the use of abstracted and reusable data objects that prevent applications from being tightly coupled to database schemas (the so-relational impedance A number of mechanisms exist to this end, including database views, triggers, functions and stored procedures, application data objects and data-access layers, XML and XSLT, ADO.NET typed data sets, and web services. The DBA should be able to assess the best approach virtualizing data. The end goal is to make using the database as quick, easy, and painless as possible. Understand and appropriately apply best practices: DBAs should promote database standards and best practices as requirements, but be flexible enough to deviate from them if given acceptable reasons for these deviations. Database standards should never be a threat to the success of a project. Connect database standards to support requirements: For example, the Service Level Agreement (SLA) can reflect DBA-recommended and developer-accepted methods of ensuring data integrity and data security. The SLA should reflect the transfer of responsibility from the DBAs to the development team if the development team will be coding their own database update procedures or data access Order 11611 by LEXIE MAY on August 25, 2017 172 D MBO K 2 Set expectations for the DBA role in project work: Ensuring project methodology includes onboarding the DBA in project definition phase can help throughout the SDLC. The DBA can understand project needs and support requirements up-front. This will improve communication by secondary DBA during analysis and design clarify expectations about DBA tasks, standards, work effort, and timelines for development work. Teams should also clarify expectations for support after implementation. 1.3 Essential Concepts 1.3.1 Database Terms Database terminology is specific and technical. In working as a DBA or with DBAs, it is important to understand the specifics of this technical language: Database: Any collection of stored data, regardless of structure or content. Some large databases refer to instances and schema. Instance: An execution of database software controlling access to a certain area of storage. An organization will usually have multiple instances executing concurrently, using different areas of storage. Each instance is independent of all other instances. Schema: A subset of a database objects contained within the database or an instance. Schemas are used to organize objects into more manageable parts. Usually, a schema has an owner and an access mmon uses of schemas are to isolate objects containing sensitive data from the general user base, or to isolate read-only views from the underlying tables in relational databases. Schema can also be used to refer to a collection of database structures with something in common. Node: An individual computer hosting either processing or data as part of a distributed database. Database abstraction means that a common application interface (API) is used to call database functions, such that an application can connect to multiple different databases without the programmer having to know all function calls for all possible databases. ODBC (Open Database Connectivity) is an example of an API that enables database abstraction. Advantages include portability; disadvantages include inability to use specific database functions that are not common across databases. 1.3.2 Data Lifecycle Management DBAs maintain and assure the accuracy and consistency of data over its entire lifecycle through the design, implementation, and usage of any system that stores, processes, or retrieves data. The DBA is the custodian of all database changes. While many parties may request changes, the DBA defines the precise changes to make to the database, implements the changes, and controls the changes. Order 11611 by LEXIE MAY on August 25, 2017 D A T A ST O R A G E A ND O P E R A T IO N S 173 Data lifecycle management includes implementing policies and procedures for acquisition, migration, retention, expiration, and disposition of data. It is prudent to prepare checklists to ensure all tasks are performed at a high level of quality. DBAs should use a controlled, documented, and auditable process for moving application database changes to the Quality Assurance or Certification (QA) and Production environments. A managerapproved service request or change request usually initiates the process. The DBA should have a back out plan to reverse changes in case of problems. 1.3.3 Administrators The role of Database Administrator (DBA) is the most established and the most widely adopted data professional role. DBAs play the dominant roles in Data Storage and Operations, and critical roles in Data Security (see Chapter 7), the physical side of data modeling, and database design (see Chapter 5). DBAs provide support for development, test, QA, and special use database environments. DBAs do not exclusively perform all the activities of Data Storage and Operations. Data stewards, data architects, network administrators, data analysts, and security analysts participate in planning for performance, retention, and recovery. These teams may also participate in obtaining and processing data from external sources. Many DBAs specialize as Production, Application, Procedural and Development DBAs. Some organizations also have Network Storage Administrators (NSA) who specialize in supporting the data storage system separately from the data storage applications or structures. In some organizations, each specialized role reports to a different organization within IT. Production DBAs may be part of production infrastructure or application operations support groups. Application, Development, and Procedural DBAs are sometimes integrated into application development organizations. NSAs usually are connected to Infrastructure organizations. 1.3.3.1 Production DBA Production DBAs take primary responsibility for data operations management, including: Ensuring the performance and reliability of the database, through performance tuning, monitoring, error reporting, and other activities Implementing backup and recovery mechanisms to ensure data can be recovered if lost in any circumstance Implementing mechanisms for clustering and failover of the database, if continual data availability data is a requirement Executing other database maintenance activities, such as implementing mechanisms for archiving data Order 11611 by LEXIE MAY on August 25, 2017 174 D MBO K 2 As part of managing data operations, Production DBAs create the following deliverables: A production database environment, including an instance of the DBMS (Database Management System) on the supporting server, of a sufficient size and capacity to ensure adequate performance, configured for the appropriate level of security, reliability, and availability. Database System Administration is responsible for the DBMS environment. Mechanisms and processes for controlled implementation of changes to databases in the production environment Mechanisms for ensuring the availability, integrity, and recoverability of data in response to all circumstances that could result in loss or corruption of data Mechanisms for detecting and reporting any error that occurs in the database, the DBMS, or the data server Database availability, recovery, and performance in accordance with service level agreements Mechanisms and processes for monitoring database performance as workloads and data volumes vary 1.3.3.2 Application DBA An application DBA is responsible for one or more databases in all environments (development / test, QA, and production), as opposed to database systems administration for any of these environments. Sometimes, application DBAs report to the organizational units responsible for development and maintenance of the applications supported by their databases. There are pros and cons to staffing application DBAs. Application DBAs are viewed as integral members of an application support team. By focusing on a specific database, they can provide better service to application developers. However, application DBAs can easily Application DBAs collaborate closely with data analysts, modelers, and architects. 1.3.3.3 Procedural and Development DBAs Procedural DBAs lead the review and administration of procedural database objects. A procedural DBA specializes in development and support of procedural logic controlled and execute by the DBMS: stored procedures, triggers, and user-defined functions (UDFs). The procedural DBA ensures this procedural logic is planned, implemented, tested, and shared (reused). Development DBAs focus on data design activities including creating and managing special use databases, such In many cases, these two functions are combined under one position. Order 11611 by LEXIE MAY on August 25, 2017 D A T A ST O R A G E A ND O P E R A T IO N S 175 1.3.3.4 NSA Network Storage Administrators are concerned with the hardware and software supporting data storage arrays. Multiple network storage array systems have different needs and monitoring requirements than simple database systems. 1.3.4 Database Architecture Types A database can be classified as either centralized or distributed. A centralized system manages a single components can be classified depending on the autonomy of the component systems into two types: federated (autonomous) or non-federated (non-autonomous). Figure 55 illustrates the difference between centralized and distributed. Figure 55 Centralized vs. Distributed 1.3.4.1 Centralized Databases Centralized databases have all the data in one system in one place. All users come to the one system to access the data. For certain restricted data, centralization can be ideal, but for data that needs to be widely available, centralized databases have risks. For example, if the centralized system is unavailable, there are no other alternatives for accessing the data. 1.3.4.2 Distributed Databases Distributed databases make possible quick access to data over a large number of nodes. Popular distributed database technologies are based on using commodity hardware servers. They are designed to scale out from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the database management software itself is designed to replicate data amongst the servers, thereby delivering a highly available service on top of a cluster of computers. Database Order 11611 by LEXIE MAY on August 25, 2017 176 D MBO K 2 management software is also designed to detect and handle failures. While any given computer may fail, the system overall is unlikely to. Some distributed databases implement a computational paradigm named MapReduce to further improve performance. In MapReduce, the data request is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. In addition, data is co-located on the compute nodes, providing very high aggregate bandwidth across the cluster. Both the filesystem and the application are designed to automatically handle node failures. 1.3.4.2.1 Federated Databases Federation provisions data without additional persistence or duplication of source data. A federated database system maps multiple autonomous database systems into a single federated database. The constituent databases, sometimes geographically separated, are interconnected via a computer network. They remain autonomous yet participate in a federation to allow partial and controlled sharing of their data. Federation provides an alternative to merging disparate databases. There is no actual data integration in the constituent databases because of data federation; instead, data interoperability manages the view of the federated databases as one large object (see Chapter 8). In contrast, a non-federated database system is an integration of autonomous; they are controlled, managed and governed by a centralized DBMS. Federated databases are best for heterogeneous and distributed integration projects such as enterprise information integration, data virtualization, schema matching, and Master Data Management. Federated architectures differ based on levels of integration with the component database systems and the extent of services offered by the federation. A FDBMS can be categorized as either loosely or tightly coupled. Figure 56 Federated Databases Order 11611 by LEXIE MAY on August 25, 2017 D A T A ST O R A G E A ND O P E R A T IO N S 177 Loosely coupled systems require component databases to construct their own federated schema. A user will typically access other component database systems by using a multi-database language, but this removes any levels of location transparency, forcing the user to have direct knowledge of the federated schema. A user imports the data required from other component databases, and integrates it with their own to form a federated schema. Tightly coupled systems consist of component systems that use independent processes to construct and publish an integrated federated schema, as illustrated in Figure 57. The same schema can apply to all parts of the federation, with no data replication. Figure 57 Coupling 1.3.4.2.2 Blockchain Database Blockchain databases are a type of federated database used to securely manage financial transactions. They can also be used for contract management or exchange of health information. There are two types of structures: individual records and blocks. Each transaction has a record. The database creates chains of time-bound groups of transactions (blocks) that also contain information from the previous block in the chain. Hash algorithms are used to create information about transactions to store in blocks while the block is the end of the chain. Once a new block is created, the old block hash should never change, which means that no transactions contained within that block may change. Any change to transactions or blocks (tampering) will be apparent when the hash values no longer match. 1.3.4.3 Virtualization / Cloud Platforms Virtualization (also computation, software, data access, and storage services that do not require end-user knowledge of the physical location and configuration of the system that delivers the Order 11611 by LEXIE MAY on August 25, 2017 178 D MBO K 2 service(s). Parallels are often drawn between the concept of cloud computing and the electricity grid: end users consume power without needing to understand the component devices or infrastructure required to provide the service. However, virtualization can be on-premises or off-premises. Cloud computing is a natural evolution of the widespread adoption of virtualization, service oriented architectures, and utility computing. Here are some methods for implementing databases on the cloud: Virtual machine image: Cloud platforms allow users to purchase virtual machine instances for a limited time. It is possible to run a database on these virtual machines. Users can either upload their own machine image with a database installed on it, or use ready-made machine images that already include an optimized installation of a database. Database-as-a-service (DaaS): Some cloud platforms offer options for using a database-as-a-service, without physically launching a virtual machine instance for the database. In this configuration, application owners do not have to install and maintain the database on their own. Instead, the database service provider is responsible for installing and maintaining the database, and application owners pay according to their usage. Managed database hosting on the cloud: Here the database is not offered as a service; instead, the cloud provider hosts the database and manages it on the DBAs, in coordination with network and system administrators, need to establish a systematic integrated project approach to include standardization, consolidation, virtualization, and automation of data backup and recovery functions, as well as security of these functions. Standardization/consolidation: Consolidation reduces the number of data storage locations an organization has, including the number of data stores and processes within a data center. Based on Data Governance policy, Data Architects and DBAs may develop the standard procedures that include identifying mission critical data, duration of data retention, data encryption procedures, and data replication policies. Server virtualization: Virtualization technologies allow equipment, such as servers from multiple data centers, to be replaced or consolidated. Virtualization lowers capital and operational expenses and reduces energy consumption. Virtualization technologies are also used to create virtual desktops, which can then be hosted in data centers and rented out on a subscription basis. Gartner views virtualization as a catalyst for modernization (Bittman, 2009). Virtualization provides data storage operations much more flexibility in provisioning storage at local or cloud environment. Automation: Data automation involves automating tasks such as provisioning, configuration, patching, release management, and compliance. Security: The security of data on virtual systems needs to be integrated with existing security of physical infrastructures (see Chapter 7). Order 11611 by LEXIE MAY on August 25, 2017 D A T A ST O R A G E A ND O P E R A T IO N S 179 1.3.5 Database Processing Types There are two basic types of database processing. ACID and BASE are on opposite ends of a spectrum, so the coincidental names matching ends of a pH spectrum are helpful. The CAP Theorem is used to define how closely a distributed system may match either ACID or BASE. 1.3.5.1 ACID constraint for achieving reliability within database transactions. For decades, it has provided transaction processing with a reliable foundation on which to build.34 Atomicity: All operations are performed, or none of them is, so that if one part of the transaction fails, then the entire transaction fails. Consistency: The transaction must meet all rules defined by the system at all times and must void halfcompleted transactions. Isolation: Each transaction is independent unto itself. Durability: Once complete, the transaction cannot be undone. Relational ACID technologies are the dominant tools in relational database storage; most use SQL as the interface. 1.3.5.2 BASE The unprecedented increase in data volumes and variability, the need to document and store unstructured data, the need for read-optimized data workloads, and subsequent need for greater flexibility in scaling, design, processing, cost, and disaster recovery gave rise to the diametric opposite of ACID, appropriately termed BASE: Basically Available: The system guarantees some level of availability to the data even when there are node failures. The data may be stale, but the system will still give and accept responses. Soft State: The data is in a constant state of flux; while a response may be given, the data is not guaranteed to be current. Eventual Consistency: The data will eventually be consistent through all nodes and in all databases, but not every transaction will be consistent at every moment. Jim Gray established the concept. Haerder and Rueter (1983) coined the term ACID. Order 11611 by LEXIE MAY on August 25, 2017 180 D MBO K 2 BASE-type systems are common in Big Data environments. Large online organizations and social media companies commonly use BASE implementations, as immediate accuracy of all data elements at all times is not necessary. Table 12 summarizes the differences between ACID and BASE. Table 12 ACID vs BASE Item Casting (data structure) Consistency Processing Focus Processing Focus History Scaling ACID Schema must exist Table structure exists Columns data typed Strong Consistency Available Transactional Row/Column 1970s application storage Product Dependent Origin Transaction Mixture Yes BASE Dynamic Adjust on the fly Store dissimilar data Strong, Eventual, or None Key-value stores Wide-column stores 2000s unstructured storage Automatically spreads data across commodity servers Open-source Possible 1.3.5.3 CAP distributed systems (Brewer, 2000). The theorem asserts that a distributed system cannot comply with all parts of ACID at all time. The larger the system, the lower the compliance. A distributed system must instead trade-off between properties. Consistency: The system must operate as designed and expected at all times. Availability: The system must be available when requested and must respond to each request. Partition Tolerance: The system must be able to continue operations during occasions of data loss or partial system failure. The CAP Theorem states that at most two of the three properties can exist in any shared-data system. This is CAP Theorem CAP Theorem Partition Tolerance Partition Tolerance CAP Theorem No System Failures Figure 58 CAP Theorem Order 11611 by LEXIE MAY on August 25, 2017 D A T A ST O R A G E A ND O P E R A T IO N S 181 An interesting use of this theorem drives the Lambda Architecture design discussed in Chapter 14. Lambda Architecture uses two paths for data: a Speed path where availability and partition tolerance are most important, and a Batch path where consistency and availability are most important. 1.3.6 Data Storage Media Data can be stored on a variety of media, including disks, volatile memory, and flash drives. Some systems can combine multiple storage types. The most commonly used are Disk and Storage Area Networks (SAN), InMemory, Columnar Compression Solutions, Virtual Storage Area Network VSAN, Cloud-based storage solutions, Radio Frequency Identification (RFID), Digital wallets, Data centers and Private, Public, and Hybrid Cloud Storage. (See Chapter 14.) 1.3.6.1 Disk and Storage Area Networks (SAN) Disk storage is a very stable method of storing data persistently. Multiple types of disk can exist in the same system. Data can be stored according to usage patterns, with less-used data stored on slower-access disks, which are usually cheaper than high performance disk systems. Disk arrays can be collected into Storage Area Networks (SAN). Data movement on a SAN may not require a network, as data can be moved on the backplane. 1.3.6.2 In-Memory In-Memory databases (IMDB) are loaded from permanent storage into volatile memory when the system is turned on, and all processing occurs within the memory array, providing faster response time than disk-based systems. Most in-memory databases also have features to set and configure durability in case of unexpected shutdown. If the application can be reasonably assured to fit most/all data into memory, then significant optimization can be made available from in- more predictable access time to data than do disk storage mechanisms, but they require a provide functionality for real-time processing of analytics and are generally reserved for this due to the investment required. 1.3.6.3 Columnar Compression Solutions Columnar-based databases are designed to handle data sets in which data values are repeated to a great extent. For example, in a table with 256 columns, a lookup for a value that exists in a row will retrieve all the data in the row (and be somewhat disk-bound). Columnar storage reduces this I/O bandwidth by storing column data Order 11611 by LEXIE MAY on August 25, 2017 182 D MBO K 2 using compression where the state (for example) is stored as a pointer to a table of states, compressing the master table significantly. 1.3.6.4 Flash Memory Recent advances in memory storage have made flash memory or solid state drives (SSDs) an attractive alternative to disks. Flash memory combines the access speed of memory-based storage with the persistence of disk-based storage. 1.3.7 Database Environments Databases are used in a variety of environments during the systems development lifecycle. When testing changes, DBAs should be involved in designing the data structures in the Development environment. The DBA team should implement any changes to the QA environment, and must be the only team implementing changes to the Production environment. Production changes must adhere strictly to standard processes and procedures. While most data technology is software running on general purpose hardware, occasionally specialized hardware is used to support unique data management requirements. Types of specialized hardware include data appliances servers built specifically for data transformation and distribution. These servers integrate with existing infrastructure either directly as a plug-in, or peripherally as a network connection. 1.3.7.1 Production Environment The production environment is the technical environment where all business processes occur. Production is mission-critical if this environment ceases to operate, business processes will stop, resulting in bottom-line losses, as well as a negative impact on customers who are unable to access services. In an emergency, or for public service systems, unexpected loss of function can be disastrous. The production environment is the reliable production environment, other non-production environments must exist and be used appropriately. For example, production environments should not be used for development and testing as these activities put production processes and data at risk. 1.3.7.2 Pre-production Environments Pre-production environments are used to develop and test changes before such changes are introduced to the production environment. In pre-production environments, issues with changes can be detected and addressed without affecting normal business processes. In order to detect potential issues, the configuration of preproduction environments must closely resemble the production environment. Order 11611 by LEXIE MAY on August 25, 2017 D A T A ST O R A G E A ND O P E R A T IO N S 183 Due to space and cost, it is usually not possible to exactly replicate production in the pre-production environments. The closer on the development path the non-production environment is to the production environment, the more closely the non-production environment needs to match the production environment. Any deviation from the production system equipment and configuration can itself create issues or errors that are unrelated to the change, complicating issue research and resolution. Common types of pre-production environments include development, test, support, and special use environments. 1.3.7.2.1 Development The development environment is usually a slimmer version of the production environment. It generally has less disk space, fewer CPUs, less RAM, etc. Developers use this environment to create and test code for changes in separate environments, which then are combined in the QA environment for full integration testing. Development can have many copies of production data models, depending on how development projects are managed. Larger organizations may give individual developers their own environments to manage with all appropriate rights. The development environment should be the first place any patches or updates are applied for testing. This environment should be isolated from and on different physical hardware than the production environments. Due to the isolation, data from production systems may need to be copied to the development environments. However, in many industries, production data is protected through regulation. Do not move data from production environments without first determining what restrictions there are on doing so. (See Chapter 7.) 1.3.7.2.2 Test The test environment is used to execute quality assurance and user acceptance testing and, in some cases, stress or performance tests. In order to prevent test results from being distorted due to environmental differences, the test environment ideally also has the same software and hardware as the production environment. This is especially important for performance testing. Test may or may not be connected via network to production systems in order to read production data. Test environments should never write to production systems. Test environments serve many uses: Quality Assurance Testing (QA): Used to test functionality against requirements. Integration Testing: Used for testing as a whole multiple parts of a system that have been developed or upgraded independently. User Acceptance Testing (UAT): Used for testing the system functionality from a user perspective. Use Cases are the most common inputs for testing performed in this environment. Performance Testing: Used to perform high-volume or high-complexity tests at any time, rather than having to wait for off hours, or adversely affecting production system peak time. Order 11611 by LEXIE MAY on August 25, 2017 184 D MBO K 2 1.3.7.2.3 Sandboxes or Experimental Environments A sandbox is an alternate environment that allows read-only connections to production data and can be managed by the users. Sandboxes are used to experiment with development options and test hypotheses about data or merge production data with user-developed data or supplemental data obtained from external sources. Sandboxes are valuable, for example, when performing a Proof-of-Concept. A sandbox environment can either be a sub-set of the production system, walled off from production processing, or a completely separate environment. Sandbox users often have CRUD rights over their own space so that they can quickly validate ideas and options for changes to the system. The DBAs usually have little to do with these environments other than setting them up, granting access, and monitoring usage. If the Sandbox areas are situated in production database systems, they must be isolated in order to avoid adversely affecting production operations. These environments should never write back to the production systems. Sandbox environments could be handled by virtual machines (VMs), unless licensing costs for separate instances becomes prohibitive. 1.3.8 Database Organization Data storage systems provide a way to encapsulate the instructions necessary to put data on disks and manage processing, so developers can simply use instructions to manipulate data. Databases are organized in three general ways: Hierarchical, Relational, and Non-Relational. These classes are not mutually exclusive (see Figure 59). Some database systems can read and write data organized in relational and non-relational structures. Hierarchical databases can be mapped to relational tables. Flat files with line delimiters can be read as tables with rows, and one or more columns can be defined to describe the row contents. NON(Tree Schema) More Controlled Structure Less Controlled Structure Figure 59 Database Organization Spectrum 1.3.8.1 Hierarchical Hierarchical database organization is the oldest database model, used in early mainframe DBMS, and is the most rigid of structures. In hierarchical databases, data is organized into a tree-like structure with mandatory Order 11611 by LEXIE MAY on August 25, 2017 D A T A ST O R A G E A ND O P E R A T IO N S 185 parent/child relationships: each parent can have many children, but each child has only one parent (also known as a 1-to-many relationship). Directory trees are an example of a hierarchy. XML also uses a hierarchical model. It can be represented as a relational database, although the actual structure is that of a tree traversal path. 1.3.8.2 Relational People sometimes think that relational databases are named for the relation between tables. This is not the case. Relational databases are based on set theory and relational algebra, where data elements or attributes (columns) are related into tuples (rows). (See Chapter 5.) Tables are sets of relations with identical structure. Set operations (like union, intersect, and minus) are used to organize and retrieve data from relational databases, in the form of Structured Query Language (SQL). In order to write data, the structure (schema) has to be known in advance (schema on write). Relational databases are row-oriented. The database management system (DBMS) of a relational database is called RDBMS. A relational database is the predominant choice in storing data that constantly changes. Variations on relational databases include Multidimensional and Temporal. 1.3.8.2.1 Multidimensional Multidimensional database technologies store data in a structure that allows searching using several data element filters simultaneously. This type of structure is used most frequently in Data Warehousing and Business Intelligence. Some of these database types are proprietary, although most large databases have cube technology built in as objects. Access to the data uses a variant of SQL called MDX or Multidimensional eXpression. 1.3.8.2.2 Temporal A temporal database is a relational database with built-in support for handling data involving time. The temporal aspects usually include valid time and transaction time. These attributes can be combined to form bitemporal data. Valid time is the timeframe when a fact is true with respect to the entity it represents in the real world. Transaction time is the period during which a fact stored in the database is considered true. It is possible to have timelines other than Valid Time and Transaction Time, such as Decision Time, in the database. In that case, the database is called a multi-temporal database as opposed to a bi-temporal database. Temporal databases enable application developers and DBAs to manage current, proposed, and historical versions of data in the same database. Order 11611 by LEXIE MAY on August 25, 2017 186 D MBO K 2 1.3.8.3 Non-relational Non-relational databases can store data as simple strings or complete files. Data in these files can be read in -relational databases may be row-oriented, but this is not required. A non-relational database provides a mechanism for storage and retrieval of data that employs less constrained consistency models than traditional relational databases. Motivations for this approach include simplicity of design, horizontal scaling, and finer control over availability. Nondifferentiating factor is the storage structure itself, where the data structure is no longer bound to a tabular relational design. It could be a tree, a graph, a network, or a key-value pairing. The NoSQL tag emphasizes that some editions may in fact support conventional SQL directives. These databases are often highly optimized data stores intended for simple retrieval and appending operations. The goal is improved performance, especially with respect to latency and throughput. NoSQL databases are used increasingly in Big Data and real-time web applications. (See Chapter 5.) 1.3.8.3.1 Column-oriented Column-oriented databases are used mostly in Business Intelligence applications because they can compress redundant data. For example, a state ID column only has unique values, instead of one value for each of a million rows. There are trade-offs between column-oriented (non-relational) and row-oriented (usually relational) organization. Column-oriented organization is more efficient when an aggregate needs to be computed over many rows. This only holds true for a notably smaller subset of all columns of data, because reading that smaller subset of data can be faster than reading all data. Column-oriented organization is more efficient when new values of a column are supplied for all rows at once, because that column data can be written efficiently to replace old column data without touching any other columns for the rows. Row-oriented organization is more efficient when many columns of a single row are required at the same time, and when row-size is relatively small, as the entire row can be retrieved with a single disk seek. Row-oriented organization is more efficient when writing a new row if all of the row data is supplied at the same time; the entire row can be written with a single disk seek. In practice, row-oriented storage layouts are well suited for Online Transaction Processing (OLTP)like workloads, which are more heavily loaded with interactive transactions. Column-oriented storage Order 11611 by LEXIE MAY on August 25, 2017 D A T A ST O R A G E A ND O P E R A T IO N S 187 layouts are well suited for Online Analytical Processing (OLAP)-like workloads (e.g., data warehouses) which typically involve a smaller number of highly complex queries over all data (possibly terabytes). 1.3.8.3.2 Spatial A spatial database is optimized to store and query data that represents objects defined in a geometric space. Spatial databases support several primitive types (simple geometric shapes such as box, rectangle, cube, cylinder, etc.) and geometries composed of collections of points, lines, and shapes. Spatial database systems use indexes to quickly look up values; the way that most databases index data is not optimal for spatial queries. Instead, spatial databases use a spatial index to speed up database operations. Spatial databases can perform a wide variety of spatial operations. As per the Open Geospatial Consortium standard, a spatial database may perform one or more of the following operations: Spatial Measurements: Computes line length, polygon area, the distance between geometries, etc. Spatial Functions: Modifies existing features to create new ones; for example, by providing a buffer around them, intersecting features, etc. Spatial Predicates: Allows true/false queries about spatial relationships between geometries. Examples include two polygons over there a residence located within a mile of the Geometry Constructors: Creates new geometries, usually by specifying the vertices (points or nodes) which define the shape. Observer Functions: Queries that return specific information about a feature such as the location of the center of a circle. 1.3.8.3.3 Object / Multi-media A multimedia database includes a Hierarchical Storage Management system for the efficient management of a hierarchy of magnetic and optical storage media. It also includes a collection of objects classes, which represents the foundation of the system. 1.3.8.3.4 Flat File Database A flat file database describes any of various means to encode a data set as a single file. A flat file can be a plain text file or a binary file. Strictly, a flat file database consists of nothing but data, and contains records that may vary in length and delimiters. More broadly, the term refers to any database that exists in a single file in the Order 11611 by LEXIE MAY on August 25, 2017 188 D MBO K 2 form of rows and columns, with no relationships or links between records and fields except the structure. Plain text files usually contain one record per line. A list of names, addresses, and phone numbers, written by hand on a sheet of paper, is an example of a flat file database. Flat files are used not only as data storage tools in DBMS systems, but also as data transfer tools. Hadoop databases use flat file storage. 1.3.8.3.5 Key-Value Pair Key-Value pair databases contain sets of two items: a key identifier and a value. There are a few specific uses of these types of databases. Document Databases: Document-oriented databases contain collections of files including both structure and data. Each document is assigned a key. More advanced document-oriented databases also both complete and incomplete documents. Document databases may use XML or JSON (Java Script Object Notation) structures. Graph Databases: Graph databases store key-value pairs where the focus is on the relationship between the nodes, rather than on the nodes themselves. 1.3.8.3.6 Triplestore A data entity composed of subject-predicate-object is known as a triplestore. In Resource Description Framework (RDF) terminology, a triplestore is composed of a subject that denotes a resource, the predicate that expresses a relationship between the subject and the object, and the object itself. A triplestore is a purpose-built database for the storage and retrieval of triples in the form of subject-predicate-object expressions. Triplestores can be broadly classified into three categories: Native triplestores, RDBMS-backed triplestores and NoSQL triplestores. Native triplestores are those that are implemented from scratch and exploit the RDF data model to efficiently store and access the RDF data. RDBMS-backed triplestores are built by adding an RDF specific layer to an existing RDBMS. NoSQL Triplestores are currently being investigated as possible storage managers for RDF. Triplestore databases are best for taxonomy and thesaurus management, linked data integration, and knowledge portals. 1.3.9 Specialized Databases Some specialized situations require specialized types of databases that are managed differently from traditional relational databases. Examples include: Order 11611 by LEXIE MAY on August 25, 2017 D A T A ST O R A G E A ND O P E R A T IO N S 189 Computer Assisted Design and Manufacturing (CAD / CAM) applications require an Object database, as will most embedded real-time applications. Geographical Information Systems (GIS) make use of specialized geospatial databases, which have at least annual updates to their Reference Data. Some specialized GIS are used for utilities (electric grid, gas lines, etc.), for telecom in network management, or for ocean navigation. Shopping-cart applications found on most online retail websites, make use of XML databases to initially store the customer order data, and may be used real-time by social media databases for ad placement on other websites. Some of this data is then copied into one or more traditional OLTP (Online Transaction Processing) databases or data warehouses. In addition, many off-the-shelf vendor applications may use their own proprietary databases. At the very least, their schemas will be proprietary and mostly concealed, even if they sit on top of traditional relational DBMSs. 1.3.10 Common Database Processes All databases, no matter the type, share the following processes in some way. 1.3.10.1 Archiving Archiving is the process of moving data off immediately accessible storage media and onto media with lower retrieval performance. Archives can be restored to the originating system for short-term use. Data that is not actively needed to support application processes should be moved to an archive on less-expensive disk, tape, or a CD / DVD jukebox. Restoring from an archive should be a matter of simply copying the data from the archive back into the system. Archival processes must be aligned with the partitioning strategy to ensure optimal availability and retention. A robust approach involves: Creating a secondary storage area, preferably on a secondary database server Partitioning existing database tables into archival blocks Replicating the data that is needed less often to the separate database Creating tape or disk backups Creating database jobs that periodically purge unneeded data It is wise to schedule regular tests of archive restoration to ensure avoid surprises in an emergency. When changes are made to the technology or structure of a production system, the archive also needs to be evaluated to ensure that data moved from the archive into current storage will be readable. There are several ways of handling out-of-synch archives: Order 11611 by LEXIE MAY on August 25, 2017 190 D MBO K 2 Determine if or how much of the archive is required to be preserved. What is not required can be considered purged. For major changes in technology, restore the archives to the originating system before the technology change, upgrade or migrate to the new technology, and re-archive the data using the new technology. For high-value archives where the source database structures change, restore the archive, make any changes to the data structures, and re-archive the data with the new structure. For infrequent-access archives where the source technology or structure changes, keep a small version of the old system running with limited access, and extract from the archives using the old system as needed. Archives that are not recoverable with current technology are useless, and keeping old machinery around to read archives that cannot be otherwise read, is not efficient or cost-effective. 1.3.10.2 Capacity and Growth Projections Think of a database as a box, the data as fruit, and overhead (indexes, etc.) as packing material. The box has dividers, and fruit and packing material go in the cells: First, decide the size of the box that will hold all the fruit and any packing material needed that is the Capacity. How much fruit goes into the box, and how quickly? How much fruit comes out of the box, and how quickly? Decide if the box will stay the same size over time, or must be expanded over time to hold more fruit. This projection of how much and how quickly the box must expand to hold incoming fruit and packing material is the growth projection. If the box cannot expand, the fruit must be taken out as fast as it is put in, and the growth projection is zero. How long should the fruit stay in the cells? If the fruit in one cell gets dehydrated over time, or for any reason becomes not as useful, should that fruit be put in a separate box for longer term storage (i.e., archived)? Will there ever be a need to bring that dehydrated fruit back into the main box? Moving the fruit to another box with the ability to move it back into the first box is an important part of archiving. This allows the box to not have to be expanded quite as often or as much. If a fruit becomes too stagnant to use, throw that fruit away (i.e., purge the data). 1.3.10.3 Change Data Capture (CDC) Change data capture refers to the process of detecting that data has changed and ensuring that information relevant to the change is stored appropriately. Often referred to as log-based replication, CDC is a non-invasive Order 11611 by LEXIE MAY on August 25, 2017 D A T A ST O R A G E A ND O P E R A T IO N S 191 way to replicate data changes to a target without affecting the source. In a simplified CDC context, one computer system has data that may have changed from a previous point in time, and a second computer system needs to reflect the same change. Rather than sending the entire database over the network to reflect just a few minor changes, the idea is to just send what changed (deltas), so that the receiving system can make appropriate updates. There are two different methods to detect and collect changes: data versioning, which evaluates columns that identify rows that have changed (e.g., last-update-timestamp columns, version-number columns, status-indicator columns), or by reading logs that document the changes and enable them to be replicated in secondary systems. 1.3.10.4 Purging It is incorrect to assume that all data will reside forever in primary storage. Eventually, the data will fill the available space, and performance will begin to degrade. At that point, data will need to be archived, purged, or both. Just as importantly, some data will degrade in value and is not worth keeping. Purging is the process of completely removing data from storage media such that it cannot be recovered. A principal goal of data management is that the cost of maintaining data should not exceed its value to the organization. Purging data reduces costs and risks. Data to be purged is generally deemed obsolete and unnecessary, even for regulatory purposes. Some data may become a liability if kept longer than necessary. Purging it reduces the risks that it may be misused. 1.3.10.5 Replication Data replication means same data is stored on multiple storage devices. In some situations, having duplicate databases is useful, such as in a high-availability environment where spreading the workload among identical databases in different hardware or even data centers can preserve functionality during peak usage times or disasters. Replication can be active or passive: Active replication is performed by recreating and storing the same data at every replica from every other replica. Passive replication involves recreating and storing data on a single primary replica and then transforming its resultant state to other secondary replicas. Replication has two dimensions of scaling: Horizontal data scaling has more data replicas. Vertical data scaling has data replicas located further away in distance geographically. Multi-master replication, where updates can be submitted to any database node and then ripple through to other servers, is often desired, but increases complexity and cost. Order 11611 by LEXIE MAY on August 25, 2017 192 D MBO K 2 Replication transparency occurs when data is replicated between database servers so that the information remains consistent throughout the database system and users cannot tell or even know which database copy they are using. The two primary replication patterns are mirroring and log shipping (see Figure 60). In mirroring, updates to the primary database are replicated immediately (relatively speaking) to the secondary database, as part of a two-phase commit process. In log shipping, a secondary server receives and applies copies of the primary logs at regular intervals. transaction The choice of replication method depends on how critical the data is, and how important it is that failover to the secondary server be immediate. Mirroring is usually a more expensive option than log shipping. For one secondary server, mirroring is effective; log shipping may be used to update additional secondary servers. Figure 60 Log Shipping vs. Mirroring 1.3.10.6 Resiliency and Recovery Resiliency in databases is the measurement of how tolerant a system is to error conditions. If a system can tolerate a high level of processing errors and still function as expected, it is highly resilient. If an application crashes upon the first unexpected condition, that system is not resilient. If the database can detect and either abort or automatically recover from common processing errors (runaway query, for example), it is considered resilient. There are always some conditions that no system can detect in advance, such as a power failure, and those conditions are considered disasters. Three recovery types provide guidelines for how quickly recovery takes place and what it focuses on: Immediate recovery from some issues sometimes can be resolved through design; for example, predicting and automatically resolving issues, such as those that might be caused by a failover to backup system. Order 11611 by LEXIE MAY on August 25, 2017 D A T A ST O R A G E A ND O P E R A T IO N S 193 Critical recovery refers to a plan to restore the system as quickly as possible in order to minimize delays or shut downs of business processes. Non-critical recovery means that restoration of function can be delayed until systems that are more critical have been restored. Data processing errors include data load failures, query return failures, and obstacles to completing ETL or other processes. Common ways of increasing resilience in data processing systems are to trap and re-route data causing errors, detect and ignore data causing errors, and implement flags in processing for completed steps to avoid reprocessing data or repeating completed steps when restarting a process. Each system should require a certain level of resiliency (high or low). Some applications may require that any error halts all processing (low resiliency), while others may only require that the errors be trapped and re-routed for review, if not outright ignored. For extremely critical data, the DBA will need to implement a replication pattern in which data moves to 1.3.10.7 Retention Data Retention refers to how long data is kept available. Data retention planning should be part of the physical database design. Retention requirements also affect capacity planning. Data Security also affects data retention plans, as some data needs to be retained for specific timeframes for legal reasons. Failure to retain data for the appropriate length of time can have legal consequences. Likewise, there are also regulations related to purging data. Data can become a liability if kept longer than specified. Organizations should formulate retention policies based on regulatory requirements and risk management guidelines. These policies should drive specifications for purging and archiving of data. 1.3.10.8 Sharding Sharding is a process where small chunks of the database are isolated and can be updated independently of other shards, so replication is merely a file copy. Because the shards are small, refreshes/overwrites may be optimal. 2. Activities The two main activities in Data Operations and Storage are Database Technology Support and Database Operations Support. Database Technology Support is specific to selecting and maintaining the software that Order 11611 by LEXIE MAY on August 25, 2017 194 D MBO K 2 stores and manages the data. Database Operations Support is specific to the data and processes that the software manages. 2.1 Manage Database Technology Managing database technology should follow the same principles and standards for managing any technology. The leading reference model for technology management is the Information Technology Infrastructure Library (ITIL), a technology management process model developed in the United Kingdom. ITIL principles apply to managing data technology.35 2.1.1 Understand Database Technology Characteristics It is important to understand how technology works, and how it can provide value in the context of a particular business. The DBA, along with the rest of the data services teams, works closely with business users and managers to understand the data and information needs of the business. DBAs and Database Architects combine their knowledge of available tools with the business requirements in order to suggest the best possible applications of technology to meet organizational needs. Data professionals must first understand the characteristics of a candidate database technology before determining which to recommend as a solution. For example, database technologies that do not have transaction-based capabilities (e.g., commit and rollback) are not suitable for operational situations supporting Point-of-Sale processes. Do not assume that a single type of database architecture or DBMS works for every need. Most organizations have multiple database tools installed, to perform a range of functions, from performance tuning to backups, to managing the database itself. Only a few of these tool sets have mandated standards. 2.1.2 Evaluate Database Technology Selecting strategic DBMS software is particularly important. DBMS software has a major impact on data integration, application performance, and business productivity. Some of the factors to consider when selecting DBMS software include: Product architecture and complexity Volume and velocity limits, including streaming rate Application profile, such as transaction processing, Business Intelligence, and personal profiles Specific functionality, such as temporal calculation support http://bit.ly/1gA4mpr. Order 11611 by LEXIE MAY on August 25, 2017 D A T A ST O R A G E A ND O P E R A T IO N S 195 Hardware platform and operating system support Availability of supporting software tools Performance benchmarks, including real-time statistics Scalability Software, memory, and storage requirements Resiliency, including error handling and reporting Some factors are not directly related to the technology itself, but rather to the

Data Storage and Operations PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue