Full Transcript

DATABASE FUNDAMENTALS AND ADVANCED CONCEPTS Chapter 1 01 The Three Database Revolution Early Database Systems Definition A database is an organized collection of data. The term became common in the late 1960s, but the concept of organizing data has...

DATABASE FUNDAMENTALS AND ADVANCED CONCEPTS Chapter 1 01 The Three Database Revolution Early Database Systems Definition A database is an organized collection of data. The term became common in the late 1960s, but the concept of organizing data has been important throughout human history. Historical Examples of Data Organization Books: Dictionaries and encyclopedias are examples of structured datasets in physical form. Libraries: Represent preindustrial equivalents of modern database systems. Punched Cards: Used in the 19th century for various purposes such as programming fabric looms, producing census statistics, and controlling player pianos. Emergence of Digital Databases The first revolution in databases occurred with the advent of electronic computers after World War II. Early digital computers were used for mathematical functions and data manipulation, such as processing encrypted military communications. Storage Media Evolution Early "databases" used paper tape and magnetic tape for sequential data storage. The introduction of spinning magnetic disks in the mid-1950s enabled direct, high-speed access to individual records. Development of Indexing Methods Indexing methods like ISAM (Index Sequential Access Method) allowed for fast record-oriented access. This advancement led to the creation of the first OLTP (On-line Transaction Processing) systems. Pre-DBMS Era Early electronic databases were controlled entirely by the application, meaning there were databases but no Database Management Systems (DBMS). The First Database Revolution Productivity Issues Initially, each application had to write its own data handling code, leading to inefficiencies and frequent data corruption due to errors in the code. Need for a DBMS To address these issues, the concept of a Database Management System (DBMS) emerged. A DBMS externalizes the database handling logic from applications, minimizing programmer workload and ensuring data integrity and performance. Early Database Systems These systems enforced a schema and a fixed access path to navigate records, such as retrieving orders for a specific customer. Mainframe Era The first-generation databases operated on mainframe systems, primarily IBM's. By the early 1970s, two major DBMS models were competing: Network Model: Formalized by the CODASYL standard, implemented in databases like IDMS. Hierarchical Model: A simpler approach, exemplified by IBM's IMS (Information Management System). Hierarchical and network database systems dominated during the era of mainframe computing and powered the vast majority of computer applications up until the late 1970s. However, these systems had several notable drawbacks. First, the navigational databases were extremely inflexible in terms of data structure and query capabilities. Generally only queries that could be anticipated during the initial design phase were possible, and it was extremely difficult to add new data elements to an existing system. Early database systems focused on simple CRUD operations, making complex analytic queries difficult and requiring extensive coding. As businesses integrated more with computer systems, the demand for analytic reports grew, leading to backlogs in IT departments and a generation of programmers writing repetitive COBOL report code. The Second Database Revolution No single person has had more influence over database technology than Edgar Codd. Codd received a mathematics degree from Oxford shortly after the Second World War and subsequently immigrated to the United States, where he worked for IBM on and off from 1949 onwards. Codd worked as a “programming mathematician” and worked on some of IBM’s very first commercial electronic computers. In the late 1960s, Codd was working at an IBM laboratory in San Jose, California. Codd was very familiar with the databases of the day, and he harbored significant reservations about their design. In particular, he felt that: Existing databases were too hard to use. Databases of the day could only be accessed by people with specialized programming skills. Existing databases lacked a theoretical foundation. Codd’s mathematical background encouraged him to think about data in terms of formal structures and logical operations; he regarded existing databases as using arbitrary representations that did not ensure logical consistency or provide the ability to deal with missing information. Existing databases mixed logical and physical implementations. The representation of data in existing databases matched the format of the physical storage in the database, rather than a logical representation of the data that could be comprehended by a nontechnical user. Codd published an internal IBM paper outlining his ideas for a more formalized model for database systems, which then led to his 1970 paper “A Relational Model of Data for Large Shared Data Banks”. This classic paper contained the core ideas that defined the relational database model that became the most significant—almost universal—model for database systems for a generation. Relational theory The intricacies of relational database theory can be complex and are beyond the scope of this introduction. However, at its essence, the relational model describes how a given set of data should be presented to the user, rather than how it should be stored on disk or in memory. Key concepts of the relational model include: Relational theory Tuples, an unordered set of attribute values. In an actual database system, a tuple corresponds to a row, and an attribute to a column value. A relation, which is a collection of distinct tuples and corresponds to a table in relational database implementations. Relational theory Constraints, which enforce consistency of the database. Key constraints are used to identify tuples and relationships between tuples. Operations on relations such as joins, projections, unions, and so on. These operations always return relations. In practice, this means that a query on table returns data in a tabular format. Relational theory A row in a table should be identifiable and efficiently accessed by a unique key value, and every column in that row must be dependent on that key value and no other identifier. Arrays and other structures that contain nested information are, therefore, not directly supported. Relational theory Levels of conformance to the relational model are described in the various “normal forms.” Third normal form is the most common level. Database practitioners typically remember the definition of third normal form by remembering that all non-key attributes must be dependent on “the key, the whole key, and nothing but the key—So Help Me Codd”! Relational theory The First Relational Databases The initial response to the relational model was lukewarm, with vendors like IBM hesitant to embrace it. They were skeptical of Codd's claim that existing databases were fundamentally flawed and doubted that a system could offer high performance without being finely tuned to specific access mechanisms. They questioned whether a database could efficiently handle any type of data access a user might need. The First Relational Databases Despite initial skepticism, IBM began developing a prototype relational database system in 1974, known as System R. This project proved that relational databases could perform adequately and introduced the SQL language. Around the same time, Mike Stonebraker at Berkeley developed a relational database system called INGRES, which used a non-SQL query language called QUEL. The First Relational Databases Larry Ellison, a technically sophisticated entrepreneur with experience at Amdahl, recognized the potential of relational databases after learning about Codd’s work and System R. In 1977, he founded the company that would later become Oracle Corporation, which released the first commercially successful relational database system. Database Wars! During the late 1970s and early 1980s, minicomputers emerged, challenging and eventually ending the dominance of mainframes. Although not truly "mini" by today's standards, minicomputers required fewer specialized facilities and enabled mid-sized companies to own their computing infrastructure. This shift created a demand for new databases compatible with the new operating systems running on these hardware platforms. Database Wars! By 1981, IBM had released SQL/DS, a commercial relational database that ran only on IBM mainframes, limiting its impact on the growing minicomputer market. In contrast, Ellison's Oracle database, released in 1979, quickly gained popularity on minicomputers from companies like Digital and Data General. Simultaneously, the Berkeley INGRES project led to the creation of the commercial Ingres database. Oracle and Ingres became key competitors, battling for dominance in the early minicomputer relational database market. Database Wars! By the mid-1980s, the advantages of relational databases were well-recognized, even if the underlying theory was not fully understood. The SQL language, adopted by all major vendors including Ingres, significantly boosted productivity for report writing and analytic queries. Database Wars! The rise of 4GLs (fourth-generation programming languages), which were highly compatible with relational databases, further supported this trend. Additionally, minicomputers, offering better price/performance ratios compared to mainframes, became popular in the midmarket, where relational databases were the dominant choice. Database Wars! Relational databases became so dominant that vendors of older database systems began labeling their offerings as relational to stay relevant. This led Codd to formulate his 12 (actually 13) rules to differentiate genuine relational databases from impostors. Database Wars! Over the decades, many new database systems emerged, including Sybase, Microsoft SQL Server, Informix, MySQL, and DB2. Despite their claims of superior performance, availability, functionality, or cost-effectiveness, they all adhere to three core principles: Codd’s relational model, the SQL language, and the ACID transaction model. Client-server Computing By the late 1980s, the relational model had become the dominant paradigm in both mindshare and market share, particularly with the rise of client-server computing. While minicomputers were akin to "small mainframes" with all processing done on the minicomputer and user interaction through basic terminals, a new application architecture was emerging. Client-server Computing The rise of IBM PC-based microcomputers and graphical user interfaces like Microsoft Windows led to the client-server model. In this setup: Client-side: Presentation logic and application logic often run on a PC terminal with a graphical interface. Server-side: The database server, typically on a minicomputer, handles data management. Application logic could also reside on the server through stored procedures. Client-server Computing Client-server architecture offered a richer user experience compared to the green-screen terminals of the past. By the early 1990s, nearly all new applications adopted the client-server model, relying on relational database management systems (RDBMS) and SQL for communication between client and server. The Relational Plateau After the initial enthusiasm for object-oriented databases faded, relational databases remained dominant and unchallenged until the mid-2000s. From 1995 to 2005, no major new database systems emerged due to the saturation of the market with RDBMSs and the strong hold of the relational model. This period, marked by the Internet's expansion from a niche interest to a global network, highlights the enduring power and dominance of the relational database model. The Third Database Revolution By the mid-2000s, the relational database appeared firmly established, with no signs of radical changes ahead. Although ongoing innovations within existing relational systems were expected, the dominance of the relational model seemed secure. However, the rise of massive web-scale applications introduced new demands that the traditional client-server architectures—and by extension, relational databases—could not adequately address through mere incremental improvements. This shift marked the beginning of the end for the era of complete relational database supremacy. Google and Hadoop By 2005, Google had become the largest website in the world, facing data challenges far beyond the capabilities of traditional relational databases. The sheer volume and velocity of data required Google to innovate new hardware and software solutions to manage and process this data. Google and Hadoop In 2003, Google introduced the Google File System (GFS), a distributed file system that underpinned its storage architecture. In 2004, Google introduced MapReduce, a distributed parallel processing algorithm used to create web indexes. By 2006, Google revealed BigTable, a distributed structured database designed to handle massive amounts of data. Google and Hadoop These innovations, along with other technologies primarily developed by Google, laid the foundation for the Hadoop project, which matured at Yahoo! and gained rapid adoption starting in 2007. The Hadoop ecosystem, more than anything else, marked a significant shift away from the era of relational database dominance, addressing the needs of big data in ways that traditional relational systems could not. Cloud Computing Around 2008, cloud computing became a major focus for organizations and startups, marking a shift from client-server to web-based applications hosted in the cloud. Amazon's Elastic Compute Cloud (EC2) and related services, part of Amazon Web Services (AWS), offered scalable hosting solutions and inspired similar platforms from Google and Microsoft. Cloud Computing Traditional relational databases struggled with the scalability demands of cloud computing, leading to the rise of non-relational databases like Amazon's SimpleDB and DynamoDB, better suited for the elastic needs of cloud-based applications. Document Databases Programmers remained frustrated with the mismatch between object-oriented and relational models, and object-relational mapping (ORM) provided only limited relief. Around 2004, the rise of AJAX enabled richer web experiences by allowing JavaScript in the browser to communicate with backends using XML, later replaced by the more compact JSON format. Document Databases JSON became the standard for serializing objects, leading to the development of document databases like CouchBase and MongoDB that store JSON directly, bypassing the relational model. This approach simplifies the storage process, appealing to programmers for its ease of use. The “NewSQL” In 2007, Michael Stonebraker and his team published a groundbreaking paper arguing that the traditional relational database architecture, which had been largely uniform across major RDBMS systems, was outdated due to changes in hardware and diverse modern workloads. The “NewSQL” They proposed specialized database designs optimized for specific tasks, leading to the development of H-Store, a distributed in-memory database, and C-Store, a columnar database. These innovations became foundational to NewSQL databases, which maintain relational principles while departing from traditional architectures like those of Oracle and SQL Server. The Nonrelational Explosion In the late 2000s, a surge of new database systems emerged, with 2008-2009 being particularly prolific. Many of these, including MongoDB, Cassandra, and HBase, became significant players in the market. Initially, these systems lacked a unified name, with "NoSQL" eventually becoming the accepted term despite its limitations. The Nonrelational Explosion "NoSQL" databases break away from traditional relational models and SQL language, focusing on different data handling approaches. By 2011, "NewSQL" was coined to describe modern databases that modify or enhance relational principles. The term "Big Data" later became mainstream, referring to technologies handling large, unstructured datasets like Hadoop. Conclusion: One Size Doesn’t Fit All Thank you Any questions?

Use Quizgecko on...
Browser
Browser