Chapter-1.pdf
Document Details
Uploaded by RespectableObsidian7450
Tags
Related
- Chapter 2 Database Concepts PDF
- Database Management - Info-Management PDF
- WKL4_Database Systems and Networks Slides only.pdf
- Principles of Information Systems Chapter 5: Database System and Big Data PDF
- Lecture 6 - Database Systems Lecture Notes
- Principles Of Information Systems Lecture 5 - Database Systems(1) PDF
Full Transcript
Chapter 1 Database Systems Learning Objectives After completing this chapter, you will be able to: 1-1 Define the difference between data and information 1-5...
Chapter 1 Database Systems Learning Objectives After completing this chapter, you will be able to: 1-1 Define the difference between data and information 1-5 Identify flaws in file system data management 1-2 Describe what a database is, the various types of databases, 1-6 Outline the main components of the database system and why they are valuable assets for decision making 1-7 Describe the main functions of a database management 1-3 Explain the importance of database design system (DBMS) 1-4 Outline how modern databases evolved from file systems PREVIEW Organizations use data to keep track of their day-to-day operations. Such data is used to generate information, which in turn is the basis for good decisions. Data is likely to be managed most efficiently when it is stored in a database. Databases are involved in almost all facets and activities of our daily lives: from school to work, medical care, government, nonprofit organizations, and houses of worship. In this chapter, you will learn what a database is, what it does, and why it yields better results than other data management methods. You will also learn about various types of databases and why database design is so important. Databases evolved from the need to manage large amounts of data in an organized and efficient manner. In the early days, computer file systems were used to organize such data. Although file system data management is now largely outmoded, understanding the characteristics of file systems is important because file systems are the source of serious data management limitations. In this chapter, you will also learn how the database system approach helps eliminate most of the shortcomings of file system data management. Copyright 2023 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 4 Part 1: Database Concepts Data Files and Available Formats MS Access Oracle MS SQL MySQL Ch01_Text Yes Yes Yes Yes Ch01_Problems Yes Yes Yes Yes Data Files available on cengage.com 1-1 Why Databases? So, why do we need databases? In today’s world, data is ubiquitous (abundant, global, every- where) and pervasive (unescapable, prevalent, persistent). From birth to death, we generate and consume data. The trail of data starts with the birth certificate and continues all the way to a death certificate (and beyond!). In between, each individual produces and consumes enormous amounts of data. As you will see in this book, databases are the best way to store and manage data. Databases make data persistent and shareable in a secure way. As you look at Figure 1.1, can you identify some of the data generated by your own daily activities? Figure 1.1 The Pervasive Nature of Databases A Day in Susan’s Life See how many databases she interacts with each day In the morning, Susan Later, she picks up In the afternoon, she At night, she plans for a trip and Before going to bed, goes grocery shopping her prescription at orders some items buys airline tickets and hotel Susan checks her social the pharmacy online reservations online media accounts Where are the product data Where is the pharmacy Where are the product Where does the online What particular customer stored? inventory data stored? and stock data stored? travel website get the data is kept by each airline and hotel data from? website? Is the product quantity in What data about each Where does the system stock updated at checkout? product will be in the get the data to What customer data would Where is the data about the Does she pay with a credit inventory data? generate product be kept by the website? friends and groups stored? card? “recommendations” to What data is kept about the customer? Where would the customer Where are the “likes” stored each customer and where data be stored? and what would they be is it stored? used for? Products Products Products Flights Users Sales Sales Sales Hotels Friends Customers Customers Customers Customers Posts Copyright 2023 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. Chapter 1: Database Systems 5 Data is not only ubiquitous and pervasive; it is also essential for organizations to survive and prosper. Imagine trying to operate a business without knowing who your customers are, what products you are selling, who is working for you, who owes you money, and to whom you owe money. All businesses have to keep this type of data and much more. Just as important, they must have that data available to decision makers when necessary. It can be argued that the ultimate purpose of all business information systems is to help businesses use information as an organizational resource. At the heart of all of these systems are the collection, storage, aggregation, manipulation, dissemination, and management of data. Depending on the type of information system and the characteristics of the business, this data could vary from a few megabytes on just one or two topics to petabytes covering hun- dreds of topics within the business’s internal and external environment. Telecommunications companies, such as Sprint and AT&T, are known to have systems that keep data on trillions of phone calls, with new data being added to the system at speeds up to 70,000 calls per second! Not only do these companies have to store and manage immense collections of data, but they must be able to find any given fact in that data quickly. Consider the case of Internet search staple Google. While Google is reluctant to disclose many details about its data storage speci- fications, it is estimated that the company responds to over 91 million searches per day across a collection of data that is several terabytes in size. Impressively, the results of these searches are available almost instantly. How can these businesses process this much data? How can they store it all, and then quickly retrieve just the facts that decision makers want to know, just when they want to know it? The answer is that they use databases. Databases, as explained in detail throughout this book, are specialized structures that allow computer-based systems to store, manage, and retrieve data very quickly. Virtually all modern business systems rely on databases. Therefore, a good understanding of how these structures are created and their proper use is vital for any information systems professional. Even if your career does not take you down the amazing path of database design and development, databases will be a key component of the systems that you use. In any case, you will probably make decisions in your career based on informa- tion generated from data. Thus, it is important that you know the difference between data and information. 1-2 Data versus Information To understand what drives database design, you must understand the difference between data and information. Data consists of raw facts. The word raw indicates that the facts have not data yet been processed to reveal their meaning. For example, suppose that a university tracks Raw facts, or facts that have not yet been processed to data on faculty members for reporting to accrediting bodies. To get the data for each faculty reveal their meaning to the member into the database, you would provide a screen to allow for convenient data entry, end user. complete with drop-down lists, combo boxes, option buttons, and other data-entry validation controls. Figure 1.2(a) shows a simple data-entry form from a software package named Sedona. When the data is entered into the form and saved, it is placed in the underlying database as raw data, as shown in Figure 1.2(b). Although you now have the facts in hand, they are not particularly useful in this format. Reading through hundreds of rows of data for faculty members does not provide much insight into the overall makeup of the faculty. There- fore, you transform the raw data into a data summary like the one shown in Figure 1.2(c). Now you can get quick answers to questions such as “What percentage of the faculty in the Information Systems (INFS) department are adjuncts?” In this case, you can quickly deter- mine that 20 percent of the INFS faculty members are adjunct faculty. Because graphics can enhance your ability to quickly extract meaning from data, you show the data summary pie chart in Figure 1.2(d). Copyright 2023 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 6 Part 1: Database Concepts Figure 1.2 Transforming Raw Data into Information a) Data entry screen b) Raw data c) Information in summary format d) Information in graphical format information Information is the result of processing raw data to reveal its meaning. Data processing can The result of processing raw be as simple as organizing data to reveal patterns or as complex as making forecasts or drawing data to reveal its meaning. inferences using statistical modeling. To reveal meaning, information requires context. For Information consists of transformed data and example, an average temperature reading of 105 degrees does not mean much unless you also facilitates decision making. know its context: Is this reading in degrees Fahrenheit or Celsius? Is this a machine tempera- ture, a body temperature, or an outside air temperature? Information can be used as the foun- knowledge dation for decision making. For example, the data summary for the faculty can provide The body of information and facts about a specific accrediting bodies with insights that are useful in determining whether to renew accreditation subject. Knowledge implies for the university. familiarity, awareness, Keep in mind that raw data must be properly formatted for storage, processing, and pre- and understanding of sentation. For example, dates might be stored in Julian calendar formats within the database, information as it applies but displayed in a variety of formats, such as day-month-year or month/day/year, for different to an environment. A key characteristic is that new purposes. Respondents’ yes/no responses might need to be converted to a Y/N or 0/1 format knowledge can be derived for data storage. More complex formatting is required when working with complex data types, from old knowledge. such as sounds, videos, or images. In this “information age,” production of accurate, relevant, and timely information is the key to good decision making. In turn, good decision making is the key to business survival in a global market. We are now said to be entering the “knowledge age.”1 Data is the foundation of information, which is the bedrock of knowledge—that is, the body of information and facts about a specific subject. Knowledge implies familiarity, aware- ness, and understanding of information as it applies to an environment. A key characteristic of knowledge is that “new” knowledge can be derived from “old” knowledge. 1 Peter Drucker coined the phrase “knowledge worker” in 1959 in his book Landmarks of Tomorrow. In 1994, Esther Dyson, George Keyworth, and Dr. Alvin Toffler introduced the concept of the “knowledge age.” Copyright 2023 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. Chapter 1: Database Systems 7 Let’s summarize some key points: Data constitutes the building blocks of information. Information is produced by processing data. Information is used to reveal the meaning of data. Accurate, relevant, and timely information is the key to good decision making. Good decision making is the key to organizational survival in a global environment. The previous paragraphs have explained the importance of data and how the processing of data is used to reveal information that in turn generates “actionable” knowledge. Let’s explore a simple example of how this works in the real world. In today’s information-centric society, you use smartphones on a daily basis. These devices have advanced GPS functionality that constantly tracks your whereabouts. This data is stored and shared with various applications. When you get a new smartphone, you can use the map application to go places and to set up your home address (now the phone knows where you live!). The GPS feature in your phone tracks your daily locations. In some cases, the information generated is very helpful: it can help you navigate to various locations and even to find where you parked your car. Figure 1.3 shows screenshots from the smartphone of one of the authors. The phone “knows” that this is about the time he goes home and tells him how long it is going to take to get there. It also tells him where he parked his car; if he clicks the Parked Car icon, it will open a map so he can locate the car. Figure 1.3 Smartphone Tracking Source: Morris Coronel Furthermore, in terms of privacy issues, your smartphone may know more about your activities than you imagine. For example, suppose that every Wednesday night you go to the gym and play indoor soccer with your friends. Next Wednesday night, 20 minutes before you leave home, your phone pops up a message saying “19 minutes to [gym address]. Traffic is light.” The phone has been storing GPS data on your movements to develop patterns based on days, times, and locations to generate this knowledge. It can then associate such knowledge as your daily activities provide more data points. Imagine that on Wednesday when you go to the Magic Box gym to play soccer, when you arrive you use Facebook on your phone to check in to the gym. Now, your phone also knows the name of the place where you go every Wednesday night. As you can see from this example, knowledge and information require timely and accurate data management data. Such data must be properly generated and stored in a format that is easy to access and A process that focuses on process. In addition, like any basic resource, the data environment must be managed carefully. data collection, storage, and retrieval. Common data Data management is a discipline that focuses on the proper generation, storage, and retrieval management functions of data. Given the crucial role that data plays, it should not surprise you that data management include addition, deletion, is a core activity for any business, government agency, service organization, or charity. modification, and listing. Copyright 2023 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 8 Part 1: Database Concepts database A shared, integrated 1-3 Introducing the Database computer structure that Efficient data management typically requires the use of a computer database. A database is a houses a collection of related data. A database contains shared, integrated computer structure that stores a collection of the following: two types of data: end- End-user data—that is, raw facts of interest to the end user user data (raw facts) and metadata. Metadata, or data about data, through which the end-user data is integrated and managed metadata Data about data; that is, data The metadata describes the data characteristics and the set of relationships that links the about data characteristics data found within the database. For example, the metadata component stores information such and relationships. See also as the name of each data element, the type of values (numeric, dates, or text) stored on each data dictionary. data element, and whether the data element can be left empty. The metadata provides informa- tion that complements and expands the value and use of the data. In short, metadata presents a more complete picture of the data in the database. Given the characteristics of metadata, you might hear a database described as a “collection of self-describing data.” database management A database management system (DBMS) is a collection of programs that manages the system (DBMS) database structure and controls access to the data stored in the database. In a sense, a database The collection of programs resembles a very well-organized electronic filing cabinet in which powerful software (the that manages the database DBMS) helps manage the cabinet’s contents. structure and controls access to the data stored in the database. 1-3a Role and Advantages of the DBMS The DBMS serves as the intermediary between the user and the database. The database struc- ture itself is stored as a collection of files, and the only way to access the data in those files is through the DBMS. Figure 1.4 emphasizes the point that the DBMS presents the end user (or application program) with a single, integrated view of the data in the database. The DBMS receives all application requests and translates them into the complex operations required to fulfill those requests. The DBMS hides much of the database’s internal complexity from the Figure 1.4 The DBMS Manages the Interaction between the End User and the Database End users Database structure Metadata Application Data request Customers DBMS Single http:// End-user Database View of data Invoices data Management Systems Integrated End users Application Products request Data Copyright 2023 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. Chapter 1: Database Systems 9 application programs and users. The application program might be written by a programmer using a programming language, such as Python, Java, or C#, or it might be created through a DBMS utility program. Having a DBMS between the end user’s applications and the database offers some import- ant advantages. First, the DBMS enables the data in the database to be shared among multiple applications or users. Second, the DBMS integrates the many different users’ views of the data into a single all-encompassing data repository. Because data is the crucial raw material from which information is derived, you must have a good method to manage such data. As you will discover in this book, the DBMS helps make data management more efficient and effective. In particular, a DBMS provides these advantages: Improved data sharing. The DBMS helps create an environment in which end users have better access to more and better-managed data. Such access makes it possible for end users to respond quickly to changes in their environment. Improved data security. The more users access the data, the greater the risks of data security breaches. Corporations invest considerable amounts of time, effort, and money to ensure that corporate data is used properly. A DBMS provides a framework for better enforcement of data privacy and security policies. Better data integration. Wider access to well-managed data promotes an integrated view of the organization’s operations and a clearer view of the big picture. It becomes much easier to see how actions in one segment of the company affect other segments. Minimized data inconsistency. Data inconsistency exists when different versions of the data inconsistency same data appear in different places. For example, data inconsistency exists when a A condition in which different versions of the same data company’s sales department stores a sales representative’s name as Bill Brown and the yield different (inconsistent) company’s personnel department stores that same person’s name as William G. Brown, or results. when the company’s regional sales office shows the price of a product as $45.95, and its national sales office shows the same product’s price as $43.95. The probability of data inconsistency is greatly reduced in a properly designed database. Improved data access. The DBMS makes it possible to produce quick answers to ad hoc queries. From a database perspective, a query is a specific request issued to the DBMS query for data manipulation—for example, to read or update the data. Simply put, a query is a A question or task asked by question, and an ad hoc query is a spur-of-the-moment question. The DBMS sends back an end user of a database in the form of SQL code. an answer (called the query result set) to the application. For example, when dealing A specific request for data with large amounts of sales data, end users might want quick answers to questions (ad hoc manipulation issued by the queries). Some examples are the following: end user or the application to the DBMS. What was the dollar volume of sales by product during the past six months? ĎĎ ad hoc query What is the sales bonus figure for each of our salespeople during the past three months? ĎĎ A “spur-of-the-moment” How many of our customers have credit balances of $3,000 or more? question. ĎĎ Improved decision making. Better-managed data and improved data access make it possi- query result set ble to generate better-quality information, on which better decisions are based. The quality The collection of data rows of the information generated depends on the quality of the underlying data. Data quality returned by a query. is a comprehensive approach to promoting the accuracy, validity, and timeliness of the data. data quality While the DBMS does not guarantee data quality, it provides a framework to facilitate data A comprehensive approach quality initiatives. Data quality concepts will be covered in more detail in Chapter 16, to ensuring the accuracy, Database Administration and Security. validity, and timeliness of data. Increased end-user productivity. The availability of data, combined with the tools that transform data into usable information, empowers end users to make quick, informed decisions that can make the difference between success and failure in the global economy. Copyright 2023 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 10 Part 1: Database Concepts The advantages of using a DBMS are not limited to the few just listed. In fact, you will discover many more advantages as you learn more about the technical details of databases and their proper design. 1-3b Types of Databases A DBMS can be used to build many types of databases. Each database stores a particular collec- tion of data and is used for a specific purpose. Over the years, as technology and innovative uses of databases have evolved, different methods have been used to classify databases. For example, databases can be classified by the number of users supported, where the data is located, the type of data stored, the intended data usage, and the degree to which the data is structured. The number of users determines whether the database is classified as single user or mul- single-user database tiuser. A single-user database supports only one user at a time. In other words, if user A is A database that supports only one user at a time. using the database, users B and C must wait until user A is done. A single-user database that runs on a personal computer is called a desktop database. In contrast, a multiuser database desktop database supports multiple users at the same time. When the multiuser database supports a relatively A single-user database that small number of users (usually fewer than 50) or a specific department within an organization, runs on a personal computer. it is called a workgroup database. When the database is used by the entire organization and multiuser database supports many users (more than 50, usually hundreds) across many departments, the database A database that supports is known as an enterprise database. multiple concurrent users. Location might also be used to classify the database. For example, a database that supports workgroup database data located at a single site is called a centralized database. A database that supports data A multiuser database that distributed across several different sites is called a distributed database. The extent to which usually supports fewer than a database can be distributed and the way in which such distribution is managed are addressed 50 users or is used for a in detail in Chapter 12, Distributed Database Management Systems. specific department in an organization. Both centralized and decentralized (distributed) databases require a well-defined infra- structure (hardware, operating systems, network technologies, etc.) to implement and operate enterprise database the database. Typically, the infrastructure is owned and maintained by the organization that The overall company data creates and operates the database. But in recent years, the use of cloud databases has been representation, which provides support for present growing in popularity. A cloud database is a database that is created and maintained using and expected future needs. cloud data services, such as Microsoft Azure or Amazon AWS. These services, provided by third-party vendors, provide defined performance measures (data storage capacity, required centralized database throughput, and availability) for the database, but do not necessarily specify the underlying A database located at a single site. infrastructure to implement it. The data owners do not have to know, or be concerned about, what hardware and software are being used to support their databases. The performance capa- distributed database bilities can be renegotiated with the cloud provider as the business demands on the database A logically related database change. For example, 3M Health Information Systems, the world’s largest provider of health that is stored in two or more physically independent sites. care analytics software in hospitals, used Amazon’s AWS cloud database services to consolidate its multiple IT centers. 3M did not have to buy, install, configure, or maintain any hardware, cloud database operating systems, or network devices. It simply purchased storage and processing capacity for A database that is created its data and applications. As the demands on the databases increased, additional processing and and maintained using cloud services, such as Microsoft storage capabilities could be purchased as needed. As a result, server provisioning processes Azure or Amazon AWS. that previously took 10 weeks to complete could be done in mere minutes. This allows the company to be more responsive to the needs of customers and innovate faster. general-purpose In some contexts, such as research environments, a popular way of classifying databases is database according to the type of data stored in them. Using this criterion, databases are grouped into two A database that contains a wide variety of data used in categories: general-purpose and discipline-specific databases. General-purpose databases multiple disciplines. contain a wide variety of data used in multiple disciplines—for example, a census database that contains general demographic data and the LexisNexis and ProQuest databases that discipline-specific contain newspaper, magazine, and journal articles for a variety of topics. Discipline-specific database A database that contains databases contain data focused on specific subject areas. The data in this type of database is data focused on a specific used mainly for academic or research purposes within a small set of disciplines. Examples of subject area. discipline-specific databases are financial data stored in databases such as CompuStat or CRSP Copyright 2023 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. Chapter 1: Database Systems 11 (Center for Research in Security Prices), geographic information system (GIS) databases that operational database store geospatial and other related data, and medical databases that store confidential medical A database designed history data. primarily to support a The most popular way of classifying databases today, however, is based on how they will be company’s day-to-day used and on the time sensitivity of the information gathered from them. For example, transac- operations. Also known as tions such as product or service sales, payments, and supply purchases reflect critical day-to-day a transactional database, OLTP database, or production operations. Such transactions must be recorded accurately and immediately. A database that is database. designed primarily to support a company’s day-to-day operations is classified as an operational database, also known as an online transaction processing (OLTP) database, transactional online transaction database, or production database. In contrast, an analytical database focuses primarily on processing (OLTP) storing historical data and business metrics used exclusively for tactical or strategic decision database See operational database. making. Such analysis typically requires extensive “data massaging” (data manipulation) to produce information on which to base pricing decisions, sales forecasts, market strategies, and transactional database so on. Analytical databases allow the end user to perform advanced analysis of business data See operational database. using sophisticated tools. production database Typically, analytical databases comprise two main components: a data warehouse and See operational database. an online analytical processing front end. The data warehouse is a specialized database analytical database that stores data in a format optimized for decision support. The data warehouse contains A database focused primarily historical data obtained from the operational databases as well as data from other external on storing historical data and sources. Online analytical processing (OLAP) is a set of tools that work together to provide business metrics used for an advanced data analysis environment for retrieving, processing, and modeling data from tactical or strategic decision the data warehouse. In recent times, this area of database application has grown in impor- making. tance and usage, to the point that it has evolved into its own discipline: business intelli- data warehouse gence. The term business intelligence describes a comprehensive approach to capture A specialized database and process business data with the purpose of generating information to support business that stores historical and decision making. Chapter 13, Business Intelligence and Data Warehouses, covers this topic aggregated data in a format optimized for decision in detail. support. Databases can also be classified to reflect the degree to which the data is structured. Unstructured data is data that exists in its original (raw) state—that is, in the format in which online analytical it was collected. Therefore, unstructured data exists in a format that does not lend itself to the processing (OLAP) processing that yields information. Structured data is the result of formatting unstructured A set of tools that provide advanced data analysis for data to facilitate storage, use, and generation of information. You apply structure (format) based retrieving, processing, and on the type of processing that you intend to perform on the data. Some data might not be ready modeling data from the data (unstructured) for some types of processing, but they might be ready (structured) for other warehouse. types of processing. For example, the data value 37890 might refer to a zip code, a sales value, business intelligence or a product code. If this value represents a zip code or a product code and is stored as text, A set of tools and processes you cannot perform mathematical computations with it. On the other hand, if this value rep- used to capture, collect, resents a sales transaction, it must be formatted as numeric. integrate, store, and analyze To further illustrate the concept of structure, imagine a stack of printed paper invoices. If data to support business you want to merely store these invoices as images for future retrieval and display, you can scan decision making. them and save them in a graphic format. On the other hand, if you want to derive information unstructured data such as monthly totals and average sales, such graphic storage would not be useful. Instead, you Data that exists in its could store the invoice data in a (structured) spreadsheet format so that you can perform the original, raw state; that is, in requisite computations. Actually, most data you encounter is best classified as semistructured. the format in which it was collected. Semistructured data has already been processed to some extent (usually using formatting tags or some markup language) but does not conform to the strict tabular format typical of the structured data relational model. For example, if you look at a typical webpage, the data is presented in a pre- Data that has been formatted arranged format to convey some information. The database types mentioned thus far focus on to facilitate storage, use, and information generation. the storage and management of highly structured data. However, corporations are not limited to the use of structured data. They also use semistructured and unstructured data. Just think semistructured data of the valuable information that can be found on company emails, memos, and documents such Data that has already been as procedures, rules, and webpages. Unstructured and semistructured data storage and processed to some extent. management needs are being addressed through a new generation of databases known as Copyright 2023 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 12 Part 1: Database Concepts Table 1.1 Types of Databases Product Number of Users Data Location Data Usage XML Single Multiuser User Workgroup Enterprise Centralized Distributed Operational Analytical MS Access X X X X MS SQL Server X* X X X X X X X IBM DB2 X* X X X X X X X MySQL X X X X X X X X Oracle RDBMS X* X X X X X X X *Vendor offers single-user/personal or Express DBMS versions XML databases. Extensible Markup Language (XML) is a special language used to represent Extensible Markup Language (XML) and manipulate data elements in a textual format. An XML database supports the storage and A metalanguage used to management of semistructured XML data. represent and manipulate Table 1.1 compares the features of several well-known database management systems. data elements. Unlike other With the emergence of the web and Internet-based technologies as the basis for the new markup languages, XML “social media” generation, great amounts of data are being stored and analyzed. Social media permits the manipulation of a document’s data elements. refers to web and mobile technologies that enable “anywhere, anytime, always on” human interactions. Websites such as Google, Facebook, Instagram, Twitter, and LinkedIn capture XML database vast amounts of data about end users and consumers. This data grows exponentially and requires A database system that stores the use of specialized database systems. For example, as of 2020, over 500 million tweets were and manages semistructured XML data. posted every day on Twitter, and that number continues to grow. As a result, the MySQL data- base Twitter was using to store user content was frequently overloaded by demand.2 Facebook social media faces similar challenges. With over 500 terabytes of data coming in each day, it stores over Web and mobile 100 petabytes of data in a single data storage file system. From this data, its database scans over technologies that enable “anywhere, anytime, always 200 terabytes of data each hour to process user actions, including status updates, picture on” human interactions. requests, and billions of “Like” actions.3 Over the past few years, this new breed of specialized database has grown in sophistication and widespread usage. Currently, this new type of database NoSQL is known as a NoSQL database. The term NoSQL (Not only SQL) is generally used to describe A new generation of a new generation of DBMS that is not based on the traditional relational database model. DBMS that is not based on the traditional relational NoSQL databases are designed to handle the unprecedented volume of data, variety of data database model. types and structures, and velocity of data operations that are characteristic of these new business requirements. You will learn more about this type of system in Chapter 2, Data Models. This section briefly mentioned the many types of databases. As you learned earlier, a data- base is a computer structure that houses and manages end-user data. One of the first tasks of a database professional is to ensure that end-user data is properly structured to derive valid and timely information. For this, good database design is essential. 1-4 Why Database Design Is Important A problem that has evolved with the use of personal productivity tools such as spreadsheets and desktop database programs is that users typically lack proper data-modeling and data- base design skills. People naturally have a “narrow” view of the data in their environment. 2 www.internetlivestats.com/twitter-statistics/ 3 Josh Constine, “How big is Facebook’s data? 2.5 billion pieces of content and 500+ terabytes of data ingested every day,” Tech Crunch, August 22, 2012, https://techcrunch.com/2012/08/22/ how-big-is-facebooks-data-2-5-billion-pieces-of-content-and-500-terabytes-ingested-every-day/ Copyright 2023 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. Chapter 1: Database Systems 13 For example, consider a student’s class schedule. The schedule probably contains the student’s identification number and name, class code, class description, class credit hours, class instructor name, class meeting days and times, and class room number. In the mind of the student, these various data items compose a single unit. If a student organization wanted to keep a record of the schedules of its members, an end user might make a spreadsheet to store the schedule information. Even if the student makes a foray into the realm of desktop databases, he or she is likely to create a structure composed of a single table that mimics his or her view of the schedule data. As you will learn in the coming chapters, translating this type of narrow view of data into a single two-dimensional table structure is a poor database design choice. Database design refers to the activities that focus on the design of the database structure database design that will be used to store and manage end-user data. A database that meets all user require- The process that yields the description of the database ments does not just happen; its structure must be designed carefully. In fact, database design structure and determines is such a crucial aspect of working with databases that most of this book is dedicated to the the database components. development of good database design techniques. Even a good DBMS will perform poorly with The second phase of the a badly designed database. database life cycle. Data is one of an organization’s most valuable assets. Data on customers, employees, orders, and receipts is all vital to the existence of a company. Tracking key growth and perfor- mance indicators are also vital to strategic and tactical plans to ensure future success; there- fore, an organization’s data must not be handled lightly or carelessly. Thorough planning to ensure that data is properly used and leveraged to give the company the most benefit is just as important as proper financial planning to ensure that the company gets the best use from its financial resources. Because current-generation DBMSs are easy to use, an unfortunate side effect is that many computer-savvy business users gain a false sense of confidence in their ability to build a functional database. These users can effectively navigate the creation of database objects, but without the proper understanding of database design, they tend to produce flawed, overly simplified structures that prevent the system from correctly storing data that corresponds to business realities, which produces incomplete or erroneous results when the data is retrieved. Consider the data shown in Figure 1.5, which illustrates the efforts of an organization to keep records about its employees and their skills. Some employees have not passed a certification test in any skill, while others have been certified in several skills. Some certified skills are shared by several employees, while other skills have no employees that hold those certifications. Figure 1.5 Employee Skills Certification in a Poor Design How to count how How to produce Is Basic Database Why are there many employees are What if an employee an alphabetical Manipulation the blanks in rows certified in Basic acquires a fourth listing of same as Basic DB 9 and 10? Database certification? employees? Manipulation? Manipulation? Do we add another column? Copyright 2023 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 14 Part 1: Database Concepts Based on this storage of the data, notice the following problems: It would be difficult, if not impossible, to produce an alphabetical listing of employees based on their last names. To determine how many employees are certified in Basic Database Manipulation, you would need a program that counts the number of those certifications recorded in Skill1 and places it in a variable. Then the count of those certifications in Skill2 could be calcu- lated and added to the variable. Finally, the count of those certifications in Skill3 could be calculated and added to the variable to produce the total. If you redundantly store the name of a skill with each employee who is certified in that skill, you run the risk of spelling the name differently for different employees. For example, the skill Basic Database Manipulation is also entered as Basic DB Manipulation for at least one employee in Figure 1.5, which makes it difficult to get an accurate count of employees who have the certification. The structure of the database will have to be changed by adding more columns to the table when an employee is certified in a fourth skill. It will have to be modified again if an employee is certified in a fifth skill. Contrast this poor design with that shown in Figure 1.6 where the design has been improved by decomposing the data into three related tables. These tables contain all of the same data that Figure 1.6 Employee Skills Certification in a Good Design Database name: Ch01_Text Table name: CERTIFIED Table name: EMPLOYEE Table name: SKILL Copyright 2023 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. Chapter 1: Database Systems 15 was represented in Figure 1.5, but the tables are structured so that you can easily manipulate the data to view it in different ways and answer simple questions. With the improved structure in Figure 1.6, you can use simple commands in a standard data manipulation language to do the following: Produce an alphabetical listing of employees by last name: SELECT * FROM EMPLOYEE ORDER BY EMPLOYEE_LNAME; Determine how many employees are certified in Basic Database Manipulation: SELECT COUNT(*) FROM SKILL JOIN CERTIFIED ON SKILL.SKILL_ID = CERTIFIED.SKILL_ID WHERE SKILL_NAME = 'Basic Database Manipulation'; You will learn more about these commands in Chapter 7, Introduction to Structured Query Language (SQL). Note that because each skill name is stored only once, the names cannot be spelled or abbreviated differently for different employees. Also, the additional certification of an employee with a fourth or fifth skill does not require changes to the structure of the tables. Proper database design requires the designer to precisely identify the database’s expected use. Designing a transactional database emphasizes accurate and consistent data and opera- tional speed. Designing a data warehouse database emphasizes the use of historical and aggre- gated data. Designing a database to be used in a centralized, single-user environment requires a different approach from that used in the design of a distributed, multiuser database. This book emphasizes the design of transactional, centralized, single-user, and multiuser databases. Chapters 12 and 13 also examine critical issues confronting the designer of distributed and data warehouse databases. Designing appropriate data repositories of integrated information using the two- dimensional table structures found in most databases is a process of decomposition. The inte- grated data must be decomposed properly into its constituent parts, with each part stored in its own table. Further, the relationships between these tables must be carefully considered and implemented so the integrated view of the data can be recreated later as information for the end user. A well-designed database facilitates data management and generates accurate and valuable information. A poorly designed database is likely to become a breeding ground for difficult-to-trace errors that may lead to poor decision making, and poor decision making can lead to the failure of an organization. Database design is simply too important to be left to luck. That’s why college students study database design, why organizations of all types and sizes send personnel to database design seminars, and why database design consultants often make an excellent living. 1-5 Evolution of File System Data Processing Understanding what a database is, what it does, and the proper way to use it can be clarified by considering what a database is not. A brief explanation of the evolution of file system data processing can be helpful in understanding the data access limitations that databases attempt to overcome. Understanding these limitations is relevant to database designers and developers because database technologies do not make these problems magically disappear— database technologies simply make it easier to create solutions that avoid these problems. Creating database designs that avoid the pitfalls of earlier systems requires that the designer understand these problems and how to avoid them; otherwise, the database technologies are no better (and are potentially even worse!) than the technologies and techniques they have replaced. Copyright 2023 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 16 Part 1: Database Concepts data processing (DP) 1-5a Manual File Systems specialist Historically, data was kept in paper-and-pencil manual systems organized to facilitate the The person responsible for expected use of the data. Typically, this was accomplished through a system of file folders developing and managing a computerized file processing and filing cabinets. However, as organizations grew and reporting requirements became more system. complex, keeping data in a manual file system became more difficult. Therefore, companies looked to computer technology for help. field A character or group of characters (alphabetic or numeric) that has a specific 1-5b Computerized File Systems meaning. A field is used to Generating reports from manual file systems was slow and cumbersome. In fact, some business define and store data. managers faced government-imposed reporting requirements that led to weeks of intensive record effort each quarter, even when a well-designed manual system was used. Therefore, a A logically connected set data processing (DP) specialist was hired to create a computer-based system that would track of one or more fields that data and produce required reports. describes a person, place, or thing. Initially, the computer files within the file system were similar to the manual files. A simple example of a customer data file for a small insurance company is shown in Figure 1.7. (You will file discover later that the file structure shown in Figure 1.7, although typically found in early file A collection of related systems, is unsatisfactory for a database.) records. For example, a file might contain data about the The description of computer files requires a specialized vocabulary. Every discipline students currently enrolled at develops its own terminology to enable its practitioners to communicate clearly. The basic file Gigantic University. vocabulary shown in Table 1.2 will help you to understand subsequent discussions more easily. Figure 1.7 Contents of the Customer File Database name: Ch01_Text C_NAME = Customer name A_NAME = Agent name C_PHONE = Customer phone A_PHONE = Agent phone C_ADDRESS = Customer address TP = Insurance type C_ZIP = Customer zip code AMT = Insurance policy amount, in thousands of $ REN = Insurance renewal date Table 1.2 Basic File Terminology Term Definition Data Raw facts, such as a telephone number, a birth date, a customer name, and a year-to-date (YTD) sales value. Data has little meaning unless it has been organized in some logical manner. Field A character or group of characters (alphabetic or numeric) that has a specific meaning. A field is used to define and store data. Record A logically connected set of one or more fields that describes a person, place, or thing. For example, the fields that constitute a record for a customer might consist of the customer’s name, address, phone number, date of birth, credit limit, and unpaid balance. File A collection of related records. For example, a file might contain data about the students currently enrolled at Gigantic University. Copyright 2023 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. Chapter 1: Database Systems 17 Using the proper file terminology in Table 1.2, you can identify the file components shown in Figure 1.7. The CUSTOMER file contains 10 records. Each record is composed of nine Online Content fields: C_NAME, C_PHONE, C_ADDRESS, C_ZIP, A_NAME, A_PHONE, TP, AMT, and The databases used in each chapter are available REN. The 10 records are stored in a named file. Because the file in Figure 1.7 contains cus- at www.cengage.com. tomer data for the insurance company, its filename is CUSTOMER. Throughout the book, When business users wanted data from the computerized file, the DP specialist had to Online Content boxes highlight material related create programs to retrieve the data from the file(s), manipulate it in whatever manner the to chapter content on user had requested, and present it as a printed report. For example, the Sales department at the website. the insurance company created a file named SALES, which helped track daily sales efforts. Then, the DP specialist created the AGENT file shown in Figure 1.8 to automate Personnel department payroll processing and other personnel functions. The data in the AGENT file allowed the Personnel department to write checks, keep track of taxes paid, and summarize insurance coverage, among other tasks. Figure 1.8 Contents of the Agent File A_NAME = Agent name YTD_PAY = Year-to-date pay A_PHONE = Agent phone YTD_FIT = Year-to-date federal income tax paid A_ADDRESS = Agent address YTD_FICA = Year-to-date Social Security taxes paid ZIP = Agent zip code YTD_SLS = Year-to-date sales HIRED = Agent date of hire DEP = Number of dependents As more and more computerized files were developed, the problems with this type of file system became apparent. The issues centered around having many data files that contained related—often overlapping—data with no means of controlling or managing the data consis- tently across all of the files. Each file in the system used its own application program to store, retrieve, and modify data. The advent of computer files to store company data was significant; it established a land- mark in the use of computer technologies and represented a considerable step forward in a business’s ability to process data. Previously, users had direct, hands-on access to all of the business data. But they didn’t have the tools to convert that data into the information they needed. The creation of computerized file systems gave them improved tools for manipu- lating the company data to create new information. However, it had the additional effect of introducing a schism between the end-users and their data. The desire to close the gap between the end-users and the data influenced the development of many types of computer technologies, system designs, and uses (and misuses) of many technologies and techniques. However, such developments created a split between the ways DP specialists and end-users viewed the data. From the DP specialist’s perspective, the computer files within the file system were cre- ated to be similar to the manual files. Data management programs were created to add to, update, and delete data from the file. From the end user’s perspective, the systems separated the users from the data. As the users’ competitive environment pushed them to make more and more decisions in less time, users became frustrated by the delay between conceiving of a new way to create information from the data and the point when the DP specialist actually created the programs to generate that information. Copyright 2023 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part. Due to electronic rights, some third party content may be suppressed from the eBook and/or eChapter(s). Editorial review has deemed that any suppressed content does not materially affect the overall learning experience. Cengage Learning reserves the right to remove additional content at any time if subsequent rights restrictions require it. 18 Part 1: Database Concepts 1-5c File System Redux: Modern End-User Productivity Tools The users’ desire for direct, hands-on access to data helped to fuel the adoption of personal computers for business use. Although not directly related to file system evolution, the ubiqui- tous use of personal productivity tools can introduce the same problems as the old file systems. Business users widely use personal computer spreadsheet programs such as Microsoft Excel to enter data in a series of rows and columns to manipulate the data using a wide range of functions. The popularity of spreadsheet applications has enabled users to conduct sophis- ticated data analysis that has greatly enhanced their ability to understand the data and make better decisions. Unfortunately, as in the old adage “When the only tool you have is a hammer, every problem looks like a nail,” users have become so adept at working with spreadsheets that they tend to use them to complete tasks for which spreadsheets are not appropriate. A common misuse of spreadsheets is as a substitute for a database. Interestingly, end users often take the limited data to which they have direct access and place it in a spreadsheet format similar to that of the traditional, manual data storage systems—which is precisely what the early DP specialists did when creating computerized data files. Due to the large number of users with spreadsheets, each making separate copies of the data, the resulting “file system” of spreadsheets suffers from the same problems as the file systems created by the early DP specialists, which are outlined in the next section. 1-6 Problems with File System Data Processing The file system method of organizing and managing data was a definite improvement over the manual system, and the file system served a useful purpose in data management for over two decades—a very long time in the computer era. Nonetheless, many problems and limitations became evident in this approach. A critique of the file system method serves two major purposes: Understanding the shortcomings of the file system enables you to understand the devel- opment of modern databases. Failure to understand such problems is likely to lead to their duplication in a database environment, even though database technology makes it easy to avoid them. The following problems associated with file systems severely challenge the types of infor- mation that can be created from the data as well as the accuracy of the information: Lengthy development times. The first and most glaring problem with the file system approach is that even the simplest data-retrieval task requires extensive programming. With the older file systems, programmers had to specify what must be done and how to do it. As you will learn in upcoming chapters, modern databases use a nonprocedural data manipulation language that allows the user to specify what must be done without specifying how. Difficulty of getting quick answers. The need to write programs to produce even the simplest reports makes ad hoc queries impossible. Harried DP specialists who worked with mature file systems often received numerous requests for new reports. They were often forced to say that the report will be ready “next week” or even “next month.” If you need the information now, getting it next week or next month will not serve your information needs. Complex system administration. System administration becomes more difficult as the number of files in the system expands. Even a simple file system with a few files requires creating and maintaining several