2_What is Big Data and How it Started.pdf
Document Details
Uploaded by EnrapturedElf
Tags
Full Transcript
What is Big Data and How it Started. Welcome. In this video, I will help you understand the following things. So let's start with the first item in the list. We are software engineers. If not yet, at least you are aspiring to become one. As software engineers, we design and develop software. Right?...
What is Big Data and How it Started. Welcome. In this video, I will help you understand the following things. So let's start with the first item in the list. We are software engineers. If not yet, at least you are aspiring to become one. As software engineers, we design and develop software. Right? These software systems do a variety of things. Can we categorize the software systems or the software engineering? We can try. Categorizing software systems accurately and crafting a complete list of all possible categories is almost impossible. However, we can still start with the following categories. In the beginning, engineers and scientists started creating system software such as operating systems and device drivers. Later, many companies started creating new programming languages and software development tools such as C++, Java, Visual Studio, etc. Then we saw an evolution of desktop applications such as Microsoft Word, Powerpoint, Graphics designing software, and many more. While all of these things were happening in the software industry, some companies started to focus on creating Data processing technologies and applications. In fact, we saw the evolution of the COBOL programming language specifically designed for business data processing The COBOL, also known as Common Business-Oriented Language, was the first of its kind. COBOL allowed us to store data in files, create index files, and process data efficiently. However, we saw data processing shift from COBOL to relational databases such as Oracle and Microsoft SQL Server. Side by side, we also saw an evolution of the internet, and hence engineers started working on Web sites and mobile applications. All of this was happening with the evolution of programming languages, tools, technologies, and systems to help software engineers do their work more quickly and simplify the complexities of software development. And this trend of simplifying software engineering started to emerge as platform development. We saw Hadoop emerge as a Data lake platform, and Cloud emerged as a platform to offer hundreds of services. Machine learning and AI are also emerging as a new software engineering category. These eight categories are not enough to capture the expansion of software engineering and its application. However, the list can give you an overview of the evolution of information technology. I wanted to attract your attention towards Data processing applications. You can think of COBOL as the first serious attempt towards enabling data processing. And COBOL was designed in 1959. The Oracle database achieved the subsequent major success in enabling data processing. And Oracle was founded in 1977. So data processing has always been at the center of the Software industry. And the reason is straightforward. Everything else will come and go, but data will only grow. Data processing is an evergreen field for software engineers. Technologies may evolve, but the requirement for processing more and more data at a faster speed will keep becoming more critical. But how do we create data processing applications? We have used RDBMS technology for many decades. Some popular RDBMS systems are Oracle, SQL Server, PostgreSQL, MySQL, Teradata, and Exadata. They all come with different capabilities, but the core technology is the same. These RDBMS systems offered us three main features to help us develop Data Processing applications. SQL - An easy Data Query Language Scripting Languages such as PL/SQL and Transact SQL Interface for other programming languages such as JDBC and ODBC So we used SQL for querying data and PL/SQL for doing things that we couldn't do using SQL. They also offered interfaces such as ODBC/JDBC so we could interact with data using the programming languages. Things were perfect. We could create data processing applications using these technologies. But then, the data itself started to evolve. What does it mean? Let me explain. In the beginning, we saw data in rows and columns. Here is an example. COBOL stored and accessed this data in plain text files such as CSV files. Here is an example of the CSV file. RDBMS such as Oracle stored this data in an advanced file format like DBF files. We do not care about the file format used by RDBMS, but we see data as a table of rows and columns. But with time, we developed some new data formats such as JSON and XML. Here is a JSON example. And here is an XML example. RDBMS was not ready to store and process these JSON and XML data files. They needed us to define a row-column structure and then store it in RDBMS. However, many RDBMS added this capability over some time and started to allow storing and processing JSON and XML data files. But by that time, we developed some new formats such as txt, pdf, doc, jpg, png, gif, mp3, mp4, and many more. RDBMS technology was not designed to handle such an expansion of file formats. So basically, we developed three categories of data. Structured, SemiStructured and Un-Structured The structured data comes in a standardized format such as row-column table format. The semi-structured data does not obey the tabular structure. However, it has another well-defined key/value structures such as JSON or XML. Here is a JSON example, and here is an XML. Both of these represent data as key/value elements. Pick up anything from a JSON or XML, and you can represent it as a key/value pair. And such data is known as semi-structured data. Why semi-structured? Because they do not have a tabular structure. But they have some key/value structures. The last category is the unstructured data, where we do not see any definite pattern or structure. The unstructured data comes in text files, pdf and other documents, images, and video files. With the growth of the internet, social media, mobile apps, etc., businesses are now collecting more semi-structured and unstructured data. Structured data is now a small portion of overall available data within an organization. However, RDBMS failed to bring support for the unstructured data. The point is straight. Modern data processing application requires handling the following problems. They must be able to handle all three varieties of data. I mean, we need methods to store and process structured, semi-structured, and unstructured data. The overall data volume is relatively high these days. Companies collect terabytes and petabytes of data in a short period. So we need capabilities to handle large volumes of data. Modern data is being generated at a very high speed. I mean, it takes very little time for an organization to collect petabytes of volume. And they do not have weeks and months to process these volumes. We want to process it faster in hours and minutes. So the velocity of collecting data is very high, and the requirement of processing velocity is also high. These three problems combined are also known as the Big Data problem. We had RDBMS systems developed and matured over many decades. However, the RDBMS was not designed to handle Big Data problems. So the industry needed a Big Data Management platform to do the following. Store High Volumes of data arriving at a higher velocity Accommodate structured, semi-structured, and unstructured data variety Process high volumes of a variety of data at a higher velocity Make sense? Great! Let me summarize critical takeaways of the discussion so far. Data processing is one of the critical business requirements We used RDBMS for decades to develop data processing applications However, the advent of the internet started putting following data processing challenges We started collecting a Variety of Data Structured data was already there We now started seeing Semi-Structured and Un-Structured data collection Businesses started collecting high Data Volumes in petabytes And they needed to collect and process data at a high velocity The new data challenge is popularly known as Big Data Problem The Big Data problem was defined using the 3Vs of Big data - Verity, Volume, and Velocity RDBMS failed to handle the Big Data problem And industry needed a new approach or platform to handle the Big Data Problem Scientists and engineers came up with many approaches to solve the big data However, we saw two main categories Monolithic Approach and Distributed Approach The monolithic approach designs one large and robust system that handles all the requirements. Teradata and Exadata are examples. These two systems mainly support only structured data. So we cannot call them big data systems, but they are designed using a monolithic approach. In a distributed approach, we take many smaller systems and bring them together to solve a bigger problem. Here is a diagram to understand the distributed approach. A monolithic system uses one massive system with a vast capacity of the following resources. CPU Capacity RAM Capacity and Disk Storage However, a distributed system uses a cluster of computers connected to work as a single system. The combined capacity of the cluster may be equal to or even higher than the single monolithic system. These two approaches are compared on the following criteria. Scalability, Fault Tolerance and High Availability, and Cost-Effective Let's try to understand these concepts. So what is scalability? Scalability is the ability of the system to increase or decrease performance in response to the demand. What does it mean? Look at the two approaches. Let's say we have ten TB of data stored in both systems. And we have 100 concurrent users reading and writing data to the system. Assume both the systems are working perfectly fine. But over some time, the data volume and the concurrent users are increased. Both the systems have limited capacity. I mean, the CPU, Memory, and disk size are fixed. As the data volume and concurrent users increase, these systems will reach their maximum capacity. What happens when they reach their maximum capacity? Can we increase the capacity? If yes, the system is scalable. If not, it is not scalable. Both the approaches are scalable but scaling a monolithic system is complex. You may have to call the vendor and request them to increase the capacity. They will take some time, bring some more chips, open the system and increase the tower's height. This approach is known as vertical scalability. In vertical scalability, you will increase the height of a single system adding more resources to one single system. On the other side, the distributed system is also scalable. But scaling a distributed system is as simple as adding a few more computers to the network. And this approach is known as horizontal scalability. In horizontal scalability, you will add new members to the cluster and increase the length of the network. Horizontal scalability takes less time because you can easily buy some new systems. Vertical scalability requires coordination with the system vendor and help from them. So it may take a longer time. Make sense? Great! Fault tolerance and high availability are the next evaluation criteria. So let's try to understand that. Let's look at both systems. What happens if a CPU burns out, a network card fails, or the system's motherboard fails. The monolithic system will stop working. Right? So the monolithic system may not tolerate a hardware failure. If a hardware component of a monolithic system fails, it may stop working, and your application will not remain available for your users. What happens if a computer fails in the cluster? Other computers remain working. Right? So a distributed system can tolerate many failures. A system failure in a cluster will only reduce the capacity, but the overall system remains working. Make sense? The last criteria are about costeffectiveness. The distributed architecture uses a cluster of computers. You can start with a small cluster and keep your initial investment as low as needed. You can add some more machines at a later stage as your requirement grows. You can even use an average quality machine available at a reasonable price. You can use a cloud environment to get machines at rental prices and create your cluster. These options make a distributed approach more cost-effective and economical. However, monolithic systems are expensive. Scaling them takes a lot of time, so you may have to start with a large machine even if you need a smaller one. Make sense? Great! So we discussed the Big Data Problem and learned that the industry needed a new approach or platform to handle the Big Data Problem. We evaluated monolithic and distributed approaches and concluded the following findings. Distributed systems are Horizontally Scalable, and Monolytic Systems are Vertically Scalable. Horizontal scalability is better and more desirable. Distributed systems can tolerate hardware failures and may offer greater availability. In contrast, monolithic systems may not withstand hardware failures and offer a lower availability. Distributed systems can start with a small cluster and add more computers to the cluster as the business grows. So distributed approach is more economical. In comparison, monolithic systems are complex and time-consuming to scale, so they must go through an estimation of the resources for the medium-term and provision enough capacity to sustain the growth. So the monolithic systems are less economical compared to distributed systems. Make sense? Great! Now let's come back to the summary. We learned that the industry needed a new approach or platform to handle the Big Data Problem. Engineers evaluated the monolithic and distributed approaches to design a new data processing system and handle big data requirements. And the evaluations indicate that distributed system could be a better choice. And that's where a new system called Hadoop came into existence. Hadoop was a revolutionary Big Data processing platform that grabbed immense attention and massive adoption. So what is Hadoop? Hadoop came up as a new data processing platform to solve Big Data problems. The Hadoop platform was designed and developed in layers. The core platform layer offered three capabilities. Distributed cluster formation or Cluster Operating System Data storage and retrieval on the distributed cluster or Distributed Storage Distributed data processing using Java programming language or Map-Reduce Framework As discussed earlier, the distributed system was a preferred approach. So Hadoop was developed as an operating system of a distributed cluster. And that was the most critical capability. Why? Let's try to understand. We can create data processing software, install it on one computer and run it there. The computer comes with an Operating system. Right? The operating system allows you to use the computer resources such as CPU, Memory, and Disk to run your applications. If you do not have an operating system on the computer, you cannot run any software on it. Make sense? Similarly, when you create a cluster of computers, you need a cluster operating system. The cluster operating system will allow you to use the cluster resources such as CPU, Memory, and disk. But cluster operating system has more complex work to do. Why? Because the CPU, Memory, and Disk on a cluster setup are spread across the computers. We are not using a single computer where everything is available on a single machine. A Cluster is a group of computers. And the cluster operating system makes it work like a single large computer. And that was the first capability of the Hadoop platform. It allowed us to use a cluster of computers as a single large computer. Hadoop also allowed us to store data and retrieve it back. The data was internally distributed on the hard disks on the cluster. However, Hadoop offered us a distributed storage system and allowed us to save and read the data file as we do it on a single machine. The next critical feature of Hadoop was to allow us to write a data processing application in Java and run it on the cluster. So Hadoop offered us a new distributed data processing framework known as Map-Reduce framework for processing large volumes of data. The point is straight. Hadoop platform was designed to use the cluster of computers as a single large machine. You can read/write and program data processing applications and run them on the Hadoop cluster without realizing that you are working on a cluster. Hadoop simplified distributed computing for developers. It saved the developer community from learning the complexities of distributed computing and parallel processing. We develop a data processing application on Hadoop the same way we develop it for a monolithic system. But the program runs on a distributed computing platform and uses parallel processing. And that's why Hadoop grabbed immense attention and popularity. As I said earlier, The Hadoop platform was designed and developed in layers. The core layers offered a cluster computing operating system, distributed storage, and Map-Reduce Framework. However, the Hadoop community developed many other tools over and above the Hadoop Core platform. Some of the most popular tools are listed here. Hive Database, HBase Database, Pig Scripting Language Sqoop data ingestion tools and Oozee workflow tool In the next lesson, I will cover more details about Hadoop and the other listed tools. However, I wanted to conclude this lecture with high-level features of the Hadoop platform and how they compare with the RDBMS. So let's go back to RDBMS and recall the critical RDBMS capabilities. RDBMS offered the following key capabilities. Data storage SQL Query Language over the stored data Scripting language to process data such as PL/SQL and Transact SQL Interface for other programming languages Hadoop supported most of the RDBMS capabilities and also supported big data requirements. Hadoop offers petabyte-scale of data storage. In contrast, RDBMS was good for storing Terabytes of Data. Hadoop also offered to store structured, semi-structured, and unstructured data into the Hadoop platform. In contrast, RDBMS offered only structured and semi-structured data. The Hadoop platform also offered SQL queries using an additional component called Hive. So we could configure HIve on the Hadoop platform and run SQL queries on the data just like any other RDBMS system. Hadoop also offered some scripting languages like Apache Pig Scripting. Hadoop allowed us to code Java data processing applications and offered JDBC/ODBC connectivity. The point is straight. RDBMS still dominates for the small to medium-sized structured data processing requirements. However, for big data processing, Hadoop started taking the lead. Make sense? Great! That's all for this video. The following lecture will go deeper into the Hadoop platform and help you understand the details of distributed computing and parallel processing using Hadoop. See you again. Keep Learning and Keep Growing