IT301 1st Term Final Coverage-1-3.pdf
Document Details
Uploaded by DeadCheapOwl4393
Tags
Full Transcript
IT 301 – ADVANCED DATABASE SYTEMS 1. Volume: Chapter 2: Introduction to Big Data and NoSQL This refers to the immense amount of data generated every second from multiple sources Big Data...
IT 301 – ADVANCED DATABASE SYTEMS 1. Volume: Chapter 2: Introduction to Big Data and NoSQL This refers to the immense amount of data generated every second from multiple sources Big Data like business transactions, social media, sensors, Big Data refers to extremely large and complex mobile devices, and more. datasets that traditional data processing software The scale of the data, which can range from cannot manage effectively. terabytes to petabytes and beyond, makes it a These datasets can be structured, semi- defining characteristic of Big Data. structured, or unstructured and can come from For instance, companies like Google and various sources like social media, sensors, Facebook manage data on an exabyte scale. transactions, and more. 2. Velocity: Organizations use Big Data analytics to uncover hidden patterns, correlations, and insights, This aspect of Big Data deals with the speed at driving better decision-making. which data flows from various sources. Big Data technologies, such as Apache Hadoop, This could include real-time data streaming, and Apache Spark, and various data warehousing the rapid generation of data from sources like IoT solutions, are often used to store and analyze vast devices, social media platforms, and online amounts of information. transactions. Organizations can leverage Big Data analytics to High velocity necessitates timely processing and improve customer experiences, optimize often realtime analysis. operations, and identify new business For example, stock exchanges process millions opportunities. of transactions per day, each needing to be Techniques such as machine learning and data captured and analyzed promptly. mining are commonly employed to extract 3. Variety: valuable insights from Big Data. Big Data comes in various formats: structured, Example: Social media platforms like Facebook semi-structured, and unstructured. and Twitter generate massive amounts of data daily, including posts, comments, likes, and Structured data is organized and easily shares. Analyzing this data can provide insights searchable (like in databases), semi-structured into user behavior, trends, and sentiment data is not organized in a predefined manner but analysis. still has some organizational properties (like JSON files), and unstructured data has no specific 6V’s of Big Data format or structure (like video, audio, and text). The diversity of data types, from simple text files to complex video feeds, adds to the complexity of processing Big Data. 4. Veracity: This refers to the quality and accuracy of data. Given the vast sources of Big Data, ensuring that the data is accurate, credible, and reliable is a challenge. Data veracity is crucial for making sound decisions, as inaccurate data can lead to faulty conclusions and bad business decisions. 5. Variability: - Hadoop Distributed File System (HDFS): A distributed file system that stores data This property addresses the inconsistencies and across multiple machines, providing high- variations in data flow rates. throughput access to application data. Data can be highly inconsistent, with periodic - MapReduce: A programming model for peaks, making it challenging to manage and processing large datasets in parallel across analyze. a Hadoop cluster. - YARN (Yet Another Resource Negotiator): For example, social media sentiment can change A resource management layer that allows rapidly, affecting the flow and nature of data. multiple applications to share resources in 6. Value: the cluster. - Hadoop Common: The libraries and One of the most important aspects of Big Data utilities needed by other Hadoop modules. is the value that can be derived from it. Hadoop is designed to run on commodity It’s not just about collecting large volumes of hardware, making it a cost-effective solution for data but about extracting meaningful insights for processing large amounts of data. better decision-making and strategic business moves. Its scalability allows organizations to start small and grow as their data needs expand. The real worth of Big Data lies in the ability to convert it into valuable insights. Additionally, Hadoop's ecosystem includes tools like Apache Hive for data warehousing, Apache DATABASES FOR BIG DATA Pig for data processing, and Apache HBase for real-time data access, which enhance its capabilities further. Example: A retail company may use Hadoop to analyze customer purchase patterns by processing transaction logs, inventory data, and customer feedback. By utilizing Hadoop's distributed processing capabilities, the company can handle large datasets efficiently. Databases for Big Data are specialized systems NOSQL designed to store, manage, and process large volumes of complex data efficiently. NoSQL databases are particularly well-suited for handling semi-structured and unstructured data, They are crucial in handling the characteristics such as logs, JSON objects, and multimedia files. of Big Data, namely its volume, velocity, variety, veracity, and value. They support horizontal scaling, meaning organizations can add more servers to The landscape of databases for Big Data can be accommodate increased data loads without broadly divided into two categories: SQL significantly affecting performance. (Structured Query Language) databases, often referred to as traditional or relational databases, This makes NoSQL databases a popular choice and NoSQL (Not Only SQL) databases, which were for applications that require fast access to large developed as a response to the limitations of SQL volumes of data, like real-time analytics and databases in handling Big Data. content management systems. HADOOP JSON stands for JavaScript Object Notation. Hadoop is an open-source framework that JSON is a lightweight data-interchange format allows for distributed storage and processing of JSON is plain text written in JavaScript object large datasets using a network of computers. notation It consists of several key components: JSON is used to send data between computers JSON is language independent They use distributed architectures to support high transaction rates while offering SQL support {"name":"John"} for complex queries and relational data modeling. {name:"John"} This makes NewSQL databases ideal for NoSQL Real World Use Case and Examples applications like online transaction processing (OLTP) and e-commerce platforms, where both Netflix uses NoSQL databases to store and speed and reliability are crucial. manage massive amounts of data, including customer profiles, viewing histories, and content Example: CockroachDB is used by companies recommendations. NoSQL databases allow like DoorDash for its ability to handle high Netflix to handle large volumes of data and transaction volumes with strong consistency provide fast, reliable access to data across a guarantees. It allows DoorDash to manage orders distributed network. and deliveries effectively while ensuring data integrity. Uber uses NoSQL databases to handle the massive amounts of data generated by its ride- Working with Graph Databases using Neo4j sharing platform, including driver and rider Graph databases, like Neo4j, are designed to profiles, trip histories, and realtime location data. represent and store data in graph structures, NoSQL databases provide the scalability and consisting of nodes, edges, and properties. flexibility needed to handle high traffic volumes and changing data models. They excel in managing relationships and are ideal for applications where connections between Airbnb uses NoSQL databases to store and data points are crucial, such as social networks, manage data for its booking platform, including recommendation engines, and fraud detection. property listings, guest profiles, and booking histories. NoSQL databases allow Airbnb to Key features of Neo4j include: handle large volumes of unstructured data and - Flexible Schema: Allows for easy changes provide fast, reliable access to data across a to the data model without extensive distributed network. migrations. NewSQL - Cypher Query Language: A powerful and expressive query language designed NewSQL databases aim to provide the scalability specifically for working with graph data. of NoSQL systems while maintaining the ACID - ACID Compliance: Ensures reliable (Atomicity, Consistency, Isolation, Durability) transactions and data integrity. properties of traditional SQL databases. Neo4j's graph model is particularly effective for They are designed to handle high transaction applications where relationships play a significant rates and support SQL-like querying. Examples role. include: For instance, recommendation systems can - Google Spanner: A globally distributed utilize graph algorithms to suggest products or database service with strong consistency. content based on user preferences and - CockroachDB: A distributed SQL database behaviors. that provides high availability and strong consistency. Neo4j supports various graph algorithms, such as PageRank and community detection, to NewSQL databases are suitable for applications analyze connections within the data. that require high performance and reliability while using familiar SQL semantics. The use of Cypher for querying allows developers to express complex relationships NewSQL databases combine the benefits of intuitively, making it easier to work with traditional SQL databases with the scalability of connected data. NoSQL systems.