IT301 1st Term Final Coverage - with Comments.pdf
Document Details
Uploaded by DeadCheapOwl4393
Tags
Full Transcript
IT 301 – ADVANCED DATABASE SYTEMS 1. Volume: Chapter 2: Introduction to Big Data and NoSQL This refers to the immense amount of data generated every second from multiple sources Big Data...
IT 301 – ADVANCED DATABASE SYTEMS 1. Volume: Chapter 2: Introduction to Big Data and NoSQL This refers to the immense amount of data generated every second from multiple sources Big Data like business transactions, social media, sensors, Big Data refers to extremely large and complex mobile devices, and more. datasets that traditional data processing software The scale of the data, which can range from cannot manage effectively. terabytes to petabytes and beyond, makes it a These datasets can be structured, semi- defining characteristic of Big Data. structured, or unstructured and can come from For instance, companies like Google and various sources like social media, sensors, Facebook manage data on an exabyte scale. transactions, and more. 2. Velocity: Organizations use Big Data analytics to uncover hidden patterns, correlations, and insights, This aspect of Big Data deals with the speed at driving better decision-making. which data flows from various sources. Big Data technologies, such as Apache Hadoop, This could include real-time data streaming, and Apache Spark, and various data warehousing the rapid generation of data from sources like IoT solutions, are often used to store and analyze vast devices, social media platforms, and online amounts of information. transactions. Organizations can leverage Big Data analytics to High velocity necessitates timely processing and improve customer experiences, optimize often realtime analysis. operations, and identify new business For example, stock exchanges process millions opportunities. of transactions per day, each needing to be Techniques such as machine learning and data captured and analyzed promptly. mining are commonly employed to extract 3. Variety: valuable insights from Big Data. Big Data comes in various formats: structured, Example: Social media platforms like Facebook semi-structured, and unstructured. and Twitter generate massive amounts of data daily, including posts, comments, likes, and Structured data is organized and easily shares. Analyzing this data can provide insights searchable (like in databases), semi-structured into user behavior, trends, and sentiment data is not organized in a predefined manner but analysis. still has some organizational properties (like JSON files), and unstructured data has no specific 6V’s of Big Data format or structure (like video, audio, and text). The diversity of data types, from simple text files to complex video feeds, adds to the complexity of processing Big Data. 4. Veracity: This refers to the quality and accuracy of data. Given the vast sources of Big Data, ensuring that the data is accurate, credible, and reliable is a challenge. Data veracity is crucial for making sound decisions, as inaccurate data can lead to faulty conclusions and bad business decisions. 5. Variability: - Hadoop Distributed File System (HDFS): A distributed file system that stores data This property addresses the inconsistencies and across multiple machines, providing high- variations in data flow rates. throughput access to application data. Data can be highly inconsistent, with periodic - MapReduce: A programming model for peaks, making it challenging to manage and processing large datasets in parallel across analyze. a Hadoop cluster. - YARN (Yet Another Resource Negotiator): For example, social media sentiment can change A resource management layer that allows rapidly, affecting the flow and nature of data. multiple applications to share resources in 6. Value: the cluster. - Hadoop Common: The libraries and One of the most important aspects of Big Data utilities needed by other Hadoop modules. is the value that can be derived from it. Hadoop is designed to run on commodity It’s not just about collecting large volumes of hardware, making it a cost-effective solution for data but about extracting meaningful insights for processing large amounts of data. better decision-making and strategic business moves. Its scalability allows organizations to start small and grow as their data needs expand. The real worth of Big Data lies in the ability to convert it into valuable insights. Additionally, Hadoop's ecosystem includes tools like Apache Hive for data warehousing, Apache DATABASES FOR BIG DATA Pig for data processing, and Apache HBase for real-time data access, which enhance its capabilities further. Example: A retail company may use Hadoop to analyze customer purchase patterns by processing transaction logs, inventory data, and customer feedback. By utilizing Hadoop's distributed processing capabilities, the company can handle large datasets efficiently. Databases for Big Data are specialized systems NOSQL designed to store, manage, and process large volumes of complex data efficiently. NoSQL databases are particularly well-suited for handling semi-structured and unstructured data, They are crucial in handling the characteristics such as logs, JSON objects, and multimedia files. of Big Data, namely its volume, velocity, variety, veracity, and value. They support horizontal scaling, meaning organizations can add more servers to The landscape of databases for Big Data can be accommodate increased data loads without broadly divided into two categories: SQL significantly affecting performance. (Structured Query Language) databases, often referred to as traditional or relational databases, This makes NoSQL databases a popular choice and NoSQL (Not Only SQL) databases, which were for applications that require fast access to large developed as a response to the limitations of SQL volumes of data, like real-time analytics and databases in handling Big Data. content management systems. HADOOP JSON stands for JavaScript Object Notation. Hadoop is an open-source framework that JSON is a lightweight data-interchange format allows for distributed storage and processing of JSON is plain text written in JavaScript object large datasets using a network of computers. notation It consists of several key components: JSON is used to send data between computers JSON is language independent They use distributed architectures to support high transaction rates while offering SQL support {"name":"John"} for complex queries and relational data modeling. {name:"John"} This makes NewSQL databases ideal for NoSQL Real World Use Case and Examples applications like online transaction processing (OLTP) and e-commerce platforms, where both Netflix uses NoSQL databases to store and speed and reliability are crucial. manage massive amounts of data, including customer profiles, viewing histories, and content Example: CockroachDB is used by companies recommendations. NoSQL databases allow like DoorDash for its ability to handle high Netflix to handle large volumes of data and transaction volumes with strong consistency provide fast, reliable access to data across a guarantees. It allows DoorDash to manage orders distributed network. and deliveries effectively while ensuring data integrity. Uber uses NoSQL databases to handle the massive amounts of data generated by its ride- Working with Graph Databases using Neo4j sharing platform, including driver and rider Graph databases, like Neo4j, are designed to profiles, trip histories, and realtime location data. represent and store data in graph structures, NoSQL databases provide the scalability and consisting of nodes, edges, and properties. flexibility needed to handle high traffic volumes and changing data models. They excel in managing relationships and are ideal for applications where connections between Airbnb uses NoSQL databases to store and data points are crucial, such as social networks, manage data for its booking platform, including recommendation engines, and fraud detection. property listings, guest profiles, and booking histories. NoSQL databases allow Airbnb to Key features of Neo4j include: handle large volumes of unstructured data and - Flexible Schema: Allows for easy changes provide fast, reliable access to data across a to the data model without extensive distributed network. migrations. NewSQL - Cypher Query Language: A powerful and expressive query language designed NewSQL databases aim to provide the scalability specifically for working with graph data. of NoSQL systems while maintaining the ACID - ACID Compliance: Ensures reliable (Atomicity, Consistency, Isolation, Durability) transactions and data integrity. properties of traditional SQL databases. Neo4j's graph model is particularly effective for They are designed to handle high transaction applications where relationships play a significant rates and support SQL-like querying. Examples role. include: For instance, recommendation systems can - Google Spanner: A globally distributed utilize graph algorithms to suggest products or database service with strong consistency. content based on user preferences and - CockroachDB: A distributed SQL database behaviors. that provides high availability and strong consistency. Neo4j supports various graph algorithms, such as PageRank and community detection, to NewSQL databases are suitable for applications analyze connections within the data. that require high performance and reliability while using familiar SQL semantics. The use of Cypher for querying allows developers to express complex relationships NewSQL databases combine the benefits of intuitively, making it easier to work with traditional SQL databases with the scalability of connected data. NoSQL systems. Other Information in the PPT: In the context of Big Data, "Variety" refers to the diversity of data types that need to be managed Relational databases organize data in and analyzed. structured tables with fixed schemas, while NoSQL databases, such as document-based, Key-Value Store NoSQL databases organize data graph-based, and key-value stores, offer more in pairs, allowing for quick retrieval of values flexible data storage options without strict based on their associated keys. structure. While OLTP systems manage daily transactional Business Intelligence (BI) encompasses the data, they are not typically part of a data tools and techniques used to turn data into warehouse architecture, which focuses on actionable insights, helping organizations make analytical processing. informed decisions. OLAP is designed for analytical purposes, A Data Warehouse consolidates data from allowing users to perform multidimensional different sources, allowing for complex queries analysis of business data. and reporting to aid in business analytics. historical data/ time-variant/subjected-oriented/nonvolatile Business Intelligence is not focused on This characteristic defines a Data Warehouse, transactional data entry, but rather on analyzing as it holds historical data that remains unchanged and interpreting data for insights. and is organized around specific subjects to Document-based NoSQL databases are facilitate analysis. specifically designed to handle unstructured data Data mining involves analyzing vast amounts of in the form of documents. data to uncover patterns, trends, and insights that One of the significant challenges associated with can inform business decisions. supervised learning algorithm/ regression/predictive Big Data is the complexity of processing and analytics or modeling This technique is commonly used in data mining gaining insights from large datasets. to understand relationships between variables A Data Mart is used to meet the specific and make predictions based on historical data. analytical needs of a particular business unit. ETL is a crucial process in data warehousing that The primary goal of data mining is to analyze involves extracting data from various sources, data to uncover trends that may not be transforming it into a usable format, and loading immediately visible. it into a data warehouse for analysis. Versioning/historical or temporal data management/time variant This term describes how a Data Warehouse A data warehouse consolidates data from maintains historical data that changes over time multiple sources, enhancing the ability to perform to reflect past states of the data. comprehensive analysis and generate reports. NoSQL databases are often chosen for Big Data Business Intelligence tools help organizations applications because they can easily scale to visualize data and derive insights that support accommodate growing data volumes and diverse strategic decisions. data types. MongoDB is an example of a NoSQL database Tableau is a popular Business Intelligence tool that stores data in a flexible, document-oriented used for data visualization, enabling users to format, differing from traditional relational create interactive and shareable dashboards. databases. In the context of Big Data, "Velocity" refers to Hadoop enables the distributed storage and the rapid pace at which new data is created and processing of large datasets across clusters of needs to be analyzed. computers. "Schema-less" in NoSQL databases allows for The primary purpose of data mining is to identify easy modification of data structures without the trends and correlations that can inform business need for extensive reconfiguration. strategies and decisions. The primary purpose of a Data Warehouse is to These tools facilitate data visualization, making provide a centralized repository for data that can it easier to interpret sales performance data. be analyzed for insights. NoSQL Database Use a combination of document-based, key- This type of database is best suited for handling value, and graph databases based on the data large volumes of unstructured data, such as text type and access patterns. or multimedia content. polyglot persistence This approach leverages the strengths of various Denormalization in a data warehouse design NoSQL database types for diverse datasets. helps optimize data retrieval for reporting Design the system to route transactional data to purposes by reducing the complexity of joins. relational databases and analytical data to NoSQL These techniques are commonly used in data databases, ensuring proper integration between mining to identify patterns in customer them. purchasing behavior. combination of SQL and NoSQL databases ETL (Extract, Transform, Load) This hybrid approach optimizes both This sequence describes the ETL process for transactional and analytical workloads. integrating data from various sources into a data Technologies like Apache Kafka, Apache Storm, warehouse. Data warehouse optimiation technique or Apache Flink, and integrate with a data lake or These techniques help improve query distributed file system. performance by optimizing how data is accessed This setup enables efficient real-time analytics in and organized in a data warehouse. a Big Data environment. Graph databases are specifically designed to A comprehensive data mining project plan manage and analyze relationships between users, must cover all essential stages to ensure making them suitable for social media platforms. successful analysis. These Big Data technologies are ideal for Effective integration is crucial for leveraging BI implementing real-time analytics and processing tools to provide meaningful data visualizations. of large datasets. Data warehouse optimiation technique Use data lakes for storing raw, unstructured These data mining techniques can help predict data and data warehouses for structured, future sales based on historical data patterns. processed data, based on the company’s need for These methods can help optimize performance flexibility and processing power. in a data warehouse that struggles with large This recommendation provides the best of both volumes of data. worlds in data storage and analysis Hadoop and HDFS are designed to handle the storage and processing of large datasets, making them suitable for managing log data in real-time. Use clustering and classification techniques. These data mining methods can effectively segment a customer base for targeted marketing strategies. data intergration This approach helps improve business decision- making by offering a holistic view of the data. These techniques help distribute data across multiple servers and optimize data retrieval, addressing performance bottlenecks in NoSQL databases. BI tools like Tableau or Power BI to create interactive dashboards and reports.