Lecture-15.pdf

Lecture 15: Big Data Techniques Khushnur Binte Jahangir, Lecturer Department of Computer Science and Engineering United International University What is Big Data? “Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation.” (Gartner) img source: https://opencirrus.org/big-data-analytics-vs-data-mining/ 2 Example: Netflix’s Big Data Transformation o Initial Phase: Started as a DVD rental-by-mail service. o Game-Changer: Leveraged customer data for insights and transformation. Big Data Technologies: Recommendation Engine: Personalized content suggestions via machine learning. Streaming Infrastructure: Scalable, reliable tech (Apache Kafka, AWS) supports millions of simultaneous streams globally. Data Analytics: Tools like Apache Hadoop and Spark process vast datasets in real-time, guiding content acquisition, production, and marketing. Impact on Growth: Enhanced Personalization: Increased user engagement and reduced churn. Global Expansion: Tailored content for regional preferences (200M+ subscribers). Original Content: Data-driven production of hit shows like Stranger Things. Characteristics Of Big Data (The 4Vs) Volume Data Quantity Velocity Data Speed Variety Data Types Veracity Data Reliability 4 Big Data: Volume Refers to the sheer size of data being generated and stored. The scale of the data is massive, often reaching petabytes or even exabytes. Examples: As of June 2022, more than 500 hours of video were uploaded to YouTube every minute. This massive volume of data makes it impossible to manage with traditional systems. Walmart deals with big data. They handle more than 1 million customer transactions every hour, importing more than 2.5 petabytes of data into their database. This is about 167 times the amount of information contained in all the books in the US Library of Congress. Big Data: Velocity This refers to the speed at which new data is generated, collected, and processed. Big Data systems need to handle this rapid influx of information in real-time or near real-time. Examples: Stock market data, where prices and transactions are updated every fraction of a second, must be processed instantaneously to make real-time decisions. There are more than 3.5 billion searches per day are made on Google. Also, Facebook users are increasing by 22%(Approx.) year by year. source: source:https://www.vecteezy.com/vector-art/40814044-data-variety-icon-in-vector-logotype Big Data: Variety Big Data comes from multiple sources and exists in various formats, including structured (databases), unstructured (videos, social media posts), and semi-structured data (XML, JSON files). Examples: A company might need to integrate data from their customer service emails (unstructured), their transaction database (structured), and social media mentions (semi-structured). IoT (Internet of Things) devices, such as sensors, smart appliances, and connected devices, continuously generate different types of data, including structured (e.g., temperature readings) and unstructured data img source:https://www.vecteezy.com/vector -art/40814044-data-variety-icon-in-vector -logotype (e.g., video footage from security cameras) in real time. Big Data: Veracity This addresses the trustworthiness, accuracy, and reliability of the data. With Big Data, there can be inconsistencies, biases, or noise that must be filtered out to ensure the data is useful. Examples: In healthcare, patient data may come from various sources like sensors, doctor’s notes, and lab results. Ensuring this data is accurate and consistent is critical for effective patient care. img source:https://www.vecteezy.co m/vector-art/40814044-data-vari ety-icon-in-vector-logotype Importance of Big Data Driving Business Strategies: Big Data Unlocks Insights Beyond Human helps businesses uncover patterns Perception: Big data analytics can and trends, enabling data-driven reveal trends and patterns invisible decisions that fuel growth, boost to traditional methods or human efficiency, and enhance customer analysis. satisfaction. Cost Savings : Some tools of Big Data Time Reductions :The high speed of like Hadoop and Cloud-Based tools like Hadoop and in-memory Analytics can bring cost advantages analytics can easily identify new to business when large amounts of sources of data which helps data are to be stored and these tools businesses analyzing data also help in identifying more efficient immediately and make quick ways of doing business. decisions based on the learnings. Big Data: Use Cases Volume & Variety: Big Data deals with massive, diverse datasets—patient records, sensor data, genomics, and medical imaging—which are too large for traditional databases to handle. This provides comprehensive insights for ML models to make precise diagnoses and treatment plans. Healthcare: Real-Time Data Processing: Big Data technologies like Apache Hadoop and Spark can process large-scale health data in real-time, enabling hospitals to monitor patient vitals, predict potential health crises, and react instantly, which ML or AI alone would struggle with. Handling Complex, Unstructured Data: Retailers generate vast amounts of structured (transactions) and unstructured (social media, reviews) data. Big Data technologies enable the storage, processing, and aggregation of these diverse data sources, feeding ML models with robust datasets for dynamic pricing and Retail customer personalization. Scalability: Big Data tools allow scalable analysis of global data streams—from customer interactions to supply chain events—providing the infrastructure for ML/AI to build better models for inventory management and sales forecasting. High-Frequency, Real-Time Data Processing: Big Data systems can process and analyze millions of transactions per second, identifying potential fraud or risk events in real-time, something ML and AI models rely on but cannot handle alone without Big Data infrastructure. Finance Regulatory & Compliance Data: Financial sectors often require analyzing vast datasets for regulatory reporting. Big Data technologies enable organizations to manage these massive data stores efficiently and ensure compliance, beyond what ML/AI alone can do. Big Data Technologies Big Data Technologies are divided into 4 fields : 1. Data Storage 2. Data Mining 3. Data Analytics 4. Data Visualization Data Storage : Apache Hadoop What is Hadoop? Key Features: A widely used Big Data technology designed to handle large-scale data and file systems. Batch Processing: Processes data in batches, perfect for analyzing large volumes of data. Uses Hadoop Distributed File System (HDFS) for managing massive datasets across Low Cost: Uses commodity hardware, reducing the multiple machines. cost of large-scale data processing. Features parallel processing using the Java-Based: Hadoop is written in Java and introduced MapReduce framework, enabling efficient by the Apache Software Foundation in 2011. handling of tasks in batches. Real-Life Use Case: NextBio uses Hadoop’s MapReduce and HBase to process multi-terabyte datasets of the human genome, making genome data analysis faster and more efficient. Data Storage : NoSQL Databases NoSQL Databases Databases designed to handle large volumes of unstructured or semi-structured data that doesn’t fit into traditional relational databases. MongoDB: Cassandra: Flexible document-oriented database used to store Apache Cassandra is a free, open-source NoSQL JSON-like documents. database designed to manage large amounts of MongoDB Inc. introduced MongoDB in Feb 2009. It data across multiple servers, ensuring high is written with a combination of C++, Python, availability and no single point of failure. JavaScript, and Go language. Developed in 2008 by the Apache Software Foundation for the Facebook inbox search feature. It is based on the Java programming language. Data Mining RapidMiner ElasticSearch: A data science software offering a robust graphical An open-source, real-time distributed search and user interface (GUI) to create, deliver, manage, and analytics engine. Handles both structured and maintain predictive analytics. unstructured data with high scalability, up to Developed in 2001 by Ralf Klinkenberg, Ingo petabytes. Mierswa, and Simon Fischer at the Technical Can replace document-based databases like University of Dortmund. Initially known as YALE MongoDB and RavenDB. (Yet Another Learning Environment). Widely used in enterprise search engines by big Java-based centralized solution. organizations such as Wikipedia and GitHub, StackOverflow. Data Analytics : Apache Spark What is Apache Spark? Key Features: Integration with Hadoop: A widely used Big Data technology Real-Time Streaming: Processes Can work independently known for its in-memory computing real-time data using batching and or with Hadoop for capabilities that enhance operational windowing techniques. storage and processing. speed. Spark Components: Includes Spark Uses Hadoop primarily for Provides a generalized execution MLlib, GraphX, and R for machine storage as it has its own model to support a wide range of learning and data science. cluster management for applications with APIs in Java, Scala, Written in Java, Scala, Python, and R. computation. Python, and R. Developed by Apache Software Foundation in 2009. Data Visualization : Tableau What is Tableau? One of the fastest and most powerful data visualization tools used by leading business intelligence industries. Helps analyze data quickly and creates visualizations in the form of dashboards and worksheets. Development & Language: Developed by Tableau, introduced in May 2013. Written in multiple languages like Python, C, C++, and Java. Widely used by: Companies such as Cognos, QlikQ, and ORACLE Hyperion leverage Tableau for data visualization. 16 Reading References Book -Big Data For Dummies by Judith S. Hurwitz , Alan Nugent (Author), Fern Halper, Marcia Kaufman https://www.geeksforgeeks.org/popular-big-data-technologies/ https://www.javatpoint.com/big-data-technologies https://www.geeksforgeeks.org/5-vs-of-big-data/ https://en.wikipedia.org/wiki/Apache_Hadoop https://en.wikipedia.org/wiki/Apache_Spark We will continue with Challenges of Big Data 17 The End 18

Document Details

Tags

Related

Full Transcript