Big Data Explained PDF
Document Details
Uploaded by WiseZeal
UCM
Tags
Summary
This document provides a general overview of big data concepts. It discusses the massive volume of data being generated and the challenges of storing and processing it. The 5 Vs of big data (velocity, variety, volume, veracity, and value) and sources of big data are mentioned.
Full Transcript
BIG DATA “Data is absolutely key to everything we are doing. It is needed in all areas and fuels the insights that help us make better decisions in all aspects of the business”. – Kathleen Hogan (Chief People Officer at Microsoft). BIG DATA: GLOBAL VOLUME OF DATA: In 2025, it´s estimated...
BIG DATA “Data is absolutely key to everything we are doing. It is needed in all areas and fuels the insights that help us make better decisions in all aspects of the business”. – Kathleen Hogan (Chief People Officer at Microsoft). BIG DATA: GLOBAL VOLUME OF DATA: In 2025, it´s estimated that the world will generate 175 zettabytes (ZB) of data. * 1 ZB = 1 billion gigabytes In 2010 it was just 2. Every day Internet users generate around 2.500.000 GB daily 90% of the data were generated in the last 2 years. THE 5 Vs OF BIG DATA: 1. VELOCITY → batch, near time, real time, streams 2. VARIETY → structured, unstructured, semistructured, all the above 3. VOLUME → terabytes, records, transactions, tables, files 4. VERACITY → trustworthiness, authenticity, origin, reputation, accountability 5. VALUE → statistical, events, correlations, hypothetical SOURCES OF DATA: Main sources are: - Facebook - Twits (500.000 tweets per minute) - Instagram (347.222 posts per minute) - IoT (75 mil millions of connected devices generating data) – sensors STORAGE OF GENERATED DATA: Less than 20% of global data is stored in Relational Databases. Is a small percentage but important to handle Banks databases, hospitals, customers… 80% of the global data is not structured (text, images, video). This data is stored in Big Data Architectures, in the Cloud and in NoSQL Databases. BIG DATA STORAGE: Different technologies are needed to store, process and analyze such volume of data that cannot be managed with traditional databases. STORAGE IN HDFS (HADOOP DISTRIBUTED FILE SYSTEM): this type of set up is prepared to handle large volumes of data across multiple servers. ➔ It divides data into small blocks (typically 128 MB or 256 MB) and distributes them across different nodes (servers) ➔ It provides high redundancy (copies of data) to ensure that data is not lost if a node fails ➔ Ideal for storing large amounts of unstructured or semi-structured data DATALAKES: centralized repository that stores flat files of all types of data (structured, semi-structured and unstructured). It is stored as raw data, as the data is generated, with no transformation. It is used when you need to store large volumes of diverse and raw data for long-term analysis or if you don´t know what type of analysis you will perform later. NoSQL