Summary

This document provides a high-level overview of big data, covering topics such as the global volume of data, the five Vs of big data (velocity, variety, volume, veracity, and value), data sources, and data storage methods including Hadoop Distributed File System (HDFS) and data lakes. The material also discusses the advantages and functionalities of NoSQL databases in handling big data.

Full Transcript

**BIG DATA** "Data is absolutely key to everything we are doing. It is needed in all areas and fuels the insights that help us make better decisions in all aspects of the business". -- *Kathleen Hogan* (Chief People Officer at Microsoft). **BIG DATA:** GLOBAL VOLUME OF DATA: - In 2025, it´s es...

**BIG DATA** "Data is absolutely key to everything we are doing. It is needed in all areas and fuels the insights that help us make better decisions in all aspects of the business". -- *Kathleen Hogan* (Chief People Officer at Microsoft). **BIG DATA:** GLOBAL VOLUME OF DATA: - In 2025, it´s estimated that the world will generate 175 zettabytes (ZB) of data. \* 1 ZB = 1 billion gigabytes - In 2010 it was just 2. - Every day Internet users generate around 2.500.000 GB daily - 90% of the data were generated in the last 2 years. **THE 5 Vs OF BIG DATA:** 1. VELOCITY batch, near time, real time, streams 2. VARIETY structured, unstructured, semistructured, all the above 3. VOLUME terabytes, records, transactions, tables, files 4. VERACITY trustworthiness, authenticity, origin, reputation, accountability 5. VALUE statistical, events, correlations, hypothetical **SOURCES OF DATA:** Main sources are: - Facebook - Twits (500.000 tweets per minute) - Instagram (347.222 posts per minute) - IoT (75 mil millions of connected devices generating data) -- sensors **STORAGE OF GENERATED DATA:** Less than 20% of global data is stored in Relational Databases. Is a small percentage but important to handle Banks databases, hospitals, customers... 80% of the global data is not structured (text, images, video). This data is stored in Big Data Architectures, in the Cloud and in NoSQL Databases. **BIG DATA STORAGE:** Different technologies are needed to store, process and analyze such volume of data that cannot be managed with traditional databases. **[STORAGE IN HDFS (HADOOP DISTRIBUTED FILE SYSTEM):]** this type of set up is prepared to handle large volumes of data across multiple servers. - It divides data into small blocks (typically 128 MB or 256 MB) and distributes them across different nodes (servers) - It provides high redundancy (copies of data) to ensure that data is not lost if a node fails - Ideal for storing large amounts of unstructured or semi-structured data **[DATALAKES:]** centralized repository that stores flat files of all types of data (structured, semi-structured and unstructured). It is stored as raw data, as the data is generated, with no transformation. It is used when you need to store large volumes of diverse and raw data for long-term analysis or if you don´t know what type of analysis you will perform later. **[NoSQL]**

Use Quizgecko on...
Browser
Browser