Big Data Introduction
41 Questions
1 Views

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to Lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

Approximately how much of the world's data was generated in the two years preceding the creation of the presentation?

  • 90% (correct)
  • 30%
  • 60%
  • 10%

Which of the following is a common source of big data, according to the slides?

  • Ancient cave paintings
  • Telepathic communication
  • Human activity like emails and photos (correct)
  • Dream analysis

By which year was it estimated that a third of the world's data would pass through the cloud?

  • 2020 (correct)
  • 2016
  • 2025
  • 2030

In what year did 73% of organizations already invest or plan to invest in big data infrastructures?

<p>2016 (D)</p> Signup and view all the answers

The data from mobile devices and people, in relation to the total size of the digital universe in 2013, was approximately:

<p>17% (B)</p> Signup and view all the answers

In the context of Big Data benefits, what is one way hospitals can use big data?

<p>To detect critical conditions in patients (B)</p> Signup and view all the answers

When was the concept of Big Data initially recognized?

<p>1997 (D)</p> Signup and view all the answers

Which of the following is an example of how big data can be applied in education?

<p>Analyzing student reactions to predict success (B)</p> Signup and view all the answers

When did Google publish its first paper on the Google File System?

<p>2003 (A)</p> Signup and view all the answers

What does the acronym GIGO stand for?

<p>Garbage In, Garbage Out (A)</p> Signup and view all the answers

Within the context of customer service, Big Data makes which of the following possible?

<p>360 degree view of the customer (A)</p> Signup and view all the answers

In Big Data, upselling refers to what?

<p>Selling other expensive products (A)</p> Signup and view all the answers

How can Big Data analysis help in fraud prevention, as suggested in the slides?

<p>By detecting unusual purchase patterns (A)</p> Signup and view all the answers

What would a predictive maintenance system do in aerospace engineering?

<p>Detect potential faults in aircraft (D)</p> Signup and view all the answers

What is the purpose of flight simulation in aerospace engineering, using big data?

<p>To train pilots virtually under various conditions (A)</p> Signup and view all the answers

What do organizations analyze to prevent and detect attacks in real-time?

<p>Log information (C)</p> Signup and view all the answers

In healthcare, what is one of the ways Big Data Analytics can help?

<p>Early epidemic alert (A)</p> Signup and view all the answers

According to the presentation, what is a limitation of the traditional approach to data management?

<p>Complex database management (A)</p> Signup and view all the answers

What does the concept of distributed computing refer to?

<p>Executing a computer process across different machines (C)</p> Signup and view all the answers

What is one of the key aspects of distributed computing?

<p>Scalability (C)</p> Signup and view all the answers

What is the name of the algorithm that Google used to solve the problem of processing large datasets?

<p>MapReduce (B)</p> Signup and view all the answers

Which component divides a task into small parts, assigns them to multiple computers, and collects results?

<p>MapReduce Algorithm (B)</p> Signup and view all the answers

Which 'V' of Big Data refers to the enormous quantities of data?

<p>Volume (A)</p> Signup and view all the answers

Which 'V' of Big Data refers to the different types and formats of data?

<p>Variety (C)</p> Signup and view all the answers

Which of the following describes a node in a Big Data architecture?

<p>A server in a Big Data network (B)</p> Signup and view all the answers

What is the name referring to a set of servers within the same Big Data network?

<p>Cluster (A)</p> Signup and view all the answers

What does 'information processing' meet in Big Data architecture?

<p>A wide variety of processing requirements (B)</p> Signup and view all the answers

What are the major uses of Big Data listed in the presentation?

<p>Targeted ads, fraud prevention, and healthcare improvement. (B)</p> Signup and view all the answers

Which big data V refers to the rate at which data is received?

<p>Velocity (D)</p> Signup and view all the answers

Which big data V refers to the data's accuracy?

<p>Veracity (C)</p> Signup and view all the answers

Which of these is NOT a component of the Hadoop Ecosystem?

<p>Microsoft Word (C)</p> Signup and view all the answers

What is the role of SQOOP in the Hadoop ecosystem?

<p>Data Collection (B)</p> Signup and view all the answers

What is a key element of Datamining?

<p>Statistical methods (B)</p> Signup and view all the answers

What is data heterogeneity?

<p>Using machines with different architectures (D)</p> Signup and view all the answers

Which of the following is a goal of Hadoop?

<p>Process data on many computers (C)</p> Signup and view all the answers

What is one way that retailers can use Big Data?

<p>Both options are correct (C)</p> Signup and view all the answers

What is the use case for Big Data in the domain of aviation?

<p>All of the above (D)</p> Signup and view all the answers

What is a good use of Big Data in fraud detection?

<p>Detecting groups of suspicious purchases (D)</p> Signup and view all the answers

Which of the following actions can NOT be completed with Big Data?

<p>Increasing the amount of available oil (A)</p> Signup and view all the answers

What can happen if there are network problems when using Big Data?

<p>All of the above (D)</p> Signup and view all the answers

What is one of the challenges that is facing Big Data currently?

<p>Lack of skilled workers (A)</p> Signup and view all the answers

Flashcards

What is Big Data?

A collection of large datasets that cannot be processed using traditional computing techniques.

What happened in 2003?

Google published a paper on the Google File System and revealed the first secrets of its success.

What happened in 2005?

Doug Cutting and Michael Cafarella, at Yahoo, created Nutch Search Engine inspired by Google.

Why is Big Data Important?

Analytics of data that can improve business decisions.

Signup and view all the flashcards

What is cross-selling?

Detect complementary items a customer might buy.

Signup and view all the flashcards

What is Upselling?

Selling customers more expensive products.

Signup and view all the flashcards

Fraud prevention without big data.

Using rules-based systems to flag potentially fraudulent transactions.

Signup and view all the flashcards

Material Innovation

Using data to detect lightweight, high-strength materials for better fuel efficiency and structural integrity.

Signup and view all the flashcards

Autonomous Drones

Designing UAVs for applications like reconnaissance, delivery, or environmental monitoring.

Signup and view all the flashcards

Security Intelligence

Using big data analytics to prevent hackers and cyberattackers.

Signup and view all the flashcards

What are the 4 Vs of Big Data?

Consists of volume, velocity, variety and veracity.

Signup and view all the flashcards

What is Volume?

The term refers to enormous quantities of data.

Signup and view all the flashcards

What is Velocity?

New data being created and needing to be processed very quickly.

Signup and view all the flashcards

What is Variety?

Data comes from a wide variety of sources and resides in many different formats.

Signup and view all the flashcards

What is Veracity?

Data quality can vary greatly affecting analysis.

Signup and view all the flashcards

What is a node?

A server belonging to a Big Data network.

Signup and view all the flashcards

What is a Cluster?

Set of servers in the same Big Data network.

Signup and view all the flashcards

Tasks to work within the architecture of Big Data.

Collection data,store data,process data and analyze the result.

Signup and view all the flashcards

Data Storage task

Store the volume and variety of data in a cost-effective manner.

Signup and view all the flashcards

Information Processing

Meet a wide variety of processing requirements.

Signup and view all the flashcards

Integration task

Integrate data from various sources into storage in real-time.

Signup and view all the flashcards

Security task

Authentication, authorization, accounting, data protection

Signup and view all the flashcards

Study Notes

  • Introduction to Big Data presented by Pr. Kaoutar EL HANDRI on 2/16/2025.
  • The plan includes:
  • Introduction to Big Data.
  • Hadoop.
  • Hadoop Ecosystem.
  • NoSql Databases.
  • ML & Big Data.

Big Data Overview

  • 2.5 trillion bytes of data are generated every day and 90% of the world's data created in the last two years.
  • Big Data sources include:
  • Human activity like emails, photos, video, logs, and likes.
  • Machine activity from sensors.
  • Meters such as electric meters, vehicles, and household appliances.
  • Institution and company data like schedules and regional statistics.
  • Open APIs such as Twitter and Google.
  • Within five years, more than 50 billion smart connected devices will exist.
  • By 2020, a third of the world's data passed through the cloud.
  • The Hadoop market is forecasted to grow at a compound annual rate of 58 %, surpassing $1 billion by 2020.
  • 73% of organizations have invested or are planning to invest in big data infrastructures by 2016.

Big Data Benefits

  • Big data enables risky decisions based on real-time transactional data.
  • Big data helps detect customer feelings and reactions, which is critical for life-threatening conditions in hospitals.
  • Big data can predict weather patterns, and identify criminals and threats from video, audio and data streams.
  • It also predicts student success based on gathered statistics and patterns (Big Data in Education domain).
  • It also helps predict interacting rate of people during a pandemic.

What is Big Data

  • Big data is a collection of large datasets unprocessable using traditional computing techniques.
  • Big data has become a complete subject, which involves various tools, techniques and frameworks.
  • Big data involves data created by different devices and applications.
  • BIG DATA is a Non relational database.
  • The notion of Big Data started in 1997 and the technology in 2003.

How Big Data was born

  • The United States is the precursor of Big Data through Google, Yahoo, and Apache.
  • Google established its leadership as a search engine in 2000.
  • Google published a paper on the Google File System, and revealed its first secrets of its success in 2003.
  • MapReduce was discovered in 2004.
  • Doug Cutting and Michael Cafarella created Nutch Search Engine in 2005.
  • Hadoop was created in 2006.

Why Big Data is Important

  • Big Data can play a significant economic and environmental role
  • Potential beneficiaries:
  • Government agencies
  • National economies
  • Multinational companies
  • Private companies
  • Small and medium-sized enterprises
  • Individuals
  • Many organizations discard up to 80% of the data they generate.
  • Critical business decisions on data in relational databases often represent less than 20% of all generated business data.
  • "Garbage In, Garbage Out" (GIGO) means that the quality of output is determined by the quality of input.

Big Data Uses Cases

  • 360° View of the Customer
  • Fraud Prevention
  • Data Warehouse Offload
  • Price Optimization
  • Recommendation Engines
  • Social Media Analysis and Response (ex. rumors)
  • Preventive Maintenance and Support
  • Internet of Things
  • Cross-selling (selling complementary items). Upselling (selling other expensive products) customers on products.
  • Big data detects if a customer might defect to a competitor.
  • Suggest potential discounts that could lower the customer’s rate.
  • Suggest appropriate responses or services to sales using analytics of customers’ language to detect their current emotions

Fraud Prevention with Big Data

  • Credit card issuers used basic rules-based systems to flag possible fraud.
  • A customer service agent might call to confirm if a credit card was used to rent a car in Casablanca but lives in Hawaii.
  • Big data-driven fraud prevention systems can use data such as past airline ticket purchases, sunscreen, and a new swimsuit to better determine fraud likelihood.
  • Historical patterns and predictive analytics can further discern fraud potential.

Big Data Uses Cases in Aerospace Engineering

  • Aircraft Design Optimization: Computational tools and AI enhance aerodynamics, reduce fuel consumption, and improve overall aircraft performance.
  • Spacecraft Navigation: Develop algorithms and systems for precise orbital maneuvers, planetary landings, and deep-space exploration.
  • Predictive Maintenance: IoT and machine learning are used detect potential faults in aircraft systems, .
  • Flight Simulation: Virtual environments train pilots and test aircraft under conditions without physical risks.
  • Material Innovation: Lightweight, high-strength materials enhance fuel efficiency and structural integrity.
  • Autonomous Drones: UAVs designed for reconnaissance, delivery, and environmental monitoring.
  • Noise Reduction: Technologies minimize aircraft noise during takeoff and landing.

Security Intelligence

  • Organizations use big data analytics to stop hackers and cyberattackers.
  • Servers commonly generate log files and these log files are analyzed.
  • Organizations analyze internal and external log data to prevent and detect attacks in real time.

Big Data in Healthcare and Life Sciences

  • Problem: Large amounts of real-time information come from wireless monitoring devices that postoperative patients plus those with chronic illnesses use at home.
  • Big Data Analytics can help by monitoring intensive care units and remote monitoring, and providing early epidemic alerts.

Limitations of Traditionnel Approach

  • Limitations:
  • Structured data (tables).
  • Normal forms
  • It requires DBMS optimization and complex database management.
  • Datamining
  • Statistical methods for knowledge extraction.
  • First a model, which will then be validated; sampling of data to fit in memory.
  • Distributed computing and HPC (High Performance Computing).
  • Concentrated effort on "computationally intensive" problems.

Distributed Computing & Google Solutions

  • Execution of a computer process on a multitude of different machines (a cluster of machines) in a transparent way.
  • Scalability: new machines can for the calculation if necessary.
  • Heterogeneity: the machines are in different architectures.
  • Fault tolerance: a faulty machine part of the cluster should not produce an error
  • Transparency: the cluster as a whole must be usable as a single "traditional" machine.
  • Google uses algorithm called MapReduce to divides the task into small parts and assigns those parts to many computers connected over the network, and collects the results to form the final result dataset.

Solutions Diagram

  • Describes Solutions Diagram and their connections:
  • Oozie.
  • HCatalog.
  • Pig.
  • Hive.
  • Mahout,
  • Drill.
  • Avro
  • Sqoop.
  • Flume.
  • Zookeeper.
  • Mapreduce.
  • Yarn.
  • HBASE.
  • HDFS.

The 4 Vs of Big Data

  • The concept of Big Data is now often used to define an idea/methodology
  • The 4 Vs are:
  • Volume: Enormous quantities of data measured in terabytes or petabytes.
  • Velocity: New data being created rapidly and needs to be processed quickly.
  • Variety: from a wide variety of sources and resides in many different formats like text files, images, video, audio files, presentations, spreadsheets, email messages, and databases.
  • Veracity: The data quality of captured data can vary greatly, affecting the accurate analysis.

Another View of The Vs of Big Data

  • Volume: Size of data
  • Velocity: The speed at which data is generated
  • Validity: Data Quality
  • Variability: Dynamic behavior
  • Variety: Types of data
  • Veracity: Data accuracy
  • Venue: Distributed Heterogeneous Data from Multiple Platforms
  • Vocabulary: Data Models
  • Value: Useful data
  • Vagueness: Confusion over the meaning of BigData

Big Data Architecture

  • Node: server belonging to a Big Data network
  • Cluster: set of servers in the same Big Data network
  • Nodes can be physically heterogeneous (different hardware configurations).
  • Big Data management systems manage this heterogeneity.
  • Unlimited storage
  • Automatic replication management
  • Collect data efficiently.
  • Store data in an efficient and cost-effective manner.
  • Process the data in real-time.
  • Analyze the results.
  • Store the volume and variety of data in a cost-effective manner
  • Meet a wide variety of information processing requirements, including batch processing, ad hoc queries, real-time stream processing, and search. and Machine learning.
  • Security: Authentication, authorization, accounting, data protection
  • Operations: Provision, manage, monitor, and schedule resources

Big Data Challenges

  • Lack of knowledge professionals to run modern technologies and tools.
  • Lack of understanding of Massive Data
  • Data Growth Issues
  • Confusion while tool selection
  • Integrating Data from a Spread of Sources
  • Securing Data
  • Network problems in connecting datacenters
  • "configuration changes on the back-end routers which coordinate network traffic between data centers" for Facebook caused interrupted communication.
  • Facebook lost more than $60 million in revenue.
  • Travelers at LaGuardia Airport in New York were stranded due to a global IT outage, on July 19, 2024.
  • Airport operations were disrupted, passengers waited for flight updates, and terminals were crowded.
  • Technological failure affected systems worldwide

Studying That Suits You

Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

Quiz Team

Related Documents

Description

Introduction to Big Data, its sources, and the increasing amount of data generated daily. Covers human and machine activity, open APIs, and forecasts for Hadoop market growth and cloud data usage.

More Like This

Hadoop Unit 1 Essentials
15 questions

Hadoop Unit 1 Essentials

SupportedArtNouveau3510 avatar
SupportedArtNouveau3510
Big Data Concepts and Workload Processing
30 questions
1 Big Data
10 questions
Use Quizgecko on...
Browser
Browser