Podcast
Questions and Answers
Approximately how much of the world's data was generated in the two years preceding the creation of the presentation?
Approximately how much of the world's data was generated in the two years preceding the creation of the presentation?
- 90% (correct)
- 30%
- 60%
- 10%
Which of the following is a common source of big data, according to the slides?
Which of the following is a common source of big data, according to the slides?
- Ancient cave paintings
- Telepathic communication
- Human activity like emails and photos (correct)
- Dream analysis
By which year was it estimated that a third of the world's data would pass through the cloud?
By which year was it estimated that a third of the world's data would pass through the cloud?
- 2020 (correct)
- 2016
- 2025
- 2030
In what year did 73% of organizations already invest or plan to invest in big data infrastructures?
In what year did 73% of organizations already invest or plan to invest in big data infrastructures?
The data from mobile devices and people, in relation to the total size of the digital universe in 2013, was approximately:
The data from mobile devices and people, in relation to the total size of the digital universe in 2013, was approximately:
In the context of Big Data benefits, what is one way hospitals can use big data?
In the context of Big Data benefits, what is one way hospitals can use big data?
When was the concept of Big Data initially recognized?
When was the concept of Big Data initially recognized?
Which of the following is an example of how big data can be applied in education?
Which of the following is an example of how big data can be applied in education?
When did Google publish its first paper on the Google File System?
When did Google publish its first paper on the Google File System?
What does the acronym GIGO stand for?
What does the acronym GIGO stand for?
Within the context of customer service, Big Data makes which of the following possible?
Within the context of customer service, Big Data makes which of the following possible?
In Big Data, upselling refers to what?
In Big Data, upselling refers to what?
How can Big Data analysis help in fraud prevention, as suggested in the slides?
How can Big Data analysis help in fraud prevention, as suggested in the slides?
What would a predictive maintenance system do in aerospace engineering?
What would a predictive maintenance system do in aerospace engineering?
What is the purpose of flight simulation in aerospace engineering, using big data?
What is the purpose of flight simulation in aerospace engineering, using big data?
What do organizations analyze to prevent and detect attacks in real-time?
What do organizations analyze to prevent and detect attacks in real-time?
In healthcare, what is one of the ways Big Data Analytics can help?
In healthcare, what is one of the ways Big Data Analytics can help?
According to the presentation, what is a limitation of the traditional approach to data management?
According to the presentation, what is a limitation of the traditional approach to data management?
What does the concept of distributed computing refer to?
What does the concept of distributed computing refer to?
What is one of the key aspects of distributed computing?
What is one of the key aspects of distributed computing?
What is the name of the algorithm that Google used to solve the problem of processing large datasets?
What is the name of the algorithm that Google used to solve the problem of processing large datasets?
Which component divides a task into small parts, assigns them to multiple computers, and collects results?
Which component divides a task into small parts, assigns them to multiple computers, and collects results?
Which 'V' of Big Data refers to the enormous quantities of data?
Which 'V' of Big Data refers to the enormous quantities of data?
Which 'V' of Big Data refers to the different types and formats of data?
Which 'V' of Big Data refers to the different types and formats of data?
Which of the following describes a node in a Big Data architecture?
Which of the following describes a node in a Big Data architecture?
What is the name referring to a set of servers within the same Big Data network?
What is the name referring to a set of servers within the same Big Data network?
What does 'information processing' meet in Big Data architecture?
What does 'information processing' meet in Big Data architecture?
What are the major uses of Big Data listed in the presentation?
What are the major uses of Big Data listed in the presentation?
Which big data V
refers to the rate at which data is received?
Which big data V
refers to the rate at which data is received?
Which big data V
refers to the data's accuracy?
Which big data V
refers to the data's accuracy?
Which of these is NOT a component of the Hadoop Ecosystem?
Which of these is NOT a component of the Hadoop Ecosystem?
What is the role of SQOOP in the Hadoop ecosystem?
What is the role of SQOOP in the Hadoop ecosystem?
What is a key element of Datamining?
What is a key element of Datamining?
What is data heterogeneity?
What is data heterogeneity?
Which of the following is a goal of Hadoop?
Which of the following is a goal of Hadoop?
What is one way that retailers can use Big Data?
What is one way that retailers can use Big Data?
What is the use case for Big Data in the domain of aviation?
What is the use case for Big Data in the domain of aviation?
What is a good use of Big Data in fraud detection?
What is a good use of Big Data in fraud detection?
Which of the following actions can NOT be completed with Big Data?
Which of the following actions can NOT be completed with Big Data?
What can happen if there are network problems when using Big Data?
What can happen if there are network problems when using Big Data?
What is one of the challenges that is facing Big Data currently?
What is one of the challenges that is facing Big Data currently?
Flashcards
What is Big Data?
What is Big Data?
A collection of large datasets that cannot be processed using traditional computing techniques.
What happened in 2003?
What happened in 2003?
Google published a paper on the Google File System and revealed the first secrets of its success.
What happened in 2005?
What happened in 2005?
Doug Cutting and Michael Cafarella, at Yahoo, created Nutch Search Engine inspired by Google.
Why is Big Data Important?
Why is Big Data Important?
Signup and view all the flashcards
What is cross-selling?
What is cross-selling?
Signup and view all the flashcards
What is Upselling?
What is Upselling?
Signup and view all the flashcards
Fraud prevention without big data.
Fraud prevention without big data.
Signup and view all the flashcards
Material Innovation
Material Innovation
Signup and view all the flashcards
Autonomous Drones
Autonomous Drones
Signup and view all the flashcards
Security Intelligence
Security Intelligence
Signup and view all the flashcards
What are the 4 Vs of Big Data?
What are the 4 Vs of Big Data?
Signup and view all the flashcards
What is Volume?
What is Volume?
Signup and view all the flashcards
What is Velocity?
What is Velocity?
Signup and view all the flashcards
What is Variety?
What is Variety?
Signup and view all the flashcards
What is Veracity?
What is Veracity?
Signup and view all the flashcards
What is a node?
What is a node?
Signup and view all the flashcards
What is a Cluster?
What is a Cluster?
Signup and view all the flashcards
Tasks to work within the architecture of Big Data.
Tasks to work within the architecture of Big Data.
Signup and view all the flashcards
Data Storage task
Data Storage task
Signup and view all the flashcards
Information Processing
Information Processing
Signup and view all the flashcards
Integration task
Integration task
Signup and view all the flashcards
Security task
Security task
Signup and view all the flashcards
Study Notes
- Introduction to Big Data presented by Pr. Kaoutar EL HANDRI on 2/16/2025.
- The plan includes:
- Introduction to Big Data.
- Hadoop.
- Hadoop Ecosystem.
- NoSql Databases.
- ML & Big Data.
Big Data Overview
- 2.5 trillion bytes of data are generated every day and 90% of the world's data created in the last two years.
- Big Data sources include:
- Human activity like emails, photos, video, logs, and likes.
- Machine activity from sensors.
- Meters such as electric meters, vehicles, and household appliances.
- Institution and company data like schedules and regional statistics.
- Open APIs such as Twitter and Google.
- Within five years, more than 50 billion smart connected devices will exist.
- By 2020, a third of the world's data passed through the cloud.
- The Hadoop market is forecasted to grow at a compound annual rate of 58 %, surpassing $1 billion by 2020.
- 73% of organizations have invested or are planning to invest in big data infrastructures by 2016.
Big Data Benefits
- Big data enables risky decisions based on real-time transactional data.
- Big data helps detect customer feelings and reactions, which is critical for life-threatening conditions in hospitals.
- Big data can predict weather patterns, and identify criminals and threats from video, audio and data streams.
- It also predicts student success based on gathered statistics and patterns (Big Data in Education domain).
- It also helps predict interacting rate of people during a pandemic.
What is Big Data
- Big data is a collection of large datasets unprocessable using traditional computing techniques.
- Big data has become a complete subject, which involves various tools, techniques and frameworks.
- Big data involves data created by different devices and applications.
- BIG DATA is a Non relational database.
- The notion of Big Data started in 1997 and the technology in 2003.
How Big Data was born
- The United States is the precursor of Big Data through Google, Yahoo, and Apache.
- Google established its leadership as a search engine in 2000.
- Google published a paper on the Google File System, and revealed its first secrets of its success in 2003.
- MapReduce was discovered in 2004.
- Doug Cutting and Michael Cafarella created Nutch Search Engine in 2005.
- Hadoop was created in 2006.
Why Big Data is Important
- Big Data can play a significant economic and environmental role
- Potential beneficiaries:
- Government agencies
- National economies
- Multinational companies
- Private companies
- Small and medium-sized enterprises
- Individuals
- Many organizations discard up to 80% of the data they generate.
- Critical business decisions on data in relational databases often represent less than 20% of all generated business data.
- "Garbage In, Garbage Out" (GIGO) means that the quality of output is determined by the quality of input.
Big Data Uses Cases
- 360° View of the Customer
- Fraud Prevention
- Data Warehouse Offload
- Price Optimization
- Recommendation Engines
- Social Media Analysis and Response (ex. rumors)
- Preventive Maintenance and Support
- Internet of Things
- Cross-selling (selling complementary items). Upselling (selling other expensive products) customers on products.
- Big data detects if a customer might defect to a competitor.
- Suggest potential discounts that could lower the customer’s rate.
- Suggest appropriate responses or services to sales using analytics of customers’ language to detect their current emotions
Fraud Prevention with Big Data
- Credit card issuers used basic rules-based systems to flag possible fraud.
- A customer service agent might call to confirm if a credit card was used to rent a car in Casablanca but lives in Hawaii.
- Big data-driven fraud prevention systems can use data such as past airline ticket purchases, sunscreen, and a new swimsuit to better determine fraud likelihood.
- Historical patterns and predictive analytics can further discern fraud potential.
Big Data Uses Cases in Aerospace Engineering
- Aircraft Design Optimization: Computational tools and AI enhance aerodynamics, reduce fuel consumption, and improve overall aircraft performance.
- Spacecraft Navigation: Develop algorithms and systems for precise orbital maneuvers, planetary landings, and deep-space exploration.
- Predictive Maintenance: IoT and machine learning are used detect potential faults in aircraft systems, .
- Flight Simulation: Virtual environments train pilots and test aircraft under conditions without physical risks.
- Material Innovation: Lightweight, high-strength materials enhance fuel efficiency and structural integrity.
- Autonomous Drones: UAVs designed for reconnaissance, delivery, and environmental monitoring.
- Noise Reduction: Technologies minimize aircraft noise during takeoff and landing.
Security Intelligence
- Organizations use big data analytics to stop hackers and cyberattackers.
- Servers commonly generate log files and these log files are analyzed.
- Organizations analyze internal and external log data to prevent and detect attacks in real time.
Big Data in Healthcare and Life Sciences
- Problem: Large amounts of real-time information come from wireless monitoring devices that postoperative patients plus those with chronic illnesses use at home.
- Big Data Analytics can help by monitoring intensive care units and remote monitoring, and providing early epidemic alerts.
Limitations of Traditionnel Approach
- Limitations:
- Structured data (tables).
- Normal forms
- It requires DBMS optimization and complex database management.
- Datamining
- Statistical methods for knowledge extraction.
- First a model, which will then be validated; sampling of data to fit in memory.
- Distributed computing and HPC (High Performance Computing).
- Concentrated effort on "computationally intensive" problems.
Distributed Computing & Google Solutions
- Execution of a computer process on a multitude of different machines (a cluster of machines) in a transparent way.
- Scalability: new machines can for the calculation if necessary.
- Heterogeneity: the machines are in different architectures.
- Fault tolerance: a faulty machine part of the cluster should not produce an error
- Transparency: the cluster as a whole must be usable as a single "traditional" machine.
- Google uses algorithm called MapReduce to divides the task into small parts and assigns those parts to many computers connected over the network, and collects the results to form the final result dataset.
Solutions Diagram
- Describes Solutions Diagram and their connections:
- Oozie.
- HCatalog.
- Pig.
- Hive.
- Mahout,
- Drill.
- Avro
- Sqoop.
- Flume.
- Zookeeper.
- Mapreduce.
- Yarn.
- HBASE.
- HDFS.
The 4 Vs of Big Data
- The concept of Big Data is now often used to define an idea/methodology
- The 4 Vs are:
- Volume: Enormous quantities of data measured in terabytes or petabytes.
- Velocity: New data being created rapidly and needs to be processed quickly.
- Variety: from a wide variety of sources and resides in many different formats like text files, images, video, audio files, presentations, spreadsheets, email messages, and databases.
- Veracity: The data quality of captured data can vary greatly, affecting the accurate analysis.
Another View of The Vs of Big Data
- Volume: Size of data
- Velocity: The speed at which data is generated
- Validity: Data Quality
- Variability: Dynamic behavior
- Variety: Types of data
- Veracity: Data accuracy
- Venue: Distributed Heterogeneous Data from Multiple Platforms
- Vocabulary: Data Models
- Value: Useful data
- Vagueness: Confusion over the meaning of BigData
Big Data Architecture
- Node: server belonging to a Big Data network
- Cluster: set of servers in the same Big Data network
- Nodes can be physically heterogeneous (different hardware configurations).
- Big Data management systems manage this heterogeneity.
- Unlimited storage
- Automatic replication management
- Collect data efficiently.
- Store data in an efficient and cost-effective manner.
- Process the data in real-time.
- Analyze the results.
- Store the volume and variety of data in a cost-effective manner
- Meet a wide variety of information processing requirements, including batch processing, ad hoc queries, real-time stream processing, and search. and Machine learning.
- Security: Authentication, authorization, accounting, data protection
- Operations: Provision, manage, monitor, and schedule resources
Big Data Challenges
- Lack of knowledge professionals to run modern technologies and tools.
- Lack of understanding of Massive Data
- Data Growth Issues
- Confusion while tool selection
- Integrating Data from a Spread of Sources
- Securing Data
- Network problems in connecting datacenters
- "configuration changes on the back-end routers which coordinate network traffic between data centers" for Facebook caused interrupted communication.
- Facebook lost more than $60 million in revenue.
- Travelers at LaGuardia Airport in New York were stranded due to a global IT outage, on July 19, 2024.
- Airport operations were disrupted, passengers waited for flight updates, and terminals were crowded.
- Technological failure affected systems worldwide
Studying That Suits You
Use AI to generate personalized quizzes and flashcards to suit your learning preferences.
Related Documents
Description
Introduction to Big Data, its sources, and the increasing amount of data generated daily. Covers human and machine activity, open APIs, and forecasts for Hadoop market growth and cloud data usage.