Hadoop Lecture PDF
Document Details
Uploaded by EasygoingRealism222
Tags
Summary
This lecture introduces Hadoop, a framework for processing large datasets. It covers key concepts, features, and resources. The lecture discusses the challenges of storing and processing big data, and how Hadoop addresses them.
Full Transcript
Hadoop Lecture Key Questions to Answer Why Hadoop? What is Hadoop? How to Hadoop? Examples of Hadoop What is Big Data Wikipedia big data – An all-encompassing term for any collection of data sets so large and complex that it becomes difficult to pr...
Hadoop Lecture Key Questions to Answer Why Hadoop? What is Hadoop? How to Hadoop? Examples of Hadoop What is Big Data Wikipedia big data – An all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using on-hand data management tools or traditional data processing applications. Data Creation Growth Projections Who is generating Big Data? Social User Tracking & Homeland Security Engagement eCommerce Financial Services Real Time Search 5 That is a lot of data … What are Key Features of Big Data? Volume Velocity Social Media Petabyte scale Sensor Throughput Big Data Variety 4 Vs Veracity Structured Unclean Semi-structured Imprecise Unstructured Unclear Philosophy to Scale for Big Data? Divide and Conquer Divide Work Combine Results Distributed processing is non-trivial How to assign tasks to different workers in an efficient way? What happens if tasks fail? How do workers exchange results? How to synchronize distributed tasks allocated to different workers? Big data storage is challenging Data Volumes are massive Reliability of Storing PBs of data is challenging All kinds of failures: Disk/Hardware/Network Failures Probability of failures simply increase with the number of machines … One popular solution: Hadoop Hadoop Cluster at Yahoo! (Credit: Yahoo) Hadoop offers Redundant, Fault-tolerant data storage Parallel computation framework Job coordination Hadoop offers Redundant, Fault-tolerant data storage Parallel computation framework Job coordination Q: Where file is located? No longer need to worry about Q: How to handle failures & data lost? Q: How to divide computation? Programmers Q: How to program for scaling? So we are A little history on Hadoop Hadoop is an open-source implementation based on Google File System (GFS) and MapReduce from Google Hadoop was created by Doug Cutting and Mike Cafarella in 2005 Hadoop was donated to Apache in 2006 Hadoop Stack Computation Storage Hadoop Resources Hadoop at ND: http://ccl.cse.nd.edu/operations/hadoop/ Apache Hadoop Documentation: http://hadoop.apache.org/docs/current/ Data Intensive Text Processing with Map-Reduce http://lintool.github.io/MapReduceAlgorithms/ Hadoop Definitive Guide: http://www.amazon.com/Hadoop-Definitive-Guide- Tom-White/dp/1449311520