Hadoop and HDFS PDF - IBM Training Material 2018

Summary

This document provides training material from IBM on Hadoop and HDFS (Hadoop Distributed File System). It covers topics such as the need for big data strategies, Hadoop architecture, hardware considerations, parallel data processing, and the advantages and disadvantages of using Hadoop. The training material includes a summary and checkpoint questions.

Full Transcript

Hadoop and HDFS Data Science Foundations © Copyright IBM Corporation 2018 Course materials may not be reproduced in whole or in part without the written permission of IBM. Unit objectives Understand the basic need for a big data strate...

Hadoop and HDFS Data Science Foundations © Copyright IBM Corporation 2018 Course materials may not be reproduced in whole or in part without the written permission of IBM. Unit objectives Understand the basic need for a big data strategy in terms of parallel reading of large data files and internode network speed in a cluster Describe the nature of the Hadoop Distributed File System (HDFS) Explain the function of the NameNode and DataNodes in an Hadoop cluster Explain how files are stored and blocks ("splits") are replicated Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2018 The importance of Hadoop "We believe that more than half of the world's data will be stored in Apache Hadoop within five years" - Hortonworks Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2018 Hardware improvements through the years CPU speeds: § 1990: 44 MIPS at 40 MHz § 2010: 147,600 MIPS at 3.3 GHz How long does it RAM memory take to read 1TB of data (at 80 MB/sec)? § 1990: 640K conventional memory (256K extended memory recommended) 1 disk - 3.4 hrs 10 disks - 20 min § 2010: 8-32GB (and more) 100 disks - 2 min Disk capacity 1000 disks - 12 sec § 1990: 20MB § 2010: 1TB Disk latency (speed of reads and writes) - not much improvement in last 7-10 years, currently around 70-80 MB/sec Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2018 What hardware is not used for Hadoop* RAID § Not suitable for the data in a large cluster § Wanted: Just-a-Bunch-Of-Disks (JBOD) : described multiple hard disk drives operated as individual independent hard disk drives. Linux Logical Volume Manager (LVM) § HDFS and GPFS are already abstractions that run on top of disk filesystems - LVM is an abstraction that can obscure the real disk Solid-state disk (SSD) § Low latency of SSD is not useful for streaming file data § Low storage capacity for high cost (currently not commodity hardware) RAID is often used on Master Nodes (but never Data Nodes) as part of fault tolerance mechanisms Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2018 Parallel data processing is the answer There has been: § GRID computing: spreads processing load § distributed workload: hard to manage applications, overhead on developer § parallel databases: Db2 DPF, Teradata, Netezza, etc. (distribute the data) Challenges: § heterogeneity § openness § Security § Scalability Distributed computing: Multiple computers § Concurrency appear as one super computer, communicate § fault tolerance with each other by message passing, and § transparency operate together to achieve a common goal Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2018 What is Hadoop? Apache open source software framework for reliable, scalable, distributed computing over massive amount of data: § hides underlying system details and complexities from user § developed in Java Consists of 4 sub projects: § MapReduce § Hadoop Distributed File System (HDFS) § YARN § Hadoop Common Supported by many Apache/Hadoop-related projects: § HBase, ZooKeeper, Avro, etc. Meant for heterogeneous commodity hardware. Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2018 Hadoop open source projects Hadoop is supplemented by an extensive ecosystem of open source projects. Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2018 Advantages and disadvantages of Hadoop Hadoop is good for: § processing massive amounts of data through parallelism § handling a variety of data (structured, unstructured, semi-structured) § using inexpensive commodity hardware Hadoop is not good for: § processing transactions (random access) § when work cannot be parallelized § low latency data access § processing lots of small files § intensive calculations with small amounts of data Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2018 Timeline for Hadoop Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2018 Hadoop: Major components Hadoop Distributed File System (HDFS) § where Hadoop stores data § a file system that spans all the nodes in a Hadoop cluster § links together the file systems on many local nodes to make them into one large file system that spans all the data nodes of the cluster MapReduce framework § How Hadoop understands and assigns work to the nodes (machines) § Evolving: MR v1, MR v2, etc. § Morphed into YARN and other processing frameworks Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2018 Brief introduction to HDFS and MapReduce Driving principles § data is stored across the entire cluster § programs are brought to the data, not the data to the program Data is stored across the entire cluster (the DFS) § the entire cluster participates in the file system § blocks of a single file are distributed across the cluster § a given block is typically replicated as well for resiliency 101101001 Cluster 010010011 1 100111111 001010011 101001010 1 3 2 010110010 2 010101001 100010100 4 1 3 101110101 Blocks 110101111 011011010 3 101101001 010100101 2 4 010101011 4 2 3 100100110 101110100 1 4 Logical File Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2018 Hadoop Distributed File System (HDFS) principles Distributed, scalable, fault tolerant, high throughput Data access through MapReduce Files split into blocks (aka splits) 3 replicas for each piece of data by default Can create, delete, and copy, but cannot update Designed for streaming reads, not random access Data locality is an important concept: processing data on or near the physical storage to decrease transmission of data Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2018 HDFS architecture Master/Slave architecture Master: NameNode NameNode File1 § manages the file system a b namespace and metadata c d ―FsImage ―Edits Log § regulates client access to files Slave: DataNode § many per cluster a b a c § manages storage attached to b a d b the nodes d c c d DataNodes § periodically reports status to NameNode Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2018 HDFS blocks HDFS is designed to support very large files Each file is split into blocks: Hadoop default is 128 MB Blocks reside on different physical DataNodes Behind the scenes, each HDFS block is supported by multiple operating system blocks 128 MB HDFS blocks OS blocks If a file or a chunk of the file is smaller than the block size, only the needed space is used. For example, a 210MB file is split as: 64 MB 64 MB 64 MB 18 MB Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2018 HDFS replication of blocks Blocks of data are replicated to multiple nodes § behavior is controlled by replication factor, configurable per file § default is 3 replicas Approach: § first replica goes on any node in the cluster § second replica on a node in a different rack § third replica on a different node in the second rack The approach cuts inter-rack network bandwidth, which improves write performance Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2018 Setting rack network topology (Rack Awareness) Defined by a script which specifies which node is in which rack, where the rack is the network switch to which the node is connected, not the metal framework where the nodes are physically stacked. The script is referenced in net.topology.script.property.file in the Hadoop configuration file core-site.xml. For example: net.topology.script.file.name /etc/hadoop/conf/rack-topology.sh The network topology script (net.topology.script.file.name in the example above) receives as arguments one or more IP addresses of nodes in the cluster. It returns on stdout a list of rack names, one for each input. One simple approach is to use IP addressing of 10.x.y.z where x = cluster number, y = rack number, z = node within rack; and an appropriate script to decode this into y/z. Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2018 Compression of files File compression brings two benefits: § reduces the space need to store files § speeds up data transfer across the network or to/from disk But is the data splitable? (necessary for parallel reading) Use codecs, such as org.apache.hadoop.io.compressSnappyCodec Compression Format Algorithm Filename extension Splitable? DEFLATE DEFLATE.deflate No gzip DEFLATE.gz No bzip2 bzip2.bz2 Yes LZO LZO.lzo /.cmx Yes, If indexed in preprocessing LZ4 LZ4.lz4 No Snappy Snappy.snappy No Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2018 Which compression format should I use? Most to least effective: § use a container format (Sequence file, Avro, ORC, or Parquet) § for files use a fast compressor such as LZO, LZ4, or Snappy § use a compression format that supports splitting, such as bz2 (slow) or one that can be indexed to support splitting, such as LZO § split files into chunks and compress each chunk separately using a supported compression format (does not matter if splittable) - choose a chunk size so that compressed chunks are approximately the size of an HDFS block § store files uncompressed Take advantage of compression - but the compression format should depend on file size, data format, and tools used Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2018 NameNode startup It does preserve some state about the files (name, path, size, block size, block IDs etc), just not the physical location of where the blocks are 1.NameNode reads fsimage in memory 2.NameNode applies editlog changes 3.NameNode waits for block data from data nodes § NameNode does not store the physical-location information of the blocks § NameNode exits SafeMode when 99.9% of blocks have at least one copy accounted for SafeMode for the NameNode is essentially a read-only mode for the HDFS cluster, where it does not allow any modifications to file system or blocks. Normally the NameNode leaves SafeMode automatically after the DataNodes have reported that most file system blocks are available. block information send 3 1 fsimage is read to NameNode datadir block1 NameNode DataNode1 block2 … edits log is read 2 and applied DataNode2 datadir block1 namedir block2 edits log … fsimage … Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2018 NameNode files (as stored in HDFS) [root@vm hdfs]# cd /hadoop/hdfs [root@vm hdfs]# ls -l total 12 drwxr-x---. 3 hdfs hadoop 4096 Apr 20 05:45 data drwxr-xr-x. 3 hdfs hadoop 4096 Apr 20 05:45 namenode drwxr-xr-x. 3 hdfs hadoop 4096 Apr 20 05:45 namesecondary [root@vm hdfs]# ls -l namenode/current: total 53296... -rw-r--r--. 1 hdfs hadoop 1048576 Apr 17 04:00 edits_0000000000000047319-0000000000000048279 -rw-r--r--. 1 hdfs hadoop 1048576 Apr 17 04:33 edits_0000000000000048280-0000000000000048317 -rw-r--r--. 1 hdfs hadoop 1048576 Apr 17 04:50 edits_0000000000000048318-0000000000000048340 -rw-r--r--. 1 hdfs hadoop 1048576 Apr 17 05:46 edits_0000000000000048341-0000000000000048809 -rw-r--r--. 1 hdfs hadoop 1048576 Apr 20 05:00 edits_0000000000000048810-0000000000000049264 -rw-r--r--. 1 hdfs hadoop 1048576 Apr 20 05:17 edits_0000000000000049265-0000000000000049322 -rw-r--r--. 1 hdfs hadoop 1048576 Apr 20 05:44 edits_inprogress_0000000000000049323 -rw-r--r--. 1 hdfs hadoop 1322806 Apr 16 23:39 fsimage_0000000000000046432 -rw-r--r--. 1 hdfs hadoop 62 Apr 16 23:39 fsimage_0000000000000046432.md5 -rw-r--r--. 1 hdfs hadoop 1394702 Apr 20 04:11 fsimage_0000000000000048809 -rw-r--r--. 1 hdfs hadoop 62 Apr 20 04:11 fsimage_0000000000000048809.md5 -rw-r--r--. 1 hdfs hadoop 6 Apr 20 05:37 seen_txid -rw-r--r--. 1 hdfs hadoop 207 Apr 20 04:11 VERSION Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2018 Adding a file to HDFS: replication pipelining 1. File is added to NameNode memory by persisting info in edits log 2. Data is written in blocks to DataNodes § DataNode starts chained copy to two other DataNodes § if at least one write for each block succeeds, the write is successful Client talks to NameNode, which 1 determines which DataNodes will store Client the replicas of each block API on client send data block to 3 first node NameNode datadir 1st DataNode block1 First DataNode daisychain- block2… 2 edits log is changed in writes to second, second to memory and on disk third - with ack back to 2nd DataNode previous node datadir namedir Then first DataNode block1 edits log confirms replication complete block2… fsimage … to the NameNode Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2018 Managing the cluster* Adding a data node § Start new DataNode (pointing to NameNode) § If required, run balancer to rebalance blocks across the cluster: hadoop balancer Removing a node § Simply remove DataNode § Better: Add node to exclude file and wait till all blocks have been moved § Can be checked in server admin console server:50070 Checking filesystem health § Use: hadoop fsck Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2018 HDFS-2 NameNode HA (High Availability) HDFS-2 adds NameNode High Availability Standby NameNode needs filesystem transactions and block locations for fast failover Every filesystem modification is logged to at least 3 quorum journal nodes by active NameNode § Standby Node applies changes from journal nodes as they occur § Majority of journal nodes define reality § Split Brain is avoided by JournalNodes (They will only allow one NameNode to write to them ) DataNodes send block locations and heartbeats to both NameNodes Memory state of Standby NameNode is very close to Active NameNode § Much faster failover than cold start JournalNode1 JournalNode2 JournalNode3 Active Standby NameNode NameNode DataNode1 DataNode2 DataNode3 DataNodeX Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2018 Secondary NameNode During operation primary NameNode cannot merge fsImage and edits log This is done on the secondary NameNode § Every couple minutes, secondary NameNode copies new edit log from primary NN § Merges edits log into fsimage § Copies the new merged fsImage back to primary NameNode Not HA but faster startup time § Secondary NN does not have complete image. In-flight transactions would be lost § Primary NameNode needs to merge less during startup Was temporarily deprecated because of NameNode HA but has some advantages § (No need for Quorum nodes, less network traffic, less moving parts ) New Edits Log entries are Primary copied to Secondary NN Secondary NameNode NameNode Merged fsimage is copied namedir back namedir edits log edits log fsimage simage Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2018 Possible FileSystem setup approaches Hadoop 2 with HA § no single point of failure § wide community support Hadoop 2 without HA (or, Hadoop 1.x in older versions) § copy namedir to NFS ( RAID ) § have virtual IP for backup NameNode § still some failover time to read blocks, no instant failover but less overhead Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2018 Federated NameNode (HDFS2) New in Hadoop2: NameNodes (NN) can be federated § HistoricallyNameNodes can become a bottleneck on huge clusters § One million blocks or ~100TB of data require roughly one GB RAM in NN Blockpools § Administrator can create separate blockpools/namespaces with different NNs § DataNodes register on all NNs § DataNodes store data of all blockpools (otherwise setup separate clusters) § New ClusterID identifies all NNs in a cluster. § A namespace and its block pool together are called Namespace Volume § You define which blockpool to use by connecting to a specific NN § Each NameNode still has its own separate backup/secondary/checkpoint node Benefits § One NN failure will not impact other blockpools § Better scalability for large numbers of file operations Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2018 fs: file system shell (1 of 4) File System Shell (fs) § Invoked as follows: hadoop fs § Example: listing the current directory in HDFS hadoop fs -ls. § Note that the current directory is designated by dot (".") - the here symbol in Linux/UNIX § If you want the root of the HDFS file system, you would use slash ("/") Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2018 fs: file system shell (2 of 4) FS shell commands take URIs as argument § URI format: scheme://authority/path Scheme: § Forthe local filesystem, the scheme is file § For HDFS, the scheme is hdfs Authority is the hostname and port of the NameNode hadoop fs -copyFromLocal file:///myfile.txt dfs://localhost:9000/user/virtuser/myfile.txt Scheme and authority are often optional § Defaults are taken from configuration file core-site.xml Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2018 fs: file system shell (3 of 4) Many POSIX-like commands § cat, chgrp, chmod, chown, cp, du, ls, mkdir, mv, rm, stat, tail Some HDFS-specific commands § copyFromLocal, put, copyToLocal, get, getmerge, setrep copyFromLocal / put § Copy files from the local file system into fs hadoop fs -copyFromLocal localsrc dst or hadoop fs -put localsrc dst Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2018 fs: file system shell (4 of 4) copyToLocal / get § Copy files from fs into the local file system hadoop fs -copyToLocal [-ignorecrc] [-crc] or hadoop fs -get [-ignorecrc] [-crc] Creating a directory: mkdir hadoop fs -mkdir /newdir Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2018 Unit summary Understand the basic need for a big data strategy in terms of parallel reading of large data files and internode network speed in a cluster Describe the nature of the Hadoop Distributed File System (HDFS) Explain the function of the NameNode and DataNodes in an Hadoop cluster Explain how files are stored and blocks ("splits") are replicated Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2018 Checkpoint 1.True or False? Hadoop systems are designed for transaction processing. 2.List the Hadoop open source projects. 3.What is the default number of replicas in a Hadoop system? 4.True or False? One of the driving principal of Hadoop is that the data is brought to the program. 5.True or False? At least 2 NameNodes are required for a standalone Hadoop cluster. Introduction to Big Data and Data Analytics © Copyright IBM Corporation 2018