COS20028 Big Data Architecture and Application PDF
Document Details
Swinburne University of Technology
Dr Pei-Wei Tsai
Tags
Summary
The document is lecture notes for a course, COS20028, on Big Data Architecture and Application; it covers various aspects of MapReduce programming, development techniques, debugging, testing, and logging using tools like Hadoop. It explains how to do different types of tests like Pseudo distributed, local job runners and use of log4j.
Full Transcript
COS20028 – BIG DATA ARCHITECTURE AND APPLICATION Dr Pei-Wei Tsai (Lecturer, Tutor, and Unit Convenor) [email protected], EN508d Week 5 Writing MapReduce Program -III 2 DEVELOPMENT TIPS AND TECHNIQUES...
COS20028 – BIG DATA ARCHITECTURE AND APPLICATION Dr Pei-Wei Tsai (Lecturer, Tutor, and Unit Convenor) [email protected], EN508d Week 5 Writing MapReduce Program -III 2 DEVELOPMENT TIPS AND TECHNIQUES 3 Strategies for Debugging MapReduce Code Each instance of a Mapper runs as a separate Debugging MapReduce code is not task (Often on a different machine). easy. Difficult to attach a debugger to the process. Difficult to catch “edge cases.” Very large volumes of data means Code which expects all data to be well-formed is that unexpected input is likely to likely to fail. appear 4 Debugging Tips With sampled small-scale data. Make as much Test in Start small, of your code Test locally Code Write unit pseudo- Finally, test build as possible whenever defensively tests distributed on the cluster incrementally Hadoop- possible mode agnostic Ensure that input data is in the expected format. Make it easier to test. Expect things to go wrong and catch exceptions. 5 Testing Strategies When testing in pseudo-distributed mode, ensure that you are testing with a similar environment to that on the real cluster. Same amount of RAM allocated to the task JVMs. Same Hadoop version. Same Java version. Same third-party library versions. 6 Testing Locally Using LocalJobRunner Hadoop can run Mapreduce in a single, local process. Does not require any Hadoop daemons to be running. Uses the local file system instead of HDFS. Known as LocalJobRunner mode. This is a very useful way for quickly testing incremental changes to code. 7 Testing via LocalJobRunner ◦ To run in LocalJobRunner mode, add the following lines to the driver code: In Hadoop 1.x only. In Hadoop 2.x and above. ◦ Or, set these options on the command line if your driver uses ToolRunner ◦ -fs is equivalent to -D fs.default.name ◦ -jt is equivalent to -D maprep.job.tracker In Hadoop 1.x only. hadoop jar yourjar.jar YourDriver -fs=file:/// -jt=local Input Output 8 Limitations of LocalJobRunner Mode ◦ Distributed Cache does NOT work. ◦ The job can only specify a single Reducer. ◦ Some “beginner” mistakes may not be caught. For example, attempting to share data between Mappers will work (which it should not be valid in the real situation) because the code is running in a single JVM. 9 LocalJobRunner Mode in Eclipse In the VM we used in the lab, the Eclipse can also run Hadoop code in LocalJobRunner mode from within the IDE. This is Hadoop’s default behaviour when no configuration is provided. This allows rapid development iterations. (Agile programming) 10 LocalJobRunner Mode in Eclipse ◦ Find the corresponding setting in Eclipse 11 LocalJobRunner Mode in Eclipse ◦ Select Java Application and then click on the “New” button. ◦ Verify the Project and Main Class fields are pre-filled correctly. 12 LocalJobRunner Mode in Eclipse ◦ Specify values in the Arguments tab. ◦ Local input/output file ◦ Any configuration options needed when your job runs. ◦ Define breakpoints if desired. ◦ Execute the application in run mode or debug mode. 13 LOCALJOBRUNNER MODE IN ECLIPSE Review output in the Eclipse console window 14 WRITING AND VIEWING LOG FILES 15 Using stdout and stderr Tried-and-true debugging technique: write to stdout or stderr If running in LocalJobRunner mode, you will see the results of System.err.println() If running on a cluster, that output will not appear on your console Output is visible via Hadoop’s Web UI. 16 Hadoop Yarn Job Tracking ◦ Hadoop Yarn in an integrated platform, which provides the job tracking function. ◦ When executing your job, you will see a URL pops-up in the displayed information. ◦ Right-click the URL and click “Open Link” to see the tracking result. 17 LOGFILE VIA WEB UI 18 LOGFILE VIA WEB UI 19 Why Logging is better than Printing? println statements rapidly become awkward Turning them on and off in your code is tedious, and leads to errors. Logging provides much finer-grained control over: What gets logged. When something gets logged. How something is logged. 20 log4j Library ◦ Hadoop uses log4j to generate all its log files. ◦ Your Mappers and Reducers can also use log4j. ◦ All the initialisation is handled for you by Hadoop. ◦ Add the log4j.jar- file from your CDH distribution to your classpath when you reference the log4j classes. 21 log4j Library Simply send strings to loggers tagged with severity levels LOGGER.trace(“message”); LOGGER.debug(“message”); LOGGER.info(“message”); LOGGER.warn(“message”); LOGGER.error(“message”); 22 USING LOG4J TO STORE LOGS https://logging.apache.org/log4j/2.x/manual/api.html 23 USING LOG4J TO STORE LOGS 24 USING LOG4J TO STORE LOGS 25 Avoiding Expensive ◦ By sending the parameter via the ToolRunner, we can decide whether display the debugging Operations information without modifying the code. 26 AVOIDING EXPENSIVE OPERATIONS Put a logger in the Reducer for outputting information. 27 CHECK THE LOG CONTENT 28 Find Your Job ID 29 LOGFILE VIA WEB UI 30 Find the Job in Web UI 31 Find the Job in Web UI 32 Find the Job in Web UI 33 Find the Job in Web UI ◦ Click the “Logs” to view the log file. 34 Find the Job in Web UI ◦ Mapper Log File 35 Find the Job in Web UI ◦ Reducer Log File 36 Log Output Restrictions If you suspect the input data of being faulty, you may be tempted to log the (key, value) pairs your Mapper receives. Reasonable for small amounts of input data. If your job runs across 500GB of input data, you could be writing up to 500GB of log files! Remember to think at scale when do so. Instead of wrap vulnerable sections of code in try{…} blocks, write logs in the catch {…} blocks helps you only log down the critical data. 37 RETRIEVING JOB INFORMATION WITH COUNTERS 38 What are Counters? Counters provide a way for Mappers or Reducers to pass aggregate values back to the driver after the job has completed. Their values are also visible from the JobTracker’s Web UI. They are reported on the console when the job ends. Very basic: just have a name and a value. Value can be incremented within the code. 39 What are Counters? Counters are collected into Groups. Within the group, each Counter has a name. Example: A group of counters called RecordType Names: TypeA, TypeB, TypeC Appropriate Counter can be incremented as each record is read in the Mapper. 40 Counters Counters can be set and incremented via the method. context.getCounter(group, name).increment(amount); Example context.getCounter(“RecordType”,”A”).increment(1); 41 What are Counters? To retrieve Counters in the Driver code after the job is complete, use code such as examples below in the driver. long typeARecords = job.getCounters().findCounter(“RecordType”,”A”).getValue(); long typeBRecords = job.getCounters().findCounter(“RecordType”,”B”).getValue(); 42 Important! Do NOT rely on a counter’s value from the Web UI while a job is running. Due to possible speculative execution, a counter’s value could appear larger than the actual final value. Modifications to counters from subsequently killed/failed tasks will be removed from the final count. 43 REUSE OF OBJECTS 44 Reuse of Objects ◦ It is generally good practice to reuse objects instead of creating many new objects. ◦ Creating new objects increases the usage of RAM space and overhead of memory allocation. 45 The Alternative ◦ A better practice is to create those frequently used but the value doesn’t need to be kept for long objects outside of the method. 46 The Alternative ◦ A better practice is to create those frequently used but the value doesn’t need to be kept for long objects outside of the method. 47 The Alternative ◦ A better practice is to create those frequently used but the value doesn’t need to be kept for long objects outside of the method. 48 MAP-ONLY JOB 50 Map-Only MapReduce Jobs ◦ Many type of jobs, in practical, actually fit in the case that only Mapper is needed. ◦ Examples ◦ File format conversion ◦ Input data sampling ◦ Image processing ◦ Extract, Transform, Load (ETL) data processing methods 51 Map-Only MapReduce Jobs ◦ To create a Map-only job, set the number of Reducers to 0 in the Driver code by job.setNumReduceTasks(0); ◦ Call the job.setOutputKeyClass and job.setOutputValueClass methods to specify the output types. ◦ Anything written using the context.write method in the Mapper will be written to HDFS. ◦ The original output should be written in the intermediate data. ◦ One file per Mapper will be written. 52 ◦ Unless stated otherwise, the materials presented in this lecture are taken from: Texts and ◦ Hadoop: the definitive guide, White, Tom (Tom E.) author., 4th ed.., Sebastopol, California: O'Reilly, 2015. Resources ◦ Hadoop operations, Sammer, Eric., Loukides, Michael Kosta.; Nash, Courtney.; Romano, Robert (Illustrator), illustrator., First edition., Sebastopol, CA: O'Reilly, 2012. ◦ Programming pig: dataflow scripting with hadoop, Gates, Alan, author.; Dai, Daniel, 2nd edition., Beijing, China: O'Reilly, 2017 53 MapReduce Code Unit Summary Explanation. WEEK 09 SUMMARY 54