COS20028_BDAA_week06.pdf

Full Transcript

COS20028 – BIG DATA ARCHITECTURE AND APPLICATION Dr Pei-Wei Tsai (Lecturer, Tutor, and Unit Convenor) [email protected], EN508d Week 6 Partitioners and Data Format 2 PARTITIONER 3 MapReduce Pipeline...

COS20028 – BIG DATA ARCHITECTURE AND APPLICATION Dr Pei-Wei Tsai (Lecturer, Tutor, and Unit Convenor) [email protected], EN508d Week 6 Partitioners and Data Format 2 PARTITIONER 3 MapReduce Pipeline No. of Mapper: 1 No. of Combiner: 1 No. of Reducer: 1 Mapper Combiner Reducer Output Input 4 MapReduce Pipeline No. of Mapper: 3 No. of Combiner: 3 No. of Reducer: 1 Mapper Combiner Mapper Combiner Reducer Output Input Mapper Combiner 5 MapReduce Pipeline No. of Mapper: 3 No. of Combiner: 3 No. of Reducer: 2 Mapper Combiner Reducer Mapper Combiner ? Reducer Output Input Mapper Combiner 6 MapReduce Pipeline No. of Mapper: 3 No. of Combiner: 3 No. of Reducer: 2 Mapper Combiner Reducer Mapper Combiner Partitioner Reducer Output Input Mapper Combiner The Partitioner determines which Reducer each intermediate key-value pair goes to. 7 The Typical Outlook of the Partitioner Partitioner (Inter_key, Inter_value, num_reducers) partition 8 ◦ The default Partitioner is the HashPartitioner ◦ It uses the java hashCode method The Default ◦ Guarantees all pairs with the same key go to the same Reducer. Partitioner ◦ Hash ◦ The output of the hash function from the same input key is always the same. 9 Example of a Hash Partitioner Key Value Reducer 1 Output from Mapper Bird 1 Key Value Cat 1 Partitioner Cat 1 Cat 1 Dog 1 Cat 1 Key Value Reducer 2 Pig 1 Bird 1 Dog 1 Pig 1 10 No. of Reducers You Need It is customisable according to your design. By default, a single reducer is used in the program to gather all results together in the same place. With a single reducer, it collects all key-value pairs in sorted order. Handy for composing the sorted result. Potentially resource heavy for the node running the reducer. Taking a long time to run. 11 Jobs which Require a Single Reducer ◦ If a job needs to: ◦ Output a single file. ◦ Alternatively, the TotalOrderPartitioner can be used. ◦ Uses an externally generated file which contains information about intermediate key distribution. ◦ Partitions data such that all keys which go to the first Reducer are smaller than any which go to the second. ◦ In this way, multiple Reducers can be used. ◦ Concatenating the Reducers’ output file results in a totally ordered list. 12 Jobs which Require a Fixed Number of Reducers Some jobs will require a specific number of Reducers Examples Gathering data by Month, by Weekday, etc. 13 Jobs with a Variable Number of Reducers Many jobs can be run with a variable number of reducers. Developer must Distributing the decide how workload. many to specify. Typical way to Extrapolate to Use that to calculate the calculate the determine how Test the job with a relatively small test amount of number of many reducers dataset. intermediate data expected from the Reducers which should be to specify: “real” input data. specified. 14 Jobs with a Variable Number of Reducers Drive Through Estimate the number of Drive Through 1 Reduce slots likely to be available on the cluster. Drive Through 2 If more reducers than the capacity are specified, the total waiting time can be longer. 15 WRITE CUSTOMISED PARTITIONERS 16 Custom Partitioners ◦ In some cases, you’ll prefer to write your own Partitioner ◦ For example, your key is a custom WritableComparable which contains a pair of values (a, b). ◦ You may decide that all keys with the same value for a need to go to the same Reducer. Saving 10% Food Income Socialisation 18% Insurance 3% 3% Medical Car Insurance Insurance 20% 30% Insurance 28% Household 27% Property Insurance 47% Entertainment Transportation 6% 17 8% Custom Partitioners Custom Partitioners are needed when performing a secondary sort. Custom Partitioners are also useful to avoid potential performance issues. Avoiding one reducer having to deal with many very large lists of values. 18 Creating a Custom ◦ Create a class that extends Partitioner ◦ Override the getPartition method Partitioner ◦ Return an integer between 0 and one less than the number of Reducers. 19 Using a Custom Partitioner ◦ Specify the custom Partitioner in your driver code. ◦ If you need to set up variables in your partitioner, implement the Configurable. ◦ If a Hadoop object implements Configurable, its setConf() method will be called once, when it is instantiated. ◦ You can therefore set up variables in the setConf() method, which your getPartition() method will then be able to access. 20 SETTING UP VARIABLES FOR YOUR PARTITIONER DRIVER CODE 21 SETTING UP VARIABLES FOR YOUR PARTITIONER PARTITIONER 22 DATA INPUT AND OUTPUT 23 Data Types in Hadoop IntWritable, LongWritable, Writable WritableComparable Text, … Define a de/serialisation Defines a sort order. protocol. Concrete classes for different data types. Every data in Hadoop All keys must be is Writable. WritableComparable. 24 “Box” Classes in Hadoop Hadoop’s built-in data type are “box” classes. They contain a single piece of data: Text: string IntWritable: int LongWritable: long FloatWritable: float Etc. Writable defines the wire transfer format. How the data is serialised and deserialised. 25 Creating a Complex Writable ◦ Assume we want a tuple (a, b) as the value emitted from a Mapper. ◦ We could artificially construct it by: Text t = new Text(a + “,” + b); … String[] arr = t.toString().split(“,”); ◦ Inelegant ◦ Problematic ◦ Ex: a or b contained commas. ◦ Not always practical ◦ Doesn’t easily work for binary objects. ◦ Solution: create your own Writable object. 26 The Writable Interface public interface Writable { void readFields(DataInput in); void write(DataOutput put); } ◦ The readFields() and write() methods will define how your custom object will be serialised and deserialised by Hadoop. ◦ The DataInput and DataOutput classes support ◦ boolean ◦ byte, char (Unicode: 2 bytes) ◦ double, float, int, long ◦ String (Unitcode or UTF-8) ◦ Line until line terminator ◦ unsigned byte, short ◦ byte array 27 Example Custom Writable: DateWritable 28 Binary Objects Solution: Write idiom: Read idiom: Use byte arrays. Serialise object Read byte count to byte array Create byte Write byte count array of proper Write byte array size Read byte array Deserialise object 29 Binary Objects Solution: Write idiom: Read idiom: Use byte arrays. Serialise object Read byte count to byte array Create byte Write byte count array of proper Write byte array size Read byte array Deserialise object 30 WritableComparable WritableComparable is a sub-interface of Writable. Must implement compareTo, hashCode, equals methods. All keys in MapReduce must be WritableComparable. 31 Making DateWritable a WritableComparable class DateWritable implements WritableComparable { int month, day, year; // Constructors omitted for brevity public void readFields (DataInput in) … public void write (DataOutput out) … public Boolean equals(Object o) { if (o instanceof DateWritable) { DateWritable other = (DateWritable) o; return this.year == other.year && this.month == other.month && this.day == other.day; } return false; } 32 Making DateWritable a WritableComparable public int compareTo(DateWritable other) { if (this.year != other.year) { return (this.year < other.year ? -1 : 1); } else if (this.month != other.month) { return (this.month < other.month ? -1 : 1); } else if (this.day != other.day) { return (this.day < other.day ? -1: 1); } return 0; } public int hashCode() { int seed = 123; return this.year * seed + this.month * seed + this.day * seed; } } 33 Using Custom Types in MapReduce Jobs ◦ Use methods in Job to specify your custom key/value types. ◦ For output of Mappers: job.setMapOutputKeyClass(…); job.setMapOutputValueClass(…); ◦ For output of Reducers: job.setOutputKeyClass(…); job.setOutputValueClass(…); ◦ Input types are defined by InputFormat 34 SAVING BINARY DATA USING SEQUENCEFILES AND AVRO DATA FILES 35 SequenceFiles SequenceFiles are files containing binary-encoded Three file types in one. Often used in MapReduce key-value pairs Work naturally with Hadoop Uncompressed Especially when the output of data types. Record-compressed one job will be used as the SequenceFiles include Block-compressed input for another metadata which identifies the SequenceFileInputFormat data types of the key and SequenceFileOutputFormat value. 36 Directly Accessing SequenceFiles ◦ It is possible to directly access SequenceFiles from your code: Configuration config = new Configuration(); SequenceFile.Reader reader = new SequenceFile.Reader(FileSystem.get(config), path, config); Text key = (Text) reader.getKeyClass().newInstance(); IntWritable value = (IntWritable) reader.getValueClass().newInstance(); While (reader.next(key,value)) { // do something here } Reader.close(); 37 Strong Points and Problems with SequenceFiles Because they are binary, they have faster read/write than text formatted files. They are most typically accessible via Java API only. If the definition of the key or value object changes, the file becomes unreadable. Small file may cause memory overhead, but can be avoided with proper arrangements. Record-based file is much less than Block-based file, this causes the overhead cost and spins up more mappers 38 An Alternative to SequenceFiles: Avro Apache Avro is a serialisation format which is becoming a popular alternative to SequenceFiles. Self-describing file format. Compact file format. Portable across multiple languages Support for C, C++, Java, Python, Ruby and others. Compatible with Hadoop Via the AvroMapper and AvroReducer classes. 39 ISSUES TO CONSIDER WHEN USING FILE COMPRESSION 40 Hadoop and Compressed Files Hadoop understands a variety of file compression formats. If a compressed file is included as one of the files to be processed, Hadoop will automatically decompress it and pass the decompressed contents to the Mapper. If the file is not in a “splittable file format,” it can only be decompressing by starting at the beginning of the file and continuing on to the end. 41 Non-Splittable File Formats in Hadoop If the MapReduce framework receives a non-splittable file, it passes the entire file to a single Mapper. This can result in one Mapper running for far longer than the others. Typically it is not a good idea to use GZip to compress MapReduce input files. 42 Splittable Compression Formats: LZ4 LZ4 is a splittable compression format under conditions. To make an LZ4 file splittable, you must first index the file. The index file contains information about how to break the LZ4 file into splits that can be decompressed. 43 Common Compression Codecs Supported in Hadoop Codec File Extension Splittable? Degree of Compression Compression Speed Gzip.gz No Medium Medium Bzip2.bz2 Yes High Slow Snappy.snappy No Medium Fast LZ4.lz4 No, unless Medium Fast indexed 44 https://www.dummies.com/programming/big-data/hadoop/compressing-data-in-hadoop/ Compressing Output SequenceFiles with Snappy ◦ Specify output compression in the Job object. ◦ Specify block or record compression. ◦ Block compression is recommended for the Snappy codec. ◦ Set the compression codec to the Snappy codec in the Job object. 45 ◦ Unless stated otherwise, the materials presented in this lecture are taken from: Texts and ◦ Hadoop: the definitive guide, White, Tom (Tom E.) author., 4th ed.., Sebastopol, California: O'Reilly, 2015. Resources ◦ Hadoop operations, Sammer, Eric., Loukides, Michael Kosta.; Nash, Courtney.; Romano, Robert (Illustrator), illustrator., First edition., Sebastopol, CA: O'Reilly, 2012. ◦ Programming pig: dataflow scripting with hadoop, Gates, Alan, author.; Dai, Daniel, 2nd edition., Beijing, China: O'Reilly, 2017 46 MapReduce Code Unit Summary Explanation. WEEK 09 SUMMARY 47

Use Quizgecko on...
Browser
Browser