HADOOP Pig.pptx
Document Details
Uploaded by EarnestGreenTourmaline7771
Full Transcript
1 Lakehead University Hadoop Pig by Abed Alkhateeb 2023 Pig 182/ Apache Pig is an abstraction over MapReduce. It is a tool/platform which is used to analyze larger sets of data representing them as data flows. Pig is generally used with Hadoop; All data manipulation operations in Had...
1 Lakehead University Hadoop Pig by Abed Alkhateeb 2023 Pig 182/ Apache Pig is an abstraction over MapReduce. It is a tool/platform which is used to analyze larger sets of data representing them as data flows. Pig is generally used with Hadoop; All data manipulation operations in Hadoop can be performed using Apache Pig. Where not to use Pig? 183/ Completely unstructured data (video, audio, human-readable text). Pig is slow compared to MapReduce. When you need more power to optimize your code. Real time ETL tasks. Pinpointing a single record in large dataset. Where to use Pig? 184/ Pig is a data flow language. It is on the top of Hadoop to create a complex jobs to process large volume of data quickly and efficiently. Pig consumes any type of data, structured, Semistructured, and unstructured data. Pig code is relatively way easier and faster to write than Java code to perform preprocessing tasks. Pig provides the common data operations in a single pipeline (filter, join, ordering). Pig provides nested data types (bages, tuples, and maps). Features of Pig Rich set of operators − It provides many operators to perform operations like join, sort, filer, etc. Ease of programming − Pig Latin is similar to SQL and it is easy to write a Pig script if you are good at SQL Optimization opportunities − The tasks in Apache Pig optimize their execution automatically, so the programmers need to focus only on semantics of the language Features of Pig Extensibility − Using the existing operators, users can develop their own functions to read, process, and write data UDF’s − Pig provides the facility to create User-defined Functions in other programming languages such as Java and invoke or embed them in Pig Scripts Handles all kinds of data − Apache Pig analyzes all kinds of data, both structured as well as unstructured. It stores the results in HDFS Applications of Apache Pig To process huge data sources such as web logs To perform data processing for search platforms To process time sensitive data loads Quick prototyping of algorithms Usecases 188/ ICM UW – discover knowledge from large collections of documents(academic papers). LinkedIn – Discovering “people you may know”. Mendeley – Finding trending keywords over time. Nokia (OVI Maps) – exploring unstructured data sets from logs, databases dumps, and data feed. Real web – computing statistics on Pig Architecture 189/ Pig Basic Structure 1810/ Script Grunt Pig can run a script file that contains Pig commands. E.g. Pig runs commands in script.pig Grunt is an interactive shell for running Pig commands. Grunt also is able to run Pig scripts using run and exec Embedded Embedded can run Pig throughout Java Pig Execution Modes 1811/ 4 Basic Types of Data Types 1812/ Atom - A simple atomic value (int , long, double, string) ex: ‘Abhi’ Tuple - A sequence of fields that can be any of the data types ex: (‘Abhi’, 14) Bag - A collection of tuples of potentially varying structures, can Map - An associative array, the key must be ex: a contain duplicates char array but the value can be any(‘Manu’, type. (14, {(‘Abhi’), Complex Data Types 1813/ Pig Latin Pig provides a high-level language known as Pig Latin to write data analysis programs This language provides various operators using which programmers can develop their own functions for reading, writing, and processing data To analyze data using Apache Pig, programmers need to write scripts using Pig Latin language All these scripts are internally converted to MapReduce tasks Apache Pig has a component known as Pig Engine that accepts the Pig Latin scripts as input and converts those scripts into MapReduce jobs 1815/ Pig Latin Relational Operators Sample Dataset 1816/ Conceptual Data Flow 1817/ Pig Script 1818/ 1819/ Most Clicked Pages by Pig Latin Nulls 1820/ Diagnostic Operators & UDF Statements 1821/ Pig Latin Diagnostic Operators Types of Pig Latin Diagnostic Operators DESCRIBE : Prints a relation’s schema. EXPLAIN : Prints the logical and physical plans. ILLUSTRATE : Shows a sample execution of the logical plan, using a generated subset of the input. Pig Latin UDF Statements Types of Pig Latin UDF Statements REGISTER: Registers a JAR file with the Pig runtime. DEFINE : Creates an alias for a UDF, streaming script, or a command specification. DESCRIBE 1822/ Use the DESCRIBE operator to review the fields and data-types. Explain 1823/ Use the EXPLAIN operator to review the logical, physical, and MapReduce execution plans that are used to compute the specified relationship. The logical plan shows a pipeline of operators to be executed to build the relation. Type checking and backendindependent optimizations (such as applying filters early on) also apply. Explain : Logical Plan 1824/ Explain : Physical Plan 1825/ The physical plan shows how the logical operators are translated to backendspecific physical operators. Some backend optimizations also apply. Explain : MapReduce Plan 1826/ Illustrate 1827/ .ILLUSTRATE command is used to demonstrate a "good" example input data :Judged by three measurements Completeness 2: Conciseness 3: Degree of realism :1 PIG Latin - Creating UDF 1828/ Run UDF 1829/ Where to find useful PigLatin scripts? 1830/ PiggyBank - Pig’s repository of user contributed functions load/store functions (e.g. from XML) datetime, text functions, math, stats functions DataFu - LinkedIn's collection of Pig UDFs statistics functions (quantiles, variance etc.) convenient bag functions (intersection, union etc.) utility functions (assertions, random numbers, MD5, distance between lat/long pair), PageRank References 1831/ Pig Tutorial, https:// cwiki.apache.org/confluence/display/PIG/ PigTutorial , Last accessed Oct 5, 2023. https ://pig.apache.org/docs/latest/test.html#: ~: text=Use%20the%20ILLUSTRATE%20ope rator%20to,Example%20Data%20for%20 Dataflow%20Programs , Last accessed Oct 5, 2023. 32 QA