Full Transcript

IT3061 – Massive Data Processing and Cloud Computing Year 3, Semester 2 Data Processing Practical 2 Working in HDFS Before running WordCount example, we need to create some input text file, then move it to HDFS. First, crea...

IT3061 – Massive Data Processing and Cloud Computing Year 3, Semester 2 Data Processing Practical 2 Working in HDFS Before running WordCount example, we need to create some input text file, then move it to HDFS. First, create an input test file in your local file system. echo “This is a hadoop tutorial test" > wordcount.txt Next, we need to move this file into HDFS. Create a folder, copy the input file from local filesystem to HDFS, and list the content on HDFS. ls hdfs dfs -put /home/cloudera/temp/wordcount.txt /user/cloudera/input hdfs dfs -ls /user/cloudera/input Found 1 items -rw-r--r-- 1 cloudera cloudera 31 2015-01-15 18:04 /user/cloudera/input/wordcount.txt It should be noted that for a fresh Cloudera VM, there is a “/user” folder in HDFS but not in the local filesystem. This example illustrates that local file system and HDFS are separate, and the Linux’s “ls” and HDFS’s “ls” interact with those independently. ls /user ls: cannot access /user: No such file or directory hdfs dfs -ls /user To see the content of a file on HDFS, use cat subcommand: hdfs dfs -cat /user/cloudera/input/wordcount.txt this is a hadoop tutorial test Additional Note For large files, if you want to view just the first or last parts, you can pipe the output of the -cat subcommand through your local shell’s more, or tail. For example : hdfs dfs -cat wc-out/* | more. Running the WordCount Example Now we are going to run some MapReduce example, such as WordCount. The WordCount example is commonly used to illustrate how MapReduce works. The example returns a list of all the words that appear in a text file and the count of how many times each word appears. The output should show each word found and its count, line by line. We need to locate the example programs on the sandbox VM. On Cloudera Quickstart VM, they are packaged in this jar file “hadoop-mapreduce-examples.jar”. Running that jar file without any argument will give you a list of available examples. To run the WordCount example using the input file that we just moved to HDFS, use the following command: ** Make sure YARN (MR2) service is running before you run this command (Check in the Cloudera Manager). hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar wordcount /user/cloudera/input/wordcount.txt /user/cloudera/output The output folder is specified as “/user/cloudera/output” in the above command. Finally, check the output of WordCount example in the output folder.

Use Quizgecko on...
Browser
Browser