Big Data - Session 7 (Computer Science)

‫علوم الحاسب‬ ‫الفرقة الثالثة‬ ‫البيانات الضخمة‬ ‫‪Big Data‬‬ Q1. Define each of the following 1. Parallel Data Processing Parallel data processing involves the simultaneous execution of multiple sub-tasks that collectively comprise a larger task. ‫تتضمن معالجة البيانات المتوازية التنفيذ ز‬.‫المتامن لمهام فرعية متعددة تشكل مجتمعة مهمة أكت‬ ‫ يمكن تقسيم المهمة إىل ثالث مهام فرعية يتم تنفيذها بالتوازي عىل ثالثة معالجات مختلفة داخل‬:‫مثال‬ ‫نفس الجهاز‬ 2. Distributed Data Processing Achieved through physically separate machines that are networked together as a cluster..‫معا ز يف شبكة كمجموعة‬ ً ‫فعليا ومتصلة‬ ً ‫قائمة عىل المعالجة المتوازية لكن من خالل أجهزة منفصلة‬ 3. Hadoop Is an open-source framework for large-scale data storage and data processing..‫هو إطار عمل مفتوح المصدر لتخزين البيانات عىل نطاق واسع ومعالجة البيانات‬ 1 DataBase Q2. Define Processing workload and compare between its types. Processing workload The amount and nature of data that is processed within a certain amount of time ‫الت تتم معالجتها خالل فتة زمنية معينة‬ ‫كمية وطبيعة البيانات ي‬. Two types: Batch processing Transactional )‫(المعالجة بالدفعات‬ Known as offline processing Known as online processing Queries can be complex and data is processed interactively involve multiple joints. without delay. Involve fewer joins. Example OLAP systems. Example: OLTP and operational systems. Q3. Discuss how can process data in MapReduce? MapReduce is a widely used implementation of a batch processing framework. (parallel processing) Based on the principle of divide-and-conquer. It divides a big problem into a collection of smaller problems that can each be solved quickly. ‫بيقسم المشكلة الكبتة اىل مشاكل صغتة من السهل حلها‬ A single processing run of the MapReduce processing engine is known as a MapReduce job. 2 DataBase Each MapReduce job is composed of a map task and a reduced task, and each task consists of multiple stages. 1. Map, dataset file is divided into multiple smaller splits. 2. Combine function summarizes a mapper’s output before it gets processed by the reducer. 3. Partitioner last stage of the map task divides the output from the combiner into partitions. 4. Shuffling output from all partitioners is copied across the network to the nodes running the reduced task. key-value output 5. Sorts the key-value pairs according to the keys. 6. Reduce is the final stage of the reduced task. summarize its input or will emit the output without making any changes. 3 DataBase

Big Data - Session 7 (Computer Science)

Document Details

Tags

Related

Summary

Full Transcript