Big Data - Session 7 (Computer Science)
Document Details
Uploaded by WellRegardedLosAngeles
Tags
Related
- CS 131.01 Processing Big Data: File Systems & Directories PDF
- Chapter 1. Big Data and Artificial Intelligence Systems PDF
- CS4337 Applied System Design Lecture 2 – Big Data PDF
- Lesson 3: Design Layers in Data Processing Architecture (Big Data Analytics) PDF
- Session 4: Big Data - Computer Science - 3rd Year - PDF
- Lecture 2 Scalable Data Systems
Summary
These are notes for a computer science course on Big Data. The session covers parallel data processing, distributed data processing, and Hadoop. It also covers MapReduce.
Full Transcript
علوم الحاسب الفرقة الثالثة البيانات الضخمة Big Data Q1. Define each of the following 1. Parallel Data Processing Parallel data processing involves the simultaneous execution of multiple sub-tasks that collectively comprise a larger task....
علوم الحاسب الفرقة الثالثة البيانات الضخمة Big Data Q1. Define each of the following 1. Parallel Data Processing Parallel data processing involves the simultaneous execution of multiple sub-tasks that collectively comprise a larger task. تتضمن معالجة البيانات المتوازية التنفيذ ز.المتامن لمهام فرعية متعددة تشكل مجتمعة مهمة أكت يمكن تقسيم المهمة إىل ثالث مهام فرعية يتم تنفيذها بالتوازي عىل ثالثة معالجات مختلفة داخل:مثال نفس الجهاز 2. Distributed Data Processing Achieved through physically separate machines that are networked together as a cluster..معا ز يف شبكة كمجموعة ً فعليا ومتصلة ً قائمة عىل المعالجة المتوازية لكن من خالل أجهزة منفصلة 3. Hadoop Is an open-source framework for large-scale data storage and data processing..هو إطار عمل مفتوح المصدر لتخزين البيانات عىل نطاق واسع ومعالجة البيانات 1 DataBase Q2. Define Processing workload and compare between its types. Processing workload The amount and nature of data that is processed within a certain amount of time الت تتم معالجتها خالل فتة زمنية معينة كمية وطبيعة البيانات ي. Two types: Batch processing Transactional )(المعالجة بالدفعات Known as offline processing Known as online processing Queries can be complex and data is processed interactively involve multiple joints. without delay. Involve fewer joins. Example OLAP systems. Example: OLTP and operational systems. Q3. Discuss how can process data in MapReduce? MapReduce is a widely used implementation of a batch processing framework. (parallel processing) Based on the principle of divide-and-conquer. It divides a big problem into a collection of smaller problems that can each be solved quickly. بيقسم المشكلة الكبتة اىل مشاكل صغتة من السهل حلها A single processing run of the MapReduce processing engine is known as a MapReduce job. 2 DataBase Each MapReduce job is composed of a map task and a reduced task, and each task consists of multiple stages. 1. Map, dataset file is divided into multiple smaller splits. 2. Combine function summarizes a mapper’s output before it gets processed by the reducer. 3. Partitioner last stage of the map task divides the output from the combiner into partitions. 4. Shuffling output from all partitioners is copied across the network to the nodes running the reduced task. key-value output 5. Sorts the key-value pairs according to the keys. 6. Reduce is the final stage of the reduced task. summarize its input or will emit the output without making any changes. 3 DataBase