SOEN 363 Data Systems for Software Engineers Lecture 1 Introduction PDF
Document Details
Uploaded by Deleted User
Essam Mansour
Tags
Summary
This document is an introduction to data systems, specifically for software engineers. It covers the basics of data management, big data, database models, and query processing. The document aims to provide a foundational understanding to those studying data systems in computer science.
Full Transcript
SOEN 363 - Data Systems for Software Engineers Lecture 1: Introduction Essam Mansour Outline Motivation ✓ Course Overview and Administrivia A Primer on Databases On the Verge of A Disruptive Century: Breakthroughs...
SOEN 363 - Data Systems for Software Engineers Lecture 1: Introduction Essam Mansour Outline Motivation ✓ Course Overview and Administrivia A Primer on Databases On the Verge of A Disruptive Century: Breakthroughs Gene Ubiquitous Sequencing and Computing Biotechnology Smaller, Faster, Cheaper Sensors Faster Communication A Common Theme is Data The amount of data is only growing… 1.2 Zettabytes (1ZB = 1021 B or 1 Billion TB) in 2010 We Live in a World of Data ▪ Nearly 500 Exabytes per day are generated by the Large Hadron Collider experiments (not all recorded!) ▪ 2.9 million emails are sent every second ▪ 20 hours of video are uploaded to YouTube every minute ▪ 24 PBs of data are processed by Google every day ▪ 50 million tweets are generated per day ▪ 700 billion total minutes are spent on Facebook each month ▪ 72.9 items are ordered on Amazon every second Data and Big Data ▪ The value of data as an organizational asset is widely recognized ▪ Data is literally exploding and is occurring along three main dimensions “Volume” or the amount of data “Velocity” or the speed of data “Variety” or the range of data types and sources ▪ What is Big Data? ▪ It is the proliferation of data that floods organizations on a daily basis ▪ It is high volume, high velocity, and/or high variety information assets ▪ It requires new forms of processing to enable fast mining, enhanced decision-making, insight discovery and process optimization What Do We Do With Data and Big Data? Store Share Query Mine …. and Encrypt more! We want to do these seamlessly and fast... Using Diverse Interfaces & Devices Mobile Devices Computers …and even appliances Consumer Electronics Personal Monitors and Sensors We also want to access, share and process our data from all of our devices, anytime, anywhere! Data is Becoming Critical to Our Lives Health Science Domains Education of Data Work Environment Finance … and more Why Studying Databases? ▪ Data is everywhere and is critical to our lives ▪ Data need to be recorded, maintained, accessed and manipulated correctly, securely, efficiently and effectively ▪ At the “low end”: scramble to web-scale (a mess!) ▪ At the “high end”: scientific applications ▪ Database management systems (DBMSs) are indispensable software for achieving such goals ▪ The principles and practices of DBMSs are now an integral part of computer science curricula ▪ They encompass OS, languages, theory, AI, multimedia, and logic, among others As such, the study of database systems can prove to be richly rewarding in more ways than one! EXAMPLE – MODEL A DATABASE FOR THE UNIVERSITY PROBLEM WITH FLAT FILES Scaling issues Integrity issues System recovery issues Concurrent edits to files How to build another application? What if changes need to be made to how the data is physically stored? 1970’s – RELATIONAL DATA MODEL Programmers rewriting IMS code every time the database schema changes Abstract databases to avoid this issue Decouple logical structure from physical structure Store database in a simple data structure Use high-level language to access data Physical storage is left to the DBMS implementation Edgar F. Codd Outline Motivation Course Overview and Administrivia ✓ A Primer on Databases Course Objectives In this course we aim at studying: Big Data, Hadoop, How to construct BigTable, parallel buffer and disk and distributed How to refine space managers, DBMSs, NoSQL and speed up query optimizers, and NewSQL How to query data retrieval and concurrency databases and manipulate and managers for How to design databases manipulation DBMSs and implement databases from ‘cradle-to-grave’ Application-Centric Systems-Centric & Theory-Centric Advanced Topics NoSQL Databases ▪ To this end, a new class of databases emerged, which mainly follow the BASE properties ▪ These were dubbed as NoSQL databases ▪ E.g., Amazon’s Dynamo and Google’s Bigtable ▪ Main characteristics of NoSQL databases include: ▪ No strict schema requirements ▪ No strict adherence to ACID* properties ▪ Consistency is traded in favor of Availability *ACID: Atomicity, Consistency, Isolation, and Durability Types of NoSQL Databases ▪ Here is a limited taxonomy of NoSQL databases: NoSQL Databases Document Graph Key-Value Columnar Stores Databases Stores Databases List of Topics Considered: a reasonably critical and comprehensive understanding..1. The Entity-Relationship Model Masterful: a powerful and illuminating.2. understanding. The Relational Model.3. SQL.4. Data Storage and Organization.5. Tree-Based and Hash-Based Indexing.6. Query Evaluation and Optimization.7. Advanced Topics: Distributed Databases, Hadoop, and NoSQL and NewSQL Databases Learning Outcomes ❖ After finishing this course you will be able to: 1. Describe a wide range of data involved in real-world organizations using the entity- relationship (ER) data model 2. Explain how to translate an ER diagram into a relational database 3. Indicate how SQL builds upon relational calculus and algebra and effectively apply SQL to create, query and manipulate relational databases 4. Appreciate how DBMSs work 5. Have practical experience in manipulate and manage files of fixed-length and variable-length records on disks Learning Outcomes ❖ After finishing this course you will be able to: 6. Create and operate various static and dynamic tree-based (e.g., ISAM and B+ trees) and hash-based (e.g., extendable and linear hashing) indexing schemes 7. Explain and evaluate various algorithms for relational operations (e.g., join) using techniques such as iteration, indexing and partitioning 8. Analyze and apply different query evaluation plans and describe the various tasks of a typical relational query optimizer 9. Identify alternative architectures for distributed databases, and describe how data can be partitioned and distributed across networked nodes of a DBMS 10. Appreciate the scale of Big Data, discuss some popular analytics engines for Big Data processing and denote the applicability of NoSQL databases for Big Data storage