Data Engineering Concepts Quiz
40 Questions
0 Views

Choose a study mode

Play Quiz
Study Flashcards
Spaced Repetition
Chat to lesson

Podcast

Play an AI-generated podcast conversation about this lesson

Questions and Answers

Ktoré z nasledujúcich možností patria medzi podstatné otázky spojené s výberom systému ukladania dát?

  • Architektúra dátového skladu (correct)
  • Verzia operačného systému
  • Typ zdrojového systému (correct)
  • Teplota dát (correct)
  • Generovanie dát sa vyskytuje iba na začiatku životného cyklu dátového inžinierstva.

    False (B)

    Čo je to "hot data"?

    Najčastejšie prístupné dáta.

    ___ dáta sa zriedkavo vyťahujú a sú vhodné na ukladanie do archívnych systémov.

    <p>Studené</p> Signup and view all the answers

    Spojte typy zdrojových systémov s ich príkladom:

    <p>Relácie databázových systémov = MySQL NoSQL = MongoDB IoT = SenZory v priemyselných zariadeniach Dátové prúdy = Real-time events log</p> Signup and view all the answers

    Data engineers sa nemusí vedieť komunikovať s technickými tímami.

    <p>False (B)</p> Signup and view all the answers

    V ktorých fázach životného cyklu dátového inžinierstva sa nachádza bezpečnostná vrstva?

    <p>Bezpečnosť je prítomná vo všetkých fázach životného cyklu dátového inžinierstva.</p> Signup and view all the answers

    Multiverzná časová pečiatka v riadení súbežnosti umožňuje viacerým transakciám súčasne pristupovať k rovnakému dátovému elementu, ale len na čítanie.

    <p>False (B)</p> Signup and view all the answers

    Optimizmustické techniky v riadení súbežnosti predpokladajú, že konflikty sú ______ a že je efektívnejšie nechať transakcie prebiehať bez oneskorení.

    <p>zriedkavé</p> Signup and view all the answers

    Ktorá z nasledujúcich fáz sa nenachádza v optimizmustických technikách?

    <p>Spracovanie (A)</p> Signup and view all the answers

    Vysvetlite hlavný princíp multiverzného riadenia súbežnosti pomocou časovej pečiatky.

    <p>Multiverzné riadenie súbežnosti pomocou časovej pečiatky umožňuje viacerým transakciám súčasne pristupovať k rovnakému dátovému elementu tým, že pre každú zmenu vytvorí novú verziu dátového elementu. Pri čítaní sa vyberie verzia, ktorá zaisťuje serializovateľnosť. Staršie verzie sa zmazú, keď už nie sú potrebné.</p> Signup and view all the answers

    Spojte každý pojem s jeho definíciou:

    <p>Multiverzné riadenie súbežnosti = Umožňuje transakciám pristupovať k rôznym verziám dátového elementu a zaisťuje serializovateľnosť. Optimizmustické techniky = Predpokladajú, že konflikty sú zriedkavé a nechajú transakcie prebiehať bez oneskorení. Základné riadenie súbežnosti pomocou časovej pečiatky = Predpokladá, že existuje len jedna verzia dátového elementu a umožňuje prístup k nemu len jednej transakcii súčasne.</p> Signup and view all the answers

    Aký je konečný zostatok na účte po transakciách T1 a T2?

    <p>£190 (D)</p> Signup and view all the answers

    Nezaviazaná závislosť je, keď jedna transakcia môže vidieť medzičasové výsledky inej transakcie pred jej potvrdením.

    <p>True (A)</p> Signup and view all the answers

    Čo je to serializovateľnosť?

    <p>Identifikácia vykonaní transakcií, ktoré zaručujú konzistenciu.</p> Signup and view all the answers

    Problém _nastáva, keď transakcia číta niekoľko hodnôt a druhá transakcia aktualizuje niektoré z nich počas vykonávania prvej.

    <p>neúplnej analýzy</p> Signup and view all the answers

    Zarovnajte nasledujúce problémy s ich popisom:

    <p>Stratený aktualizačný problém = Strata aktualizácie druhej transakcie Nezviazaná závislosť = Prístup k medzičasovým výsledkom Problém neúplnej analýzy = Čítanie hodnôt počas aktualizácie</p> Signup and view all the answers

    Aký je spôsob, ako sa vyhnúť stratenému aktualizačnému problému?

    <p>Zabrániť T1 čítať balx pred aktualizovaním. (C)</p> Signup and view all the answers

    Serializovateľnosť zvyšuje paralelitu transakcií.

    <p>False (B)</p> Signup and view all the answers

    Aký zostatok by mal mať balx po transakcii T4, ak T4 zruší svoju aktualizáciu?

    <p>£100</p> Signup and view all the answers

    Súbor čítaní/písaní sú o transakciách sa nazýva _ .

    <p>časový plán</p> Signup and view all the answers

    Aký je hlavný problém nesprávneho rozvrhu zamknutia?

    <p>Transakcie uvoľňujú zámky príliš skoro. (A)</p> Signup and view all the answers

    Protokol Two-Phase Locking (2PL) povoľuje akvizíciu nových zámkov počas zmenšovacej fázy.

    <p>False (B)</p> Signup and view all the answers

    Aké sú dve fázy protokolu 2PL?

    <p>Rastúca fáza a zmenšovacia fáza.</p> Signup and view all the answers

    Protokol 2PL zabraňuje problému ______ aktualizácie.

    <p>stratenej</p> Signup and view all the answers

    Usporiadajte problémy a ich riešenia pomocou protokolu 2PL:

    <p>Stratená aktualizácia = Preventing Lost Update Problem Nezavedená závislosť = Preventing Uncommitted Dependency Problem Nejednotná analýza = Preventing Inconsistent Analysis Problem Kaskádový rollback = Preventing Cascading Rollback Problem</p> Signup and view all the answers

    Čo sa stane, ak transakcia T14 zlyhá?

    <p>Všetky transakcie, ktoré jej závisia, musia byť tiež zrušené. (D)</p> Signup and view all the answers

    Pri dodržiavaní protokolu 2PL môže nastať problém so uvoľnením zámkov.

    <p>True (A)</p> Signup and view all the answers

    Čo znamená celková serializovateľnosť transakcií?

    <p>Transakcie musia byť vykonané v takom poradí, aby sa správali ako keby sa vykonávali sekvenčne.</p> Signup and view all the answers

    V protokole 2PL sa uvoľnenie všetkých zámkov odkladá až na koniec transakcie, aby sa zabránilo ______.

    <p>kaskádovému rollbacku</p> Signup and view all the answers

    Čo sa stane, ak transakcia má časovú pečiatku menšiu ako časová pečiatka poslednej transakcie, ktorá zapísala daný prvok?

    <p>Transakcia je zrušená a reštartovaná (D)</p> Signup and view all the answers

    Každá transakcia dostáva časovú pečiatku na základe predchádzajúcej transakcie.

    <p>False (B)</p> Signup and view all the answers

    Aké sú problémy spojené s detekciou a zotavením z mŕtvej zóny?

    <p>výber obete mŕtvej zóny, ako ďaleko vrátiť transakciu, zabránenie hladu.</p> Signup and view all the answers

    Jedným z hlavných cieľov ____ je zabrániť zablokovaniu transakcií.

    <p>časového pečiatkovania</p> Signup and view all the answers

    Zlúčte nasledujúce komponenty časového pečiatkovania s ich definíciami:

    <p>read-timestamp = časová pečiatka poslednej transakcie, ktorá prečítala prvok write-timestamp = časová pečiatka poslednej transakcie, ktorá zapísala prvok ts(T) = časová pečiatka aktuálnej transakcie T konflikt = situácia, keď sa transakcie snažia naraz meniť rovnaké dáta</p> Signup and view all the answers

    Aká je výhoda časového pečiatkovania v porovnaní s inými metodami správy transakcií?

    <p>Zefektívňuje čítanie a zápis bez zamykania (B)</p> Signup and view all the answers

    Pri časovom pečiatkovaní nie je potrebné vrátiť transakcie späť, ak neexistuje konflikt.

    <p>True (A)</p> Signup and view all the answers

    Aký identifikátor je vytvorený systémom DBMS pre určenie relatívneho času začatia transakcie?

    <p>časová pečiatka</p> Signup and view all the answers

    Pri konflikte je možné vyriešiť situáciu tak, že _____ a transakcia sa znovu spustí.

    <p>vrátite transakciu späť</p> Signup and view all the answers

    Čo sa deje s transakciou, ak je jej časová pečiatka väčšia ako časová pečiatka píšúcej transakcie?

    <p>Transakcia je zrušená a reštartovaná (D)</p> Signup and view all the answers

    Study Notes

    Data Engineering

    • Data engineering is about the movement, manipulation, and management of data.
    • Endless definitions of data engineering exist.
    • Data engineering is a set of operations aimed at creating interfaces and mechanisms for the flow and access of information.
    • Dedicated specialists—data engineers—maintain data so that it remains available and usable by others.
    • The data engineering field is a superset of business intelligence and data warehousing, bringing more elements from software engineering.

    Definition

    • Data engineering is the development, implementation, and maintenance of systems and processes that take in raw data and produce high-quality, consistent information, supporting downstream use cases (analysis and machine learning).
    • Data engineering is the intersection of security, data management, DataOps, data architecture, orchestration, and software engineering.
    • A data engineer manages the data engineering lifecycle, beginning with getting data from source systems and ending with serving data for use cases such as analysis or machine learning.

    The Data Engineering Lifecycle

    • Comprises stages turning raw data into a useful product for consumption by analysts, data scientists, and others.
    • Stages include generation, storage, ingestion, transformation, and serving data.
    • Undercurrents (security, data management, DataOps, data architecture, orchestration, and software engineering) cut across all stages of the data engineering lifecycle.

    Generation

    • A source system is the origin of data used in the data engineering lifecycle.
    • Relation database systems, NoSQL, and IoT are examples of source systems.
    • Data streams must be evaluated thoroughly (essential characteristics, persistence, frequency of generation, errors, schema presence).

    Storage

    • A place to store ingested data, running across the entire process (ingestion, transformation, and serving).
    • Key engineering questions focus on selecting a storage system (data warehouse, data lakehouse, database, or object storage).
    • Data is categorized into temperatures based on access frequency (hot, lukewarm, and cold).
    • There is no one-size-fits-all storage recommendation.

    Ingestion

    • Data ingestion from source systems (normally outside direct control).
    • Source systems and ingestion are significant bottlenecks.
    • Batch versus streaming data
      • Data is inherently streaming
      • Batch ingestion simplifies processing in large chunks.
      • Real-time data is available shortly after creation.
    • Push versus pull models
      • Push model: Source system writes data to a target.
      • Pull model: Data is retrieved from the source system.
      • Hybrid model combines both approaches.

    Transformation

    • Data needs transformation from its original format into a usable form for downstream use cases.
    • Data is mapped to the correct types and cleaned.
    • This can include normalisation and creation of new features.
    • Transformation is often included in other stages of the life cycle.

    Serving Data - Analytics

    • Data has value as it is used.
    • Analytics is essential for data endeavors.
    • Business intelligence describes the past and current business state.
    • Operational analytics focuses on the present and detailed operations.
    • Embedded analytics is customer-facing and dramatically increases demand for reports.
    • Multitenancy allows multiple customers to share data in a unified view for analysis and ML, via logical views with controls.

    Serving Data - Machine Learning

    • A feature store is a tool that combines data engineering and ML engineering.
    • A solid data foundation should be built before investing in ML resources.

    Reverse ETL

    • Takes processed data (analytics, scored models, etc.) from the data engineering lifecycle output and feeds it back into source systems.
    • Useful for pushing data back to production systems or SaaS platforms, especially in SaaS-heavy business environments.

    Major Undercurrents Across the Data Engineering Lifecycle

    • Data engineering encompasses traditional enterprise concepts and practices.
    • Includes security (access control, data governance, discoverability, data integrity, and modelling), orchestration (coordination, scheduling tasks), software engineering (programming, software design), and DataOps (Agile methodologies, DevOps, SPC).
    • Data Architecture – analyze trade-offs.
    • Data Management – discoverability, security, and accountability.

    Data Management and Governance

    • Data management involves developing, implementing, and supervising plans, policies, and procedures that enhance the value of data throughout its life cycle.
    • Covers data governance, modelling, lineage, integration, and interoperability.
    • Data governance includes discoverability, security, and accountability and is crucial for data quality, integrity, security, and usability.

    Orchestration

    • A process coordinating many jobs to run efficiently.
    • Differs from schedulers in how job dependencies are managed (directed acyclic graph or DAG, for example with Apache Airflow).
    • Orchestration systems include job history, visualization, and alerting capabilities.

    DataOps

    • Maps best practices of Agile, DevOps, and statistical process control (SPC) to data.
    • Aims to improve the efficiency and quality of data products.
    • Core technical elements: automation, monitoring, and observability, and incident response.

    Software Engineering

    • Core data processing code (e.g., SQL)
    • Development of open-source frameworks (e.g., Hadoop ecosystem)
    • Streaming
    • Infrastructure as Code (IaC)
    • Pipeline as code is a core concept for present-day orchestration systems touching every stage of the data engineering lifecycle
    • General-purpose problem-solving.

    Enterprise Architecture

    • TOGAF "enterprise" encompasses all information and technology services.
    • Enterprise Architecture (EA) is an organizational model aligning strategy, operations, and technology for success.
    • Enterprise architecture designs systems to support change in the enterprise, enabled by flexible and reversible decisions.

    Data Architecture

    • Reflects current and future state of data systems supporting organizational data needs and strategy.
    • Part of an overall enterprise architecture
    • Describes the structure and interaction of data types, logical data assets, physical data assets, and data management resources
    • Data architecture is the design of systems that support the growing data needs of an organization, by supporting flexible and reversible data decisions.
    • Serves business requirements; it is flexible and maintainable.
    • It is not static - it's in a constant state of evolving.

    Principles of Good Data Architecture

    • AWS and Google principles focus on operational excellence, security, reliability, performance efficiency, cost optimisation and sustainability, data automation.
    • Examples and types of data architecture (data warehouse and mart, data lake, data lakehouse, modern data stack, Lambda architecture, Kappa architecture, architecture for IoT, and data mesh).

    Data Warehouse

    • A subject-oriented, integrated, nonvolatile, and time-variant collection of data to support business decisions.
    • Central hub for reporting and analysis.
    • Modern cloud data warehouses are scalable and accessible to smaller companies

    ETL vs. ELT

    • ETL (Extract, Transform, Load): Data is transformed before loading into the warehouse.
    • ELT (Extract, Load, Transform): Data is loaded into the warehouse first, then transformed.

    Data Marts

    • Subset of a warehouse, focusing on a single department or business unit.
    • Makes analytical data easier and more accessible for analysis and reporting.
    • Provides additional transformation steps beyond the initial ETL/ELT.

    Data Lake vs. Data Lakehouse

    • Data lake: Dumps all unstructured and structured data into a central location.
    • Data lakehouse combines controls and structure of a data warehouse when managing data in object storage.

    Modern Data Stack

    • Cloud-based, plug-and-play, modular, cost-effective data architecture.
    • Typical components: pipelines, storage, transformation, data management, monitoring, visualization, exploration.

    Lambda Architecture

    • Reaction to the need to analyze streamed data, with independent systems for batch, stream, and serving tasks.
    • Immutability and append-only data for stream and batch processing.

    Kappa Architecture

    • Aims to improve data handling (ingestion, storage, and serving)
    • Real-time and batch processing can be applied seamlessly.
    • Event-based approach.

    Architecture for IoT

    • The Internet of Things (IoT) is a collection of devices, computers, sensors, mobile devices, smart home devices, and anything else connecting to the internet.
    • The architecture handles data from IoT devices.

    Data Mesh

    • A response to the complexity of traditional, centralized data architectures.
    • Inverts the traditional structure, focusing on decentralization
    • Uses data as a product.

    Physical Tables

    • Data stored in tables (grouped into schemas)
    • Contain columns and use commands like CREATE TABLE, ALTER TABLE, and more and similar commands.
    • Several physical table types: permanent, transient, and temporary
    • Hybrid Unistore tables.
    • Metadata tables.
    • External tables.
    • Directory tables

    Snowflake Views, Materialised Views, Streams, and Tasks

    • Snowflake Views: Store SELECT statements over physical objects as an object (time saving, maintainability, reusable).
    • Materialised Views: Physical tables storing views’ results, used for frequently queried data (cost because physical storage).
    • Streams: Capture data changes in underlying sources, tracking changes (inserts, deletions, updates) via metadata.
    • Tasks: Schedule and automate data loading/transformation using SQL (serial/parallel).

    Constraints and Enforcement

    • Constraints define integrity and consistency.
    • Constraints types: PRIMARY KEY, UNIQUE, FOREIGN KEY, NOT NULL.
    • Enforcement : actively monitoring integrity.

    Keys Taxonomy

    • Business Keys- hold organization meaning
    • Surrogate keys- no special meaning; a random unique identifier.
    • Sequence- An independent database object generating sequential (not necessary gap-less) integers.
    • Alternative key- A unique key
    • One table may have several keys

    Conceptual Modeling

    • Based on Kimball's Dimensional Modelling approach, a logical model aligned with business operations, providing a framework for the overall structure of data.

    Define the Business Process

    • Capture relevant metrics concerning the operational activities and transactions occurring in a business.

    Determine the Grain

    • The lowest detail level necessary to analyze a business process, usually reflecting atomic transaction levels.

    Determine the Dimensions

    • Entities answer who, what, where, and other descriptive context questions; these entities become dimensions.

    Identify the Facts

    • Identify the transactions among business dimensions associated with relevant metrics (mapped in the bus matrix).

    The Conceptual Diagram from a Bus Matrix

    • Illustrates the relationships and interdependencies of the business processes and entities. Only a few fact tables may be created in a conceptual model.

    Types of Dimensions

    • Some types of dimensions include conforming, junk, degenerate, and role playing dimensions.

    Slowly Changing Dimensions

    • Dimensions changing over time—tracks historical version data.
    • Four types:
      • Type 0 (retain original)
      • Type 1 (overwrite)
      • Type 2 (add new record)
      • Type 3 (add attributes)
      • Type 4 (historical table)

    Bridge Table

    • Resolves M:N relationships to sit between fact and dimension tables.

    Fact Tables

    • Primary table for fact-based modelling • Lowest granularity level • Measures captured from business processes, not the report • Contains foreign keys for dimensions. • Measures used for different query types and aggregations • Primary key in fact tables is frequently a composite key.

    Types of Facts

    • Types of fact tables: Additive, Semi-additive, and Non-additive. • Additive facts can be summed across various dimensions. • Semi-additive facts can be summed across some dimensions. • Non-additive facts cannot be summed.

    Transaction Fact Table

    • Most common fact table structure, one row for each transaction
    • Contains measures relating to business processes.
    • Grain is the lowest granularity, date is a key dimension variable.
    • Additive measures, usually of large growing size (easy to update).

    Periodic Fact Table

    • Snapshots of data based on certain periods (daily, weekly, monthly, quarterly, annually).
    • Useful to get overview of KPI's and usually derived from transaction fact tables.
    • Smaller in size than transaction fact tables.

    Accumulating Fact Tables

    • One row per event/product for its entire lifetime.
    • Has beginning and end
    • multiple date columns to enable data updates and handling complex situations during the processing of an order or insurance process.

    Star Schema

    • Fact table in the centre of the schema;
    • Dimensions radiating outwards in a star-like structure.
    • Easier querying and understanding data relationships.

    Snowflake Schema

    • Extends on star schema concept
    • further normalizes dimension tables into sub-dimensions, to manage hierarchies.
    • Can have more complicated queries and potentially affect performance because of the number of required joins.

    Hybrid Architecture

    • Combines both relational and star schema models.

    Data Vault Methodology

    • Created by Dan Linstedt
    • Overcomes disadvantages of previous data modeling methodologies by storing all raw data
    • Using hashed keys instead of composite keys.
    • Data marts built upon data vault views
    • Handles structured and unstructured data at scale

    Wide Table or One Big Table (OBT)

    • Denormalised table storing pre-aggregated data
    • Optimized for faster query performance for smaller projects.
    • Can be more intuitive for end-users
    • Ideal for smaller projects. Cost-effective due to storage, intuitive.

    Dimensional Modelling (DM)

    • Technique focusing on modelling for faster data retrieval
    • Denormalized structure
    • Data grouping by business categories

    Kimball's DWH Methodology

    • Bottom-up approach to data warehousing.
    • Data marts are created first.
    • Use of dimensional models (star or snowflake).

    Program and Project Planning

    • Establish the goals of the project, plan for resources and time.
    • Define required steps including business requirements, data profiling, business process documentation, defining data models, logical and physical modeling.
    • Define a project team with separate roles (Data Architect, Data Engineer, Analyst, BI Analyst, PM)

    Business Requirements Definition

    • Includes gathering process requirements for the data warehouse.
    • Uses interviews, data analysis, and data profiling.
    • Essential for creating the business process list
    • Includes criteria such as data readiness, data quality, complexity, and business priority.

    Business Processes

    • Activities performed by the organization (e.g., sales, supplier)
    • Identifying processes that create fact data and defining appropriate grain.
    • Crucial step to establish a data warehouse, defines the scope of necessary data and analysis processes.

    Documentation - Business Processes List

    • Evaluation considering data readiness, data quality, complexity, and business priority for each process.

    Documentation - User Stories

    • Understanding user needs and their expectations.
    • Describes who wants what, for what purpose (as a business analyst, I want to understand...? )
    • Helps define required data (columns, etc.)

    Documentation - Business Processes Matrix

    • Tool used to implement a dimensional data warehouse
    • Defines the high-level entities, business processes, and dimensions and how they map to facts.
    • Prioritized project directions/workloads.

    Studying That Suits You

    Use AI to generate personalized quizzes and flashcards to suit your learning preferences.

    Quiz Team

    Related Documents

    Data Engineering PDF

    Description

    Otestujte svoje vedomosti o základných konceptoch dátového inžinierstva. Tento kvíz pokrýva otázky týkajúce sa životného cyklu dát, bezpečnostných vrstiev a riadenia súbežnosti. Zistite, ako dobre rozumiete témam, ako sú 'hot data' a optimizmustické techniky.

    More Like This

    Data Engineering Concepts Quiz
    5 questions
    Data Engineering CH01: Introduction
    30 questions
    Data Engineering Lifecycle Stages
    18 questions
    Use Quizgecko on...
    Browser
    Browser