Data Engineering Summary PDF

Data engineering 1. Advanced relational databases...................................................................... 3 1.1. Normalization.......................................................................................................... 3 1.2. Views....................................................................................................................... 5 1.3. Indexing................................................................................................................... 5 1.4. Constraints.............................................................................................................. 6 1.5. Stored procedures................................................................................................... 6 1.6. Triggers.................................................................................................................... 7 1.7. Window functions.................................................................................................... 8 1.8. Database transactions............................................................................................. 8 2. NoSQL/Non-relational databases................................................................. 11 2.1. Introduction to NoSQL........................................................................................... 11 2.2. BASE....................................................................................................................... 13 2.3. Concepts of distributed systems............................................................................ 14 2.4. MongoDB............................................................................................................... 16 2.5. Apache Cassandra................................................................................................. 18 2.6. Graph databases................................................................................................... 20 2.7. Cypher by Neo4j.................................................................................................... 21 3. Big data Frameworks................................................................................... 23 3.1. Big data Frameworks............................................................................................ 23 3.2. Apache Hadoop..................................................................................................... 24 3.3. Apache Spark......................................................................................................... 27 4. Stream Processing........................................................................................ 31 4.1. Basic concepts of stream processing..................................................................... 31 4.2. Time windowing.................................................................................................... 33 4.3. Stream processing frameworks............................................................................. 35 4.4. Apache Spark Streaming....................................................................................... 36 4.5. Difference Flink and Spark..................................................................................... 38 4.6. Examples................................................................................................................ 38 5. Building data pipelines................................................................................. 39 5.1. The Data Engineering Lifecycle.............................................................................. 39 5.2. Common Data Pipeline Patterns........................................................................... 40 5.3. Data Ingestion with Apache Kafka........................................................................ 41 1 5.4. Kafka Fundamentals.............................................................................................. 42 5.5. Separation of compute and data........................................................................... 45 5.6. Orchestration......................................................................................................... 45 5.7. Big Data Programming Models/Frameworks........................................................ 46 5.8. Workflow Management using Apache Airflow...................................................... 47 6. Data visualization......................................................................................... 49 6.1. Introduction........................................................................................................... 49 6.2. Ways to visualize data........................................................................................... 49 6.3. Workflow............................................................................................................... 58 6.4. Visualization principles.......................................................................................... 59 6.5. Colors..................................................................................................................... 63 6.6. Dashboards............................................................................................................ 64 6.7. Strategies to create effective dashboards............................................................. 68 6.8. Examples................................................................................................................ 70 7. Data Privacy and General Data Protection Regulation................................... 75 7.1. Privacy................................................................................................................... 75 7.2. Security.................................................................................................................. 80 7.3. Threat modeling.................................................................................................... 83 7.4. Privacy enhancing technologies............................................................................ 85 8. Cloud computing.......................................................................................... 87 8.1. The role of the cloud.............................................................................................. 87 8.2. Historian systems................................................................................................... 88 8.3. Virtualization and orchestration............................................................................ 88 8.4. Cloud service models............................................................................................. 91 8.5. Examples................................................................................................................ 93 2 1. Advanced relational databases A relational DBMS (database management system) stores data in the form of tables/relations ➔ Data storage format is independent of its usage Relation A 2-dimensional table Rows: have info about entities (specific entries) No two rows may be identical Columns: have info about attributes All entries in a column are of the same kind (data type) Each column has a unique name Cells: hold a single value Keys one or more columns that are used to identify rows in a table Composite key: consists of two or more columns Candidate key: determines all other columns in a row (unique!) Primary key: candidate key selected as the primary one to identify rows Only one per table May be a composite key Is short, numeric, not null and ideally never changes Every table should have one Surrogate key: artificial column added to a relation to serve as primary key Supplied by the DBMS (e.g. autoincrementing) Short, numeric and unchanging Artificial values, meaningless to the user Usually hidden Foreign key: primary key of one relation that is placed in another table to form a link between the relations May be composite Refers to a pk in another table Referential integrity constraint: the values of the foreign key should always be existing primary key values 1.1. Normalization Organizing a database efficiently ➔ Avoiding redundancy ➔ Eliminating inconsistent dependencies BUT if you start from a well-designed UML- or EER-diagram you will end up with normalized tables First normal form (1NF) 3 Eliminate duplicate columns Each attribute of a table must have atomic values (no lists of values in a cell) Second normal form (2NF) From table in 1NF All the non-key columns are dependent on the table’s primary key (only relevant with a composite pk) ➔ Eliminate redundant data ➔ Narrow tables to a single purpose Third normal form (3NF) From 2NF Contains only columns that are non-transitively dependent on the pk ➔ Eliminate all data that is not dependent on the pk Example 4 1.2. Views A table that is not physically stored in the database Name has to be unique Can refer to tables and other views Data in a view is retrieved from other tables in the database: Queries can be executed on a view Example View: Queries: Why use views? To give a name to queries you execute frequently and reuse the result of these queries as if it is a real table ➔ Hides complexity Security: to limit the number of rows/column a user can see e.g. only access to data about people in the same department e.g. only access to a reduced set of columns from a table 1.3. Indexing Data structure used to locate and quickly access data ➔ Faster data retrieval Index entries act like pointers to table rows to quickly determine rows that match a certain condition B-tree Common data structure for a typical DBMS Indexes generalize the binary search tree 5 1.4. Constraints Conditions that should be true at all times ➔ DBMS keeps a close eye on them to enforce them NOT NULL: Ensures that a column cannot have a NULL value UNIQUE: Ensures that all values in a column are different PRIMARY KEY: Uniquely identifies each row in a table FOREIGN KEY: a field in one table that refers to the pk in another table CHECK: Ensures that the values in a column satisfies a specific condition DEFAULT: Sets a default value for a column if no value is specified Referential integrity The integrity of a row that contains a foreign key depends on the integrity of the row that it references ➔ If a pk is deleted or updated you destroy the meaning of any rows that contain that value as a fk The fk constraint is used to prevent actions that would destroy links between tables Child table: table with the fk Parent table: table with the pk Referential actions When an UPDATE or DELETE operation affects a key value in the parent table that has matching rows in the child table, the result depends on the referential action specified by ON UPDATE and ON DELETE subclauses of the FOREIGN KEY clause RESTRICT: rejects the delete/update for the parent table CASCADE: delete/update the row from the parent table and automatically does the same for the child table SET NULL: delete/update the row from the parent table and set the foreign key in the child table null NO ACTION: rejects all changes 1.5. Stored procedures A procedure with declarative and procedural SQL-statements (e.g. a query), stored in the database Can be called at any time from a program or other stored procedure Why When the same operation is frequently repeated For a safe and consistent execution of operations 6 + Less load on the client + Improved performance (less data-exchange and network traffic between client and server) o Pre-compiled and optimized SQL statements + Safer o Users and client applications are not required to have direct access to the tables in the database + Reusable and easy to maintain - More load on the server - DBMS dependent (no standard) o Migration to another DBMS provider can be difficult - Business logic in the database 1.6. Triggers Stored procedures that automatically execute in response to certain events on a specific table in a database Activated before or after an insert/update/delete operation Why Audit data modification: keep history of data, independent from client app (Complex) validation formatting (e.g. phone numbers) business security Keep in mind Access to existing/new data: INSERT trigger: NEW data DELETE trigger: OLD data UPDATE trigger: OLD data & NEW data More than one row can be referenced in a trigger, but MySQL only supports row level triggers (trigger called for every row involved) the old data of the book is kept and stored somewhere else and the new data is stored in the table 7 Trigger types Events Temporal triggers Code that can be triggered to execute at a certain moment (SCHEDULE AT) in time or repeated at a regular interval (SCHEDULE EVERY) 1.7. Window functions Performs an aggregate-like operation on a set of query rows (an aggregate operation groups query rows into a single result row) Produces a result for each query row Concepts Current row: row for which function evaluation occurs Window: the query rows related to the current row Partition: Define how many rows in front and behind the query looks at Frame: Subset of partition. Can “move” within a partition depending on the location of the current row Example 1.8. Database transactions What if there is a concurrent use of the database? Multiple concurrent operations in the database by multiple users and clients ➔ System failure 8 Examples Bank accounts: Bidding web app: Seat reservation system: Archiving of students: 9 ACID A set of properties that are used to ensure the reliability and consistency of relational database For data bases that can handle many small simultaneous transactions Atomicity: Transactions are atomic One or more operations = "one unit of work" All changes succeed (commit) or everything will be undone (rollback) Consistency: Database is always in a consistent state A transaction cannot be completed if it would leave the database in an inconsistent state Isolation: Operations within a transaction are isolated from the operations of other transactions This prevents conflicts between transactions Ensures that the results of a transaction are not visible to other transactions until the transaction is complete Durability: Once committed, a transaction is guaranteed to become permanent Read conditions What happens when two or more transactions operate on the same data at the same time? Dirty reads: transaction reads uncommitted changes made by the previous transaction Repeatable reads: The data reads are guaranteed to look the same if read again during the same transaction Phantom reads: New records added to or deleted from the database are detectable by transactions that started prior to the insert/delete Isolation levels (ISO standard) Prevents read errors due to multiple transactions Read uncommitted: Transaction can read the uncommitted data of other transactions (="dirty" read) do not use it in multithreaded applications! Read committed: Transaction will never read uncommitted changes from other transactions (default level for most databases) Repeatable read: Transaction is guaranteed to get the same data on multiple reads of the same rows until the transaction ends Serializable: None of the tables/rows the transaction touches will change during the transaction (= exclusive read) Highest level (likely to cause performance bottlenecks) 10 2. NoSQL/Non-relational databases 2.1. Introduction to NoSQL Refers to family of databases that are: Non-relational in architecture o Not standard row and column type RDBMS Designed to handle “big data” o Scale horizontally o Share data more easily o Use a global unique key to simplify data sharding Simpler to develop app functionality than for RDBMS o More agile development Majority have their roots in the open source community Characteristics Are built to scale horizontally Share data more easily than RDBMS Use a global unique key to simplify data sharding Are more use case specific than RDBMS Are more developer friendly than RDBMS Allow more agile development via flexible schemas Benefits: Scalability: scale across clusters of servers and even datacenters Performance: Availability: data runs on a cluster of servers with multiple copies of the data Cost: reduced due to cloud architecture Flexible schema: Varied data structures Specialized capabilities 11 Categories Key value: All data is stored as a key that uses a HashMap with its corresponding value + Scale well + Shard easily + Not intended for complex queries Use-cases: For quick simple CRUD operations on non-interconnected data For single key operations only o Storing and retrieving session information for web applications ▪ Storing in-app user profiles and preferences ▪ shopping cart data for online stores Unsuitable use-cases: complex queries o can only do atomic single key operations (operation that can guarantee access to a single value) o the value of a key is opaque to the database so it is hard to index and query Data interconnected with many-to-many relationships o Social networks o Recommendation engines If high-level consistency is required for multi-operation transactions with multiple keys When apps run queries based on value instead of key Document: Each piece of data is considered a document containing visible values that can be queried with atomic operations on single documents + Each document offers a flexible schema + Content of documents can be indexed + Horizontally scalable + Allows sharding across multiple nodes (via unique keys) Use-cases: Each instance can be represented by a new document o Event logging for apps and processes ▪ Each event is represented by a new document o Online blogs ▪ Each user, post, like, comment, … represented by a document o Operational datasets and metadata for web and mobile apps Unsuitable use-cases: Transactions that operate over multiple documents o ACID transactions required Data does not fit in an aggregate-oriented design o If data naturally falls into a normalized tabular model 12 Column: column families are several rows, with unique keys, belonging to one or more columns that are often It uses rows that is a key-value pair where the key is mapped a value that is a set of columns + Columns can have TTL parameter Use-case: Large amount of sparse data Deploy across clusters of nodes Event logging and blogs Counters Data with an expiration value Unsuitable use-cases: Traditional ACID transactions Graph: stores info in entities (nodes) and relationships (edges) - Do not shard well + ACID transaction Use-cases: Highly connected and related data o Social networking Routing, spatial and map apps Recommendation engines Unsuitable use-cases: Applications that require horizontal scaling When trying to update all nodes with a certain parameter 2.2. BASE A model that is used to describe the characteristics of a distributed database management system (DBMS) It is based on the idea that a distributed database can still be highly available and consistent, even if it is not immediately consistent Favors availability over cosnistency Basically Available: The system should be available for read and write operations most of the time, even if it is not possible to guarantee 100% availability Soft state: The state of the system can change over time, even if there are no updates being made to the system The system may be distributed across multiple machines, and the state of each machine may not be immediately consistent with the others 13 Eventually consistent: The system will eventually become consistent, even if it is not immediately consistent ➔ the system can continue functioning and serving requests while it is in the process of becoming consistent. Use-cases: worldwide online services social media apps marketing and customer service companies 2.3. Concepts of distributed systems A collection of multiple interconnected databases spread across various locations (physically) Fragmentation Sharding / partitioning Large pieces of data are broken into smaller pieces Keys can be grouped lexically or by their id e.g. All records starting with A-C, all records with id 124 Replication All data fragments are stored redundantly in two or more sites ➔ Protection of data for node failures ➔ Increase availability BUT replicated data needs to be synchronized or else we get inconsistencies + Reliability and availability + Improved performance + Query processing time reduced + Ease of growth/scale + Continuous availability 14 Challenges No or very limited transactions support Concurrency control: how to keep data consistent with multiple users accessing the database WRITES/READS to a single node per fragment of data (data is synchronized in the background) WRITE operations go to all nodes holding a fragment of data READS to a subset of nodes per consistency Developer-driven consistency of data CAP theorem System can only handle providing two of the three elements Partition tolerance: A communication break happens within a distributed system The cluster must continue to work despite any number of breaks between nodes Distributed systems cannot avoid these partitions and must be tolerant to it! ➔ Basic feature of NoSQL ➔ Cap theorem: system must chose CP or AP From RDBMS to NoSQL Data-driven model to query-driven model o RDBMS starts from the data integrity and the relations between entities o NoSQL starts from the queries and creates models based on the way the application will interact with the data Normalized to denormalized data o RDBMS starts from normalized data then builds queries o NoSQL structures data based on the queries From ACID to BASE model NoSQL systems do not support transactions and joins mix of both is also possible! 15 2.4. MongoDB Designed to store and retrieve large amounts of data in the form of documents Easy to access by indexing Supports various data types The database schema can be flexible ➔ schema is changed as needed without involving complex data definition language statements Complex data analysis can be done on the server using Aggregation Pipelines The scalability makes it easier to work across the globe Enables real-time analysis on your data Documents: in binary JSON format which supports various data types Collection: Group of stored documents Database: Stores collections Where to use MongoDB Large and unstructured data Complex data Flexible data Highly scalable applications Self-managed, hybrid, or cloud hosted + Flexibility with schema + Code-first approach 16 + Evolving schema + Unstructured data + Querying and analytics (using MQL) o Has a wide range of operators o For complex analysis use aggregation pipelines + High availability o Resilience through redundancy o No system maintenance downtime o No upgrade downtime Use-cases for MongoDB IoT o Billions of devices around the world o Scale o Expressive querying o Vast amount of data E-commerce o Products with different attribute Real-time analytics o Quick response to challenges o Simplified ETL (Extraction, Transformation, Load) o Real time gaming and finance o globally scalable o no downtime o supports rapid development Finance o Speed o Security o Reliability 17 2.5. Apache Cassandra A column based database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure Highly scalable, high-performance database that is well-suited for storing large amounts of data that need to be accessed quickly, such as real-time analytics data. Keyspace: Logical entity that contains one or more tables Table: Logical entity that organizes data storage at cluster and node level (according to a declared schema) Data is organized in tables containing rows and columns Tables can be created, dropped, and alternated at runtime without blocking updates and queries To create a table, you must define a primary key and other data columns (regular columns) Static table: contains a primary key with only a partition key Dynamic table: contains a primary key with a clustering key Primary key: mandatory subset of the declared columns that cannot be changed once declared Optimizes read performance for queries Provides uniqueness to the entries Partition key: mandatory - determines data locality in the cluster based on partitions Clustering key: optional – sort the data within the partitions Basic rules of data modeling Choose a primary key that optimizes the query execution time and a clustering key that starts answering your query and spreads the data uniformly in the cluster ➔ minimize the number of partitions read in order to answer the query Distributed and decentralized Runs on multiple machines while it appears as a unified whole to the user There is no center node, every node performs the same function 10 nodes 2 datacenters Distributing data starts with a query: 18 Data replication and multiple Data Clusters support Data is automatically replicated across multiple nodes and distributing data evenly across the nodes ➔ highly fault-tolerant and to support fast linear-scale performance Replicas: Nodes that contain a certain piece of data that other nodes contain to (a partition) Racks: Grouped set of nodes that do not contain the same replicas ➔ replicas are spread around through different racks Availability vs. consistency Data in a Cassandra database is always available The consistency of the data is tunable ➔ Per operation the consistency can be set CAP theorem: Cassandra favours availability over consistency High availability and fault tolerance Peer-to-peer architecture: Each node “gossips” to the node that is left or right of them Nodes temporary/permanent failures are immediately recognized by the other nodes in the cluster Nodes reconfigure the data distribution once nodes are taken out of the cluster Failed request can be transmitted to other nodes Fast and linear scalability Scales horizontally by adding new nodes in the cluster Performance increases linearly with the number of added nodes New nodes are automatically assigned tokens from existing nodes Adding and removing of nodes is done seamlessly High write throughput Writes can be distributed in parallel to all nodes holding replicas 19 By default there is no reading before writing: Writes are done in node memory and later flushed on disk All disk writes are sequential nodes, append-like operations Use cases eCommerce website o storing transactions o website interactions for prediction of customer behavior o status of orders/users’ transactions o users’ profiles and shopping history Timeseries o Monitoring servers’ access logs o Weather updates from sensors o Tracking packages Online services o Users’ authentication for access to services o Tracking users’ activity in the application 2.6. Graph databases A set of nodes (entities) and relationships (edges) that connects them Graph database models Edge-labelled graphs: only edges can be labelled 20 Property graphs: both edges and nodes can be labelled 2.7. Cypher by Neo4j Graph database management system, designed to store and manage large amounts of data that is represented as a graph, with nodes representing entities and edges representing relationships between those entities. + Written in java so easy to use + Scalable + Reliable + Graph native database + Powerful query language: Cypher o Syntax similar to SQL Use-cases Applications that need to store and analyze complex relationships between data: Social networks Recommendation engines Fraud detection systems Labeled property graphs Node: represents entities or instances Has a label and a set of properties Keys are strings and values are arbitrary data types Relationships: representing the relationship between the nodes Directed edge between two nodes Has a label and can have a set of properties properties: information associated to nodes or relationships In the form of arbitrary key-value pairs 21 22 3. Big data Frameworks 3.1. Big data Frameworks Big data Extremely large data sets that are too large and complex to be processed and analyzed using traditional data processing tools and techniques Often characterized by the … V's Five V’s of Big Data (according to IBM) Volume: huge amount of data Velocity: data is generated at high speeds Veracity: inconsistencies and uncertainty in data Variety: different formats of data from various sources Value: extract useful data Scalability System will remain effective when there is a significant increase in the number of resources and users Vertical scaling (scale up): Installing more processors, memory and faster hardware Typically within a single server Usually involves a single instance of an operating system + Most of the software can easily take advantage of vertical scaling + Easy to manage and install hardware within a single machine - More powerful system to handle future workloads - Initially the additional performance not fully utilized - Not possible to scale up after a certain limit - Limited by the machine cost Horizontal scaling (scale out): Distribute the workload across many servers ➔ Multiple independent machines are added together to improve the processing capability ➔ Typically, multiple instances of the operating system (replicas) are running on separate machines + Increases performance in small steps as needed + Financial investment to upgrade is relatively less + Can scale out the system as much as needed - Software handles all data distribution - Parallel processing complexities - Limited number of software available to take advantage of horizontal scaling Big data frameworks Specialized software platforms that are designed to process and analyze large amounts of data efficiently By providing a set of tools, libraries, and other resources that can be used to build and run big data applications and to gain insights from large data sets 23 3.2. Apache Hadoop A popular open-source big data processing framework that is designed to store and process large amounts of data efficiently Consists of a number of components, which together form a distributed computing platform for big data applications Written in java and based in C Scales data from one server to multiple PCs instead of using one larger PC Detects and handles fails of the application layers + Processes large volumes of data in a parallelly distributed manner + Linear processing of huge datasets - Reads and writes from disk which slows down The main components of the Hadoop ecosystem Hadoop Distributed File System (HDFS): A distributed file system that is designed to store large amounts of data across a cluster of machines + High throughput access to large amounts of data + Highly fault-tolerance: detects faults and automatically recover quickly ensuring continuity and reliability ➔ capable of storing data even if some of the nodes in the cluster fail. + Compatibility and portability: portable across a variety of hardware setups and compatible with several underlying OS’ Data is stored as files Files are divided into blocks Blocks are replicated and distributed across datanodes in the cluster Datanodes manage blocks and receive instructions from the masternode to perform tasks such as creating, deleting and replicating blocks If a client application wants to read a file it sends a request to the master node (NameNode) of the cluster The master responds with the location of the blocks that form the file The client application retrieves the blocks from the appropriate DataNodes 24 MapReduce: Programming model for processing and analyzing large amounts of data in parallel on large clusters of commodity hardware in a reliable, fault-tolerant manner It is composed of two main functions: Map task: processes data in parallel Takes a set of data and converts it into another set of data, Individual elements are broken down into tuples (key/value pairs) Input data is in the form of file or directory, stored in HDFS Each map task is assigned to a computer node called a worker Intermediate key-value pairs produced are stored in the local disks of the workers Reduce task: combines the results of the map function to produce the final output Takes the output from a map as an input Combines those data tuples into a smaller set of tuples In Hadoop 1.0 the Map Reduce was also in charge of data processing but this had a single master & multiple slaves architecture witch has a single point of failure, thus introducing YARN. YARN (Yet Another Resource Negotiator): Resource management platform that is responsible for allocating resources to applications running on the Hadoop cluster It is designed to manage the entire lifecycle of big data applications, from resource allocation to job scheduling and monitoring 25 In a Hadoop cluster running YARN, there are a few main components: Client: submits map-reduce jobs Resource Manager: arbitrates resources among all the applications in the system Receives processing requests, forwards it to the corresponding node manager and allocates resources to complete the requests o Scheduler: allocates resources to various running applications. o Applications Manager: accepts the application negotiates the first container (collection of physical resources ) from the Resource Manager Node Manager: responsible for containers, monitoring their resource usage (CPU, memory, disk, network) reporting the same to the Resource Manager/Scheduler Application master: single job submitted to a framework responsible for negotiating resources with the resource manager Container: a collection of physical resources (RAM, CPU cores and disk on a single node) Workflow of these components: Hive: Data warehouse software Facilitates reading, writing and managing large datasets residing in distributed storage using SQL Mahout: to create scalable machine learning algorithms Pig: To analyze large datasets that consists of a high-level language To express data analysis programs, coupled with infrastructure for evaluating them HBASE: Provides Bigtable-like capabilities on top of Hadoop and HDFS Zookeeper: Centralized service for distributed applications to provide configuration information, naming, distributed synchronization & group services ➔ makes distributed systems easier to manage many applications skimp on services that make propagating changes reliable 26 Limitations of Apache Hadoop hard to manage and administer: cumbersome operational complexity verbose general batch-processing MapReduce API boilerplate setup code required ➔ brittle fault tolerance repeated disk I/O for large batches of data jobs with many pairs of MR tasks ➔ intermediate results written to the local disk for the subsequent stage of its operation ➔ large MR jobs could run for hours/days not ideal for combining other workloads e.g., ML, streaming 3.3. Apache Spark Open-source data processing engine for parallel large-scale data processing It is used for a wide range of data processing tasks (batch processing, stream processing, and machine learning) Unified: easier data analytics and more efficient to write Computing engine: data load storage system and data computation Not permanent storage Storage Persistent storage systems = Storage device that retains data after power to the device is shut off Apache Spark is compatible with a variety of storage systems: Cloud storage systems (azure storage, amazon S3) Distributed file systems (HDFS) Key-value stores (apache cassandra) Message busses (apache kafka) Cluster management Handles resource sharing between spark applications YARN: specially designed for Hadoop work loads o High degree of scalability and fault tolerance Mesos: designed for all kinds of workloads abstracts CPU, memory, storage, & other compute resources away from machines resource management & scheduling across entire datacenter & cloud environments Handles workloads in a distributed environment through dynamic resource sharing and isolation Kubernetes : automating deployment, scaling, & management of containerized applications o High degree of scalability and fault tolerance Local mode: run Spark applications on Operating system Standalone: good for small spark clusters o Easy to use o Not as scalable and fault tolerant 27 SparkContext Part of the engine SparkCore Coordinates an independent set of processes Connects the several types of cluster managers Sets up executors on nodes in the cluster Sends the application to the executors Sends tasks to the executors to run A Spark application consists of two processes: Driver: the heart of a Spark Application that sits on a node in the cluster, maintains all relevant information during the lifetime of the application Maintaining information about the Spark Application Responding to a user’s program or input Analyzing, distributing, and scheduling work across the executors (defined momentarily) Executors: responsible for actually executing the work assigned to them Executing code assigned to it by the driver Reporting the state of the computation to the driver node The workflow for a process: 1) Each application gets its executor processes & runs tasks in multiple threads + applications’ isolation both on the scheduling and executor - data cannot be shared across different Spark applications without writing it to an external storage system 2) Spark is agnostic to the underlying cluster manager 3) Driver program listens for and accepts incoming connections from its executors throughout its lifetime ➔ driver program network addressable from the worker nodes 4) The driver schedules should be run close to the worker nodes Resilient Distributed Dataset (RDD) A read-only multiset of data items distributed over a cluster of machines Immutable: cannot be changed when created BUT can be transformed using operations such as map, filter, and reduce Time and cost efficient 28 RDD transformations: creating a new Spark RDD from an existing one by passing a dataset to a function and returning a new dataset Narrow transformations: each input partition will contribute to only one output partition Wide transformations: input partitions will contribute to multiple output partition and the other way around RDD actions: instructs Spark to compute a result from a series of transformations View data in console Collection data to native objects in the respective language Write to output data sources RDD characteristics: Dependencies: a list of dependencies instructs Spark how an RDD is constructed Spark can recreate an RDD from these dependencies and replicate operations on it ➔ gives RDDs resiliency and results reproducibility Partitions (with some locality information): split the work to parallelize computation on partitions across executors Spark uses locality information to send work to executors close to the data ➔ less data is transmitted over the network Apache Spark APIs Software interface providing a service to other pieces of software DataFrame API: a high-level API for working with data in Apache Spark Designed to make it easier for developers to work with structured data, such as tables, by providing a set of high-level operations that can be performed on the data Distributed in-memory tables (collection of objects of type Row) with named columns and schemas, to create new DataFrames while the previous versions are preserved: add or change the names and data types of the columns Dataset API: datasets are strongly typed so each element in a Dataset has a specific data type, such as integer or string ➔ allows the Dataset API to provide a more expressive and type-safe interface for working with data, as well as to perform additional optimization under the hood 29 Libraries Apache Spark includes a number of libraries that are designed to make it easier for developers to work with big data Some of the main libraries: Core: general execution engine that all other functionality is built upon Spark SQL: data abstraction to support structured & semi-structured data Streaming: leverages Spark Core’s fast scheduling capability to perform streaming analytics. It ingests data in mini-batches & performs RDD transformations on them MLlib: distributed machine learning framework GraphX: distributed graph-processing framework Spark characteristics Speed: benefits from cheap servers with large memory & multiple cores performs efficient multithreading & parallel processing & generates compact code for execution computes queries as directed acyclic graph (DAG) (construct an efficient computational graph) Ease of use: fundamental abstraction of a simple logical data structure (RDD) other higher-level structured data abstractions built upon RDD Modularity: operations applied across many types of workloads & expressed in many programming languages Extensibility: decouples storage & compute: focus on fast, parallel computation engine rather than on storage Why use Spark build upon Hadoop make highly fault tolerant and parallel support in-memory storage for intermediate results offer easy and composable APIs in multiple programming languages support other workloads in a unified manner 30 4. Stream Processing Use-cases: Notifications and alerting: a notification or alert should be triggered if an event or series of events occurs Real-time reporting and decision making: real-time dashboards that any employee can look at Incremental or real-time ETL (extract, transform, load): incorporate new data within seconds, enabling users to query it faster Online machine learning train a model on a combination of streaming and historical data from multiple users 4.1. Basic concepts of stream processing Batch processing vs. Stream processing Batch processing: data is bounded with a start and end in a job and the job finishes after processing that finite data Stream processing: processing unbounded data coming in real-time continuously for days, months, years, … Micro-batching Process that ingests batch data in smaller, more frequent chunks Frequent (and usually small) batches can be considered streams 31 Bounded vs. unbounded streams Bounded streams: Consists of finite and unchanging data Have a defined start and end Ordered ingestion is not required, bounded data can always be sorted Unbounded streams: Have a start but no defined end Data is provided as generated and must continuously be processed Events must be ingested in a specific order Stream vs. real-time processing Stream processing: Process continuously generated data by different sources Process streaming data incrementally without having access to all data Fails if not completed within a specific time Real-time processing: Must guarantee response within specified time constraints Real-time: latency calculated in milliseconds Near real-time: extended to (sub-)seconds Latency and throughput Throughput: The volume of data that passes through a network in a given period (the actual amount of data transmitting through the network) impacts how much data can be transmitted in a period of time measured in Megabytes/second (MBps) Latency: The time delay when sending data higher latency causes a network delay measured in Milliseconds (ms) Bandwidth: The maximum amount of data that the network can transmit Continuous vs. micro-batch processing Continuous processing: Continually listen to messages lowest possible latency when the total input rate is relatively low lower maximum throughput fixed topology of operators that cannot be moved at runtime 32 Micro-batch processing: Accumulate small batches of input data, and then process each batch a higher base latency often achieve high throughput use dynamic load balancing techniques to handle changing workloads Stateless vs stateful processing Stateless processing: Does not retain any state associated with the current message after it has been processed e.g. filtering an incoming stream of user-records by a field and writing the filtered messages to their own stream Stateful processing: records some state about a message even after processing it e.g. counting the number of unique users to a website every two minutes requires to store information about each user seen thus far for de-duplication 4.2. Time windowing Event time vs. processing time Event time: The timestamp that is inserted into each record at the source when the event originally occurred Processing time: The timestamp that is inserted into each record at the streaming application when the record is received Out of order events: The order in which the events occur and the order in which they are observed by the system differs Windowing Breaks the data stream into mini-batches or finite streams to apply different transformations on it A window is a period over which data is aggregated or processed Tumbling window: segments a data stream into distinct time segments and performs a function against them They repeat, do not overlap, and an event cannot belong to more than one tumbling window 33 Hopping window: Hop forward in time by a fixed period Events can belong to more than one Hopping window result set Sliding window: Output events only for points in time when the content of the window actually changes (when an event enters or exits the window) Every window has at least one event Session window: Group events that arrive at similar times, filtering out periods of time where there is no data Begins when the first event occurs if event keep on occurring within the specified timeout from the last ingested event o the window extends to include the new event until the max duration is reached else o the window is closed at the timeout 34 Snapshot window: Group events that have the same time stamp 4.3. Stream processing frameworks Apache flink (one tuple at a time) A framework and distributed processing engine Tuples represent a single record of data and are processed one at a time, in the order in which they are received ➔ Tuples are immediately processed and the result is emitted Characteristics: for stateful computations over unbounded and bounded data streams designed to run in all common cluster environments perform computations at in-memory speed and at any scale Capable of integrating with all popular cluster resource managers e.g. YARN, Mesos, Docker and Kubernetes Utilizes external storage systems e.g. HDFS, S3, HBase, Kafka, Apache Flume, Cassandra, or any RDBMS Uses a master/slave architecture with JobManager and TaskManager Advantages: + Highly efficient: high throughput with low latency + Exactly-once processing: each tuple is processed only one time even if there are failures + Flexible windowing: dynamically analyzes and optimizes tasks + Save points Apache Kafka streams (actually not a framework) A distributed data streaming platform that comes packaged with Apache Kafka Can build real-time streaming data pipelines and applications Advantages: + low entry barrier and easy integration with other applications + low-latency + no need to have standard message brokers 35 Apache Storm A distributed real-time big data-processing system originally developed by BackType One of the first open-source stream processing frameworks developed to operate in a distributed environment Advantages and disadvantages: + efficiently perform low-latency + highest ingestion rates - lower adoption than other frameworks Apache Samza A distributed stream processing framework developed at LinkedIn in 2013 Offers built-in integrations with : Apache Kafka, AWS Kinesis, Azure EventHubs, ElasticSearch or Apache Hadoop Runs stream-processing as a managed service by integrating with popular cluster-managers like Apache YARN Advantages: + state is stored on disk, so the job can maintain more state than would fit in memory + offers incremental checkpoints and host affinity + stable and fault-tolerant Tasks: logical units of parallelism Each task processes a subset of the input partitions and has its own storage Worker: a container Runs one or more tasks Coordinator: manages the assignment of task across the individual containers 4.4. Apache Spark Streaming Spark structured streaming API Represents a stream of data as a table to which data is continuously appended ➔ Developers can use SQL and DataFrame/DataSet APIs to build applications 36 Input sources: Rate (for Testing): automatically generates data including 2 columns (timestamp and value) Socket (for Testing): listens to the specified socket and ingest any data File: listens to a particular directory as streaming data e.g., CSV, JSON, ORC, or Parquet Kafka: reads data from Apache Kafka and is compatible with Kafka broker Output modes: Append Mode: Spark outputs only newly processed rows since the last trigger Update Mode: Spark outputs only updated rows since the last trigger Complete Mode: Spark outputs all the rows it has processed so far Output sinks: store results into external storage Console sink: displays the content of the DataFrame to console File sink: Stores the contents of a DataFrame in a file within a directory Supported file formats are csv, Json, orc, and parquet Kafka sink: publishes data to a Kafka topic Foreach sink: applies to each row of a DataFrame ForeachBatch sink: applies to each micro-batch of a DataFrame Triggers: Default: Executes a micro-batch as soon as the previous finishes Fixed interval micro-batches: Specifies the interval when the microbatches are execute e.g., 1 minute , 30 seconds or 1 hour etc. One-time micro-batch: Executes only one micro-batch to process all available data and then stop 37 4.5. Difference Flink and Spark Apache Flink and Apache Spark are both open-source distributed processing frameworks that are designed for big data analytics. However, there are several key differences between the two frameworks: Programming model: o Flink uses a flexible dataflow programming model, which allows developers to build arbitrary data processing pipelines using a variety of operators o Spark uses a more restricted programming model based on transformations and actions on distributed datasets Execution model: o Flink's execution model is based on a continuous streaming model, which means that it processes data one record at a time as it is received o Spark's execution model is based on micro-batching, which means that it processes data in small batches (typically on the order of milliseconds). Fault tolerance: Both Flink and Spark support fault tolerance o Flink uses a checkpoint-based model it periodically saves the state of the streaming application and can recover from failures by restoring from the latest checkpoint o Spark uses a lineage-based model it tracks the transformations applied to a dataset and can recompute lost data on failure. API support: Flink and Spark both support a wide range of APIs, including SQL, DataFrames/Datasets, and machine learning libraries they differ in the level of support and the specific features that are supported 4.6. Examples Twitter: Twitter processes a large volume of data in real-time (tweets, likes…) Uses Apache Storm to process the data streams and extract insights and trigger actions based on the data. These insights are used to optimize the platform for users and advertisers. LinkedIn: LinkedIn processes a large amount of data in real-time, including user activity data, job postings, and other types of data Uses Kafka, an open-source distributed streaming platform developed by the Apache Software Foundation Processes and analyzes streams of data in real-time and to support a wide range of use cases, including real- time analytics, activity tracking, and data integration. LinkedIn has also developed a number of custom stream processing systems that are built on top of Kafka, including a stream processing engine called Samza and a real-time data integration platform called Databus 38 5. Building data pipelines Data pipelines Sets of processes that move and transform data from various sources to a destination where new value can be derived 5.1. The Data Engineering Lifecycle Generation Source system e.g. IoT device, application message queue, … A data engineer consumes data from a source system, but typically does not own the source system Storage Storage runs across the entire data engineering lifecycle, occurring in multiple places in a data pipeline (has a close interaction with the ingestion, transformation and serving phase) Data access frequency determines the "temperature" of the data: Hot data: is commonly retrieved multiple times per day Lukewarm data: might be accessed every week/month or so Cold data: is seldom queried 39 Ingestion The gathering of the data Source systems and ingestion represent the most significant bottlenecks of the lifecycle Characteristics: Batch versus streaming: o Batch: data comes in in parts o Streaming: data comes in continuously Pull versus push: o Pull-based system: Target systems retrieve data from the source system at the moment they need it by making a request to the source system and waiting for a response with desired data Often used when the target system needs to have the most up-to-date data from the source or when data transfer needs to be triggered by the target system o Push-based system: Source system actively sends data to target system without request It sends it to the target system that then processes it as needed Often used when data transfer needs to be triggered by the source Transformation Change the data in its original form into something useful ➔ create value in the data pipeline Serving Doing something with the data Unconsumed data has no value 5.2. Common Data Pipeline Patterns ELT and ETL Pattern used to feed data into a data warehouse: Extract: gathers the data from various sources Load: brings the raw data (ELT) or the transformed data (ETL) to its final destination Transform: raw data is combined and formatted to be useful for analysts, visualization tools Extracting and loading are often tightly coupled ➔ The combination of loading and extracting is sometimes called data ingestion ELT > ETL  In Column-based everything is already easily accessible, from here then you can transform it to what you need for a specific purpose. 40 EtLT subpattern Although ELT was more popular, it became clear that doing transformation after extraction but before loading was still beneficial ➔ Extract – transform – Load – Transform Examples: Deduplicate records in a table Parse URL parameters into individual components 5.3. Data Ingestion with Apache Kafka Apache Kafka Shift to event-driven systems has begun ➔ Need for single platform to connect everyone to every event ➔ To ingest a real-time stream of events ➔ Need to store all events for historical view Examples: Automotive: A car is a distributed system of sensors and services generating events in real-time improving safety and user experience Real-time e-commerce: New merchants, increased speed at which mobile applications are delivered to customers, enabled a full 360 view of customers, Enhanced performance and monitoring, Projected savings of millions of dollars Online gaming: Data pipelining, Increased reliability, Accurate, real-time data, Ability to process data at scale, Faster ramp time Customer 360: Improved data integration, Increased up-sell and cross-sell opportunities, Increased scalability and flexibility, Saved costs Government: Near real-time events and better data quality, Increased efficiency, Ability to change their organization, Produce and Store population data from several sources, Reduce welfare crime through strengthened identity management, Provide better privacy and meet GDPR requirements 41 Company examples: LinkedIn: activity streams, operational metrics, data bus –400 nodes, 18k topics, 220B msg/day (peak 3.2M msg/s), May 2014 Netflix: real-time monitoring and event processing Twitter: as part of their Storm real-time data pipelines Spotify: log delivery (from 4h down to 10s), Hadoop 5.4. Kafka Fundamentals Publish/subscribe messaging The sender (publisher) of a piece of data (message) does not direct it to a specific receiver (subscriber), but classifies the message into a certain class Subscriber subscribes to receive messages of a certain class Normal systems have multiple messaging queues and this has multiple downsides : Duplication Maintaining multiple systems expansion Enter Kafka to solve the publish subscribe problem Producers: write data to brokers Consumers: read date from brokers Data is stored in topics Topics are split into partitions Partitions are replicated 42 Zookeeper: Helps consumers, producers and brokers to manage data Messages: key data pair Batch: collection of messages Typically produced to the same topic Topic: named container for similar events Data can be duplicated between topics Durable logs of events: o Append only o Can only seek by offset, not indexed Events are immutable: retention period is configurable Partitions: ordered and immutable sequences of messages which are continually appended to Producer: creates new messages to a specific topic Assigns the message to a partition through a key 43 Consumer: reads the data from a topic Offsets an integer value that continually increases to keep track of what is read Broker: a single Kafka server Receives messages from producers, assigns offsets and commits messages to storage on disk Cluster: set of brokers that work together One broker will act as the controller of the cluster The leader of the partition is a broker that own a single partition A partition may be replicated and assigned to multiple brokers Replication: copies of data for fault tolerance There is always One lead partition and N-1 followers In general, writes and reads happen to the leader An invisible process to most developers Tunable in the producer Design of Kafka Sequential I/O: Each topic has an ever-growing log (a list of files) A message is addressed by a log offset Everything is stored sequentially 44 Zero-copy principle: 5.5. Separation of compute and data Trend to separate the systems where data is stored and where data is computed Example of big data flow: 5.6. Orchestration The process of coordinating many jobs to run as quickly and efficiently as possible on a scheduled cadence ➔ Abstract data access across storage systems ➔ Virtualize all the data ➔ Present the data via standardized APIs with global namespace to data-driven applications Orchestration =/= scheduling  Schedulers are only aware of time (triggering tasks at specific times) 45 Workflow composition: Combine different (heterogeneous) analytical tasks Should provide guidance for domain experts to define and manage the entire pipeline Uses several techniques: script-based, event-based or adaptive orchestration Workflow Mapping: Map the graph of analytic tasks to big data programming platforms, cloud resources and edge resources Cross-layer resource configuration Directed Acyclic Graph (DAG) Form in which an orchestration engine builds in metadata (data about data) on job dependencies Directed: each edge has a direction Acyclic: no loops Orchestration challenges Complexity: orchestration processes can become very complex Heterogenous Architectures: different systems need to collaborate Automating Data Cleansing and Stitching Regulations and Compliance Data governance 5.7. Big Data Programming Models/Frameworks Current approaches and techniques Cloud platform integration: mask the heterogeneity of different cloud platforms uniform way of accessing the cloud resources Two approaches o Standardization Approach: standardize interfaces but very difficult to agree on common standards o Intermediation Approach: Intermediate layer (or middleware) that hides proprietary APIs (e.g. OpenNebula - use case booking.com) Resource provisioning: select the cloud resources needed to execute the tasks two approaches o Static Resource Provisioning Approach: Takes the decision of the needed resources before the execution of the workflow Not able to dynamically scale while running o Dynamic Resource Provisioning Approach: Decides which resources are needed during execution 46 5.8. Workflow Management using Apache Airflow Platform to programmatically author, monitor and schedule workflows using DAGs Overview Main components Scheduler Workers Webserver The scheduler 47 Features Incremental loading: Airflow defines discrete time slots to run processes (e.g. every day, every month,...) Each DAG runs processes only data for the corresponding time slot (the data’s delta) instead of having to reprocess the entire data set every time Backfilling: A new or updated DAG applied historical schedule intervals that occurred in the past Create (or backfill) new data from historical intervals Tasks and operators Operators have a single piece of responsibility and perform a single piece of work (task) 48 6. Data visualization 6.1. Introduction 6.2. Ways to visualize data Our brains are very good a recognizing visual patterns BUT we are change blind Univariate data (values) consists of observations on only one characteristic or attribute Altimeter: Boxplot: 49 Violin plot: The width of each curve corresponds with the approximate frequency of data points in each region Histogram: Bargram: Word cloud: Single numbers: 50 Bivariate data (values) Data on each of two variables Each value of one of the variables is paired with a value of the other variable Anscombe’s quartet: A group of datasets (x, y) that have the same mean, standard deviation end regression line, but which are qualitatively different ➔ Shows the importance of looking at datasets graphically instead of only focusing on statistic properties Scatterplot: Time series: Only add a line between dots if there is a relationship to show Area chart: 51 Trivariate data (values) Observations in this case are represented by triplets of values (x, y, z) where the third variable z is typically numeric and continuous, whereas x and y can be either numerical (discrete or continuous) or categorical Scatterplot matrix: Linked histogram: Time series: 52 Dot diagram: Choropleth: Hypervariate data (values) Dimension > 3 Parallel coordinates: A scatterplot can be represented by two lines Six objects, represented by seven attributes The trade-off between A and B and correlation between B and C, are immediately apparent The trade-off between B and E and correlation between C and G, are not 53 Star plot: Same as parallel coordinates but all axis’s cross each other Mosaic plot: Chernoff faces: Multidimensional icons: represents eight attributes of a dwelling (woonruimte) 54 Lines (relations) Networks: Chord diagram: Sankey diagram: Flow map: 55 Maps and diagrams Venn diagram: Cluster map: Tree representations: Cone tree: Tree map: 56 Slide and dice construction: Hyperbolic tree: Sunburst chart: Enclose diagrams (and circle packing): Pie chart: 57 6.3. Workflow 1) Question: what is studied? 2) Acquire: get the data needed to study the subject 3) Parse: data is analyzed and divided in components to better understand its structure and meaning 4) Filter: filter the parsed data files and drop unnecessary values/files 5) Mine: Get relevant values from the parsed data files 6) Represent: Put the data in a compatible diagram 7) Refine: Refine the diagram to enhance clarity Add contrast with colors 8) Interact: Highlight interesting information visible in the diagram Example of the workflow 1) Question: How do zip codes relate to geographic data 2) Acquire: 3) Parse: 4) Filter 5) Mine 58 6) Represent: 7) Refine: Refine the diagram to enhance clarity Add contrast with colors 8) Interact: Highlight interesting information visible in the diagram 6.4. Visualization principles General design principles: 1. Show data above all else 2. Maximize data-ink ratio, within reason 3. Erase non-data ink, within reason 4. Erase redundant data-ink 5. Revise and edit 59 Data ink ratio data-ink / total ink used to print graphic This should be maximized as much as possible within reason Maximize data-ink ratio: 60 Pre-attentive processing The subconscious accumulation of information from the environment Parallel processing Happens before attentive processing = Consciously focusing on specific information while other information is ignored Sequential processing The “odd one out” can quickly be identified: Encoding methods Magnitude estimation: Accuracy of judgement of encoded quantitative data: Bottom three objects should be used if we need to encode Hypervariate data 61 Bertin’s guidance: Shows suitability of various encoding methods to support common tasks Encoding quantitative, ordinal and categorical data: Mackinlay: 62 6.5. Colors Can help us break camouflage BUT keep colorblindness into account ➔ Red, green, yellow and blue are good choices Use color hue for categorical data: Use saturation for ordinal and quantitative data: Use luminance for ordinal and quantitative data: Sequential color schemes: Sets of colors are logically arranged from high to low Stepped sequence is represented by sequential lightness steps Diverging color schemes: Sets are based on two different hues that diverge from a shared light color (the midpoint) toward darker colors of different hues at each extreme Qualitative color schemes: Use differences in hue to represent nominal differences CONCLUSION: color guidelines Use more saturated colors for small symbols, thin lines, or small areas Use less saturated colors for large areas Use easy-to-remember and consistent color codes in color pallets Red, green, blue and yellow are hard-wired into the brain as primaries If it is necessary to remember a color coding, these colors are the first that should be considered 63 Haloing effect Enhancing the edges Luminance contrast as a highlighting method Chromostereopsis A visual illusion whereby the impression of depth is conveyed in two-dimensional color images  If two far pure colors are used in the same image, then our eyes cannot focus on both of them CONCLUSION: interaction between colors Different colors can be used for highlighting and creating 3D effects Use colors that are less saturated Surround the contrasting colors with a background that decreases the effect of their different wavelengths Separate contrasting colors 6.6. Dashboards Visual display of all the most important information needed to achieve one or more objectives The data should be displayed on a single screen so all information is available at a glance ➔ Small, concise, clear, intuitive display mechanisms ➔ Customizable Categories 64 Common mistakes in dashboard design Exceeding the boundaries of a single screen: requiring scrolling/clicking o Fragment data into separate screens: ▪ Separated into discrete screens to which one must navigate ▪ Separated into different instances of a single screen that are accessed through some form of interaction Supplying insufficient context for the data o No units, no labels, … Displaying excessive detail or precision o Time expressed to the second, too many digits after the comma, irrelevant data Choosing a deficient measure o Use of measures that fail to directly express the intended message o Graph does not emphasize the deviation from a target 65 Choosing inappropriate display media o Common problem with pie charts ▪ Pie charts are (even when used correctly) difficult to interpret o Encoding of useless values onto map Introducing meaningless variety o Different types of representations for same type of data Using poorly designed display media o Legends force our eyes to go back and forth between the graph and the legend o Random occurrence of data in the chart vs. in the legend o Bright colors cause sensory overkill, only use it to highlight important things o Colors are too much alike o Occlusion, some data blocks visibility of other data 66 Encoding quantitative data inaccurately o Y or X axis’ of charts should always start at 0 Arranging data poorly o Most important data should stand out the most, start in upper left corner o Data that’s to be compared should be arranged to encourage comparisons o Data that should not be compared should be visualized in different ways Highlighting important data ineffectively or not at all o If everything has color, nothing will attract attention Cluttering the display o Using useless and dysfunctional decoration 67 Misusing or overusing color o Too much color undermines its power Designing an unattractive visual display 6.7. Strategies to create effective dashboards The fundamental challenge of dashboard design is to effectively display a great deal of often disparate data in a small amount of space Characteristics of a well-designed dashboard Well organized Condensed, primarily in the form of summaries and exceptions Specific to and customized for the dashboard's audience and objectives Displayed using concise and often small media that communicate the data and its message in the clearest and most direct way possible. Condensing information Summarization: Representing a set of numbers as a single numbers The two most common summaries are sums and averages Critical values exception: A couple of values from hundreds that require attention Reducing the non-data pixels Eliminate all unnecessary non-data pixels 68 De-emphasize and regularize the non-data pixel Enhancing the data pixels Eliminate all unnecessary data pixels Highlight the most important data pixels that remain Designing dashboards for usability Organize information to support meaning and use: Organize groups according to business functions, entities, and use Co-locate items that belong to the same group Delineate groups using the least visible means Support meaningful comparisons Discourage meaningless comparisons Make the viewing experience aesthetically pleasing: Choose colors appropriately Choose the right font Design for use as a launch pad: Dashboards should almost always be designed for interaction Types of interactions: Drilling down into details Slicing the data to narrow the field of focus 69 6.8. Examples Example 1 + Color has been used only when needed + The prime real estate on the screen has been used for the most important data + Small, concise display media have been used to support the display of a dense set of data in a small amount of space + Some measures have been presented both graphically and as text + The display of quarter-to-date revenue per region combines the actual and pipeline values in the form of stacked bars + White space alone has been used to delineate and group data + The dashboard has not been cluttered with instructions and descriptions that will seldom be needed 70 Example 2 - Relies almost entirely on text to communicate o only visuals in the form of green, light red, and vibrant red hues o Dashboards are meant to provide immediate insight o Text requires reading a serial process that is much slower than the parallel processing of a visually oriented dashboard that makes good use of the pre-attentive attributes - To compare actual measures to their targets, mental math is required - Numbers have been center-justified, rather than right-justified. This makes them harder to compare when scanning up and down a column - All four quarters of the current year have been given equal emphasis o A sales manager would have greater interest in the current quarter o The design of the dashboard should have focused on the current quarter and comparatively reduced emphasis on the other quarters - The subdued shade of red and the equally subdued shade of green might not be distinguishable for colorblind people - The numbers that ought to stand out most are the hardest to read against the dark red background 71 Example 3 - The grid lines that appear in the tables are not needed at all o Even if they were needed, they should have been muted visually - The grid lines that appear in the graphs are also unnecessary o They distract from the data - The drop shadows on the bars and lines in two of the graphs and on the pie chart are visual fluff o These elements serve only to distract - All of the numbers in the tables have been expressed as percentages o If those who use this dashboard only care about performance relative to targets, this is fine, but it is likely that they will want a sense of the actual amounts as well - The pie chart is not the most effective display medium o Assuming that it is worthwhile to display how the 90% probability portion of the revenue pipeline is distributed among the regions, a bar graph with the regions in ranked order would have communicated this information more effectively - Overall, this dashboard exhibits too many bright colors o The dashboard as a whole is visually overwhelming and fails to feature the most important data - There is no comparison of trends in the revenue history. The 12-month revenue history shown in the line graph is useful, but it would also have been useful to see this history per region and per product, to allow the comparison of trends. 72 Example 4 - The display media were not well chosen o The circular representations of time-series data using hues to encode states of performance are good BUT for the purpose of showing history, these are not as intuitive or informative as a linear display, such as a line graph - None of the measures that appear on the left side of the dashboard is revealed beyond its performance state o Knowing the actual revenue amount and more about how it compares to the target would certainly be useful to a sales manager - The circular display mechanisms treat all periods of time equally o There is no emphasis on the current quarter - Gradient fill colors in the background of the bar graphs add meaningless visual interest o They also influence perception of the values encoded by the bars in subtle ways o Bars that extend into the darker sections of the gradient appear slightly different from those that extend only into the lighter sections 73 74 7. Data Privacy and General Data Protection Regulation 7.1. Privacy The claim of individuals, groups and institutions to determine for themselves, when, how and to what extent information about them is communicated to others 3 dimensions of privacy Personal privacy: Protecting a person against unnecessary interference (such as physical searches) and information that violates his/her moral sense Territorial privacy: Protecting a physical area surrounding a person that may not be violated without the knowledge of a person Informational privacy: Deals with the gathering, compilation and selective dissemination of information How to protect privacy Laws and developments by the government Institutions, groups or individuals can involve self-regulation for fair information practices Using house-rules or codes of conduct Individuals can protect their data themselves by using privacy-enhancing technologies (PETs) (encryption, coding, concealing, deidentification, …) Educating consumers and IT professionals about privacy General data protection regulation (GDPR) the government writes that they guarantees privacy Types of data Explicit data: data explicitly exchanged with others Mostly information that you want to share Implicit metadata: data that is send along, probably without you knowing Inferred data: data that can be derived from both types together EXAMPLE: Bob takes pictures of Alice This action that sends the pictures automatically to the cloud triggers the inference of the inferred data In the metadata you can find tagged people and places, location coordinates and even camera specs Basic privacy principles Lawfulness of processing: The person from which the data is gathered should be informed about how it is processed, where it is stored, for how long and who has access to it (very detailed) The person needs to consent to this and give his approval of all detailed steps 75 Data minimisation and avoidance: Minimization of data collection, use, sharing, linkability, retention Data should be adequate, relevant and not excessive ➔ Only record what is needed Purpose specification and purpose building: “non-sensitive” data does not exist ➔ Each data can have knowledge that can be misused Purpose of the data should be clearly stated and you have to stick to that EXAMPLE: access card for buildings Can, but should not be used to track working times of employees Transparency and intervenability: Openness about developments, practices and policies with respect to personal data Appropriate security: Personal data should be protected within reason against loss, unauthorised access, destruction, use, modification or disclosure of data Accountability: Data controller should accountable for complying with measure which give effect to the principles Privacy threat types (the LINDDUN model) A potential negative action or event facilitated by a vulnerability that results in an unwanted impact to a computer system or application Can be intentional or accidental A threat model is used to help with reasoning and finding flaws in a system Data disclosure: unnecessary use of data Excessively collecting, storing, processing or sharing personal data The data minimization principle is key here: only the strictly necessary data should be collected or disclosed Can be explicit (intended) or implicit (through metadata or derived from other data) Unnecessary can mean Unnecessary data types: data type sensitivity and the level of detail and encoding Excessive data volume: the amount and frequency of the data processing, and the number of involved data subjects can be too high Unnecessary processing: the further data treatment analysis, propagation or retention can be considered not strictly necessary Excessive exposure: the different parties with whom the data is shared can be considered too much and how widely accessible the data is 76 EXAMPLES: Health monitoring apps should restrict the collection and processing of a patient’s medical parameters to those strictly needed E.g. a dietary app should not record heart rate measurements When someone posts on a social network, this post often also includes personal information on other people When a user unsubscribes from a newsletter, the system should no longer retain their email address When a website uses third party tracking and analytics services, personal data about the website user and their site usage is transferred to external parties A heatmap of several soldiers of the US which was shared using there strava app Data about their running trajectory was being disclosed to the public and so people could infer that there was a military base there. Linking: learning more about an individual/group by matching data items together May lead to unwanted privacy implications Data items or user actions can be linked through: identifiers: a username, IP address, phone model, sensor type, … the combination of data items or sets: a browser fingerprint, a location trace profiling, derivation or inference EXAMPLES: Singling out an (pseudonymous) individual. A person with email address ‘[email protected]’ has posted several hate messages on an online discussion board so his account is suspended (but we cannot identify this person). Tying ‘anonymous’ user actions together. A travel booking website may track browser fingerprints and increase rates if a user revisits their site for the same flight/hotel details. Identifying: learning the identity of an individual, determination of the person’s details Identity can be revealed through leaks, deduction or inference in cases where it is not intentional or desired 77 Can be done through: identified data: when the system explicitly maintains the link to the identity, or refers to the identity identifiable data: identity can be derived indirectly (from pseudonyms, identity revealing content or through a small anonymity set) o Identifying threats are also related to linking o (Re-)identification can also result in unawareness and non-compliance issues EXAMPLES: The processing of identified info: When subscribing to a newspaper website, users are forced to register with their full name and address, which enables the identification of webpage views o The user can no longer browse the newspaper anonymously. The data subject is distinguishable from others o E.g. querying the average salary for women at a small company when there is only a single female employee. Revealing attributes: When a customer submits a feedback form and provides very detailed info so that they can be uniquely identified o E.g. buying a car at a dealer’s and sending feedback including car make, model, dealer, zipcode NOTE! Linking vs. identifying Is like playing “guess-who” vs. winning “guess who” Detecting: deducing subject involvement by observing existence of relevant information Merely knowing the data exists is sufficient to infer more data Can be don through: Observed communications: the communication to a system or service, or between different systems can be observed Observed application side-effects: actions in an application have side-effects that can be observed E.g., temporary files in the file system Evoked system responses: the system leaks the existence of certain data through its responses These threats also relate to linking and identifying, as the deduced info can be used to extract more info on an individual EXAMPLES: Wireless transmissions are detectable by other devices nearby A deleted app on a smartphone may leave behind a trail of configuration and temporary files, which can be used by an adversary to learn more about the phone owner. By attempting to send an email to a suspected mail user and observing any bounces, an adversary may derive that certain email addresses are indeed in use and the recipient a registered user By attempting to access a celebrity’s health record in a rehab facility and receiving an ‘insufficient access rights’ reply, one can infer that the celebrity has or has been treated for an addiction, even without having access to the actual record 78 Non-repudiation: proof of a claim about an individual The system maintains evidence attributed to an individual regarding a certain action or fact which causes loss of plausible deniability (log files, digital signatures, document metadata, watermarked data) EXAMPLES: Certain word processors retain author and revision metadata within the document, and thus prevent deniability claims regarding document authorship or review Read notifications provide evidence about a user having opened/read a message The data collected by a menstruation tracking app could be used as evidence for the persecution of individuals in the enforcement of abortion laws (Roe vs wade) Unawareness: insufficiently informing about the processing of personal data and lack of data subject control Caused by: Lack of transparency: data subjects are not aware of the collection and/or processing of their personal data or the personal data of others Lack of feedback: data subjects are insufficiently informed about the privacy impact they may cause to others by using the system Lack of intervenability: data subjects cannot access or manage their personal data EXAMPLES: When traffic cameras collect and process personal data using face recognition and license plate recognition techniques, without informing individuals When using a social media platform, users are not informed about the implications of sharing pictures that include other people When data subjects cannot access or modify their own data, or (easily) update their privacy settings Non-compliance: lack of adherence to legislation, regulation, standards and best practices 79 Main characteristics: Unlawfulness o

Data Engineering Summary PDF

Document Details

Tags

Related

Summary

Full Transcript