001-2024-0328_LIBFEXDLMDMCDM01_Course_Book.pdf
Document Details
Uploaded by Deleted User
Full Transcript
CONCEPTS IN DATA MANAGEMENT LIBFEXDLMDMCDM01 CONCEPTS IN DATA MANAGEMENT MASTHEAD Publisher: The London Institute of Banking & Finance 8th Floor, Peninsular House 36 Monument Street London EC3R 8LJ United Kingdom Administrative Centre Address: 4-9 Burgate...
CONCEPTS IN DATA MANAGEMENT LIBFEXDLMDMCDM01 CONCEPTS IN DATA MANAGEMENT MASTHEAD Publisher: The London Institute of Banking & Finance 8th Floor, Peninsular House 36 Monument Street London EC3R 8LJ United Kingdom Administrative Centre Address: 4-9 Burgate Lane Canterbury Kent CT1 2XJ United Kingdom LIBFEXDLMDMCDM01 Version No.: 001-2024-0328 Norman Hofer Cover image: Adobe Stock, 2024. © 2024 The London Institute of Banking & Finance This course book is protected by copyright. All rights reserved. This course book may not be reproduced and/or electronically edited, duplicated, or dis- tributed in any kind of form without written permission by the The London Institute of Banking & Finance. The authors/publishers have identified the authors and sources of all graphics to the best of their abilities. However, if any erroneous information has been provided, please notify us accordingly. 2 TABLE OF CONTENTS CONCEPTS IN DATA MANAGEMENT Introduction Signposts Throughout the Course Book............................................. 6 Learning Objectives............................................................... 7 Unit 1 The Data Processing Lifecycle 9 1.1 Data Ingestion and Integration................................................ 10 1.2 Data Processing............................................................. 13 1.3 Data Storage................................................................ 16 1.4 Data Analysis................................................................ 24 1.5 Reporting................................................................... 28 Unit 2 Data Protection and Security 31 2.1 Ethics in Data Handling....................................................... 32 2.2 Data Protection Principles.................................................... 36 2.3 Data Encryption............................................................. 41 2.4 Data Masking Strategy........................................................ 44 2.5 Data Security Principles & Risk Management................................... 47 Unit 3 Distributed Data 51 3.1 System’s Reliability and Data replication....................................... 52 3.2 Data partitioning............................................................ 55 3.3 Processing Frameworks for Distributed Data.................................... 58 Unit 4 Data Quality and Data Governance 65 4.1 Data and Process Integration.................................................. 66 4.2 Data Virtualization........................................................... 72 4.3 Data as a Service............................................................. 73 4.4 Data Governance............................................................ 75 3 Unit 5 Data Modelling 81 5.1 Entity Relationship Model.................................................... 82 5.2 Data Normalization.......................................................... 86 5.3 Star and Snowflake Schema.................................................. 92 Unit 6 Metadata Management 95 6.1 Types of metadata........................................................... 96 6.2 Metadata repositories....................................................... 100 Appendix List of References............................................................... 104 List of Tables and Figures........................................................ 105 4 INTRODUCTION WELCOME SIGNPOSTS THROUGHOUT THE COURSE BOOK This course book contains the core content for this course. Additional learning materials can be found on the learning platform, but this course book should form the basis for your learning. The content of this course book is divided into units, which are divided further into sec- tions. Each section contains only one new key concept to allow you to quickly and effi- ciently add new learning material to your existing knowledge. At the end of each section of the digital course book, you will find self-check questions. These questions are designed to help you check whether you have understood the con- cepts in each section. For all modules with a final exam, you must complete the knowledge tests on the learning platform. You will pass the knowledge test for each unit when you answer at least 80% of the questions correctly. When you have passed the knowledge tests for all the units, the course is considered fin- ished and you will be able to register for the final assessment. Please ensure that you com- plete the evaluation prior to registering for the assessment. Good luck! 6 LEARNING OBJECTIVES In this course, you will learn about Data Management concerned with developing strat- egies to govern data in a secure, cost-effective, and efficient way to preserve and increase the value of the data. Concepts in Data Management covers all data lifecycle processes, from data collection to analysis, and the final objective is transforming data into valuable information to support operational decision-making. The course covers these data management topics to provide you with the necessary skills and knowledge for professional data management. It will not go into details for each sub- ject but gives you a sound overview of the field. In the first unit, you will learn about the data processing lifecycle, where data ingestion, storage, analysis, and reporting will be discussed. Data protection and security are addressed in the second unit covering essential topics such as ethics, data protection prin- ciples, and their technical solutions. The third unit concerns distributed data, where important technical aspects such as data replication, partitioning, and processing frame- works are reviewed. Data quality and data governance are the topics of the fourth unit, covering essential concepts, such as Data as a Service (DaaS), data virtualization techni- ques, and general principles for data governance. Unit 5 is about data modeling compris- ing entity relationship models, data normalization, and data schemas. Finally, metadata management is discussed in the sixth unit, covering the different types of metadata and repositories. 7 UNIT 1 THE DATA PROCESSING LIFECYCLE STUDY GOALS On completion of this unit, you will be able to... – plan and manage the different phases of the data processing lifecycle. – select adequate technologies for data ingestion and data integration. – distinguish batch and stream processing and implement a lambda architecture. – select an adequate storage type. – discuss different types of data analysis and data visualization techniques. 1. THE DATA PROCESSING LIFECYCLE Introduction In recent years, many technological solutions have emerged for analytical data applica- tions that provide professional digitalization, storage, and analysis solutions. Conse- quently, companies have engaged in constructing analytical data applications to optimize the efficiency of their business processes. This is achieved, for instance, by shifting their business management toward a data-driven approach, where decisions are based on evi- dence from data rather than traditional business rules. Nevertheless, to do so, companies need IT professionals with skills and knowledge to con- struct such data-analytical applications. Several complex processes must be resolved to transform the collected data into actionable information. In this unit, you will learn about the Data Processing Lifecycle, which describes the proc- esses needed to transform data from its source to interpretable information. The data processing lifecycle comprises five phases: Data ingestion and integration: Responsible for data collection from various sources and transformations of the data to a homogeneous storage format. Data processing: Responsible for processing and transformation of the data. Data storage: Responsible for storing the data adequately. Data analysis: Responsible for providing tools to analyze the data with, e.g., machine learning. Reporting: Business Intelligence applications or static reports and presentations. In the following sections, we obtain an overview of the entire data lifecycle and learn about technical concepts, pertinent technologies, and frameworks. 1.1 Data Ingestion and Integration The first phase of the data processing lifecycle is data ingestion and integration. Its proc- esses collect data from sources and integrate it into storage in a suitable format. Data inte- gration may require data transformation to conform to the format defined at the storage level. This is why the first two phases of the data lifecycle, data ingestion, and processing, are often not clearly separated in the real world. Nevertheless, we try to shed light on each aspect separately in the following sections. With the advances in digitalization technologies, the rise of the Internet of Things (IoT), so- called social networks, and mobile devices, data can be obtained from many distinct sour- ces with various formats. In the following, some examples are given for data ingestion 10 from different sources. When it comes to integrating data from multiple sources, we have to be aware of data heterogeneity. Loading data from other sources most naturally includes numerous data types, formats, and storage solutions. Data heterogeneity Apart from the data format, for instance, spreadsheets, XML or JSON files, Weblogs and cookies, SQL queries, and flat files, data might be integrated as structured, semi-struc- tured or unstructured data. Data is classified as structured data when it conforms to a well-defined data schema. An example would be a data record with a person's name, address, and date of birth. Struc- tured data often fit into a table. For each datapoint, being the rows of a table, there are attributes, namely the table columns. Each column contains a specific data type, such as a number, a string, or a boolean (true/false value). Structured data is usually stored in rela- tional SQL databases but also in structured text files, such as CSV files, or binary files, such as Excel spreadsheets. Structured data usually proceeds from other systems like relational databases, a data warehouse, or an organization's legacy data systems. Semi-structured data refers to data with some grade of schema but not entirely defined by a data model. An example would be Hypertext Markup Language (HTML) files, Extensible Markup Language (XML) files, or JavaScript Object Notation (JSON) files. In these cases, some structural information, such as tags, is available to separate elements, but the ele- ments' size, order, and content can be different. Semi-structured data is often used on the web and with IoT devices. An example of integrating such data could be collecting infor- mation via an API to read messages of an IoT sensor. The following shows two examples of semi-structured data in JSON format from a Twitter message and IoT device. Table 1: Semi-structured data Semi-structured data Twitter message IoT Device {“created_at”: “2026-10 18:16:26 2021”, {"timestamp":" 2026-01 T17:13:52.443Z", "id": 10401183211789707, "type": "K2", "id_str": "732534959601", "mac": "B32773625F", "text": "I'm quite sure the best way to solve the "name": "Sensor1", issue is getting in contact with the company …", "temperature": 12.4, "user": {}, "humidity": 16.2 "entities": {} } } Non-structured data is a type of data lacking any formal description of its schema. Exam- ples could be text files (DOCX or PDF), image files (JPG), video (MP4), or audio files (WAV). Although such data are often easily interpretable by humans, they need preprocessing by computer systems to extract information. An example of ingesting unstructured data could be the integration of images from a video surveillance system or the collection of uploaded text documents by a customer to a repository. 11 Data integration and ingestion frameworks From a technological point of view, data ingestion and integration are often implemented together as one technical solution. Here, we obtain a brief overview of the most prevalent frameworks for data integration and ingestion, but remember that these frameworks are also used for data processing. Data integration frameworks have been used for a long time in data warehousing applica- ETL tions, described as an ETL process. ETL comprises data Extraction, Transformation, and is an abbreviation of Loading of the data. The raw data are usually structured, and the transformation brings extract, transform and load. the data to a homogeneous format according to the defined structure at the destination. Transformation may comprise data cleaning, feature engineering, and enrichment, i.e., adding information from other data sources. For example, an ETL process can capture the measurements of a temperature sensor and integrate it with other sensor readings and the local weather forecast into a unified monitoring system. The transformation step would transform the measurement represented as an electric signal to degrees Celsius, discard non-plausible readings (cleaning), and enrich the data record by adding the cur- rent weather forecast for the next two hours. Apart from traditional ETL tools, there are other approaches to modern data ingestion sol- utions, such as IoT hubs, digital twins, data pipeline orchestrators, tools for bulk imports, and data streaming platforms. It is beyond the scope of this overview, but take notice that ETL tools are not the only option at this stage of the data lifecycle. Features of a data integration tool A modern data integration tool should fulfill several requirements (Alpoim et al., 2019, p. 605): Support different protocols to collect data from a wide variety of sources A width of support for hardware and operating systems on which data integration proc- esses can be implemented Scalability and adaptability Integrated capabilities for data transformation operations, including fundamental and complex transformations Implement security mechanisms to protect the data in the data transformation pipeline In addition, many data integration tools also support visualizations of the data flow. Challenges One of the most critical challenges of data integration is data sources’ increasing variety and veracity. Connected to this is the challenge of processing large amounts of data with high velocity. There is also an increasing interest in implementing machine learning directly on edge devices, for example, by federate learning, which states new data inges- tion and integration challenges. Another vital challenge regards cybersecurity. Data inges- tion needs to be secure, trusted, and accountable. Data ingestion systems must use encryption to protect data during transport from the source to the data destination. 12 1.2 Data Processing In the last couple of years, special data processing frameworks have been designed with the capability to efficiently store, access and process large amounts of data in a reasona- ble time. The key to achieving this is that modern data processing frameworks can distrib- ute storage and processing over several nodes. Conceptually, a data processing framework transforms data in several steps. In some cases, the transformation is modeled as a Directed Acyclic Graph (DAG). In this approach, each task has an input and produces some output. The inputs of one task can be the input to another task while this connection is directed, meaning that dependencies are clearly defined, and there are no feedback loops in the pipeline. An orchestrator supervises the correct execution of pipeline tasks, usually comprising a scheduler for task coordination, an executor to run the tasks, and a metadata component to monitor the pipeline state. Batch versus stream data ingestion Traditionally, ETL was used in batch processing primarily for business intelligence use cases, where raw data were collected at given time intervals from data sources, trans- formed, and loaded into a centralized storage system. The ETL process can be automated by scheduled jobs so that the data aggregation is repetitive and does not require human intervention. For example, a typical batch job would be to collect all orders for the last month and create a monthly report based on these data. Streaming data ingestion is an alternative to batch processing. With the rise of connected devices, social media, and mobile applications, data became available with much more extensive volume and velocity, requiring data capturing in real-time (or near real-time). In batch processing, data are collected, e.g., once a day or weekly. Streaming data ingestion shortens these intervals to sub-seconds, or it processes data in an event-driven approach in which data (events) are processed as soon as they are available. In this case, the data- producing source notifies the availability of new data/events to the stream processor. Event-driven solutions are often implemented as a publish-subscribe architecture based on messages/events. But you may have noticed that the definition of batch and stream processing merely by the length of time intervals between data entries is somewhat flawed. Who decides what time interval will be considered short, requiring stream processing, or long, resulting in batch processing? There is another fundamental difference between the approaches, other than the length of the time intervals between data. The key conceptual difference between batch and stream processing is that batch data are bound, and streaming data are unbound. This means that, in batch processing, we know the size and number of data entries before we start the processing job. For streaming data, this is not the case, as these data do not begin or end anywhere. The use cases of systems where data are ingested as streams include smart home applica- tions, IoT applications for infrastructure monitoring and logistics, health monitoring appli- cations, and retail monitoring applications, to name just a few. 13 Lambda architecture Until now, we have distinguished data pipelines into streaming or batch processing. The frameworks and tools, like the discussed ETL tools, are not limited to data ingestion but can also be used for more elaborate data processing before or after the data are stored. In the following, we will investigate how batch and stream processing can be combined into one processing architecture to leverage the advantages of both approaches. Lambda architecture The lambda architecture is a type of data-processing design pattern that aims to handle The lambda architecture massive amounts of data by supporting both batch and stream processing (Alghamdi & comprises a batch and a stream layer. The stream Bellaiche, 2021, p. 563). It uses an abstraction of the underlying implementation to pro- layer is referred to as the vide a uniform programming interface that keeps the same application with both batch speed layer. and stream processing. The objective is to take advantage of each type of processing approach. It combines the more complex analytical capabilities of batch processing with quick insights with low latencies of stream processing. In the lambda architecture, a batch layer, as the name suggests, handles data processing in batches, while a speed layer implements stream processing capabilities. As a level of abstraction, the serving layer combines the results of the batch and speed layers into a unified data view that we can query. The Lambda architecture is very common in practice, but there is an easier-to-implement alternative with the Kappa architecture (Feick, Kleer & Kohn, 2018, p. 2). In the kappa architecture, there is no real distinction between the speed and batch layers, but they are integrated into one processing layer, capable of batch and stream processing. This simpli- fies the architecture, making it easier to set up and maintain, and gives less attack surface in terms of the system's security. Data processing solutions in the cloud Batch, stream, and hybrid processing frameworks have been implemented in many open- source Apache projects and technical solutions subsumed under the Hadoop ecosystem. MapReduce Some examples include MapReduce, Spark, and Kafka. However, resource allocation is is a method invented by more efficient for many use cases if we do not set up, administer, and maintain these solu- Google for processing large amounts of data at tions on our own but use a managed service offered by a cloud provider. In this section, high speed through. par- we obtain an overview of typical data processing frameworks offered by cloud providers. allelization. Note that there are many providers out there, out of which Amazon Web Services (AWS), Spark Microsoft Azure, and Google Cloud Platform (GCP) are the three providers with the largest Apache Spark is an open source framework for par- market share. It is beyond the scope of this section to go into detail about all available allel processing. services and providers. Here, we use Microsoft Azure to obtain a brief overview of the pos- Kafka sibilities in a typical cloud processing environment. Ultimately, we will see an example of a Apache Kafka is an open source event streaming lambda architecture on Amazon Web Services. platform. Batch processing on Microsoft Azure The Azure cloud platform provides solutions for both batch and stream processing. The following diagram shows a brief overview of the frameworks involved in an exemplary batch data analytical application with the Azure Cloud. The processed data are often 14 stored in a structured format in an analytical data store to be analyzed subsequently with an analytical or reporting tool. Notice that data processing and storage switched places in comparison to the data lifecycle. This is only one example, demonstrating that data projects are always use case specific, and the definitions and principles have to be adapted accordingly. Azure offers several analytical services for batch processing: Azure Synapse is a distributed data warehouse offering advanced analytics capabilities. Azure Data Lake Analytics offers analytical services optimized for Azure Data Lake. HDInsight is a managed offering for the Hadoop technologies, such as MapReduce, Hive Hive, and Pig. is a data warehouse solu- tion built on Apache Databricks is a managed large-scale analytical platform based on Spark. Hadoop. Pig Stream processing on Microsoft Azure is an extension to Apache Hadoop for analyzing large data sets. Similar to the batch example above, the following diagram shows an exemplary architec- ture for a stream processing application on Azure. Some examples of services with stream processing capabilities on Azure include… HDInsight with Spark Streaming or Storm Databricks Azure Stream Analytics Azure Functions Azure App Service Webjobs IoT and Event Hubs When choosing an adequate solution for stream processing, we should consider the sup- ported programming languages, the programming paradigm (declarative or imperative), and the pricing model. The cost of a streaming application can be per streaming unit (e.g., Azure Streaming Analytics), per active cluster hour (e.g., HDInsight and Databricks), or per executed function (e.g., Azure Functions). Also relevant for the choice of an appropriate streaming solution are the offered connec- tors. The following gives an overview of out-of-the-box available data sources and sinks for some exemplary streaming solutions on the Azure platform. Table 2: Integration capabilities of stream processing frameworks Integration capabilities of stream processing frameworks Service Source Sinks Azure Stream Analytics Azure Event Hub, Azure IoT Hub, Azure Data Lake Store, Azure Azure Blob Storage SQL Database, Storage Blobs, Event Hubs, Power BI, Table Storage, Service Bus Queue, Cosmos DN, Azure Functions 15 Service Source Sinks HDInsight with Storm/Spark HDFS, Kafka, Storage Blobs, Event Hubs, IoT Hubs, Storage Azure Data Lake Storage, Cos- Blobs, Azure Data Lake Store mos DB Apache Spark with Azure Data- Event Hubs, IoT Hubs, Storage HDFS, Kafka, Storage Blobs, bricks Blobs, Azure Data Lake Store, Azure Data Lake Storage, Cos- Kafka, HDFS Storage mos DB Azure App Service Webjobs Service Bus Storage Queues, Service Bus Storage Queues, Storage Blobs, Event Hubs, Web- Storage Blobs, Event Hubs, Web- hooks, Cosmos DB, Files hooks, Cosmos DB, Files Lambda architecture on AWS The following diagram shows an example application using several services offered by AWS to implement a cloud lambda architecture. Each of the layers in the Lambda architecture can be built using various services available on the AWS platform. Here, the batch layer processes sensor data from a thermostat Kinesis Firehose ingested into Kinesis Firehose and stores the results in an Amazon Simple Storage Service is an AWS service for the (S3), a cloud-based distributed file system. AWS Glue is used for batch processing to ana- Extract, Transform, Load (ETL). lyze the efficiency of the thermostat compared to historical data. The speed layer, in this example, Kinesis Analytics, analyzes the data in near real-time for anomaly detection in the thermostat device readings. Finally, the outcomes from both lay- ers are stored in a data warehouse called Athena. Amazon QuickSight, a reporting tool, is used to report and further investigate the data. 1.3 Data Storage Storage is saving data on a physical device. While primary storage refers to holding data in memory and CPU during the execution of a computer program, secondary storage solu- tions are needed to persist data in physical hardware devices in a non-volatile way to retain the information later. For physical secondary storage, there are Hard Disc Drives (HDDs), USB flash drives, SD cards, and Solid State Drives (SSDs), to name a few. These direct attached storage devices are often used as direct attached storage. A direct attached stor- age is a storage drive that is directly connected to a However, storage solutions deployed over a network, such as a cloud storage solution, can computer. store large amounts of data and are becoming the standard solution for data-intensive applications. Cloud storage involves an infrastructure of interconnected servers designed to distribute the data across the physical storage of individual machines. 16 Serialization Regarding data persistence, data is usually stored in a file system supported by the operat- ing system. Such file systems organize the information hierarchically in folders and pro- vide efficient access. Internally, a file can be stored in a binary or text format, referring to the data encoding. The content of a file is a sequence of bits in both cases. For text files, the bits represent characters. Text files can be created with a text editor and stored as plain text (.txt), rich text (.rtf), or as a table with comma-separated values (.csv). In the case of binary storage, that sequence of bytes is not human-readable but optimized for access by the computer. One advantage of binary over textual data is that binary formats typically do a better job compressing the data, allowing for smaller files and faster access. Examples of binary files are images, audio, or video files that require the interpretation of a computer program to read the content. We call the process of translating human-reada- ble text data into binary data serialization, while the translation from binary into human- readable text data is called de-serialization. Data storage types Depending on the specific requirements of a given use case, we can choose from different data storage approaches, each with several solutions. In general, we can distinguish five fundamental data storage categories: File systems Data lakes Relational databases Data warehouses NoSQL databases In the following, we will obtain an overview of each of these categories. File systems Local file systems are well known from a computer's operating system as the hierarchy of drives, directories, subdirectories, and files. Simple cloud-based file systems are com- monly used in organizations as a collaborative platform to store and share files remotely. Examples of managed commercial file sharing products include Sync, pCloud, iceDrive, Google Drive, SharePoint, OneDrive, and DropBox. These managed services typically include automated backups, synchronization of data, use-specific file/folder access, ver- sioning, data security services, and data management via a user-friendly interface. Tech- nologically, these solutions are distributed file systems usually used to store unstructured data objects. But file sharing solutions usually focus on the sharing and access control aspects of relatively small datasets and not so much on handling large distributed data- sets. 17 Hadoop Distributed File System (HDFS) The Hadoop Distributed File System (HDFS) is a distributed data storage technology pro- viding high availability and fault tolerance by implementing redundant copies of the data. Although HDFS implements a unified view of a resource at a logical level (a file or folder), the resources are split internally into blocks that are stored on different nodes (machines). HDFS seamlessly scales vertically (by increasing the nodes' capacities) and horizontally (by adding nodes to the cluster) while supporting thousands of nodes capable of storing petabytes of data. Like local file systems, data storage is organized hierarchically, allowing file storage in directories and subdirectories. The architecture of HDFS follows a master- slave configuration. The basic components are NameNodes and DataNodes. The Name- Node is the master of the system, managing access to the resources and keeping the sys- tem's metadata. Internally, it maintains a table to map the data blocks to DataNodes. The DataNodes are responsible for the actual data storage. Usually, they comprise commodity hardware with several discs for large storage capacity. Hadoop was designed on the assumption that commodity hardware occasionally fails therefore implementing high tol- erance for failures and high availability addressing failover management at the application level. Data distribution across several cluster nodes comes with another advantage besides fault tolerance. Data processing in a distributed architecture is highly paralleliza- ble by running a processing task simultaneously on several machines. HDFS is part of the Hadoop ecosystem and the Apache open-source software family. Many technologies are available for computation based on data stored in HDFS, such as MapRe- duce, Tez, and Hive. The Hadoop ecosystem comprises several technological layers and different frameworks at each of these layers. For example, on top of HDFS, we can use MapReduce or Spark as processing frameworks. MapReduce can be used to define the pro- gram logic for the processing. It splits larger tasks into a mapping and a reducing part. The individual steps are parallelized across many cluster nodes, allowing for operations on massive datasets. Resources required for data processing are managed by YARN (Yet Another Resource Nego- tiator). It parallelizes processing for distributed computing on the nodes dividing the com- putations into smaller tasks that are distributed between the nodes. In addition, Zoo- keeper coordinates the cluster by monitoring and assigning roles to each machine, for example, the NameNode or a DataNode. YARN and Zookeeper are services running in the background and usually not requiring much attention. Manage cloud file systems For many data-related projects, it is more efficient not to set up HDFS and other Hadoop or Apache technology on our own servers but to choose a managed service offered by a cloud provider. In many ways, most of these managed services are similar to HDFS and follow the same underlying technological paradigm. There are many solutions, and going into detail for each is beyond the scope of this section. Instead, we acquaint ourselves with a few examples. 18 Amazon Simple Storage Service (S3) uses distributed storage for objects, where the files and the metadata are stored in buckets. The Amazon Elastic File System (EFS) is used for shared file storage of Elastic Compute Cloud (EC2) instances, the cloud-based virtual machines on AWS. On the Google Cloud Platform (GCP), Cloud Storage is a managed service for storing text and binary objects. Like S3, it uses the fundamental technological approach previously described for HDFS. Azure Blob Storage is a member of a family of storage services on Azure called Storage Accounts. The other Storage Account services are File Storage, a mountable file system focusing on shared access, Queues for stream or micro-batch processing, and Table Stor- age, a column-oriented NoSQL datastore. Data in Blob Storage are stored in containers where we can structure them into virtual folders. Internally, the data use flat storage, meaning that this hierarchy is only virtual but not implemented in the underlying physical storage of the cluster's machines. The make use of the performance advantages of an actual hierarchical namespace, we can easily convert an Azure Blob Storage into an Azure Data Lake. Data lakes Data lakes are large storage repositories for raw and unstructured data. Usually, data is integrated for the entire organization or at least several departments from various data sources. The ingested data are stored with metadata enabling many different analyses in a user-defined fashion. In most cases, data lakes use different stages for various processing levels. For example, in Databricks' Delta Lake, raw and untreated data are dumped into the bronze stage (Databricks, 2023). In the silver stage, these data are refined and prepro- cessed. Ultimately, data in the gold stage are entirely prepared and directly usable for ana- lytical purposes, such as training machine learning models. Azure Data Lake Store and AWS Data Lake solution are both cloud data lakes built on the distributed storage solution of each provider, e.g., Blob Storage, Simple Storage Solution (S3). Google also offers a data lake service based on Cloud Storage. Relational databases While the data storage solutions described so far are primarily oriented toward unstruc- tured and semi-structured data, there are database storage solutions designed to store structured data with a pre-defined schema. Relational databases, also called Relational Database Management Systems (RDBMS), are the traditional solution to store structured data. They conform to a schema-based data model where data are stored in tables. The records are the table's rows, and the attributes are the table's columns. The ACID properties, atomicity, consistency, isolation, and dura- bility guarantee reliable transactions. Typical for RDBMS is referential integrity and the concept of data normalization. Consequently, the Structured Query Language (SQL) can 19 ACID efficiently query the records. Popular commercial RDBMSs include Microsoft SQL Server, The abbreviation ACID Oracle Database, and many others. Open-source relational databases include MariaDB, stands for Atomicity, Con- sistency, Isolation, Dura- PostgreSQL, MySQL, and SQLite. bility and describes a set of database transaction Managed cloud databases Many cloud providers offer several managed Databases as a properties to guarantee data validity. Service (DBaaS). For example, Google Cloud Platform offers Cloud SQL. On AWS, we can Data normalization use the Relational Database Service (RDS) and Aurora. Azure offers Azure SQL database, is the process of organiz- and many other cloud providers offer many similar solutions. These managed solutions ing data so that it looks similar in all fields and automatically scale according to the workload, partition and distribute the data across a records. cluster of several machines, and perform automated backups, security audits, and patches. Data warehouses Enterprise applications require aggregating data from different sources at a central reposi- tory to analyze all relevant information. Such repositories can either be a data lake or a data warehouse (or both simultaneously). While data lakes are primarily used for unstruc- tured data, data warehouses typically store structured data. Data warehouses are designed to integrate data from various sources, aggregating it in a clear semantic, homo- geneous format so that SQL queries and massive analytical processes can later exploit it. In that respect, data warehouses are, at their core, similar to relational databases. Some examples of data warehouse solutions are AWS Redshift, Microsoft Azure Synapse, Google Cloud BigQurey, and Snowflake. NoSQL storage solutions Non-relational or not-only-SQL databases, also referred to as NoSQL databases, are stor- age solutions for structured, semi-, and unstructured data. Despite being a diverse family of considerably different storage solutions, these databases share some common charac- teristics. For example, while relational databases typically enforce a data schema on write, NoSQL databases usually do not exert this limitation. This makes NoSQL storage more flexible. For instance, let's assume that we stored the temperature measured by an IoT sensor. Now, a new generation of this sensor can also measure air humidity. In NoSQL databases, this is not an issue, and we store the data with the new schema (an additional attribute) along with the existing entries. Sometimes, people say there would be no schema in NoSQL data stores. But we usually infer some schema, as we could not use the data at all otherwise. It is more precise to say that relational databases enforce a schema on write, while in NoSQL databases, the schema is inferred on read. Note that this data model allows for write flexibility but has some downsides. For example, we cannot be sure that a particular data entry abides by a specific structure. Also, joining two datasets is less effi- cient than in relational databases. In our example, let us assume that we would like to join the humidity attribute with another dataset containing information about the sensor manufacturers. In a relational database, we could use an indexed sensor ID field as a key 20 to join the two tables quickly. In a NoSQL database, we would have to scan through each document containing the temperature and humidity readings to find a possible ID and then perform the join. At this point, a note of warning about common misconceptions about relational and NoSQL databases is advisable: Relational databases have been around and prevalent for several decades (and continue to do so), while NoSQL databases have been developed comparatively recently. This correlates with the tendency towards distributed systems over the last few years. This led to the conception that NoSQL databases would comprise distributed architectures and all the implications of such, while relational databases would not be built on these distributed systems. It should be clear that this conception is incorrect and that many relational databases today run on distributed clusters. The main difference between relational and NoSQL databases is not their distributed or local nature. As a matter of fact, flexible schemas and referential integrity might be considered the main differences between relational and NoSQL databases. Many fundamentally different NoSQL storage solutions are commonly distinguished into one of the following types: key-value-oriented, document-oriented, column-oriented, and graph-oriented databases. In the following, we will briefly learn about the main character- istics of each. Key-value-oriented databases Key-value storages are similar to dictionaries, where each value is mapped to a key. In our IoT sensor example, a data entry might contain two keys, “temperature” and “humidity” (maybe also “id”), and the respective values. Redis, Memcached, etcd, Riak KV, LevelDB, and Amazon SimpleDB are some examples of key-value-oriented databases. Typical use case examples include storing user profiles and session information in web applications, the content of shopping carts, product details in e-commerce, structural information for system maintenance, such as IP forwarding tables, and IoT readings, as shown in the following table. Table 3: Key-value storage Key-value storage IoT readings key value timestamp 10:00:00 temperature 22 humidity 55 timestamp 11:00:00 temperature 23 21 IoT readings timestamp 12:00:00 temperature 25 humidity 50 Document-oriented databases Document-oriented databases follow a key-value approach but combine object informa- tion in collections of key-value pairs. This also allows for nested data structures. Docu- BSON ments can be encoded as XML, JSON, or BSON files. Well-known document databases are is a binary encoded Java- MongoDB, Couchbase, CouchDB, Google Cloud Firestore, RethinkDB, and Amazon Docu- script Object Notation (JSON). mentDB. The following table shows an exemplary collection of two documents about journal arti- cles. Both articles share common fields but are flexible in others. Also, notice the nested structure of the first document. Table 4: Document-oriented storage Document-oriented storage Collection of Articles Article 1 Article 2 { { "_id": "Article_1", "_id": "Article_2", "title": "NoSQL Database", "title": "MongoDB", "authors": { "author": "Anna Noah", "name": "Peter Moran", "category": "Data storage", "email": "[email protected]"}, "date": "01-06-2020" "category": "Data storage", "pages": "15" "date": "01-04-2020" "journal": "DM Review" } "volume": "3" } Column-oriented databases In relational databases, records are stored by row. For example, two data entries for a sim- ple IoT sensor might be stored like this: Id, temperature, humidity 1, 22, 18 2, 24, 17 22 This data structure is efficient for append-only writes. For example, a row is added to the database to append a new sensor reading. For analytical purposes, however, we are typi- cally interested in specific columns and aggregations of such. For instance, reading only the humidity over time becomes highly inefficient as all rows must be read just to discard most of the information. Column-oriented databases, also called wide column stores, store records by column rather than row and are optimized for analytical applications involving frequent aggrega- tions. Data stored in the same column are located in contiguous regions of the physical drive. The IoT sensor example from above, in a column-oriented datastores, translates to the following: Id, 1, 2 Temperature, 22, 24 Humidity, 18, 17 Storing information like this allows for more direct access to columns (only a few lines must be read) and faster column-based aggregations. In addition, this structure also allows for more efficient data compression. Furthermore, columns can also be grouped by similar access patterns into column families, and the entire "table" is called a keyspace in many column-oriented databases. The above example might be grouped by sensor id and readings in the following keyspace. Column-familiy 1, id, 1, 2 Column-familiy 2, Temperature, 22, 24 Column-familiy 2, Humidity, 18, 17 Among widely used column-oriented datastores are Cassandra, HBase, Microsoft Azure Table Storage, and Amazon Keyspaces. Graph-oriented databases The last family of NoSQL databases is graph-oriented. Graph databases are designed for heavily interconnected data. Their data model comprises nodes representing entities and edges representing relationships between these entities. The relationships are directional connections represented as an edge between two nodes. Typical use cases are topological maps, routing systems, and social networks. In e-commerce applications, these databases can be used for product recommendations based on click behaviors, previous purchases, and product recommendations by friends. Graph-based databases are also adequate for managing business processes and supply chains. The following example shows a movie database describing the relationships between movies and persons. A person can direct, act, follow, or produce a film. Finding all movies associated with a particular person is fast, easy, and efficient with this database type. 23 Neo4j is by far the most widely used graph database. Alternatives include JanusGraph, TigerGraph, and NebulaGraph. Multi-model databases Finally, there are databases providing multi-model APIs. For example, Azure CosmoDB is a NoSQL datastore providing APIs for classical SQL queries, MongoDB, Cassandra, or a graph-based approach using Gremlin. Thereby, CosmoDB supports multiple data models, such as document-oriented, column-oriented, key-value-oriented, and graph-oriented storage. Other examples for multi-model databases include Amazon DynamoDB, Mark- Logic, Aerospike, Google Cloud BigTable, Ignite, ArangoDB, OrientDB, Apache Drill, Ama- zon Neptune, Apache Druid, and GraphDB. Also, many databases can be categorized into one of the previously described NoSQL database categories but also provide a secondary database model. In addition, many cloud providers offer specialized services for specific use cases, such as cache or in-memory databases. For example, Azure Cache for Redis enables in-memory processing based on a key-value-oriented Redis database, while AWS offers a similar solu- tion with Elasticache. 1.4 Data Analysis Until this point, we ingested data into unified data storage, processing the data on the way. But storing data is never self-purpose, and we rather aim to deduct insights from it as added value. Data analysis aims to gain insights into the data and extract information and knowledge from it. The analysis can be descriptive, predictive, or prescriptive. Descriptive data analytics focuses on the explanation of present or past events. Statisti- cal analyses of historical data, such as spreadsheets or visual infographics, provide use- ful insights into the data and its distribution by presenting statistical summaries. Also, data-driven models can be used for more sophisticated analysis, such as root-cause analysis, to identify the causal factors for some past events, for example, the causes of a system failure. Predictive data analysis aims to predict future events, such as future stock prices or the probability of a customer churning a company. Data-driven models are constructed on historical data and learn the underlying patterns and correlations to predict some out- come. Namely, the constructed models allow projecting past events into the future, e.g., predicting stock prices of the next day, given the current and past stock prices. Prescriptive data analysis investigates the outcomes of different scenarios by models aiming to recommend decisions leading to the most favorable predicted future events. For example, these methods are used in climate impact research or to maximize effi- ciencies for production site settings. In this section, we will obtain an overview of some data analytics categories, including machine learning, deep learning, and time series analysis. 24 Machine Learning Machine Learning (ML) is the ability of programs to automatically extract informative pat- terns, i.e., without explicit instructions on how-to to do this. Machine learning creates data-driven models and can be subdivided into unsupervised and supervised learning. Artificial intelligence can be considered a larger category to which machine learning can be accounted, and deep learning is a specific field of machine learning. Machine learning is often performed on tabular and structured data stored in a data lake or data warehouse. Each row of the table represents an observation; we call this a sample or data point. The columns represent the respective attributes, and we call them features (except for the to-be-predicted column called the label). A common approach is to use a subset of the labeled dataset for the model's training, while the other part of the labeled data is used to assess the model's prediction quality. This separation is called data parti- tioning into a training and testing set. It is beyond the scope of this section to go into detail about all available machine learning algorithms. Instead, we will briefly learn about popular approaches on a very high over- view level. Supervised learning In supervised learning, we have samples with known labels, and we train a model using this information to predict the label for unseen samples. Labels are the assignment of explicit information to each record. The labeled dataset is used to learn the relationship between the set of attributes x1,... ,n, also called features, and the target variable, y. The learning phase is called the training phase, where a model is constructed. This model can then be used to predict the target variable for unlabeled, a process called inference. For example, a supervised machine learning task could be to train a model based on past received emails that we know to be spam to predict whether or not future emails might be considered spam and filtered out. This is an example of a classification task, but there are also supervised ML algorithms that predict numbers (e.g., tomorrow's air temperature), and we call these types of problems regression tasks. Note that in this context, regression task algorithms are not limited to linear regression and numerous more elaborate techni- ques do not imply linear relationships. Typical supervised classification algorithms include Logistic Regression, Decision Trees, Random Forest, Support Vector Machines (SVM), Naïve Bayes, k-Nearest Neighbors (kNN), and Gradient Boosting. For supervised regression tasks, we can use, for example, Linear, Ridge, Lasso, or Elastic Net Regression, Decision Trees, Random Forest, or Support Vector Regression (SVR). 25 Unsupervised learning Opposed to this, unsupervised learning algorithms discover underlying patterns in the data without labels. An example could be to cluster participants of a larger satisfaction survey into homogeneous groups to develop targeted solutions for each. As an overview, some of the most widely used techniques include the following. Clustering – Assigning data points to clusters that are not known before the analysis. Typical algorithms include k-Means, Hierarchical Clustering, and DBSCAN. Anomaly Detection – Similar to Clustering, data points are assigned to clusters, but in this case, specifically to be a “regular” or “unregular” data point. Typical applications of anomaly detection are the discovery of fraud in financial or insurance systems or the detection of intrusions in information systems. Algorithms include Local Outlier Factor (LOF), One-Class SVMs, and Isolation Forests. Dimensionality Reduction – Large datasets with many columns/features are hard to overlook. Dimensionality reduction aims to transform the data to reduce the number of columns while preserving the information as well as possible. Typical algorithms include Principal Component Analysis (PCA), t-SNE, and Linear Discriminant Analysis (LDA). Deep Learning A sub-category of machine learning is deep learning. One method that falls into this cate- gory is Artificial Neural Networks (ANNs). These networks are inspired by biological neural networks consisting of neurons or nodes and their connections. We assign a positive weight to emphasize a strong connection between nodes and use negative weights to inhibit such a connection. The basic unit in ANNs is a perceptron. In the following figure of a simple linear perceptron, multiple weighted inputs are added together and activate the output if a certain threshold is passed. This simple architecture is often elaborated by using certain activation functions and concatenating multiple layers into a more extensive network. Multi-layer perceptrons are more complex and can be used to solve nonlinear separable problems. We call the additional layers the hidden layers. We can use the backpropaga- Back propagation tion algorithm to train a neural network model to find suitable weights. is a common method for training artificial neural networks through error There is no fixed number of hidden layers, but when there is a considerable number of trace back. such, we call the technique deep learning, meaning that the network is deep in terms of layers. Deep learning algorithms are highly parallel, making GPUs and their high number of proc- essing units particularly suitable for training these models. Another deep learning example is Convolutional Neural Networks (CNNs), a performant solution for object recognition in images (and other use cases). The architecture of such a CNN with several layers allows for automatically extracting informative features. 26 Despite not strictly being a deep learning example, Reinforcement Learning can be used for optimizing decision-making. Reinforcement learning uses rewards for agents in a simu- lation that optimize their state by interacting with the simulation environment. These sim- ulations can use deep learning algorithms, but they can also use other algorithms. In the following illustration, the reinforcement approach is depicted. One particularity about neural networks is that many pre-trained general-purpose models can be retrained to match a particular use case. We call this transfer learning, and we ach- ieve this by removing the last couple of layers from the network and training it with our specific data. Despite being a popular buzzword, deep learning and neural networks are not magic algo- rithms superior to all others. There are certain situations and use cases for which they per- form well, typically in scenarios with a lot of training data or non-tabular data, such as images or videos. Impressive advances have been achieved in image processing, natural language processing, and automatic feature learning applications. In cases where the data is scarce (only a couple of hundred samples) and strictly tabular, other algorithms might be a better choice. Time series analysis Time series analysis refers to the analysis of data indexed by time. Several techniques exist to learn patterns in historical observations to forecast future values, such as Holt-Winters smoothing techniques, Autoregressive Moving Average (ARMA) models and extensions like ARIMA, SARIMA, and SARIMAX models, and ensemble models such as TBATS. Time series analysis is quite similar to the supervised machine learning approach. Still, for the latter, we use multiple features to predict the label, whereas, for the former, we forecast future values solely based on this one time-indexed variable (and time-lagged versions of itself). Nevertheless, the boundaries have become blurry, and libraries for automated feature extraction from time series have been developed to use these features in a machine learn- ing manner. From a data management perspective, we should memorize that the time index is highly relevant for time series analysis and should be treated accordingly in the data system. MLOps and CRISP-DM Regarding the development of machine learning applications, there exist frameworks to streamline these activities. They are referred to as machine learning operations (MLOps), and a canonical framework for the development of machine learning models is called the Cross-Industry Standard for Data Mining (CRISP-DM). As can be seen in the following fig- ure, many steps of this CRISP-DM process model require close collaboration between data management and machine learning teams. 27 1.5 Reporting Reporting is the final phase of the data processing cycle. It provides visualizations, dash- boards, and text reports to allow the users to gain insights from aggregated information from the previous stages. At this stage, we often use methods subsumed under Business Intelligence (BI). BI report- ing refers to the processes and tools for data analysis to obtain actionable insights about an organization's operations. But remember that these techniques are not limited to busi- ness decisions but apply to any data-driven decision, e.g., in urban planning, logistics, environmental policies, spatial decision support tools, and many more. The objective is to gain an understanding and improve decisions based on information rather than intuition or traditional rules. Organizations monitor metrics and Key Perform- Key Performance Indica- ance Indicators (KPIs) relevant to their use cases. tor A key performance indi- cator is a meanigful met- Insights from BI ric that can be used to measure the performance of an activity of an organi- The use of BI allows a variety of insights: zation. Decision-making based on evidence Real-time analytics Details about processes Discover business opportunities Control planning commitment Monitor business with key performance indicator Accordingly, BI applications aim to reflect the reality of the organization by considering information from the entire organization coming from both internal and external data sources. For example, a BI application to monitor production costs might be based on information about material costs and operational efficiency that can be read from the organization's ERP or MES system. However, insights into the competitiveness of a prod- uct might be derived from external data, such as customer reviews. Studies on the expected profitability of a product may be obtained from both internal and external data considering the production process, the prices of the competitors, and the expected mar- ket acceptance. Modern BI tools are designed to be easy to use and address data integra- tion and analysis. Most BI tools offer a Graphical User Interface (GUI) for data integration to create ETL processes. The data analytical part of BI tools typically focuses on the visual representation and interactive exploration of the data by the users. For this, BI tools sup- port the creation of data dashboards and interactive web-based reports. There are several modern BI tools on the market, such as Microsoft Power BI, Tableau, and Qlik Sense, to name a few. 28 SUMMARY The data processing lifecycle transforms raw data into information. For this, several transformations are performed. Data ingestion tackles the challenge of integrating raw data into a standardized format from vari- ous sources. The processing phase focuses on different methods that can be conducted in batches, streams, or a combination of both. Data persistence is resolved in the storage phase, where the aggregated data can be stored with various database or file storage solutions. Each stor- age solution has its strengths and weaknesses related to the characteris- tics of the application or the data. After persisting the aggregated data, its analysis by means of machine learning or reporting tools is possible. Machine learning derives insights from the data with sophisticated mod- els and algorithms, which learn from the data providing diagnostic, pre- dictive, or prescriptive models. Reporting applications aim to provide interfaces to access and analyze the data, mainly at the descriptive level. For this, BI applications are a popular solution as they provide the possi- bility to explore the data interactively with an emphasis on the graphical representation of the data. 29 UNIT 2 DATA PROTECTION AND SECURITY STUDY GOALS On completion of this unit, you will be able to... – understand the need for ethical principles in data handling. – know the principal data protection principles. – know different alternatives for data encryption. – understand data masking strategies and know tools for their implementation. – define a risk management plan according to the most important data security princi- ples. 2. DATA PROTECTION AND SECURITY Introduction With the rise of digitalization and the massive recollection of data appropriate measures are needed to protect individual rights in the era of big data. In first place it's necessary to review how individual's rights can be affected by massive data collection and analysis. Ethical principles are fundamental to assure a proper conduct in data analysis. In this con- text, the General Data Protection Regulation (GDPR) has established different data protec- tion principles which must be addressed mandatorily to protect the rights of individuals. The requirement of security for private data has a technological impact on information systems as proper data protection solutions are needed. Encryption and data masking are modern techniques to protect the confidentiality of data and these solutions can find dif- ferent application scenarios. Data protection in information systems is mandatory and often a non-trivial endeavor. It should be addressed by a proper security risk management plan providing mitigation strategies for security threats in order to meet the different data security principles. 2.1 Ethics in Data Handling With accelerated digitalization of our everyday lives in last decades, the collection of per- sonal data also increased. Accordingly, concerns arose regarding the protection of privacy and possible misuse of personal data. As countermeasures data protection legislations have defined the basic rules for data protection and privacy in data handling. Neverthe- less, beyond the legislative regulations there is a need to care about ethical data manage- ment. These thoughts gave birth to the legislative regulations and should be fully under- stood by everybody designing modern data systems. Ethical principles Despite the legislative efforts for data protection, technological advances are often faster than legislation, and new challenges may arise, which may not have been considered in the current data protection legislation. For this, it is important to know about ethical prin- ciples for data handling which are the basis for legislative regulations. Namely, such prin- ciples are transparency, fairness, and respect regarding the treatment of personal data (Jobin, Ienca & Vayena, 2019, p. 1). Ethical data handling assures the protection of privacy, freedom and autonomy of individuals and thereby encouraging trust in digitalization by individuals. Transparency Transparency means the data subject (the person whos data is stored or processed) and the data handler should be clear about what is done with the data and for which reason. This means in the first place that the collected user content and the intended use of data 32 (purpose) should be clearly defined. For this purpose, transparent and clearly written poli- cies should be defined and provided. Technically, privacy-respecting policies should be set as default configurations; for example on websites, the default configuration of cook- ies should be restricted to technical necessary ones. If the data subject gives consent to store and process additional data for reasons of convenience, it should be clear what data are collected and for what purpose. The reasons for storing and analyzing data should be described, and organizations should follow certification schemes regarding data protec- tion. If decision-making is automated in any service of an organization, it should be clear which ethical aspects are considered in the decision to avoid discrimination. Fairness Fairness focuses on the impact of data handling on persons and their interests. The use of personal data should be fair for all involved parties, misuse should be avoided, and the impact of failures should be considered. To avoid discrimination, sensitive personal data, such as race, religion, political preferences, sexual orientation, or disability are not allowed to be used in automated decision-making. The data should only be used in the context to which the user has given consent, and the organization should provide the pos- sibility for users to claim the correction of data. Respect The principle of respect refers to the consideration for persons behind the data. We as data managers should in first place look after the interests of the persons before consider- ing the benefit an organization that derives value from the collection of data. Discussion on the ethical management of data Unethical data handling happens when the individual's rights are harmed or when the individual loses control of the data. For instance, personal data should not be collected unless the user has given his or her consent and authorization for a particular use. Espe- cially in big data applications, the use of data collection and data analysis is under contro- versy. Big data relies on the recollection of massive amounts of data from a variety of sour- ces and their aggregation by data linking to obtain enriched information. Currently, big data is used in manifold applications, ranging from recommender systems, supply-chain optimization, medical research, or rating applications for credit, housing, education, or employment. The first concern is inappropriate data-sharing. This problem rises when personal data is shared with other stakeholders without the consent of the individual. The individual should always have control of the data by being fully informed about the pur- pose of data analysis and the identity of the data handler. A common practice to legally use data in big data applications is the use of privacy-preserving transformation to the data, also referred to as data anonymization, to remove the personal information of indi- viduals in such a way that this information is not recoverable by appropriate means. In such cases, the anonymized data can be used without the need to comply with further personal data protection rules. Nevertheless, data anonymization is not the ultimate solu- tion. It has been demonstrated, that still with anonymized data the identity of individuals can be inferred from other characteristics of the data for some examples. A famous exam- ple is Netflix's $1 million contest intended to improve its movie recommendation system 33 based on anonymized historical records of movie choices from customers. Nevertheless, the contest was canceled in 2010 as researchers raised concerns regarding the anonymiza- tion of the data and adverting about the risk of identification of the customers. In fact, the anonymized data set still comprised information of gender, zip code of residence, and age, which are sufficient to narrow the identity of a customer to a reduced number of possibili- ties (New York Times, 2010). Identification of individuals is also a concern in profiling. According to (Schermer,2011), “Profiling is the process of discovering correlations between data in databases that can be used to identify and represent a human or nonhuman subject (individual or group) and/or the application of profiles (sets of correlated data) to individuate and represent a subject or to identify a subject as a member of a group or category”. Profiling relies on data mining, which can either be descriptive or predictive. In the case of descriptive data mining pro- files are groupings discovered from the data and the outcome is the description of the characteristics and relationships of the discovered groups. In predictive data mining the relationship between the characteristics and class membership is learned from labeled data. The data model learns the underlying correlations from the data and is able to pre- dict with a given certainty that the individual belongs to a certain group. Predictive profiling can be an advantage in many applications, where the characterization of the individual is key, for example, recommendation systems, personalized services, or applications based on anomaly detection. However, the use of profiling is not without controversy. According to Schermer (2011), “the most significant risks associated with profiling and data mining are discrimination, de-individualization and information asymme- tries.” Discrimination can happen when biased data is used to train a predictive model so that profiling derived from the model exhibits discrimination, or if the results are used in a discriminant way. Regarding de-individualization, profiling applies the group characteris- tics to an individual according to the predicted group membership by the predictive model. This can lead to unfair treatment as the individual may not necessarily share all characteristics of the profiled group and should not be judged by a group characteristic he does not have. Profiling can also stigmatize individuals and thus negatively affect social ties. As data mining and profiling create insights about individuals, so-called information asymmetries may arise. “Information asymmetries may influence the level playing field between government and citizens, and between businesses and consumers, upsetting the current balance of power between different parties” (Schermer, 2011). For example, unethi- cal use of profiling would be denying service to an individual based on some profile-based decisions. For example, an individual could be denied credit based on belonging to a non- credit trustworthy group, although its individual aptitudes would be excellent for request- ing credit. Moreover, the negative decision taken by the profiling algorithm may not be explainable, i.e. it may not be clear to the individual for which reason and based on which information the decision was taken. In fact, data models for profiling should exclude any sensitive personal attribute to guarantee the model will not make discriminating deci- sions based on sensitive attributes, such as gender, race, or political beliefs. 34 Risks of data privacy in the digital society Social networks, such as Twitter, Facebook, Instagram, or LinkedIn are places where users voluntarily reveal certain Personal Identifiable Information (PII) such as images of them- selves, or personal information (Kayes & Iamnitchi, 2017). Nevertheless, the publication of such information can be a risk as it can be exploited for social engineering, phishing, or even identity theft. Also, the publication of location information can lead to stalking or even demographic re-identification as location, gender, and date of birth can lead to nar- row identification. Another risk to privacy is the systematic recollection of user's data by so-called data-aggregation companies. Such companies collect personal data from public profiles of online networks and other information available online with the aim of com- mercializing this information to third parties, such as insurance or rating companies (Bon- neau, et al., 2009). Another important stakeholder with the ability to collect a huge amount of personal data is governments through the rise of digital administration. Often governments hold sensitive personal data (gender, age, race, etc.), which require addi- tional protection. Another threat attributed to governments and derived from digitaliza- tion and online records is public surveillance (Nissenbaum, 2004). This term refers to the activity of government agents to use different digital media, such as online information, video, or other data to systematically surveil individuals. This activity could harm the autonomy and right to privacy of citizen and create disservice. From a technological point of view, Internet of Things (IoT) is another emergent technol- ogy with an impact on personal data. IoT devices are capable of gathering huge amounts of sensitive data, such as by health monitoring or in the context of smart city applications. Often, these devices include diverse technical platforms and, in contrast to computers and servers, are typically not under physical control of the system administrators but located somewhere in the open. Accordingly, IoT devices face different cybersecurity threats and there is intense ongoing research on methods to better protect privacy in IoT applications. (Ziegler, 2019) Limitations in access to technology is also a risk that may cause partial, incorrect, or non- representative data. This is especially a risk for applications that automatically collect data from digital applications. The lack of access to technology may be a cause for the underrepresentation of certain groups in such systems, for example, the elderly or people with lower income. An example would be Street Bump, an IoT application using the smart- phone's accelerometer of drivers to automatically detect damage on roads in order to schedule road maintenance (Crawford, 2013). The results generated by this application might be considered unfair as access to technology, in this case, smartphones and cars were not the same in all regions. In consequence, the regions with a lower rate of access to technology were underrepresented in the collected data and in consequence penalized in the maintenance plans. Discrimination by algorithms As mentioned before, automated decision-making by algorithms can lead to discrimina- tion. In many applications, such as spam filtering, fraud, or intrusion detection, such mod- els completely automate a person's decision. Partly automated decisions use the outcome of an algorithm to assist humans in the decision-making. There is a lively discussion about 35 the impact of discrimination by algorithms (Zuiderveen, 2020), especially for data-driven algorithms where the outcome is not explicitly programmed but derived from the learned relationships in the training data. The concern comes from the fact that data-driven mod- els may learn and reproduce biases or prejudices from the provided data. In machine learning, patterns are learned from the data distribution so that when biased data is used, the model will preserve such bias. As a countermeasure, some best practices can be applied (Barocas & Selbst, 2016), such as the use of unbiased training data, attention to class balance, or adequate feature selection to assure fine-grained predictions also for minority groups. It is worth noting that real-world data are most likely biased, for example, the information available on the Internet. For example, search machines provide information according to the quantity and diversity of information available from external sources. Google's search algorithm, for instance, was accused of being racist, as the images returned in a search for teenagers was stereotyped if the searched term included the race of the teenagers. Allen (2016) described the apparently racist behavior of the Google search algorithm that showed smiling and happy images for the keywords “three white teenagers” and on the other hand disproportionately many mug shots for the keywords “three black teenagers”. In response, Google argued that the different outcomes of their search algorithm rely on the tagged content published on the Internet, where different content and quantity of information are published for each search term, and the search algorithms only index existing content. An example demonstrating that prejudices tend to be preserved in automated decision- making is the automated admission assessment application for an higher education medi- cal institution in the UK in the 1980's (Lowry & Macpherson, 1988). The admission assess- ment model was constructed on the former admission records of the institution and exhibits discrimination against women and some minority groups as in the past, the admissions were made by humans based on these criteria. The reason the model exhibits this criterion was that it simply learned the underlying discriminating relationships of the training data. 2.2 Data Protection Principles There are many influential regulations built on data protection principles. For systems handling data of EU citizens the EU General Data Protection Regulation (GDPR) is one of the most important regulations to consider. It became effective in May 2018, providing rules for data protection and privacy of EU citizens' personal data (Regulation (EU) 2016/679). The GDPR describes a framework for data protection during the collection, storage, and processing of data. This framework comprises the definition of the roles of the stakeholders, several data protection principles, and the description of the rights and obligations of the different stakeholders. In this section, we will use the GDPR as an exam- ple to show how data protection principles can be defined and implemented into legal regulations. We will also learn about our obligations when we design and create data sys- tems. 36 GDPR roles The GDPR framework distinguishes the roles of the different involved parties in the data handling process: Data subjects are individuals (persons) from whom personal data is collected. Data controllers hold the collected data and define the way in which the data are col- lected and processed. Data processors collect and process data in the name of and commissioned by the data controllers. Data protection principles of the GDPR Regulation (EU) 2016/679 (2016) constitutes several guiding principles for data protection based on the ethical principles for data handling that aim to define how data should be processed. Fairness, lawfulness, and transparency In the first place, the reason for collecting data should be set up on a legal basis. The GDPR considers the possibilities of consent, public interest, or legitimate interests as options for legal grounds. Data can be collected and processed when the data subject authorizes it, i.e., gives consent. In the case of the collection of sensitive data, explicit consent of the data subject is required. Data collection by legitimate interest refers to situations where data are collected to fulfill a legal requirement, for example, to accomplish administrative obligations. Data should be handled with fairness, i.e., in a fair and reasonable fashion from the data subject's perspective. Data collection should be transparent by clearly defining and explaining in a straightfor- ward way which data are collected, for whom, for which purpose, and the time it will be stored. Purpose limitation Purpose limitation means that data are only stored and processed for the clearly specified and legitimate primary purpose. Other or incompatible storage or processing purposes are not permitted. For example, address data that was collected to ship articles to custom- ers must not be used to send advertisement without additional consent. Data minimization Collected personal data should be appropriate, relevant, limited, and indispensable for accomplishing the purpose of collection. Prior to starting data collection, the minimum quantity of data to be collected for the given purpose should be defined. For example, an 37 organization may only collect health information from its employees if such information is necessary for some legitimate use, for example, workplace risk assessment. Otherwise, the organization is not authorized to collect such type of personal data. Accuracy Accuracy refers to the obligation of the data controller and processor to assure that per- sonal data is correct and up-to-date. The organization has the obligation to carry out recti- fications of the data if required by the individual or in consequence of other modifications that might outdate a data entry. Storage limitation Storage limitation means that data should only be kept as long as it is necessary for the given purpose. It should be defined by a policy, how long the data is stored and when it is deleted or anonymized. Technically, this should automatically be performed within the data system. Integrity and confidentiality Integrity and confidentiality refer to the obligation of the data controller and processor to maintain the data secure and treat it as sensitive information. This includes protection against unauthorized or unlawful access, processing, loss, destruction, or damage. Accountability Not really a principle of its own, accountability emphasizes the data controller's responsi- bility to comply with the GDPR. Exceptions and special cases Exceptions to the prohibition on processing personal data are allowed if they are provided for in union law or member state law, and if justified by the public interest. Such an excep- tion may be made, for example, for health purposes, to ensure public health. Such excep- tions, justified in the public interest, include, in particular, health monitoring, prevention or control of contagious diseases and other serious health hazards (Regulation (EU) 2016/679, 2016, p. 9). Apart from this, there are also exceptions based on legitimate interest. For example, in the case of a group of companies or a group of entities that are assigned to a central organiza- tion and have a legitimate interest in processing personal data within the group of compa- nies for internal administrative purposes, including the processing of personal data of cus- tomers and employees (Regulation (EU) 2016/679, 2016, p. 10). Rights of data subjects Furthermore, the GDPR describes the rights individuals have regarding the collection and processing of their personal data. 38 Right of information about which personal data are collected and processed. Right to know the source of personal data means that data controllers must inform data subjects about data processors collecting or processing personal data. For example, if user A visits website B which combines the digital behavior of the user with additional personal data, such as the user's age obtained from data provider C, the website owner B has to inform user A about this data enrichment. Right of access: The individual has the right to obtain a copy of the personal data. Right to rectification: Possibility to request data corrections. Right to erasure: Sometimes also referred to as the right to be forgotten, the individual can request the deletion of the personal data. Right to object processing and right to restrict processing: The individual can limit the way the data controller handles the data by objecting the processing (right to object), which means that the data controller needs to stop processing the data. The request for restriction refers to processing constraints. For example, while a claim regarding the correctness of data is resolved, the individual can request the restriction of processing the data. Right to be notified: Data subjects must be notified about further data processing activi- ties, deletion of the data, breaches and unauthorized accesses to the data. Right to data portability: This right aims to enable the free flow of data by giving the individual the right to request personal data in a portable format (if technologically fea- sible). Right not be subject to profiling: The GDPR defines profiling as “any form of automated processing of personal data evaluating the personal aspects relating to a natural person, in particular to analyse or predict aspects concerning the data subject's performance at work, economic situation, health, personal preferences or interests, reliability or behav- iour, location or movements” (Regulation (EU) 2016/679, 2016, p. 14). In this case, the data controller needs to carry out a Data Impact Assessment before putting the auto- mated decision application into practice to asses the privacy-related risk. The legiti- macy of the use of such automated decision applications is also subject to the explicit authorization of the data subject except it is authorized by law (e.g. tax fraud detection). Data subjects have also the right to obtain explanatory information about the logic of the automated decision-making process. Data security Data security is necessary to guarantee data privacy. This concerns the physical security of premises and workstations, as well as the security of systems. In addition, we should design data systems in a way that the risk in cases of data breaches is minimized. This can include anonymization and restriction to only collect the absolute necessary data. The GDPR defines two principles in that regard referring to “data protection by design and data protection by default” (Regulation (EU) 2016/679, 2016, p. 15). The former says that data protection principles should be technically implemented into very architecture of data systems from the very first stage of development, while the latter means that the highest possible data protection settings should be selected by default throughout the entire sys- tem. Furthermore, a risk assessment should be conducted considering possible data breaches, also listing the likelihood, severity, and countermeasures for each risk. 39 Translation of regulations to technical solutions To safeguard data, the data controller should needs to ensure the confidentiality, integrity, availability, and resilience of the processing systems, which refers to different measures. These measures include strong access control to systems, effective restoring mechanisms in case of technical incidents, protection of servers against external threats, and closed- controlled system of data processing. The data controller also should apply anonymiza- tion or pseudonymization and encryption to protect personal data. In the following, we learn about anonymization and pseudonymization, and (in the next section) encryption. Anonymization Data anonymization is a technique to remove Personal Identifiable Information (PII). This is a common solution for enabling data analysis in certain situations: data sharing, for example, in applications joining aggregated data from various data sources data analysis based on sensitive data, for example, in medic