EGT308 AI Solution Architect Project PDF

Official (Open) EGT308 AI Solution Architect Project Topic 6: Data Engineering for Solution Architecture  In this topic, you will learn about how to handle and manage your big data...

Official (Open) EGT308 AI Solution Architect Project Topic 6: Data Engineering for Solution Architecture  In this topic, you will learn about how to handle and manage your big data needs. 1 Official (Open) Data Engineering for Solution Architecture In the internet and digitization era, data is being generated everywhere with high velocity and volume. Getting insight from these huge amounts of data at a fast pace is challenging. We need to innovate continuously to ingest, store, and process this data to derive business outcomes. We will learn/recap the followings:  What is big data architecture?  Designing for a big data processing pipeline  Data ingestion, storage, processing, and analytics  Data visualization  Designing big data architecture 2 Official (Open) 1. What is big data architecture? In big data architecture, the general flow of a significant data pipeline starts with data and ends with insight. How you get from start to finish depends on a lot of factors. The following diagram illustrates a data workflow pipeline that will help you design your data architecture: Figure 1: Big data pipeline for data architecture design The best way to architect data solutions while considering latency is to determine how to balance throughput with cost because a higher performance and subsequently reduced latency usually results in a higher price. 4 Official (Open) 1. What is big data architecture? As shown in the preceding diagram, the standard workflow of the big data pipeline includes the following steps: o Data is collected (ingested) by an appropriate tool. o The data is stored persistently. o The data is processed or analyzed. The data processing/analysis solution takes the data from storage, performs operations, and then stores the processed data again. o The data is then used by other processing/analysis tools or by the same tool again to get further answers from the data. o To make answers useful to business users, they are visualized using a business intelligence (BI) tool or fed into an ML algorithm to make future predictions. Once the appropriate answers have been presented to the user, this gives them insight into the data they can then use to make further business decisions. 5 Official (Open) 2. Designing big data processing pipelines One of the critical mistakes many big data architectures make is handling multiple stages of the data pipeline with one tool. A fleet of servers managing the end-to-end data pipeline, from data storage and transformation to visualization, may be the most straightforward architecture, but it is also the most vulnerable to breakdowns in the pipeline. Such tightly coupled big data architecture typically does not provide the best possible balance of throughput and cost for your needs. When designing a data architecture, use FLAIR data principles. Finability, Lineage, Accessibility, Interoperability, Reusability 6 Official (Open) 2. Designing big data processing pipelines When designing a data architecture, use FLAIR data principles. F: Findability. The ability to view which data assets are available, access metadata including ownership and data classification, and other mandatory attributes for data governance and compliance L: Lineage. The ability to find the data origin, trace data back, and understand and visualize data as it flows from data sources to consumption A: Accessibility. The ability to request a security credential granting entitlement to access the data asset. It also requires a networking infrastructure to facilitate efficient access I: Interoperability. Data is stored in a format that will be accessible to most, if not all, internal processing systems R: Reusability. Data is registered with a known schema, and attribution of the data source is clear. May encompass MDM (Master Data Management) concepts 7 Official (Open) 2. Designing big data processing pipelines Big data architects recommend decoupling the pipeline between ingestion, storage, processing, and getting insight. Various tools and processes to consider when designing a big data architecture pipeline: 8 Official (Open) 3. Data ingestion, storage, processing and analytics Data ingestion is the act of collecting data for transfer and storage. There are lots of places that data can be onboarded. Predominantly, data ingestion falls into one of the categories from databases, streams, logs, and files. Among these, databases are the most popular. Use the type of data your environment collects and how it is collected to determine what kind of ingestion solution is ideal for your needs: 9 Official (Open) 3. Data ingestion, storage, processing and analytics  Transactional data storage must be able to store and retrieve data quickly. End-users need quick and straightforward access to the data, which makes app and web servers the ideal ingestion methods. For the same reasons, NoSQL and Relational Database Management System (RDBMS) databases are usually the best solutions for these kinds of processes.  Data transmitted through individual files is typically ingested from connected devices. A large amount of file data does not require fast storage and retrieval compared to transactional data. For file data, often a transfer is one-way, where data is produced by multiple resources and ingested into a single object or file storage for later use.  Stream data such as clickstream logs should be ingested through an appropriate solution such as Apache Kafka or Fluentd. Initially, these logs are stored in stream storage solutions such as Kafka, so they're available for real-time processing and analysis. 10 Official (Open) 3. Data ingestion, storage, processing, and analytics Some popular open-source tools for data ingestion and transfer:  Apache Sqoop: Sqoop is part of the Hadoop ecosystem project and helps to transfer data between Hadoop and relational data stores such as RDBMS. Sqoop allows you to import data from a structured data store into Hadoop Distributed File System (HDFS) and to export data from HDFS into a structured data store.  Apache DistCp: DistCp stands for distributed copy and is part of the Hadoop ecosystem. The DistCp tool is used to copy large data within a cluster or between clusters.  Apache Flume: Flume is open-source software and is mainly used to ingest a large amount of log data. Apache Flume collects and aggregates data to Hadoop reliably and in a distributed manner. Flume facilitates streaming data ingestion and allows analytics.  More open-source projects are available for streaming, such as Apache Storm and Apache Samza, to provide a means of reliably processing unbounded data streams. 11 Official (Open) 3. Data ingestion, storage, processing and analytics One of the most common mistakes when setting up storage for a big data environment is using one solution, frequently an RDBMS, to handle all the data storage requirements. The best solution for your environment might be a combination of storage solutions that carefully balance latency with cost. An ideal storage solution uses the right tool for the right job. The following diagram combines multiple factors related to your data and the storage choice associated with it: 12 Official (Open) 3. Data ingestion, storage, processing, and analytics Choose a data store depends upon the following factors:  How structured is your data?  How quickly does new data need to be available for querying?  What is the size of the data ingesting?  What is the total volume of data and its growth rate?  What the cost will be to store and query the data in any particular location? When you determine all characteristics of your data and understand the data structure, you can then assess which solution you need to use for your data storage. 13 Official (Open) 4. Visualizing data The following are some of the most popular data visualization platforms, which help you to prepare reports with data visualization as per your business requirements. Amazon QuickSight is a cloud-based BI tool for enterprise-grade data visualizations. It comes with a variety of visualization graph presets such as a line graph, pie charts, treemaps, heat maps, histograms, and so on. Amazon QuickSight has a data-caching engine known as Super-fast, Parallel, In-memory Calculation Engine (SPICE), which helps render visualizations quickly. Kibana is an open-source data visualization tool used for stream data visualization and log exploration. Kibana also provides popular visualization charts such as histograms, pie charts, and heat maps and offers built-in geospatial support. Tableau is one of the most popular BI tools for data visualization. It uses a visual query engine, which is a purpose-built engine used to analyze big data faster than traditional queries. Tableau offers a drag-and- drop interface and the ability to blend data from multiple resources 15 Official (Open) 4. Visualizing data The following are some of the most popular data visualization platforms, which help you to prepare reports with data visualization as per your business requirements. Spotfire uses in-memory processing for faster response times, enabling extensive datasets from various resources. It provides the ability to plot your data on a geographical map and share it on Twitter. Jaspersoft enables self-service reporting and analysis. It also offers drag-and-drop designer capability. Power BI is a popular BI tool provided by Microsoft. It provides self-service analytics with a variety of visualization choices. 16 Official (Open) 5. Designing big data architectures Big data solutions are comprised of data ingestion, storage transformation, and visualization in a repeated manner to run daily business operations. You can build these workflows using the open source or cloud technologies. Some big data architecture patterns:  Data lake architecture  Lakehouse architecture  Data mesh architecture  Streaming data architecture 17 Official (Open) 5. Designing big data architectures Data lake architecture A data lake is a centralized repository for both structured and unstructured data. The data lake is a combination of the different kinds of data found in the corporation. It has become the place where you can offload all enterprise data to a low-cost storage system such as Amazon S3. You have access to data using a generic API and open file formats, such as Apache Parquet and ORC. The lake stores data as is, using open-source file formats to enable direct analytics and machine learning uses. The data lake is becoming a popular way to store and analyze large volumes of data in a centralized repository. 18 Official (Open) 5. Designing big data architectures Data lake architecture – Benefits of Data Lake  Data ingestion from various sources: Data lakes let you store and analyze data from multiple sources such as relational and non-relational databases and streams in one centralized location for a single source of truth.  Collecting and efficiently storing data: A data lake can ingest any kind of data structure, including semi-structured and unstructured data, without the need for any schema.  Scale up with the volume of generated data: Data lakes allow you to separate the storage and compute layers to scale each component separately.  Applying analytics to data from different sources: With a data lake, you can determine the schema on reading and create a centralized data catalog on data collected from various resources. This enables you to perform quick ad hoc analysis. Figure: Object store for data lake 19 Official (Open) 5. Designing big data architectures Data lake architecture Data is ingested to centralized storage from various resources such as relational databases and master data files. All of the data is stored in the raw layer of the data lake in its original format. This data is cataloged and transformed using the AWS Glue service. AWS Glue is a serverless data cataloging and ETL service based on the Spark framework in the AWS cloud platform. Transformed data is stored in the data lake's process layer, which can be consumed for different purposes. Data engineers can run ad hoc queries using Amazon Athena, a serverless query service built on top of managed Presto instances and use SQL to query the data directly from Amazon S3. Business analysts can use Amazon QuickSight, Tableau, or Power BI to build visualizations for business users or load selective data in Amazon Redshift to create a data warehouse mart. Finally, data scientists can Figure: Data lake architecture in AWS platform consume this data using Amazon SageMaker to perform machine learning. 20 Official (Open) 5. Designing big data architectures Lakehouse architecture A new architecture paradigm has emerged called lakehouse architecture to address the limitations of data lakes and data warehouses. Lakehouse architecture aims to leverage the benefits of both, leveraging the scale of a data lake to ingest and store an ever-increasing amount of data in open formats that customers want to analyze, and enabling the user-friendliness of SQL queries and guarantees of a data warehouse. The main aspects of lakehouse architecture are: Data storage in open-data formats Decoupled storage and compute Transactional guarantees Support diverse consumption needs Secure and governed Amazon Redshift Spectrum is a tool that provides the ability to query data from the data lake (such as S3) without storing data in the data warehouse. This allows the transformation of Amazon S3 from a data lake architecture (purely storing raw data) to a lakehouse architecture. 21 Official (Open) 5. Designing big data architectures Data mesh architecture The major difference between data mesh and data lake architecture is that rather than trying to combine multiple domains into a centrally managed data lake, data is intentionally left distributed. Data mesh provides a pattern that allows a large organization to connect multiple data lakes/lakehouses within large enterprises, and to facilitate sharing with partners, academia, and even competitors. Data mesh marks a welcome architectural and organizational paradigm shift in how we manage large analytical datasets. The paradigm is founded on four principles: Domain-oriented decentralization of ownership and architecture Data served as a product Federated data governance with centralized audit controls Common access that makes data consumable 24 Official (Open) 5. Designing big data architectures Streaming data architecture Streaming data is one of the fastest-growing data segments. You need to ingest real-time data from various resources such as video, audio, application logs, website clickstreams, and IoT telemetry data and quickly process to provide fast business insights. Streaming data architecture is different as it needs to process a continuous stream of massive data with very high velocity. Often this data is semi-structured and needs a good amount of processing to get actionable insights. While designing streaming data architecture, you need to easily scale data storage while getting real- time pattern identification from time-series data. 26 Official (Open) 5. Designing big data architectures Streaming data architecture Data is ingested from the windfarm to understand wind turbine health and speed. It's important to control wind turbines in real time to avoid costly repairs in the case of high wind speeds beyond the limit that the wind turbine can handle. The wind turbine data is ingested to Kinesis Data Streams using AWS IoT. Kinesis Data Streams can retain the streaming data for up to a year and provide replay capability. These are subjected to the fan-out technique to deliver the data to multiple resources, where you can massage data using Figure: Streaming data analytics for IoT data Lambda and store it to Amazon S3 for further analytics using Amazon Kinesis Firehose. 27 Official (Open) Summary  What is big data architecture?  Designing for a big data processing pipeline  Data ingestion, storage, processing, and analytics  Data visualization  Designing big data architecture 28

EGT308 AI Solution Architect Project PDF

Document Details

Tags

Related

Summary

Full Transcript