aws-big-data-in-production.pdf

Course Overview Course Overview Hi, my name is Matthew Alexander, and welcome to my course, AWS Big Data in Production. I'm a software engineer at Lucid Software, where my forte is creating repeatable, robust solutions to big data problems. With more than 2.5 quintillion bytes of data generated every day, it is no wonder that new techniques are introduced at such a rapid pace, not only to efficiently process increasing amounts of data, but also architectural patterns to support such large datasets. In this course, we will cover topics adjacent to big data that in turn will help you effectively practice big data in your own organization. Some of the major topics we will cover include automating architecture deployments through cloud formation, securing your data, several supported patterns for controlling costs, and possibly most important, visualizing data with AWS QuickSight. By the end of this course, you will have a better holistic set of skills pertaining to your big data practices. I hope that you'll join me as we seek to strengthen some of the foundational principles of big data practices with the AWS Big Data in Production course at Pluralsight. Automating Governance with CloudFormation Introduction Hi, my name is Matthew Alexander, and welcome to my course on using AWS Big Data in Production. In this module, we will learn about some of the tools AWS provides for centralized governance, and how to use them to deploy and maintain infrastructure related to AWS's big data offerings. We'll start off with a high-level overview of one of AWS's offerings for centralized governance, CloudFormation. Along the way, we'll touch on some of CloudFormation's more advanced features, including provisioning infrastructure and application services in multiple AWS regions and accounts. Additionally, with the power that CloudFormation provides in provisioning resources, we will investigate some of the supported patterns for controlling and auditing costs. In the end, we will demonstrate how CloudFormation can greatly benefit Globomantics, a fictitious company, by ensuring robust, repeatable deployments for Globomantics' underlying infrastructure in an effort to support a big data platform. One of the great, but also problematic trends in the software industry, is using shiny words for old concepts. Mine today is governance. Governance is the activity or action of governing something. Governance can take many forms. Maybe it's a mechanism to ensure the application servers always have the latest security patches, or it could be a mechanism to ensure that deploying software is repeatable and atomic, either succeeding or failing completely. One of the most important principles when dealing with governance is that best intentions never work. Always introduce some type of mechanism that will help you realize what your best intentions actually are. So what is CloudFormation? Cloud Formation is AWS's infrastructure as code offering. Using a simple formatted text file, you can provision entire applications and networking backbones, or stacks as they are called, in multiple regions and across different AWS accounts. Like many other governance tools such as Puppet or Chef, CloudFormation uses a declarative programming model, meaning you describe your desired end state in terms of resources, and AWS will take care of the rest. CloudFormation provides lots of features, enabling you to build very complex components. Notwithstanding this, CloudFormation requires careful design and planning to deliver something that avoids future headaches as goals and requirements invariably changes with time. CloudFormation's core concepts can be broken down into primarily four components. Overarching these four components is the CloudFormation stack. Stacks are named containers, much like a folder is to a set of documents. In our case, instead of documents, stacks contain parameters, mappings, resources, and outputs. Other more advanced concepts do exist, such as dynamic parameters, drift detection, stack sets, and integrating CloudFormation with AWS CodePipeline. I won't spend too much time on them in this module, but I would encourage you to research them out. Notwithstanding the more advanced concepts, the previously mentioned core foundational concepts provide the building blocks to do some pretty amazing things. Let's go ahead and dig into some of these concepts as we examine how CloudFormation is used to build out Globomantics' underlying infrastructure. CloudFormation: Parameters Near the top of CloudFormation's core concepts are parameters. Although simple to understand, parameters can often go underutilized, missing meaningfully hidden value that enable engineers to create complex and resilient software components. CloudFormation parameters define and capture a stack's user input. Parameters act as variables and provide capabilities for input validation across different data types, even specifying an allowed set of values. Where appropriate, default values can be used to allow for optional user input. Parameters, like variables, can be referenced throughout the stack template, and we will see plenty of examples of this functionality throughout this module. Parameters are one of the best kept secrets to controlling costs. Using parameters, we can lock down which EC2 instance types are available in different AWS regions, or how many EC2 instances can be provisioned. Let's take a look at a concrete example to not only understand how to use a parameter's obvious functionality, but also how a parameter might be used more meaningfully inside of a CloudFormation template. In this example, a simple parameter named Environment is declared. The value for the Environment parameter can be either development, staging or production with a default value of development. Reading the parameter's description, we are given to understand that the parameter environment actually corresponds to an AWS resource tag named Environment. Building up one layer, what is not immediately apparent is that this environment parameter, when applied to our stack's resources, can actually allow us to see environment-specific billing for resources in our AWS account through the cost and usage reports application. This same approach is used in the CloudFormation template designed for Globomantics underlying infrastructure. The specified parameters contain a description and both default and allowed values. Like the canned example, the Environment parameter is also used to define resource tags, descriptions, and eventually improve cost and usage reports. Lastly, the Environment parameter acts as part of a mapping key for determining the appropriate subnet or VPC CIDR block. At the end of the day, parameters are exceptional at what functionality they provide, but they require careful consideration for when to define them. CloudFormation: Mappings In this section, we will talk about CloudFormation mappings, specifically, how CloudFormation mappings provide a solid systematic approach for defining static configuration within a CloudFormation template. Likewise, we will also discuss their usage, both appropriate and not, as mappings can often be elaborate and overwhelming. A CloudFormation mapping is none other than a named object with key value pairs. If you're familiar with JSON, a mapping is essentially a JSON object. Mappings are very useful for environment-specific properties such as an environment-specific AMI mapping which associates environments and regions to Amazon machine instance IDs. In general, it's best to consider mappings on the relatively same tier as a C-header file definition, as they define a set of constraints available for use in implementation. One thing to keep in mind as you research your own examples is that mappings are often extremely verbose, contributing to incredibly large CloudFormation templates and offering some confusion. Best practices for mappings are generally vague, and most of the mappings examples you will see have copied and pasted code from somewhere else. Notwithstanding the vagueness, mappings are excellent resources in constraining certain criteria about your environment that you don't wish to be specified with input parameters. For Globomantics, mappings are essential to building a robust global infrastructure. They define static configuration for non-overlapping VPC CIDR blocks. These pre-defined non-overlapping VPC CIDR blocks are separated by region and environment, which when used in a shared VPC setting allow Globomantics to deploy and connect applications in up to 256 different worldwide regions. What makes this approach even more developer friendly is that in order to achieve this functionality, all that is needed is input from the user as to what environment they would like to deploy. As a final note, remember that parameters and mappings can often come as a pair, meaning that a parameter might act as part of the mapping key when retrieving some specific configuration. This scenario will be explored more in depth later. CloudFormation: Resources CloudFormation resources are at the heart and core of CloudFormation stacks, and deserve our utmost attention. CloudFormation resources can range from DynamoDB tables, EC2 instances, and Auto Scaling groups, to VPCs, subnets, and CloudWatch alarms. Resources are declared in such a way that they define a directed acyclic graph, or DAG for short. Behind the scenes, this means that CloudFormation will build an ordered execution list for creating your resources so that resources which have dependencies are built before those that they depend on. As you define resources in your CloudFormation template, make sure you constantly check back to the provided documentation, as it is a great resource. Often there will be new features released, and the resources documentation will be updated to reflect that. Resource definitions are constantly paired with intrinsic functions. Intrinsic functions introduce different types of dynamic behavior into templates, such as allowing for dynamic access to mappings or conditionally creating stack resources based upon certain parameters. We will examine intrinsic functions later as we dive into the actual provisioning of Globomantics networking backbone. Often, I find it best to see an example in code to really give a good contextual background. In the provided example, an AutoScalingGroup resource type is defined. Of special note, the properties MaxSize, MinSize, and DesiredCapacity all reference an intrinsic function called FindInMap. FindInMap does exactly what you would think, it references a CloudFormation mapping and retrieves a particular key value. In a similar vein, the Tags property defines a single tag with a name, environment, and for its value, references an input parameter, Environment Parameter. Many of the resources defined for Globomantics networking implement the same pattern. Based upon an input parameter, a mapping value is then inserted into a Resources property. In conclusion, as you seek to implement your own templates, just be considerate of this pattern and how you can leverage it effectively. CloudFormation: Outputs Let's briefly review CloudFormation outputs, but examine more closely how they can be utilized to form multi-stack architectures. CloudFormation outputs export metadata about CloudFormation stack resources. In the way of semantics, outputs generally reference resource properties. Once exported, they can be referenced by other stacks in the same region. This is an excellent feature. it really allows for efficient design and evolution of your architecture. For example, imagine you create a CloudFormation template for a particular application. In the application, you define an elastic IP, which needs to be used by another service. Instead of copying the elastic IP from the console or overcrowding your already existing CloudFormation template, you can export the elastic IP and reference it from another CloudFormation template. Be aware, however, that outputs must be unique across an AWS account and region. Because of this constraint, outputs greatly benefit from namespacing, for example, stack name - export name. These patterns are easily seen inside Globomantics' use of CloudFormation. Globomantics separates VPC creation from application subnets using namespaced exports. As can be seen here, every newly created VPC exports its own ID, as well as public route table and CIDR block. In conclusion, unfortunately, or fortunately, depending on your perspective, CloudFormation outputs deserve a formalized discussion regarding what they should be and how they should be named. Controlling and Auditing Costs With so much power in our hands when using CloudFormation, mechanisms need to be put into place to control and audit AWS costs. Using CloudFormation's core concepts, we can begin to piece together some mechanisms to ensure that our best intentions are realized in trying to make this happen. In this section, we will examine some good principles for controlling and auditing costs with CloudFormation. Currently, CloudFormation provides several mechanisms for controlling costs throughout your AWS environment. The first and foremost mechanism to be considered are parameters. Parameters, as mentioned before, provide an allowed values property to constrain stack inputs. These are exceptionally useful, and they can dictate, for example, allowed EC2 instance types, or even the number of EC2 instances. Secondly, and up one layer, are resource properties. Resource properties often go hand-in-hand with CloudFormation mappings or parameters, however, this need not always be the case. Hard-coding configuration can often be an effective instrument instead of the often used dynamic configuration through mappings or parameters. This may especially be useful when declaring DynamoDB tables and ensuring that the resource properties from both read and write capacity units are always statically defined. While the previously mentioned mechanisms are proactive, there exists reactive mechanisms as well. First among those are CloudWatch dashboards. With CloudWatch dashboards, you can visualize costs on a periodic basis and react accordingly. In harmony with CloudWatch dashboards, CloudWatch alerts alert you as you're breaching thresholds. These thresholds could be as simple as the number of EC2 instances in your AutoScalingGroup or the predicted cost for a particular service day over day. While each mechanism provides its own value, it is important to weigh the pros and cons of each and apply them where needed. When attempting more reactionary measures such as dashboards and alerts, there are several key takeaways. First, take time to identify all metrics including the necessary correlations that provide good answers to relevant business questions. I would say that generally, from my experience, engineers don't actively consider what questions will need to be answered about the data being adjusted, however, asking these types of questions beforehand means that we consider that in addition to seeing the number of EC2 instances in our AutoScalingGroup on the dashboard, we also visualize the event that caused our EC2 instances to drop or increase. Secondly, create dashboards using identified metrics to capture meaningful insight. A subtle component to this idea is to think in groups when you create the dashboard. This is so that similar questions can be answered from a similar location on the page. In an emergency, no one wants to scroll through several pages of a dashboard to find out why a metric at the top is the value that it is. Lastly, potentially use horizontal annotations to provide at-a-glance contextualized information. At-a-glance is a very important phrase. Internalize it. Introducing mechanisms to provide contextual information will provide significantly improved response times in on-call events, and generally more meaningful business insights. I have experienced this scenario countless times. Always err on the side of caution when debating whether to include more context into a dashboard. One final thought about CloudWatch alerts is that generally a lot of planning and thought can and should go into alert definitions. An example helps in this discussion. For the present scenario, this alert is configured to alert if a particular service is estimated to overspend on a certain level. Although this alert helps in controlling and auditing costs, take care that the baselines you define are reasonable to avoid alert noise. Besides estimated charges, there are numerous dimensions that can be alerted upon to ensure that you understand your service is working. In summary, let's review some of the principles and best practices we covered for controlling and auditing AWS costs. Always remember that you can control costs effectively through both parameters and resource properties, however, more reactionary, remember to visualize those costs through custom dashboards. Always seek to do what you can to monitor and audit these costs, because, inevitably, things will fall through. This can be done through dashboards and alerts. And as was mentioned before, remember that alerts should be both preemptive and responsive. By utilizing each of these principles, you will almost be guaranteed that your expenditures are both controlled and auditable on multiple dimensions. Deploying the Networking Backbone Without further ado, let's go ahead and demonstrate using CloudFormation to deploy Globomantics' networking backbone to support its future big data offerings. During the demonstration, we will cover deploying Globomantics' networking stack. We will also take a quick look at the CloudFormation Designer and what mechanisms are available to manage our newly created stack. Lastly, we will discuss how we can use the networking backbone we have created to integrate Globomantics' big data application framework in the future. Let's start off with a brief description of Globomantics use of CloudFormation in setting up a robust network stack. Globomantics' networking follows a very hierarchal pattern, starting with the creation of an environmental VPC. After the creation of an environmental VPC, application-level network stacks are created, referencing the necessary environmental VPC attributes such as VPCId, CIDRBlock, and public and/or private route tables. All application=level network stacks create their own single use subnets in two different availability zones for redundancy. Both VPC and subnet resources are created with resource sharing in mind, meaning that a single AWS account owns all networking related to a specific environment and delegates subnets to child AWS accounts as necessary. Let's go now to the console to provision some of these resources. When creating a CloudFormation stack, head to the CloudFormation console. From there, click on Create new stack. You'll then walk through a series of steps to provision your resources. In our case, we have an existing VPC template that needs to be provisioned. Template files can be uploaded from multiple locations including S3 and the local file system. S3 is very useful when using AWS CodePipeline, as you can build a CI/CD pipeline for application infrastructure very simply. For this example, I'll click on Upload a template file and select the Globomantics VPC template. As a side note, as CloudFormation is a regional service, make sure that you are in the region you want to be in. I can't tell you how many times I've gone looking for resources and not find them, only to find out later that my browser cookies had pointed me to another AWS region. Prior to creating the resources, let's take a look at how they appear in the CloudFormation designer, and make sure that the template has valid syntax. You can also validate template syntax locally with the validate template command in the AWS CLI. From my experience, the designer gets mixed thoughts among engineers. Some love it, some hate it and never use it. I found it useful to give me a visual check on what it is that I'm creating. Here I can see a VPC public route table, etc., and all appears to be good. As an extra measure of validation, let's validate the template here and make sure that it will actually provision successfully. Executing a validate template command provides that quick and early feedback to avoid future headaches. Always do it. So all seems to be good, so let's continue and finalize these resources. Many of these steps are pretty mundane, and so the best practice is to automate stack creation through CodePipeline and StackSets. As AWS creates a stack, notification events will occasionally appear, letting me know the status of my individual resources being created. It looks like everything succeeded. At this point we're only halfway done as we don't have any applications or other networking running inside our VPC. Processing as much data as Globomantics might, let's create an underlying networking foundation for our complex event processor in an effort to crunch some big data. As was mentioned previously, Globomantics separated VPC creation from application networking. This means that I will need to create a new stack for the complex event processor applications networking. Knowing this beforehand, I made one ready that creates two public subnets in different AZs. The same steps that were used to create the environmental VPC are the same to create the application's networking. Very similar to the VPC creation, several resources can be seen from their respective consoles. These public subnets can then be used in yet another CloudFormation stack which creates EMR clusters for Apache Flink, Apache Spark processing, or AutoScalingGroups running a standard application framework. You should be aware that there is a strong trend for fine-grained microservices instead of giant monolithic applications. CloudFormation templates are not immune from this debate. You will see some very massive templates, and some very modular, fine-grained ones as well. Both have their uses and respective goals. Through experience, discerning the invisible boundaries between templates will become easier. In Globomantics' case, the decision to make more fine-grained modular components, allows for easier evolution of the underlying service architecture. This is due to the decoupling of application networking from VPC networking. There are no hard, fast rules to follow, as infrastructure design is as much an art as it is a hard science. In conclusion, in this module we discussed the foundational components of CloudFormation including parameters, mappings, resources, and outputs, and how they each play a part in the overarching evolution of a chosen architecture. We also demonstrated deploying templates and managing them, including controlling and auditing costs. To finish, we discussed how the CloudFormation template that was created for Globomantics provides not only the ability to evolve, but also a smooth integration path for future applications. Securing Data with IAM and Encryption at Rest Securing Data with IAM and Encryption at Rest Hi, my name is Matthew Alexander, and welcome back to AWS Big Data in Production. In this module, we will learn about securing data through IAM, Identity and Access Management, as well as encrypting data at rest. Following from the first module, the foundational components introduced there to instrument scalable Infrastructure as Code will provide the mechanisms needed to create a well-architected support structure for big data, in an effort to ensure that customer data is secure as it flows from one point to another. At the end of the first module, multiple CloudFormation templates were created. Primary among those was the networking backbone template that enabled the creation of a single VPC per AWS region and stage, where stage was one of three potential values, Development, Staging, or Production. This structure was built around the concept of AWS's shared VPCs. With this approach, each unique application and environment could be located in an isolated AWS account to promote strict isolation. This gave Globomantics a solid foundation and capability to expand operations worldwide if needed. With the networking backbone out of the way, we ended by creating a CloudFormation template that provided isolated networking for each potential application. In an effort to promote resiliency, the application was spread across two subnets, each of which was located in a different availability zone. It would have been very easy to create application-specific resources in the single VPC template, however, this would have been agonizing and painful to deal with as Globomantics expands in the future, especially coordinating updates to the Stack template when multiple teams may need new application networking. For the remainder of this module, we will look to establish a multi-prong support structure for securing application data as it flows from one point to another. First among the available mechanisms will be IAM, or Identity and Access Management. IAM will be used to secure resource-specific access control policies to ensure that only privileged users and/or applications have access to customer data. Second, we will examine S3, Amazon's Simple Storage Service. In particular, S3 encryption schemes will be examined for securing data at rest, as well as resource-based policies that enable or disable access. Additionally, S3's access control lists will be briefly mentioned. Lastly, we will look at securing data before it even leaves the server with encrypted elastic block storage volumes. In the end, with each of the mentioned mechanisms, we will have high confidence that customer data is secure at each touchdown point. Securing Data: IAM In the industry, we often talk about battle-hardened applications. Applications that fall under this category have earned their stars through being stretched, ripped, and toppled over since their inception. From my experience, AWS's IAM service is just that. It provides a critically centralized service that spans all of AWS's product offerings. Although there are some gotchas involved with AWS IAM limits, or even some of the supported design patterns, IAM's feature set is enough to handle some of the most complex use cases out there. For our purposes today, we will only look to cover a very limited portion of IAM's feature set, specifically policies and their supported design patterns. Among the first to be covered will be inline policies, which are ad hoc permissions associated with an IAM role, user or group. Secondly, managed policies. Managed policies require more careful design, but are much more flexible, and can be natively applied at the same time to multiple different entity types. Although a generalist policy syntax will cover a majority of the user cases, policy conditions will be delved into to provide just that little bit more when finer-grain tools are needed. Lastly, we will check out the preferred mechanisms for applying resource identity and attribute-based policies to Globomantics' big data platform. Some time ago, I was asked to add a resource permission to an IAM role in Lucid's Staging environment. I added the permission, tried to save, and a very informative error appeared. The error stated that the inline policy for the IAM role had reached its character count limit. Needless to say, I had to revert my efforts, deduplicate certain permission statements in the inline policy, and then migrate functionality to custom manage policies. This leads to the first best practice of managing IAM's access control policies, namely that inline policies are useful for proofs of concepts or one-time fixes, but should be migrated to customer managed policies as soon as possible. Migrating to customer-managed policies is not as easy as it sounds. Custom-managed policies require careful design as they still do have policy limits. Notwithstanding the design you choose to implement, make sure that you follow this everlasting principle, always grant least privilege no matter the policy type. As least privilege is a founding principle of secure access, policy conditions provide very fine-grain tooling when specificity is needed. Here is a simple example of a policy condition that is restricting access to a particular resource based upon the source IP. With policy conditions, multiple conditions can be checked at the same time. This is tantamount to applying a number of IAM's global policy conditions, coupled with service-specific action derived condition keys. An example of this could be locking down access to a DynamoDB table, such that rights are only allowed for a primary key associated with the user's user ID or providing field-specific read-only access to obscure sensitive information. In the end, make sure that a consistent naming scheme is used across all key value pairs. Consistent naming is important, as AWS does not strictly apply case sensitivity when evaluating rules. This can cause headaches for debugging when a policy contains a condition for a specific resource tag, but one of your resources contains two of the same tag names with different character cases, but each tag has a different value. As essential as thinking at lower levels can be, it is just as appropriate to think in big picture. AWS provides several different levels of policies, namely identity, resource, and attribute. The most commonly implemented approach to resource-level isolation is that of identity-based policies. Identity-level policies apply resource-level permissions to an entity. A commonly used example is an AWS EC2 instance profile, which allows or disallows resource-level permissions for a particular EC2 instance. In this scenario, an EC2 instance assumes an identity with the identity's associated permissions. This is the approach taken by Globomantics when granting permissions to the complex event processor application. Resource-based policies are specific to AWS product offerings, and generous support for them is limited. Selecting identity or resource-based policies comes down to your use case. In practice, it turns out that resource-based policies excel at granting and managing cross account or external account access, and internal usage is best left to identity-based policies. Attribute-based policies can best be described as providing an additional dimension to role-based access control, or RBAC for short. A very succinct example of RBAC is when you grant permission to a role for a particular resource. Attribute-based access control would go one step further and specify that only a subset of entities granted that role that possessed a particular attribute, for example, a specific department, would actually have access to the provided resource. Although powerful, always attempt to use RBAC first, and then apply ABAC if the occasion arises. Encrypting Data at Rest: S3 S3 provides various features for securing customer data at rest. Usage of each feature arises from application-specific use cases. In this section, we will discuss each of the encryption options available and their intended use cases. Very much the standard for encryption at rest, S3 provides AES 256 server-side encryption. This option is very simple, at no extra charge to you, and S3 handles all the details. I would always recommend enabling this option as a good start. Many security compliance certifications require that data is encrypted at rest, and generally this option should suffice. Perhaps the most complex solution S3 offers is called customer managed server-side encryption. In this encryption scheme, S3 handles encrypting the data, however, the caller must specify the encryption algorithm and the data encryption key whereby the data is encrypted. However, that's not all. The customer must also provide the same data encryption key when requesting the object to be used in decryption. This subtle detail manes that when using customer managed server-side encryption, the customer must maintain all data encryption keys and the objects to which they are associated. I personally have used this encryption scheme, and it arose from a contractual agreement my company made with a customer. There are a lot of details and a lot of potential errors, including huge data loss. A lot of design and planning are necessary to make this work. Somewhere in the middle of the road is using AWS's KMS offering to encrypt S3 data. Personally, I like this option the best, as you don't have to manage encryption keys with a huge potential for data loss, and you are still able to rotate encryption keys on a regular basis. However, the downside to this is that in order for you to retrieve your data, AWS's KMS service has to be reachable by S3. Hand-in-hand with encryption are access control lists. Access control lists are canned policies that can be applied to individual objects. Essentially, this gives you the ability to create a shared storage layer, where in addition to potentially encrypting individual objects with customer-managed server-side encryption, you isolate who can access that specific object. Bucket-level ACLs are still available, however, as applicable. In the context of encryption, specifying a default encryption of AES 256 may be applicable when granting a log delivery group ACL, so that the logs from a public S3 bucket serving a static website are secure as they are delivered to an S3 bucket of your choosing. In conclusion, encryption should often be jointly considered with the policies you define for your S3 bucket and the consuming application. Encrypting Data at Rest: EBS One of the last touch points for customer data that we will cover is EBS volumes. Every EC2 instance that you create attaches a root volume and one or more additional volumes as configured. An EBS volume should be thought of as the hard drive component of your computer. In terms of AWS, one or more hard drives are allowed to be mounted at different directory paths on your EC2 instance. Each volume has a number of features and constraints associated with it. Some volumes support high throughput, no matter the size of the hard disk. Other volumes constrain this down, such that you only get so many operations per second until you increase the size of the volume, for example, from 16 GB to 32 GB. AWS provides managed encryption through the use of a default encryption key. This approach is very easy to configure, however, with this approach it is impossible for you to rotate encryption keys or provide different encryption keys to different volumes and reduce the attack surface if your encryption key gets compromised. Alternatively, AWS provides a customer-managed approach. In this approach, everything is the same as previously mentioned before, except you specify the encryption key ID using AWS KMS, and you manage the key rotation yourself. Both mechanisms are well suited to ensure data is encrypted at rest, however, it should be noted that you cannot change the encryption key after the volume is created. For Globomantics, this functionality is enabled by default, such that every EC2 instance that gets created starts with an encrypted EBS volume. In the end, both approaches work equally well, but depending on your use case it may be better to choose one over the other. Creating the Application Support Structure In this section, we will use the knowledge we have gained throughout this module to create the application support structure for Globomantics' big data platform. Several events will take place in this demo. We will create the application support structure including IAM roles, policies, and S3 buckets, and additionally we will take a high-level look at the complex event processor application. As we do that, we won't go into intricate details in order to save focus on the overall approach. The most important event is that we will actually deploy a new CloudFormation template for our application, following which we will generate some sample data to verify some functionality. To start, let's take a look at the application code. In our effort to deal with potentially huge amounts of streaming data, there are several application frameworks available. While there are many to choose from, I chose to use Apache Flink, which is a dedicated stream processing framework. Each Apache Flink job is a standalone Java application. Here we can see that in order for the application to run successfully, several command line arguments are needed, namely port, parameter, and destination. Each of these arguments have meaning in the overall context of the application. To give that context, this application when initialized will start an HTTP server listening on the specified TCP port. In order to send streaming data to our application, the HTTP server expects data to be sent via HTTP get requests with a query parameter containing the data. After performing some complex processing on the incoming data, the relevant information is output to our destination, which in our case will be an S3 bucket. Apache Flink has the ability to run various jobs concurrently. This means that our application may not always succeed on the first run. To address this, we have configured Flink to restart our application up to three times if it fails to start. Additionally, we have requested that the running job checkpoint itself every 10 seconds. Checkpointing essentially means persisting its own internal state to our S3 bucket. As data starts to flow into our application, we transform the simple string input from our query parameter into a paired data structure consisting of the event ID and its count. Every 10 seconds, the application will sum the events by their IDs. For the complex event processing, we apply a sequential window pattern, which essentially means that if in the first 10 seconds an event comes in more than five times, we keep it in memory. Then for the next 10 seconds we determine if the event increased in volume, for example, instead of 5 times, 10 times. This simple approach captures when a business event starts to trend and increases in frequency. For those events which match our pattern, we output them in string form to our S3 destination. Let's move on to the CloudFormation portion. The CloudFormation side of things creates the necessary support structure for this to happen. We first need an S3 bucket that will capture our events of interest. We encrypt this bucket to ensure that we keep data secure. Secondly, our application needs the ability to write to our S3 bucket. To support this, we create several resources, namely an IAM role, policy, and EC2 instance profile. In an effort to reduce the attack surface of our application, we restrict which TCP ports are open through the use of a security group. For the actual application, in an effort to keep the CloudFormation template simple, I opted to create an AutoScaling group instead of using AWS EMR. Through the use of the launch configuration resource block device mappings, each EC2 instance that is created in the AutoScaling group will have an encrypted EBS volume. Each time an EC2 instance initializes, a series of actions are run on the host as defined by the user data section. Finally, an AutoScalingGroup resource is created. Should we decide to update this CloudFormation template in the future, we made sure that we wouldn't take down all of our servers at once by specifying an update policy that makes sure that only one server is updated at a time. Let's go ahead and build this application using a mvn clean package. What this will do is it will package in all of the dependencies into one single Java jar that can then be uploaded to Apache Flink through the Web UI. Following the same pattern that we have done previously, we'll go ahead and create the CloudFormation stack. Now, I fast-forwarded through most of this because we have done it before, but the only difference here is that I created an EC2 key pair name that was used to then initialize the EC2 instances that were created so we could SSH into them. This CloudFormation stack takes about 15 minutes to create. With the stack created, I'll go then to the resources of the stack and get the AutoScalingGroup resource, and be able to use that to go to the Apache Flink web console. I'll click on the AutoScalingGroup resource, opening up a new tab. I'll then be able to go to the EC2 instance that was created and view its specifications and descriptions and metadata surrounding it. One of the things of particular note for me is the IP address of the EC2 instance. Using this, we are then able to connect to Apache Flink and their web console and upload any of the new jobs that we had just created. Going to Submit new job, I'll find the package that was built using mvn clean package, and then upload that. I fast-forwarded, as it takes about a couple of minutes for it to upload. After the upload, I'll then be able to specify the parameters that we talked about in the very beginning, specifying port 8181 and then the query parameter msg, message. And the destination actually is the S3 bucket that we created in a CloudFormation template. Going back for the CloudFormation screen, I can then grab the S3 bucket name, pass it in, and it's important that it's an actual absolute path, and then go ahead and submit the job. As the job starts to run, I'm then able to view any metadata surrounding it. As I can see right now, it's created, and just transition into the running state. With the application job running, I'll go through and actually execute and generate some sample data and verify that actually functionality is working as expected. Going back to the CloudFormation screen, I can then go to the S3 bucket and see if any data was uploaded. And as we can see here, based upon our checkpointing of 10 seconds, data was generated, and then we can download it and see what the alerts were. And there we have it, an Apache Flink big data streaming application running on CloudFormation, processing events in a complex event processor framework and outputting alerts in textual form to an S3 bucket. Later, we'll take that information and visualize it with QuickSight. Monitoring Availability with CloudWatch Monitoring Availability with CloudWatch Hi, my name is Matthew Alexander, and welcome back to AWS Big Data in Production. In this module, we will learn about monitoring application availability through CloudWatch, and in the process dive deeper into the founding principles behind logs and metrics, that of observability. Much like governance, observability is a buzzword. Originally the term comes from mathematics and found its way into control theory. Observability is defined as the ability to define a system's internal workings based solely upon its outputs. Of special note here is when we mention internal workings, and by that I mean everything. In an ideal world, if implemented properly, through observability techniques, application developers or system admins would, upon getting a single alert, be able to diagnose and fix an issue at the onset. In the real world, things are not as easy. This type of an achievement is either extremely difficult to achieve or too costly to implement. With these types of constraints placed on observability, it is common for engineering efforts to isolate the components that provide the biggest value with moderate effort. So, what are those components? In no particular order, first we have logs. Logs are emphatically the most common outputs used to observe a system's inner workings. Logs are partially predefined textual output. I purposely say partially because logs may output captured runtime configuration or exceptions. Generally logs are segregated into various different levels, trace, debug, info, warn, error. Levels will vary by platform and framework. The second is metrics. Metrics are quantitative measurements of the system. Some examples could be the number of HTTP requests that an application receives or even the number of errors that are caught while processing a specific operation. In a similar vein to logs, metrics can have levels configured around them. When this takes place, alerts are created. Let's dive a little deeper into each of these components so as to better understand how to use them effectively. Logs will either eventually be your biggest blessing or your greatest curse. To avoid the latter, there are a couple of truths about logging that would be good to remember. First, logging is more than just outputting text. Logs should have a purpose and selectively output information that will help in debugging errors. One of the great debates that you may encounter as you improve logging within your organization is which log level should be recorded. My personal preference is to only output error-level logs to a centralized log store and output level warn, info, and below to a local hard disk. If these logs are needed, then a command line utility can be used to get them from the host. This leads to the second point, namely that the format of what you log is something to be considered carefully. A common format is JSON, however, JSON can add some overhead as it introduces extra bytes for each log output. If such a case applies to you, then using a delimited format may suit you better. Lastly, no matter what format you choose or what log level you record, logs should connect easily to one another very much like a train of transactions. This pattern helps significantly when your services are built in a distributed manner. The social buzzword for this type of logging is distributed tracing. When considering metrics, almost every vendor has their own terminology and/or format for how they are measured. What you should remember, though, is that there are a set of overarching best practices that will help you effectively use metrics. These best practices will be covered in more detail later, however, as a simple overview, the first best practice is to namespace your metrics, meaning to isolate and contain metrics submitted by one application or product so a metric with the same name submitted by another application or product doesn't corrupt your data. The next principle is that every metric contains a name. Often, I have seen companies prefix their metric names and suffer immense amount of pain when they wish to extract aggregate metrics across a product or distributed application. I would recommend avoiding this pattern and use very short, specific metric names. Each metric should also contain one or more dimensions. Dimensions are additional metadata surrounding the metric, if you will, an attribute about the data point. An example of this would be for HTTP requests outputting the dimension, status code equals 500. Lastly, every metric has a type. Types can vary widely by vendor, but some common ones are counters, gauges, and rates. Using these foundational best practices will help you greatly as you seek to increase the observability of your platform or product. Armed with this information, we can review Globomantics' current infrastructural platform for big data and understand ways in which it can be more observable. Currently we have an end user, which could be a web page or an internal application communicating with our backend service. For our backend service, we have Apache Flink running a complex event processor in up to two subnets with potentially three instances. Our service is listening on a supplied TCP port, and when an incoming event matches our predefined pattern, the event gets output in batch to S3. Our application in its current state is missing logs and metrics. What we will look to do over the course of this module is add CloudWatch support in the form of logs, metrics, dashboards, and other features to achieve a higher level of observability. Let's get started. CloudWatch: Concepts Conceptually, CloudWatch can be broken down into several manageable components. First are logs and metrics. CloudWatch provides both basic and complex support for logging, and divides logs conceptually into both log groups and log streams. Log streams are generally defined as log events that come from the same source, the same application, the same log file. Log groups operate at a higher level and encapsulate a group of log streams, applying the same retention, monitoring, and access control settings to each one. Log groups can be encrypted using AWS KMS, however, this option is only available through the AWS CLI at this time. CloudWatch metrics follow from our previous discussion and contain namespaces, metric names, dimensions, and metric types. To visualize both logs and metrics, CloudWatch provides capabilities for creating dashboards. Dashboards can be used to create a unified view of various metrics across a system, or to create an operational run book that can be used in the event of a production outage. In line with a DevOps-centric approach, CloudWatch enables the creation of alarms and notifications to relevant personnel in the event that an alarm threshold is breached. Generally, notifications are done through SNS, and can be configured to work with email or SMS. One of CloudWatch's most powerful features in my opinion is that both logs and metrics can be combined together such that CloudWatch will parse the logs and submit metrics in an appropriate scheme. This functionality is called filters. This can be extremely beneficial for generating metrics from server-access logs. all-in-all, through different combinations of CloudWatch components, all types of engineers and operators can effectively meet the requirements and demands placed upon them. CloudWatch: Logs Having already had a brief introduction to CloudWatch logs, let's head to the console in an effort to get familiar with creating log groups and configuring their retention period. CloudWatch logs can be accessed through the CloudWatch console. Generally, the link of interest is on the left-hand side of the screen, however, AWS loves to update the console design, so it may be in a different position. Clicking on the Logs hyperlink, I'm brought to the Logs user interface. We have no log groups currently, and I'm able to get started. What we want to do now is create a log group. Give it a name, but remember that log groups are composed of a collection of log streams, and log streams are specific to a source of information, so I want my naming to be a little bit more generic to potentially encapsulate a wide variety of log streams and their sources. For our big data application with Globomantics, I'm going to name the log group ComplexEventProcessor. And the log group is created. Now because we don't want to increase our AWS bill by a lot, we need to configure the log data retention period. Because we're in a development environment, we don't need logs to stay around for more than a single day. By clicking on the retention period, I can specify one day and then update the log group configuration. Notice also that each log group can have subscriptions and log insights enabled. Although these topics won't be covered here, subscriptions allow for the streaming of log data to Elasticsearch, and/or AWS Lambda for additional processing or indexing. Enabling Log Insights allows users to query logs in a much more complicated manner, such as filtering, selecting specific fields, generating statistics, and more. With this log group created, we can then set up our application to send our logs. Later on, we will create log groups using CloudFormation. CloudWatch: Metrics Conceptually, we have covered a lot of ground with CloudWatch metrics. In this section, we will look at some of the finer details, including some of AWS's predefined metrics, and how it is that we can publish our own custom metrics. Currently, our big data application for Globomantics uses a couple of AWS services, and with each managed service AWS provides predefined metrics, some for free and others for an associated cost. With our usage of S3, we get the following metrics for free, namely bucket size and bytes and number of objects. Request metrics for S3 are available for an additional cost. These metrics include request count broken down by type, and several additional dimensions including bucket name and storage type, among others. As we are also using AutoScaling groups, the underlying EC2 instances provide numerous metrics, much too long to list here. We'll head to the CloudWatch metrics console later and examine ways in which we can see what metrics have been provided by each service. Whenever you use a new service, make sure to check on the service's public documentation regarding what metrics are available. Often, new insights can be derived, but they are generally absent, because in our day-to-day tasks we completely forget to look back, at least I know I do. Now that we've briefly covered AWS provided metrics, let's look at how we can create our own custom metrics and get them published to CloudWatch. Here we have an example where we create an Amazon CloudWatch client. Additionally, we create a metric called Requests with type Count and value 10. We then associate a dimension to the Requests metric to indicate that each of those requests had a status of 404. After creating the metric, we then use the CloudWatch client to publish the metric to a custom namespace. Here we ignore the result, however, it is always good practice to assert that what you sent was acknowledged and received. With this approach, we can publish custom metrics to enhance the observability of our application. In conclusion, make sure that before you publish metrics, you understand how they will be used. This process leads to much more meaningful metrics and better usability. CloudWatch: Configuring Log Groups and Streams with CloudFormation In this section, we will use CloudFormation to encapsulate our creation of log groups and log streams. Previously we created our log group for the complex event processor inside the AWS console. Very powerful, but doing this operation manually incurs additional overhead, as now for each environment we'd need to do the same thing over again. Unfortunately, CloudFormation does not have the ability to inherit and manage preexisting resources. This means that we will migrate the previously defined log group to our application's CloudFormation template. Creating these two resources is very simple. At the top of our Resources section, we have defined a new resource type, LogGroup, and below it LogStream. The LogGroup's name is derived from the environment, and the specified retention period is referenced through an environment to days mapping. This allows us to map different retention periods to each unique environment. The log stream resource is simply given a name, and references the above-mentioned log group. As we are updating an existing CloudFormation template, I wanted to take a moment to look at CloudFormation change sets. A change set is an important element in CloudFormation that indicates what changes will take place in the CloudFormation stack. Be forewarned that sometimes CloudFormation will take a very destructive approach to architectural evolution by demolishing a modified resource and recreating it from scratch. As we update the development ComplexEventProcessor application stack, this doesn't seem to be the case. If ever you are in doubt, you can go to the CloudFormation resource page and under the field update requires, you will know if modification for that property destroys the resource. Now that we have created a managed log group and stream, we will next take a look at using the log group and log stream by installing the CloudWatch agent using our cloud configuration resource. CloudWatch: Installing the Agent With the desire to have some better observability around our application, let's install the CloudWatch agent on our EC2 instances using our launch configuration resource. This approach of configuring each instance at runtime is very effective for what our current needs are, however, generally this would be done through creating a static AMI that had the CloudWatch agent and all other necessary components preinstalled. Preconfiguring a static AMI would reduce start-up time for our Auto Scaling instances, and allow us to pin down software versions for our application supporting infrastructure. Because we have introduced new dependencies to our startup scripts, it is worthwhile to reevaluate our usage of the User Data section of our launch configuration. Currently we issue commands and hope that all goes well. What happens if an error takes place during the process? Does our instance continue to start up, or does it fail in the CloudFormation template? The current functionality is that the instance will continue to start up and start successfully, however, the template User Data section will fail. This means that the instance is available inside of the Auto Scaling group, but is not fully functional or fully provisioned as of yet. In order to address this issue, CloudFormation introduced the CloudFormation::Init metadata property for launch configuration resources. The CloudFormation::Init metadata property allows you to use declarative programming to specify your server's end state, and AWS takes care of retries, packet file downloads, etc. So looking at our CloudFormation template, let's see how we migrated our User Data section to CloudFormation::Init. Previously, we defined package dependency installation through the User Data section. With CloudFormation::Init, all we need to do is specify the package name with an authorized list of versions, where an empty list means the latest version. For downloading Flink, we originally used a curl command without retries. After migrating, all we need to do is specify the desired extraction directory, and all other functionality stays the same. As for each of the commands that were run to move optional Java dependencies into the correct folder, nothing of particular note has changed after the change to CloudFormation::Init. Lastly, prior to migration, we needed to perform said operation on Flink's configuration file. This is no longer the case, as we can instruct CloudFormation::Init to create files on demand. CloudFormation::Init properties can be found in the relevant online documentation. It will not be our intent to cover everything in depth here. From our previous template, a potential security vulnerability existed where anyone was able to upload jar applications to the complex event processor through Apache Flink's Web UI. In an effort to address this, instead of uploading through the Apache Flink web interface, we will create a new S3 bucket for holding our deployment package, upload our generated package to that S3 bucket, and then ask CloudFormation:Init to download that jar into the appropriate directory so our complex event processor application can run. This adjustment requires the use of another small CloudFormation template called ComplexEventProcessorSupport in addition to adding the relevant IAM permissions for downloading the jar from the S3 bucket. The only resource created in this template is the S3 artifact bucket. Coming back full circle, our original intent was to install the CloudWatch agent for publishing application logs to CloudWatch. Installing the agent happens through downloading and installing CloudWatch's binary package. A configuration file is then created and configured to ensure that our agent picks up the appropriate files and sends them to our centralized log storage. Prior to deployment of this new stack, let's build our application using mvn clean package. With the package built, we can head to the AWS console and create the new support CloudFormation template. Much of this has been done before, and so we are just following the same pattern that we have, and so we can speed this up just a little bit. When the S3 bucket does get created, we'll then head to the Resources section, click on the S3 bucket link, and then upload our artifact that we created using mvn clean package to its final resting destination. As a side note, if you have a slower internet connection this could take a significant amount of time. With the jar file uploaded to our S3 bucket destination, we'll head back to the CloudFormation console and update the stack template that was used to create the application in the first place. Much of this, like before, has already been done, except for the part where we have changed the input parameter and allowed the specification of the jar file name. We'll head over to the AWS S3 console, and specifically to the bucket that we created from the support template, grab the key of the S3 object, and then input that into our new CloudFormation template input parameter. With the changes that we made, we'll make sure that CloudFormation doesn't act destructive and destroy a lot of the resources that we had created previously. As many of the updates can take a long time, I've fast-forwarded until we had the stack completely created. With the stack created, let's head over to the Resources section, go into the Auto Scaling group, and let's identify the instance that was created to get its IP address, so then we can go to the Apache Flink console and verify that we are no longer able to upload jobs to the UI. We'll then go ahead and generate some traffic using curl to verify the functionality of our newly created application. Now that we have verified our functionality, let's head back to the CloudWatch console to verify that we're actually getting our logs. And, as we can see, there they are. In conclusion, CloudFormation provides many tools and some are best suited for different scenarios. In our case, we originally started out with using the User Data section and then migrated to the CloudFormation::Init to use the declarative programming model to be able to ensure that our instances were successful when they were provisioned and updated. CloudWatch: Publishing Custom Metrics In reviewing the current state of the world, our application now has increased observability through CloudWatch logging. The next step is publishing custom metrics. In this section, we will update our code sample to include custom metrics so that when an event matches our specified pattern we emit a counter metric. Additionally, we will update our IAM role permissions to allow us to publish metrics to CloudWatch. In our application code, let's review the changes that take place. As a recap, we have defined a pattern which will look for events that breach an initial threshold 5 within 10 seconds, and then within the next 10 seconds increase by 10 events. As we get matches to this pattern, we output them to S3 in batches. The first change is inside where we select the matches to our pattern we now create a new metric with a single dimension, namely pattern type. The metric name is Count with type Count. Our namespace for this metric is Globomantics/ComplexEventProcessor. After creation, we publish the metric using the Async CloudWatch client. The reason Async was more appealing here is that we don't want to hold up pattern matching in case our communication with CloudWatch is delayed for any reasons. The second change was to follow good semantic versioning patterns and increment the version of our package jar. With these changes, let's go ahead and re-package our application using mvn clean package. After re-packaging, we will upload the new version to our application's S3 artifact bucket. As we previously touched upon, when uploading our first packaged application, the S3 artifact bucket can be found from the Resources section in the CloudFormation stack window. Do make sure to take note of the package name, as this name will be used when we are updating our CloudFormation template to take advantage of our new application. AWS provides so many different window views of the data, especially S3. Once updated, we will then update our current stack to redeploy our application using the specified rolling update policy. The last change that was needed for this particular functionality was to add CloudWatch PutMetric permissions to the IAM role used by our ComplexEventProcessor. Often, I have found that this step is missed, and in general, I would encourage that each time you add another dependency on an AWS SDK, make sure to capture what permissions are needed. With everything done, let's generate some traffic and see if we can get some metrics sent to CloudWatch. Going back to the CloudWatch console, we click on Metrics instead of Logs, whereby we're brought to a screen where we can visualize a lot of the custom and predefined metrics that AWS provides in a graph-like mechanism. There we have it. In summary, publishing custom metrics is very easy, and as a consequence of implementation can greatly improve any application's observability. CloudWatch: Dashboards and Alarms With our application now submitting logs and metrics, let's visualize our new-found observability through a simple dashboard, and configure an alarm based on the metrics we have created. The changes that were needed for this functionality are all contained in the CloudFormation template. We have specified several new resources, namely a CloudWatch Alarm, Dashboard, SNS Topic, and SNS Topic resource policy. Additionally, we added an input parameter to specify where notifications should be sent when alarm thresholds are breached. Let's briefly cover each new resource. AWS Simple Notification Service, SNS, is used as a medium for sending notifications to a provided email address. A resource-based policy is needed to make sure that SNS allows CloudWatch to send messages to the defined SNS Topic. With the SNS Topic defined, a CloudWatch Alarm is created, referencing our previously created custom CloudWatch metrics. The threshold we have here is very tight, meaning a super low threshold for mistakes. Ideally, this will be adjusted with time when we figure out what trending really means for our application's code pattern. Having declared these resources, we create a simple dashboard which reflects our interests. Rather than going through the update process one more time, I have pre-updated the CloudFormation stack with the updated configuration. We can see that we achieved our intended actions, a very simple dashboard, but functional in the most important ways. In order to test our new alarm, in the background I've sent new traffic to the server in an effort to trigger the alert. As we can see, it works great. In summary, dashboards and alarms are a vital and often left-to-the-last-minute process. Through CloudFormation, this forgotten pattern can be simplified and made repeatable and robust. Integrating AWS Auto Scaling Maybe one of the most exciting aspects of AWS is the dynamic nature of its product offerings. This is surely the case with AWS Auto Scaling policies. In this section, we will look to improve our current Auto Scaling group configuration by implementing scaling policies that will react to changes in our environment. A small number of changes are needed to introduce this functionality, namely first we need to update the Auto Scaling group size mapping. In an effort to shift responsibility of who owns these mappings, we will move the MaxSize property for auto scaling groups to an input parameter instead of a mapping. The MinSize property will be set statically at 1. This addition will allow us to create a difference between a min and max number of servers to allow our soon-to-be-created Auto Scaling policy to increase the number of EC2 servers automatically. This change has already been made to the mapping for each of the environments. Second, we need to actually create the new auto scaling policy that references our chosen metric. Of special note, there are several different types of auto scaling policies, namely target tracking, step scaling, and simple scaling. Target tracking will monitor and adjust servers according to a target metric, for example, CPU utilization at 50 percent. Step scaling will adjust the number of servers based on a step scale, for example, any difference below 10 we need to scale by 1, but if above 10 scale by 2. Simple scaling does not introduce a scale, and will always increment the number of servers by a specified size. Choosing which policy makes sense requires a little more understanding of our environment. Several common choices are to monitor the number of incoming bytes, outgoing bytes, or memory usage, however, knowing our application, and that the data that gets sent to our application is very small, scaling by the number of bytes does not seem ideal, nor does memory usage. For this, CPU usage seems to be the most appropriate, therefore, a single target tracking Auto Scaling policy has been created to make sure our Auto Scaling group's average CPU is always around 50%. Lastly, when we referenced our application before making these changes, it was by using the single IP address of our EC2 server. However, now that we have possibly more than one server, how do we distribute incoming HTTP traffic evenly, or even tell our application to use the IP address of the new EC2 instance? The solution to this is to create an elastic load balancer for our application. Elastic load balancers confront auto scaling groups and evenly distribute traffic across all EC2 servers. An additional benefit is that the newly created elastic load balancer provides a single URL we can communicate with. Here we see the resource definition where we have referenced our current Auto Scaling group, and that we are listening on ports 8081 and 8181 and forwarding traffic to the relevant backend ports. In an attempt to secure our application more fully, a new security group was created for the load balancer, and the current security group for our complex event processor was modified to only accept traffic from the load balancer. Additional properties are available for elastic load balancers, but will not be covered here in any detail. In a final note, let's sync this template and generate some basic traffic for testing. In order to do that, we need to get the DNS name of the elastic load balancer that was created, and then we will use that instead of the HTTP address that we used previously when executing our curl request against our application. With these resources created, we are able to have a greater abundance of confidence that our resources can handle the ups and downs of our customer traffic. Putting It All Together Looking back at this module, we've covered a lot of information. Let's go ahead and recap. In the beginning, our system had a single customer whereby we would receive input, maybe the customer was a web page or some internal backend service. The input then flowed through to the application and eventually, if the event stream matched our specified pattern, we would output the event ID to S3 in batches. Now, on the other hand, whenever the event stream matches our pattern, we output custom metrics to CloudWatch along with our application's logs. Each of these items can then be pattern matched and alarm notifications are sent to us via AWS SNS when our metrics breach our alarms. When our alarms breach, our application will scale up the number of servers and distribute traffic evenly across our EC2 instances. Additionally, we created a framework for building streaming HTTP big data applications using CloudFormation, where we can just upload a new package application to S3, update our template, and sit back and relax. Okay, well maybe not relax, but we can be rest assured that we have increased our toolset for creating robust solutions to big data problems. Even after all that we have done, there are a good number of improvements that can be made. One of which was mentioned before is migrating the current functionality from CloudFormation::Init to using a preconfigured AMI. This doesn't necessarily mean that you need to abandon CloudFormation altogether. CloudFormation::Init can still be used to provision a static EC2 instance from which you then create an AMI. Second, currently we are running Apache Flink in standalone mode. This works well for processing events independent of each other, however, when examining events that are related it is best to use Apache Flink's cluster mode through AWS EMR. With the functionality that AWS EMR provides, you are then able to partition data in a consistent pattern. In the end, through the addition of logs, metrics, dashboards, and alarms, we have created a resilient, observable system. Visualizing Data with QuickSight Visualizing Data with QuickSight Hi, my name is Matthew Alexander, and welcome back to AWS Big Data in Production. In this module, we will learn about the importance of visualizing data and one of the mechanisms provide by AWS to do it, QuickSight. In the current technological ecosystem we find ourselves in, data has almost become a clear winner for business priorities. Notwithstanding how important data might be, the truth is that data is only powerful when it is understood. This brings us to the centralized theme of this module, namely that visualizing data in a comprehendible manner will provide more business value than almost anything else that can be done. Rather than take this movement for granted, let's analyze some of the clear reasons why data visualization wins out. First, data when visualized properly can enable business insights. This statement holds true for a wide range of fields including marketing, engineering, analytics, research, and many more. One of the primary responsibilities I have as an engineer at Lucid is to improve site stability. Without data and the proper visualization of that data, I don't fare very well at doing my job. Second, visualizations often communicate on common grounds, conveying concepts in a universal manner. This idea becomes quite clear when you have projects that span multiple departments and/or countries. Lastly, visualizations take advantage of the way the human mind processes information. There's an often-repeated cliché that says go with the grain, and when it comes to conveying information, why not transmit information on the level that the human mind understands best. AWS QuickSight attempts to address each of these issues through multiple different avenues. By scaling to thousands of users, information is easily available to the masses through providing interactive visualizations that allow for onlookers to apply their creativity in processing information, and empowering individuals with machine-learning-backed insights. Let's dive into some of QuickSight's core concepts before demoing its functionality. QuickSight: Concepts In this section, we will focus on describing QuickSight's core concepts. QuickSight primarily begins with data sources. Data sources provide input data from which QuickSight can perform a wide variety of functions. While QuickSight supports numerous data sources including S3, Postgres, MySQL, and Snowflake, to name a few, make sure that your use case is supported, or determine how you might get data into the supported sources. After importing data from a specified data source, QuickSight will then allow for analyses to be run against the data. It is important to know that QuickSight supports a small number of data formats in order to visualize the data that we output with our application. I have updated the ComplexEventProcessor to output content as line-delimited JSON objects. Some of the more advanced functionality provided by QuickSight comes in the form of functions. Functions provide a robust implementation for developing keen business insight as they offer aggregation, numeric and conditional operations, among others. Lastly, AWS QuickSight provides filtering capabilities when working with imported data sources. I have seen a great deal of benefit come from being able to deep-dive into a large data set through filtering. While these are some of QuickSight's foundational concepts, there are many others to explore through online documentation. Prior to visiting the QuickSight console, let's take a look at the manifest file that we'll use to import our data set. Here we have a manifest file that uses a URI prefix to the S3 bucket that we created in our CloudFormation template. Now with this S3 prefix, we can then grab all of the data under all of the "directories" that S3 providers, and here we specify that we have a JSON format for every single one of our files. Let's head to the QuickSight console and demo some of the functionality we have been talking about. While there, we will visualize the data we generated and stored in S3, and we will create a simple analysis and look into what possible business insights we can eventually derive. Let's head to the QuickSight console by typing in quick sight and selecting the first menu option that comes down. If you haven't set up QuickSight before, it will bring you to a dialog where you can then register for QuickSight and provide all the necessary information. After the page comes up, we'll go to Manage data and click on New data set, whereby we'll select S3, give our data source a name, for this example we're giving it ComplexEventProcessor, and then select Upload to be able to upload the manifest.json file that we talked about a little bit earlier. After that, we'll click Connect, and then we want to preview it always before we go ahead and import it. As the preview comes up, we're then able to see that QuickSight was able to process the data efficiently. It gives us in tabular form, and we can see that we have all of the fields that we originally created that are application generated. After we save and visualize, we're brought to the analysis component of QuickSight whereby we can select certain fields and visual types to be able to visualize our data. For here, we've selected a Count of Records by the event ID, here related as the Start.id. We can also see that we have two different event IDs for a total of 28 records. We want to add maybe, per se, another visual, where we visualize the total number of events that came through. So as we click on start.count and then start.id, we can then use a pie chart or a donut chart to visualize how much data has actually come through to our application. Although simple building tools, QuickSight provides all of the necessary foundational components for us to create very complex, intuitive, and interactive dashboards and analysis against the data that we have generated. Some of the potential things that we could do is see that our actual events increasing per each time, or is it just a static trend that continues on. With this in mind, there's so much more that we can do with AWS QuickSight, and I'll leave that to you to investigate on your own time. QuickSight: Creating Dashboards In this section, let's investigate formalizing our analysis and publishing it as a dashboard. Coming back to the AWS QuickSight console, We'll click on All dashboards and see that there's nothing in there. Going back to the analysis that we originally created, I've renamed it to Trends, and then we're going to polish up some of the charts and diagrams that we have to make them more official, because dashboards, by their very nature, are meant to be shared. Here we're going to rename one of these charts to be Event ID Record Count and the other one Event ID Sum of Records. Now I'm sure there's plenty of other names that you can give it that would be more meaningful. I have seen that often when we create our data and we visualize it, generally it is only for internal use, meaning to the person who created it it's valuable, it's meaningful. The contrary nature of dashboards is that they are meant to be shared, they're meant to communicate information, and by this AWS QuickSight provides a really great mechanism to do that. Then we'll click on Share, and give our dashboard a name, where we have the option to allow for advanced filtering or CSV downloads. We then are actually given the option if we want to share this with anybody. Currently this is just for demonstration purposes, so the answer is going to be no. As we create our dashboard, we then see visually it looks stunning. It's a great experience, people have the ability to then filter it. When they view the dashboard, they can actually go back and see the dashboards that they had recently viewed, and it's a very great experience for promoting and publishing and marketing the data that you have visualized. In the end, data is extremely powerful, but it's most powerful when it's shared in a medium that can affect change, and that is why we use dashboards. Conclusion For this last section, let's take some time to review everything that has happened in this module, as well as the course overall. Throughout this module, we have talked about the importance of visualizing data, both for small and large groups of users. Visualizations take advantage both of how the human mind natively works, as well as providing a common framework for communication across different boundary lines, be it culture, country, or department. In line with this, AWS QuickSight makes scaling interaction with data possible through its robust toolset, by providing support for up to thousands of users. Lastly, we demonstrated QuickSight's ability to create and publish dashboards with the options of sharing them with others and making them available offline through CSV downloads. Each of these tools and their import can never be underestimated. In an attitude of remembering, we have covered a significant amount of material throughout this course. We started off by automating deployments with CloudFormation for Globomantics' networking backbone, utilizing several different templates as necessary. Inside each of the developed CloudFormation templates, we introduced mechanisms for controlling and auditing costs, including parameters, among others. As our ComplexEventProcessor application generated data, we sought to secure that data through IAM, using both resource and identity-based policies in addition to S3 and EBS encryption at rest. With each improvement we made to the application, we addressed gaps in observability by monitoring availability with CloudWatch. In the end, we created a robust framework for deploying and maintaining streaming big data applications. By using each of these tools that we have talked about, you are rest assured to be improving the big data practices within your organization. After all, I hope you've enjoyed, and thank you for joining me on this great adventure with Pluralsight's course, AWS Big Data in Production.

aws-big-data-in-production.pdf

Document Details

Related

Full Transcript

Upgrade to continue