ETL with Azure Cookbook PDF
Document Details
Uploaded by MotivatedComputerArt753
2020
Christian Cote, Matija Lah, Madina Saitakhmetova
Tags
Summary
This book provides practical recipes for building modern ETL (Extract, Transform, Load) solutions using Microsoft Azure. It covers topics like setting up Azure Data Studio, creating SQL Azure databases, and utilizing Azure Data Factory. Ideal for developers and data professionals.
Full Transcript
ETL with Azure Cookbook Practical recipes for building modern ETL solutions to load and transform data from any source Christian Cote Matija Lah Madina Saitakhmetova BIRMINGHAM—MUMBAI ETL with Azure Cookbook Copyright © 2020 Packt Publishing All rights reserved. No part of this book may be r...
ETL with Azure Cookbook Practical recipes for building modern ETL solutions to load and transform data from any source Christian Cote Matija Lah Madina Saitakhmetova BIRMINGHAM—MUMBAI ETL with Azure Cookbook Copyright © 2020 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. Commissioning Editor: Sunith Shetty Acquisition Editor: Ali Abidi Senior Editor: David Sugarman Content Development Editor: Nathanya Dias Technical Editor: Manikandan Kurup Copy Editor: Safis Editing Project Coordinator: Aishwarya Mohan Proofreader: Safis Editing Indexer: Pratik Shirodkar Production Designer: Vijay Kamble First published: September 2020 Production reference: 1290920 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-80020-331-0 www.packt.com Packt.com Subscribe to our online digital library for full access to over 7,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website. Why subscribe? Spend less time learning and more time coding with practical eBooks and Videos from over 4,000 industry professionals Improve your learning with Skill Plans built especially for you Get a free eBook or video every month Fully searchable for easy access to vital information Copy and paste, print, and bookmark content Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at packt.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details. At www.packt.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks. Contributors About the authors Christian Cote is an IT professional with more than 15 years of experience working on data warehouse, big data, and business intelligence projects. Christian has developed expertise in data warehousing and data lakes over the years and has designed many ETL/BI processes using a range of tools on multiple platforms. He's presented at several conferences and code camps. He currently co-leads the SQL Server PASS chapter. He is also a Microsoft Data Platform Most Valuable Professional (MVP). Matija Lah has more than 18 years of experience working with Microsoft SQL Server, mostly from architecting data-centric solutions in the legal domain. His contributions to the SQL Server community have led to him being awarded the MVP Professional award (Data Platform) between 2007 and 2017/2018. He spends most of his time on projects involving advanced information management and natural language processing, but often finds time to speak at events related to Microsoft SQL Server where he loves to share his experience with the SQL Server platform. Madina Saitakhmetova is a developer specializing in BI. She has been in IT for 15 years, working with Microsoft SQL,.NET, Microsoft BI, Azure, and building BI solutions for medical, educational, and engineering companies. Her adventure with Microsoft BI began with Analysis Services and SSIS, and in later years she has been building her expertise in ETL/ELT, both on-premises and in the cloud. Finding patterns, automating processes, and making BI teams work more efficiently are challenges that drive her. During the past few years, BIML has become an important part of her work, increasing its efficiency and quality. About the reviewers Steph Martin is a senior customer engineer in the FastTrack for Azure team at Microsoft. She has a background in SQL Server database design and development and extensive knowledge of SSIS and ETL patterns and practices. Steph helps customers to achieve success with their projects, providing architecture design reviews, best practice guidance, and in-depth knowledge sharing across the platform, covering migration to Azure as well as cloud-first deployments. She has wide-ranging experience of Azure data services, specializing in Azure SQL Database and Managed Instance, and Azure Data Factory. Steph is an active member of the data platform community, running a PASS affiliated meetup in the UK and helping to co-ordinate the volunteer team at SQLBits. Aaditya Pokkunuri is an experienced Senior Database Engineer with a demonstrated history of working in information technology and services industry. He is skilled in performance tuning, SQL Database, requirements analysis, databases, and servers, SSIS, and SSRS. He possesses strong knowledge on Clustering, SQL server high-availability options, and ITIL of processes. Aaditya's expertise lies in Windows administration tasks, active directory, and Microsoft Azure technologies. He is a strong information technology professional with a Bachelor of Technology degree, specializing in computer science and engineering from SASTRA University, Tamil Nadu. Packt is searching for authors like you If you're interested in becoming an author for Packt, please visit authors. packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea. Table of Contents Preface 1 Getting Started with Azure and SSIS 2019 Technical requirements 14 Getting ready 38 Installing Microsoft SQL Server How to do it… 39 2019 Integration Services 14 How it works… 43 Getting ready 15 Installing Azure Data Studio 43 How to do it… 16 Getting ready 43 How it works… 31 How to do it… 44 Installing SQL Server How it works… 45 Management Studio 31 Creating an Azure subscription 45 Getting ready 32 Getting ready 46 How to do it… 32 How to do it… 46 How it works… 38 How it works… 49 Installing SQL Server Data Tools 38 There's more… 49 2 Introducing ETL Creating a SQL Azure database 52 Getting ready 60 Getting ready 52 How to do it… 60 How to do it… 52 How it works… 61 How it works… 60 Creating a simple ETL package 62 Connecting SQL Management Getting ready 62 Studio to Azure SQL 60 How to do it… 62 ii Table of Contents How it works… 72 Getting ready 72 How to do it… 72 Loading data before its How it works… 75 transformation72 3 Creating and Using SQL Server 2019 Big Data Clusters Technical requirements 78 How to do it… 96 Creating a SQL Server 2019 Big How it works… 107 Data Cluster 78 Loading data into a Big Data Getting ready 79 Cluster107 How to do it… 83 Getting ready 107 How it works… 90 How to do it… 108 Stopping and starting Azure How it works… 118 resources90 Extracting data from Getting ready 91 a Big Data Cluster 119 How to do it… 91 Getting ready 119 How it works… 94 How to do it… 119 Setting up a Big Data Cluster 94 How it works… 123 Getting ready 95 4 Azure Data Integration Setting up the Azure Feature Setting up an Azure storage Pack in SSIS 126 account129 Getting ready 126 Getting ready 130 How to do it… 126 How to do it… 130 How it works… 127 How it works… 132 Setting up a Java Runtime Creating an on-demand Azure Environment128 HDInsight cluster 132 Getting ready 128 Getting ready 133 How to do it… 128 How to do it… 133 How it works… 129 How it works… 137 Table of Contents iii Transforming data with Hive 137 Transforming data with Pig 154 Getting ready 137 Getting ready 154 How to do it… 138 How to do it… 154 How it works… 153 How it works… 175 5 Extending SSIS with Custom Tasks and Transformations Technical requirements 178 There's more… 200 Designing a Custom Control Deploying SSIS customizations 201 Flow Task 178 Getting ready 201 Getting ready 180 How to do it… 201 How to do it… 180 How it works… 210 How it works… 189 Upgrading a Custom Data Flow Designing a Custom Data Flow Component211 Component190 Getting ready 211 Getting ready 191 How to do it… 212 How to do it… 191 How it works… 217 How it works… 198 6 Azure Data Factory Data factory creation 219 Getting ready 238 Getting ready 221 How to do it… 238 How to do it… 221 How it works… 262 How it works… 224 Triggering and monitoring our Copying data from the internet 224 pipeline263 Getting ready 224 Getting ready 263 How to do it… 224 How to do it… 263 How it works… 237 How it works… 276 Moving and transforming data 237 iv Table of Contents 7 Azure Databricks Creating a Databricks Getting ready 285 workspace278 How to do it… 286 Getting ready 278 How it works… 293 How to do it… 279 Using Delta Lake 293 How it works… 282 Getting ready 294 Creating a cluster in our How to do it… 294 workspace282 How it works… 300 Getting ready 283 There's more… 301 How to do it… 283 Writing in Azure SQL Server 301 How it works… 285 Getting ready 301 Connecting to an internet data How to do it… 302 source 285 How it works… 307 8 SSIS Migration Strategies Running SSIS packages in ADF 310 ADF325 Getting ready 310 Getting ready 325 How to do it… 310 How to do it… 326 How it works… 324 How it works… 342 Rewriting an SSIS package using 9 Profiling data in Azure Technical requirements 344 Getting ready 350 Using SQL in Spark 344 How to do it… 351 How it works… 353 Getting ready 344 How to do it… 345 Using pandas profiling 354 How it works… 350 Getting ready 354 Using Spark built-in functions 350 How to do it… 354 How it works… 362 Table of Contents v 10 Manage SSIS and Azure Data Factory with Biml Technical requirements 366 Biml388 Installing prerequisites 366 Getting ready 388 How to do it… 367 How to do it… 388 How it works… 368 How it works… 390 There's more… 391 Generating T-SQL select statements for all tables 369 Generating a mass change Getting ready 369 to stored procedures 391 How to do it… 369 Getting ready 392 How it works… 371 How to do it… 393 There's more… 372 How it works… 397 There's more… 399 Generating T-SQL to drop and create all indexes 372 Generating an SSIS package Getting ready 373 containing a Data Flow Task 400 How to do it… 373 Getting ready 400 How it works… 374 How to do it… 400 How it works… 403 Generating a basic SSIS package375 Getting ready 375 Generating your first Azure How to do it… 376 Data Factory 404 How it works… 379 Getting ready 404 There's more… 379 How to do it… 406 How it works… 414 Generating an SSIS package There's more… 415 with Execute SQL Tasks 380 Getting ready 381 Loading multiple files using How to do it… 381 Data Factory 415 How it works… 386 Getting ready 415 There's more… 387 How to do it… 416 How it works… 428 Using Convert SSIS package to See also 429 Other Books You May Enjoy Index Preface Extract, Transform, Load (ETL) is one of the most common and tedious procedures for moving and processing data from one database to another. With the help of this book, you will be able to speed up the process by designing efficient ETL solutions using the Azure services available for handling and transforming any data to suit your requirements. With this cookbook, you will become well versed in all the features of SQL Server Integration Services (SSIS) to perform data migration and ETL tasks that integrate with Azure. You will learn how to transform data in Azure and understand how legacy systems perform ETL on-premises using SSIS. Later chapters will get you up to speed with connecting and retrieving data from SQL Server 2019 Big Data Clusters, and even show you how to extend and customize the SSIS toolbox using custom-developed tasks and transformations. This ETL book also contains practical recipes for moving and transforming data with Azure services, such as Data Factory and Azure Databricks, and lets you explore various options for migrating SSIS packages to Azure. Toward the end, you will find out how to profile data in the cloud and automate package development with Business Intelligence Markup Language (BIML). By the end of this book, you will have developed the skills you need to create and automate ETL solutions on-premises as well as in Azure. Who this book is for This book is for data warehouse architects, ETL developers, or anyone else who wants to build scalable ETL applications in Azure. Those looking to extend their existing on-premises ETL applications to use big data and a variety of Azure services, or others interested in migrating existing on-premises solutions to the Azure cloud platform, will also find the book useful. Familiarity with SQL Server services is necessary to get the most out of this book. viii Preface What this book covers Chapter 1, Getting Started with Azure and SSIS 2019, describes, in step-by-step fashion, how to set up SQL Server 2019 to deploy the features that are used throughout the book. It also covers how to set up an Azure subscription. Chapter 2, Introducing ETL, explains why ETL is needed, where it comes from, and the differences between ETL and ELT. Chapter 3, Creating and Using SQL Server 2019 Big Data Clusters, discusses how to create, deploy, connect to, and retrieve data from SQL Server 2019 Big Data Clusters. Chapter 4, Azure Data Integration, talks about the Azure Feature Pack, which allows SSIS to integrate Azure data from blob storage and HDInsight clusters. Chapter 5, Extending SSIS with Custom Tasks and Transformations, talks about extending and customizing the built-in toolset with custom-developed control flow tasks and data flow transformations. Chapter 6, Azure Data Factory, talks about using Azure Data Factory to move and transform data in the cloud. Chapter 7, Azure Databricks, talks about using Azure Databricks to transform data in the cloud. Chapter 8, SSIS cloud Migration Strategies, talks about scaling out SSIS packages to Azure. Chapter 9, Profiling data in Azure, contains various recipes that show you how to perform data profiling in Azure Databricks. Chapter 10, Manage SSIS and Azure Data Factory with BIML, contains various recipes that demonstrate how to manage SSIS and Azure Data Factory with BIML using metadata. To get the most out of this book Basic or working knowledge of SQL Server, SSIS, and Azure is needed, as is basic knowledge of ETL techniques and Azure SQL Server databases. You also need a personal computer running Microsoft Windows 10. Preface ix Some of the recipes in this cookbook contain partial, incomplete solutions that you will complete by following the instructions in the recipe. If, for some reason, you get stuck and are unable to complete the tasks successfully, the complete copies of the solutions are available in the Solution folder of the corresponding chapter. Please, use these files as a reference when working on the recipes; they are there to assist you, not to prevent you from learning by performing the work yourself. If you are using the digital version of this book, we advise you to type the code yourself or access the code via the GitHub repository (link available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code. Download the example code files You can download the example code files for this book from your account at www. packt.com. If you purchased this book elsewhere, you can visit www.packtpub.com/ support and register to have the files emailed directly to you. You can download the code files by following these steps: 1. Log in or register at www.packt.com. 2. Select the Support tab. 3. Click on Code Downloads. 4. Enter the name of the book in the Search box and follow the onscreen instructions. Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of: 1. WinRAR/7-Zip for Windows 2. Zipeg/iZip/UnRarX for Mac 3. 7-Zip/PeaZip for Linux x Preface The code bundle for the book is also hosted on GitHub at https://github.com/ PacktPublishing/ETL-with-Azure-Cookbook In case there's an update to the code, it will be updated on the existing GitHub repository. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out! Download the color images We also provide a PDF file that has color images of the screenshots/diagrams used in this book. You can download it here: https://static.packt-cdn.com/ downloads/9781800203310_ColorImages.pdf. Conventions used There are a number of text conventions used throughout this book. Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "For the location, use hivesource as the folder path and sales. dat as the filename." A block of code is set as follows: INSERT OVERWRITE TABLE ETLInAzure.SalesAgg SELECT storename , zipcode , SUM(unitcost) AS unitcost, AVG(unitprice) AS unitprice, SUM(salesamount) AS salesamount, SUM(salesquantity) AS salesquantity, CalendarMonth FROM etlinazure.salesource GROUP BY storename, zipcode, CalendarMonth; Any command-line input or output is written as follows: USE master; GO CREATE DATABASE Staging; GO Preface xi Bold: Indicates a new term, an important word, or words that you see onscreen. For example, words in menus or dialog boxes appear in the text like this. Here is an example: "Click on the list beside the Enumerator property and select Foreach Azure Blob Enumerator." Tips or important notes Appear like this. Sections In this book, you will find several headings that appear frequently (Getting ready, How to do it..., How it works..., There's more..., and See also). To give clear instructions on how to complete a recipe, use these sections as follows: Getting ready This section tells you what to expect in the recipe and describes how to set up any software or any preliminary settings required for the recipe. How to do it… This section contains the steps required to follow the recipe. How it works… This section usually consists of a detailed explanation of what happened in the previous section. There's more… This section consists of additional information about the recipe in order to make you more knowledgeable about the recipe. See also This section provides helpful links to other useful information for the recipe. xii Preface Get in touch Feedback from our readers is always welcome. General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected]. Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata, selecting your book, clicking on the Errata Submission Form link, and entering the details. Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material. If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors. packtpub.com. Reviews Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you! For more information about Packt, please visit packt.com. 1 Getting Started with Azure and SSIS 2019 In this cookbook, we are going to cover Extract, Transform, Load (ETL) development with Microsoft Azure. We will start with Microsoft SQL Server 2019 Integration Services (SSIS) and then gradually move on to data management capabilities in Azure. Throughout this cookbook we will use hands-on examples, which will not only provide you with genuine first-hand experience in SSIS development but also demonstrate how to design a working ETL solution or a module that could be built into an ETL solution. Let's start by establishing your development environment. We are going to guide you through the installation of a new SQL Server instance, including all the components needed for on-premises as well as cloud-based SSIS solution development. You are also going to install the tools used in administering and maintaining SQL Server instances – either on-premises or in the cloud. You are also going to install the tools used in developing, testing, and deploying your SSIS solutions. You will finish this chapter by creating an Azure subscription, which will allow your on-premises development environment to extend into the cloud. This chapter covers the following recipes that will help you get started with SSIS development on-premises as well as in the cloud: Installing Microsoft SQL Server 2019 Integration Services Installing SQL Server Management Studio 14 Getting Started with Azure and SSIS 2019 Installing SQL Server Data Tools Installing Azure Data Studio Creating an Azure subscription Technical requirements In order to install the necessary components, you will need a workstation – either a physical personal computer or a virtual one – running a compatible Microsoft Windows operating system. SQL Server 2019 can be installed on any edition of the Microsoft Windows Server operating system, starting with Windows Server 2016, or even on its desktop edition, Windows 10, starting with version TH1, build 1507. Depending on the features installed, SQL Server is going to require between 6 and roughly 8 GB of free space on the hard drive, and at least 1 GB of available system memory, though around 4 GB is the recommended minimum. Practically any 64-bit CPU is supported. Important note You can find more detailed information about SQL Server 2019 hardware and software requirements in the online vendor documentation (SQL docs) article entitled SQL Server 2019: Hardware and software requirements, at https:// docs.microsoft.com/en-us/sql/sql-server/install/ hardware-and-software-requirements-for-installing- sql-server-ver15. Speaking of online documentation, your workstation is going to need internet access – even to perform the installations. Installing Microsoft SQL Server 2019 Integration Services You are going to install a new instance of SQL Server on your workstation; if you prefer, you can also upgrade an existing SQL Server instance, as noted in the recipe. The installation will include all the features necessary to design SSIS solutions generally, not just to perform the work presented in this book. Installing Microsoft SQL Server 2019 Integration Services 15 The Client Tools Software Development Kit (SDK), an otherwise optional SQL Server component, must be installed on the workstation used in SSIS development. It contains the.NET managed assemblies with design-time access to the SSIS runtime, without which SSIS development simply cannot take place. The installation of the SQL Client Connectivity SDK, another optional component, is recommended, though not necessary. It contains SQL Server Native Client connectivity resources that you might need in database application development. Important note None of the SDKs mentioned would ever need to be installed on a server used for hosting data, as such a server would normally not be used for SSIS development. Getting ready You need access to the SQL Server 2019 installation media. Since this is going to be a typical development workstation installation, you can use the free, specialized SQL Server 2019 Developer Edition, available for download at https://www.microsoft. com/en-us/sql-server/sql-server-downloads. Alternatively, you can also use a free trial edition of SQL Server 2019 Standard, or the Data Center editions, but please note that their use is limited to 180 days. Important note The Developer edition of SQL Server 2019 is a full-featured free edition, licensed for use as a development and test database in a non-production environment. Either download the installation media from the website or use the installation media provided by the administrator in your organization. If you decide to use the Developer edition of SQL Server 2019, download the executable from the website, and save it to the local drive on the workstation that you will be using for the recipes in this book. If you prefer to use the Evaluation edition of SQL Server 2019, you will have to sign up on the website by providing some of your personal information before being able to access the installation media. 16 Getting Started with Azure and SSIS 2019 How to do it… Start your workstation and log in with an account with administrative operating system privileges. Make sure the workstation has access to the SQL Server 2019 installation media – for instance, make sure that the Developer edition executable file is available on the local drive: 1. Locate the installation media, and double-click the installation executable (for instance, SQL2019-SSEI-Dev.exe for SQL Server 2019 Developer Edition) to start the installation. When prompted by the operating system, which will ask whether you want to allow the application to make changes to your device, as shown in the following screenshot, click Yes to continue: Figure 1.1 – The SQL Server installation User Account Control dialog 2. At the beginning of the installation, you are asked whether you would like to use the basic settings, customize them, or simply download the rest of the installation media. Tip Generally, you will only download the installation media if you plan to install SQL Server on additional devices. On the other hand, a very good reason for having the media available locally would be in case you need to repair the installation later. Select Custom to continue. Installing Microsoft SQL Server 2019 Integration Services 17 3. You now have to select the location of the installation files. Use the Default folder unless you prefer to use an alternative location. When ready, click Install to continue. Depending on your internet connection, it should take just a few minutes to download and extract the files. 4. After the installation files have been prepared, the SQL Server Installation Center window will open, as shown in the following screenshot: Figure 1.2 – SQL Server Installation Center On the Installation page, select the New SQL Server stand-alone installation or add features to an existing installation option to start the setup wizard. 18 Getting Started with Azure and SSIS 2019 5. On the Product Key page, either specify your free edition (Developer or Evaluation) or provide the product key of a licensed edition that you are allowed to install. Important note Do not install the Express edition of SQL Server 2019, as it does not include SSIS, nor does it come with many of the other features that you will need for the recipes in this cookbook. 6. Specify the appropriate edition, as shown in the following screenshot: Figure 1.3 – Specifying the SQL Server edition Click Next to continue. 7. On the License Terms page, read and accept the license terms, and then click Next to continue. Installing Microsoft SQL Server 2019 Integration Services 19 8. On the Microsoft Update page, shown in the following screenshot, check Use Microsoft Update to check for updates to include the latest updates for the product, unless you prefer to install them later: Figure 1.4 – Adding available updates to the installation Click Next to continue. The setup program should now perform a few checks of your system to verify whether the installation can proceed. 9. On the Install Rules page, you should see the system validation results. If there are errors, click on the link in the Status column to access each error message. Depending on the error, the installation might not proceed until you correct the cause or might even have to be aborted and restarted after the problem has been resolved. 20 Getting Started with Azure and SSIS 2019 If there are warnings, the installation should allow you to continue; however, you should inspect the warning messages anyway, as shown in the following screenshot, as additional activities might be needed during or after the installation in order for the SQL Server instance or one or more of the shared features to work as expected: Figure 1.5 – Install Rules Tip If SQL Server has not been installed on this workstation before or has not yet been configured for external access, you might see the Windows Firewall warning, as shown in the screenshot. In this particular case, the warning points to information on configuring the Windows firewall in order to allow external access to the SQL Server instance. You do not have to make any changes to your workstation's firewall at this time. When ready, click Next to continue. If the installation detects an existing SQL Server instance, an additional step is added, shown in the following screenshot, asking you to either create a new SQL Server instance or add features to an existing one: Installing Microsoft SQL Server 2019 Integration Services 21 Figure 1.6 – Installation Type This cookbook assumes that you will be using a workstation where SQL Server has not been installed before, but you can also use an existing SQL Server instance if you prefer. 10. On the Feature Selection page, select the following features to be installed: a) Database Engine Services b) PolyBase Query Service for External Data c) Java connector for HDFS data sources d) Client Tools Connectivity e) Integration Services f) Client Tools SDK g) SQL Client Connectivity SDK 22 Getting Started with Azure and SSIS 2019 You can see part of the selection in the following screenshot: Figure 1.7 – SQL Server 2019 Feature Selection When ready, click Next to continue. 11. If you are installing this SQL Server instance on a workstation where SQL Server has not been installed before, you can create either a new default instance or a named instance. Otherwise, your options will depend on what parts of SQL Server are already installed on the system: Installing Microsoft SQL Server 2019 Integration Services 23 Figure 1.8 – SQL Server 2019 Instance Configuration If possible, use the default instance, as the preceding screenshot shows; however, if you decide on using a named instance instead, we recommend that you use MSSQL2019 as its name. Throughout this book, we will refer to this SQL Server instance either as localhost (the default instance) or MSSQL2019 – in both cases, this will mean the same SQL Server instance. When ready, click Next to continue. 24 Getting Started with Azure and SSIS 2019 12. On the PolyBase Configuration page, leave the default settings unchanged, as shown in the following screenshot, and click Next to continue: Figure 1.9 – PolyBase Configuration 13. On the Java Install Location page, leave the default option selected to install the version of the Java runtime environment included in the installation, as shown in the following screenshot, unless you are installing to an environment where a different version is required by your administrators: Figure 1.10 – Java Install Location Installing Microsoft SQL Server 2019 Integration Services 25 If you have to install a different version, select the second option, and use Browse to locate the installation files. When ready, click Next to continue. 14. On the Server Configuration page, on the Service Accounts tab, change the Startup Type setting of the SQL Server Agent feature to Automatic. Leave all other settings on this tab unchanged, as displayed in the screenshot that follows: Figure 1.11 – SQL Server service accounts 15. On the Collation tab of the same page, you can leave the default collation for the database engine; however, it is recommended to use a Windows collation instead. To change the collation, click Customize next to the Database Engine Collation setting, and then in the Customize the SQL Server 2019 Database Engine Collation dialog, select the Windows collation designator and sort order option with the following settings: a) Collation designator: Latin1_General_100 b) Accent-sensitive: Checked c) Char/Varchar Storage Options: UTF-8 26 Getting Started with Azure and SSIS 2019 The recommended settings are shown in the following screenshot: Figure 1.12 – Setting the default collation for the SQL Server instance Click OK to confirm the settings. Review the settings on the Collation tab; the full name of the collation should be Latin1_General_100_CI_AS_SC_UTF8, as shown in the following screenshot: Figure 1.13 – The default SQL Server collation Installing Microsoft SQL Server 2019 Integration Services 27 When ready, click Next to continue. 16. On the Database Engine Configuration page, on the Server Configuration tab, shown in the following screenshot, leave Windows authentication mode selected for Authentication Mode: Figure 1.14 – Database Engine Configuration Add the account you are currently using to the SQL Server administrators role by clicking the Add Current User button. After a few moments, your username should appear in the SQL Server administrators list box. You do not have to make any changes to the rest of the settings on this page unless you want to change the location of the database files, which you can do on the Data Directories tab. 28 Getting Started with Azure and SSIS 2019 Important note For the purposes of this cookbook, you can use the default locations on the C: drive for the database files; however, in a real-life environment – even if solely for testing purposes – it is recommended to host the database files on a different drive, not the one used for the operating system and other installed features files. When ready, click Next to continue. 17. On the Ready to Install page, review the installation options listed in Steps 5 through 16. When you are sure that everything has been configured in accordance with the instructions in this recipe, click Install to start the automated part of the installation. On modern hardware, the installation should complete within a few minutes. 18. Once the installation has completed, inspect the installation status of each installed feature. If any errors are encountered during the installation, they will be marked with a Failed installation status. When you select each feature reported as failed, additional information about the failure should be displayed in the Details section: Figure 1.15 – The final step of the SQL Server 2019 installation Installing Microsoft SQL Server 2019 Integration Services 29 Inspect any error messages and address the cause of each problem based on the information provided. You can repeat the installation after you have resolved the problems. Click Close to complete the installation. One more task awaits you, and then the installation can be considered finished (for the time being, at least). In order to allow the PolyBase services access to the SQL Server instance, you must allow connections to the instance through the TCP/IP protocol. 19. In the Windows Start menu, in the Microsoft SQL Server 2019 folder, locate SQL Server 2019 Configuration Manager and start it: Figure 1.16 – SQL Server Configuration Manager In Configuration Manager, shown in the preceding screenshot, under SQL Server Services, you should see that the SQL Server instance that you just installed is running; however, both SQL Server PolyBase services will either be Stopped or in the Change pending… state. The reason for this is that the services are unable to connect to the SQL Server instance. They use TCP/IP to connect, and on a newly installed SQL Server instance, this protocol is not enabled. 30 Getting Started with Azure and SSIS 2019 20. Under SQL Server Network Configuration | Protocols for MSSQLSERVER, shown in the following screenshot, right-click the disabled TCP/IP protocol, and select Enable from the shortcut menu: Figure 1.17 – Enabling the TCP/IP protocol A warning is displayed, telling you that in order for the changes to take effect, the affected service must be restarted. Click OK to close the warning. 21. Back under SQL Server Services, right-click the SQL Server (MSSQLSERVER) service, and select Restart from the shortcut menu: Figure 1.18 – Restarting the SQL Server instance Installing SQL Server Management Studio 31 22. Wait for the service to restart. When prompted to stop or restart any other dependent service, confirm those as well. After the SQL Server instance has restarted, the PolyBase services should start as well. However, if they stay in the Change Pending… state for an extended period of time, try starting them manually. If even that fails, you might have to restart the workstation. Close SQL Server 2019 Configuration Manager. How it works… You have just successfully installed a new SQL Server instance. You will use it to complete the rest of the recipes in this cookbook; of course, it can also be used for other development work involving the SQL Server platform. If you later determine that you need additional features, you can add them by starting the installation again from SQL Server Installation Center. Important note If you have installed the Developer edition of the product, please remember that it has not been licensed for production use. You are only allowed to use it for application development and testing. If you have installed an Evaluation edition of the product, remember that the license expires 180 days after the installation. Installing SQL Server Management Studio SQL Server Management Studio (or SSMS, for short) is the principal tool used in SQL Server administration, maintenance, and development. It can be used to administer relational databases hosted on the Database Engine, or analytical databases and solutions hosted in Analysis Services. It can be used to administer the SQL Server Integration Services and Database (SSISDB) catalog, used for SSIS project deployments. You can even use it to administer the legacy SSIS service, used for legacy SSIS package deployments. Important note While most SSIS development activities will take place in SQL Server Data Tools (SSDT), you should still install SSMS on the workstation. Its programming capabilities complement SSDT, whereas its administrative and maintenance capabilities far exceed those available in SSDT. 32 Getting Started with Azure and SSIS 2019 Of course, SSMS supports numerous programming languages and data management standards typically associated with data management: Transact-SQL (T-SQL, for short) to manipulate data in relational databases, Multidimensional Expressions (MDX) and Data Mining Extensions (DMX) for data analysis and knowledge extraction, and XML for Analysis (XMLA) to administer SQL Server's Analysis Services (SSAS) databases. You can even use SSMS to create and edit XML documents. Getting ready For several years now, SSMS has been delivered as a standalone tool, no longer closely associated with a particular version of SQL Server. This decision allows Microsoft to develop the tool in its own development cycle, pretty much completely independently of the development cycle they use for the SQL Server platform. It also grants the development team behind the tool more flexibility in terms of the features being deployed or upgraded in each version. SSMS is, therefore, no longer included as part of the SQL Server installation, as it used to be in the past, and must be installed separately. It is available for download at https://docs.microsoft.com/en-us/sql/ssms/download-sql-server- management-studio-ssms. The download site can also be accessed from SQL Server Installation Center. Tip It is recommended to always use the latest version of SSMS, as it typically contains the latest security, stability, and user-experience fixes. The latest features and capabilities are, naturally, also only available in the most recent version of SSMS. In fact, this is true for every tool that you will be installing in this and later chapters. How to do it… Use your favorite internet browser to navigate to the SSMS download site listed previously, or start SQL Server Installation Center – it is located in the Start menu, in the Microsoft SQL Server 2019 folder – and then, on the Installation page, select the Install SQL Server Management Tools option. Installing SQL Server Management Studio 33 You will start this recipe by installing SSMS and complete it by downloading a sample database from GitHub and deploying it to the SQL Server instance you installed in the previous recipe, Installing Microsoft SQL Server 2019 Integration Services: 1. In the Microsoft docs article entitled Download SQL Server Management Studio (SSMS), inspect the latest information about the download. Pay close attention to any warnings or considerations that the vendor might have provided. When ready, click the Download SQL Server Management Studio (SSMS) HTML link to initiate the download. 2. If prompted to save the executable to your drive, click Save, wait for the download to complete, and then click Run to start the installation. If the download is performed in the background, use Windows Explorer to locate the file in your downloads folder, and start the execution from there – for instance, by executing the SSMS-Setup-ENU.exe file. 3. On the first screen of the installation dialog, leave the default location unchanged, and click Install to continue. 4. The installation should complete within a few minutes. If the installation is interrupted by an error, investigate the cause based on the information provided in the installation dialog. Otherwise, click Close to finish the installation. 5. In the Windows Start menu, locate the Microsoft SQL Server Tools 18 folder, and in it the Microsoft SQL Server Management Studio 18 shortcut. Open it. 6. In the Connect to Server dialog, select the SQL Server instance you created in the previous recipe, Installing Microsoft SQL Server 2019 Integration Services. The instance should now be displayed in the SSMS Object Explorer. 7. Open SQL Server Installation Center, and on the Resources page, locate the SQL Samples Web Site link. Follow the link to navigate to the Microsoft SQL Server Samples site. In your internet browser, scroll down to the SQL Server Samples Repository section, and click on the link to the Wide World Importers sample database web page. 8. On the new page, locate the WideWorldImporters-Full.bak file and click the link to start the download. Depending on your internet connection, the download should complete within a few minutes. 34 Getting Started with Azure and SSIS 2019 9. In SSMS, in Object Explorer, right-click the Databases node, and then select Restore database… from the shortcut menu, as shown in the following screenshot: Figure 1.19 – Restoring a SQL Server database 10. In the Restore Database dialog, on the General page, select Device as the source, and open the Select backup devices dialog by clicking the ellipsis icon to the right of the Device text box, as shown in the following screenshot: Installing SQL Server Management Studio 35 Figure 1.20 – Selecting the backup device 11. In the Select backup devices dialog, click Add to open the Locate Backup File dialog, and then navigate the filesystem on the left to locate the file you downloaded in Step 8. Select the file and click OK to confirm the selection. 12. In the Select backup devices dialog, click OK to confirm the selection. 13. In the Restore Database dialog, click OK again to start the restore operation. It should take about a minute to restore the sample database. Once the message box opens with the Database 'WideWorldImporters' restored successfully message, the database should be available on the selected instance. If there are any errors, inspect the error messages, make the necessary corrections, and repeat the process accordingly. 36 Getting Started with Azure and SSIS 2019 14. After the database has been restored successfully, click OK to close the message box, which also closes the Restore Database dialog. Now that you have your first tool installed, you can change two more settings that you might need later. 15. In SSMS, open a new query window, connect to the master database, and then use the following T-SQL command to inspect the Hadoop connectivity instance settings: EXEC sp_configure @configname = 'hadoop connectivity'; In the execution result, observe the run_value column; if its value is 0, change it to 7 by executing the following T-SQL commands: EXEC sp_configure @configname = 'hadoop connectivity', @configvalue = 7; reconfigure; By turning this configuration setting on, you allow your SQL Server instance to connect to remote Hadoop or Azure Blob storage locations. By using the configuration value of 7, you allow connections to all versions and both editions of Hortonworks, as well as to the Azure Blob storage. Tip More information about this setting is available in the vendor documentation, at https://docs.microsoft.com/en-us/sql/database- engine/configure-windows/polybase-connectivity- configuration-transact-sql. You should be familiar with the setting and its possible values before using the feature in a production environment. 16. Use the following T-SQL command to check whether the PolyBase service is enabled on your SQL Server instance: EXEC sp_configure @configname = 'polybase enabled'; Installing SQL Server Management Studio 37 If the value of the run_value column is 0, that means that PolyBase is not enabled; enable it by changing the configuration value to 1 by executing the following T-SQL commands: EXEC sp_configure @configname = 'polybase enabled', @configvalue = 1; RECONFIGURE; Tip More information about this setting, and the PolyBase services in general, is available in the vendor documentation, at https://docs.microsoft. com/en-us/sql/relational-databases/polybase/ polybase-installation. You should be familiar with the setting and its possible states before using the feature in a production environment. 17. If all the commands complete successfully, close SSMS. This time, you do not have to save the script or the solution. 18. Open SQL Server 2019 Configuration Manager, and restart the MSSQLSERVER instance, as you did in Step 20 of the previous recipe, Installing Microsoft SQL Server 2019 Integration Services. Important note The SQL Server instance must be restarted in order for these two configuration settings to take effect, as they affect the dependent PolyBase services. Also, you cannot restart a SQL Server instance running PolyBase from within SSMS when Object Explorer is connected to the service and prevents the service from being stopped. After the instance is restarted, you can close Configuration Manager. 38 Getting Started with Azure and SSIS 2019 How it works… You have installed the principal development and administration tool for SQL Server. You will be using SSMS throughout this cookbook. Of course, if you have worked with SQL Server before, you should already be familiar with SSMS and its capabilities. If this is the first time you have used SSMS, do not worry – every feature used in the recipes in this book will be explained when needed. You finished the recipe by deploying a sample database to the newly installed SQL Server instance. You will use this database in other recipes of this cookbook, and you are free to use it whenever you are exploring SQL Server and its capabilities. Installing SQL Server Data Tools SSDT is a specialized edition of the Microsoft Visual Studio integrated development environment (IDE). It can be used to develop a variety of projects related to SQL Server, such as relational and Azure databases, Analysis Services data models, Reporting Services reports, or Integration Services projects, to name just a few. In the past, SSDT used to be a standalone tool, requiring only a subset of Visual Studio features to be installed. However, starting with Visual Studio 2019 (or SQL Server 2019), SSDT represents an extension (or better: a set of extensions) of the principal Visual Studio IDE. To use the 2019 version of SSDT, you first need to install Visual Studio 2019, and then add the SSDT extensions, followed by the extensions needed to develop Analysis Services, Integration Services, and/or Reporting Services projects. This does increase the complexity of the installation procedure, but it also more closely follows the standard approach of adding features to the core Visual Studio installation. Getting ready If you do not have a valid Visual Studio 2019 Professional or Enterprise license at your disposal, you can also use the free Community edition of Visual Studio 2019. All editions are available to download at https://docs.microsoft.com/en-us/sql/ssdt/ download-sql-server-data-tools-ssdt. This location can also be accessed from SQL Server Installation Center. Installing SQL Server Data Tools 39 How to do it… Use your favorite internet browser to navigate to the Visual Studio 2019 download site listed previously, or start SQL Server Installation Center, and then on the Installation page, select the Install SQL Server Data Tools option. This recipe assumes that Visual Studio 2019 has not yet been installed on the workstation, and the instructions will show you how to install Visual Studio 2019 Community Edition: 1. On the Download SQL Server Data Tools (SSDT) for Visual Studio web page, in the Install SSDT with Visual Studio 2019 section, locate the link to Visual Studio 2019 Community, and follow that link to the Visual Studio 2019 Downloads page. Click FREE DOWNLOAD in the Community column to start the download operation and save the file to the local drive on your workstation. 2. When prompted, click Run to start the installation; otherwise, use Windows Explorer to navigate to your folder, and start the installation by opening the vs_Community.exe file. 3. The installation begins with the setup of a few prerequisites. You can take this moment to inspect the Microsoft Privacy Statement and the Microsoft Software License Terms by following the links provided in the message box: Figure 1.21 – Visual Studio Installer When ready, click Continue. The installation now downloads and extracts the files, after which the Installing — Visual Studio 2019 Community dialog should open. 40 Getting Started with Azure and SSIS 2019 4. On the Workloads page, in the Desktop & Mobile section, check.NET desktop development, and in the Other Toolsets section, check Data storage and processing: Figure 1.22 – The Visual Studio 2019 installation details Check the list of items in the right column – the same items should be listed as the ones shown in the screenshot. 5. Leave the Install while downloading option selected in the combo box next to the Install button. When ready, click Install to continue. The installation should take a few minutes, depending on your internet connection, after which the Visual Studio will greet you with the Sign in screen. Installing SQL Server Data Tools 41 6. Click Not now, maybe later to skip the sign-in process for the time being. 7. In the next dialog, select your development settings and your color theme, and click Start Visual Studio. Of course, you can also change these settings in SSDT later. 8. After a few minutes, Visual Studio 2019 should open, with the Open recent/Get Started dialog. At this time, you only need to access the Visual Studio 2019 IDE to install additional SQL Server extensions. Click Continue without code. 9. Visual Studio opens without any solutions or projects loaded. In the Extensions menu, select the Manage Extensions command. 10. In the Manage Extensions dialog, on the Online | Visual Studio Marketplace page, use the search text box located at the top right to search for integration services, as shown in the following screenshot: Figure 1.23 – The Visual Studio 2019 extensions 42 Getting Started with Azure and SSIS 2019 In the search result list, locate SQL Server Integration Services Projects and click Download. 11. An internet browser window should open, prompting you to save the file to your local drive. Save the file to the local drive on your workstation. 12. If prompted by the browser to run the file, click Run to start the installation; otherwise, use Windows Explorer to navigate to your downloads folder, and open the Microsoft.DataTools.IntegrationServices.exe file to start the installation. 13. In the first dialog, select the language to be used during the installation. You can use any of the supported languages; however, note that these instructions assume that you selected the English language. When ready, click OK to continue. 14. In the next dialog, simply click Next to continue. 15. In the next dialog, make sure that your edition of Visual Studio 2019 is selected as the installation target, but do not click Install yet. 16. Switch to Visual Studio 2019, close the Manage Extensions dialog, and then close Visual Studio as well. Important note During the installation of a Visual Studio extension, changes might have to be made to Visual Studio files, which cannot be done if these files are open. Therefore, this is the perfect moment during the installation to close Visual Studio. If files are open, an error message might be displayed, prompting you to close them before being able to continue. 17. Switch back to the Visual Studio extensions installer, and now do click Install to continue. 18. If prompted by the operating system whether to allow SQL Server Integration Services Projects to make changes to your device, click Yes to continue. The installation should complete within a few minutes. 19. In the final dialog, click Close to finish the installation. Close the internet browser window that was used to download the SQL Server Integration Services Projects installer, and Visual Studio Installer, if they are still open. Installing Azure Data Studio 43 How it works… You have installed Visual Studio 2019 with the features needed to complete the rest of the recipes in this book. These recipes will cover typical SSIS development, but also control flow and data flow customization. You can use the same setup to complete many other development tasks, and if needed, you can install any additional features at any time, by running Visual Studio Installer – available from the Windows Start menu. Important note In order to continue using the Community edition of Visual Studio 2019 beyond the 30-day evaluation period, you will have to sign in with a Microsoft Live account. You can use the same account that you will be using later in this chapter to create the Azure subscription. A few days before the evaluation period expires, if you try to open SSDT, you will be prompted to sign in, but you can also sign in at any time. Installing Azure Data Studio Azure Data Studio (ADS) represents a new database administration, maintenance, and development tool to be used with Microsoft data platforms on-premises or in the cloud. In contrast to SSMS, not only can ADS be installed on Windows, but also on macOS or Linux. ADS offers a new approach to development with a visually and functionally rich T-SQL editor, IntelliSense, and code snippets. It visually integrates execution results with the T-SQL queries that were executed to generate them and allows you to export query results in several standard formats. The aim of ADS is not to completely replace SSMS as the principal data administration, maintenance, and development tool, but rather to complement it, especially as a cross- platform tool. If you are new to ADS, or need help in determining how to balance your work between ADS and SSMS, we recommend the What is Azure Data Studio? introductory article, available online at https://docs.microsoft.com/en-us/sql/azure-data- studio/what-is. Getting ready The ADS installation is available online, at https://docs.microsoft.com/en-us/ sql/azure-data-studio/download-azure-data-studio. 44 Getting Started with Azure and SSIS 2019 How to do it… Use your favorite internet browser to navigate to the ADS download site listed previously: 1. On the Download and install Azure Data Studio web page, locate the recommended User Installer link, and follow it to start the download. When prompted, save the file to the local drive on your workstation. Depending on your internet connection, the download should complete within a minute. 2. When prompted by the browser, click Run to start the installation; otherwise, use Windows Explorer to navigate to your downloads folder, and initiate the installation there. 3. Review the license agreement, and then click Next to continue. 4. On the Select Destination Location page, leave the default folder selected, and click Next to continue. 5. On the Select Start Menu Folder page, leave the default folder name, and click Next to continue. 6. On the Select Additional Tasks page, you can check the Create a desktop icon option if you want to place the ADS shortcut on the desktop. You can also check the Register Azure Data Studio as an editor for supported file types option if you want all files normally associated with SSMS to be opened in ADS by default: Figure 1.24 – Selecting additional tasks in the ADS setup Creating an Azure subscription 45 If you are not planning to replace SSMS with ADS at this time, leave the default settings unchanged, as shown in the screenshot. When ready, click Next to continue. 7. On the Ready to Install page, review the setup options. When ready, click Next to start the automated part of the installation. The installation should complete within a few minutes. 8. On the final page, uncheck the Launch Azure Data Studio option, and click Finish to close the installer. You do not have to start ADS at this time. How it works… You have now installed ADS, the newest addition to the Microsoft data platform development and administration toolset. Some of the recipes in this cookbook will require the use of features that are available in ADS, but not in SSMS. Creating an Azure subscription One of the objectives of this cookbook is to show you how to integrate your on-premises data warehousing solutions with cloud-based data storage, maintenance, and consumption capabilities of the Microsoft data platform. One of the prerequisites needed to follow this objective is an active Azure subscription. Azure provides a variety of services and features that you can use to host your data off-premises, extend your own on-premises and cloud-based data management solutions with cloud-based services provided by Microsoft and Microsoft Partner Network, and even host your own solutions as cloud-based services to be used by your clients. If you already have access to a valid Azure subscription, you are welcome to use it for all the relevant recipes in this book; otherwise, you can use this recipe to create a new one. An Azure free subscription will provide you with access to the services for 12 months – free of charge. A number of services are even provided at no cost independently of the subscription period. The free subscription includes a credit of 200 USD to be used for paid services for a period of 30 days from the subscription activation. This period should be enough for you to run all of the cloud-oriented recipes provided in this cookbook. 46 Getting Started with Azure and SSIS 2019 Important note Certain paid services in Azure incur costs while online, even if they are not actually being used. This will be pointed out in each recipe dealing with such services, and the instructions on how to either turn a feature off or remove it completely will be provided. However, by using Azure, you take full responsibility for all the costs involved in using the services provided therein. Getting ready Azure is an online service, located at https://portal.azure.com/. Generally, you would use an internet browser to access it; however, on the latest versions of the Windows operating system, you can also use the Azure portal application. The instructions in this cookbook use an internet browser to access the Azure portal. How to do it… Use your favorite internet browser and navigate to the Azure portal website: 1. On the Azure Portal app landing page, click Continue to Azure Portal website. 2. On the Sign in page, do not sign in with your existing Microsoft Azure credentials, but instead, click Create one! to create a new account. On the next screen, provide a valid email address for the account. Use the address of an existing email account that you have access to and is not currently associated with an existing Azure subscription. Alternatively, you can use a phone number instead of an email. If you prefer to use a phone number, click Use a phone number instead, and the form will allow you to enter it. Again, use an existing phone number of a smart device that you have access to and that is not currently associated with an Azure account. Creating an Azure subscription 47 You can even create a new email account. If you prefer to use a new email account, click Get a new email address, and the form will allow you to create a new outlook.com or hotmail.com email account. When ready, click Next to continue. 3. On the next screen, create a password for the account. It is recommended that you follow the standard recommendations on how to create a safe password. 4. On the next screen, provide your country/region information and the date of your birth. When ready, click Next to continue. An automated confirmation email will be sent to the email address you used earlier, or a message will be sent to your phone if you used a phone number. 5. Copy the code you were sent in the message from the Microsoft account team into the text box of the Verify email form. When ready, click Next to continue. 6. On the next screen, you will be required to prove that you are a real person by entering the characters you recognize in the screenshot. If the characters are difficult to read, click New for a new image, or click Audio for the images to be read out through your audio device. When you have entered the correct characters, click Next to continue. 7. After the account has been created, the Welcome to Azure web page will open, prompting you to start the tour. Click Maybe later to close the dialog. In a new free Azure account, none of the paid services can be accessed until the trial is activated. You do not have to activate it at this time; this can be done when you go through the recipes in the following chapters in which paid Azure services are required. You can close the internet browser. However, if you do want to activate the trial, continue on to Step 8. 48 Getting Started with Azure and SSIS 2019 8. To start the free trial, click Start: Figure 1.25 – Activating the Microsoft Azure Free Trial 9. Follow the instructions on the web page. You will have to provide some of your personal information, including an email address and a valid phone number. Important note If Azure is not available in your region, please contact Microsoft, and they will be able to assist you further. To complete the activation, your identity will be verified by phone using the method you select – via a text message or through a call from the service center. As part of the identity verification, you will also have to provide valid credit card details. Creating an Azure subscription 49 Important note Credit card information will not be used unless you upgrade your free subscription to a pay-as-you-go subscription. Before you can complete the activation procedure, you also have to read and accept the subscription agreement. Important note We recommend that you familiarize yourself with the subscription agreement, the details of the offer, and the privacy statement before continuing. 10. When ready, click Sign up to activate your Azure free trial. The activation should complete within a few minutes. How it works… You have just created a free Azure subscription; it will not only allow you to complete most of the recipes in this cookbook but will also allow you to explore additional services that you can use for your personal or professional needs. Important note One of the purposes of this cookbook is to introduce SSIS development in the cloud, using more than just the basic features of SQL Server or Azure. Unfortunately, the free Azure subscription does not provide access to more advanced features (such as the Azure Kubernetes Service or Azure Databricks); therefore, in order to complete specific recipes, you will have to upgrade your account. Each of those recipes will contain a special warning and provide the appropriate instructions on how to proceed. There's more… If you decide on using Azure for more than just learning about SSIS development, you will eventually have to upgrade your account. If you design applications or provide services that can also be hosted in Azure, you might even consider joining the Microsoft Partner Network yourself. 2 Introducing ETL When I first started in the data warehousing business, something like 20 years ago, I was asked in an interview to define ETL. Being at my first job interview, I had no clue what the interviewer meant by ETL. Luckily, the interviewer kind of liked me and hired me anyway. He told me that I would know all about ETL quite soon. Being in data warehouse businesses for many years, and more recently a data engineer, ETL is what has kept me busy most of the time since then. ETL stands for Extract, Transform, and Load. ETL is a data moving technique that has been used in various forms since the first enterprise data warehouses' inceptions. Microsoft formalized the ETL concept near the end of the 1990s with a tool called DTS: Data Transformation Service. This ETL tool, aimed at helping database administrators load data into and from SQL Server, used SQL and ActiveX to move and transform data on-premises. Microsoft brought its ETL tool to the cloud with the introduction of Azure Data Factory (ADF). In 2018, Microsoft extensively overhauled ADF to create Azure Data Factory v2, which allowed the user to complete many tasks within ADF that had previously required the use of more software. Another commonly used Azure ETL tool is Databricks. This tool uses Apache Spark as a compute service, allowing developers to use many languages to develop their transformations: Python, Scala, R, and SQL. Java can also be used to develop shared components to be used by many ETL pipelines. 52 Introducing ETL Doing ETL is a necessary step for any data warehouse or data science project. It is used in various forms and shapes in IT for tasks such as the following: Storing procedures or script used in reports: Data is extracted from a data source first and transformed every time a column is created; a calculation is done for various reports' sections. BI tools such as Power BI: This kind of tool has a model in which we can add measures or columns that fill some missing attributes in the data source. Data warehouse and science projects: Every time a program or script cleans up data or transforms it for specific consumption purposes, we are doing ETL. In the next chapters, we will explore all these tools and give you access to recipes that will show you how to do ETL in Azure. In this chapter, we will cover the following recipes: Creating a SQL Azure database Connecting SQL Server Management Studio Creating a simple ETL package Loading data before its transformation Creating a SQL Azure database To do ETL, we need to have a source to query data and a target (often called a sink) to land the transformed data. Our first recipe will create a simple SQL Azure database that will be used as both the source and sink in all recipes of this chapter. Getting ready In Chapter 1, Getting Started with Azure and SSIS 2019, you were shown how to install SQL Server on-premises, SQL Server Management Studio (SSMS), and Visual Studio 2019 with the SSIS extension. This chapter will show you how to set up SQL Server in Azure and the recipes will use this version of SQL Server. How to do it… In a browser, go to the Microsoft Azure portal using the following URL: https:// portal.azure.com/#home. Creating a SQL Azure database 53 On the main portal page, we'll create a resource group to group all Azure services together. It's much easier to find and manage them later.: 1. From the main portal page, click on the + sign to create a new resource, as shown in the following screenshot: Figure 2.1 – Create a resource in Azure 2. In the search box, type resource. Click on the search result Resource group that appears, as in the following screenshot: Figure 2.2 – Choose Resource group from the search list 3. In the Resource group blade, click Create to create a new resource group, as shown in the following screenshot: Figure 2.3 – Create a resource group after selection 54 Introducing ETL 4. Select the subscription, give the resource group a name, and choose the region, as shown in the following screenshot: Figure 2.4 – New resource group properties The region you select here will be the default for all the resources you create in the resource group in the future. Ideally, you should choose a location not too far from where you live. We can always override this setting in each resource we create in the future. Click on Review + create at the bottom left of the blade to validate the entries we made. Once validated, click Create to create the resource group. 5. Once the resource group is created, click Go to resource group from the activity log message that is displayed. Or you can click on Home on the left-hand menu and click on your resource group in the Recent resource list. Once in the resource group, click on Create resources as shown in the following screenshot: Figure 2.5 – Add a resource to a resource group Creating a SQL Azure database 55 6. From the blade that appears, select Databases on the left and click on SQL Database as shown in the following screenshot: Figure 2.6 – Select SQL Database 7. The Create SQL Database blade opens. Set the properties as shown in the following screenshot: Figure 2.7 – Database properties Let's look at the database properties: a) Subscription: Your subscription. b) Resource group: EtLInAzureCookBook or the resource group created before. 56 Introducing ETL c) Database name: We're going to restore a sample database that will be used in future recipes. We'll name it AdventureWorksLT. d) Server: Click on Create new below the combo list. The New server blade appears: Figure 2.8 – New server properties Let's look at the server properties: a) Server name: You should use something like etlinazurecookbook all in lower case. Here's the reference from Microsoft for SQL Server naming: https://docs.microsoft.com/en-us/ azure/azure-resource-manager/management/resource-name- rules#microsoftsql. b) Server admin login: ETLAdmin. c) Password: Create a password. d) Confirm password: Confirm the password created. e) Location: Select an Azure location. You should use the same region as the resource group created before. lick on OK to create the server. f) C g) Back in the database creation blade, set the SQL elastic pool option to No. Creating a SQL Azure database 57 h) Compute + Storage: Click on the Configure database link. We'll select the Basic configuration for now. i) From the Configure blade, click on the Looking for basic, standard, premium? link as shown in the following screenshot: Figure 2.9 – Change server capacity Select Basic. On the Create SQL Database blade, click on Networking. The Networking blade appears as shown in the following screenshot: Figure 2.10 – Server Networking blade 58 Introducing ETL Set Connectivity method to Public endpoint. In the Firewall rules section, set Allow Azure services and resources to access this server to Yes. This will be needed in later chapters when we use Azure services such as Azure Data Factory. Also, set Add current client IP address to Yes. This will whitelist your current PC IP address and therefore allow you to connect to the database using SQL Server Management Studio (SSMS) and SQL Server Integration Services (SSIS). Click on Additional settings. The Additional settings blade appears. In the Use existing data setting, click on Sample as shown in the following screenshot: Figure 2.11 – Server Additional settings Click on Review + create and then Create to create the database. 8. Now, we'll check whether the sample database has been created correctly. Go into the resource group and click on the AdventureWorksLT (etlinazurecookbook/ AdventureWorksLT) resource. The database blade opens. As shown in the following screenshot, click on Query editor (preview): Creating a SQL Azure database 59 Figure 2.12 – Database query editor Enter the login and password. The query editor opens. Expand the Tables folder, right-click on any table, and click Select Top 1000 Rows from the submenu that appears, as shown in the following screenshot: Figure 2.13 – Database Select Top 1000 Rows 9. Click the Run button to execute the query. 60 Introducing ETL Let's move on to the next section! How it works… We created a SQL server in Azure and we stored a sample database on it: AdventureWorksLT. Everything was done using a web browser, with no need to install anything on a PC. The next recipes will show how to use that database with on-premises software. Connecting SQL Management Studio to Azure SQL This recipe will show you how to connect to our Azure SQL database from your computer. Getting ready This recipe assumes that you have installed SSMS. How to do it… This is a very short and simple recipe! Let's begin by opening SSMS: 1. We're going to connect to the Azure database that we created in the previous recipe, Creating a SQL Azure database. Log into your Azure subscription and navigate to the ETLInAzureCookBook resource group. 2. Click on the AdventureWorksLT SQL database to open its blade. Copy the server name found at the top right as shown in the following screenshot: Figure 2.14 – Get the server name from the Azure subscription Connecting SQL Management Studio to Azure SQL 61 3. Using SSMS, log into the server using the username and password that we used in Step 7 of the previous recipe: Creating a SQL Azure database. Once connected, the Object Explorer shows the database and its related objects. Expanding the Tables folder lists all the tables, as shown in the following screenshot: Figure 2.15 – Azure SQL SSMS Object Explorer That's it! We have connected to our Azure SQL server from an on-premises program. How it works… Connecting to a SQL Azure database is not as simple as a regular on-premises database. We must open firewall rules to access the database on-premises. This is done by design to prevent any malicious data from accessing our data in Azure. But, once we are connected, we can query the data as simply as if it was a regular SQL Server database on-premises. 62 Introducing ETL Creating a simple ETL package This recipe will show you how to build a simple ETL package that does the following: Extract data from the AdventureWorksLT SalesLT.Customer table. Transform data: Concatenating all customer-name-related information into a single column called FullName. Load the data into a new table called SalesLT.CustomerFullName. The table we create will be used in the final recipe in this chapter, Loading data before its transformation, to explain ELT. Getting ready This recipe assumes that you have installed SSMS, Visual Studio 2019, and SSIS. How to do it… Let's begin by opening Visual Studio 2019: 1. Select Create a new project from the main screen. 2. Select Integration Services Project from the screen that appears, as shown in the following screenshot: Figure 2.16 – Visual Studio – creating a new Integration Services Project Creating a simple ETL package 63 3. As shown in the following screenshot, name the project AdventureWorksETL and select a folder for it. Click Create to complete the project creation: Figure 2.17 – SSIS project configuration 4. The project is created and opened, along with a package named Package.dtsx. Right-click in the Connection Managers section at the bottom of the package and select New OLE DB Connection… as shown in the following screenshot: Figure 2.18 – Add a New OLE DB Connection 64 Introducing ETL 5. Click on New…. Enter the database name, select SQL Server Authentication, and type the username and password. Select AdventureWorksLT as the database name and click on OK, as shown in the following screenshot: Figure 2.19 – New OLE DB connection configuration 6. Back on the connection manager configuration window, click on OK to complete the connection manager creation, as shown in the following screenshot: Figure 2.20 – New OLE DB connection 7. Click on the F2 function key on your keyboard to rename the connection manager to cmgr_etlinazurecookbook.database.windows.net.AdventureWorksLT. ETLAdmin. We simply prefix the name with cmgr_. This is a standard naming convention in SSIS that allows package creation standards. Creating a simple ETL package 65 8. Right-click on the connection manager and select Convert to Project Connection from the contextual menu, as shown in the following screenshot: Figure 2.21 – Convert to Project Connection manager Having a project connection manager makes the connection available on every package we create from now on. The connection manager will simply be present on all the packages in the project so we will not need to add a new connection, repeating the preceding steps, every time we add a new package to the project. 9. From the SSIS toolbox on the left, drag and drop a Data Flow Task onto the package surface. Name it DFT_SalesLT_CustomerFullName. This is a naming convention that indicates this dataflow is processing data for SalesLT.CustomerFullName. 10. Double-click on the Data Flow Task and add an OleDB data source from the Other sources section of the SSIS toolbox. Call it OLEDBSRC_SalesLT_Customer. This indicates that the source we are using is an OLEDB source, and extracts data from the SalesLT.Customer table. 11. Double-click on the OLEDB connection manager and set it to the one we created before. Set Data access mode to SQL Command. Then type the following SQL statement inside the SQL command textbox: SELECT [CustomerID] ,[NameStyle] ,[Title] ,[FirstName] ,[MiddleName] ,[LastName] ,[Suffix] FROM [SalesLT].[Customer] 66 Introducing ETL Click on Preview to ensure that the query works correctly. Click on OK to close the OLE DB source component. 12. Drag and drop a Derived Column transform onto the Data Flow and attach it to the OLE DB source created earlier. Rename it DER_FullName, indicating that this is adding a new column called FullName to the pipeline. 13. Double-click on it to open it. In the Derived column name field, type FullName. In the Expression field, type the following expression and click OK: TRIM(REPLACENULL(Title,"") + " " + REPLACENULL(FirstName,"") + " " + TRIM(REPLACENULL(MiddleName,"") + " ") + REPLACENULL(LastName,"") + " " + REPLACENULL(Suffix,"")) 14. From Other Destinations in the SSIS toolbox, add an OLE DB destination to the Data Flow and attach it to the DER_FullName derived column transform created in Step 13 of this recipe. Rename it OLEDBDST_SalesLT_CustomerFullName to indicate that we're loading data into the SalesLT.CustomerFullName table. 15. Double-click on it to open it. Make sure the OLEDB connection manager is set to the one we created before and Table access mode is set to Table or view – fast load. 16. Click on the New… button beside Name of the table or the view and set the properties as shown in the following screenshot: Figure 2.22 – Create a table in OLE DB Destination Editor Creating a simple ETL package 67 17. Make sure to use the table name [SalesLT].[CustomerFullName]. Failure to do so will fail other steps of the recipe. The SQL DDL script to create the table is the following: CREATE TABLE [SalesLT].[CustomerFullName] ( [CustomerID] int, [NameStyle] bit, [Title] nvarchar(8), [FirstName] nvarchar(50), [MiddleName] nvarchar(50), [LastName] nvarchar(50), [Suffix] nvarchar(10), [FullName] nvarchar(350) ) 18. Click on OK to close the Create Table window and return to the previous screen. Click on Mappings on the left of OLE DB Destination Editor. Your screen should look like the following screenshot. Click on OK to close the editor and return to the Data Flow Task: Figure 2.23 – Destination OLE DB Destination Editor Mapping tab 68 Introducing ETL 19. To correctly align the Data Flow content, right-click on the taskbar and select Layout from the submenu as shown in the following screenshot: Figure 2.24 – Add a Layout button on the toolbar 20. Now, select all items in the Data Flow task (Ctrl + A) and click on the Make same width icon on the Layout toolbar. 21. For the final finishing touch, we'll auto-align all transforms on the Data Flow. From the Format menu, select Auto Layout and Diagram as shown in the following screenshot: Figure 2.25 – Auto Layout Diagram menu selection Creating a simple ETL package 69 The Data Flow should now look like the following screenshot: Figure 2.26 – Dataflow layout 22. Right-click on the arrow (path) between the DER_FullName transform and OLEDBDST and select Enable Data Viewer as shown in the following screenshot: Figure 2.27 – Data Flow Enable Data Viewer 70 Introducing ETL 23. Close the Data Viewer. Now, right-click anywhere on the Data Flow task background and select Execute Task as shown in the following screenshot: Figure 2.28 – Execute the dataflow The Data Flow Task should execute successfully, and you should see the Data Viewer with the new column added as shown in the following screenshot: Figure 2.29 – Data Viewer during dataflow execution 24. Now, right-click on the path between the derived column and destination again to Disable Data Viewer as we don't need it anymore. Stop the execution of the Data Flow by clicking the Stop button. 25. As a final step, we'll make our package idempotent, that is, making sure that anytime we execute it, the destination data will be the same. Creating a simple ETL package 71 Go back to the package's control flow and add an Execute T-SQL task from the Favorite Tasks section of the SSIS toolbox. Double-click on it to open it. Set the properties as follows: a) Name: SQL_Truncate_SalesLT_CustomerFullName b) Connection: the connection manager of the package c) SQLSourceType: Direct Input d) S QL Statement: TRUNCATE TABLE SalesLT.CustomerFullName Click on OK to close Execute SQL Task Editor. Back on the control flow, attach the Data Flow task to the Execute SQL Task just created. 26. Like we did for the Dataflow items, select both tasks and, from the Layout toolbar, click on Make same width. From the Format menu, select, Format > Auto Layout > Diagram. Your package should look like the following screenshot: Figure 2.30 – Control flow after autolayout 27. Save the package, right-click on the package in the Solution Explorer (at the right of Visual Studio), and select Rename from the submenu. Rename the package to SalesLT.CustomerFullName.dtsx. 28. Finally, execute the package. Again, right-click on it and select Execute package from the submenu that appears. The package will execute, and your screen should look like the following screenshot: Figure 2.31 – Control flow after successful execution 72 Introducing ETL How it works… This recipe showed a simple ETL design pattern. Most ETL packages are more complex but they usually follow the same pattern: Source > Transformation > Load into a destination. The transformation phase is quite simple: it uses an expression to concatenate the name-related columns, taking care of the fact that the column may be NULL. The outer TRIM command takes care of the space at the front and end of the column because the Title and Suffix columns might be NULL. It is the same for the inner TRIM column that is taking care of the NULL MiddleName column. Loading data before its transformation ELT is very similar to ETL, but with a crucial difference: the order of the transform and load steps are inverted. This is very useful with big data in the cloud or when we do not have an ETL tool on-premises. This recipe will be much simpler than the previous one, as we'll implement ELT using a database, so no tools are involved here except for calling the ELT task. It also relies on the previous recipe, Creating a simple ETL package, since we're going to use the SalesLT.CustomerFullName table data to implement the ELT pattern. There are essentially two parts to this recipe: 1. Extract and load data into our data lake. Here, we don't have a real data lake; we're using AdventureWorksLT on Azure to mimic the data lake concept. 2. Transform the data inside the database using a simple SQL script. We're going to add the FullName column to the SalesLT.Customer table and update it using this script. Getting ready This recipe assumes that you have installed SSMS, Visual Studio 2019, and SSIS. How to do it… Let's dig into the recipe: 1. In the SSIS solution, right-click on the SSIS packages folder and select New SSIS package. Rename the package from Package1 to SalesLT.Customer. 2. Drag and drop Execute SQL Task from the Favorites section of the SSIS toolbox onto the control flow and name it SQL_ELT_SalesLT_Customer. Loading data before its transformation 73 3. We're going to add the FullName column to SalesLT.Customer. Open SSMS and connect it to our Azure database. Make sure that the context is set to the AdventureWorksLT database as shown in the following screenshot: Figure 2.32 – SSMS AdventureWorksLT database context 4. In the query window, type the following DDL statement: ALTER TABLE SalesLT.Customer ADD FullName NVARCHAR(350) 5. Execute the command and verify that the FullName column has been successfully added in the Object Explorer as shown in the following screenshot: Figure 2.33 – SSMS Object Explorer FullName column 74 Introducing ETL 6. Go back to the SSIS package. Double-click on the SQL_ELT_SalesLT_Customer task to open Execute SQL Task Editor. Set the properties as follows: a) Connection: cmgr_etlinazurecookbook.database.windows.net.