Snowflake SnowPro Core Certification Study Guide - Nov 2022.pdf

Snowflake SnowPro Core Certification Study Guide Sudhanshu Thakur November 2022 EXAM TOPICS 1.0 Domain: Account and Security 1.1 Explain how to manage Snowflake accounts. ● Account usage ● Information schema 1.2 Outline security principles. ● Multi -factor Authentication (MFA) ● Data Encryption ● Network Security & Policies ● Access Control ● Federated Authentication ● Single Sign -On (SSO) 1.3 Define the entities and roles that are used in Snowflake. ● Outline how privileges can be granted and revoked ● Explain Role Hierarchy and Privilege Inheritance 1.4 Explain the Security capabilities associated with each Snowflake edition. ● Private Linking / Data Masking /HIPAA / Periodic Rekeying / Tri -State 1.5 Outline Data Governance capabilities in Snowflake ● Data masking ● Account usage views ● External Tokenization ● Secure Views 2.0 Domain: Virtual Warehouses 2.1 Outline compute principles. ● Credit usage & billing ● Concurrency ● Caching ● Virtual warehouse characteristics and parameters ● Query Profiler 2.2 Explain Virtual Warehouse best practices. ● Scale up vs scale out ● Types of virtual warehouses ● Management/monitoring 3.0 Domain: Data Movement 3.1 Outline different commands used to load data and when they should be used. ● COPY ● INSERT ● PUT ● GET ● VALIDATE 3.2 Define bulk as compared to continuous data loading methods. ● COPY ● Snowpipe ● Parameter Types 3.3 Define best practices th at should be considered when loading data. ● File size ● File Formats ● Folders 3.4 Outline how data can be unloaded from Snowflake to either local storage or cloud storage locations. ● Define formats supported for unloading data from Snowflake ● Define commands that help when unloading data ● Define best practices that should be considered when unloading data 3.5 Explain how to work and load semi -structured data. ● Supported file format, data types, and sizes ● VARIANT column ● Flattening the nested structure 4.0 Domain: Performance Managemen t 4.1 Outline best practices for Snowflake performance management on storage. ● Clustering ● Materialized views ● Search Optimization 4.2 Outline best practices for Snowflake performance management on virtual warehouses. ● Query per formance and analysis ● Query profiles ● Query history and activity ● SQL optimization ● Caching 5.0 Domain: Snowflake Overview & Architecture 5.1 Outline key components of Snowflake’s Cloud data platform. ● Data types ● Continuous data protection ● Cloning ● Types of Caching ● User -Defined Functions (UDFs ) ● Web Interface (UI) ● Data Cloud /Data Sharing / Data Marketplace / Data Exchange 5.2 Outline Snowflake data sharing capabilities. ● Account types ● Data Marketplace & Exchange ● Access Control options ● Shares 5.3 Explain how Snowflake is different compared to legacy warehouse solutions. ● Elastic Storage ● Elastic Compute ● Account Management 5.4 Outline the different editions that are available, and the functionality associated with each edition. ● Pricing ● Features 5.5 Identify Snowflake’s Partner Ecosystem ● Cloud Partners ● Connectors 5.6 Outline and define the purpose of Snowflake’s three distinct layers. ● Storage Layer ● Compute Layer ● Cloud Services Layer 5.7 Outline Snowflake’s catalog and objects. ● Accounts ● Database ● Schema ● Tables Types ● View Types ● Data Types ● External Functions ● Stored Procedures 6.0 Domain: Storage and Protection 6.1 Outline Sn owflake Storage concepts. ● Micro partitions ● Metadata Types ● Clustering ● Data Storage ● Stage Types ● File Formats ● Storage Monitoring 6.2 Outline Continuous Data Protection with Snowflake. ● Time Travel ● Fail Safe ● Data Encryption ● Cloning ● Replication ● Master Keys TABLE OF CONTENTS Snowflake - Key Concepts 12 Snowflake Architecture 12 Snowflake Regions 15 Account Identifiers 16 Snowflake Editions (Account Types) 18 Snowflake Release Schedule 22 Snowflake Ecosystem 24 Snowflake Partner Connect 25 Drivers, Connectors and Client Tools 30 Snowflake Compliance 31 Web Interfaces 32 Cla ssic Web Interface 33 Snowsight 33 Sharing Worksheets & Folders 34 Data Storage 35 Micro -Partitions 36 Metadata 37 DEMO - Micro -Partitions and Metadata Layer Usage 38 Clustering 45 DEMO - CLUSTERING KEYS 49 DEMO - CLUSTERING DEPTH 52 Reclustering 54 Data Storage Monitoring 58 Virtual Data Warehouses 59 Warehouse Sizes 59 Query Processing & Concurrency 61 Impact on Data Loading 61 Impact on Query Processing 61 Scaling Up vs. Scaling Out 63 Multi -Cluster Warehouse 64 Multi -cluster Warehouse Modes 64 Multi -Cluster Configuration 66 Scaling Policies for Multi -Cluster Warehouses 67 Alterin g Query Behavior 71 Scaling of Cloud Services 73 Query History (classic Web UI only) 73 Query Profile 74 SQL Optimization Techniques in Snowflake 74 Snowflake Account and Object Hierarchy 75 Object Hierarchy 75 Database Types in Snowflake 81 Table Types in Snowflake 82 Temporary vs. Transient Tables 83 External Tables 84 Data Types in Snowflake 92 Information Schema 93 Account Usage Views 95 Access_History View 97 DEMO: Show last 100 account access re cords 98 Login_History View 99 DEMO: Show last 100 account logins 99 Load_History View 100 DEMO: Show last however many data load activities 100 COPY_HIstory View 100 QUERY_HISTORY View 100 Views in Snowflake 102 Standard Views 102 Materialized Views 102 Secure Views 108 Developing Applications with Snowflake 109 Context Functions 110 Constraints 111 Transactions Management 113 User -Defined Functions (UDFs) 114 Stored Procedures 116 Caller’s Right vs Owner’s Rights 117 Snowflake Scripting 119 Working with Variables 119 External Functions 120 Snowpark 121 Snowflake Connector for Spark 122 Important SQL Commands 123 Conditional Expressions 123 Merge/Upsert Command 123 TRUNCATE Table Command 124 Subqueries 125 Correlated vs. Uncorrelated Subqueries 125 Scalar vs. Non -scalar Subqueries 125 CTE - Common Table Expressions 126 Data Loading in Snowflake 127 Stages 129 Demo - Show a list of named stages. 132 Encryption of Staged Files 135 Staging Uncompressed Files 135 PUT command 137 PURGE and REMOVE 138 DROPPING STAGES (Internal / External) 140 File Format 141 Transforming Data during Loading 142 Bulk Data Loading 143 Data Loading Parameters (Bulk Loading) 145 DEMO - Configure SnowSql 146 DEMO - Bulk Data Loa ding from Local Filesystem 147 DEMO - Loading Data From Cloud Storage 149 DEMO - Cross -cloud Data Loading 152 Validation options during data loading 153 DEMO - Validation During Data Load 155 DEMO - Pre -Loading Validation Mode 156 Data Loading Best Practices 158 File Sizing Best Practices 159 Folders / Partitioning 160 Continuous Data Loading 161 Snowpipe 162 Snowpipe REST API vs. Snowpipe Auto_Ingest 162 Bulk data loading vs. continuous loading with Snowpipe 163 DEMO - Continuous Data Loading using REST API 165 Demo - Continuous Data Loading using Auto -Refresh 166 Snowflake connector for Kafka 169 Tasks 171 DEMO - Automating Data Ingestion using Tasks 173 DEMO - Creating a data processing workflow using Tasks 174 Streams 175 Stream Types 177 DEMO - Working with Streams 179 Stream Staleness 179 Data Pipelines in Snowflake 181 Data Unloading 182 Data Unloading Supported Formats 182 Bulk Unload Process 182 GET Command 182 Bulk Unloading into Single or Multiple Files 185 Data Unloading Best Practic es 186 DEMO - Structured Data Unloading 188 DEMO - JSON & Parquet Unloading 190 DEMO - MAX_FILE_SIZE 190 Demo - Empty String / NULL substitution 193 Floating -Point Numbers Truncation 196 Semi -Structured Data 19 7 Supported File Formats 197 Semi -Structured Data Types 197 VARIANT Data Type 198 OBJECT Data Type 198 ARRAY Data Type 199 Querying Semi -Structured Data 200 Flatten Function 202 Loading and Unloading Semi -Structured Data 209 DEMO - Loading Semi -Structured Data in Snowflake 212 DEMO - Flatten and Lateral Flatten 206 Unstructured Data Support 213 Directory Tables 214 Security & Access Control 215 Network Policy 217 Private Link (AWS) 219 Authentication 220 MFA 220 Federated Authentication & SSO 225 Single Sign -On [SSO] 225 Authorization - Roles and Privileges 226 Securable Obj ect 227 Role Hierarchy 229 Demo - Grants and Revokes 232 Custom Roles in Snowflake 235 Managed Access Schema 236 Continuous Data Protection in Snowflake 237 Data Encryption 237 Master Keys 238 Encryption Key Rotation 238 DEMO - Enable PERIODIC DATA REKEYING 240 Time Travel 242 SQL Extensions to enable Time Travel 242 DEMO - TIME TRAVEL 244 Fail Safe 245 Cloning 246 Business Continuity and Disaster Recovery 250 Account Object Replication and Failover/Failback 251 Replication Groups and Failover Groups 251 Database Replication & Failover/Failback 252 Primary Database 252 Enable Replication across Accounts 254 Client Redirect 254 Row -Level Security 255 Row Access Policies 256 Data Masking - Column -Level Security 258 Dynamic Data Masking 258 Masking Policies 258 Masking Policy Example 259 External Tokenization 261 Secure Data Sharing 262 Direct Share 262 Direct Sharing - Key Points 263 Granting Privileges on a Shared Database 265 DEMO - SECURE DATA SHARING 266 Data Consumer Limitations 268 Reader Accounts 270 Snowflake Data Marketplace 273 Data Exchange 273 Performance Tuning 274 SQL Tuning Tips 274 Query Pruning 275 Caching in Snowflake 277 Caching Types 277 Result Scan Function 279 DEMO - Result Scan Function 280 Cache Reuse in Snowflake 282 DEMO - Cache Reuse 283 Search Optimization Service [SOS] 286 DEMO - Search Optimization Service 287 Pricing 289 Virtual Warehouse Pricing 290 Data Storage Pricing 291 Cloud Services Pricing 291 Resource Monitors 293 One -to -Many Relationship between RMs and VWHs 295 Create/Modify Resource Monitors 296 Snowflake - Key Concepts Snowflake’s Data Cloud is powered by an advanced data platform provided as Software -as -a-Service (SaaS). Snowflake enables data storage, processing, and analytic solutions that are faster, easier to use, and far more flexible than traditional offerings. The Snowflake data platform is not built on any existing database technology or “big data” software platforms such as Hadoop. Instead, Snowflake combines a completely new SQL query engine with an innovative architecture natively designed for the cloud. To the user, Snowflake provides all of the functionality of an enterprise analy tic database, along with many additional special features and unique capabilities. Snowflake Architecture Snowflake’s architecture is a hybrid of traditional shared -disk and shared -nothing database architectures. Similar to shared -disk architectures, Snow flake uses a central data repository for persisted data that is accessible from all compute nodes in the platform. But similar to shared -nothing architectures, Snowflake processes queries using MPP (massively parallel processing) compute clusters where eac h node in the cluster stores a portion of the entire data set locally. This approach offers the data management simplicity of a shared -disk architecture, but with the performance and scale -out benefits of a shared -nothing architecture. Multi -cluster share d-data architecture, consisting of three independent layers: 1. Storage Layer 2. Compute [Virtual Warehouse] Layer 3. Cloud Services Layer Snowflake Regions Each Snowflake account is located in a single region. If you wish to use Snowflake across multiple regions, you must maintain a Snowflake account in each of the desired regions. Snowflake replicates data to 3 AZs for DR purposes. Effective May 1, 2020: ● All new accounts, regardless of Snowflake Edition, receive Premier support , which includes 24/7 coverage. ● Standard Edition accounts that were provisioned before this date will continue to receive Standard support until the accounts are transitioned to Premier support. The government regions are only supported for Snowflake accounts on Business Critical Edition (or higher). Account Identifiers Classic web console (now deprecated in favor of the new SnowSight console): ● https://app.snowflake.com/us -east -1/tca34621/worksheets New (October 2022): Crea te new Organization: All old accounts created before the new organization implicitly got added to a default organization. To attach an orgadmin role to existing account, run the following: -- Assume the ACCOUNTADMIN role use role accountadmin; -- Grant the ORGADMIN role to a user grant role orgadmin to user sthakur500; show organization account; New account URLs do not have any reference to their cloud provider or region they are associated with. New URL format is: https:// orgname -accountname. snowflake.comupting.com ORG NAME: RJDODFG https://rjdodfg -tca34621.snowflakecomputing.com https://rjdodfg -thakazure1.snowflakecomputing.com The old URL format, now called the account_url_locator, can still be used but the URL based on the combination of orgname -accountname is the prefer red format. https://tca34621.us -east -1.snowflakecomputing.com https://vw37125.ea st-us -2.azure.snowflakecomputing.com Snowflake Editions (Account Types) You will often be asked what is the minimum snowflake version that supports feature X. All versions of the software have the features of the l ower versions. The Snowflake Editio n that your organization chooses determines the unit costs for the credits and the data storage you use. Other factors that impact unit costs are the region where your Snowflake account is located and whether it is an On Demand or Capacity account : ● On Demand: Usage -based pricing with no long -term licensing requirements. ● Capacity: Discounted pricing based on an up -front Capacity commitment. NOTE: You might also be asked to select all versions that support feature X. Premier support is available for ALL editions [even Standard edition accounts created after May 1, 2020] Standard Introductory edition. Fully supported. Time -travel is limited to a single day. No periodic rekeying of encryption data in Standard edition. Periodic rekeying is different from key rotation which happens on all editions automatically every 30 days. Period ic rekeying is the process of completely destroying the old key and encrypted older files with the new key. No support for materialized views in Standard edition. No support for object tagging in standard edition. No support for multi -cluster virtual warehouses in Standard edition. Enterprise Everything in standard edition + time -travel up to 90 days. Meant for large enterprises and organizations. Key differences from standard edition: ● Time -travel up to 90 days ● Materialized View s ● Search Optimization Service ● Periodic rekeying ● Column level security - data masking and tokenization ○ Dynamic Data Masking - mask sensitive data [role -based masking policies] ○ External Tokenization - mask data before it lands in Snowflake using an external tokenization provider like Protegrity. Must use external functions to detokenize [external functions are available in Standard and higher editions] but integration with an external tokenization provided requires Enterprise or higher edition. Database replication and failover not available in Enterprise edition. Business Critical Business Critical Edition, formerly known as Enterprise for Sensitive Data (ESD), offers even higher levels of data protection to support the needs of organization s with extremely sensitive data, particularly PHI data that must comply with HIPAA and HITRUST CSF regulati ons. It includes all the features and services of Enterprise Edition , with the addition of enhanced security and data protection. In addition, database failover/failback adds support for business continuity and disaster recovery. Key differences from Enterprise edition: ● Support for secure, direct proxy to virtual networks or on -prem data center [AWS Private Link or Azure Private Link] ● Supports HIPAA, PCI -DSS, SOC 1 & 2, and FedRamp data ● Database replication and failover - This feature enables replicating databases between Snowf lake accounts (within the same organization) and keeping the database objects and stored data synchronized. Database replication is supported across regions and across cloud platforms. ○ Note: can not replicate to a “lower” edition account. ○ Currently only d atabase objects are replicated. Other object types such as users, roles, shares, resource monitors, VWH etc can not be replicated. ○ Encryption is complicated between the primary and replicated accounts. Snowflake does some magic under the covers. Even more complicated when tri -secret security is enabled. ● Query statement encryption ● Tri -Secret Secure - basically you create a composite key by combining your own key with Snowflake’s account master key. ■ CMK + Snowflake account master key = Composite Key ■ Revocati on of either the CMK or the Snowflake key will result in the composite key being revoked [can not decrypt data] VPS Virtual Private Snowflake offers our highest level of security for organizations that have the strictest requirements, such as financial i nstitutions and any other large enterprises that collect, analyze, and share highly sensitive data. It includes all the features and services of Business Critical Edition, but in a completely separate Snowflake environment , isolated from all other Snowfla ke accounts (i.e. VPS accounts do not share any resources with accounts outside the VPS). VPS accounts utilize different naming conventions than the accounts for other Snowflake Editions. This results in a different structure for the hostnames/URLs used to access VPS accounts. Feature Comparison Matrix ● Standard Version ○ 1 day of time travel ○ dedicated virtual warehouses ○ fail -safe ● Enterprise Version (Standard +) ○ up to 90 days of time travel ○ multi -cluster virtual warehouses ○ Materialized Views ● Business Critical Version (Enterprise +) ○ Supports HIPAA, PCI -DSS, SOC 1 & 2, and FedRamp data ○ Private Link ○ Tri -state CMK ○ Database Failover / DR ● Virtual Private Snowflake (Business Critical Version +) ○ High security, NO SHARING Snowflake Release Schedule New patche s and upgrades are released every week. Use the current_version function to retrieve the Snowflake version number. SELECT current_version(); -- 6.32.0 Enterprise, Business Critical and VPS editions get early dibs on new releases. 24 -hour early access to weekly new releases , which can be used for additional testing/validation before each release is deployed to your production accounts. Answer: 2 days Once a new release has been deployed, Snowflake does not move all accounts to the release at the same time. Accounts are moved to the release using a three stage approach over two days . Accounts are moved to the new release in the following order, based on their Snowflake Edition : ● Day 1 - Stage 1 (early access) for designated Enterprise accounts. ● Day 1 or 2 - Stage 2 (regular access) for all Standard Edition accounts. ● Day 2 - Stage 3 (final) for all Enterprise Edition and VPS accounts. The minimum amount of elapsed time between the early access and final stages is 24 hours. This staged approach enables Sno wflake to monitor activity as accounts are moved and respond to any issues that may occur. Update (Oct 2022) - shown below is an old screenshot - New release is now called Full release. There is an additional change type - behavior change release. Each week, Snowflake deploys two types of planned releases: ● Full release - May include new features, behavior changes, enhancements, updates, and fixes. ● Patch release - Includes fixes only. Once a month, over a two month period, Snowflake also releases: ● Behavior change release (bcr) - once a month over a two month period. Snowflake Ecosystem Snowflake works with a wide array of industry -leading tools and technologies, enabling you to access Snowflake through an extensive network of connectors, drivers, programming languages, and utilities, including: ● Certified partners who have developed cloud -based and on -premises solutions for connecting to Snowflake. ● Other 3rd -party tools and technologies that are known to work with Snowflake. ● Snowflake -provided clients, including SnowSQL (command line interface), connectors for Python and Spark, and drivers for Node.js, JDBC, ODBC, and more. Technology Partner Listing ● Data Integration ● Business Intelligence (BI) ● Machine Learning & Data Science ● Security, Governance & Observability ● SQL Development & Management ● Native Programmatic Interfaces Snowflake Partner Connect The Snowflake Partner Network unlocks the potential of the cloud data platform with a broad array of tools and partners. Our certified partnerships and integrations enable customers to leverage Snowflake’s flexibility, performance, and ease of use to deliv er more meaningful data insights. As a customer, whether you are looking for certified services partners to help you migrate or make the most of your Snowflake deployment, or whether you are looking for integrated technologies, the Snowflake Partner Networ k is the place to start. ● Cloud Partners - AWS, Azure, GCP ● Data Providers - Data Marketplace ● Service Partners - Consulting partners ● Technology Partners - ● Powered by Snowflake (new) - Snowflake Partner Connect Need ADMINACCOUNT role to connect with partner connect tools and services. All options are correct. Update (October 2022) - this answer is still correct but the menu item to access Partner connect is through the ‘Admin’ section. Update: October 2022 - the download page has now moved to a separate location (developers.snowflake.com). This answer is still correct. SnowCD , the Snowflake Connectivity Diagnostic Tool, helps users diagnose and troubleshoot their network connection to Snowflake. Drivers, Connectors and Client Tools ● CLI Tools / Snowsql ● ODBC / JDBC Drivers ● Python connector ● Node.js Driver ● Spark Connector ● “Go” Snowflake Driver ● SnowCD - Snowflake Connectivity Diagnostic Tool (SnowCD) <recent additions - 2021/2022> ● Kafka - Snowflake connector ● PHP PDO driver ● Snowpark ● Streamlit ● .Net driver ● SQL API Snowflake Compliance HIPAA, PCI DSS and HITRUST compliance only available for Business Critical or above editions. Web Interfaces ● Classic Web UI ● Snowsight (new) Classic Web Interface You can use the classic web interface to perform tasks that would normally performed using SQL and the command line, including: ● Creating and managing users and other account -level objects (if you have the necessary administrator roles). ● Creating and using virtual warehouses. ● Creating and modifying databases and all database objects (schemas, tables, views, etc.). ● Loading data into tables. ● Submitting and monitoring queries. ● Changing your password and setting user preferences. Snowsight Snowsight, the Snowflake web interface, distills Snowflake’s powerful SQL support into a unified, easy -to -use experience. Use Snowsight to perform your critical Snowflake operations, including: ● Building and running queries. ● Loading data into tables. ● Monitoring query performance and copy history. ● Creating and managing users and other account -level objects. ● Creating and using virtual warehouses. ● Creating and modifying databases and all database objects. ● Sharing data with other Snowflake accounts. ● Exploring and using the Snowflake Marketplace. Sharing Worksheets & Folders Snowsight lets you send links to other Snowflake users to view or run worksheets and folders you own. Optionally, you can allow Snowflake users to duplic ate shared worksheets to pursue new directions of analysis. You can share worksheets, folders, and dashboards you own with other Snowflake users in your account. Worksheets in Snowsight use unique sessions with specific roles and warehouses assigned in th e context of the worksheet. To view shared query results, the Snowflake user must use the same role as the session context for the worksheet . To share a worksheet: 1. Select a worksheet. The worksheet opens. 2. In the upper -right corner of the worksheet, select Share. To share a folder: 1. Select a folder. The folder opens. 2. In the upper -right corner of the Worksheets page, select Share. Data Storage All data in Snowflake tables is automatically divided into micro -partitions , which are contig uous units of storage. Each micro -partition contains between 50 MB and 500 MB of uncompressed data (note that the actual size in Snowflake is smaller because data is always stored compressed). Groups of rows in tables are mapped into individual micro -parti tions, organized in a columnar fashion . This size and structure allows for extremely granular pruning of very large tables, which can comprise of millions, or even hundreds of millions, of micro -partitions. Type Micro -Partitions Architecture Storage only Storage Pattern Hybrid Columnar Row Storage Contiguous Partition Size 50 to 500 MB [generally max 16 MB compressed] Updates Not allowed; micro -partitions are immutable Data Clustering Sorted/ordered along “natural” dimensions such as date or geography Performance Optimization Clustering depth The clustering depth for a populated table measures the average depth (1 or greater) of the overlapping micro -partitions for specified columns in a table. The smaller the average depth , the better clustered the table is with regards to the specified column s. Micro -Partitions Metadata Snowflake stores metadata about all rows stored in a micro -partition, including: ● The range of values for each of the columns in the micro -partition ● The number of distinct values ● MIN and MAX Values ● NULL count ● Additional properties used for both optimization and efficient query processing. Notice the overlap on the letter “J”. Overlapping is OK in Snowflake. DEMO - Micro -Partitions and Metadata Layer Usage Each micro -partitions stores in it the following: ● Range of values ● Number of distinct values ● MIN and MAX values ● NULL count Results for a SQL query asking for the attributes listed above are served from the metadata layer without needing a virtual warehouse. Let’s demonstrate this with an examp le. -- we are using the AWS Snowflake account -- set context for the worksheet use role sysadmin; use warehouse compute_wh; use database thakur_demo; use schema public; We will be using the BIKE_TRIPS table which has the following structure: Let’s run some queries. worksheet> select count(*) from bike_trips; -- 1,264,112 rows Query took 91 ms to run and it was served entirely from the metadata layer. Here’s the query profiler view. WHAT DOES THE ORANGE BAR SHOW? Percentage: Fraction of time that this operator consumed within the query step. This information is also reflected in the orange bar at the bottom of the operator node , allowing for easy visual identification of performance -critical operators. worksheet> select max(trip_id) from bike_trips; -- no compute warehouse used; entirely served from the metadata -- let’s put a filter condition on the previous query. worksheet> select max(trip_id) from bike_trips Where start_station_name = '6th & Chalmers' ; -- yep, this needed a compute warehouse since station_name values are not stored in the metadata for each partition. -- Let’s see the count of distinct START_STATION_NAMES. worksheet> select count(distinct sta rt_station_name) from bike_trips; -- 193 Since distinct values (only distinct counts) are not stored at the micro - partitions level, running this query needed a compute warehouse to run. Let’s look at the query profile. The entire table was scanned since we were looking for a distinct count of values over the entire table (so no pruning; this was a full table scan). -- How about if we retrieve the DISTINCT values of START_STATION_NAMES. worksheet> select distinct start_station_name from bike_trips; -- results shown Clearly the entire table had to be scanned again to get the distinct station names. Look at the ‘Initialization’ step. I believe this is the COMPUTE WAREHOUSE auto -starting (like a screensaver). -- now let’s run the same exact query again. worksheet> select distinct start_station_name from bike_trips; -- results are shown but the profiler indicated that the results were served from the Results Cache [no warehouse needed in this case]. Remember, Results Cache is availa ble for 24 hours, unless someone calls for it within that 24 hour timeframe in which case another 24 hour TTL is granted to the results. worksheet> Select start_station_name, avg(duration_minutes) From bike_trips Group By start_station_name; -- as expected, a compute warehouse was engaged to run this query. worksheet> Select avg(duration_minutes) From bike_trips Where start_station_name = '6th & Chalmers'; -- let’s review the query profiler vie w for this statement -- this is not a very efficient query as it involved scanning 11 out of the 12 micro -partitions. we may benefit from clustering the table differently, perhaps using the START_STATION_NAME as the clustering key. See next section. Only min, max, and count values are stored in the metadata layer for each micro - partition. Filtering on a given value will need a compute warehouse to run. Clustering In Snowflake, as data is inserted/loaded into a table, clustering metadata is collected and recorded for each micro -partition created during the process. Snowflake then leverages this clustering information to avoid unnecessary scanning of micro - partitions during querying, significantly accelerating the perfor mance of queries that reference these columns. In general, Snowflake produces well -clustered data in tables; however, over time, particularly as DML occurs on very large tables (as defined by the amount of data in the table, not the number of rows), the d ata in some table rows might no longer cluster optimally on desired dimensions. Note: since rows in Snowflake are stored contiguously, it is possible to have overlapping values in micro -partitions. Consider a 100 million row customer table which stores cu stomer’s demographic information. The city or state values must end up in lots of micro -partitions since there are only so many locations from where customers come from. This type of overlap is acceptable in Snowflake. The measure of this “overlap” is calc ulated using a metric called ‘clustering depth’. A smaller depth is desirable for better query performance. Clustering depth can be used for a variety of purposes, including: ● Monitoring the clustering “health” of a large table, particularly over time as DML is performed on the table. ● Determining whether a large table would benefit from explicitly defining a clustering key . The clustering depth for a table is not an absolute or precise measure of whether the table is well -clustered. Ultimately, query per formance is the best indicator of how well -clustered a table is: ● If queries on a table are performing as needed or expected, the table is likely well -clustered. ● If query performance degrades over time, the table is likely no longer well - clustered and may b enefit from clustering. Clustering keys are not intended for all tables. The size of a table, as well as the query performance for the table, should dictate whether to define a clustering key for the table. In particular, to see performance improvements from a clustering key, a table has to be large enough [multi -TB range] to consist of a sufficiently large number of micro -partitions , and the column(s) defined in the clustering key have to provide sufficient filtering to select a subset of these micro -partitions. In general, tables in the multi -terabyte (TB) range will experience the most benefit from clustering, particularl y if DML is performed regularly/continually on these tables. Also, before explicitly choosing to cluster a table, Snowflake strongly recommends that you test a representative set of queries on the table to establish some performance baselines. You can define any number of clustering keys; however more is not always better. Do not define more than 3 cluster keys on a table and ensure that low cardinality columns [the one with fewer distinct values] are listed first in the cluster key definition. To improve the clustering of the underlying table micro -partitions, you can always manually sort rows on key table columns and re -insert them into the table; however, performing these tasks could be cumbersome and expensive. Instead, Snowflake supports automating these tasks by designating one or more table columns/expressions as a clustering key for the table. A table with a clustering key defined is considered to be clustered. You can cluster materialized views , as well as tables. The rules for clustering tables and materialized views are generally the same. For a few additional tips specific to materialized views, see Materialized Views and Clustering and Best Practices for Materialized Views . In general, if a column (or expression) has higher cardinality, then maintaining clustering on that column is more expensive. The cost of clustering on a unique key might be more than the benefit of clustering on that key, especially if point lookups are not the primary use case for that table. If you want to use a column with very high cardinali ty as a clustering key, Snowflake recommends defining the key as an expression on the column , rather than on the column directly, to reduce the number of distinct values. The expression should preserve the original ordering of the column so that the minimu m and maximum values in each partition still enable pruning. For example, if a fact table has a TIMESTAMP column c_timestamp containing many discrete values (many more than the number of micro -partitions in the table), then a clustering key could be define d on the column by casting the values to dates instead of timestamps (e.g. to_date(c_timestamp) ). This would reduce the cardinality to the total number of days, which typically produces much better pruning results. As another example, you can truncate a n umber to fewer significant digits by using the TRUNC functions and a negative value for the scale, e.g., TRUNC(123456789, -5) . If you are defining a multi -column clustering key for a table, the order in which the columns are specified in the CLUSTER BY clause is important. As a general rule, Snowflake recommends ordering the columns from lowest cardinality to highest cardinality. Putting a higher cardinality column before a lower cardinality column will generally reduce the effectiveness of clustering on t he latter column. THIS MAY SEEM BACKWARDS IF YOU COME FROM AN OLTP BACKGROUND WHERE COMPOSITE KEY INDEXES GO FROM HIGH TO LOW CARDINALITY. Cluster BY command examples: In the example above, a substring function is used on the timestamp column to make the cluster key have a lower cardinality. Clustering on the main timestamp column would waste lots of compute as every value in the column is likely to be unique (high cardinality). Substringing to a date value reduces the cardinality to mak e the cluster key more efficient. TIP: do not add a cluster key at the time of table creation. Let the table work itself out for a while before deciding on putting on a cluster key. Remember, adding clustering on a smaller table (less than TB) often has m inimal impact on the query performance but could add significant compute cost to maintain it (particularly if the table is changing frequently). AUTOMATIC CLUSTERING CAN BE SUSPENDED OR RESUMED. DEMO - CLUSTERING KEYS -- set context for the worksheet use role sysadmin; use warehouse compute_wh; use database thakur_demo; use schema public; -- let's start with a simple query SELECT MAX(duration_minutes) FROM bike_trips; -- this is a metadata -based query. No compute warehouse needed. -- now let's add a filter condition to the query above. SELECT MAX(duration_minutes) FROM bike_trips Where start_station_name = '6th & Chalmers'; -- 1.01s -- this definitely needed a warehouse to run since individual -- column values are not stored in the metadata for each micro -partition -- also notice how this is an inefficient query as it required scanning -- 11 out 12 micro -partitions. -- Can we do better? -- let us define a cluster key on the START_STATION_NAME column -- now this is a smaller query so defining a cluster key is -- generally not recommended for tables smaller than in the TB range -- but let's try it for the sake of demonstration. -- any number of cluster keys can be defined but it is generally recommended -- to define 3 or less keys from low cardinality to high. This may seem -- backwards for someone coming from an Oracle background where indexes -- are typically placed on higher cardinality columns first. But here -- in Snowflake we define cluster keys on low cardinality columns so that -- values with common keys will have a chance to be co -located in the -- same micro -partitions. This results in better pruning of micro - partiti ons. -- natural clustering gets overridden once we define a cluster key. ALTER TABLE BIKE_TRIPS CLUSTER BY (START_STATION_NAME); -- done The table definition now shows a Cluster By key listing. I think ‘LINEAR’ means a simple alphanumeric sorting was a pplied on the cluster key. -- now let's rerun the previous query. SELECT MAX(duration_minutes) FROM bike_trips Where start_station_name = '6th & Chalmers'; -- 707ms -- this time only 1 out 2 micro -partitions were scanned. Although now the data itself is stored in just 2 MPs compared to 12 before. DEMO - CLUSTERING DEPTH Clustering depth computes the average depth of the table according to the specified columns (or the clustering key defined for the table). The average depth of a populated table (i.e. a table containing data) is always 1 or more . The smaller the average de pth, the better clustered the table is with regards to the specified columns. -- let’s see this in action -- set context for the worksheet use role sysadmin; use warehouse compute_wh; use database thakur_demo; use schema public; -- we will use the DEMOGRAPHIC_STREAMING table for this demo worksheet> select count(*) from demographic_Streaming; -- 2.36 million records -- access -date column was loaded as VARCHAR -- let's cast it as a DATETIME data type and create -- a new cluster key based on the con verted value worksheet> ALTER TABLE DEMOGRAPHIC_STREAMING CLUSTER BY (cast(access_date as datetime)); -- good -- let’s see if how well the clustering worked worksheet> Select min(cast(access_date as datetime)) as min_access_date, max(cast(access_date as datetime)) as max_access_date FROM DEMOGRAPHIC_STREAMING; -- Dec 09 to Dec 10 -- this was a bad cluster key -- let's look at the cluster depth for this table worksheet> Select system$clustering_depth('DEMOGRAPHIC_STREAMING'); -- overall table cluster depth is 2 worksheet> Select system$clustering_depth('DEMOGRAPHIC_STREAMING','cast(access_date as datetime)'); -- depth is 3 (which is also the total number of partitions for this table; which means this is a poor choice for a cluster key or table is just too small for it have any impact) worksheet> Select system$clustering_depth ('DEMOGRAPHIC_STREAMING','(cast(access_date as datetime),age'); -- depth is 3, same as the max number of micro -partitions –clustering_information is another system function which provides broader information regarding the cluster key worksheet> Select system$clustering_information ('DEMOGRAPHIC_STREAMING','(cast(access_date as datetime),age'); Reclustering As DML operations (INSERT, UPDATE, DELETE, MERGE, COPY) are performed on a clustered table, the data in the table might become less clustered. Periodic/regular reclustering of the table is required to maintain optimal clustering. During reclu stering, Snowflake uses the clustering key for a clustered table to reorganize the column data, so that related records are relocated to the same micro - partition. This DML operation deletes the affected records and re -inserts them, grouped according to the clustering key. Reclustering in Snowflake is automatic; no maintenance is needed. NOTE: Go easy with cluster keys. Snowflake is not Redshift. Unlike Redshift, tables in Snowflake are automatically tuned by the optimization engine using the micro - partit ioning architecture. Defining a cluster key on a small table as such has minimal impact on performance. For larger tables there could be benefits but do weigh the pros and cons before defining a bunch of cluster keys. Snowflake performs automatic recluste ring in the background. All you need to do is define a clustering key for each table (if appropriate) and Snowflake manages all future maintenance. Automatic Clustering consumes Snowflake credits, but does not require you to provide a virtual warehouse. In stead, Snowflake internally manages and achieves efficient resource utilization for reclustering the tables. Your account is billed only for the actual credits consumed by automatic clustering operations on your clustered tables. Note that, after a cluste red table is defined, reclustering does not necessarily start immediately. Snowflake only reclusters a clustered table if it will benefit from the operation. EXAM QUESTION: This example indicates that the CUSTOMER table is not well -clustered and hence the query will run slower for the following reasons: ● Average overlap parameter is too high, indicating that the table is not well - clustered. ● AVERAGE DEPTH of 96 is too high (ranges from 1 or more). ● Most of the micro -partitions are grouped at the lower -end of the histogram, with the majority of micro -partitions having an overlap depth between 64 and 128, meaning that there will be minimal partition pruning. EXAM QUESTION ( verified October 2022 ): Data Storage Monitoring If you have been assigned the ACCOUNTADMIN role (i.e. you serve as the top -level administrator for your Snowflake account), you can use Snowsight or the classic web interface to view data storage across your entire account: Getting Started With Snowsight — Snowflake Documentation Snowsight Select Admin » Usage » Storage . Classic Web Interface Click on Account » Billing & Usage » Average Storage Used The Usage page provides detailed information about your Snowflake consumpti on across all relevant services, in either credits or the appropriate currency. The Usage page is the default page for the Admin area. Individual Table Storage Of the three methods, TABLE_STORAGE_METRICS provides the most detailed information because it i ncludes a breakdown of the physical storage (in bytes) for table data in the following three states of the CDP life -cycle: ● Active (ACTIVE_BYTES column) ● Time Travel (TIME_TRAVEL_BYTES column) ● Fail -safe (FAILSAFE_BYTES column) Additional details here: Data Storage Pricing Virtual Data Warehouses A virtual warehouse is one or more compute clusters that enable customers to load data and perform queries. Customers pay for virtual warehouses using Snowflake credits. Single -Cluster Configuration Standard edition only allows a single cluster configuration. Resizing a single cluster VWH to a bigger size will always start all servers (nodes) for that configuration. You can’t control the count of nodes in a single -cluster configuration. Multi -Cluster Configuration By default, a virtual warehouse consists of a single cluster of compute resources available to the warehouse for executing queries. As que ries are submitted to a warehouse, the warehouse allocates resources to each query and begins executing the queries. If sufficient resources are not available to execute all the queries submitted to the warehouse, Snowflake queues the additional queries un til the necessary resources become available. With multi -cluster warehouses, Snowflake supports allocating, either statically or dynamically, additional clusters to make a larger pool of compute resources available. Warehouse Sizes Snowflake supports a wide range of virtual warehouse sizes: X -Small, Small, Medium, Large, X -Large, 2X -Large, 3X -Large, and 4X -Large. The size of the virtual warehouse determines how fast queries will run. When a virtual warehouse is not running (that is, when it is set to sle ep mode), it does not consume any Snowflake credits. The different sizes of virtual warehouses consume Snowflake credits at the following rates, billed by the second with a one -minute minimum . The “t -shirt size” of your warehouse (XS, S, M, L, XL) will significantly impact your credit usage. Sizing your warehouses to perform well and remain cost -efficient is always a challenge, and it’s a bit more art than science. But while you may need some trial and error, there are some rules of thumb for pickin g your starting point. Begin with a size that roughly correlates to how much data the warehouse in question will be processing for a given query: ● XS Multiple megabytes - default size if VWH created using SnowSQL ● S A single gigabyte ● M Tens of gigabytes ● L Hundreds of gigabytes - default size if created using the WebUI ● XL Terabytes TIP: A warehouse for data ingestion is generally two sizes smaller than a warehouse for transformation or analytics. Size specifies the amount of compute resources availabl e per cluster in a warehouse. Snowflake supports the following warehouse sizes ( current as of Oct 2022 ): Cost Saving Tips Enable these settings for “human users”: ● LOCK_TIMEOUT (60 seconds) ● STATEMENT_TIMEOUT_IN_SECONDS (600) ● STATEMENT_QUEUED_TIMEOUT_IN_S ECONDS (30) Lock timeout determines how long a query will wait for a resource that is locked. For most users, if you are waiting for more than 60 seconds, there is likely an issue, and there is no reason to waste further warehouse credits. A statement timeout prevents poorly opt imized queries from running all weekend and racking up substantial charges. (Yes, this happens.) For human users who are waiting for results, ten minutes is a reasonable timeout. A user still has the option of increasing this timeout for a given session if they know their query will take an exceptional amount of time. Ten minutes may not be long enough for system users, but some reasonable limits will prevent multi -day queries from creating unexpected costs. Warehouses that are overworked have queue times that slow queries down. If you do not have a queue timeout , users may sometimes repeatedly queue new queries until your warehouse is working for many hours to clear the queue. Thirty seconds is a good default for human users; if you find that queries are regularly queueing, consider making your warehouse a multi -cluster that scales on -demand. Query Processing & Concurrency Impact on Data Loading Increasing the size of a warehouse does not always improve data loading performance. Data loading performance is influenced more by the number of files being loaded (and the size of each file) than the size of the warehouse. Unless you are bulk loading a large number of files concurrently (i.e. hundreds or thousands o f files), a smaller warehouse (Small, Medium, Large) is generally sufficient . Using a larger warehouse (X -Large, 2X -Large, etc.) will consume more credits and may not result in any performance increase . Impact on Query Processing The size of a warehouse c an impact the amount of time required to execute queries submitted to the warehouse, particularly for larger, more complex queries . In general, query performance scales with warehouse size because larger warehouses have more compute resources available to process queries. If queries processed by a warehouse are running slowly, you can always resize the warehouse to provision more compute resources. The additional resources do not impact any queries that are already running, but once they are fully provisio ned they become available for use by any queries that are queued or newly submitted. Larger is not necessarily faster for small, basic queries. MAX CONCURRENCY LEVEL Defines the level of parallelism for a virtual warehouse. Default is 8. Specifies the co ncurrency level for SQL statements (i.e. queries and DML) executed by a warehouse. When the level is reached, the operation performed depends on whether the warehouse is a single -cluster or multi -cluster warehouse: ● Single -cluster or multi -cluster (in Maxim ized mode): Statements are queued until already -allocated resources are freed or additional resources are provisioned, which can be accomplished by increasing the size of the warehouse. ● Multi -cluster (in Auto -scale mode) : Additional clusters are started. MAX_CONCURRENCY_LEVEL can be used in conjunction with the STATEMENT_QUEUED_TIMEOUT_IN_SECONDS parameter to ensure a warehouse is never backlogged. CANCELING LONG_RUNNING QUERIES Scaling Up vs. Scaling Out Multi -Cluster Warehouse A multi -cluster warehouse is defined by specifying the following properties: ● Maximum number of clusters, greater than 1 (up to 10). ● Minimum number of clusters, equal to or less than the maximum (up to 10). Additionally, multi -cluster warehouses support all the same properties and actions as single -cluster warehouses, including: ● Specifying a warehou se size. ● Resizing a warehouse at any time. ● Auto -suspending a running warehouse due to inactivity; note that this does not apply to individual clusters, but rather the entire multi -cluster warehouse. ● Auto -resuming a suspended warehouse when new queries are submitted. Cluster Count For most transformation and ingest warehouses, you can leave the cluster default of one minimum and one maximum . However, for analytics warehouses, you may need to scale for usage. As a rule of thumb, you will need a max cluster count of roughly one for every ten concurrent requests . So, if you expect 20 concurrent queries and do not want them to queue, a multi -cluster count of two should be adequate. In this case, the max cluster count should also be two . It's always best to sta rt small and scale up as needed, so setting proper configurations for the scenario above needs to be taken into consideration. Starting with a minimum cluster count of one will provide some cost savings but as the number of concurrent queries ramp up this may cause performance issues. If performance issues are encountered, increasing the minimum cluster count higher to two should be considered. Multi -cluster Warehouse Modes ● Maximized - same value for max and min cluster size. ● Auto -Scaling - different valu e for min and max cluster size. ○ Scaling Policy ■ Standard - don’t wait; add more clusters ASAP ■ Economy - six per minute attempts before scaling Use ALTER WAREHOUSE command to change max / min count of a running virtual warehouse. You can choose to run a multi -cluster warehouse in either of the following modes: Maximized This mode is enabled by specifying the same value for both maximum and minimum number of clusters ( note that the specified value must be larger than 1 ). In this mo de, when the warehouse is started, Snowflake starts all the clusters so that maximum resources are available while the warehouse is running. This mode is effective for statically controlling the available compute resources , particularly if you have large n umbers of concurrent user sessions and/or queries and the numbers do not fluctuate significantly . Auto -scale This mode is enabled by specifying different values for maximum and minimum number of clusters . In this mode, Snowflake starts and stops clusters as needed to dynamically manage the load on the warehouse: ● As the number of concurrent user sessions and/or queries for the warehouse increases, and queries start to queue due to insufficient resources, Snowflake automatically starts additional clusters, u p to the maximum number defined for the warehouse. ● Similarly, as the load on the warehouse decreases, Snowflake automatically shuts down clusters to reduce the number of running clusters and, correspondingly, the number of credits used by the warehouse. Multi -Cluster Configuration Available in Enterprise or higher editions of Snowflake. With multi -cluster warehouses, Snowflake supports allocating, either statically or dynamically, a larger pool of resources to each warehouse. Multi -cluster configuration is not available in the Standard edition. A multi -cluster warehouse is defined by specifying the following properties: ● Maximum number of server clusters, greater than 1 (up to 10). ● Minimum number of server clusters, equal to or less than the maximum (up to 10). A 4 -X Large cluster can scale up to 10 clusters, each with 128 nodes. So maximum configuration of the largest cluster could be 10 x 128 = 1,280 nodes Additionally, multi -cluster warehouses support all the same properties and actions as single -cluster war ehouses, including: ● Specifying a warehouse size. ● Resizing a warehouse at any time. ● Auto -suspending a running warehouse due to inactivity; note that this does not apply to individual clusters, but rather the entire warehouse. ● Auto -resuming a suspended wareh ouse when new queries are submitted. Scaling Policies for Multi -Cluster Warehouses To help control the credits consumed by a multi -cluster warehouse running in Auto - scale mode, Snowflake provides scaling policies, which are used to determine when to start or shut down a cluster. Multi -Cluster Warehouse [Maximized] Multi -Cluster Warehouse [Auto -Scaled] The scaling policy for a multi -cluster warehouse only applies if it is running in Auto - scale mode . In Maximized mode, all clusters run concurrently so there is no need to start or shut down individual clusters. Snowflake supports the following scaling policies: Standard vs. Economy DO NOT WAIT - the moment you see queries starting to queue up, add a new cluster node right away. More $$. The first cluster starts immediately when either a query is queued or the system detects that there’s one more query than the currently -running clusters can execute. Each successive cluster waits to start 20 se conds after the prior one has started. For example, if your warehouse is configured with 10 max clusters, it can take a full 200+ seconds to start all 10 clusters. WAIT AND SEE - maybe it is just a blip; will the queries be in queue for longer than 6 minutes? If so, add another cluster; else let the queries wait in the queue. SCALING BACK Scaling back using the Standard policy: quickly shut the extra node down after checking two or three times. The nodes to be shut down are first marked to be shutdown in order to let the current queries complete. Once the load dissipates on the marked node, it is shut down completely. After 2 t o 3 consecutive successful checks (performed at 1 minute intervals), which determine whether the load on the least -loaded cluster could be redistributed to the other clusters without spinning up the cluster again. Scaling back using the Economy policy: be lazy about removing nodes. After 5 to 6 consecutive successful checks (performed at 1 minute intervals), which determine whether the load on the least -loaded cluster could be redistributed to the other clusters without spinning up the cluster a gain. Altering Query B

Snowflake SnowPro Core Certification Study Guide - Nov 2022.pdf

Document Details

Tags

Related

Full Transcript