Intro to Power Query and Power Pivot PDF

UNIT 4.1 Intro to Power Query and Power Pivot John Omar D. Esguerra Instructor “The best thing to happen to excel in 20 years” - Bill Jelen (aka “Mr. Excel”) 1 Import and analyze Millions of rows of data in Excel Access data from virtually anywhere 2 Quickly build models to blend and analyze data across source Instantly connect sources and analyze (database tables, flat files, cloud services, holistic performance across your entire data folders, etc.) model 3 Create fully automated data shaping and loading procedures Connect to databases and watch data flow 4 Define calculated measures using Data Analysis Expressions (DAX) No more redundant A1-style “grid” through your model with the click of a formulas; DAX expressions are flexible, button powerful and portable 1. Import & Analyze Millions of rows When was the last time you loaded 25,000,000 rows of data into Excel? When you connect to data with Power Query and load it to Excel’s Data Model, the data is compressed and stored in memory, NOT in worksheets (no more 1,048,576 row limit!) 2. Build data models to blend sources This is an example of a Data Model in “Diagram View”, which allows you to create connections between tables Instead of manually stitching tables together with cell formulas, you create relationships to blend data based on common fields 3. Automate your data processing With Power Query, you can filter, shape and transform your raw data before loading it into the data model Each step is automatically recorded and saved with the query, and applied whenever the source data is refreshed – like a macro! 4. Create powerful measures with DAX Measures are flexible and powerful calculations defined using Data Analysis Expressions (DAX) In this case we’re using a DAX time intelligence formula to calculate a 10-day rolling average When To Use Power Query & Power Pivot Use Power Query and Power Pivot when you want to… 1. Analyze more data than can fit into a worksheet 2. Create connections to databases or external sources 3. Blend data across multiple large tables 4. Automate the process of loading and shaping your data 5. Unleash the full business intelligence capabilities of Excel LESSON 4.1.1 Power Query Meet Power Query Power Query (aka “Get & Transform”) allows you to: Connect to data across a wide range of sources Filter, shape, append and transform raw data for further analysis and modeling Create stored procedures to automate your data prep (like a macro!) The Power Query tools live in the Data tab, under the “Get & Transform” section (Excel 2016) Types of Data Connections From File From Database FromAzure From Online Services From Other Sources The Query Editor Query Editing Tools Formula Bar (this is “M” code) Name your table! Data Preview Applied Steps Access the Query Editor by creating a new query and choosing the “Edit” option, or by launching the Workbook Queries pane (Data > Show Queries) and right-clicking an existing query to edit Query Editor Tools The HOME tab includes general settings and common table transformation tools The TRANSFORM tab includes tools to modify existing columns (splitting/grouping, transposing, extracting text, etc. The ADD COLUMN tools create new columns based on conditional rules, text operations, calculations, dates, etc. Data Loading Options When you load data from Power Query, you have several options: Table Stores the data in a new or existing worksheet Requires relatively small data sets ( New Query > From Other Sources > Blank Query) 2) In the formula bar, generate a starting date by entering a “literal” (1/1/2013 shown below): 3) Click the fX icon to add a new custom step, and enter the following formula exactly as shown: 4) Convert the resulting list into a Table (List Tools > To Table) and format the column as a Date 5) Add calculated Date columns (Year, Month, Week, etc.) as necessary using the Add Column tools Adding an Index Column Index Columns contain a list of sequential values that can be used to identify each unique row in a table (typically starting from 0 or 1) These columns are often used to create unique IDs that can be used to form relationships between tables (more on that later!) Adding a Conditional Column Conditional Columns allow you to define new fields based on logical rules and conditions (IF/THEN statements) In this case we’re creating a new conditional column called “Order Size”, which depends on the values in the “quantity” column, as follows: If quantity >5, Order Size = “Large” If quantity is from 2-5, Order Size = “Medium” If quantity =1, Order Size = “Small” Otherwise Order Size = “Other” Grouping & Aggregating Data Group By allows you to aggregate your data at a different level (i.e. transform daily data into monthly, roll up transaction-level data by store, etc.) In this case we’re transforming a daily, transaction-level table into a summary of “quantity” by “product_id” Note that we lose any field not specified in the Group By settings Pivoting & Unpivoting “Pivoting” is a fancy way to describe the process of turning distinct row values into columns (“pivoting”) or turning columns into rows (“unpivoting”) Imagine that the table is on a hinge; pivoting is like rotating it from a vertical to a horizontal layout, and unpivoting is like rotating it from horizontal to vertical NOTE: Transpose works very similarly, but doesn’t recognize unique values; instead, the entire table is transformed so that each row becomes a column and vice versa Modifying Workbook Queries Click on Show Queries to launch the Workbook Queries pane Right-click any individual query to access common options and tools: Edit (launches the Query Editor) Delete Rename Refresh Duplicate Merge Append Merging Queries Merging queries allows you to join tables based on a common column (like VLOOKUP) In this case we’re merging the FoodMart_Transactions_1997 table with the Product_Lookup table, which share a “product_id” column TIP: Merging adds columns to an existing table HEY THIS IS IMPORTANT! Just because you can merge tables, doesn’t mean you should. In general, it’s better to keep tables separate and define relationships between them (more on that later!) Appending Queries Appending queries allows you to combine (or stack) tables that share a common structure and set of columns In this case we’re appending the FoodMart_Transactions_1998 table to the FoodMart_Transactions_1997 table, since they contain the same set of columns and data types TIP: Appending adds rows to an existing table PRO TIP: Use the “From Folder” query option to automatically append all files from within the same folder Power Query Best Practices Give your queries clear and intuitive names, before loading the data Define names immediately; updating query & table names later can be a headache, especially if you’ve already referenced them in calculated measures Don’t use spaces in table names (otherwise you have surround them with single quotes) Do as much shaping as possible at the source of the data Shaping data at the source (i.e. SQL, Access) minimizes the need for complex procedures in Power Query, and allows you to create new models without replicating the same process When working with large tables, only load the data you need Don’t include hourly data when you only need daily, or product-level transactions when you only care about store-level performance; extra data will only slow you down LESSON 4.1.2 Power Pivot Meet Power Pivot A “Power” Pivot is just like a normal PivotTable, except it sits on top of an entire data model rather than a single table or range. This allows you to: Explore massive datasets consisting of multiple sources and tables, using familiar, user-friendly PivotTable tools and options Create powerful and flexible calculations using Data Analysis Expressions (DAX) The Power Pivot tab includes tools to manage the data model and define new measures (Note: you may need to enable this tab by selecting File > Options > Add-Ins > Manage COM Add-Ins) Creating a “Power” Pivot Table Option #1: From the Data Model Option #2: From the Insert > PivotTable dialog box “Normal” Pivots Vs. “Power” Pivots Normal Pivot Power Pivot Can analyze data from one table at a Can analyze an entire data model, time; multiple tables must be flattened or consisting of multiple tables connected via “stitched” together with cell functions relationships rather than cell functions Restricted to the data capacity of a single Virtually unlimited data capacity as tables Excel worksheet (1,048,576 rows) are compressed outside of normal Limited to relatively basic calculated worksheets fields, using a sub-set of Excel functions Performs complex calculations using Data Analysis Expressions (DAX) NOTE: It’s not the PivotTable itself that’s different; it’s the data behind it “Normal” Pivots Vs. “Power” Pivots Normal Pivot Power Pivot More Tables! No More “Calculated Fields” Oh rats, where are my calculated fields?? One of the key Power Pivot features is the ability to create much more robust calculated fields, known as measures* Because these measures interact directly with the data model (including tables stored in memory), traditional cell formulas won’t do the trick Instead, we’ll use a new (but familiar) formula language called Data Analysis Expressions (DAX) *Note: Depending on the version of Excel you’re using, you might see these referred to as either “Measures” (Excel 2010, 2016) or “Calculated Fields” (Excel 2013) Data Analysis Expressions (Dax) Data Analysis Expressions, commonly known as DAX, is the formula language that drives Power Pivot. With DAX, you can: Add calculated columns and measures to your model, using intuitive syntax Go beyond the capabilities of traditional “grid-style” formulas, with powerful functions built specifically to work with relational data Two places to use DAX: 1) Adding Calculated Columns 2) Adding Measures Calculated Columns Calculated columns allow you to add new, formula-based columns to tables No “A1-style” references; calculated columns refer to HEY THIS IS IMPORTANT! entire tables or columns As a rule of thumb, ONLY use Calculated columns are computed at the row-level, calculated columns if you want to “stamp” static, fixed values to each and values are stored with the table (this eats up row in a table (or use Power Query!) memory) Calculated columns understand row context; they’re DO NOT use calculated columns for great for defining new properties based on information aggregation formulas, or to calculate in each row, but generally useless for aggregation fields for the “Values” area of a pivot (use measures instead) (SUM, AVERAGE, COUNT, etc.) PRO TIP: Calculated columns are typically placed in the Filters, Slicers, Rows or Columns areas of a pivot Creating Calculated Columns Step 1: In the data model “Data View”, choose a table and then select any cell in the “Add Column” section Step 2: Enter a DAX function in the formula bar (we’ll cover specific functions in the next section) Step 3: Press “Enter”, and all cells in the column will update Calculated Columns: Good & Bad In this case we’ve added a calculated column called price_category, which equals “High” if the retail price is >$2, and “Low” otherwise (just like you would write in Excel!) Since calculated columns understand row context, a new value is calculated in each row based on that row’s price This is a valid use of calculated columns; it creates a new row “property” that we can now use to filter or segment any related data within the model Here we’re using an aggregation function (SUM) to calculate a new column named total_revenue Since calculated columns do not understand filter context, the same grand total is returned in every single row of the table This is not a valid use of calculated columns; these values are statically “stamped” onto the table and can’t be filtered, sliced, subdivided, etc. DAX Measures Measures are DAX formulas used to generate dynamic values within a PivotTable Like calculated columns, measures reference entire HEY THIS IS IMPORTANT! tables or columns (no A1-style or “grid” references) As a rule of thumb, use Unlike calculated columns, measures don’t actually live measures (vs. calculated in the table; they get placed in the values area of a columns) when a single row PivotTable and dynamically calculated in each can’t give you the answer (i.e. requires aggregation) individual cell Measures are evaluated based on the filter context of Measures can ONLY be each cell, which is determined by the PivotTable layout placed in the values area of a PivotTable (filters, slicers, rows and columns) PRO TIP: Use measures to create values that users can explore with a pivot (Power Pivot version of a “Calculated Field”) Creating Implicit Measures STEP 1: Check the box next to a value field in a data table, or manually drag it into the “Values” box STEP 2: Pat yourself on the back, you just created a measure! HEY THIS IS IMPORTANT! Before you pop the champagne, there’s a catch. When you drag a raw data field into the values section of a pivot, you create what’s called an implicit measure. While there’s nothing wrong with implicit measures, they are extremely limited. Explicit measures (defined using DAX) will give us much more flexibility, as well as the ability to reuse measures in multiple places (measure trees!) FROM NOW ON, JUST SAY “NO” TO IMPLICIT MEASURES Creating Explicit Measures (AutoSum) AutoSum is a shortcut for creating simple DAX formulas (Sum, Average, Count, Distinct Count, Max and Min) To use AutoSum: Click on a cell in the Measures Pane (see below), within the column you want to evaluate Select the AutoSum menu and choose an option from the list The Measures Pane sits beneath the data in the “Data View” of the model PRO TIP: AutoSum is a nice way to get comfortable with basic DAX and quickly add measures; just don’t rely on them when things start to get more complicated! Creating Explicit Measures (Power Pivot) The Measure Dialog Box Each measure is The Formula pane assigned to a table contains the actual DAX and given a code, as well as options measure name (as to browse the formula well as an optional library or check syntax description) Note: just start typing, and “Intellisense” will kick in to help you auto-populate formula names and tables Use the Formatting Options to PRO TIP: specify a format Ctrl+scroll adjusts for each measure formula text size Understanding Filter Context Measures are calculated based on filter context, which is the set of filters (or “coordinates”) determined by the PivotTable layout (filters, slicers, row labels and column labels) HEY THIS IS IMPORTANT! Each measure cell in the pivot calculates independently, based on its coordinates (think of each cell as an island) When you change the pivot layout (by updating filters/slicers, row labels or column labels), each measure cell detects its new coordinates and then recalculates its value The coordinate for this measure cell is Customer_Lookup[customer_city] = “Hidalgo” Given this coordinate, Excel filters down to the “Hidalgo” rows in the Customer_Lookup table, filters all related tables (based on the relationships in data model), then evaluates the arithmetic in the table defined by the measure (in this case Total Quantity equals the sum of quantity from the transactions data table) This cell does NOT add up the values above it (it’s an island, remember?) Total rows represent a lack of filters; since this cell does not have a customer_city coordinate, it evaluates the Total Quantity measure across the entire, unfiltered Customer_Lookup table Filter Context Examples Cell coordinates: Calendar_Lookup[Year] = 1997 Customer_Lookup[customer_country] = “USA” Customer_Lookup[customer_city] = “Altadena” Cell coordinates: Cell coordinates: Calendar_Lookup[Year] = 1997 Store_Lookup[store_country] = “Canada” Customer_Lookup[customer_country] = Store_Lookup[store_city] = “Vancouver” “USA” Product_Lookup[product_brand] = “Akron” Cell coordinates: Calendar_Lookup[Year] = 1998 Cell coordinates: Calendar_Lookup[Quarter] = 1 Customer_Lookup[customer_country] = Store_Lookup[store_country] = “Canada” “USA” Product_Lookup[product_brand] = “Amigo” Cell coordinates: Customer_Lookup[customer_country] = “USA” Step-by-step Measure Calculation How exactly is this measure calculated? REMEMBER: This all happens instantly behind the scenes, every time a measure cell calculates STEP 1 STEP 2 STEP 3 Detect pivot coordinates & Carry filters “downstream” & apply Evaluate the measure formula apply filter context to all related tables against the filtered table Store_Lookup Table USA Store_Lookup[store_country] = “USA” 11 FoodMart_Transactions Store_Lookup Table USA * * USA FoodMart Returns USA Sum of Transactions[quantity] when store_country = “USA” = 555,899 Recap: Calculated Columns vs. Measures Calculated Columns Measures Evaluated in the context of each row of the table to Evaluated in the context of each cell of the which it belongs (has row context) PivotTable in which it is displayed (has filter context) Appends static values to each row in a table and stores them in the model, increasing file size Does not create new data in the tables themselves, and does not increase file size Only recalculated on data source refresh or changes to component columns Recalculated in response to any change in the PivotTable view Primarily used as rows, columns, slicers or filters Can only be used as PivotTable values Calculated columns “live” in tables Measures “live” in PivotTables * *Note: Calculated columns CAN be placed in the values area of a pivot, but you can (and should) use a measure instead Power Pivot Best Practices Avoid using implicit measures whenever possible Implicit measures are limited in functionality and restricted to the pivot in which they were created; explicit measures are more portable and powerful Don’t use a calculated column when a measure will do the trick Only use calculated columns to “stamp” static, fixed values to each row in a table Use measures when aggregation is necessary, or to create dynamic values in a pivot Know your data model inside and out! It’s easy to produce incorrect results in Power Pivot if you don’t respect the model’s table relationships, and errors are often difficult to spot without a thorough QA LESSON 4.1.2 Data Modeling John Omar D. Esguerra Instructor Meet Excel’s Data Model The Data Model provides simple and intuitive tools for building relational databases directly in Excel. With the data model you can: Manage massive datasets that can’t fit into worksheets Create table relationships to blend data across multiple sources Define custom hierarchies and perspectives Access the Data Model through the Power Pivot tab or the Data tab (Note: you may need to enable the Power Pivot tab via File > Options > Add-Ins > Manage COM Add-Ins) The Data Model Window The Data Model opens in a separate Excel window, where you can view your data tables, calculate new measures, and define table relationships Note: Closing the Data Model window does NOT close your Excel workbook Data View Vs. Diagram View DATA VIEW DIAGRAM VIEW Tables organized in tabs Tables organized as objects Database Normalization Normalization is the process of organizing the tables and columns in a relational database to reduce redundancy and preserve data integrity. It is commonly used to: Eliminate redundant data to decrease table sizes and improve processing speed & efficiency Minimize errors and anomalies from data modifications (inserting, updating or deleting records) Simplify queries and structure the database for meaningful analysis In a normalized database, each table should serve a distinct and specific purpose (i.e. product information, calendar fields, transaction records, customer attributes, etc.) When you don’t normalize, you end up with tables like this; all of the duplicate product records could be eliminated with a lookup table based on product_id This may not seem critical now, but minor inefficiencies can become major problems as databases scale in size Data Tables vs. Lookup Tables Models generally contain two types of tables: data (or “fact”) tables, and lookup (or “dimension”) tables Data tables contain numbers or values, typically at the most granular level possible, with ID or “key” columns that can be used to connect to each lookup table Lookup tables provide descriptive, often text-based attributes about each dimension in a table This Calendar Lookup table provides additional attributes about each date (month, year, weekday, quarter, etc.) This Product Lookup table provides additional attributes about each product (brand, product name, sku, price, etc.) This Data Table contains “quantity” values, and connects to lookup tables via the “date” and “product_id” columns Primary & Foreign Keys These columns are primary keys; they uniquely identify These columns are foreign keys; each row of a table, and match the foreign keys in they contain multiple instances of related data tables each value, and are used to match the primary keys in related lookup tables Relationships vs. Merged Tables Can’t I just merge queries or use LOOKUP or RELATED functions to pull those attributes into the fact table itself, so that I have everything in one place?? -Anonymous confused man Original Fact Table fields Attributes from Calendar Lookup table Attributes from Product Lookup table Sure, but it’s extremely inefficient. Merging data in this way creates redundant data and utilizes significantly more memory and processing power than creating relationships between multiple small tables Creating Table Relationships Option 1: Click and drag relationships in Diagram View Option 2: Use “Create Relationship” in the Design tab Tip: Always drag relationships from the Data table to the Lookup tables Connecting Lookups To Lookups PRO TIP: Models with multiple related lookup tables are called “snowflake” schemas Models with a single table for each lookup or dimension are called “star” schemas This Transactions data table can connect to Store_Lookup using store_id, but does not contain a region_id to connect to the Region_Lookup table By creating a relationship between Store_Lookup and Region_Lookup (using region_id), we have essentially connected Transactions with Region_Lookup; filter context will now flow all the way down the chain Modifying Table Relationships The Manage Relationships window allows you to create, edit or delete any connection in the data model Use this to see all table relationships, as well as table names, cardinality and filter direction Note: double-click a single connection in diagram view to edit an individual relationship Active vs. Inactive Relationships We can connect the Calendar_Lookup and FoodMart_Transactions tables on both transaction_date and stock_date; however, only one can be active at a time To make a connection active or inactive, double-click the connection and check the box, or right-click the relationship line itself (Note: must deactivate one before activating another!) Relationship Cardinality Cardinality refers to the uniqueness of values in a column In Power Pivot, all relationships in a data model should follow a “one-to-many” cardinality Each column (or “key”) used to join tables can only have one instance of each unique value in the lookup table (these are the primary keys), but may have many instances of each unique value in the data table (these are the foreign keys) In this case we’re joining the Calendar_Lookup table to the FoodMart_Transactions data table using the date column as our key There is only one instance of each date in the lookup table (noted by the “1”), but many instances of each date in the data table (noted by the asterisk “*”), since multiple transactions occur each day Bad Cardinality: Many-to-many If we try to connect these tables using the product_id field, we’ll have a many-to-many relationship since there are multiple instances of each ID in both tables Even if we could create this relationship in Power Pivot, how would you know which product was actually sold on each date – Cream Soda or Diet Cream Soda? Bad Cardinality: One-to-one In this case, connecting the tables above using the product_id field creates a one-to-one relationship, since each ID only appears once in each table Unlike many-to-many, there is nothing illegal about this relationship; it’s just inefficient To eliminate the inefficiency, you could simply merge the two tables into a single, valid lookup Note: this still respects the laws of normalization, since all rows are unique and directly related to the primary key Connecting Multiple Data Tables Here we’ve loaded a second data table named Returns, containing records of returns by date, product and store This table connects to each lookup exactly like the Transactions table did, except that there is no way to connect the Returns table to Customer_Lookup This allows us to analyze data across both tables in the same pivot, as long as we only filter or segment the data using lookups that are common to both In other words, we know which product was HEY THIS IS IMPORTANT! returned, which store it was returned to, and which date the return occurred, but NOT NEVER try to connect data tables directly to each other; which customer was responsible ALWAYS connect them indirectly via shared lookup tables! Filter Direction is Important This model includes two data tables (Transactions and Returns), both connected to the Calendar_Lookup Note the filter directions (shown as arrows) in each relationship; in Power Pivot (2016) these will always point from the “one” side of the relationship (lookups) to the “many” side (data tables)* Filtering a table will impact any tables “downstream” of it, as defined by the filter relationship (i.e the direction of the arrow) Let’s say we’re analyzing both Transactions and Returns in the same PivotTable; filtering by the Calendar_Lookup PRO TIP: date field will return correctly filtered data from both data Arrange your lookup tables above tables, but filtering by the Transactions date field will your data tables in diagram view to yield unfiltered Returns values remind you that filters always flow “downstream” *Note: In Excel 2010/2013 the diagram view looks a bit different, and arrows point in the opposite direction by default Filter Direction is Important (Cont.) Calendar_Lookup filters flow Filtering by date in the “down” to both the Transactions Transactions table yields and Returns tables, so we can incorrect, unfiltered values from filter or segment those metrics the Returns table, since filter using any field from the Calendar context cannot flow “upstream” table to the Calendar table Hiding Fields From Client Tools When you hide a field from Client Tools, you make it invisible to tools outside of the data model (i.e. Power Pivot) This can be used to prevent users from filtering or segmenting on invalid fields, or to hide irrelevant metrics from view PRO TIP: Always hide the foreign key columns in your data tables to prevent users from accidentally filtering on them! Defining Hierarchies Hierarchies are groups of nested columns that reflect multiple levels of granularity For example, a “Geography” hierarchy might include Country, State, and City columns Each hierarchy is treated as a single item in PivotTables and PivotCharts, allowing users to “drill up” and “drill down” through different levels of the hierarchy in a meaningful way Drag fields to create a hierarchy Right-click a field to see the hierarchy options Hierarchies appear in Power Pivot Data Model Best Practices Normalize your data model before you do anything else Make sure that each table in your model serves a single, distinct purpose Use relationships vs. merged tables; long & narrow tables are better than short & wide Organize lookup tables above data tables in the diagram view This serves as a visual reminder that filters always flow “downstream” Hide fields from client tools to prevent invalid filter context All foreign key columns should be hidden from data tables, so that users are only able to use valid fields for filtering and segmentation LESSON 4.1.3 Common Dax Functions DAX Syntax MEASURE NAME Referenced Referenced Note: Measures are always surrounded in brackets (i.e. [Total TABLE NAME COLUMN NAME Quantity]) when referenced in formulas, so spaces are OK Total Quantity: =SUM(Transactions[quantity]) FUNCTION NAME This is a “fully qualified” column, since it’s preceeded by the table name Calculated columns don’t always use functions, Note: Table names with spaces must be surrounded by single quotes: but measures do: Without a space: In a calculated column, =Transactions[quantity] Transactions[quantity] returns the value from the quantity column in With a space: ‘Transactions each row (since it evaluates for each row) Table’[quantity] In a measure, =Transactions[quantity] will return an error since Excel doesn’t know how to evaluate that as a single value in a pivot (you need some sort of aggregation) PRO TIP: For column references, use the fully qualified name (i.e. Table[Column]) For measure references, just use the measure name (i.e. [Measure]) DAX Operators Arithmetic Comparison Meaning Example Meaning Example Operator Operator + Addition 2+7 = Equal to [City]=“Boston” - Subtraction 5–3 > Greater than [Quantity]>10 * Multiplication 2*6 < Less than [Quantity]= Greater than or equal to [Unit_Price]>=2.5 ^ Exponent 2^5 30,000, otherwise “Small” Switch & Switch(TRUE) SWITCH() Evaluates an expression against a list of values and returns one of multiple possible result expressions =SWITCH(, , , , , … ) Any DAX expression that returns a single List of values produced by the expression, each paired Value returned if scalar value, evaluated multiple times (for with a result to return for rows/cases that match the expression each row/constant) doesn’t match any Examples: value argument Examples: =SWITCH(Calendar_Lookup[month_num], 1, Calendar_Lookup[month_num] Product_Lookup[product_brand] “January”, 2, “February”, etc… =SWITCH(TRUE(), PRO TIP: [retail_price]=5, [retail_price]=20, [retail_price] $3, “Medium” if price is between $2 and $3, “Low” if price is =5 PRO TIP: CALCULATE works just like SUMIF or COUNTIF, except it can evaluate measures based on ANY sort of calculation (not just a sum, count, etc); it may help to think of it like “CALCULATEIF” CALCULATE (Example) Why do we see the same repeating value when we add store_country to rows? Shouldn’t these cells have filter contexts for Canada and Mexico? HEY THIS IS IMPORTANT! The CALCULATE function modifies filters and overrules any competing ones defined by the PivotTable coordinates! In this example, the MEXICO cell has a filter context of store_country= “MEXICO” (defined by the row label) AND In this case we’ve defined a new measure named story_country= “USA” (defined by the CALCULATE function) “USA Transactions”, which evaluates the “Total Transactions” measure when the store country Both cannot be true at the same time, so the MEXICO filter equals “USA” is overwritten and CALCULATE takes priority CALCULATE Changes the Filter Context CALCULATE If the measure being evaluated contains a CALCULATE Modify filters if measure function, filter context is modified between Step 1 & Step 2 contains CALCULATE Store_Lookup[store_country] = “USA” STEP 1 STEP 2 STEP 3 Detect pivot coordinates & Carry the filters across all Evaluate the formula apply filter context table relationships against the filtered table Store_Lookup Table USA Store_Lookup Table USA Store_Lookup[store_country] = “MEXICO” 11 Store_Lookup Table Transactions MEXICO * * = 180,823 USA FoodMart Returns Total Transactions where store_country = “USA” USA FILTER FILTER() Returns a table that represents a subset of another table or expression =FILTER(, ) HEY THIS IS IMPORTANT! FILTER is used to add filter context on top of what’s already defined by Table to be filtered A Boolean (True/False) filter the PivotTable layout. Examples: expression to be evaluated for each Since FILTER returns a table (as opposed row of the table to a scalar), it’s almost always used as an Store_Lookup input to other functions, like enabling Product_Lookup Examples: more complex filtering options within a Store_Lookup[store_country]=“USA” CALCULATE function (or passing a Calendar[Year]=1998 filtered table to an iterator like SUMX) [retail_price]>AVERAGE[retail_price] PRO TIP: Since FILTER iterates through each row in a table, it can be slow and processor-intensive; never use FILTER when a normal CALCULATE function will accomplish the same thing! Pro Tip: Filtering With Disconnected Slicers (Part 1) STEP 1: Create an Excel table containing a list STEP 3: Make sure that your table loaded, and is of values to use as thresholds or parameters: NOT connected to any other table in the model: STEP 2: Add the table to the Data Model (from Power Pivot tab): Pro Tip: Filtering With Disconnected Slicers (Part 2) STEP 4: Place your new table on the pivot as a slicer: STEP 5: Create a measure to capture the slicer selection, then reference it in a FILTER statement within CALCULATE: The Transactions Under Price Threshold measure calculates Total Transactions when the product price is below the selected threshold FILTER (Examples) Calculate Total Transactions only for cases where the product price is Calculate Total Revenue, but only for USA stores below a selected threshold ALL ALL() Returns all rows in a table, or all values in a column, ignoring any filters that have been applied =ALL( or , [column1], [column2],…) The table or column that you List of columns that you want to clear filters on (optional) want to clear filters on Notes: Examples: If your first parameter is a table, you can’t specify additional columns Transactions All columns must include the table name, and come from the same Product_Lookup[product_brand] table Examples: Customer_Lookup[customer_city], Customer_Lookup[customer_country] Product_Lookup[product_name] PRO TIP: ALL is like the opposite of FILTER; instead of adding filter context, ALL removes filter context. This is often used when you need unfiltered values that won’t be skewed by the PivotTable layout (i.e. Category sales as % of Total) ALL (Example) In this example, we use ALL to calculate total transactions across all rows in the Transactions table, ignoring any filter context from the PivotTable By dividing the original [Total Transaction] measure (which responds to PivotTable filter context as expected) by the new [All Transactions] measure, we can correctly calculate the percentage of the total no matter how the PivotTable is filtered RELATED RELATED() Returns related values in each row of a table using relationships with other tables =RELATED() HEY THIS IS IMPORTANT! RELATED works almost exactly like a VLOOKUP function – it The column that contains the uses the relationship between tables (defined by primary and values you want to retrieve foreign keys) to pull values from one table into a new column of another. Examples: Since this function requires row context, it can only be used Product_Lookup[product_brand] as a calculated column or as part of an iterator function that Store_Lookup[store_country] cycles through all rows in a table (FILTER, SUMX, MAXX, etc.) PRO TIP: Avoid using RELATED to create redundant calculated columns unless you absolutely need them, since those extra columns increase file size; instead, use RELATED within a measure like FILTER or SUMX RELATED (Examples) Retrieve the retail price from the Product_Lookup table and append it to the Transactions table Multiply the quantity in each row of the Transactions table with the related retail price from the Product_Lookup table, and sum the results Iterator (“X”) Functions Iterator (or “X”) functions allow you to loop through the same calculation or expression on each row of a table, and then apply some sort of aggregation to the results (SUM, MAX, etc.) =SUMX(, ) Aggregation to apply to Table in which the expression Expression to be evaluated for calculated rows* will be evaluated each row of the given table Examples: Examples: Examples: SUMX Transactions [Total Transactions] COUNTX FILTER(Transactions, Transactions[price] * AVERAGEX RELATED(Store_Lookup[country])=“USA”) Transactions[quantity] RANKX MAXX/MINX PRO TIP: Imagine the function adding a temporary new column to the table, calculating the value in each row (based on the expression) and then applying the aggregation to that new column (like SUMPRODUCT) *In this example we’re looking at SUMX, but all “X” functions follow a similar syntax Iterator (“X”) Functions (Examples) Multiply quantity and retail price for each row in the Transactions Calculate the rank of each product brand, based table, and sum the results on total revenue Basic Date & Time Functions DAY/MONTH/ Returns the day of the month (1-31), month of the year (1-12), or year of a =DAY/MONTH/YEAR() YEAR() given date HOUR/MINUTE/ Returns the hour (0-23), minute (0- 59), or second (0-59) of a given =HOUR/MINUTE/SECOND() SECOND() datetime value TODAY/NOW() Returns the current date or exact time =TODAY/NOW() WEEKDAY/ Returns a weekday number from 1 (Sunday) to 7 (Sunday), or the week # =WEEKDAY/WEEKNUM(, ) WEEKNUM() of the year EOMONTH() Returns the date of the last day of the month, +/- a specified number of months =EOMONTH(, ) DATEDIFF() Returns the difference between two =DATEDIFF(, , dates, based on a selected interval ) Basic Date & Time Functions (Examples) Calculate the time difference between the customer birthdate and current date, in years Calculate the end date of the month, for each row in the Calendar_Lookup table Time Intelligence Formulas Time Intelligence functions allow you to easily calculate common time comparisons: Performance =CALCULATE(, DATESYTD(Calendar[Date])) To-Date Use DATESQTD for Quarters or DATESMTD for Months Previous =CALCULATE(, DATEADD(Calendar[Date],-1,MONTH)) Period Select an interval (DAY, MONTH, QUARTER, or YEAR) and the # of intervals to compare (i.e. previous month, rolling 10-day) Running =CALCULATE(, Total DATESINPERIOD(Calendar[Date], MAX(Calendar[Date]),-10,DAY)) PRO TIP: To calculate a moving average, use the running total calculation above and divide by the # of intervals! Speed & Performance Considerations Avoid using unnecessary slicers, or consider disabling cross-filtering When you use multiple slicers, they “cross-filter” by default; in other words, options in Slicer B are automatically grayed out if they aren’t relevant given a selected value in Slicer A To disable, select Slicer Tools > Slicer Settings and uncheck “Visually indicate items with no data” Eliminate redundant columns; keep data tables narrow Data tables should ideally only contain quantitative values and foreign keys; any extra descriptive columns should live in a related lookup table Imported columns are better than calculated columns When possible, create calculated columns at the source (i.e. in your raw database) or using Power Query; this is more efficient than processing those calculations in the Data Model/Power Pivot Minimize iterator functions (FILTER, SUMX, etc.) Functions that cycle through each row in a table are “expensive”, meaning that they take time and consume processing power DAX Best Practices Write measures for even the simplest calculations (i.e. Sum of Sales) Once you create a measure it can be used anywhere in the workbook and as an input to other, more complex calculations Break measures down into simple, component parts DAX is a difficult language to master; focus on practicing and understanding simple components at first, then assemble them into more advanced formulas Reference columns with the table name, and measures alone Using “fully qualified” column references (preceeded by the table name) helps make formulas more readable and intuitive, and differentiates them from measure references UNIT 4.2 Intro to Data Visualization John Omar D. Esguerra Instructor Finding the Story in Your Data Information can be visualized in a number of ways, each of which can provide a specific insight. When you start to work with your data, it’s important to identify and understand the story you are trying to tell and the relationship you are looking to show. Knowing this information will help you select the proper visualization to best deliver your message. When analyzing data, search for patterns or interesting insights that can be a good starting place for finding your story, such as: TRENDS CORRELATIONS OUTLIERS Example: Example: Example: Ice cream sales over time Ice cream sales vs. Ice cream sales in an temperature unusual region What is Data Visualization? Data visualization is the graphical representation of information and data. By using visual elements like charts, graphs, and maps, data visualization tools provide an accessible way to see and understand trends, outliers, and patterns in data. Why is it important? Improves comprehension: Helps simplify complex data. Reveals trends and patterns: Makes it easier to spot key insights. Enhances decision-making: Data-driven insights lead to better decisions. Engages audiences: People are more likely to understand and retain information presented visually. Elements of Data Visualization Charts and Graphs Colors Shapes and Sizes Labels and Legends The most common tools Helps differentiate Used in scatter plots, Help identify data series, to represent data between categories or bubble charts, and other categories, or units of visually. data series. complex visualizations. measurement. Common Types of Charts Bar Chart Line Chart Histogram Used to compare quantities of Shows trends over time. Shows distribution of a different categories. Example: Stock price movements continuous variable. Example: Sales across different regions. over the last 5 years. Example: Distribution of ages in a population. Pie Chart Scatter Plot Represents parts of a whole. Shows relationships between two Example: Market share of companies. variables. Example: Correlation between hours studied and exam scores. Bar Chart Bar charts are very versatile. They are best used to show change over time, compare different categories, or compare parts of a whole. Variations of Bar Charts Page Views, By Month Content Published, By Category VERTICAL (COLUMN CHART) HORIZONTAL Best used for chronological data (time-series should always run Best used for data with long category labels. left to right), or when visualizing negative values below the x- axis. Bar Chart Variations of Bar Charts (cont.) PERCENTAGE OF CONTENT PUBLISHED, BY MONTHLY TRAFFIC, BY SOURCE MONTH STACKED 100% STACKED Best used when there is a need to compare multiple part-to- Best used when the total value of each category is whole relationships. These can use discrete or continuous unimportant and percentage distribution of data, oriented either vertically or horizontally. subcategories is the primary message. Bar Chart Design Best Practices Use Horizontal Labels Space Bars Avoid steep diagonal Appropriately or vertical type, as it Space between bars can be difficult to read. should be ½ bar width. Start The Y-axis Value At 0 Use Consistent Colors Order Data Appropriately Starting at a value above zero Use one color for bar charts. You Order categories alphabetically, truncates the bars and doesn’t may use an accent color to sequentially, or by value. accurately reflect the full value. highlight a significant data point. Pie Chart Pie charts are best used for making part-to-whole comparisons with discrete or continuous data. They are most impactful with a small data set. Variations Of Pie Charts THE CASE AGAINST THE PIE CHART STANDARD Used to show part-to-whole The pie chart is one of the most popular relationships. chart types. However, some critics, such as data visualization expert Stephen Few are not fans. They argue that we are really only able to gauge the size of pie slices if they are in familiar percentages (25%, 50%, 75%, 100%) and positions, because they DONUT are common angles. We interpret other angles inconsistently, making it difficult to Stylistic variation that enables compare relative sizes and therefore less the inclusion of a total value or effective. design element in the center. Pie Chart Design Best Practices Don’t Use Multiple Pie Charts Make Sure All Data Adds Up For Comparison to 100% Slice sizes are very difficult to compare side- Verify that values total 100% and that pie by-side. Use a stacked bar chart instead. slices are sized proportionate to their corresponding value. 5 Visualize No More Than 5 2 1 4 1 Categories Per Chart 3 3 5 2 It is difficult to differentiate between 4 small values; depicting too many slices decreases the impact of the visualization. If needed, you can group Order Slices Option 1 Option 2 smaller values into an “other” or Correctly Place the largest section at Start the largest section “miscellaneous” category, but make sure 12 o’clock, going clockwise. at 12 o’clock, going There are two ways to Place the second largest clockwise. Place it does not hide interesting or significant order sections, both section at 12 o’clock, going remaining sections in information. of which are meant to counterclockwise. The descending order, going remaining sections can be clockwise. aid comprehension: placed below, continuing counterclockwise. Line Chart DIRECT MARKETING VIEWS, BY DATES Line charts are used to show time-series relationships with continuous data. They help show trend, acceleration, deceleration, and volatility. Line Chart Design Best Practices Include A Zero Baseline If Possible Don’t Plot More Than 4 Lines Use Solid Lines Only If you need to display more, break them out Dashed and dotted lines can be Although a line chart does not have into separate charts for better comparison. distracting. to start at a zero baseline, it should be included if possible. If relatively small fluctuations in data are meaningful (e.g., in stock market data), you may truncate the scale to showcase these variances. Label The Lines Directly Use The Right Height This lets readers quickly identify lines and Plot all data points so that the line chart corresponding labels instead of referencing a takes up approximately two-thirds of the legend. y-axis’ total scale. Scatter Plot Scatter plots show the relationship between items based on two sets of variables. They are best used to show correlation in a large amount of data. REVENUE, BY PRODUCT FAMILY Scatter Plot Design Best Start Y-axis Value At 0 Include More Variables Practices Starting the axis above zero truncates the visualization of Use size and dot color to encode additional values. data variables. Use Trend Lines Don’t Compare More Than 2 These help draw correlation Trend Lines between the variables to show Too many lines make data trends. difficult to interpret. Other Charts Area Chart Area charts depict a time-series relationship, but they are different than line charts in that they can represent volume. Variations of Area Charts Area Chart Stacked Area 100% Stacked Area Best used to show or compare a Best used to visualize part-to- Best used to show distribution of quantitative progression over time. whole relationships, helping show categories as part of a whole, how each category contributes to where the cumulative total is the cumulative total. unimportant. Area Chart Design Best Practices Don’t Display More Than 4 Start Y-axis Value At 0 Data Categories Starting the axis above zero truncates Too many will result in a cluttered the visualization of values. visual that is difficult to decipher. Make It Easy To Read In stacked area charts, arrange data to position categories with highly variable data on the top of the chart and low variability on the bottom. Use Transparent Colors Don’t Use Area Charts To In standard area charts, ensure data Display Discrete Data isn’t obscured in the background by The connected lines imply ordering thoughtfully and using intermediate values, which only transparency. exist with continuous data. Other Charts Bubble Chart Bubble charts are good for displaying nominal comparisons or ranking relationships. Variations of Bubble Charts BUBBLE PLOT BUBBLE MAP This is a scatter plot with bubbles, best used Best used for visualizing values for to display an additional variable. specific geographic regions. Bubble Chart Design Best Practices Make Sure Labels Are Visible Size Bubbles Appropriately Don’t Use Odd Shapes All labels should be unobstructed Bubbles should be scaled Avoid adding too much detail or and easily identified with the according to area, not diameter. using shapes that are not entirely corresponding bubble. circular; this can lead to inaccuracies. Other Charts Heatmap STATES WITH NEW SERVICE CONTRACTS Heat maps display categorical data, using intensity of color to represent values of geographic areas or data tables. Heatmap Design Best Practices W A M N O T D R I S D W D Y N N V U E C C T K A O S A N Z M Use A Simple Map Outline Use Patterns Sparingly A pattern overlay that indicates a These lines are meant to frame the second variable is acceptable, but Select Colors Appropriately data, not distract. using multiple is overwhelming and Some colors stand out more than distracting. others, giving unnecessary weight to that data. Instead, use a single color with varying shade or a spectrum between two analogous colors to show intensity. Also remember to intuitively code color intensity according to values. Choose Appropriate Data Ranges Select 3-5 numerical ranges that enable fairly even distribution of data between them. Use +/- signs to extend high and low ranges. 10 Data Design 1 | Do use one color to represent each category Dos And Don’ts 2 | Do order data sets using logical hierarchy 3 | Do use callouts to highlight important Designing your data doesn’t have to or interesting information be overwhelming. With a basic understanding of how different data sets should be visualized, along with a few fundamental design tips and best 4 | Do visualize data in a way that is easy for readers to compare values practices, you can create more accurate, more effective data visualizations. Follow these 10 tips to 5 | Do use icons to enhance comprehension and ensure your design does your data reduce unnecessary labeling justice. 6 | Don’t use high contrast color combinations 10 Data Design such as red/green or blue/yellow Dos And Don’ts 7 | Don’t use 3D charts. They can skew perception of the visualization 8 | Don’t add chart junk. Unnecessary illustrations, Designing your data doesn’t have to drop shadows, or ornamentations distract from be overwhelming. With a basic the data understanding of how different data sets should be visualized, along with a few fundamental design tips and best 9 | Don’t use more than 6 colors in a single practices, you can create more layout accurate, more effective data visualizations. Follow these 10 tips to ensure your design does your data 10 | Don’t use distracting fonts or elements (such justice. as bold, italic, or underlined text) Data Visualization Best Practices Keep it Simple: Avoid unnecessary clutter. Simplicity aids clarity. Use Colors Effectively: Avoid using too many colors. Use color to emphasize important elements. Use Clear Labels and Titles: Titles should be descriptive. Labels should be legible and unambiguous. Ensure Accuracy: Don’t distort data. Use appropriate scales and axes. Tell a Story: Visualizations should convey a narrative that is easy to follow. Provide Context: Make sure viewers understand what the data represents. SPECIAL TOPIC Correlation & Regressions John Omar D. Esguerra Instructor Learning Outcomes 1. Understand the concept of correlation and its importance in data analysis. 2. Perform linear regression to model relationships between variables. 3. Apply logistic regression for binary outcomes. 4. Use Excel to analyze health-related datasets effectively. What is Correlation? Correlation is the statistical technique that is used to describe the strength and direction of the relationship between two or more variables. It assesses the relationship between changes in one variable and those in another. Pearson Correlation Coefficient Where: σ 𝑥𝑖 − 𝑥ҧ (𝑦𝑖 − 𝑦) ത 𝑟 = Correlation coefficient 𝑟= 𝑥𝑖 = each data point in dataset X 𝑥 = mean of the dataset X σ(𝑥𝑖 − 𝑥)ҧ 2 σ(𝑦𝑖 − 𝑦) ത 2 𝑦𝑖 = each data point in dataset Y 𝑦 = mean of the dataset Y What is Correlation? Types of Correlation: Range: −𝟏 to +𝟏 𝒓 = +𝟏: Perfect positive correlation 𝒓 = −𝟏: Perfect negative correlation 𝒓 = 𝟎: No correlation Positive No Negative Correlation Correlation Correlation Interpretation: 𝒓 > 𝟎. 𝟕: Strong positive correlation 𝟎. 𝟑 ≤ 𝒓 ≤ 𝟎. 𝟕𝟎: Moderate correlation 𝒓 < 𝟎. 𝟑: Weak correlation Scatter Plots: Used to visualize the relationship between two or more variables Example Performing Correlation in Excel Exploring the relationship between Age, BMI, Hemoglobin levels, and White Blood Cell (WBC) count. 1. Syntax Health Insight: =CORREL(array1, array2) This provide valuable indicators for identifying hidden health risks. Dataset Example: 2. Data Analysis Toolpak Column A: Age (years) Data > Data Analysis > Correlation Column B: BMI (kg/m2) Column C: Hemoglobin (g/dL) Column D: WBC Count (103 𝜇/L) Correlation Does Example: Not Imply Causation Observation: Ice cream sales and drowning rates increase at the same time. Correlation: There’s a positive correlation between ice cream sales and drowning. What Does It Mean? Misinterpretation: Eating ice cream Just because two variables show a statistical causes drowning. relationship (correlation), it doesn’t mean that one Reality: A third factor — hot weather — is variable causes the other to occur. causing both to rise. Conclusion: Correlation between ice cream sales and drownings does not mean ice cream causes drownings. Key Difference: Correlation: A statistical measure that shows how two variables move together (positively or Takeaway: negatively). Always investigate the underlying factors. Causation: Indicates that one variable directly Use controlled experiments or additional affects another. evidence to establish causation. What is Regression? Regression analysis is a statistical technique that describes the relationship between variables with the goal of modelling and comprehending their interactions. Its primary objective is to form an equation between a dependent and one or more than one independent variable. Slope-Intercept Formula Simple Linear Regression Formula 𝑦 = 𝑚𝑥 + 𝑏 𝑌 = 𝛽0 + 𝛽1 𝑋 + 𝜖 Where: Where: 𝑦 = Dependent variable (output) 𝑌 = Dependent variable (the variable you want to predict) 𝑥 = Independent variable (input) 𝑋 = Independent variable (the variable you are using to predict 𝑌) 𝑚 = Slope of the line (rate of change) 𝛽0 = Intercept 𝑏 = Y-intercept (value of 𝑦 when 𝑥 = 0) 𝛽1 = Slope 𝜖 = Error term (captures the diff. bet. actual and predicted values) Example Performing Regression in Excel Let's say you want to predict a patient’s hemoglobin level based on their age. The Data Analysis Toolpak simple linear regression formula might look like this: Go to Data > Data Analysis > Regression. Select the dependent and independent 𝐻𝑒𝑚𝑜𝑔𝑙𝑜𝑏𝑖𝑛 𝐿𝑒𝑣𝑒𝑙 = 9.16 + 0.11 × 𝐴𝑔𝑒 + 𝜖 variables. Review the output: Intercept, slope, R Interpretation: Square. For every additional year of age, cholesterol level increases by 0.11. The base hemoglobin level (at age 0) is 9.16 g/dL. The error term 𝜖 accounts for differences between actual and predicted values. Multiple Regression Multiple Linear Regression Formula 𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ + 𝛽𝑛 𝑋𝑛 + 𝜖 Multiple Linear Regression Formula Where: is a statistical method used to model the relationship 𝑌 = Dependent variable (the variable you want to predict) 𝑋1 , 𝑋2 , … , 𝑋𝑛 = Independent variables (the predictors) between a dependent variable and two or more 𝛽0 = Intercept (the value of 𝑌 when all 𝑋's are 0) independent variables. 𝛽1 , 𝛽2 , … , 𝛽𝑛 = Coefficients (slopes showing how each 𝑋 affects 𝑌) 𝜖 = Error term (accounts for variability that is not explained by the model) Let’s say you want to predict a patient’s blood pressure (Y) based on multiple factors: 𝑋1 : Age 𝑋4 : Cholesterol 𝑋2 : BMI (Body Mass Index) 𝑋5 : Glucose 𝑋3 : Physical Activity Level The equation might look like this: 𝐵𝑙𝑜𝑜𝑑 𝑃𝑟𝑒𝑠𝑠𝑢𝑟𝑒 = 0.64 + 0.28 × 𝐴𝑔𝑒 + 0.43 × 𝐵𝑀𝐼 − 0.14 × 𝑃ℎ𝑦𝑠𝑖𝑐𝑎𝑙 𝐴𝑐𝑡𝑖𝑣𝑖𝑡𝑦 + 0.11 × 𝐶ℎ𝑜𝑙𝑒𝑠𝑡𝑒𝑟𝑜𝑙 + 0.15 × 𝐺𝑙𝑢𝑐𝑜𝑠𝑒 + 𝜖 Interpretation: For every 1-year increase in age, blood pressure increases by 0.28 units. For every 1-unit increase in BMI, blood pressure increases by 0.43 units. For every 1-unit increase in physical activity, blood pressure decreases by 0.14 units. For every 1-unit increase in cholesterol, blood pressure increases by 0.11 units. For every 1-unit increase in glucose, blood pressure increases by 0.15 units. Summary Output Metrics Regression Statistics Multiple R (Correlation Coefficient) Regression Statistics This value represents the strength and direction of the Multiple R 0.669616026 linear relationship between the observed and predicted R Square 0.448385622 values of the dependent variable. It is always between -1 Adjusted R Square 0.44280248 and 1. Standard Error 10.19331399 Observations 500 Close to 1: Strong positive linear relationship Close to -1: Strong negative linear relationship Close to 0: Weak or no linear relationship 𝐶𝑜𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒(𝑋, 𝑌) 𝑅= 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑋 × 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑌 Summary Output Metrics Regression Statistics R Square (Coefficient of Determination) Regression Statistics Multiple R 0.669616026 This value shows how well the independent variable (X) R Square 0.448385622 explain the variance in the dependent variable (Y). R Square values range from 0 to 1, where 0 means the Adjusted R Square 0.44280248 model explains none of the variance, and 1 means it Standard Error 10.19331399 explains all the variance. Observations 500 A higher R2 suggests that the model fits the data well and can make more accurate predictions 0: The model does not explain any variance in X. 1: The model perfectly explains the variance in Y based on X. Summary Output Metrics Regression Statistics Adjusted R Square Regression Statistics Multiple R 0.669616026 Similar to R Square but adjusted for the number of R Square 0.448385622 predictors (variables) in the model. It accounts for the addition of variables that might not improve the model. Adjusted R Square 0.44280248 Adjusted R-squared is a valuable metric for multiple Standard Error 10.19331399 linear regression. Observations 500 𝑛 −1 𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑 𝑅2 =1 − 1− 𝑅2 × 𝑛 −𝑘 −1 Where: 𝑅 2 = is the regular R-squared 𝑛= is the number of observations 𝑘= is the number of predictors(variables) Summary Output Metrics Regression Statistics Standard Error Regression Statistics A measure of the accuracy of the regression coefficient Multiple R 0.669616026 estimate. It quantifies the amount of variation or uncertainty R Square 0.448385622 in the estimated regression coefficient. Adjusted R Square 0.44280248 It shows the typical distance between the actual data points Standard Error 10.19331399 and the regression line. Observations 500 In simple terms, the standard error tells you how much the estimated regression coefficient is likely to vary from the true population value. A smaller standard error indicates that the estimated regression coefficient is likely to be very close to the true population value. A larger standard error suggest more uncertainty or variability in the estimated coefficient. Observations The number of data points or cases used in the regression analysis. Summary Output Metrics ANOVA (Analysis of Variance) df SS MS F Significance F Regression 5 41722.84 8344.568 80.31063 1.36E-61 Residual 494 51328.4 103.9037 Total 499 93051.24 ANOVA Table Helps evaluate the significance of the regression model in explaining the variance in the response variable (Y) and is rarely used for a simple linear regression analysis The Significance F value reflects the statistical reliability of your results. A Significance F below 0.05 (5%) indicates a reliable model (statistically significant). If it exceeds 0.05, it’s better to consider a different independent variables for improved model reliability Summary Output Metrics ANOVA (Analysis of Variance) df (Degrees of Freedom) F (F-Statistic) Ratio of the mean regression sum of Regression df: Number of independent variables. squares to the mean residual sum of Residual df: Number of observations minus the squares. number of independent variables minus 1. A higher F-value indicates a more significant Total df: Total number of observations minus 1. model. SS (Sum of Squares) Regression SS: Variation explained by the regression model. Residual SS: Variation that the model fails to explain. Total SS: Total variation in the data. MS (Mean Square) Calculated by dividing the sum of squares by their respective degrees of freedom (SS/df). Regression MS: Used to determine the variance explained by the model per variable. Residual MS (Error Variance): Average variance not explained by the model. Summary Output Metrics Coefficients Table Standard Lower Upper Lower Upper Coefficients t Stat P-value Error 95% 95% 95.0% 95.0% Intercept 0.644574056 3.916605 0.164575 0.869346 -7.05069 8.339833 -7.05069 8.339833 Age 0.281571493 0.025081 11.22636 3.33E-26 0.232292 0.330851 0.232292 0.330851 BMI 0.429363447 0.072298 5.938771 5.42E-09 0.287313 0.571414 0.287313 0.571414 Physical_Activity -0.143966279 0.155258 -0.92727 0.354239 -0.44901 0.161082 -0.44901 0.161082 Cholesterol 0.105346923 0.01049 10.04275 1.02E-21 0.084737 0.125957 0.084737 0.125957 Glucose 0.151692224 0.012225 12.40799 5.79E-31 0.127672 0.175712 0.127672 0.175712 Coefficients These are the values that represent the estimated change in the dependent variable for a one-unit change in the independent variable, holding all other variables constant. The Intercept represents the expected value of the dependent variable when all predictors are zero. Summary Output Metrics

Intro to Power Query and Power Pivot PDF

Document Details

Tags

Related

Summary

Full Transcript