Fundamentals of Data Analytics PDF

Module 1 Data Analytics Data analytics is the science of extracting trends, patterns, and relevant information from raw data to draw conclusions. It has multiple approaches, dimensions, and techniques. Data analytics helps in: Scientific decision-making and effective business operations. Analyzing data, gaining profits, improving resource use, and enhancing managerial operations. Problems with Traditional Accounting Methods Accounting used to be done in notebooks, which was cumbersome. The use of Excel simplified accounting but did not solve all problems. Small and medium businesses (SMBs) and startups face issues managing and tracking cash flow. A reliable financial management solution is required for business success. Challenges in Traditional Accounting Difficulty in tracking small expenses such as taxes. SMBs and startups often need a dedicated financial expert. Business owners often double as HR and payroll experts due to a lack of resources. Traditional methods made interpreting financial reports difficult. Excel-based macros and pivot tables offered insufficient insights. Data Analytics: Impact on Accounting Uncovers valuable insights. Identifies process improvements. Helps in risk management. Adds value to the decision-making process. How Accountants Use Data Analytics Tax accountants, investment advisors, and auditors benefit from: ○ Continuous monitoring and verification of large datasets. ○ Reduced errors and precise recommendations. ○ Faster decisions in investments and tax analysis. The Data Analysis Process consists of the following phases that are iterative - Data Requirements Specification ○ The data required for analysis is based on a question or an experiment. Data Collection ○ It is the process of gathering information on targeted variables identified as data requirements. It ensures that the data gathered is accurate such that the related decisions are valid. Data Processing ○ The data that is collected must be processed or organized for analysis. Data Cleaning ○ It is the process of preventing and correcting these errors. Data Analysis ○ Processed, organized, and cleaned data would be ready for analysis. ○ Various data analysis techniques are available to understand, interpret, and derive conclusions based on the requirements. ○ Data Visualization may also be used to examine the data in graphical format, to obtain additional insight regarding the messages within the data. Communication ○ The results of the data analysis are to be reported in a format required by the users to support their decisions and further action Data Analytics: Overview and Process Flow Data analytics involves examining and analyzing raw data sets to conclude, derive information, and improve business products and services. Process Flow 1. Define goals. 2. Identify measurable metrics. 3. List, collect, and extract data from sources. 4. Explore and analyze data. 5. Interpret and visualize data. 6. Infer data for decision-making. Data Analytics Life Cycle It covers the process of generating, collecting, processing, using, and analyzing data to achieve corporate objectives. It provides a systematic method for managing data to convert it into information that can be used to achieve organizational and project goals. The process gives guidance and strategies for extracting information from data and moving forward on the proper path to achieve corporate objectives. 1. Discovery: Learn about the business domain and assess resources. 2. Data Preparation: Execute extract, load, and transform (ELT). 3. Model Planning: Identify techniques to understand variable relationships. 4. Model Building: Develop datasets for testing, training, and production. 5. Communicate Results: Deliver reports, briefs, and technical documents. 6. Operationalize: Identify key findings and develop narratives for stakeholders. Types of Data Analytics, with an example, Amazon. 1. Descriptive Analytics: Summarizes past events. Techniques include data aggregation and data mining. Amazon’s revenue increased on the West Coast last year. 2. Diagnostic Analytics: Identifies the root causes of past events. Techniques include drill-down, data discovery, and correlation. Sales training had a good return on investment. 3. Predictive Analytics: Predicts future outcomes, using machine learning models. Predicts a 10-12% revenue increase. 4. Prescriptive Analytics: Suggests optimal recommendations based on predictions. Implemented an optimization plan to maximize revenue Benefits of Data Analytics Decision-Making Data analytics enables faster, fact-based decision-making. Organizations achieve higher operational efficiency, improved customer satisfaction, and stronger profits. Cost Reduction Data analytics reduces failed marketing campaigns. Predictive analytics optimizes logistics and reduces costs. Data Visualization Tools Power BI, Tableau, and Logi are commonly used for visualizing multidimensional data. Other Benefits of Data Analytics Identifies potential problems early and provides time to take corrective action. Assists in product or service development by aligning with market needs. Capable of handling heterogeneous data and providing valuable insights. Module 2 Terminologies in Data Analytics Observation: A single row or a record of data from the database. Observation is the unit of analysis on which the measurements are taken. It is also known as a case, record, pattern, or row. Data Sampling: A statistical analysis technique used to select, manipulate, and analyze a representative subset of data points. Data sampling identifies patterns and trends in the larger data set. Dataset: A collection of data or the total data captured about a particular use case. It can hold information such as medical, insurance, and loan approval records. It may include collections of images or videos. Prediction: The goal of prediction is to move from what has happened to providing the best assessment of what will happen. Types of Data Structured Data: It is the data that is processed, stored, and retrieved in a fixed format. Example: Employee details, job positions, and salaries. Unstructured Data: It is the type of data that lacks any specific form or structure. Example: Email. Semi-Structured Data: It is the data type containing both structured and unstructured data. Example: CSV and JSON documents. Analyzing Unstructured Data About 80% of business data is unstructured. The internally generated information is considered unstructured as the intelligence doesn’t fit neatly into a database. Unstructured information is text-heavy and contains data such as dates, numbers, and facts. Unstructured data is primarily used for BI and analytics but not for transaction processing applications. Retailers and Manufacturers Analyze Unstructured Data to: Perform sentiment analysis on product reviews. Enable targeted marketing. Improve customer relationship management processes. Qualitative and Quantitative Data Qualitative Data: Data in which the classification of objects is based on attributes and properties. Example: Softness of skin. Quantitative Data: Data that can be measured and expressed numerically. Example: Your height and shoe size. Qualitative Data Characteristics Quantitative Data Characteristics Data collection is unstructured. Data collection is structured. It asks why. It is about how much or how many. It cannot be computed as it is It is statistical and involves numbers. non-statistical. It recommends the final course of action. It develops initial understanding and defines the problem. Subgroups of Qualitative Data Nominal Data: Unordered data to which an order is assigned in relation to other named categories. Example: Grade classification like pass or fail for student's test results. Ordinal Data: Ordered data that is assigned to categories in a ranked fashion. Example: Feedback to a product with a 1–5 ranking. Subgroups of Quantitative Data Discrete Data: It can only take certain values. Example: The number of students in a class. Continuous Data: It can take any value within a specified range. Example: Share price of a company. Data Levels of Measurement It is a classification that describes the nature of information within the values assigned to variables: 1. Nominal Level: ○ Numbers in the variable are used to classify data. ○ Words, letters, and alphanumeric symbols can be used. Example: People in the female gender category are classified as F, and those in the male gender category are classified as M. 2. Ordinal Level: ○ Depicts an ordered relationship among the variable’s observations. ○ Indicates an order of the measurements. Example: A student with a 100% score is assigned the first rank, another student with a 95% score would be assigned the second rank. 3. Interval Level: ○ Classifies and orders the measurements. ○ Specifies that the distances between each interval on the scale are equivalent. Example: Temperature in centigrade where the distance between 80°C and 100°C is the same as the distance between 1000°C and 1020°C. 4. Ratio Level: ○ Observations can have a value of zero. ○ The properties of ratio measurement are similar to the interval level, but the zero in the scale makes it different. Normal Distribution of Data The normal distribution is also known as the Gaussian distribution or Bell curve. It is a perfectly symmetric bell-shaped distribution curve with only one peak. It is the most important probability distribution in statistics. Most natural phenomena follow the Bell curve. It is denser at the center and has equal mean, median, and mode values. It is continuous and has asymptotic tails. Basic Statistical Parameters 1. Mean: The average of all data points for a given set of data. It is measured by adding all data points and dividing the sum by the number of data points. 2. Variance: The sum of the squares of differences between all numbers and the mean, divided by the number of data points. It gives a measure of how the data distributes itself about the mean. 3. Standard Deviation: The square root of variance. It shows the extent to which data varies from the mean. Key Takeaways Structured data, unstructured data, and semi-structured data are the three types of data. Nominal, ordinal, interval, and ratio are four data levels of measurement. Normal distribution is the most important probability distribution in statistics. Mean, variance, and standard deviation are the basic statistical parameters. SMART Questions Specific Measurable Action-Oriented Relevant Time-bound Avoid when asking questions Leading Questions Closed-ended questions Vague Questions Module 3 Data cleansing It is the process of identifying and resolving corrupt, inaccurate, or irrelevant data. This critical stage of data processing — also referred to as data scrubbing or data cleaning — boosts the consistency, reliability, and value of your company’s data. Data cleansing vs. data cleaning vs. data scrubbing Data scrubbing It is viewed as an element of data cleansing that specifically involves removing duplicate, bad, unneeded or old data from data sets. Data scrubbing It has also different meaning in connection with data storage. It's an automated function that checks disk drives and storage systems to make sure the data they contain can be read and to identify any bad sectors or blocks. Why is clean data important? Business operations and decision-making are increasingly data-driven, as organizations look to use data analytics to help improve business performance and gain competitive advantages over rivals. As a result, clean data is a must for BI and data science teams, business executives, marketing managers, sales reps and operational workers. The Power of Clean Data A decision's quality depends on the data informing it. With vast amounts of data coming from multiple sources, a data cleansing tool is crucial for accuracy, efficiency, and competitive advantage. Benefits of Data Scrubbing: Improved Decision Making: Clean data ensures accurate decisions and strategies, preventing wasted time and resources. For example, Domino’s Pizza's AnyWare app relies on clean customer data to avoid errors in app development. Boosted Efficiency: Clean data enhances internal productivity by providing valuable insights into processes like employee turnover. It helps identify high-risk employees through performance reviews and feedback. Competitive Edge: Meeting customer needs effectively provides a competitive advantage. Data cleansing offers reliable insights, leading to faster responses, quality leads, and improved customer experiences. Types of Data Errors Fixed by Data Scrubbing: Typos and Invalid Data: Corrects misspellings, numerical errors, and missing values. Inconsistent Data: Standardizes formats and terms across systems. Duplicate Data: Removes or merges duplicate records from combined data sets. Irrelevant Data: Eliminates outliers and outdated entries, streamlining data processing and storage. Steps in the Data Cleaning Process 1. Inspection and Profiling: Data is audited to assess quality, identify issues, and document relationships between data elements. This includes data profiling to find errors and discrepancies. 2. Cleaning: Errors are corrected, and inconsistent, duplicate, and redundant data is addressed. 3. Verification: The cleaned data is inspected again to ensure it meets internal quality rules and standards. 4. Reporting: Results are reported to IT and business executives, including the number of issues found and corrected and updated metrics on data quality. Clean data characteristics include accuracy, completeness, consistency, integrity, timeliness, uniformity, and validity. Data quality metrics track these characteristics and assess the business impact of data quality problems. The Benefits of Effective Data Cleansing Effective data cleansing provides several benefits: Improved Decision-Making: Accurate data leads to better results from analytics applications, aiding informed decisions in business strategies, patient care, and government programs. More Effective Marketing and Sales: Clean customer data enhances the effectiveness of marketing campaigns and sales efforts. Better Operational Performance: High-quality data helps prevent issues like inventory shortages and delivery problems, reducing costs and improving customer relationships. Increased Use of Data: Trustworthy data encourages its use, maximizing its value as a corporate asset. Reduced Data Costs: By preventing the propagation of errors, data cleansing saves time and money on ongoing data fixes. Data cleansing is also crucial for data governance programs, ensuring data consistency and proper use. Data Cleansing Challenges Challenges include: Time consumption due to numerous issues and error sources. Resolving missing data values. Fixing inconsistent data across different business units. Addressing data quality issues in big data systems. Securing adequate resources and organizational support. Managing data silos. Cleaning with Spreadsheet Spreadsheets and the Data Life Cycle: Plan: Develop organizational standards for formatting, headings, color schemes, and data order to improve communication and efficiency. Capture: Connect spreadsheets to other data sources for automatic updates, ensuring current and accurate information. Manage: Store, organize, filter, and update data while controlling access and security. Analyze: Use formulas and pivot tables for data aggregation, reporting, and visualization to aid decision-making. Archive: Use built-in tools to store infrequently used spreadsheets for future reference. Destroy: Delete spreadsheets when no longer needed, or for legal/security reasons, following proper protocols. Spreadsheet Errors and Fixing Common Spreadsheet Errors and Fixes: #DIV/0!: Formula divides by zero or empty cell (e.g., =B2/B3 when B3 is 0). #ERROR!: Parsing error in Google Sheets (e.g., invalid cell range format). #N/A: Data not found (e.g., cell reference missing). #NAME?: Unrecognized formula or function name (e.g., misspelled function). #NUM!: Invalid numeric value for calculation (e.g., DATEDIF with incorrect dates). #REF!: Invalid cell reference (e.g., deleted column in formula). #VALUE!: General error with formula or cells (e.g., text or spaces causing issues). Best Practices to Avoid Errors: Filter data to reduce complexity. Use and freeze headers for clarity. Use an asterisk (*) for multiplication, not X. Start formulas with an equal sign (=). Match open and closed parentheses. Choose a readable font and set border colors to white. Separate raw data and needed data into different tabs. Spotting Errors with Conditional Formatting: In Microsoft Excel: ○ Select all cells, go to Home > Conditional Formatting > Highlight Cell Rules > More Rules. ○ Choose "Use a formula to determine which cells to format" and enter =ISERROR(A1). ○ Set the fill color and apply. In Google Sheets: ○ Select all cells, go to Format > Conditional Formatting. ○ Use "Custom formula is" and enter =ISERROR(A1). ○ Choose a formatting style and apply. Use conditional formatting to highlight and fix errors in large spreadsheets. Data Cleaning Checklist 1. Checking Spellings Use the Spelling option under the Review tab to correct misspelled words. Select "Change All" to apply corrections across the worksheet. 2. Highlighting Duplicates Use Conditional Formatting from the Home tab. Create a new rule with the formula =COUNTIF($C$8:$C$19,$C8)>1, and set a format to highlight duplicates. 3. Removing Duplicates Go to the Data tab and click "Remove Duplicates." Select a column to remove duplicates and delete the entire row. 4. Replacing Text Use the Find & Select command from the Home tab. Enter the text to be replaced and the new text, then click "Replace All." 5. Changing Text Case Apply the =PROPER(C8) formula to convert text to proper case and use Fill Handle to autofill. 6. Removing Spaces Use the formula =TRIM(CLEAN(SUBSTITUTE(C6, CHAR(160)," "))) to remove unnecessary spaces from text. 7. Removing Non-Printable Characters Apply the =CLEAN(C8) formula to remove non-printable characters. 8. Fixing Numbers Use =VALUE(F6) to convert text-formatted numbers into numerical format. 9. Fixing Times Convert times from decimal format with =F6/24, then format cells as Time. 10. Fixing Dates Convert serial numbers to dates by selecting cells and choosing Short Date or Long Date from the Number Format. 11. Merging Columns Use the =CONCATENATE(C6,"-", D6,"-", E6) formula to combine columns into one. 12. Distributing Cell Contents Use Text to Columns under the Data tab to split combined data into separate cells. 13. Switching Rows and Columns Copy the range, select a cell, then use Paste Special and choose Transpose to switch rows and columns.

Fundamentals of Data Analytics PDF

Document Details

Tags

Related

Summary

Full Transcript

Upgrade to continue