Unit 1: Introduction to Data Analysis PDF
Document Details
Burce, Catugas, Cerezo, Manzano, Pascual
Tags
Summary
This textbook provides an overview of data analytics, covering topics like data analytics lifecycle, big data analytics, tools and technologies, and future trends. It explains different types of data, such as structured and unstructured, and details various phases of data analytics, including descriptive, predictive, and prescriptive analytics.
Full Transcript
Unit I: Introduction to Data Analysis Burce, Catugas, Cerezo, Manzano, Pascual Table of Contents Overview of Data Analytics............................................................................................................1 Data Anal...
Unit I: Introduction to Data Analysis Burce, Catugas, Cerezo, Manzano, Pascual Table of Contents Overview of Data Analytics............................................................................................................1 Data Analytics Lifecycle.................................................................................................................3 Big Data Analytics..........................................................................................................................7 Tools and Technologies............................................................................................................... 10 Future Trends in Data Analytics.................................................................................................. 12 References.................................................................................................................................. 14 Overview of Data Analytics What is Data Analytics? Refers to the process of examining, transforming, and organizing raw data to uncover meaningful patterns, trends, and insights. It involves using statistical techniques, algorithms, and tools to analyze data and draw conclusions that support informed decision-making. The ultimate goal is to convert raw data into actionable insights that can guide business strategies, optimize operations, or predict future trends. Importance of Data Analytics: It helps organizations make informed decisions, improve operational efficiency, predict trends, and gain a competitive advantage. Applications: Data analytics is used in various fields like business, healthcare, sports, marketing, and finance. For example, businesses use analytics for customer segmentation, while healthcare uses it for patient diagnostics. Types of Data: 1. Structured Data - Data that is organized in a specific format or model, usually found in rows and columns, like in databases or spreadsheets. Examples include transaction records or customer information. 2. Unstructured Data - Data that doesn't have a predefined structure or is not organized in a traditional database. This includes data like social media posts, videos, and emails. Phases of Data Analytics: 1. Predictive Data Analytics - Predictive analytics may be the most commonly used category of data analytics. Businesses use predictive analytics to identify trends, correlations, and causation. The category can be further broken down into predictive modeling and statistical modeling; however, it’s important to know that the two go hand in hand. 1 Unit I: Introduction to Data Analysis Burce, Catugas, Cerezo, Manzano, Pascual For example, an advertising campaign for t-shirts on Facebook could apply predictive analytics to determine how closely conversion rate correlates with a target audience’s geographic area, income bracket, and interests. From there, predictive modeling could be used to analyze the statistics for two (or more) target audiences, and provide possible revenue values for each demographic. 2. Prescriptive Data Analytics - Prescriptive analytics is where AI and big data combine to help predict outcomes and identify what actions to take. This category of analytics can be further broken down into optimization and random testing. Using advancements in ML, prescriptive analytics can help answer questions such as “What if we try this?” and “What is the best action?” You can test the correct variables and even suggest new variables that offer a higher chance of generating a positive outcome. 3. Diagnostic Data Analytics - While not as exciting as predicting the future, analyzing data from the past can serve an important purpose in guiding your business. Diagnostic data analytics is the process of examining data to understand cause and event or why something happened. Techniques such as drill down, data discovery, data mining, and correlations are often employed. Diagnostic data analytics help answer why something occurred. Like the other categories, it too is broken down into two more specific categories: discover and alerts and query and drill downs. Query and drill downs are used to get more detail from a report. For example, a sales rep that closed significantly fewer deals one month. A drill down could show fewer workdays, due to a two-week vacation. Discover and alerts notify of a potential issue before it occurs, for example, an alert about a lower amount of staff hours, which could result in a decrease in closed deals. You could also use diagnostic data analytics to “discover” information such as the most-qualified candidate for a new position at your company. 4. Descriptive Data Analysis - Descriptive analytics are the backbone of reporting—it’s impossible to have business intelligence (BI) tools and dashboards without it. It addresses basic questions of “how many, when, where, and what.” Once again, descriptive analytics can be further separated into two categories: ad hoc reporting and canned reports. A canned report is one that has been designed previously and contains information around a given subject. An example of this is a monthly report sent by your ad agency or ad team that details performance metrics on your latest ad efforts. Ad hoc reports, on the other hand, are designed by you and usually aren’t scheduled. They are generated when there is a need to answer a specific business question. These reports are useful for obtaining more in-depth information about a specific query. An ad hoc report could focus on your corporate social media profile, examining the types of 2 Unit I: Introduction to Data Analysis Burce, Catugas, Cerezo, Manzano, Pascual people who’ve liked your page and other industry pages, as well as other engagement and demographic information. Its hyperspecificity helps give a more complete picture of your social media audience. Chances are you won’t need to view this type of report a second time (unless there’s a major change to your audience). Data Analytics Lifecycle What is the data analytics lifecycle? The data analytics lifecycle is a series of seven phases that have each been identified as vital for businesses doing data analytics. This lifecycle is based on the popular CRISP-DM analytics process model, which is an open-standard analytics model developed by IBM. The phases of the data analytics lifecycle include defining your business objectives, cleaning your data, building models, and communicating with your stakeholders. This lifecycle runs from identifying the problem you need to solve, to running your chosen models against some sandboxed data, to finally operationalizing the output of these models by running them on a production dataset. This will enable you to find the answer to your initial question and use this answer to inform business decisions. Why is the data analytics lifecycle important? The data analytics lifecycle allows you to better understand the factors that affect successes and failures in your business. It’s especially useful for finding out why customers behave a certain way. These customer insights are extremely valuable and can help inform your growth strategy. For example, you need a hypothesis to give your study clarity and direction, your data will be easier to analyze if it has been prepared and transformed in advance, and you will have a higher chance of working with an effective model if you have spent time and care selecting the most appropriate one for your particular dataset. Following the data analytics lifecycle ensures you can recognize the full value of your data and that all stakeholders are informed of the results and insights derived from analysis, so they can be actioned promptly. Phase 1: Discovery The Discovery phase is the first and most critical step in the Data Analytics Lifecycle. It focuses on fully understanding the business problem or opportunity that the project aims to address. The team must investigate the issue from all angles, ensuring they have the right context to drive the project forward. 3 Unit I: Introduction to Data Analysis Burce, Catugas, Cerezo, Manzano, Pascual Key Activities: ○ The data science team works closely with stakeholders to understand the business objectives and constraints. ○ They learn and investigate the problem, developing a deep understanding of the business context. ○ The team identifies and documents the data sources needed and available for the project. ○ This phase also involves forming an initial hypothesis that will be tested with data during later stages. Outcome: A clear definition of the problem, business goals, project plan, and list of available data sources. Phase 2: Data Collection The Data Collection phase is about gathering relevant data from internal and external sources. This data will be used to address the business problem. Key Activities: ○ Identify and gather all required data, either through querying databases, retrieving log files, or sourcing data from third-party APIs. ○ Data quality is essential at this stage, so the team ensures the data is accurate, complete, and relevant. ○ It’s important to assess how well the data matches the initial hypothesis and business goals. Outcome: Collected and ready-to-use data that meets the project requirements. Phase 3: Data Preparation In the Data Preparation phase, the raw data is cleaned, transformed, and made ready for modeling and analysis. This step ensures that the data is of high quality and suitable for building models. Key Activities: ○ Data exploration, preprocessing, and conditioning are performed to remove inconsistencies. ○ The team uses tools like Hadoop, Alpine Miner, or OpenRefine to explore, load, and transform the data into an analytic sandbox (a controlled environment for analysis). ○ Tasks may include handling missing data, removing duplicates, normalizing, aggregating, and encoding variables. 4 Unit I: Introduction to Data Analysis Burce, Catugas, Cerezo, Manzano, Pascual ○ Feature engineering (creating new variables from existing ones) is a significant part of this phase, enabling better model performance. Outcome: Clean and structured data, ready for model development. Phase 4: Model Planning The Model Planning phase is where the team decides which analytical techniques and algorithms to use. The team explores the relationships between variables and selects key variables to test in models. Key Activities: ○ Develop datasets for training, testing, and validation. ○ The team explores the data using statistical and visualization techniques to identify patterns and relationships between features. ○ Decide on the most suitable modeling techniques, such as regression, classification, or clustering. ○ Tools like MATLAB, STASTICA, or even programming languages like Python and R can be used at this stage. Outcome: A well-defined plan for the types of models to be built, along with selected features for the models. Phase 5: Model Building The Model Building phase is focused on constructing and training the models based on the prepared data. Key Activities: ○ Datasets for training, testing, and production are finalized. ○ The team develops, tests, and iterates on the models using algorithms chosen during model planning. ○ They consider whether existing tools (e.g., R, Python, Octave, WEKA) are sufficient or if they need more robust environments for running and scaling models. ○ Models are refined and optimized for accuracy, precision, or other relevant metrics. Outcome: Working machine learning models ready for evaluation. Phase 6: Communicate Results 5 Unit I: Introduction to Data Analysis Burce, Catugas, Cerezo, Manzano, Pascual In the Communicate Results phase, the team evaluates the models and shares their findings with stakeholders. The goal is to translate technical results into actionable business insights. Key Activities: ○ Compare the outcomes of the models to the established success criteria. ○ Articulate the findings clearly, taking into account any assumptions and potential limitations. ○ Quantify the business value of the model by highlighting key insights. ○ Develop a narrative or presentation that effectively communicates the results to stakeholders, ensuring everyone understands the implications of the analysis. ○ Tools like Power BI, Tableau, or custom visualizations in Python or R can be used to present the findings. Outcome: A well-communicated set of insights that provides value to the business and aligns with the project goals. Phase 7: Operationalize The Operationalize phase involves deploying the model into a real-world environment. The team ensures the model works well at scale and continues to deliver value over time. Key Activities: ○ The team sets up a pilot project to deploy the model in a controlled environment before full enterprise-wide deployment. ○ Monitor model performance in production and make adjustments based on feedback or changes in the environment. ○ Deliver final reports, code, and documentation to the relevant stakeholders. ○ Tools for this phase may include SQL, MADlib, and other cloud-based tools like AWS or Azure for deploying models. Outcome: The model is deployed and operational, with a feedback loop in place to monitor and refine the model’s performance. 6 Unit I: Introduction to Data Analysis Burce, Catugas, Cerezo, Manzano, Pascual Big Data Analytics Definition and importance: Big data is the term used to describe extremely vast and varied sets of semi-structured, unstructured, and semi-structured data that keep growing rapidly over time. These datasets are so huge and complex in volume, velocity, and variety, that traditional data management systems cannot store, process, and analyze them. The rapid growth in data and its accessibility is being driven by advancements in digital technology, including connectivity, mobility, the Internet of Things (IoT), and artificial intelligence (AI). As the volume and variety of data continue to increase, new big data tools are emerging to assist companies in collecting, processing, and analyzing this information quickly, enabling them to extract the greatest value from it. Big data refers to vast and varied datasets that are not only enormous in scale but also expand rapidly over time. It is utilized in machine learning, predictive modeling, and other advanced analytical techniques to address business challenges and support informed decision-making. Companies leverage big data within their systems to enhance operational efficiency, deliver superior customer service, design personalized marketing campaigns, and undertake other initiatives that can boost revenue and profits. Businesses that effectively utilize big data gain a competitive edge over those that don't, as they can make quicker and more informed decisions. Types of Big Data Analytics 1. Descriptive Analytics involves data that is straightforward to read and interpret. It aids in generating reports and visualizations that provide insights into company profits and sales. 2. Diagnostics Analytics helps companies determine the causes of problems. Big data technologies and tools enable users to analyze and retrieve data that clarifies the root of an issue, aiding in its resolution and helping prevent future occurrences. 3. Predictive Analytics examines historical and current data to forecast future outcomes. By leveraging artificial intelligence (AI), machine learning, and data mining, users can analyze data to anticipate market trends. 4. Prescriptive Analytics 7 Unit I: Introduction to Data Analysis Burce, Catugas, Cerezo, Manzano, Pascual offers solutions to problems by utilizing AI and machine learning to collect and analyze data, aiding in risk management and decision-making. Big Data Analytics and Software 1. Apache Hadoop - An open-source framework that enables distributed storage and processing of large data sets using a cluster of commodity hardware. Hadoop uses distributed storage and parallel processing to break down enormous amounts of data into smaller workloads, allowing analysts to store and process data quickly. 2. Apache Spark - that handles large amounts of semi-structured and structured datasets. In addition, the platform leverages in-memory computing to process data and includes several libraries to support SQL queries, stream processing, and AI/ML applications. 3. Tableau - Salesforce’s legacy big data analytics tool that lets users visualize and analyze data. With a drag-and-drop interface, users can create interactive dashboards to gain insights. The platform integrates with multiple data sources, allowing you a high-level view of critical KPIs. However, navigating Tableau can be difficult for business users as the platform requires an analytical background. 4. Qlik Sense - enables business users to perform ad-hoc queries and utilize advanced data analytics on large datasets. It allows users to create visualizations, receive recommendations, and track real-time data. Qlik Sense can be deployed both on-premises and in the cloud. 5. Power BI - that integrates seamlessly with Microsoft's ecosystem, enabling businesses to analyze and visualize large datasets. The platform provides features for data analytics, including data modeling, dashboards, and reporting. While it boasts an intuitive interface, utilizing Power BI's advanced functionalities may be challenging for business users, who may need additional training to fully leverage its capabilities. How does Big Data Analytics work? 1. Collect Data - Data collection varies from one organization to another. Leveraging modern technology, organizations can collect both structured and unstructured data from diverse sources, including cloud storage, mobile applications, in-store IoT sensors, and more. 2. Process Data - After data is collected and stored, it needs to be properly organized to ensure accurate results from analytical queries, particularly when dealing with large and unstructured datasets. As data availability expands rapidly, processing it becomes increasingly challenging for organizations. One approach is batch processing, which handles large blocks of data over time and is suitable for scenarios where there is a longer interval between data collection 8 Unit I: Introduction to Data Analysis Burce, Catugas, Cerezo, Manzano, Pascual and analysis. In contrast, stream processing deals with smaller data chunks in real-time, reducing the delay between collection and analysis for faster decision-making. However, stream processing tends to be more complex and often comes with higher costs. 3. Clean Data - Whether data is large or small, it needs to be cleaned to enhance its quality and ensure accurate results. This process involves correctly formatting the data and removing or addressing any duplicates or irrelevant information. Untidy data can obscure meaningful insights and lead to inaccurate conclusions. 4. Analyze Data - Transforming big data into a usable state is a time-consuming process. Once the data is prepared, advanced analytics techniques can convert it into valuable insights. Some of these big data analysis methods include: Data mining analyzes large datasets to uncover patterns and relationships by detecting anomalies and forming data clusters. Predictive analytics leverages historical data from an organization to forecast future trends, helping to identify potential risks and opportunities. Deep learning emulates human learning processes by utilizing artificial intelligence and machine learning to apply layered algorithms, enabling the discovery of patterns in highly complex and abstract data. 9 Unit I: Introduction to Data Analysis Burce, Catugas, Cerezo, Manzano, Pascual Tools and Technologies Data Analytics - The 7 Essential Tools 1. Microsoft Excel Used by data analytics to run basic queries and to create pivot tables, graphs and charts. It also features a macro programming language called visual basic for applications. VBA (Visual basic for Applications) 2. Python Python is an open source programming language which is used to organize and wrangle large sets of data. Data wrangle is a term used in the industry to describe the processing of data in various formats. Merging and grouping data for example to get it ready for analysis. Python has many built-in features which help with data wrangling making it a popular alternative to Microsoft Excel especially when it comes to working with more complicated data sets tools. 3. R R Is another open source programming language used for statistical computing often serving as a complementary tool to python. It is particularly popular among data analysts because of its output.R offers a great variety of tools for presenting and communicating the results of data analysis 4. SAS SAS is a command driven software package used for carrying out advanced statistical analysis and data visualization offering a wide variety of statistical methods and algorithms customizable options for analysis and output and publication quality graphics. This is also one of the most widely used software packages in the industry. 5. SQL Stands for Structured query language and it's language used to access and manipulate databases, you can think of sql as a tool that allows you to communicate and access data in a database which is necessary if you want to retrieve particularly useful data for analysis, most large businesses use some form of sql to store their big data so learning sql is essential if you want to become a data analyst 6. RapidMiner RapidMiner is a software package used for data mining or uncovering patterns and data, text mining predictive analytics and machine learning used by data analysts and data scientists alike. RapidMiner comes with a wide range of features including data modeling, validation and automation. 10 Unit I: Introduction to Data Analysis Burce, Catugas, Cerezo, Manzano, Pascual 7. FineReport It is another business intelligence tool used to monitor performance to identify trends in data and create reports and dashboards, this is an especially user-friendly tool which is popular with both data analysts and non-data experts. 11 Unit I: Introduction to Data Analysis Burce, Catugas, Cerezo, Manzano, Pascual Future Trends in Data Analytics Future trends in data analytics refer to new developments and technologies that are shaping how data is collected, processed, and used. These trends help organizations make better decisions and predictions. 1. Smarter Analytics with AI First, we have smarter analytics powered by Artificial Intelligence (AI). AI is becoming more advanced, allowing us to analyze data faster, more scalable, and cost-effective. You can leverage AI to parse vast datasets, predict user behavior, and personalize content at scale. Which, in turn, allows you to understand user behavior and optimize product features so they meet customer needs—at a much faster pace. Example: A retail company like Amazon uses AI to recommend products based on previous purchases and browsing behavior, improving the overall shopping experience. 2. Natural Language Processing (NLP) Natural language processing (NLP) is a technology that allows computers to understand, interpret, and respond to human language in a way that feels natural. NLP is revolutionizing the way you perform customer sentiment analysis. It can help you collect and process massive amounts of qualitative data, such as survey responses, support tickets, or social media comments, and transform it into valuable data insights you can use to follow a more customer-centric strategy. Example: A company uses analytic tools to gather and display the specific data they need for decision-making. 12 Unit I: Introduction to Data Analysis Burce, Catugas, Cerezo, Manzano, Pascual 3. Cloud Computing Another big trend is cloud computing. More and more data analysis is moving to the cloud, which means we can store and process vast amounts of data online rather than on local servers. Not only that, it offers better accessibility and safer data storage due to its centralized nature, making it a cost-efficient and flexible solution for most companies. Example: Google clouds or aws to analyze large data 4. Synthetic Data Synthetic data, as the name suggests, is data that’s been artificially generated when real-world data is insufficient or non-existent. And it’s used either to perform validation tests or even train AIs. 5. Data Security and Privacy Finally, as we handle more and more data, data security and privacy are becoming even more critical. With increasing amounts of sensitive information being analyzed, businesses must ensure that this data is protected. Data analytics helps identify potential security threats before they become serious issues. By analyzing patterns and anomalies in network traffic or user behavior, organizations can detect unusual activities that may indicate a security breach or cyberattack. 13 Unit I: Introduction to Data Analysis Burce, Catugas, Cerezo, Manzano, Pascual References GeeksforGeeks. (n.d.). Life cycle phases of data analytics. GeeksforGeeks. Retrieved September 6, 2024, from https://www.geeksforgeeks.org/life-cycle-phases-of-data-analytics/ RudderStack. (n.d.). Data analytics lifecycle. RudderStack. Retrieved September 6, 2024, from https://www.rudderstack.com/learn/data-analytics/data-analytics-lifecycle/ Whiteboard Digest. (2023, February 13). 7 stages of a data science project life cycle explained. Medium. Retrieved September 6, 2024, from https://medium.com/@learnwithwhiteboard_digest/7-stages-of-a-data-science-project-life -cycle-explained-9b4d4430c58d Userpilot. (2023, July 31). Top 12 data analytics trends to follow in 2024. Userpilot. Retrieved September 6, 2024, from https://userpilot.com/blog/data-analytics-trends/ Yellow.ai. (2023, August 25). 10 data analytics trends that will dominate 2024. Yellow.ai. Retrieved September 6, 2024, from https://yellow.ai/blog/data-analytics-trends/ DataCamp. (2023, June 15). The 9 best data analytics tools for data analysts in 2023. DataCamp. Retrieved September 6, 2024, from https://www.datacamp.com/blog/the-9-best-data-analytics-tools-for-data-analysts-in-2023 ?fbclid=IwY2xjawFGSElleHRuA2FlbQIxMAABHQDmWM02p3X0l7t4OF3ujSPTMIWjFeJ 5IlsFtkCxaLc7IcXFFLfP34amzw_aem_MxRPliniRyidybAu2ghDzw Oracle. (n.d.). Data analytics: Oracle business analytics. Oracle. Retrieved September 6, 2024, from https://www.oracle.com/ph/business-analytics/data-analytics/ 14