Data Analytics - Module 1 PDF
Document Details
Uploaded by UnbiasedHydrangea115
Kanad Institute of Engineering and Management
Tags
Summary
This document provides an introduction to data analytics, covering different types of data (qualitative, quantitative, discrete, continuous) and the concept of nominal, and ordinal data. It also describes some applications in business using data analytics techniques.
Full Transcript
Data Analytics Skills for Managers Paper Code: BBABB502 Module 1 BBA Department Institute of Engineering and Management Data, Information, Knowledge, and Wisdom Data A single data or observation is known as a data point. A coll...
Data Analytics Skills for Managers Paper Code: BBABB502 Module 1 BBA Department Institute of Engineering and Management Data, Information, Knowledge, and Wisdom Data A single data or observation is known as a data point. A collection of data is a data set. In statistics, reference to data means a collection of data or a data set. Data items are an elementary and recorded description of things, events, activities and transactions and do not convey any specific meaning because it is without context and interpretation Information Information is data that has been “cleaned” of errors and further processed in a way that makes it easier to measure, visualize and analyze for a specific purpose. Information is data that have been shaped into a form that is meaningful and useful to human beings and adds value to the understanding of a subject. Knowledge Knowledge is data and/or information that have been organized and processed to convey understanding, experience. Wisdom Wisdom is accumulated knowledge, which allows to understand how to apply concepts to new situations or problems Types of Data Qualitative or Categorical Data Qualitative data, also known as the categorical data, describes the data that fits into the categories. Qualitative data are not numerical. Arithmetic operations cannot be performed on the data. We can summarize categorical data by counting the number of observations or computing the proportions of observations in each category. Sometimes categorical data can hold numerical values (quantitative value), but those values do not have a mathematical sense. Examples of the categorical data are birthdate, favourite sport, school postcode. Here, the birthdate and school postcode hold the quantitative value, but it does not give numerical meaning. Types of Data Nominal Data Nominal Data is used to label variables without any order or quantitative value. The colour of hair can be considered nominal data, as one colour can’t be compared with another colour. The name “nominal” comes from the Latin name “nomen” , which means “name.” With the help of nominal data, we can’t do any numerical tasks or can’t give any order to sort the data. Examples of Nominal Data : Colour of hair (Blonde, red, Brown, Black, etc.) Marital status (Single, Widowed, Married) Nationality (Indian, German, American) Gender (Male, Female, Others) Eye Color (Black, Brown, etc.) Ordinal Data Ordinal data have natural ordering where a number is present in some kind of order by their position on the scale. These data are used for observation like customer satisfaction, happiness, etc., but we can’t do any arithmetical tasks on them. The ordinal data is qualitative data for which their values have some kind of relative position. The ordinal data only shows the sequences and cannot be used for statistical analysis. Examples of Ordinal Data : When companies ask for feedback, experience, or satisfaction on a scale of 1 to 10 Letter grades in the exam (A, B, C, D, etc.) Ranking of peoples in a competition (First, Second, Third, etc.) Economic Status (High, Medium, and Low) Education Level (Higher, Secondary, Primary) Types of Data Quantitative Data : Quantitative data can be expressed in numerical values, which makes it countable and includes statistical data analysis. Numeric and arithmetic operations, such as addition, subtraction, multiplication, and division, can be performed on them. These kinds of data answer the questions like, “how much,” “how many,” and “how often.” Quantitative data can be represented on a wide variety of graphs and charts such as bar graphs, histograms, scatter plots, boxplot, pie charts, line graphs, etc. Examples of Quantitative Data : Height or weight of a person or object Room Temperature Scores and Marks (Ex: 59, 80, 60, etc.) Time Types of Data Discrete Data The term discrete means distinct or separate. The discrete data contain the values that fall under integers or whole numbers. These data can’t be broken into decimal or fraction values. The discrete data are countable and have finite values; their subdivision is not possible. These data are represented mainly by a bar graph, number line, or frequency table. Examples of Discrete Data : Total numbers of students present in a class Numbers of employees in a company The total number of players who participated in a competition Days in a week Continuous Data Continuous data are in the form of fractional numbers. Continuous data represents information that can be divided into smaller levels. The continuous variable can take any value within a range. It has an infinite number of probable values that can be selected within a given specific range. Examples of Continuous Data : Height of a person Speed of a vehicle “Time-taken” to finish the work Market share price Data Quality Data quality tells us how reliable a particular set of data is and whether or not it will be good enough for a user to employ in decision-making Data Quality Dimensions Accuracy Accuracy is the degree to which data correctly describes the ‘real-world’ object or event being described. Analysts should use verifiable sources to confirm the measure of accuracy, determined by how close the values jibe with the verified correct information sources. Data Quality Dimensions Completeness Completeness measures the data's ability to deliver all the mandatory values that are available successfully. Completeness is concerned with comprehensiveness. Data can be complete even if optional data is missing. As long as the data meets expectations then it is considered to be complete. A relevant measure for completeness might be percentage of data fields complete. Consistency Data consistency describes the data’s uniformity as it moves across applications and networks and when it comes from multiple sources. Consistency also means that the same datasets stored in different locations should not contradict. Ideally, each datum should be recorded only once, but in practice data is often duplicated due to IT, performance, operational, and legacy system reasons. Consistency might be measured as the percentage of data items deemed to be consistent. Data Quality Dimensions Timeliness Timely data should be readily available whenever it’s needed. This dimension also covers keeping the data current; data should undergo real-time updates to ensure that it is always available and accessible. Timeliness will vary depending on the context. A relevant measure for timeliness is the time interval between the data generated and data being available. Uniqueness Uniqueness means that no duplications or redundant information are overlapping across all the datasets. No record in the dataset should exists multiple times. Analysts use data cleansing and deduplication to help address a low uniqueness score. Validity Validity is concerned with the degree to which the data makes sense. Data must be collected according to the organization’s defined business rules and parameters. The information should also conform to the correct, accepted formats, and all dataset values should fall within the proper range. Data Science Data science refers to scientific management of data and data-related processes, techniques, and skills used to derive viable information, findings and knowledge from the data belonging to various fields. It is a complex term that deals with collection, extraction, purification, manipulation, enumeration, tabulation, combination, examination, interpretation, simulation, visualization, and other such processes applied to data. The various processes and techniques applied to data are derived from many different disciplines like computer science, mathematics, and statistical analysis. Data science finds substantial application in the fields of national defense and safety, medical science, social science areas, and business management areas like marketing, production, finance, and even training and development. In simple terms, data science is an all-encompassing term for tools and methods to derive insightful information from the data. Big Data According to an estimate by IBM, 2.5 quintillion bytes of data is created every day. A recent report by DOMO estimates the amount of data generated every minute on popular online platforms - Facebook users share nearly 4.16 million pieces of content Twitter users send nearly 300,000 tweets Instagram users like nearly 1.73 million photos YouTube users upload 300 hours of new video content Apple users download nearly 51,000 apps Skype users make nearly 110,000 new calls Amazon receives 4300 new visitors Uber passengers take 694 rides Netflix subscribers stream nearly 77,000 hours of video Big data is known as the enormous repository of data garnered by organizations from a variety of sources like smartphones and other multimedia devices, mobile applications, geological location tracking devices, remote sensing and radio-wave reading devices, wireless sensing devices, and other similar sources. (Yin and Kaynak 2015) “big data as high-volume, and high velocity or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation” (Gartner Inc. 2021). Big Data Raw Big Data Sources Logs: Logs generated by web applications and servers which can be used for performance monitoring. Transactional Data: Transactional data generated by applications such as e-Commerce, Banking and Financial. Social Media: Data generated by social media platforms Databases: Structured data residing in relational databases. Sensor Data: Sensor data generated by Internet of Things (IoT) systems. Clickstream Data: Clickstream data generated by web applications which can be used to analyze browsing patterns of the users. Surveillance Data: Sensor, image and video data generated by surveillance systems. Healthcare Data: Healthcare data generated by Electronic Health Record (EHR) and other healthcare applications. Network Data: Network data generated by network devices such as routers and firewalls Big Data Structured Data: Structured data conforms to a data model or schema and is often stored in tabular form. It is used to capture relationships between different entities and is therefore most often stored in a relational database. Structured data is frequently generated by enterprise applications and information systems like ERP and CRM systems. Due to the abundance of tools and databases that natively support structured data, it rarely requires special consideration in regards to processing or storage. Examples of this type of data include banking transactions, invoices, and customer records. Unstructured Data: Big Data Unstructured data is most often categorized as qualitative data, and it cannot be processed and analyzed using conventional data tools and methods. Data that does not conform to a data model or data schema is known as unstructured data. It is estimated that unstructured data makes up most of the data within any given enterprise. Unstructured data has a faster growth rate than structured data. Examples of unstructured data include text, video files, audio files, mobile activity, social media posts, satellite imagery, surveillance imagery – the list goes on and on. Unstructured data is diffi cult to deconstruct because it has no predefined data model, meaning it cannot be organized in relational databases. Unstructured data cannot be directly processed or queried using SQL. If it is required to be stored within a relational database, it is stored in a table as a Binary Large Object (BLOB). Alternatively, a Not-only SQL (NoSQL) database is a non-relational database that can be used to store unstructured data alongside structured data. Big Data Semi-structured Data Semi-structured data has a defined level of structure and consistency, but is not relational in nature. This kind of data is commonly stored in files that contain text. For instance, data stored in JSON or XML files are common forms of semi-structured data. Due to the textual nature of this data and its conformance to some level of structure, it is more easily processed than unstructured data. Examples of common sources of semi-structured data include electronic data interchange (EDI) files, spreadsheets, RSS feeds and sensor data. Semi-structured data often has special pre-processing and storage requirements, especially if the underlying format is not text-based. Big Data Metadata Metadata is simply data about data. It means it is a description and context of the data. It helps to organize, find and understand data. Metadata provides information about a dataset’s characteristics and structure. This type of data is mostly machine-generated and can be appended to data. The tracking of metadata is crucial to Big Data processing, storage and analysis because it provides information about the pedigree of the data and its provenance during processing. Some typical metadata elements: Title and description, Tags and categories, Who created and when, Who last modified and when, Who can access or update. Attributes providing the file size and resolution of a digital photograph Big Data solutions rely on metadata, particularly when processing semi-structured and unstructured data. Characteristics of Big Data Volume Big data is a form of data whose volume is so large that specialized tools and frameworks are required to store process and analyze such data. Social media applications process billions of messages everyday, industrial and energy systems can generate terabytes of sensor data every day, cab aggregation applications can process millions of transactions in a day, etc. The volumes of data generated by modern IT, industrial, healthcare, Internet of Things, and other systems is growing exponentially driven by the lowering costs of data storage and processing architectures and the need to extract valuable insights from the data to improve business processes, effi ciency and service to consumers. Though there is no fixed threshold for the volume of data to be considered as big data, however, typically, the term big data is used for massive scale data that is diffi cult to store, manage and process using traditional databases and data processing architectures. Characteristics of Big Data Velocity Velocity of data refers to how fast the data is generated. Data generated by certain sources can arrive at very high velocities, for example, social media data or sensor data. Velocity is another important characteristic of big data and the primary reason for the exponential growth of data. High velocity of data results in the volume of data accumulated to become very large, in short span of time. Some applications can have strict deadlines for data analysis (such as trading or online fraud detection) and the data needs to be analyzed in real-time. Specialized tools are required to ingest such high velocity data into the big data infrastructure and analyze the data in real-time. Characteristics of Big Data Variety Variety refers to the forms of the data. Big data comes in different forms such as structured, unstructured or semi-structured, including text data, image, audio, video and sensor data. Data variety brings challenges for enterprises in terms of data integration, transformation, processing, and storage. Structured data can easily be processed using traditional methods that are based on relational databases. Unstructured data are diffi cult to process in comparison to structured data. These include sensor data, log data, data generated by social media, etc. Big data systems need to be flexible enough to handle such variety of data. Characteristics of Big Data Veracity Veracity refers to how accurate is the data. Data of a dataset can be part of the signal or noise. Noise is data that cannot be converted into information and thus has no value, whereas signals have value and lead to meaningful information. Data with a high signal-to-noise ratio has more veracity than data with a lower ratio. To extract value from the data, the data needs to be cleaned to remove noise. Data that is acquired in a controlled manner, for example via online customer registrations, usually contains less noise than data acquired via uncontrolled sources, such as blog postings. Thus the signal-to- noise ratio of data is dependent upon the source of the data and its type. Data-driven applications can reap the benefits of big data only when the data is meaningful and accurate. Therefore, cleansing of data is important so that incorrect and faulty data can be filtered out. Characteristics of Big Data Value Value is defined as the usefulness of data for an enterprise. The value characteristic is intuitively related to the veracity characteristic in that the higher the data fidelity, the more value it holds for the business. Value is also dependent on how long data processing takes because analytics results have a shelf-life; for example, a 20 minute delayed stock quote has little to no value for making a trade compared to a quote that is 20 milliseconds old. Value and time are inversely related. The longer it takes for data to be turned into meaningful information, the less value it has for a business. Stale results inhibit the quality and speed of informed decision-making. Figure 1 provides two illustrations of how value is impacted by the veracity of data and the timeliness of generated analytic results. Data Analytics Data analytics is the application of algorithmic techniques and methods or code language to big data or sets of it to derive useful and pertinent conclusions from it (Aalst 2016). When one uses the analytical part of data science on big data or raw data in order to derive meaningful insights and information, it is called data analytics. It has gained a lot of attention and practical application across industries for strategic decision-making, theory building, theory testing, and theory disproving. The thrust of data analytics is on the inferential conclusions that are arrived at after computation of analytical algorithms. Data analytics involves manipulation of big data to obtain contextual meanings through which business strategies can be formulated. Organizations use a blend of machine-learning algorithms, artificial intelligence, and other systems or tools for data-analytics tasks for insightful decision-making, creative strategy planning, serving consumers in the best manner, and improving performance to increase their revenues by ensuring sustainable bottom lines. Descriptive Analytics Descriptive analytics answers the question of what has happened. It simply answers the question of ‘what the data shows’. The process of descriptive analytics uses a large amount of data to find what has happened in a business for a given period and also how it differs from another comparable period. These statistics help in describing patterns in the data and present the data in a summarized form. Descriptive analytics is one of the most basic forms of analytics used by any organization for getting an overview of what has happened in the business. Using descriptive analytics on historic data, decision-makers within the organization can get a complete view of the trend on which they can base their business strategy. It also helps in identifying the strengths and weaknesses lying within an organization. Being an elementary form of analytics technique, it is usually used in conjunction with other advanced techniques like predictive and prescriptive analysis to generate meaningful results. Diagnostic Analytics Diagnostic analytics looks into the reasons or causes of any event or happening and supplements the findings of the descriptive analytics. It simply answers the question ‘why or what led to any specific event?’ by delving into the facts to direct the future course of planning. Diagnostic analytics takes a deep dive into the data and tries to find valuable hidden insights. It aims at first diagnosing the problems out of the data sets and then dissecting the reasons behind the problems by using techniques like regression or probability analysis. Diagnostic analytics, doesn’t generate any new outcome; rather, it provides the reasoning behind already known results. Such a type of analytics is widely used across fields like medicine to diagnose the cause of the problems, marketing to know the specific reasons behind consumer behavior, or even in the finance area to know the cause behind an investment decision. Predictive Analytics Predictive analytics aims to predict or prognoses what could happen in the future It answers the question ‘what events could unfold in future, or what events could flare up?’ In predictive analytics, usually historical and transactional data are used to identify risks and opportunities for the future. Predictive analytics empowers organizations in providing a concrete base on which they can plan their future actions. This allows them to make decisions that are more accurate and fruitful compared to the ones taken based on pure assumptions or manual analysis of data. Within the available data sets, predictive analytics search for certain patterns or trends for events that could pan out in the future, followed by estimating the probabilities for the events that panned out. It provides predictive insights in areas of retailing and commerce for rolling out products aligned with consumer preferences, stock markets for predicting future stock prices, and even project appraisal areas for forecasting the risks posed. Prescriptive Analytics Prescriptive analytics is a branch of data analytics that helps in determining the best possible course of action that can be taken based on a particular scenario. Prescriptive analytics unlike predictive analytics doesn’t predict a direct outcome but rather provides a strategy to find the most optimal solution for a given scenario. Out of all the forms of business analytics, predictive analytics is the most sophisticated type of business analytics and is capable of bringing the highest amount of intelligence and value to businesses. How Prescriptive Analytics Works Prescriptive analytics usually relies on advanced techniques of artificial intelligence, like machine learning and deep learning, to learn and advance from the data it acquires, working as an autonomous system without the requirement of any human intervention. Prescriptive-analytics models also have the capability to adjust their results automatically as new data sets become available. Prescriptive Analytics Examples of Prescriptive Analytics The power of prescriptive analytics can be leveraged by any data-intensive business and government agency. A space agency can use prescriptive analytics to determine whether constructing a new launch site can endanger a species living nearby. This analysis can help in making the decision to relocate the particular species to some other location or to change the location of the launch site itself. Benefits of Prescriptive Analytics Prescriptive analytics provides an organization the ability to: Discover the path to success—Prescriptive-analytics models can combine data and operations to provide a road map of what to do and how to do it most effi ciently with minimum error. Minimize the time required for planning—The outcome generated by prescriptive-analytics models helps in reducing the time and effort required by the data team of the organization to plan a solution, which enables them to quickly design and deploy an effi cient solution Minimize human interventions and errors—Prescriptive-analytics models are usually fully automated and require very few human interventions, which makes them highly reliable and less prone to error compared to the manual analysis done by data scientists. Applications of Data Analytics in Business 1. Production and Inventory Management: In product development for gaining knowledge about customer needs and wants, preferences, and the latest trends. In supply chain management for keeping flow of inbound logistics. In inventory management for maintaining economic order quantity, just –in-time purchases, and ABC analysis of stock items. In production processes for analyzing productive effi ciency gains from the resources put to use. 2. Sales and Operations Management: In retail-sales management for product shelf display and replenishment, running special discount sales and loyalty programs. In outbound logistics to ensure proper physical distribution to different business locations. In warehouse and storage management for maintaining proper upkeep and ready-to-serve features. Applications of Data Analytics in Business 3. Price setting and Optimization: In price determination of goods and services, for analyzing the indicators like factor input costs, competitor’s price list, price elasticity trends etc. In tax and duty adjustments regarding computations and calculations of different duties, levies and taxes. In determining features like discounts, rebates, special prices and coupons. In optimization of input costs and overhead costs for maintaining sustainable profitability. 4. Finance and Investments: In the stock market to track stock performance, future trend, and company’s future earning potentials. In capital budgeting decisions for making investment decisions, dividend decisions, or determining the valuation of a firm. In investment banking for the task of lead book running, arriving at the mergers and amalgamation decisions. In credit rating generation, financial fraud detection or prevention, portfolio creation, management of diversification. Applications of Data Analytics in Business 5. Marketing Research: In segmenting, targeting and positioning strategy formulating. For the search engine optimization process, to return the best and relevant results from search queries run in real time. In advertising from the idea conceptualization to content creation and designing of banners or billboards or directing the advertisement. In creating a recommendation system in this era of ecommerce so that products or services reach the appropriate and targeted audiences. In consumer relationship building activities by maintaining close links and contacts with the consumers, for personalized marketing activities for brand loyalty, and to constantly better the business in providing memorable consumer experiences. 6. Human Resource Management: In recruitment and selection for conducting background checks, screening the candidates, and calling the eligible candidates for interview. In training and development schemes for building and polishing the skills that employees lack or for the infusion of new skills as per trending needs. In compensation management for motivation, retention and satisfaction of employees by giving them a good mix of both pecuniary and nonpecuniary motives. In performance appraisal for seeking information regarding employee promotion and transfers, career development, and attrition rate. SAMPLE QUESTIONS How the knowledge of Data Analytics can be applied in Finance and Investment management. Analyze how Human Resource Managers have become increasingly dependent on Data Analytics. How the knowledge of Data Analytics can be applied in Marketing Research. Explain the concept of predictive analytics Explain the concept of prescriptive analytics