Summary

This document is a study guide on data analytics, specifically for CMA intermediate students. It covers the fundamentals of data science, data processing, and visualization. The material includes examples and explanations for a better understanding of the concepts. This guide provides insights and knowledge about data and how it can be used to create information and knowledge. It contains information about the types of data and how it is used in finance and costing.

Full Transcript

CMA STUDENTS AND TEAM AAC CELEBRATING SUCCESS OF CMA INTER CLEARED STUDENT IN GOA FIRST TIME IN HISTORY OF CMA…… …..NEXT CAN BE YOU. CMA STUDENTS AND TEAM AAC CELEBRATING SUCCESS OF CMA INTER CLEARED STUDENT IN GOA TOKEN OF MOTIVATION TO ACHIEVERS...

CMA STUDENTS AND TEAM AAC CELEBRATING SUCCESS OF CMA INTER CLEARED STUDENT IN GOA FIRST TIME IN HISTORY OF CMA…… …..NEXT CAN BE YOU. CMA STUDENTS AND TEAM AAC CELEBRATING SUCCESS OF CMA INTER CLEARED STUDENT IN GOA TOKEN OF MOTIVATION TO ACHIEVERS …..NEXT CAN BE YOU. CMA STUDENTS AND TEAM AAC CELEBRATING SUCCESS OF CMA INTER CLEARED STUDENT IN GOA PROMISE MADE BY AKASH SIR IN LAST SUCCESS BATCH IS DELIVERED …..NEXT CAN BE YOU. CMA STUDENTS AND TEAM AAC CELEBRATING SUCCESS OF CMA INTER CLEARED STUDENT THAT IS WHY STUDENTS SAYS AAC=CMA …..NEXT CAN BE YOU. CMA STUDENTS AND TEAM AAC CELEBRATING SUCCESS OF CMA INTER CLEARED STUDENT …..NEXT CAN BE YOU. P R I D E O F A A C 500+ STUDENTS CLEARED CMA INTER IN JAN 23 …. NEXT CAN BE YOU!! CRACKED BOTH GROUPS IN FIRST ATTEMPT ……….. NEXT CAN BE YOU!! PRIDE OF AAC STUDENTS CLEARING BOTH THE GROUPS BOTH GRPS CLEARED.docx ……….. NEXT CAN BE YOU!! ……….. NEXT CAN BE YOU!! ……….. NEXT CAN BE YOU!! ……….. NEXT CAN BE YOU!! CMA INTERMEDIATE NEXT CAN JAN 2023 GROUP 2 BE YOU! SUCCESS MARKS WITH AAC COSTING AND FM Kavita 57 Sahil 56 Shivani 56 Munazir 56 Shahid 56 Neha 56 Shreya 56 Md 56 Shobha 56 Mahak 56 Shahil 56 Yash 56 Siddhi 56 Trishir 56 Sanyam 56 Karan 56 Amar 56 Meghana 56 Aamir 56 Biswaranjan 54 Deepam 56 Prashanth 54 Baishnavi 56 Runnu 54 Arjav 56 Mirza 54 Chandani 56 Narayan 54 Hetansee 56 Hemlata 52 AND MANY MORE……… In the age of information, we find ourselves engulfed by a seemingly endless sea of data. With each passing day, our world generates an unprecedented volume of information, both structured and unstructured, reflecting the diverse facets of human existence. Amidst this deluge of data, lies the key to unlocking valuable insights and understanding the complex patterns that govern our lives. This book is of around 100 pages consisting of 4 Chapters. In these pages, we embark on a journey that begins with the fundamentals of data analytics. From understanding the significance of data and its various types to exploring the art of data collection and organization, we lay the groundwork for the reader to become well-versed in the essentials. We delve into the techniques of data cleaning, preprocessing, and transformation, crucial steps that pave the way for accurate and reliable analysis. This book has been brought together using ICMAI Module, Past Year Papers, Revision Test papers and different reference books. The heart of data analytics lies in extracting knowledge and insights from raw data. This book introduces a diverse array of analytical methods, from descriptive analytics that summarizes and interprets data, to diagnostic analytics that identifies the root causes of specific events, to predictive analytics that forecasts future trends and outcomes, and finally, to prescriptive analytics that guides decision-making through actionable recommendations. Key Features of this Book: 1. Covers theory explaining basics of Data analytics. 2. Use of simple and concise language for easy and quick understanding 3. Covers institute’s study material 4. Logical arrangement of topics 5. Diagrammatic presentation where needed. Akash Agarwal classes FOR MORE DETAIL SCAN ME SCAN FOR YOUTUBE SCAN FOR TELEGRAM SCAN FOR AAC WEBSITE SCAN FOR APPLCATION Our students appreciation…. “HI Sir im your success batch student “Hello mam, I passed this from Rajasthan aaj ka DT ka paper attempt with 328 marks. I want asan tha apke MCQ, LDR sums apke notes sabne bhot help kii and to express my sincere thankyou always for motivation and appreciation to you for supporting us” always being available to clarify doubts and answer “Sir, paper was easy sir apke Marathon ne help kiya attempt karne ko.. thank you so much” “How can I thankyou sir.. no words Salary se dar lagta tha. Lekin sapne sikhane ke baad I was fab. Your strategy completely worked sir.. CMA Aspirant Mayank Puri No one can beat this level..your predictions “Ma’am First of all need to thankyou up to mark for todays paper. It was super se upar…you are best mam.itna easy …..thankyou” laga paper costing fm ka… thankyou so much mam” CMA Aspirant Sagar Our students appreciation…. “sir aap sirf law guru I nhi…DT ke bhi Thankyou so much mam itna acha maha guru ho..Thankyou God blessed me sa sb revision karwane ke liye..ur the with the best teacher in world. Concept best mam saradoubt clear hogya and clarity, Revisions, success batch, sab mein confidence aaraha marathon, Mcq,theory, practical hai thankyou so much aap best everything….Thankyou so much” ho mam..loveyou CMA Aspirant Priyanka “Guys just an appreciation post…I scored 68 in fm by just watching the “hi mam, I really like revision lecture twice and cleared bot teaching of AAC….I groups in the first attempt…u r thought math is a hard amazing mam..thanks for such subect but ab aap hoo incredible work…” toh darn hi lagta….bhtt confidence aaya hai maths ko leke…bilkul bhi bore nhi hota lecture ke time pe thankyou so “goodmorning mam….i want to Thank much mam AAC is whole AAC team for the quality of great….aapko jitna teaching provided free of cost in thanks bolu utna kam success batch…it helps a lot to students hai….cma foundation like me.” from Aurangabad” AKASH AGARWAL CLASSES DATA ANALYTICS INDEX Chapter Chapter Name Page No. No. 1 Introduction to Data Science 1-17 2 Data Processing, Organization, Cleaning and 18-43 Validation 3 Data Presentation: Visualization and Graphical 44-72 Presentation 4 Data Analysis and Modelling 73-105 PROF. NIKITA OSWAL CMA INTER-DATA ANALYTICS 1. INTRODUCTION TO DATA SCIENCE FOR BUSINESS DECISION- MAKING This Module Includes. 1.1 Meaning, Nature, Properties, Scope of Data. 1.2 Types of Data in Finance and Costing. 1.3 Digitization of Data and Information. 1.4 Transformation of Data to Decision Relevant Information. 1.5 Communication of Information for Quality Decision-making. 1.6 Professional Skepticism regarding Data. 1.7 Ethical Use of Data and Information. AKASH AGARWAL CLASSES CONT.- 8007777042 / 043 1 PROF. NIKITA OSWAL CMA INTER-DATA ANALYTICS UNIT-1. MEANING, NATURE,PROPERTIES,SCOPE OF DATA There is a saying ‘data is the new oil’. Over the last few years, with the advent of increasing computing power and availability of data, the importance and application of data science has grown exponentially. The field of finance and accounts has not remained untouched from this wave. In fact, to become an effective finance and accounts professional, it is very important to understand, analyse and evaluate data sets. 1.1.1. What is data and how it is linked to information and knowledge? Data is a source of information and information needs to be processed for gathering knowledge. Any ‘data’ on its own does not confer any meaning. The relationship between data, information, and knowledge may be depicted from figure 8.1 below: DATA INFORMATION KNOWLEDGE & Figure 1.1: Relationship between data, information & Knowledge The idea of data in the syllabus is frequently described to as ‘raw’ data, which is a collection of meaningless text, numbers, and symbols. The example of ‘raw data’ could be as below: 2,4,6,8........... Amul, Nestle, ITC.......... 36,37,38,35,36............ Figure 1.2: Raw data (Data, information and knowledge) The figure 1.2 above shows few data series. It is almost impossible to decipher, what these data series is talking about. The reason is that we do not know the exact context of these data. The first series may be a multiplication table of 2. Alternatively, this series may also be the marks obtained by students in a class test with full marks of 20. The second series names few Indian brands, but we don’t know, why the names are uttered here at all. To cut the long story short, we must know the context in which the raw data is talking about. Any ‘data’ on its own can’t convey any information.. AKASH AGARWAL CLASSES CONT.- 8007777042 / 043 2 PROF. NIKITA OSWAL CMA INTER-DATA ANALYTICS 1.1.2. What is Information As we discussed, data needs to processed for gathering information. Most commonly, we take the help of computers and software packages for processing data. An exponential growth in availability of computing powers, and software packages lead to growth of data science in recent years. If we say that the first series in figure 8.2 is really the first four numbers of multiplication table of 2, the third series is the highest temperature of Kolkata during previous four days, we are actually discovering some information out of the raw data. So, we may say now. Information = Data + Context 2, 4, 6, 8……… (Multiplication table of 2) Amul, Nestle, ITC…… (Three FMCG companies listed in NSE) 36, 37, 38, 35, 36……… (Highest temperature in Kolkata for last four days) Figure 1.3: Information = Data + Context. 1.1.3. What is knowledge? When these ‘information’ is used for solving a problem, we say it’s the use of knowledge. By having the information, about highest temperatures in Kolkata for a month, we may try to estimate the sale of air conditioners. If our intention is to analyse the profitability of listed FMCG companies in India, first information we should have been the names of FMCG companies. So, we may say: Knowledge = Information + Application of it 2, 4, 6, 8……… (Multiplication table of 2)….(The table of 3 should start from 3) Amul, Nestle, ITC…… (Three FMCG companies listed in NSE) ….(These three company’s financial performance should be analyzed to understand the Indian FMCG sector) 36, 37, 38, 35, 36……… (Highest temperature in Kolkata for last four days) ……. (The sale of ACs may be estimated using this information). Figure 1.4: Knowledge = Information + Application of it. AKASH AGARWAL CLASSES CONT.- 8007777042 / 043 3 PROF. NIKITA OSWAL CMA INTER-DATA ANALYTICS 1.1.4. Nature Of Data Over the time the magnitude and availability of data has exponentially grown over the years. However, the sets may be classified into different groups as below: (i) Numerical data: Any data expressed as a number is a numerical data. In finance, a prominent example. stock price data. Figure 8.5 below is showing the daily stock prices of HUL stock. This is an example of numerical data. Figure 1.5: Stock price of HUL (Source: finance.yahoo.com) (ii) Descriptive data: Some times information may be deciphered in the form of qualitative information. Look at the paragraph in figure 8.6 extracted from annual report of HUL (2021- 22). This is a descriptive data provided by HUL in its annual report (2021-22). The user may use this data to make a judicious investment decision. Leading social and environment change At Hindustan Unilever, we have always strived to grow our business while protecting the planet and doing good for the people. We believe that to generate superior long-term value, we need to care for all our stakeholders – our consumers, customers, employees, shareholders and above all, the planet and society. We call it the multistage holder model of sustainable growth. With more people entering the consumption cycle and adding to the pressure on natural resources, it will become even more important to decouple growth environmental impact and drive positive social change. Figure 1.6: Descriptive data extracted from HUL annual report (2021-22). AKASH AGARWAL CLASSES CONT.- 8007777042 / 043 4 PROF. NIKITA OSWAL CMA INTER-DATA ANALYTICS (iii) Graphic data: A picture or graphic may tell thousand stories. Data may also be presented in the form of a picture or graphics. For example, the stock price of HUL may be presented in the form of a picture or chart (Figure 1.7) Figure 1.7: Graphic representation of HUL stock prices (Source: google.com) AKASH AGARWAL CLASSES CONT.- 8007777042 / 043 5 PROF. NIKITA OSWAL CMA INTER-DATA ANALYTICS UNIT-2. TYPES OF DATA IN FINANCE AND COSTING Data plays a very important role in the study of finance and cost accounting. From the inception of the study of finance, accounting and cost accounting, data always played an important role. Be it in the form of financial statements, or cost statements etc the finance and accounting professionals played a significant role in helping the management to make prudent decisions. The kinds of data used in finance and costing may be quantitative as well as qualitative in nature. 1. Quantitative financial data: By the term ‘quantitative data’, we mean the data expressed in numbers. The quantitative data availability in finance is significant. The stock price data, financial statements etc are examples of quantitative data. As most of the financial records are maintained in the form of organised numerical data. 2. Qualitative financial data: However, some data in financial studies may appear in a qualitative format e.g. text, videos, audio etc. These types of data may be very useful for financial analysis. For example, the ‘management discussion and analysis’ presented as part of annual report of a company is mostly presented in the form of text. This information is useful for getting an insight into the performance of the business. Similarly, key executives often appear for an interview in business channels. These interactions are often goldmines for data and information. Types Of Data. There is another way of classifying the types of data. The data may be classified also as: (i) Nominal. (ii) Ordinal. (iii) Interval. (iv) Ratio. AKASH AGARWAL CLASSES CONT.- 8007777042 / 043 6 PROF. NIKITA OSWAL CMA INTER-DATA ANALYTICS Each gives a distinct set of traits that influences the sort of analysis that may be conducted. The differentiation between the four scale types is based on three basic characteristics: (a) Whether the sequence of answers matters or not. (b) Whether the gap between observations is significant or interpretable, and (c) The existence or presence of a genuine zero. We will briefly discuss these four types below: (i) Nominal Scale: Nominal scale is being used for categorising data. Under this scale, observations are classified based on certain characteristics. The category labels may contain numbers but have no numerical value. Examples could be, classifying equities into small-cap, mid-cap, and large-cap categories or classifying funds as equity funds, debt funds, and balanced funds etc. (ii) Ordinal Scale: Ordinal scale is being used for classifying and put it in order. The numbers just indicate an order. They do not specify how much better or worse a stock is at a specific price compared to one with a lower price. For example, the top 10 stocks by P/E ratio. (iii) Interval scale: Interval scale is used for categorising and ranking using an equal interval scale. Equal intervals separate neighbouring scale values. As a result of scale’s arbitrary zero point, ratios cannot be calculated. For example, temperature scales. The temperature of 40 degrees is 5 degrees higher than that of 35 degrees. The issue is that a temperature of 0 degrees Celsius does not indicate the absence of temperature. A temperature of 20 degrees is thus not always twice as hot as a temperature of 10 degrees. (iv) Ratio scale: The ratio scale possesses all characteristics of the nominal, ordinal, and interval scales. The acquired data can not only be classified and rated on a ratio scale, but also have equal intervals. A ratio scale has a true zero, meaning that zero has a significant value. The genuine zero value on a ratio scale allows for the magnitude to be described. For example, length, time, mass, money, age, etc. are typical examples of ratio scales. For data analysis, a ratio scale may be utilised to measure sales, pricing, market share, and client count. AKASH AGARWAL CLASSES CONT.- 8007777042 / 043 7 PROF. NIKITA OSWAL CMA INTER-DATA ANALYTICS UNIT-3. DIGITIZATION OF DATA AND INFORMATION In plain terms, digitization implies the process of converting the data and information from analogue to digital format. The data in the original form may be stored in as an object, a document or an image. The objective of digitization is to create a digital surrogate of the data and information in the form of binary numbers that facilitate processing using computers. There are primarily two basic objectives of digitization. First is to provide a widespread access of data and information to a very large group of users simultaneously. Secondly, digitization helps in preservation of data for a longer period. One of largest digitization project taken up in India is ‘Unique Identification number’ (UID) or ‘Aadhar’ (figure 1.8). Figure 1.8: UID – Aadhar – globally one of the largest projects of digitization (Source: https://uidai.gov. in/about-uidai/unique-identification-authority-of-india/vision- mission.html) Digitization brings in some great advantages, which are mentioned below. AKASH AGARWAL CLASSES CONT.- 8007777042 / 043 8 PROF. NIKITA OSWAL CMA INTER-DATA ANALYTICS Why We Digitize. There are many arguments that favour digitization of records. Some of them are mentioned below: Improves classification and indexing for documents, this helps in retrieval of the records. Digitized records may be accessed by more than one person simultaneously. It becomes easier to reuse the data, which are difficult to reuse in present format e.g. very large maps, data recorded in microfilms etc. Helps in work processing. Higher integration with business information systems. Easier to keep back-up files and retrieval during any unexpected disaster. Can be accessed from multiple locations through networked systems. Increased scope for rise in organizational productivity. Requires less physical storage space. How do we digitize? Large institution takes up digitization projects with meticulous planning and execution. The entire process of digitization may be segregated into six phases: Phase1: Justification of the proposed digitization project At the very initiation of the digitization project, the accrual benefit of the project needs to be identified. Also need to compute the cost aspect of the project and the assessment of availability of resources. Risk assessment is an important part project assessment. For the resources that may be facing quick destruction may be required an early digitization. Most importantly, the expected value generation through digitization should be expressed in clear terms. AKASH AGARWAL CLASSES CONT.- 8007777042 / 043 9 PROF. NIKITA OSWAL CMA INTER-DATA ANALYTICS Phase 2: Assessment In any institutions, all records are never digitized. The data that requires digitization is to be decided on the basis of content and context. Some data may be digitized in a consolidated format, and some in detailed format. The files, tables, documents, expected future use etc are to be accessed and evaluated for the assessment. The hardware and software requirements for digitization is also assessed at this stage. The human resource requirement for executing the digitization project is also planned. The risk assessment at this level e.g. possibilities of natural disasters, and/or cyber attacks etc also need to be completed. Phase 3: Planning Successful execution of digitization project needs meticulous planning. There are several stages for planning e.g. selection of digitization approach, Project documentation, Resources management, Technical specifications, and Risk management. The institution may decide to complete the digitization in-house or alternatively by an outsourced agency. It may also Phase 4: Digitization activities be done on-demand or in batches. Phase 4: Digitization activities Upon the completion of assessment and planning phase, the digitization activities start. The Wisconsin Historical Society developed a six-phase process viz. Planning, Capture, Primary quality control, Editing, Secondary quality control, and storage and management. The planning schedule is prepared at the first stage, calibration of hardware/software and scanning etc is done next. A primary quality check is done on the output to check the reliability. Cropping, colour correction, assigning Metadata etc is done at the editing stage. A final check of quality is done on randomly selected samples. And finally, user copies are created, and uploaded to dedicated storage space, after doing file validation. The digitization process may be viewed at figure 1.9 below. AKASH AGARWAL CLASSES CONT.- 8007777042 / 043 10 PROF. NIKITA OSWAL CMA INTER-DATA ANALYTICS Figure 1.9: The complete digitization process. Source: Bandi, S., Angadi, M. and Shivarama, J. Best practices in digitization: Planning and workflow processes. In Proceedings of the Emerging Technologies and Future of Libraries: Issues and Challenges (Gulbarga University, Karnataka, India, 30-31 January), 2015. Phase 5: Processes in the care of records the digitization of records is complete, there are few additional requirements arise which may be linked to administration of records. The permission for accession of data, intellectual control (over data), classification (if necessary), and upkeeping and maintenance of data are few additional requirements for data management. Phase 6: Evaluation Once the digitization project is updated and implemented, the final phase should be a systematic determination of the project’s merit, worth and significant using objective criteria. The primary purpose is to enable reflection and assist identify changes that would improve future digitization processes. AKASH AGARWAL CLASSES CONT.- 8007777042 / 043 11 PROF. NIKITA OSWAL CMA INTER-DATA ANALYTICS UNIT-4. TRANSFERMATION OF DATA TO DECISION RELEVANT INFORMATION The emergence of big data has changed the world of business like never before. The most important shift has happened in the information generation and the decision-making process. There is a strong emergence of analytics that supports a more intensive data- centric and data-driven information generation and decision-making process. The data that encompasses the organization is being harnessed into information that apprises, cares and prudent decision making in a judicious and repeatable manner. The pertinent question here is, what an enterprise needs to do for transforming data into relevant information? As noted earlier, all types of data may not lead to relevant information for decision making. The biannual KPMG global CFO report says, for today’s finance function leaders, “biggest challenges lie in creating the efficiencies needed to gather and process basic financial data and continue to deliver traditional finance outputs while at the same time redeploying their limited resources to enable higher-value business decision support activities.” For understating the finance functions within an enterprise, we may refer figure 8.10 below: Figure 1.10: Understanding finance functions (Source: KPMG international) AKASH AGARWAL CLASSES CONT.- 8007777042 / 043 12 PROF. NIKITA OSWAL CMA INTER-DATA ANALYTICS At the ‘basics’ or foundation of pyramid (figure 8.10), the data generation may be automated by using ERP and other relevant software and hardware tools. The tools, techniques and processes that comprise the field of data & analytics (D & A) play a significant role in improving the quality of standard daily data and transaction processing. To make the data turn into user friendly information, it should go through six core steps: 1. Collection of data: The collection of data may be done with standardized systems in place. Appropriate software and hardware may be used for this purpose. Appointment of trained staff also plays an important role in collecting accurate and relevant data. 2. Organising the data: The raw data needs to be organized in an appropriate manner to generate relevant information. The data may be grouped, arranged in a manner that create useful information for the target user groups. 3. Data processing: At this step, data needs to be cleaned to remove the unnecessary elements. If any data point is missing or not available, that also need to be addressed. The options available for presentation format for the data also need to be decided. 4. Integration of data: Data integration is the process of combining data from various sources into a single, unified form. This step include creation of data network sources, a master server and users accessing the data from master server. Data integration eventually enables the analytics tools to produce effective, actionable business intelligence. 5. Data reporting: Data reporting stage involves translating the data into a consumable format to make it accessible by the users. For example, for a business firm, they should be able to provide summarized financial information e.g. revenue, net profit etc. The objective is, a user, who wants to understand the financial position of the company should get the relevant and accurate information. 6. Data utilization: At this ultimate step, data is being utilized to back corporate activities and enhance operational efficiencies and productivity for the growth of business. This makes the corporate decision making really ‘data driven’. AKASH AGARWAL CLASSES CONT.- 8007777042 / 043 13 PROF. NIKITA OSWAL CMA INTER-DATA ANALYTICS UNIT 5. COMMUNICATION OF INFORMATION FOR QUALITY DECISION- MAKING The quality information should lead to quality decisions. With the help of well curated and reported data, the decision makers should be able to add higher-value business insights leading to better strategic decision making. In a sense, a judicious use of data analytics is essential for implementation of ‘lean finance’, which implies optimized finance processes with reduced cost and increased speed, flexibility and quality. By transforming the information into a process for quality decision making, the firm should achieve the following abilities: (i) Logical understanding of a wide-ranging structured and unstructured data and put on that information to corporate planning, budgeting and forecasting and decision support. (ii) Predict outcomes more effectively compared to conventional forecasting techniques based on historical financial reports. (iii) Real time spotting of emerging opportunities and also capability gaps. (iv) Making strategies for responding to uncertain events like market volatility and ‘black swan’ events through simulation. (v) Diagnose, filter and excerpt value from financial and operational information for making better business decisions. (vi) Recognize viable advantages to service customers in a better manner. (vii) Identifying possible fraud possibilities on the basis of data analytics. (viii) Building impressive and useful dashboards to measure and demonstrate success leading to effective strategies. The aim of a data driven business organization is develop a business intelligence (BI) system that is not only focused on efficient delivery of information but also provide accurate strategic insight into the operational and financial system. This impacts the organizational capabilities in a positive manner. This makes the organization resilient to market pressures and create competitive advantages by serving customers in better way by using data and predictive analytics. AKASH AGARWAL CLASSES CONT.- 8007777042 / 043 14 PROF. NIKITA OSWAL CMA INTER-DATA ANALYTICS UNIT 6. PROFESSIONAL SCEPTICISM REGARDING DATA While data analytics is an important tool for decision making, managers should never take an important analysis at face value. A deeper understanding of hidden insights that lie underneath the surface of the data set need to be explored, and what appears on the surface should be looked with some scepticism. The emergence of new data analytics tools and techniques in financial environment allows the accounting and finance professionals to gain unique insights into the data, but at the same time creating very unique challenges while exercising scepticism. As the availability of data is bigger now, analysts and auditors not only getting more information, but also is facing challenges about managing and investigating red flags. One major concern about the use of data analytics is the likelihood of false positives, i.e. the data may identify few potential anomalies that could be later identified as reasonable and explained variation of data. Studies show that the frequency of false positives increase proportionately with the size and complexity of data. Few studies also show that analysts face problems while determining outliers using data analytics tools. Professional scepticism is an important focus area for practitioners, researchers, regulators and standard setters. At the same time, professional scepticism may result into additional costs e.g. strained client relationships, and budget coverages Under such circumstances, it is important to identify and understand conditions in which the finance and audit professionals should apply professional scepticism. There is a requirement to keep a fine balance between costly scepticism and underutilizing data analytics to keep the cost under control. AKASH AGARWAL CLASSES CONT.- 8007777042 / 043 15 PROF. NIKITA OSWAL CMA INTER-DATA ANALYTICS UNIT 7. ETHICAL USE OF DATA AND INFORMATION Data analytics can help in decision making process and make an impact. However, this empowerment for business also comes with challenges. The question is how the business organizations can ethically collect, store and use data? And what rights need to be upheld? Below we will discuss five guiding principles in this regard. Data ethics addresses the moral obligations of gathering, protecting and using personally identifiable information. In present days, it is a major concern for analysts, managers and data professionals. The five basic principles of data ethics that a business organization should follow are: (i) Regarding ownership: The first principle is that ownership of any personal information belongs to the person. It is unlawful and unethical to collect someone’s personal data without their consent. The consent may be obtained through digital privacy policies or signed agreements or by asking the users to agree with terms and conditions. It is always advisable to ask for permission beforehand to avoid future legal and ethical complications. In case of financial data, some data may be sensitive in nature. Prior permission must be obtained before using the financial data for further analysis. (ii) Regarding transparency: Maintaining transparency is important while gathering data. The objective with which the company is collecting user’s data should be known to the user. For example is the company is using cookies to track the online behaviour of the user, it should be mentioned to the user through a written policy that cookies would be used for tracking user’s online behaviour and the collected data will be stored in a secure database to train an algorithm to enhance user experience. After reading the policy, the user may decide to accept or not to accept the policy. Similarly, while collecting the financial data from clients, it should be clearly mentioned that for which purpose the data should be used. (iii) Regarding privacy: As the user may allow to collect, store and analyse the personally identifiable information (PII), that does not imply it should be made publicly available. For companies, it is mandatory to publish some financial information to public e.g. through annual reports. However, there may be many confidential information, which if falls on a wrong hand may create problems and financial loss. To protect privacy of data, a data security process should be in place. This may include file encryption and dual authentication password etc. The possibility of breach of data privacy may also be done through de- identifying a dataset. (iv) Regarding intention: The intension of data analysis should never be making profits out of others weaknesses or for hurting others. Collecting data which is unnecessary for analysis should be avoided and it’s unethical. AKASH AGARWAL CLASSES CONT.- 8007777042 / 043 16 PROF. NIKITA OSWAL CMA INTER-DATA ANALYTICS (v) Regarding outcomes: In some cases, even if the intentions are good, the result of data analysis may inadvertently hurt the clients and data providers. This is called disparate impact, which is unethical. Solved Case 1 Mr. Arjun is working as data analyst with Manoj Enterprises Limited. He was invited by an educational institute to deliver a lecture on data analysis. He was told that the participants would be fresh graduates, who would like get a glimpse of the emerging field of ‘data analysis’. He was planning for the lecture and is thinking of the concepts to be covered during the lecture. In your opinion, which are the fundamental concepts that Arjun should cover in his lecture. Teaching note - outline for solution: While addressing the fresh candidates, Arjun may focus on explaining the basic concepts on data analysis. He may initiate the discussion with a brief introduction on ‘data’. He may discuss with explants’ how mere data is not useful for decision making. Next, he may move to discussion of link among data, information and knowledge. The participants should get a clear idea about the formation of knowledge using ‘raw’ data as resource. Once the basic concepts about data, information and knowledge is clear in the minds of participants, Arjun may describe the various types of data e.g. numerical data, descriptive data and graphical data. He may explain the concepts with some real-life examples. Further, he may also discuss another way of looking at data e.g. ordinal scale, ratio scale etc. How the data analysis is particularly useful for finance and accounting functions may be discussed next. The difference between quantitative and qualitative data can be discussed next with help of few practical examples. However, the key question is how the raw data may be transformed into useful information? To explore the answer to this question, Arjun may discuss the six steps to be followed for transforming data into information. The ultimate objective of adopting so much pain is to generate quality decisions. This is a subjective area. Arjun may seek inputs from participants and discuss various ways of generating relevant and useful decisions by exploring raw data. During this entire process of quality decision making, one should not forget the ethical aspects. Arjun should convey the importance of adopting ethical practices in data analysis. At the end, Arjun may end the conversation with a thanking note. AKASH AGARWAL CLASSES CONT.- 8007777042 / 043 17 PROF. NIKITA OSWAL CMA INTER-DATA ANALYTICS 2. DATA PROCESSING, ORGANIZATION, CLEANING AND VALIDATION This Module Includes. 2.1 Development of Data Processing. 2.2 Functions of Data Processing. 2.3 Data Organization and Distribution. 2.4 Data Cleaning and Validation. AKASH AGARWAL CLASSES CONT.- 8007777042 / 043 18 PROF. NIKITA OSWAL CMA INTER-DATA ANALYTICS UNIT-1 DEVELOPMENT OF DATA PROCESSING Data processing (DP) is the process of organising, categorising, and manipulating data in order to extract information. Information in this context refers to valuable connections and trends that may be used to address pressing issues. In recent years, the capacity and effectiveness of DP have increased manifold with the development of technology. Data processing that used to require a lot of human labour progressively superseded by modern tools and technology. The techniques and procedures used in DP information extraction algorithms for data are well developed in recent years, for instance, the treatment of facial data classification is necessary for recognition, and time series analysis is necessary for processing stock market data. The information extracted as a result of DP is also heavily reliant on the quality of the data. Data quality may get affected due to several issues like missing data and duplications. There may be other fundamental problems, such as incorrect equipment design and biased data collecting, which are more difficult to address. The history of DP can be divided into three phases as a result of technological advancements (figure 2.1): Figure 2.1: History of data processing. AKASH AGARWAL CLASSES CONT.- 8007777042 / 043 19 PROF. NIKITA OSWAL CMA INTER-DATA ANALYTICS (i) Manual DP: Manual DP involves processing data without much assistance from machines. Prior to the phase of mechanical DP only small-scale data processing was possible using manual efforts. However, in some special cases Manual DP is still in use today, and it is typically due to the data’s difficulty in digitization or inability to be read by machines, like in the case of retrieving data from outdated texts or documents. (ii) Mechanical DP: Mechanical DP processes data using mechanical (not modern computers) tools and technologies. This phase began in 1890 (Bohme et al., 1991) when a system made up of intricate punch card machines was installed by the US Bureau of the Census in order to assist in compiling the findings of a recent national population census. Use of mechanical DP made it quicker and easier to search and compute the data than manual process. (iii) Electronic DP: And finally, the electronic DP replaced the other two that resulted fall in mistakes and rising productivity. Data processing is being done electronically using computers and other cutting-edge electronics. It is now widely used in industry, research institutions and academia. How data processing and data science is relevant for finance? The relevance of data processing and data science in the area of finance is increasing every day. The eleven significant areas where data science play important role are: (i) Risk analytics: Business involves, risk particularly in the financial industry. It is crucial to determine the risk factor before making any decisions. For example, a better method for defending the business a against potential cybersecurity risks is risk a analytics, which is determined through data science. Given that a large portion of a company’s risk-related data is “unstructured,” its analysis without data science methods can be challenging and prone to human mistake. Figure 2.2: Data processing and data science in finance. AKASH AGARWAL CLASSES CONT.- 8007777042 / 043 20 PROF. NIKITA OSWAL CMA INTER-DATA ANALYTICS (ii) Real time analytics: Prior to significant advances in Data Engineering (Airflow, Spark, and Cloud solutions), all data was historical in nature. Data engineers would discover significance in numbers that were days, weeks, months, or even years old since that was the only discover significance in numbers that were days, weeks, months, or even years old since that was the only accessible information. It was processed in batches, which meant that no analysis could be performed until a batch of data had been gathered within a predetermined timescale. Consequently, any conclusions drawn from this data were possibly invalid. With technological advancement and improved hardware, real-time analytics are now available, as Data Engineering, Data Science, Machine Learning, and Business Intelligence work together to provide the optimal user experience. Thanks to dynamic data pipelines, data streams, and a speedier data transmission between source and analyser, businesses can now respond quickly to consumer interactions. With real-time analysis, there are no delays in establishing a customer’s worth to an organisation, and credit ratings and transactions are far more precise. (iii) Customer data management: Data science enables effective management of client data. In recent years, many financial institutions may have processed their data solely through the machine learning capabilities of Business Intelligence (BI). However, the proliferation of big data and unstructured data has rendered this method significantly less effective for predicting risk and future trends. There are currently more transactions occurring every minute than ever before, thus there is better data accessibility for analysis. Due to the arrival of social media and new Internet of Things (IoT) devices, a significant portion of this data does not conform to the structure of organised data previously employed. Using methods such as text analytics, data mining, and natural language processing, data science is well- equipped to deal with massive volumes of unstructured new data. Consequently, despite the fact that data availability has been enhanced, data science implies that a company’s analytical capabilities may also be upgraded, leading to a greater understanding of market patterns and client behaviour. (iv) Consumer Analytics: In a world where choice has never been more crucial, it has become evident that each customer is unique; nonetheless, there have never been more consumers. This contradiction cannot be sustained without the intelligence and automation of machine learning. It is as important to ensure that each client receives a customised service as it is to process their data swiftly and efficiently, without time-intensive individualised analysis. AKASH AGARWAL CLASSES CONT.- 8007777042 / 043 21 PROF. NIKITA OSWAL CMA INTER-DATA ANALYTICS As a consequence, insurance firms are using real-time analytics in conjunction with prior data patterns and quick analysis of each customer’s transaction history to eliminate sub- zero consumers, enhance cross- sales, and calculate a consumer’s lifetime worth. This allows each financial institution to keep their own degree of security while still reviewing each application individually. (v) Customer segmentation: Despite the fact that each consumer is unique, it is only possible to comprehend their behaviour after they have been categorised or divided. Customers are frequently segmented based on socioeconomic factors, such as geography, age, and buying patterns. By examining these clusters collectively, organizations in the financial industry and beyond may assess a customer’s current and long-term worth. With this information, organisations may eliminate clients who provide little value and focus on those with promise. To do this, data scientists can use automated machine learning algorithms to categorise their clients based on specified attributes that have been assigned relative relevance scores. Comparing these groupings to former customers reveals the expected value of time invested with each client. (vi) Personalized services: The requirement to customise each customer’s experience extends beyond gauging risk assessment. Even major organizations strive to provide customised service to their consumers as a method of enhancing their reputation and increasing customer lifetime value. This is also true for businesses in the finance sector. From customer evaluations to telephone interactions, everything can be studied in a way that benefits both the business and the consumer. By delivering the consumer a product that precisely meets their needs, cross-selling may be facilitated by a thorough comprehension of these interactions. Natural language processing (NLP) and voice recognition technologies dissect these encounters into a series of important points that can identify chances to increase revenue, enhance the customer service experience, and steer the company’s future. Due to the rapid progress of NLP research, the potential is yet to be fully realised. (vii) Advanced customer service: Data science’s capacity to give superior customer service goes hand in hand with its ability to provide customised services. As client interactions may be evaluated in real-time, more effective recommendations can be offered to the customer care agent managing the customer’s case throughout the conversation. Natural language processing can offer chances for practical financial advise based on what the consumer is saying, even if the customer is unsure of the product they are seeking. AKASH AGARWAL CLASSES CONT.- 8007777042 / 043 22 PROF. NIKITA OSWAL CMA INTER-DATA ANALYTICS The customer support agent can then cross-sell or up-sell while efficiently addressing the client’s inquiry. The knowledge from each encounter may then be utilised to inform subsequent interactions of a similar nature, hence enhancing the system’s efficacy over time. (viii) Predictive Analytics: Predictive analytics enables organisations in the financial sector to extrapolate from existing data and anticipate what may occur in the future, including how patterns may evolve. When prediction is necessary, machine learning is utilised. Using machine learning techniques, pre-processed data may be input into the system in order for it to learn how to anticipate future occurrences accurately. More information improves the prediction model. Typically, for an algorithm to function in shallow learning, the data must be cleansed and altered. Deep learning, on the other hand, changes the data without the need for human preparation to establish the initial rules, and so achieves superior performance. In the case of stock market pricing, machine learning algorithms learn trends from past data in a certain interval (may be a week, month, or quarter) and then forecast future stock market trends based on this historical information. This allows data scientists to depict expected patterns for end-users in order to assist them in making investment decisions and developing trading strategies. (ix) Fraud detection: With a rise in financial transactions, the risk for fraud also increases. Tracking incidents of fraud, such as identity theft and credit card scams, and limiting the resulting harm is a primary responsibility for financial institutions. As the technologies used to analyse big data become more sophisticated, so do their capacity to detect fraud early on. Artificial intelligence and machine learning algorithms can now detect credit card fraud significantly more precisely, owing to the vast amount of data accessible from which to draw trends and the capacity to respond in real time to suspect behaviour. If a major purchase is made on a credit card belonging to a consumer who has traditionally been very frugal, the card can be immediately terminated, and a notification sent to the card owner. This protects not just the client, but also the bank and the client’s insurance carrier. When it comes to trading, machine learning techniques discover irregularities and notify the relevant financial institution, enabling speedy inquiry. AKASH AGARWAL CLASSES CONT.- 8007777042 / 043 23 PROF. NIKITA OSWAL CMA INTER-DATA ANALYTICS (x) Anomaly detection: Financial services have long placed a premium on detecting abnormalities in a customer’s bank account activities, partly because anomalies are only proved to be anomalous after the event happens. Although data science can provide real- time insights, it cannot anticipate singular incidents of credit card fraud or identity theft. However, data analytics can discover instances of unlawful insider trading before they cause considerable harm. The methods for anomaly identification consist of Recurrent Neural Networks and Long Short-Term Memory models. These algorithms can analyse the behaviour of traders before and after information about the stock market becomes public in order to determine if they illegally monopolised stock market forecasts and took advantage of investors. Transformers, which are next- generation designs for a variety of applications, including Anomaly Detection, are the foundation of more modern solutions. (xi) Algorithmic trading: Algorithmic trading is one of the key uses of data science in finance. Algorithmic trading happens when an unsupervised computer utilising the intelligence supplied by an algorithm trade suggestion on the stock market. As a consequence, it eliminates the risk of loss caused by indecision and human error. The trading algorithm used to be developed according to a set of stringent rules that decide whether it will trade on a specific market at a specific moment (there is no restriction for which markets algorithmic trading can work on). This method is known as Reinforcement Learning, in which the model is taught using penalties and rewards associated with the rules. Each time a transaction proves to be a poor option, a model of reinforcement learning ensures that the algorithm learns and adapts its rules accordingly. One of the primary advantages of algorithmic trading is the increased frequency of deals. Based on facts and taught behaviour, the computer can operate in a fraction of a second without human indecision or thought. Similarly, the machine will only trade when it perceives a profit opportunity according to its rule set, regardless of how rare these chances may be. AKASH AGARWAL CLASSES CONT.- 8007777042 / 043 24 PROF. NIKITA OSWAL CMA INTER-DATA ANALYTICS UNIT -2 FUNCTION OF DATA PROCESSING Data processing generally involves the following processes. (i) Validation: As per the UNECE glossary on statistical data editing (UNECE 2013), data validation may be defined as ‘An activity aimed at verifying whether the value of a data item comes from the given (finite or infinite) set of acceptable values.’ Simon (2013) defined data validation as “Data validation could be operationally defined as a process which ensures the correspondence of the final (published) data with a number of quality characteristics.” A decision-making process called data validation leads to the acceptance or rejection of data as acceptable. Data is subjected to rules. Data are deemed legitimate for the intended final use if they comply with the rules, which means that the combination stated by the rules is not broken. The objective of data validation is to assure a particular degree of data quality. In official statistics, however, quality has multiple dimensions: relevance, correctness, timeliness and punctuality, accessibility and clarity, comparability, coherence, and comprehensiveness. Therefore, it is essential to determine which components data validation addresses. (ii) Sorting: Data sorting is any procedure that organises data into a meaningful order to make it simpler to comprehend, analyse, and visualise. Sorting is a typical strategy for presenting research data in a manner that facilitates comprehension of the story being told by the data. Sorting can be performed on raw data (across all records) or aggregated information (in a table, chart, or some other aggregated or summarised output). Summarization (statistical) or (automatic) involves reducing detailed data to its main points. Typically, data is sorted in ascending or decreasing order based on actual numbers, counts, or percentages, but it may also be sorted based on variable value labels. Value labels are metadata present in certain applications that let the researcher to save labels for each value alternative in a categorical question. The vast majority of software programmes AKASH AGARWAL CLASSES CONT.- 8007777042 / 043 25 PROF. NIKITA OSWAL CMA INTER-DATA ANALYTICS permit sorting by many factors. A data collection including region and nation fields, for instance, can be sorted by region as the main sort and subsequently by country. In each sorted region, the county sort will be implemented. When working with any type of data, there are a number of typical sorting apps. One such use is data cleaning, which is the act of sorting data in order to identify anomalies in a data pattern. For instance, monthly sales data can be sorted by month to identify sales volume variations. Sorting is also frequently used to rank or prioritise records. In this instance, data is sorted based on a rank, computed score, or other weighing factor (for example, highest volume accounts or heavy usage customers). It is also vitally necessary to organize visualisations (tables, charts, etc.) correctly to facilitate accurate data interpretation. In market research, for instance, it is typical to sort the findings of a single-response question by column percentage, i.e. from most answered to least replied, as indicated by the following brand preference question. Incorrect classification frequently results in misunderstanding. Always verify that the most logical sorts are used to every visualization. Using sorting functions is an easy idea to comprehend, but there are a few technical considerations to keep in mind. The arbitrary sorting of non-unique data is one such issue. Consider, for example, a data collection comprising region and nation variables, as well as several records per area. If a region-based sort is implemented, what is the default secondary sort? In other words, how will the data be sorted inside each region? This depends on the application in question. Excel, for instance, will preserve the original sort as the default sort order following the execution of the primary sort. SQL databases do not have a default sort order. This rather depends on other variables, such the database management system (DBMS) in use, indexes, and other variables. Other programmes may perform extra sorting by default based on the column order. In nearly every level of data processing, the vast majority of analytical and statistical software programmes offer a variety of sorting options. AKASH AGARWAL CLASSES CONT.- 8007777042 / 043 26 PROF. NIKITA OSWAL CMA INTER-DATA ANALYTICS (iii) Aggregation: Data aggregation refers to any process in which data is collected and summarised. When data is aggregated, individual data rows, which are often compiled from several sources, are replaced with summaries or totals. Groups of observed aggregates are replaced with statistical summaries based on these observations. A data warehouse often contains aggregate data since it may offer answers to analytical inquiries and drastically cut the time required to query massive data sets. A common application of data aggregation is to offer statistical analysis for groups of individuals and to provide relevant summary data for business analysis. Utilizing software tools known as data aggregators, large-scale data aggregation is in commonplace. Typically, data aggregators comprise functions for gathering, processing, and displaying aggregated data. Data aggregation enables analysts to access and analyse vast quantities of data in a reasonable amount of time. A single row of aggregate data may represent hundreds, thousands, or even millions of individual data entries. As data is aggregated, it may be queried rapidly as opposed to taking all processing cycles to acquire each individual data row and aggregate it in real time when it is requested or accessed. As the amount of data kept by businesses continues to grow, aggregating the most significant and often requested data can facilitate their efficient access. (iv) Analysis: Data analysis is described as the process of cleaning, converting, and modelling data to obtain actionable business intelligence. The objective of data analysis is to extract relevant information from data and make decisions based on this knowledge. Every time we make a decision in our day-to-day lives, we consider what occurred previously or what would occur if we choose a specific option. This is a simple example of data analysis. This is nothing more than studying the past or the future and basing judgments on that analysis. We do so by recalling our history or by imagining our future. That consists solely of data analysis. Now, the same task that an analyst does for commercial goals is known as Data Analysis. Analysis is sometimes all that is required to expand your business and finance. If any firm is not expanding, it must admit past errors and create a new plan to avoid making the same mistakes. And even if the firm is expanding, it must anticipate making it expand even more. All that is required is an analysis of the business data and operations. AKASH AGARWAL CLASSES CONT.- 8007777042 / 043 27 PROF. NIKITA OSWAL CMA INTER-DATA ANALYTICS Figure 9.3: Some popular data analysis tools. (v) Reporting: Data reporting is the act of gathering and structuring raw data and turning it into a consumable format in order to evaluate the organization’s continuous performance. The data reports can provide answers to fundamental inquiries regarding the status of the firm. They can display the status of certain data within an Excel document or a simple data visualisation tool. Static data reports often employ the same structure throughout time and collect data from a single source. A data report is nothing more than a set of documented facts and numbers. Consider the population count as an illustration. This is a technical paper conveying basic facts on the population and demographics of a country. It may be presented in text or in a graphical manner, such as a graph or chart. However, static information may be utilised to evaluate present situations. Financial data such as revenues, accounts receivable, and net profits are often summarised in a company’s data reporting. This gives an up-to-date record of the company’s financial health or a portion of the finances, such as sales. A sales director may report on KPIs based on location, funnel stage, and closing rate in order to present an accurate view of the whole sales pipeline. Data provides a method for measuring development in many aspects of our life. It influences both our professional judgments and our day-to-day affairs. A data report would indicate where we should devote the most time and money, as well as what need more organization or attention. AKASH AGARWAL CLASSES CONT.- 8007777042 / 043 28 PROF. NIKITA OSWAL CMA INTER-DATA ANALYTICS In any industry, accurate data reporting plays a crucial role. Utilizing business information in healthcare enables physicians to provide more effective and efficient patient care, hence saving lives. In education, data reports may be utilised to study the relationship between attendance records and seasonal weather patterns, as well as the intersection of acceptance rates and neighbourhood regions. The most effective business analysts possess specific competencies. An outstanding business analyst must be able to prioritise the most pertinent data. There is no space for error in data reporting, which necessitates high thoroughness and attention to detail. The capacity to comprehend and organise enormous volumes of information is another valuable talent. Lastly, the ability to organize and present data in an easy-to-read fashion is essential for all data reporters. Excellence in data reporting does not necessitate immersion in coding or proficiency in analytics. Other necessary talents include the ability to extract vital information from data, to keep things simple, and to prevent data hoarding. Although static reporting can be precise and helpful, it has limitations. One such instance is the absence of real- time insights. If confronted with a vast volume of data to organise into a usable and actionable format, a report enables senior management or the sales team to provide guidance on future steps. However, if the layout, data, and formulae are not given in a timely way, they may be out of current context. The reporting of data is vital to an organisation’s business intelligence. The more is an organization’s access to data, the more agile it may be. This can help a firm to maintain its relevance in a market that is becoming increasingly competitive and dynamic. An efficient data reporting system will facilitate the formation of judicious judgments that might steer a business in new areas and provide additional income streams. (vi) Classification: Data classification is the process of classifying data according to important categories so that it may be utilised and safeguarded more effectively. The categorization process makes data easier to identify and access on a fundamental level. Regarding risk management, compliance, and data security, the classification of data is of special relevance. Classifying data entails labelling it to make it searchable and trackable. Additionally, it avoids many duplications of data, which can minimise storage and backup expenses and accelerate the search procedure. The categorization process may sound very technical, yet it is a topic that your organisation’s leadership must comprehend. The categorization of data has vastly improved over time. Today, the technology is employed for a number of applications, most frequently to assist data security activities. AKASH AGARWAL CLASSES CONT.- 8007777042 / 043 29 PROF. NIKITA OSWAL CMA INTER-DATA ANALYTICS However, data may be categorised for a variety of purposes, including facilitating access, ensuring regulatory compliance, and achieving other commercial or personal goals. In many instances, data classification is a statutory obligation, since data must be searchable and retrievable within predetermined deadlines. For the purposes of data security, data classification is a useful strategy that simplifies the application of appropriate security measures based on the kind of data being accessed, sent, or duplicated. Classification of data frequently entails an abundance of tags and labels that identify the data’s kind, secrecy, and integrity. In data classification procedures, availability may also be taken into account. It is common practise to classify the sensitivity of data based on changing levels of relevance or secrecy, which corresponds to the security measures required to safeguard each classification level. Three primary methods of data classification are recognised as industry standards: Classification based on content, examines and interprets files for sensitive data. Context-based classification considers, among other characteristics, application, location, and creator as indirect markers of sensitive information. User-based classification relies on the human selection of each document by the end user. To indicate sensitive documents, user-based classification depends on human expertise and judgement during document creation, editing, review, or distribution. In addition to the classification kinds, it is prudent for an organisation to identify the relative risk associated with the data types, how the data is handled, and where it is stored/sent (endpoints). It is standard practise to divide data and systems into three risk categories. 1. Low risk: If data is accessible to the public and recovery is simple, then this data collection and the mechanisms around it pose a smaller risk than others. 2. Moderate risk: Essentially, they are non-public or internal (to a business or its partners) data. However, it is unlikely to be too mission-critical or sensitive to be considered “high risk.” The intermediate category may include proprietary operating processes, cost of products, and certain corporate paperwork. 3. High risk: Anything even vaguely sensitive or critical to operational security falls under the category of high risk. Additionally, data that is incredibly difficult to retrieve (if lost). All sec. ret, sensitive, and essential data falls under the category of high risk. AKASH AGARWAL CLASSES CONT.- 8007777042 / 043 30 PROF. NIKITA OSWAL CMA INTER-DATA ANALYTICS Data classification matrix Data creation and labelling may be simple for certain companies. If there are not a significant number of data kinds or if your firm has fewer transactions, it will likely be easier to determine the risk of your data and systems. However, many businesses working with large volumes or numerous types of data will certainly require a thorough method for assessing their risk. Many utilise a “data categorization matrix” for this purpose. Creating a matrix that rates data and/or systems based on how likely they are to be hacked and how sensitive the data is enables you to rapidly identify how to classify and safeguard all sensitive information (figure 2.4). Risk Confidential Data Sensitive Data Public High Medium Low General Institution The negative impact on The risk for negative The impact on the Impact the institution should impact on the institution should this data be incorrect, institution should this Public data not improperly disclosed, or information not be be available is not available when available when needed Typically low, needed is typically very is typically moderate. (inconvenient but high. not deliberating). Description Access to Confidential Access to Sensitive Access to Public institutional data must institutional data must institutional data be controlled from be requested from, and may be granted to creation to destruction authorized by, the any requester, or it and vail be grained only Functional Security is published with no to those persons Module Representative restrictions. affiliated with, the who is responsible for Public data is not University who require the data. considered such access in order to Access to internal sensitive. perform their job, or to data may be authorized The integrity of those individuals to groups of persons “Public’ data should permitted by law. by their job be protected, and Access to confidential classification or the appropriate data must be responsibilities (“role- Functional Security individually requested based” access), and may Module and then authorized by also be limited by one’s Representative the Functional Security employing unit or must authorize Module Representative- affiliation. replication or who is responsible for copying of the data the data. in order to ensure it AKASH AGARWAL CLASSES CONT.- 8007777042 / 043 31 PROF. NIKITA OSWAL CMA INTER-DATA ANALYTICS Confidential data is Non-Public or Internal remains accurate highly sensitive and may data is moderately overtime. have personal privacy sensitive in nature. consideration, or may Often, Sensitive data is be restricted by federal used for making or state law. decisions, and therefore Information which it’s important this provides access to information remain resources, physical or timely and accurate. virtual. Access Only those individuals EMU employees and EMU affiliates and designated with non- employees who general approved access. - have a business need to public with a need know. to know. Figure 2.4: Sample risk classification matrix. Example of data classification Data may be classified as Restricted, Private, or Public by an entity. In this instance, public data are the least sensitive and have the lowest security requirements, whereas restricted data are the most sensitive and have the highest security rating. This form of data categorization is frequently the beginning point for many organisations, followed by subsequent identification and tagging operations that label data based on its enterprise- relatedness, quality, and other categories. The most effective data classification methods include follow-up processes and frameworks to ensure that sensitive data remains in its proper location. Data classification process Classifying data may be a difficult and laborious procedure. Automated systems can assist in streamlining the process, but an organisation must determine the categories and criteria that will be used to classify data, understand and define its objectives, outline the roles and responsibilities of employees in maintaining proper data classification protocols, and implement security standards that correspond with data categories and tags. This procedure will give an operational framework to workers and third parties engaged in the storage, transfer, or retrieval of data, if carried out appropriately. AKASH AGARWAL CLASSES CONT.- 8007777042 / 043 32 PROF. NIKITA OSWAL CMA INTER-DATA ANALYTICS Policies and procedures should be well-defined, respectful of security needs and the confidentiality of data kinds, and simple enough for staff encouraging compliance to comprehend. For example, each category should include information about the types of data included in the categorization, security concerns including rules for accessing, transferring, and keeping data, and the potential risks associated with a security policy breach. Steps for effective data classification. 1. Understanding the current setup: Taking a comprehensive look at the location of the organisation’s current data and any applicable legislation is likely the best beginning point for successfully classifying data. Before one classifies data, one must know what data he is having. 2. Creation of a data classification policy: Without adequate policy, maintaining compliance with data protection standards in an organisation is practically difficult. Priority number one should be the creation of a policy. 3. Prioritize and organize data: Now that a data classification policy is in place, it is time to categorise the data. Based on the sensitivity and privacy of the data, the optimal method to be chosen for tagging it. AKASH AGARWAL CLASSES CONT.- 8007777042 / 043 33 PROF. NIKITA OSWAL CMA INTER-DATA ANALYTICS UNIT -3 DATA ORGANIZATION AND DISTRIBUTION 1. Data Organization: Data organization is the classification of unstructured data into distinct groups. This raw data comprises variables’ observations. As an illustration of data organization, the arrangement of students’ grades in different topics is one example. As time passes and the data volume grows, the time required to look for any information from the data source would rise if it has not previously been structured. Data organization is the process of arranging unstructured data in a meaningful manner. Classification, frequency distribution tables, image representations, graphical representations, etc. are examples of data organization techniques. Data organization allows us to arrange data in a manner that is easy to understand and manipulate. It is challenging to deal with or analyse raw data. IT workers utilise the notion of data organization in several ways. Many of these are included under the umbrella term “data management.” For instance, data organization includes reordering or assessing the arrangement of data components in a physical record. The analysis of somewhat organized and unstructured data is another crucial component of business data organization. Structured data consists of tabular information that may be readily imported into a database and then utilised by analytics software or other applications. Unstructured data are raw and unformatted data, such as a basic text document with names, dates, and other information spread among random paragraphs. The integration of somewhat unstructured data into a holistic data environment has been facilitated by the development of technical tools and resources. In a world where data sets are among the most valuable assets possessed by firms across several industries, businesses employ data organization methods in order to make better use of their data assets. Executives and other professionals may prioritise data organization as part of a complete plan to expedite business operations, boost business intelligence, and enhance the business model as a whole. The examination of both relatively organized and unstructured data is a crucial component of business data organization. Structured data consists of tabular information that can be readily incorporated into a database and supplied to analytics software or other specific applications. Unstructured data is regarded raw and unformatted, similar to a plain text document in which information is dispersed across random paragraphs. Few specialists have built technological tools and resources to manage substantially unstructured data. AKASH AGARWAL CLASSES CONT.- 8007777042 / 043 34 PROF. NIKITA OSWAL CMA INTER-DATA ANALYTICS These data are incorporated into a comprehensive data ecosystem. Businesses implement data organization techniques to make better use of their data assets. Data assets have a very significant position in the world, since they are owned by businesses in a variety of industries. Data organization is seen as a component of a holistic strategy that facilitates the streamlining of business operations, whether via the acquisition of superior business information or the overall improvement of a business model. 2. Data distribution: Data distribution is a function that identifies and quantifies all potential values for a variable, as well as their relative frequency (probability of how often they occur). Any population with dispersed data is categorised as a distribution. It is necessary to establish the population’s distribution type in order to analyse it using the appropriate statistical procedures. Statistics makes extensive use of data distributions. If an analyst gathers 500 data points on the shop floor, they are of little use to management unless they are categorised or organised in an usable manner. The data distribution approach arranges the raw data into graphical representations (such as histograms, box plots, and pie charts, etc.) and gives relevant information. The primary benefit of data distribution is the estimation of the probability of any certain observation within a sample space. Probability distribution is a mathematical model that determines the probabilities of the occurrence of certain test or experiment outcomes. These models are used to specify distinct sorts of random variables (often discrete or continuous) in order to make a choice. One can employ mean, mode, range, probability, and other statistical approaches based on the category of the random variable. 3. Types of distribution: Distributions are basically classified based on the type of data: (i) Discrete distributions: A discrete distribution that results from countable data and has a finite number of potential values. In addition, discrete distributions may be displayed in tables, and the values of the random variable can be counted. Example: rolling dice, selecting a specific amount of heads, etc. (a) Binomial distributions: The binomial distribution quantifies the chance of obtaining a specific number of successes or failures each experiment. Binomial distribution applies to attributes that are categorised into two mutually exclusive and exhaustive classes, such as number of successes/failures and number of acceptances/rejections. Example: When tossing a coin: The likelihood of a coin falling on its head is one-half and the probability of a coin landing on its tail is one-half. AKASH AGARWAL CLASSES CONT.- 8007777042 / 043 35 PROF. NIKITA OSWAL CMA INTER-DATA ANALYTICS (b) Poisson distribution: The Poisson distribution is the discrete probability distribution that quantifies the chance of a certain number of events occurring in a given time period, where the events occur in a well-defined order. Poisson distribution applies to attributes that can potentially take on huge values, but in practise take on tiny ones. Example: Number of flaws, mistakes, accidents, absentees etc. (c) Hypergeometric distribution: The hypergeometric distribution is a discrete distribution that assesses the chance of a certain number of successes in (n) trials, without replacement, from a sufficiently large population (N). Specifically, sampling without replacement. The hypergeometric distribution is comparable to the binomial distribution; the primary distinction between the two is that the chance of success is not the same for all trials in the binomial distribution but it is in the hypergeometric distribution. (d) Geometric distribution: The geometric distribution is a discrete distribution that assesses the probability of the occurrence of the first success. A possible extension is the negative binomial distribution. Example: A marketing representative from an advertising firm chooses hockey players from several institutions at random till he discovers an Olympic participant. (ii) Continuous distributions: A distribution with an unlimited number of (variable) data points that may be represented on a continuous measuring scale. A continuous random variable is a random variable with an unlimited and uncountable set of potential values. It is more than a simple count and is often described using probability density functions (pdf). The probability density function describes the characteristics of a random variable. Normally clustered frequency distribution is seen. Therefore, the probability density function views it as the distribution’s “shape.” Following are the continuous distributions of various types: (i) Normal distribution: Gaussian distribution is another name for normal distribution. It is a bell-shaped curve with a greater frequency (probability density) around the core point. As values go away from the centre value on each side, the frequency drops dramatically. In other words, features whose dimensions are expected to fall on either side of the target value with equal likelihood adhere to normal distribution. (ii) Lognormal distribution: A continuous random variable x follows a lognormal distribution if the distribution of its natural logarithm, ln(x), is normal. As the sample size rises, the distribution of the sum of random variables approaches a normal distribution, independent of the distribution of the individuals. AKASH AGARWAL CLASSES CONT.- 8007777042 / 043 36 PROF. NIKITA OSWAL CMA INTER-DATA ANALYTICS (iii) F distribution: The F distribution is often employed to examine the equality of variances between two normal populations. The F distribution is an asymmetric distribution with no maximum value and a minimum value of 0. The curve approaches 0 but never reaches the horizontal axis. (iv) Chi square distributions: When independent variables with standard normal distribution are squared and added, the chi square distribution occurs. Example: y = Z12+ Z22 +Z32 +Z42+…+ Zn2 if Z is a typical normal random variable. The distribution of chi square values is symmetrical and constrained below zero. And approaches the form of the normal distribution as the number of degrees of freedom grows. (v) Exponential distribution: The exponential distribution is a probability distribution and one of the most often employed continuous distributions. Used frequently to represent products with a consistent failure rate. The exponential distribution and the Poisson distribution are closely connected. Has a constant failure rate since its form characteristics remain constant. AKASH AGARWAL CLASSES CONT.- 8007777042 / 043 37 PROF. NIKITA OSWAL CMA INTER-DATA ANALYTICS UNIT -4 DATA CLEANING AND VALIDATION 1. Data Cleaning: Data cleaning is the process of correcting or deleting inaccurate, corrupted, improperly formatted, duplicate, or insufficient data from a dataset. When several data sources are combined, there are numerous chances for data duplication and mis-labelling. Incorrect data renders outcomes and algorithms untrustworthy, despite their apparent accuracy. There is no, one definitive method for prescribing the precise phases of the data cleaning procedure, as the methods differ from dataset to dataset. However, it is essential to build a template for your data cleaning process so that you can be certain you are always doing the steps correctly. Data cleaning is different from data transformation. Data cleaning is the process of removing irrelevant data from a dataset. The process of changing data from one format or structure to another is known as data transformation. Transformation procedures are sometimes known as data wrangling or data munging, since they map and change “raw” data into another format for warehousing and analysis. Steps for data cleaning: (i) Step 1: Removal of duplicate and irrelevant information. Eliminate unnecessary observations from your dataset, such as duplicate or irrelevant observations. Most duplicate observations will occur during data collecting. When you merge data sets from numerous sites, scrape data, or get data from customers or several departments, there are potential to produce duplicate data. De-duplication is one of the most important considerations for this procedure. Observations are deemed irrelevant when they do not pertain to the specific topic you are attempting to study. For instance, if you wish to study data pertaining to millennial clients but your dataset contains observations pertaining to earlier generations, you might exclude these useless observations. This may make analysis more effective and reduce distractions from your core objective, in addition to producing a more manageable and effective dataset. (ii) Step 2: Fix structural errors: When measuring or transferring data, you may detect unusual naming standards, typos, or wrong capitalization. These contradictions may lead to mislabelled classes or groups. For instance, “N/A” and “Not Applicable” may both be present, but they should be examined as a single category. AKASH AGARWAL CLASSES CONT.- 8007777042 / 043 38 PROF. NIKITA OSWAL CMA INTER-DATA ANALYTICS (iii) Step 3: Filter unwanted outliers: Occasionally, you will encounter observations that, at first look, do not appear to fit inside the data you are evaluating. If you have a valid cause to eliminate an outlier, such as erroneous data input, doing so will improve the performance of the data you are analysing. Occasionally, though, the arrival of an outlier will prove a notion you’re working on. Remember that the existence of an outlier does not imply that it is erroneous. This step is required to validate the number. Consider deleting an outlier if it appears to be unrelated to the analysis or an error. (iv) Step 4: Handle missing data: Many algorithms do not accept missing values, hence missing data cannot be ignored. There are several approaches to handle missing data. Although neither is desirable, both should be explored. As a first alternative, the observations with missing values may be dropped, but doing so may result in the loss of information. This should be kept in mind before doing so. As a second alternative, the missing numbers may be entered based on other observations. Again, there is a chance that the data’s integrity may be compromised, as action may be based on assumptions rather than real observations. (v) Step 5: Validation and QA: As part of basic validation, one should be able to answer the following questions at the conclusion of the data cleaning process: (a) Does the data make sense? (b) Does the data adhere to the regulations applicable to its field? (c) Does it verify or contradict your working hypothesis, or does it shed any light on it? (d) Can data patterns assist you in formulating your next theory? (e) If not, is this due to an issue with data quality? False assumptions based on inaccurate or “dirty” data can lead to ineffective company strategies and decisions. False conclusions might result in an uncomfortable moment at a reporting meeting when it is shown that the data does not withstand inspection. Before reaching that point, it is essential to establish a culture of data quality inside the firm. To do this, one should specify the methods that may be employed to establish this culture and also the definition of data quality. AKASH AGARWAL CLASSES CONT.- 8007777042 / 043 39 PROF. NIKITA OSWAL CMA INTER-DATA ANALYTICS 2. Benefits of quality data: Determining the quality of data needs an analysis of its properties and a weighting of those attributes based on what is most essential to the company and the application(s) for which the data will be utilised. Main characteristics of quality data are: (i) Validity (ii) Accuracy (iii) Completeness (iv) Consistency 3. Benefits of data cleaning: Ultimately, having clean data would boost overall productivity and provide with the greatest quality information for decision-making. Benefits include: (i) Error correction when numerous data sources are involved. (ii) Fewer mistakes result in happier customers and less irritated workers. (iii) Capability to map the many functions and planned uses of your data. (iv) Monitoring mistakes and improving reporting to determine where errors are originating can make it easier to repair inaccurate or damaged data in future applications. (v) Using data cleaning technologies will result in more effective corporate procedures and speedier decision-making. 4. Data validation: Data validation is a crucial component of any data management process, whether it is about collecting information in the field, evaluating data, or preparing to deliver data to stakeholders. If the initial data is not valid, the outcomes will not be accurate either. It is therefore vital to check and validate data before using it. Although data validation is an essential stage in every data pipeline, it is frequently ignored. It may appear like data validation is an unnecessary step that slows down the work, but it is vital for producing the finest possible outcomes. Today, data validation may be accomplished considerably more quickly than may have imagined earlier. With data integration systems that can include and automate validation procedures, validation may be considered as an integral part of the workflow, as opposed to an additional step. AKASH AGARWAL CLASSES CONT.- 8007777042 / 043 40 PROF. NIKITA OSWAL CMA INTER-DATA ANALYTICS Validating the precision, clarity, and specificity of data is essential for mitigating project failures. Without data validation, one may into run the danger of basing judgments on faulty data that is not indicative of the current situation. In addition to validating data inputs and values, it is vital to validate the data model itself. If the data model is not appropriately constructed or developed, one may encounter problems while attempting to use data files in various programmes and software. The format and content of data files will determine what can be done with the data. Using validation criteria to purify data before to usage mitigates “garbage in, garbage out” problems. Ensuring data integrity contributes to the validity of the conclusions. Types of data validation: 1. Data type check: A data type check verifies that the entered data has the appropriate data type. For instance, a field may only take numeric values. If this is the case, the system should reject any data containing other characters, such as letters or special symbols. 2. Code check: A code check verifies that a field’s value is picked from a legitimate set of options or that it adheres to specific formatting requirements. For instance, it is easy to verify the validity of a postal code by comparing it to a list of valid codes. The same principle may be extended to other things, including nation codes and NIC industry codes. 3. Range check: A range check determines whether or not input data falls inside a specified range. Latitude and longitude, for instance, are frequently employed in geographic data. A latitude value must fall between -90 and 90 degrees, whereas a longitude value must fall between -180 and 180 degrees. Outside of this range, values are invalid. 4. Format check: Numerous data kinds adhere to a set format. Date columns that are kept in a fixed format, such as “YYYY-MM-DD” or “DD-MM-YYYY,” are a popular use case. A data validation technique that ensures dates are in the correct format contributes to data and temporal consistency. 5. Consistency check: A consistency check is a form of logical check that verifies that the data has been input in a consistent manner. Checking whether a package’s delivery date is later than its shipment date is one example. 6. Uniqueness check: Some data like PAN or e-mail ids are unique by nature. These fields should typically contain unique items in a database. A uniqueness check guarantees that an item is not put into a database numerous times. Consider the case of a business that collects data on its outlets but neglects to do an appropriate postal code verification. The error might make it more challenging to utilise the data for information and business analytics. Several issues may arise if the postal code is not supplied or is typed incorrectly. AKASH AGARWAL CLASSES CONT.- 8007777042 / 043 41 PROF. NIKITA OSWAL CMA INTER-DATA ANALYTICS In certain mapping tools, defining the location of the shop might be challenging. A store’s postal code will also facilitate the generation of neighbourhood-specific data. Without a postal code data verification, it is more probable that data may lose its value. If the data needs to be recollected or the postal code needs to be manually input, further expenses will also be incurred. A straightforward solution to the issue would be to provide a check that guarantees a valid postal code is entered. The solution may be a drop-down menu or an auto-complete form that enables the user to select a valid postal code from a list. This kind of data validation is referred to as a code validation or code check. Solved Case 1 Maitreyee is working as a data analyst with a financial organisation. She is supplied with a large amount of data, and she plans to use statistical techniques for inferring some useful information and knowledge from it. But, before starting the process of data analysis, she found that the provided data is not cleaned. She knows that before applying the data analysis tools, cleaning the data is essential. In your opinion, what steps Maitreyee should follow to clean the data, and what are the benefits of clean data. Teaching note - outline for solution: The instructor may initiate the discussions with explaining the concept of data cleaning and about the importance of data cleaning. The instructor may also elaborate the consequences of using an uncleaned dataset on the final analysis. She may discuss the steps five steps of data cleaning in detail, such as, (i) Removal of duplicate and irrelevant information (ii) Fix structural errors (iii) Filter unwanted outliers (iv) Handle missing data (iv) Validation and QA AKASH AGARWAL CLASSES CONT.- 8007777042 / 043 42 PROF. NIKITA OSWAL CMA INTER-DATA ANALYTICS At the outset, Maitreyee should focus on answering the following questions: (a) Does the data make sense? (b) Does the data adhere to the regulations applicable to its field? (c) Does it verify or contradict your working hypothesis, or does it shed any light on it? (d) Can data patterns assist you in formulating your next theory? (e) If not, is this due to an issue with data quality? The instructor may close the discussions with explaining the benefits of using clean data, such as: (i) Validity. (ii) Accuracy. (iii) Completeness. (iv) Consistency. AKASH AGARWAL CLASSES CONT.- 8007777042 / 043 43 PROF. NIKITA OSWAL CMA INTER-DATA ANALYTICS 3. DATA PRESENTATION: VISUALIZATION AND GRAPHICAL PRESENTATION This Module Includes. 3.1 Data Visualization of Financial and Non-Financial Data. 3.2 Objective and Function of Data Presentation. 3.3 Data Presentation Architecture. 3.4 Dashboard, Graphs, Diagrams, Tables, Report Design. 3.5 Tools and Techniques of Visualization and Graphical Presentation. AKASH AGARWAL CLASSES CONT.- 8007777042 / 043 44 PROF. NIKITA OSWAL CMA INTER-DATA ANALYTICS UNIT - 1 DATA VISUALIZATION OF FINANCIAL AND NON-FINANCIAL DATA There is a saying ‘A picture speaks a thousand words’. Numerous sources of in-depth data are now available to management teams, allowing them to better track and anticipate organisational performance. However, obtaining data and presenting it are two distinct and equally essential activities. Data visualization comes into play at this point. Recent studies reveal that top-performing finance directors are more likely than their peers to emphasise data visualization abilities. The capacity to explain complicated ideas, identify informational linkages, and provide captivating narratives resulting from data not only elevates finance’s position in

Use Quizgecko on...
Browser
Browser