Full Transcript

DATA MINING & ANALYTICS Tommie Singleton, Ph.D, CPA, CITP, CISA (256) 762-5252 CHAPTER 3 [email protected] 1 CHAPTER 3: ALL ABOUT DATA  Some Basics (3.1)  Fina...

DATA MINING & ANALYTICS Tommie Singleton, Ph.D, CPA, CITP, CISA (256) 762-5252 CHAPTER 3 [email protected] 1 CHAPTER 3: ALL ABOUT DATA  Some Basics (3.1)  Financial Transactions and Reporting drives data  Marketing, Sales, Process Control are major drivers for promoting data quality  Also produces comparable numbers/data  Also produces facts about certain aspects of the business  Customer Relationship Management (CRM) is modern breadth of data captured and creative use of that data  Must be able to ascertain relevant DM results for specific customers and their behavior  CRM data combined with DM/DA to analyze data and build optimal communication with customers  Examples:  Working customers who have not bought lately  Sales efforts directed at certain demographics  Improve communications with customers 2 CHAPTER 3: ALL ABOUT DATA  Some Basics (3.1)  Knowledge, information and data are NOT synonyms  In different disciplines, they are used differently  Example Computer Science and Business Information Systems have different approaches  Economics and Business Information Systems have divergent views of these terms  Computer Science (some to most of “IT”) use information and data interchangeably  Business use them as two unique definitions  Knowledge can be seen by some disciplines as something not stored in data systems 3 Figure 3.1 Important terms in evolution of data 4 CHAPTER 3: ALL ABOUT DATA  Some Basics (3.1) – Data, Information, Knowledge, Wisdom (3.1.1)  We will use a clear distinction between knowledge and data  Often, starting point is definition of information as knowledge needed for special purposes The Rock (1934) Where is the life we have lost in living? Where is the wisdom we have lost in knowledge? Where is the knowledge we have lost in information? -- T.S. Elliot 5 CHAPTER 3: ALL ABOUT DATA  Some Basics (3.1) – Data, Information, Knowledge, Wisdom (3.1.1)  D = Data which is facts without context  I = Information which is facts with some context and perspective  K = Knowledge which is information used to detect and understand patterns in the data  W = Wisdom which is knowledge and happens when you understand why the patterns are occurring  D = 10, Shayne Singleton, Ellie Singleton, Montana, Eureka (think of it as a cell in Excel)  I = 101 Main Street (an address)  K = Demographics on the person living at 101 Main Street and what they buy (pattern)  W = WHY the person living there buys what they purchase 6 CHAPTER 3: ALL ABOUT DATA  Some Basics (3.1) – Data, Information, Knowledge, Wisdom (3.1.1)  Knowledge may or may not be part of the data itself  Can and should be included in data preparation  Company has a seasonal competitor that affects their sales periodically  Blips in production may be caused by a known fault 7 Figure 3.2 The Evolution of Wisdom 8 CHAPTER 3: ALL ABOUT DATA  Some Basics (3.1) – Sources and Quality of Data (3.1.2)  Data used by organizations to be transformed into information and knowledge can be derived from sources internal, external and both  Operational systems move large amounts of data internally through various systems and processes  Primarily, that internal data is to handle everyday business events, transactions, facts  Generally, they lack the capability to keep a comprehensive history of data  Most financial applications/software systems only handle 1 year of data at a time  CRM keeps years  Data Warehouses were created to capture comprehensive history of data, including melding external data with internal data  Inconsistencies in data, such as duplicative data across systems  Many quality defects affect the data, and have an even greater impact on analysis of information systems  Quality of data is CRITICAL to success of data mining/analysis (internal, external, transform) 9 CHAPTER 3: ALL ABOUT DATA  Some Basics (3.1) – Sources and Quality of Data (3.1.2)  The quality of the data used has a direct and significant impact on the quality of the data analytics based on it  At least the quality and reliability of internal data is in the control of the company  Internal data is normally much more accessible as well  This is NOT the case for external data 10 Figure 3.3 Typical internal and external data in systems 11 CHAPTER 3: ALL ABOUT DATA  Some Basics (3.1) – Sources and Quality of Data (3.1.2)  External data, being generated outside company’s systems, is often needed:  Additional information like credit rating  Reference values like  Bureau of Labor Statistics – https://www.bls.gov  National Institute of Standards & Technology – https://www.nist.gov  Also purchased external data may be necessary  Often the case with DBM and CRM  Internal data often has better quality than external  Can control how it is captured and controls for completeness and accuracy  External data may not match time scope properly  We need annual data based on calendar year while external data may be fiscal year beginning Oct. 1  Even with poor match, external data can be useful to supply additional relevant information 12 CHAPTER 3: ALL ABOUT DATA  Some Basics (3.1) – Measurement Level and Types of Data (3.1.3)  Different terms used to describe different types of data  NOMINAL DATA:  Simplest level of measurement (Latin for “name”)  Type of qualitative data which groups variables into categories  Indicates which category is applicable  Example: A customer may live in an URBAN area, RURAL area, of MIXED area  BINARY DATA:  If there are only two categories (buy, not buy) 13 CHAPTER 3: ALL ABOUT DATA  Some Basics (3.1) – Measurement Level and Types of Data (3.1.3)  ORDINAL DATA:  Where variables have natural ordered categories and distances between categories is unknown  Data exist on an ordinal scale (ranking)  Example: AP Poll for college football  The clothes were the wrong size. Needed a 38” jeans, 36” jeans were sent and is wrong size  The clothes were too big (size is ordinal)  May be classified as:  Too Big  Too Small  Unspecified size  Non-size related  Nominal measurement at four levels 14 CHAPTER 3: ALL ABOUT DATA  Some Basics (3.1) – Measurement Level and Types of Data (3.1.3)  Variables that represent size are referred to as measures, measurements, or METRICS  In DM, term “Metric” includes counts of some data type, like page views, and may correspond to a column of data  Metric variables can be count, continuous, interval or ratio.  Metric level could be a continuous scale  Distance travelled to make a purchase  Does not have to be a whole number  Metric level would be interval (equal between units) if the variable is the number of occurrences  The number of returns for a customer  This is discrete (noncontinuous) data measured on an interval scale  Altmetrics measures interaction based on social web resulting in variables such as the number of hits or mentions across the web  Metric level could be interval or ratio type (zero is meaningful, ratios are constant: 2,4,8…) 15 CHAPTER 3: ALL ABOUT DATA  Some Basics (3.1) – Measurement Level and Types of Data (3.1.3)  Data can also be quantitative, strictly numbers  Qualitative data can be summarized into quantitative data  Analysis of content can be expressed in terms of counts and measured in terms of impact or quantity of relationships  Content analysis may also result in nominal data where categories can be given names 16 CHAPTER 3: ALL ABOUT DATA  Some Basics (3.1) – Measures of Magnitude & Dispersion (3.1.4)  Magnitude mainly:  Mean  Median  Quantiles 17 CHAPTER 3: ALL ABOUT DATA  Some Basics (3.1) – Measures of Magnitude & Dispersion (3.1.4)  Magnitude mainly:  Mean  Weighted Average: 2670/10=26.7  Average: 220/10=22.0  Why is latter higher? 18 CHAPTER 3: ALL ABOUT DATA  Some Basics (3.1) – Measures of Magnitude & Dispersion (3.1.4)  Magnitude mainly:  Median (0,0,12,14,21,24,26,33,34,56)  Middle is between 5 and 6 (21 and 24) = 22.5  Half of the values will be below the median and half above it 19 CHAPTER 3: ALL ABOUT DATA  Some Basics (3.1) – Measures of Magnitude & Dispersion (3.1.4)  Quartiles is a type of Quantile (sextiles = 6)  First quartile has 25% below it  Usually with larger volume of data  Second is median, 50% values below it  Each of the 4 segments make up 25% of the values  Third has 75% of values below it 20 CHAPTER 3: ALL ABOUT DATA  Some Basics (3.1) – Measures of Magnitude & Dispersion (3.1.4)  The measures of magnitude can be used to assess the distribution of interval (metric) data  If the difference between MEAN and MEDIAN is large, then that indicates:  Data is not normally distributed  Data may be skewed  Data may have one or more OUTLIERS  In financial audits, data that has any of those represents RISK and must be analyzed and addressed in the audit  Not normally distributed: either determine WHY or adjust audit for higher level of risk  In other specific business questions, data scientists will need to TRANSFORM the data into normal data  Same for skewed  Identify the OUTLIERS which are sometimes called individually significant items, and include them in the audit procedures regardless of samples  In other specific business questions, it may be best to OMIT OUTLIERS as stats suggests 21 CHAPTER 3: ALL ABOUT DATA  Some Basics (3.1) – Measures of Magnitude & Dispersion (3.1.4)  Dispersion measures how variable the data values are  Range  Arithmetic difference between largest value and smallest value (0 and 56, thus 56)  Inter-quartile range  Difference between first quartile and third quartile  Variance and standard deviation  Both important in data mining and analytics  Example: 95% confidence interval is often relevant in accounting: 2 SD from MEAN  Variance is square of 1 standard deviation 22 CHAPTER 3: ALL ABOUT DATA  Some Basics (3.1) – Data Distributions (3.1.5)  Distribution is the way data values vary  Variation of categorical data can be shown by frequency of occurrences of each level either in a table or a bar chart 23 24 CHAPTER 3: ALL ABOUT DATA  Some Basics (3.2) – Data Partition: Random Samples for Training, Testing and Validation  For marketing/sales business questions, there are usually enough customers available for almost all questions in the analysis  Thus, a representative sample will suffice for quality DM/DA  That approach will be more efficient as it speeds up the calculations for modeling, and it provides samples to test and validate the model (effectiveness and efficiency)  If the database is large enough, then generate several samples for training, testing, and validation purposes  If large enough, can sample without replacement (each case can only be picked once)  If too small, sample with replacement (a case can appear more than once in different samples)  Model is generated (learning) using the training sample, then apply to model to one of the other samples (testing samples), and then validate using other samples 25 CHAPTER 3: ALL ABOUT DATA  Some Basics (3.2) – Data Partition: Random Samples for Training, Testing and Validation  Authors suggest that “large” be 100,000 or more records  Also suggest taking a random sample of 30,000 as effective size  If too few are chosen, may not have enough of the target cases who are in fact the ones whose behavior we want to understand in a simple random sample  May need to use a stratified random sample approach  Random samples are taken from each of the strata given by the target variable 26 CHAPTER 3: ALL ABOUT DATA  Some Basics (3.2) – Data Partition: Random Samples for Training, Testing and Validation  Questions:  What should the split be between strata in the sample?  If target group is less than 1% and the number of target cases is less than 15,000, then a 1:2 or 1:3 split should be used  Choosing only 15,000 from a much larger population of non-target cases risks missing too many of the more unusual cases (target), whereas 1:3 (45,000) lessens the risk 27 28 CHAPTER 3: ALL ABOUT DATA  Some Basics (3.3) – Types of Business Information Systems  Two basic types of IS  Operational systems supporting business processes  Analysis-based IS  My take:  Financial / Accounting related IS dominate the range of IS  Support systems for specialized departments / professions  Marketing/Sales: CRM, Sales (tracking prospects etc.)  R&D  Manufacturing  CAD/CAM  OLAP: BSA-AML, DM/DA  The development of Data Warehouses tends to meld all of these, plus external IS/data 29 CHAPTER 3: ALL ABOUT DATA  Some Basics (3.3.1) – Operational Systems Supporting Business Processes  Almost all businesses use IS to support operations and production  Automating various operational processes for efficiency  Standardizing administrative processes for efficiency and effectiveness (i.e., result is more accurate and complete thus more reliable)  Example: ERP  The growth of IS across organizations can easily lead to duplicative data, redundant data, redundant processes, inconsistency between IS, and inability to automatically move data from one IS to a successor IS  SAIC gig  Maintenance and consistency of data across systems are not always effective enough  That is, end up being heterogeneous systems (vs. homogeneous) 30 CHAPTER 3: ALL ABOUT DATA  Some Basics (3.3.2) – Analysis-Based Information Systems  ABIS: Refers to all systems that are necessary to store the data and make it ready for DM/DA  Example: Data Warehouse, Knowledge-Base Information System (KBIS)  Also include the tools necessary to take full advantage of the possible benefits  Includes a front-end tools to load and transform data  Includes storage components of data effectively  ABIS can consist of Data Warehouses and Data Marts as well as analytical tools (that is, Online Analytical Processing – OLAP including DM tools) 31 CHAPTER 3: ALL ABOUT DATA  Some Basics (3.3.3) – Importance of Information  Entities that use innovative technologies effectively are able to adapt to rapid changes in commercial environment  It can also be a competitive advantage  The large volume of data captured in business operations can be valuable as “data capital”, but could be poorly utilized or even ignored  Data Capital:  Is the value associated with the data  Can only be assessed when data is turned into information  Is often poorly utilized or left idle 32 CHAPTER 3: ALL ABOUT DATA  Some Basics (3.3.3) – Importance of Information  Essential information needed by decision makers is often not made available at the critical time or point of need  Or is not in a form necessary to make creative analysis and decision making  In these IS, the 80/20 rule probably holds true  That is, 80% of the decisions rely on 20% of the data; thus 80% of the data lies idle  That entity may be storing a lot of data, but is not fully using it in decision making  Turning the 80% idle data into INFORMATION may improve decision making and their results 33 CHAPTER 3: ALL ABOUT DATA  Some Basics (3.3.3) – Importance of Information  Examples of questions relevant to decision makers:  Which customers should be made a particular offer?  Which customers are at risk of leaving us (i.e., churning)?  How high is the cross-selling potential for a new product?  What is the lifetime profit that can be achieved with which customers?  How can top prospects with high lifetime values be attracted?  What is the turnover that can be achieved in the next year?  My accounting examples:  What is the lifetime costs of an existing system or operational equipment?  Is there any evidence of fraud in our transactions?  Which product / service has the highest ROI? 34 CHAPTER 3: ALL ABOUT DATA  Some Basics (3.3.3) – Importance of Information  Simply fact is we may NOT be capturing critical data at the time of a specific event, but SHOULD, and if we did, we could make MORE EFFECTIVE decisions  May need to modify software / applications  May need to meld a new system into the overall BIS  Example, implement a Data Warehouse  Sometimes, the data is already there but needs to be transformed first  Sometimes, it is there readily made into INFORMATION that will lead to more effective decisions 35 CHAPTER 3: ALL ABOUT DATA  Some Basics (3.4) – Data Warehouse  DEFINITION: A collection of data which is gathered for the specific purpose of analysis (solving certain types of management problems, learning patterns and relationships that have a significance, or marketing purposes).  For us, meeting broad financial and accounting needs related to knowledge contained in internal and external data  Data in a DW is very different to data in all other IS in an entity.  The relevant data in a DW has been “cleaned” and integrated or processed into higher quality data (because of completeness and accuracy)  DW are designed to provide a broad overview of information / knowledge while an IS typically does not allow such an overview  Mapping of historical data, data history, and external data play a large role in effectiveness of data in a DW  The term is generally understood to mean something that serves as an enterprise-wide DB for a wide range of applications to support analytical tasks and objectives for experts and executives 36 CHAPTER 3: ALL ABOUT DATA  Some Basics (3.4) – Data Warehouse  DEFINITION: A collection of data which is gathered for the specific purpose of analysis (solving certain types of management problems, learning patterns and relationships that have a significance, or marketing purposes).  The DW is operated separately from the operational IS and makes use of data from internal IS as well as external databases  A centralized holistic resource  DW generally seen as topic-oriented DB rather than a concrete DB system  Has separate entity-specific applications for DW  Embraces an underlying concept of combining decision-making projects with decision-related data  Therefore, other DBs are specific to particular software, but DW depends completely upon the ideas an entity wants to explore, and cannot be built mechanically by software systems alone 37 CHAPTER 3: ALL ABOUT DATA  Some Basics (3.4) – Data Warehouse  The contents of a Data Warehouse can be characterized by four main features, which demonstrate the significant differences to operational data 1. Topic Orientation 2. Logical Integration and Homogenization 3. Presence of a Reference Period 4. Low Volatility 38 CHAPTER 3: ALL ABOUT DATA  Some Basics (3.4.1) – Topical Orientation  Operational Systems are oriented towards specific entity units and / or work processes  Data Warehouse is oriented towards matters which affect decisions made by the entity in a specific topic area  The Customer  The Product(s) or Service(s)  The Payments or Cash Flows  Advertising or Sales Campaigns  Such an approach also affects the choice of an effective data model  Operational systems tend to limit data to that of a particular source document  Data Warehouses can easily provide ALL data and information associated with a customer, vendor, or other relevant object regardless of which system the data is in 39 CHAPTER 3: ALL ABOUT DATA  Some Basics (3.4.2) – Logical Integration and Homogenization  Originally, Data Warehouses consisted of data structures based on relational DBs  Recently, DW also use unstructured databases, no-SQL databases included in “big data”  The high volume of data coming from log files, social networks, etc. necessitate a different architecture and way of storing data (e.g., ability to “search” across lots of data for collection of data on a topic much like a Google search does)  The goal of DW and Big Data is an enterprise-wide integration of all relevant data into a consistent set of databases in a continuous system model  Implies cross-functional use of data 40 CHAPTER 3: ALL ABOUT DATA  Some Basics (3.4.3) – Reference Period  Information to be useful for decision support must be provided timely (usually “quickly”)  It is relatively unimportant for data processing to be taking place at a precise time in operational systems  It is much more important to be able to incorporate different periods of time without problems in subsequent analysis  In operational systems, the time factor has only a descriptive role  In DW, time factor is an important structural component  Special factor of DW is that historical data is retained even when data in operational systems is deleted or destroyed (e.g., upgrade to new system cause loss of data) 41 CHAPTER 3: ALL ABOUT DATA  Some Basics (3.4.4) – Low Volatility  Once data is transformed (i.e., “cleaned”), data in a DW should not ever change again or need to be changed  There may be amendments, additional data to the DW as a whole  Example:  Where there are errors in data due to a faulty charging process, the operational system will likely overwrite the old records but a DW will insert new records to show the new process  Where a purchase by a customer was cancelled, the operational system would overwrite the original sale or be delete it. In the DW, both records would exist allowing a more robust analysis of those transactions. 42 CHAPTER 3: ALL ABOUT DATA  Some Basics (3.4.5) – Using the Data Warehouse  The DW is populated by operational data regularly  But thereafter the DW acts independently from the operational system  Main purpose is to support decision making for analytical purposes  A DW allows for efficient read access to very large amounts of data designed in complex structures  Allows for complex queries that involve large amounts of data (what impact does the LARGE aspect have?)  Allows for extensive aggregation of data (what impact does the EXTENSIVE have?)  Usually, DW subject to significant fluctuations with peaks, whereas operational systems tend to be constant or uniformly distributed usage 43 CHAPTER 3: ALL ABOUT DATA  Some Basics (3.4.5) – Using the Data Warehouse  Need to consider 3 different forms of organization: 1. Central Database Warehouse 2. Distributed Database Warehouse 3. Virtual Database Warehouse  Most common form is #1 (Central DB DW)  Where the management of all datasets for front-end applications is on a single system  #2 – Distributed  When different departments might run their own DWs optimized for their needs  A.K.A. Data Mart: prepared data from a DW for special usage  #3 – Virtual  Creating “views” of original data in legacy system  ETL process creates virtual views instead of real data 44 CHAPTER 3: ALL ABOUT DATA  Some Basics (3.5) – Three Components of a DW: DBMS, DB, DBCS  A DW can be seen as a DB system sharing the three components of:  DBMS – DataBase Management System  DB – DataBase  DBCS – DataBase Communication System  DBMS contains metadata on loading, error detection, constraints and validation  DB is the storage of the DW data  DBCS refers to the analysis of data using SQL or other tools 45 CHAPTER 3: ALL ABOUT DATA  Some Basics (3.5.1) – DataBase Management System (DBMS)  Mainly for managing the analysis-oriented DB  Provides the functionality for data definition and manipulation  Has different requirements than operational system 46 CHAPTER 3: ALL ABOUT DATA  Some Basics (3.5.2) – DataBase (DB)  Integrity and consistency of DB in DW are evaluated differently in an operational system, as are data security and availability  Because data in DW is made up of copies of operational datasets with some transformation (e.g., cleaning) of the operational data and any additional information extracted from the data  Example: Operational data may only store salutation (Mr., Ms., Dr.) but in DW may also have marital status, age, or other demographical data. If address changes in operational data, just records new data. DW records old and new address as well as the date it changed  Integrity, consistency, security and availability in DW is greater than operational systems  Storage and access must be optimized in DW to provide timely response to complex queries and analysis, but not lose flexibility in the analysis 47 CHAPTER 3: ALL ABOUT DATA  Some Basics (3.5.3) – DataBase Communication System (DBCS)  DBCS play a prominent role in analysis-oriented systems because without them the data store in a DW can be very difficult to be useful effectively and efficiently  Indispensable part of DW is detailed metadata / meta-databases  Meta data may be essential to perform the analysis effectively 48 49 CHAPTER 3: ALL ABOUT DATA  Some Basics (3.6) – Data Marts  Data Marts are different from Data Warehouse as terms, but both are part of analysis- oriented systems  A Data Mart is a specific collection of data where only the needs of a specific view and use are mapped  Example: Data Mart constructed for customer-based predictive analyses such as prediction of those customers with the highest probability to buy next  Can be thought of as a subset DB of the DW, and simultaneously an entity of its own  Will lead to duplicated data  Why would we do that? (Hint: BIG DATA, COMPLEX queries)  The care of such Data Marts is expensive  Consider the advantages and disadvantages of heterogeneous data structures (i.e., relational, key fields)  Also requires regular updates 50 CHAPTER 3: ALL ABOUT DATA  Some Basics (3.6.1) – Regularly Filled Data Marts  Data Marts are often needed for reporting and OLAP or when data mining aggregated data that is continuously available  Data Mart tables and files are compressed at different levels (e.g., sales figures for the current level)  Depending on definition of information, should be available daily, weekly, or monthly  The shorter the update period, the more important it is for the process of updating to be fully automated or refills are carried out according to fixed rule and within defined procedures  Daily loading will likely occur during the night after operational system data has been completely processed and then accumulated into DW, providing most current information 51 CHAPTER 3: ALL ABOUT DATA  Some Basics (3.6.2) – Comparison between Data Marts and Data Warehouses  In a Data Mart, data is aggregated for the analysis of a major interest  However, there is little flexibility to analyze detailed individual values or to respond to new data requirements to link them  DW do not have that dilemma  Data Marts may be deployed to contain aggregated and possibly transformed data  In the DW, data is stored in finest available granularity  In DW, data is stored as quasi-relational data while Data Marts will be both relational and multi-dimensional 52 CHAPTER 3: ALL ABOUT DATA  Some Basics (3.7) – Typical Example from Online Marketing  Online Marketing often faced with Data Marts that are created regularly (daily) out of a log file data stream or out of a data warehouse counting log file and clickstream data (traffic/path/sticky and e-commerce)  Statisticians or data scientists can use the Data Mart as a starting point for their analysis without investing a lot of time in data preparation  For most websites, the number of users or unique clients is too big, so some of the data marts just include a representative sample of them  Such a data mart can have structure like Figure 3.8 and 3.9 53 Figure 3.9 Figure 3.8 54 CHAPTER 3: ALL ABOUT DATA  Some Basics (3.7) – Typical Example from Online Marketing  It is very likely that the Data mart will have 1,000 variables or more if you are recording the clicks for all the different areas of content  That can be a mass of data (Figure 3.8 show only last 10 slots)  If you know it is important to find the average of clicks, or the development / trend of the clicks by the average browser, then you would need to extend the definition of information (thus the size of the dataset / Data Mart)  Figure 3.10 shows one them, Item #1021 (fashion/clothing) for the earlier example 55 Figure 3.10 56 CHAPTER 3: ALL ABOUT DATA  Some Basics (3.7) – Typical Example from Online Marketing  This example illustrates the need to invest time and maybe money to decide about Data Marts that will be created regularly by the system  If the data scientist has to reconstruct the data every time they need the data, it might cost much more than the benefits  Is more complicated if more than one data scientist is involved 57 CHAPTER 3: ALL ABOUT DATA  Some Basics (3.8) – Unique Data Marts  Two kinds:  Permanently available DM  Created through a unique, sometimes complex, analysis (i.e., Ad Hoc) 58 CHAPTER 3: ALL ABOUT DATA  Some Basics (3.8.1) – Permanent Data Marts  Data Marts that are condensed from history do not change when more data is available  Only need to be constructed ONCE initially and except for known errors in that data, cannot be changed  The role for Permanent DM’s is to be information provider for various analyses  Example: Sales figures 3 years ago, production data 6 months ago (static data)  Basically, this type of DM is a “baseline” to be used in various related Data Mining / Analytics projects 59 CHAPTER 3: ALL ABOUT DATA  Some Basics (3.8.2) – Data Marts Resulting from Complex Analysis  Especially applicable to ad hoc queries and ad hoc data mining / analysis  When it is necessary to consolidate data from the DW from different views and contexts 1. Empower the user (or allow the data scientist) to contact the DW administrator with a request to create a Data Mart according to specified rules 2. Provide the user with an appropriate software solution / tool so they can create the appropriate data marts themselves  These kinds of Data Marts are only done to solve one problem or to carry out a special kind of data analysis  Will need a review of some kind to make sure Data Mart contains fruitful variables 60 CHAPTER 3: ALL ABOUT DATA  Some Basics (3.9) – Data Marts Do’s and Don’ts  Important issues in creating a Data Mart regarding creation process, handling of the Data Mart, and coding/programming aspects 61 CHAPTER 3: ALL ABOUT DATA  Some Basics (3.9.1) – Do’s and Don’ts for Processes 1. Do not forget the relevant background and domain knowledge 2. Do use check sums – they should match the numbers you know from reports 3. Do cross-check between different tables and source systems – check that the results fit together and that they represent the relations and averages you have in mind (sanity check) 4. Do not start the analysis and / or estimation too early 5. Do define meaningful meta-data 6. Do prepare and/or transform the data to a shape that suits the methods you plan to use 7. Do explore the data using easy descriptive analysis and graphical representations 8. Do carry out the peer review (independent review) – if no second analyst is available, review it yourself the next day 62 CHAPTER 3: ALL ABOUT DATA  Some Basics (3.9.2) – Do’s and Don’ts for Handling 1. Do make sure you can identify and trace every record (by suitable ID variables) at every point so as to enable cross-checking with the original data.  Don’t ever drop the ID variables even if they are not specifically useful for the analytics 2. Don’t ever lose a relevant case from a dataset; it can happen easily  In particular, cases where “nothing happened” in one of several tables are likely to disappear during a joint manipulation / join function 3. Do use meaningful units to categorize continuous variables; the units can be determined based on statistics or business rules 4. Do check the distributions of variables 5. Do use meaningful variable names or labels; it is quite handy to show the units in the variable name (lb., day, $) 63 CHAPTER 3: ALL ABOUT DATA  Some Basics (3.9.3) – Do’s and Don’ts for Coding/Programming 1. Do use options to optimize your dataset (if available in the programming language) 2. Do not forget to check the log information in great detail 3. Do structure your code 4. Do use shortcuts, and/or macros for frequently used bits of code 5. Do document the code so that a third person will get a clear picture of how the code was constructed and to ensure “bugs” are able to be detected and to allow for future changes efficiently 64 DATA MINING & ANALYTICS Tommie Singleton, Ph.D, CPA, CITP, CISA (256) 762-5252 CHAPTER 3 [email protected] 65 TUTORIAL: SQL 66 CHAPTER 3: ALL ABOUT DATA  SQL: Structured Query Language (“sequel”)  IBM invented a tool to do certain commonly needed functions for databases (circa 1970)  A “SQL” database is one that allows a SQL tool to perform the standard SQL commands  Search the DB looking for some specific criteria (e.g., list of customers who have not bought anything in the last 3 months)  Add a record  Delete a record  Modify a record 67 68 TUTORIAL: CHECKSUM 69 CHAPTER 3: ALL ABOUT DATA  Checksum  A checksum is a value used to verify the integrity of a file or a data transfer. In other words, it is a sum that checks the validity of data. Checksums are typically used to compare two sets of data to make sure they are the same.... For example, a basic checksum may simply be the number of bytes in a file. 70 Accounting Usage of Checksum Method Sum Count Sum Return Number= Items=x Amount= 71 TUTORIAL: RELATIONAL DATABASES 72 Data Structures & Definitions Invoice Number US-001 Invoice Date 11/02/2019 P.O. # 23122019 Due Date D02/26/2019 Customer Name John Smith Customer Address 2 Court Square Customer City New York Customer State NY Customer Zip 12210 1.Qty Bought 1 1.Item Bought Front and Rear Brake cables 1.Unit Price 100 1. Amount 100 … … Sales Tax 9.06 Invoice Total Amount 154.06 73 74 Data Structures & Definitions CUSTOMERS Invoice Number Customer ID CUSTOMERS Customer Name Invoice Date KEY FIELD: P.O. # Customer Address Due Date Customer City Customer ID Customer Name Customer State Customer Address Customer Zip Customer City Customer State INVOICES Customer Zip 1.Qty Bought Invoice Number INVOICES Invoice Date 1.Item Bought KEY FIELD: 1.Unit Price P.O. # 1. Amount Due Date Customer ID … Customer ID Sales Tax Qty Bought Invoice Total Amount Item Bought Unit Price Amount Return Sales Tax Invoice Total Amount 75

Use Quizgecko on...
Browser
Browser