Summary

This document describes BPM Chapter 5, covering process mining techniques, data quality issues, and event log creation. It discusses extracting, transforming, and loading (ETL) data for process mining, along with event log assumptions and examples related to data sources and formats. The document emphasizes the importance of filtering, scoping, and data analysis in process mining.

Full Transcript

BPM Chapter 5 The goal of process mining is to answer questions about operational processes Examples are: What really happened in the past? Why did it happen? What is likely to happen in the...

BPM Chapter 5 The goal of process mining is to answer questions about operational processes Examples are: What really happened in the past? Why did it happen? What is likely to happen in the future? When and why do organizations and people deviate? How to control a process better? How to redesign a process to improve its performance? In the context of BI and data mining, the phrase “Extract, Transform and Load” (ETL) is used to describe the process that involves: extracting data from outside sources, transforming it to fit operational needs (dealing with syntactical and semantical issues while ensuring predefined quality levels) loading it into the target system, e.g., a data warehouse or relational database. A data warehouse is a single logical repository of an organization’s transactional and operational data. The goal is to unify information such that it can be used for reporting analysis, forecasting, etc. Examples One data source may identify a patient by her last name and birth date while another data source uses her social security number. One data source may use the date format “31-12-2010” whereas another uses the format “2010/12/31”. If a data warehouse already exists, it most likely holds valuable input for process mining. BPM Chapter 5 1 If a data warehouse is present, it does not need to be process-oriented. Whether there is a data warehouse or not, data needs to be extracted and converted into event logs. Scoping is of the utmost importance. Often the problem is not the syntactical conversion but the selection of suitable data. For the moment, we assume that one event log corresponds to one process, i.e., when scoping the data in the extraction step, only events relevant for the process to be analyzed should be included. Depending on the questions and viewpoint chosen, different event logs may be extracted from the same data set Once an event log is created, it is typically filtered Filtering is an iterative process Coarse-grained scoping was done when extracting the data into an event log Filtering correspond to fine-grained scoping based on initial analysis results For example, for process discovery, one can decide to focus on the 10 most frequent activities to keep the model manageable. Note that process mining results most likely trigger new questions, and these questions may lead to the exploration of new data sources and more detailed data extractions. Several iterations of the extraction, filtering, and mining phases are needed. Event Logs Assumptions: An event log contains data related to a single process each event in the log needs to refer to a single process instance, often referred to as a case events can be related to some activity BPM Chapter 5 2 A process is a collection of activities such that the lifecycle of a single instance is described. (bare minimum) All events have a timestamp Optional: Resources, i.e., the persons executing the activities Costs are associated to events Process consists of cases A case consists of events such that each event relates to one case Events within a case are ordered Events can have attributes. Examples of typical attribute names are activity, time, costs, and resources. Not all events need to have the same set of attributes. However, typically, events referring to the same activity have the same set of attributes. For convenience we assume the following standard attributes: #activity (e)is the activity associated to event e. ​ #time (e)is the timestamp of event e. ​ #resource (e)is the resource associated to event e. ​ #trans (e)is the transaction type associated to event e, examples are ​ schedule, start, complete, and suspend. Two scenarios involving two activity instances leaving the same footprint in the log Given the footprint of two starts followed by two completes of the same activity, there are two possible scenarios. In one scenario the durations of the two activity instances are 5 and 6. In the other scenario, the durations of the activity instances are 9 and 2. Yet they leave the same footprint in the event log. Correlation Problem: BPM Chapter 5 3 The “primary correlation problem” is to relate events to cases, i.e., process instances Add information to the log or by using heuristics. “secondary correlation problem”, i.e., relating two events within the same case. Possible resolves: Assume a first-in-first-out order and pick the first scenario. Introduce timeouts when the time between a start event and complete event is too long. For example, start events that are not followed by a corresponding complete event within 45 minutes are removed from the log. In process models, activities play a central role. These correspond to transitions in Petri nets tasks in YAWL functions in EPCs state transitions in transition systems tasks in BPMN. There may be multiple events referring to the same activity. Some process mining techniques consider the transactional model whereas others just consider atomic events. Sometimes we just want to focus on complete events whereas at other times the focus may be on withdrawals. This can be supported by filtering (e.g., removing events of a particular type) and by the concept of a classifier. A classifier is a function that maps the attributes of an event onto a label used in the resulting process model. can be seen as the “name” of the event. In principle there can be many classifiers. BPM Chapter 5 4 Simple Event Log A simple event log is just a multi-set of traces over some set A. For example 〈 [ a, b, c, d〉 , 〈a, c, b, d〉 , 〈a, e, d〉] 3 2 defines a log containing 6 cases. In total there are 3 × 4 + 2 × 4 + 1 × 3 = 23 events. All cases start with a and end with d. In a simple log there are no attributes, e.g., timestamps and resource information are abstracted from. Moreover, cases and events are no longer uniquely identifiable. The three cases following the sequence a, b, c, d in the simple event log cannot be distinguished. XES 5 Standard Extensions 1. Concept Extension a. Defines the name attribute for traces and events b. The concept extension also defines the instances attribute for events i. This is used to distinguish different activity instances in the same trace 2. Life-Cycle Extension a. Defines the transition attribute for events 3. Organizational Extension a. defines three standard attributes for events: resource, role, and group. 4. Time Extension a. defines the timestamp attribute for events. 5. Semantic Extension a. defines the modelReference attribute for all elements in the log. BPM Chapter 5 5 Challenges when extracting event logs 1. Correlation: Events in an event log are grouped by case, but correlating events can be challenging. Event data may be scattered across multiple tables or systems, making it difficult to identify related events and cases. This includes matching responses to requests in inter-organizational communications. While this is easily addressed in new systems, legacy and interconnected systems require extra effort for event correlation. 2. Timestamps too coarse Many systems only record dates without timestamps. For instance, hospital systems often only log the date of patient events, not the exact time. This makes it impossible to determine event order within a single day. Solution 1: Use partial ordering algorithms instead of assuming total order. Solution 2: Estimate order using domain knowledge and common patterns. 3. Snapshots Cases may extend beyond the recorded period Some cases start before or end after the log's timeframe Event logs only show a snapshot of ongoing processes For short-duration cases, remove incomplete ones Known start/end activities make filtering simple: Remove cases missing their beginning or end Discovering complete processes becomes challenging when case duration matches recording length BPM Chapter 5 6 4. Scoping Domain knowledge is needed to locate the required data and to scope it Desired scope depends on both the available data and the questions that need to be answered. 5. Granularity Event logs often contain more granular details than what end-users need. Low-level system events are too detailed for stakeholders reviewing processes. Preprocessing can help by converting low-level patterns into meaningful activities. Conceptualizing Event Logs Each process may have an arbitrary number of activities, but each activity belongs to precisely one process. Each case belongs to one process Each activity instance refers to one activity Each activity instance belongs to precisely one case; there may be several activity instances for each activity/case combination. Each event refers to one case Each event corresponds to one activity instance; for the same activity instance there may be multiple events. Each case attribute refers to one case; each attribute has a name and a value, e.g., “(birthdate, 29-01-1966)”. Each event attribute refers to one event and is characterized by a name and a corresponding value, e.g., “(costs, $199.99)”. There are different subclasses of case attributes, e.g., the description of a case, case identifier, start time of case, etc. Case attributes are invariant, i.e., they do not change while the corresponding events of the case occur. There are different subclasses of event attributes, e.g., the time of occurrence of the event (j ), the position in trace (k), the transaction type BPM Chapter 5 7 (l), the resource causing the event (m), or any other type of attribute data (costs, risk, age, etc.). Per event attribute, it is indicated whether the attribute is mandatory. The process attribute of an event is optional. If the attribute is missing, we assume there is just one process. The activity instance attribute is also optional. Activity instance data is often missing in datasets, with only complete events recorded. While heuristics can help derive activity instances from start and complete events, this becomes unreliable with overlapping instances or missing events. Heuristics may be used to solve these problems. Moreover, it is always possible to consider each event as a singleton activity instance. Obviously, such solutions may introduce data quality problems. When start events cannot be related to complete events, it is impossible to measure service times and resource utilization accurately. Events typically have timestamps that determine their order in a trace, though control-flow algorithms mainly use event ordering. Multiple events can share timestamps, which may be coarse-grained (minutes/days), affecting result precision. Some algorithms work with partially ordered traces due to timing imprecision or causal relationships, but require at least timestamp or trace position. Transaction type and resource attributes are optional, along with other possible data attributes (costs, risks, etc.). Data Quality Data Quality Issues BPM Chapter 5 8 We consider: 3 main entities (case, activity instance, and event) 9 event attributes (case, process, activity, activity instance, timestamp, position, transaction type, resource, and any data). This allows us to create a classification of data quality problems. This classification is related to the challenges mentioned in the context of XES. First we consider the main entities (case, activity instance, and event) and not the attributes. At the entity level there are three potential problems: (Missing in log) Events that occurred in reality but weren't recorded For example, a blood sample was taken but not logged in the system (Missing in reality) Events that were recorded but never actually happened For example, a logged appointment that was cancelled due to emergency (Concealed in log) Events that exist but are hidden in unstructured data This includes duplicates, overly broad datasets, and merged data sources that make entity identification difficult At the event attributes level, there are three potential problems: (Missing attribute) The attribute has not been recorded for a particular event. Example, the timestamp of an event is missing. (Incorrect attribute) The recorded value of the event attribute is wrong. Example, an event is related to another case. (Imprecise attribute) The value of the event attribute is too imprecise. Example, the value of a timestamp is too coarse-grained, or the address is incomplete. BPM Chapter 5 9 Recurrence of data quality problems: Continuous: refers to the problem that the resource attribute is never recorded. Intermittent: refers to the problem that events are sometimes missing from the log. Changing: refers to the problem that events have imprecise timestamps in certain periods, e.g., before the new software system was installed only dates were recorded. In total ((3 × 3) + (9 × 3)) × 3 = 108 data quality problems can be identified using the three tables. Guidelines for logging Data quality issues often arise because data collection is not prioritized. Event data is typically collected as a by-product of other processes. For example, data may exist only for financial tracking or due to arbitrary programming decisions. The 12 logging guidelines are technology-agnostic and use a simple event model: Events are occurrences with references and attributes. References point to objects via identifiers. Attributes consist of names and values. To create an event log from such “raw events” We need to select the events relevant for the process at hand, Events need to be correlated to form process instances (cases), Events need to be ordered using timestamp information (or have an explicit order), and Event attributes need to be selected or computed based on the raw data (resource, cost, etc.). Guideline 1: Reference and attribute names must have clear, consistent meanings across all stakeholders. BPM Chapter 5 10 Guideline 2: Maintain a structured collection of reference and attribute names, with consensus required for additions. Guideline 3: Use stable, context-independent references that don't vary by time, region, or language. Guideline 4: Make attribute values as precise as possible, explicitly indicating any lack of precision. Guideline 5: Mark uncertainties in event occurrences or attributes with appropriate qualifiers. Guideline 6: Ensure events are at least partially ordered, either explicitly or through timestamps. Guideline 7: Include transactional information (start, complete, etc.) and link related events to activity instances. Guideline 8: Run regular automated checks for log consistency and correctness. Guideline 9: Keep logging consistent across time and process variants to ensure comparability. Guideline 10: Keep event data raw - do not aggregate before analysis. Guideline 11: Use soft deletes instead of removing events to maintain data provenance. Guideline 12: Protect privacy while preserving important correlations through techniques like hashing. Flattening Reality into Event Logs Flattening data into event logs is similar to OLAP data aggregation, where data can be viewed from different angles like product, region, or time. Unlike OLAP's simple data cubes, process mining requires event correlation and ordering, making it more complex. Our recommended approach for process mining: Build a process-oriented data warehouse for raw event data instead of aggregated data Create appropriate views based on analysis needs: BPM Chapter 5 11 Convert data to event logs (XES format) Think of this as taking 2D slices of 3D data Apply process mining techniques: Filter as needed Iterate until questions are answered Use multiple views if needed for complete process understanding BPM Chapter 5 12

Use Quizgecko on...
Browser
Browser