image.pdf
Document Details
Uploaded by DelightedPolonium
Stanford University
Tags
Full Transcript
Noname manuscript No. (will be inserted by the editor) The Effects of Data Quality on Machine Learning Performance arXiv:2207.14529v4 [cs.DB] 9 Nov 2022 Lukas Budach1 · Moritz Feuerpfeil1 · Nina Ihde1 · Andrea Nathansen1 · Nele Noack1 · Hendrik Patzlaff1 · Felix Naumann2 · Hazar Harmouch2 Receiv...
Noname manuscript No. (will be inserted by the editor) The Effects of Data Quality on Machine Learning Performance arXiv:2207.14529v4 [cs.DB] 9 Nov 2022 Lukas Budach1 · Moritz Feuerpfeil1 · Nina Ihde1 · Andrea Nathansen1 · Nele Noack1 · Hendrik Patzlaff1 · Felix Naumann2 · Hazar Harmouch2 Received: date / Accepted: date Abstract Modern artificial intelligence (AI) applications require large quantities of training and test data. This need creates critical challenges not only concerning the availability of such data, but also regarding its quality. For example, incomplete, erroneous or inappropriate training data can lead to unreliable models that produce ultimately poor decisions. Trustworthy AI applications require high-quality training and test data along many dimensions, such as accuracy, completeness, and consistency. We explore empirically the relationship between six data quality dimensions and the performance of fifteen widely used machine learning (ML) algorithms covering the tasks of classification, regression, and clustering, with the goal of explaining their performance in terms of data quality. Our experiments distinguish three scenarios based on the AI pipeline steps that were fed with polluted data: polluted training data, test data, or both. We conclude the paper with an extensive discussion of our observations. Keywords Data errors · Data-centric AI · Data Pollution · Explainability · Structured data Declarations Funding: Denkfabrik Digitale Arbeitsgemeinschaft im Bundesministerium für Arbeit und Soziales (BMAS) Availability of data and material: Please visit our repeatability page. Hasso Plattner Institute University of Potsdam, Germany 1 [email protected] 2 [email protected] Code availability: Open source code is available here. 1 Data Quality and AI The rapid advances in the field of artificial intelligence (AI) represent a great opportunity for further enhancement for many industries and sectors, some of which are critical in nature, such as autonomous driving and medical diagnosis. The potential for AI has been enhanced by the recent and future enormous growth of data. However, this precious data raises tedious challenges, such as data quality assessment, and, according to [27] data management, data democratization and data provenance. Until recently, both academia and industry were mainly engaged in improving or introducing new and improving existing machine learning (ML) models, rather than finding remedies for any data challenges that fall beyond trivial cleaning or preparation steps. Nevertheless, the performance of AI-enhanced systems in practice is proven to be bounded by the quality of the underlying training data [9]. Moreover, data have a long lifetime and their use is usually not limited to a specific task, but can continuously be fed into the development of new models to solve new tasks. These observations led to a shift in research focus from a model-centric to a data-centric approach for building AI systems[59]. In 2021, two workshops emerged to discuss the potential of data-centric AI and to initiate an interdisciplinary field that needs expertise from both data management and ML communities1 . 1 https://datacentricai.org/neurips21/ and https://hai.stanford.edu/events/ data-centric-ai-virtual-workshop 2 Budach, Feuerpfeil, Ihde, Nathansen, Noack, Patzlaff, Naumann, Harmouch In the field of data management, data quality is a well-studied topic that has been a major concern of organizations for decades, leading to the introduction of standards and quality frameworks [5, 72]. The recent advances in AI have brought data quality back into the spotlight in the context of building “data ecosystems” that cope with emerging data challenges posed by AIbased systems in enterprises [27]. Researchers pointed out such challenges, including data quality issues [29], data life cycle concerns [54], the connection to MLOPs [56], and model management [64]. Furthermore, some studies presented a vision of data quality assessment tools [30], an automation of data quality verification [63] or a methodology to summarize the quality of a dataset as datasheets [25], nutritional labels [69], and data cards [70]. In this work, under the umbrella of data-centric AI, we revisit six selected data quality dimensions, namely – – – – – – Consistent representation Completeness Feature accuracy Target accuracy Uniqueness Target class balance Our ultimate aim is to observe and understand MLmodel behavior in terms of data quality. We test a variety of commonly used ML-algorithms for solving classification, clustering, or regression tasks. We analyze the performance of five algorithms per task (15 algorithms in total) covering the spectrum from simple models to complex deep learning models (see Section 4). Before diving further into the details of our experimental setup, we highlight the broad scope of plausible variations of such an empirical study, showing the full perspective of any possible correlation between data quality and ML-models. As illustrated in Figure 1, there are three main aspects: (1) the model that can vary from simple models (e.g., decision trees) to complex ones (e.g., based on pre-trained embeddings); (2) the pollution/error type can range from synthetically introduced problems to more difficult-to-detect real-world errors; (3) data quality dimensions could be studied individually or by considering several dimensions at once assuming they are not independent. Together, these dimensions span an enormous experimental space. To gain the necessary basic understanding, we limit our experiments to simple models, synthetic pollution, and individual data quality dimensions. We plan to expand our study to cover the wider variations in future work. Regarding data quality, we account for two aspects. First, data plays a different role at different stages of the ML-pipeline: Some systems use pre-trained models and Fig. 1: The wide scope of empirically studying the effect of data quality on ML algorithm performance. thus the only available data is the “serving” or “testing” data; in many other cases, data scientists also need “training” data to build the models from scratch. Second, training and testing data can be generated or collected by the same process from the same data source, so that they have similar quality. In a more realistic case, training and testing data have different quality, especially when using pre-trained models and different sources or collection processes. To that end, we consider in this study three scenarios: Training and testing data have the same quality (Scenario 3); the training data have high quality (in terms of the studied quality dimensions) and lower quality testing data (Scenario 2); and finally, the testing data have a high quality and the data used to build the models are of a lower quality (Scenario 1). To vary data quality in each of these scenarios, we apply the notion of data pollution or corruption to create degraded quality versions of the dataset at hand. For each of the six quality dimensions, we designed a parameterized data polluter to introduce corresponding data errors. While we used real-world data, for several of the datasets we had to manually create a clean version as a baseline to initiate the pollution process. In these cases, we report the performance of ML-models for both the “original” and the “baseline” datasets. Research in the machine learning community has studied the effects of label noise and missing values, and the data management community has studied the effects of data cleaning on classification, as we discuss in Section 2. Nevertheless, this paper is the first systematic study of the effects of data quality dimensions not only for classification, but also for clustering and regression tasks. We studied this effect also for low-quality training data and not only for test data. Our work on real-world datasets with numerous experiments is a first The Effects of Data Quality on Machine Learning Performance step not only towards linking ML-model performance to the underlying data quality, but also to understand and explain their connection. Contributions. We present a comprehensive experimental study to understand the relation between data quality and ML-model performance under the umbrella of data-centric AI, providing: – A systematic empirical study that investigates the relation between six data quality dimensions and the performance of fifteen ML algorithms. – A simulation of real life scenarios concerning data in ML-pipelines. We perform a targeted analysis for cases where serving data, training data, or both are of low quality. – Practical insights and learned lessons for data scientists. In addition, we raise several questions and point out possible directions for further research. – Polluters, ML-pipelines and all datasets as research artifacts that can be extended easily with further quality dimensions, models, or datasets. Outline. Next, we discuss related work in Section 2. Then, we formally define the six data quality dimensions together with a systematic pollution method for each in Section 3. In Section 4, we briefly introduce the fifteen algorithms for the three AI tasks of classification, regression, and clustering. We describe our experimental setup in Section 5. The results of the empirical evaluation, the core contribution of this paper, are discussed in Section 6. Finally, we summarize our findings in Section 7. 2 Related Work First, we report on the state of the art in data validation for ML. Then, we discuss related work that studies the influence of data quality on ML-models, namely by conducting an empirical evaluation, cleaning the data or by focusing on a specific error type like label noise. Data validation. Several approaches have emerged to validate ML pipelines as well as the data fed to them, which includes training and serving data (data used in production). These approaches use the concept of unit tests to help engineers diagnose model-quality issues originating from data errors. For instance, the validation system implemented by Breck et al. [9] and the similar system by Schelter et al. [62] focus on validating serving data given a classification pipeline as a black box. 3 Generally, validation systems check, on the one hand, for traditional data quality dimensions, such as consistency and completeness, and on the other hand for MLdependent dimensions, such as model robustness and privacy [6]. To help data scientists with the validation task, Schelter et al. [61] introduced the experimental library JENGA. It enables testing ML-model’s robustness under data errors in serving data. The authors use the concept of polluters or data corruptions as in our work. However, they do not provide an extensive experimental study and their focus is on describing the framework. Task-dependent data quality. Foroni et al. [23] argue that data quality assessment should not be performed in isolation from the task at hand. Our results confirm this for ML-models, as the same “low” quality data has a different effect when used to train different models. Their paper proposes a theoretical framework with a setup similar to our experiments, which evaluates the performance of a task given polluted datasets by various kinds of generated systematic noise. The proposed framework then computes the variation effect factor or the sensitivity factor from the observed results of the task. Unlike our work, however, the authors focused only on polluted training and testing data (Scenario 3 in our paper) and the experiments were conducted on a single dataset to evaluate only three models (one per ML-task) namely, Random Forest, k-means, and Linear Least Square. The authors made observations that agree with our findings, especially the fact that missing values (completnesses) are a problem for all ML-tasks. Data cleaning. Li et al. [42] investigated the impact of cleaning training data, i.e., improving its quality, on the performance of classification algorithms. They obtained a clean version of the training data instead of systematically polluting it, as we did in this work. This effort yielded the CleanML benchmark. Furthermore, in their work, the robustness of the data cleaning method had an additional influence on how the classification performance changed with a training data of a higher quality. They focused on five error types: missing values, outliers, duplicates, in-consistencies, and mislabels. These error types are among the most popular error types, and thus some of them correspond to the data quality dimensions in our study. The authors observed that cleaning inconsistencies and duplicates is more likely to have low impact, while removing missing values and fix mislabeled data is more likely to improve the classifier prediction. These observations align with our findings. The CleanML benchmark datasets have been recently used, among others, by Neutatz et al. [49]. The 4 Budach, Feuerpfeil, Ihde, Nathansen, Noack, Patzlaff, Naumann, Harmouch authors evaluate the ability of AutoML systems, such as AutoSklearn [22], to produce a binary classification pipeline that can overcome the effect of the following types of errors in training data: duplicates, inconsistencies, outliers, missing values, and mislabels. The authors concluded that AutoML can handle duplicates, inconsistencies, and outliers but not missing values. The paper also points out that most current benchmark datasets contain only few real-world errors with insignificant impact on the ML-performance even without any cleaning. For this reason and as mentioned in Section 1, our work uses synthetic errors to better characterize the correlation between ML-models performance and data quality. Clearly, our study aligns with data cleaning efforts but with a different goal. Data cleaning systems mainly focus on cleaning the data, but our observations give data experts the understanding of the effects of data quality issues. They can then use this knowledge to determine the robustness of their insights and decide which specific problems should to be tended to and when. For an overview of the progress in cleaning for ML, we refer to [50]. Label noise. The problem of label noise or mislabeling is one of the main concerns of the ML-community and has attracted much interest. This problem is in essence a data quality problem. Frénay and Verleysen [24] surveyed the literature related to classification using training data that suffers from label noise, which is equivalent to the target accuracy dimension in our work. They distinguish several sources of noise, discuss the potential ramifications, and categorize the methods into the classes label noise-robust, label noise cleansing, and label noise-tolerant. They conclude that label noise has diverse ramifications, including degrading classification accuracy, high complexity learning models, and difficulty in specifying relevant features. Northcutt et al. [52] focus on label noise in test data and its effect on ML-benchmark results and thus on model selection. They estimated an average of 3.3% label noise in their test datasets and found that benchmark results are sensitive even to a small increase in the percentage of mislabeled data points, i.e., a smaller capacity models could have been sufficient if the test data was correctly labeled, but the label noise led to favoring a more complex model. We report similar behavior in our discussion of the effect of target accuracy dimension on classification performance and expand the discussion to clustering and regression tasks In summary, we present the first systematic empirical study on how both training and testing (serving) data quality affects not only classification but all the three ML tasks. We also provide a clear definition for each of the data quality dimensions and a respective method to systematically pollute the data. 3 Data quality dimensions and data pollution We present the definition of the six selected data quality dimensions, along with our methods to systemically pollute a dataset along those dimensions. In this work, we use the ML terms feature and sample to refer to column and row, respectively. During the pollution, we assume that features’ data types and the placeholders which represent missing values in each feature are given. 3.1 Consistent Representation A dataset is consistent in its representation if no feature has two or more unique values that are semantically equivalent. I.e., each real-world entity or concept is referred to by only one representation. For example, in a feature “city”, New York shall not be also represented as NYC or NY. Definition 1 The degree of inconsistency of a feature c, denoted as InCons(c), is the ratio of the minimum number of replacement operations required to transform it into a consistent state and the number of samples in the dataset. This definition applies only to categorical features, i.e., strings or integers that encode categorical values, whereas numerical features and dates are considered to be consistent and theirs InCons(c) = 0. The degree of consistency of a feature cannot be derived by subtracting λcr from 1 because it also depends on the number of representations of an original value (see Figure 2). Fig. 2: Relation of pollution & quality for consistent representation with different numbers of representations per original value Definition 2 We define the degree of consistency of a dataset d with f features as follows. Consistency(d) = 1 − X 1 ∗ InCons(ci ) f i∈1,...,f (1) The Effects of Data Quality on Machine Learning Performance Pollution. We have two inputs: First, the percentage of samples to be polluted λcr , defined by a value between 0 and 1, and second, for each unique value v of a pollutable feature, the number of representations kv for that value (including v itself). For each categorical feature, we choose randomly the samples to be polluted. Then, we generate kv − 1 new representations for each unique value v of its values. The new representations of a string value are produced as new non-existing values by appending a trailing ascending number to the end of the value, whereas for integers, new integers are added after the maximum existing one. These sample’s entries at this feature are replaced by a randomly picked fresh representation of the original value. 3.2 Completeness The problem of missing values exists in many real-world datasets. Some of these values are actually missing, e.g., missing readings due to a failure in a sensor while others are represented by a placeholder, such as “unknown” or “NaN”. For example, when medical sensors monitor temperature, blood pressure and other health information of a person and one sensor fails for a period of time, there are no values present for this sensor and time in the recorded dataset. In the case of a survey form containing optional input fields, personal attributes like the employment status could be given as “unknown”. The respective value that has a missing value could potentially be absent. When processing data automatically, for example in a data frame, the value itself usually exists, but is not informative or useful for analysis and therefore equivalent to the value being missing, decreasing the completeness of the dataset. Definition 3 The completeness of a feature c is the ratio of the number of non-missing values and n, the total number of samples in this dataset. The completeness of a dataset d is defined as follows. X missing(ci ) 1 Completeness(d) = 1 − ∗ (2) f n i∈1,...,f A dataset with a completeness of 1 has no missing values in it. In case of a completeness of 0, the whole dataset consists only of missing values, except for the target feature. For ML, samples with a missing value for the target feature are usually removed from the dataset, as they cannot be used for training. Thus, we exclude the target feature while computing completeness. For example, consider a dataset that has 4 features and 4 samples. We exclude the target feature and only consider the remaining 3 features. This means that there are 12 values in total for computing the completeness. 5 If 2 values are missing in the dataset, regardless of their feature and sample, the completeness of the dataset is 5 6. Pollution. We inject missing values “completely at random” [44] according to a specified pollution percentage λc for each feature. If there are already missing values in the feature, we account for those values and inject only the remaining number of missing values necessary to reach λc . A feature-specific placeholder is used to represent all missing values injected into this feature. The placeholder value does not carry information except that the value is missing. A typical placeholder is Not a Number (NaN). Many implementations of ML algorithms cannot handle NaN values. For this reason, we choose a representation as placeholders that can be used in computations, but lie outside the usual domain of a feature and are still distinguishable from the actual non-missing values of the feature. For example, -1 for “age” feature, or the string “empty” for “genre” categorical feature in a movie table. We manually selected the used placeholders, as this task requires some domain knowledge to determine suitable values. A placeholder value representation can count as imputation [61]. Most imputation methods, such as taking the mean, reconstruct some amount of information based on other observed values. As the placeholder does not contain information related to the data and has no reconstruction involved, we still consider a placeholder representation as pollution. Further, comparing different imputation methods would drift apart from the dimension completeness because it would interfere with other dimensions like accuracy. We do not make any assumptions about the underlying distribution and dependencies of missing data in our datasets. The probability of an value to have a missing value is not influenced by any other observed or unobserved value in the data. In other terms, we only consider data that is Missing Completely at Random (MCAR) [44] in our experiments. 3.3 Feature Accuracy ML-models learn correlations in datasets; thus it is desirable to have no errors in their values. For synthetic datasets, one can ensure that this is the case. However, real-world data may contain erroneous values from various sources, e.g., erroneous user input. Feature accuracy reflects to which extent feature values in a given dataset equal their respective ground truth values. The more cells deviate from their actual value and the stronger pronounced this deviation is, the lower is the feature accuracy. 6 Budach, Feuerpfeil, Ihde, Nathansen, Noack, Patzlaff, Naumann, Harmouch Definition 4 The feature (column) accuracy measures the deviation of its values from their respective ground truth values. For a categorical feature c, we define the feature accuracy as follows. mismatches(c) cFAcc(c) = 1 − (3) n where mismatches(c) denotes the number of values in the feature c that are different from the ground truth and n is the number of samples in the dataset. The erroneous values’ ratio is then subtracted from 1, meaning that 1 is the best possible quality and 0 is the worst possible quality. For numerical ones, we define feature accuracy as follows. nFAcc(c) = 1 − avg dist(c) mean gt(c) (4) n−1 avg dist(ci ) = 1 X ∗ |gti,j − vi,j | n j=0 (5) where avg dist(c) is the average of the absolute distances between the ground truth and values in c (see Equation 5) and mean gt(c) is the mean of the ground truth values of c. In Equation 5, j is defined as the index of a specific sample. Hence, gti,j denotes the ground truth and vi,j the value of the sample with index j in feature i. As for the categorical features, a quality of 1 indicates a clean feature. In contrast to the categorical feature quality measure, the numerical measure can fall below 0 and has no defined lower bound. However, we found that all datasets polluted reasonably also yield a numerical quality > 0. The feature accuracy of an entire dataset consists of two metrics: The average feature accuracy of all categorical features cFAccuracy and the average feature accuracy of all numerical features nFAccuracy. This is caused by the fact that with numeric features all samples are polluted, and with categorical features only a certain percentage of the samples are polluted. Therefore, the feature accuracy of both feature types is calculated differently, which leads to both feature types having different accuracy ranges. The feature accuracy quality measure of all categorical features cFAccuracy is defined as the average of the feature accuracy of all categorical features as can be seen in Equation 6. Similarly, Equation 7 shows that the feature accuracy quality measure of all numeric features nF Accuracy is defined as the average of all per-feature accuracies.The numbers of categorical and numeric features are given by ncat and nnum , respectively. cFAccuracy(d) = 1 ncat ∗ ncat X−1 i=0 cFAcc(ci ) (6) nFAccuracy(d) = 1 nnum ∗ nnum X−1 nFAcc(ci ) (7) i=0 Pollution. The polluter takes three arguments. The first argument λf a is a dictionary that maps feature names to float numbers in the interval [0.0, 1.0]. It describes the level of pollution that should be utilized for each given feature. λf a can also be defined as a single float number, meaning that the same level of pollution is applied to all features. The second and third polluter arguments each contain a complete list of the available categorical and numeric feature names. The pollution is executed differently depending on the feature type. For categorical features, the level of pollution λf a for a specific feature c determines the percentage of samples to be polluted. The samples to pollute are chosen randomly. However, the seed for this selection is fixed to (1) ensure reproducibility of the results and (2) allow for a level of pollution λf a (c) = 0.2 to be a direct extension of λf a (c) = 0.1. The randomly selected samples are polluted by exchanging the current category with a random, but different category from feature c’ domain. Consequently, a level of pollution λf a (c) = 1.0 for a categorical feature c means that the categories of all samples are updated and λf a (c) = 0.0 indicates that all categories of c stay the same. For numeric features, we add normally distributed noise to all samples of the feature c: noise(c) = X · mean gt(c) where X is random samples drawn from the normal Gaussian distribution N (µ, σ 2 ) with µ = 0 and σ 2 = λf a . The level of pollution λf a determines the standard deviation of the normal distribution and thus denotes how wide it is spread. Again, it is ensured that the same seed is used per feature in consecutive pollution runs on the same dataset to keep the behavior consistent and comparable. 3.4 Target Accuracy For each sample in a dataset, the target feature contains either a class/label in classification tasks or a numeric value in regression tasks. There might be some incorrect labels due to human or machine error, e.g., an angry dog labeled as a “wolf”. Definition 5 The target accuracy of a dataset is the deviation of its target feature values from their respective ground truth values. For a categorical target, the target accuracy is the ratio of correct values in the target feature. cTAccuracy(d) = 1 − mismatches(target) n (8) The Effects of Data Quality on Machine Learning Performance For a target with numerical values, we define the target accuracy as follows. nTAccuracy(d) = 1 − avg dis(target) mean gt(target) (9) Where avg dist is the averaged sum of the absolute distances (Manhattan distance) of the ground truth and target feature values and mean gt is the mean of the ground truth values. The definition of the target accuracy of a dataset is equivalent to the definition of the feature accuracy of its target feature (see previous section). Nevertheless, the target feature is the most important feature because of its influence on prediction performance. Thus, it is beneficial to study its accuracy separately. By scaling with the mean, we obtain a measurement that is less dependent on the actual target domain and, thus, more comparable between different datasets. This, in theory, allows for a negative quality metric where on average every target value is more than one mean away from its ground truth. We disregard those cases and define 0 as the lowest possible quality metric in our experiments. Pollution. Naturally, we used the same pollution method as for feature accuracy, based on the target type. In general, this polluter takes a single argument for degree of pollution λta . For categorical target, this value is interpreted as the fraction of data points that should be polluted. The pollution itself replaces the current label with a randomly chosen label that differs from the current one. For numerical targets, λta is interpreted as the variance of normally distributed noise that is scaled by the mean of the original target value distribution and added onto the target values of the specified subset (e.g., train or test data). 3.5 Uniqueness Redundant data does not provide additional information to the ML-model for the training process. Thus, de-duplication is a common step in ML pipelines to avoid overfitting. Furthermore, there is a plethora of existing research on how to detect duplicates and how to remove them [13, 35]. In this report, we evaluate how the number of duplicates present in a dataset influences the performance of ML-models. We investigate whether pre-processing datasets to remove duplicates is an essential step toward improved ML performance. According to Chen et al. there are different definitions when to consider two samples as duplicates, such as having identical primary keys or the equality in every feature of the 7 two samples [12]. In practice, there are also often nonexact duplicates, where some features differ slightly, for example timestamps. Introducing non-exact duplicates in the polluter would not only interfere with the redundancy dimension, but also with consistent representation (see Section 3.1). Exact duplicates are easy to detect. Yet, this step still expensive, especially for large datasets. In this regard, ideally, there are as few duplicates as possible in a dataset. Therefore, we only consider fully equal samples as duplicates. Definition 6 The uniqueness of a dataset d is the fraction of unique samples within the dataset. To make it possible to obtain a quality metric of 0, we subtract 1 from the denominator and numerator. We normalize the value as follows. unique samples(d) − 1 Uniqueness(d) = (10) n−1 A dataset with the quality metric of 1 does not contain any duplicates, whereas the quality metric of 0 refers to a dataset containing only one unique record – even if the dataset contains many records overall. Pollution. Input datasets can contain duplicates themselves. To pollute a dataset along the uniqueness dimension, we first remove all existing exact duplicates. This allows to pollute the dataset in incremental fashion. Then, we add exact duplicates of randomly selected samples to the dataset. Actually, we increase the dataset size to avoid data loss that can affect ML-models performance. The number of the added duplicates is determined by the duplication factor ρ: For each class cl with ncl samples, we add dup cl = (ρ − 1) ∗ ncl duplicates to avoid changing the class balance (next section). Thus, the size of the polluted dataset is n ∗ ρ and its uniqueness is 1/ρ. The duplication factor ρ ranges from 1, meaning no pollution is applied, to potentially infinity. It is important to decide how often each sample appears in the polluted dataset. One trivial approach would be to duplicate each by the same factor. One issue of this approach is the limited applicability to realworld scenarios. In reality, the number of duplicates per sample depends on the data domain. Manually inserted form data, for example, probably contain a normally distributed number of duplicates due to human errors, with a mean of 2. Working with sensor data, due to misconfiguration of sensors, the distribution of duplicate count could be uniform. Analyzing web index data based on web traffic, the duplicates could be distributed according to the Zipf distribution. Thus, the polluter needs to be flexible regarding the distribution of duplicate counts per sample. For each randomly selected sample from a class cl, we add x duplicates of this sample and then continue sampling and adding duplicates 8 Budach, Feuerpfeil, Ihde, Nathansen, Noack, Patzlaff, Naumann, Harmouch to reach dup cl . We draw x from a pre-defined distribution: We try uniform, normal, and Zipf distributions, in addition to adding a single duplicate of each selected sample. The duplicate counts generated by the specified distribution function define only the number of duplicates to add for the respective sample each time the sample is randomly selected for duplication. For this reason, the actual resulting distribution of duplicate cluster sizes after pollution likely differs from the specified distribution. This can happen especially with large duplication factors, but is inevitable, as otherwise classes with a low number of sampled duplicate counts could limit the number of generated elements. If all classes in the dataset have the same number of samples, then Balance(d) is maximal and equals 1. Contrary, Balance(d) reaches its minimum ( lim Balance(d) = d→w 0) if the balance of the classes in the dataset approaches the hypothetical worst case. 3.6 Target Class Balance Many ML approaches assume a relatively equal number of samples per target class, i.e., a balanced dataset, to achieve satisfactory performance. Clustering or classification algorithms on top of a highly imbalanced dataset may fail at identifying structures or even miss smaller classes completely. For example, the k-Means algorithm suffers from the “uniform effect”, i.e., it recognizes clusters of approximately uniform sizes even if it is not the case in the input data [39]. Definition 7 Given a dataset d with m target classes cl1 , ..., clm of ncl1 , ..., nclm samples per class, respectively, and ∀i, j : 1 ≤ i < j ≤ m ⇐⇒ ncli ≤ nclj , the target class imbalance is defined as the sum of the pairwise differences between the number of samples per class: X 1 ImBalance(d) = ∗ |ncli − nclj | (11) 2 i,j∈1,...,m As the worst imbalance case, we assume a maximal imbalanced dataset w that has dm/2e classes with ncmax samples and the remaining classes have 0 samples, where ncmax is the maximum number of sample that a class can have (see Figure 3). The target class imbalance of such a dataset is ε = dm/2e · bm/2c · ncmax . This is clearly a hypothetical case, as no class exists if it has 0 samples; otherwise we could add infinite classes with 0 samples to each dataset. However, this constructed the worst case allows us to define the target class balance quality measure. Definition 8 The target class balance of a dataset d is the deviation from its imbalance score, normalized by the imbalance score of the worst case. Balance(d) = 1 − ImBalance(d) ε (12) Fig. 3: Example of worst-case target class balance. Plot starts below zero to clearly show classes with no samples. Pollution. We have two inputs for pollution: The degree of imbalance λcb and the number of samples in the polluted version ñ. We can choose ñ arbitrarily as a multiplication of m or calculate it from the data as the number of samples from the original dataset, needed to produce the maximum pollution level. In both cases, its validity at the maximum imbalance level and the balanced dataset is checked once more. In case of an invalid sample count, the next valid, smaller possible sample count is calculated and used while a warning is presented to the user. For calculating ñ, we also consider that each class must have a minimum number of samples in the original dataset to be able to produce this imbalance. If this requirement cannot be satisfied, a new number of total samples ñ in the imbalanced dataset is iteratively determined until it is possible to create the maximal imbalance. We use λcb , which determines the severity of the class imbalance, to calculate the number of samples per class in the polluted version. It is a number in the interval [0, 1] and it is not directly linked to Balance(d): λcb = 0 creates a fully balanced dataset, which is to be used as a baseline for comparing all the imbalanced datasets to, as the original dataset maybe imbalanced itself; λcb = 1 creates the most heavily imbalanced dataset. Note that the dataset produced by λcb = 1 is not the hypothetical worst case mentioned in the definition section above. Instead, it produces a dataset where the smallest class has 0% of the samples of the largest class and where our restriction of The Effects of Data Quality on Machine Learning Performance constant changes in sample counts between the classes still stands. While mathematically this polluted dataset with one class completely removed would be the most imbalanced, this does not suit the purpose of examining the effects of class imbalance on the ML process. Therefore, we restrict the most heavily imbalanced dataset to have a class cm containing the maximal number of smax samples and a class c1 containing the minimal number of smin = d0.01·smax e samples. Due to this calculation, the class balance polluter works best if the class cm has at least smax = 100 samples. This state of imbalance is reached at a degree of imbalance λcb < 1. However, anything above this degree is ignored by the polluter. The degree of imbalance λcb for any polluted dataset ˜ d) version satisfies: λcb = ImBalance( . m+1 ·ñ 3 To pollute, we order the target classes based on their sizes descending and for the classes with the same size, we use the class ID in ascending order. Using the defined order, we create a class imbalance with an equal difference ∆ for each consequent pair of classes. The used samples of each class are randomly selected, i.e., in the polluted dataset, the following holds: ∀1 < j ≤ m : (ñcj−1 ≤ ñcj ) ∧ (ñcj − ñcj−1 ) = ∆. An example of such a distribution can be seen in Figure 4 in the depicted Balance 2. This choice has been made to both simplify the pollution process by removing all the other possible methods of creating an imbalance and make it more easily reproducible as no random component is needed to determine the per-class sample counts in the polluted dataset. Figure 4 shows the two per-class sample count distributions Balance 1 and Balance 3, which exemplify the need to limit the pollution method as they differ in per-class sample counts but still produce the same dataset quality. To create a class imbalance, we calculate the Fig. 4: Examples of possible per-class sample distributions and their corresponding quality. class size of a balanced dataset ñ m with ñ samples and m 9 classes. Then, we iteratively add/remove samples from the classes based on their order: We remove samples from all classes that are at indices b m 2 c − 1 and below, and add samples to all classes at indices above that, unless m is odd, then the size of the class at index b m 2c stays constant. 4 Machine Learning tasks In this section, we introduce the 15 algorithms that we employed for the three tasks of classification, regression and clustering. Classification. We picked a variety of classification algorithms that fall into different categories. We included two linear classification models: Logistic regression (LogR) [46] and support vector machine (SVM) [16]; a treebased model: Decision tree (DT) [11]; a k-nearest neighbors (KNN) classifier [3]; and finally a neural networkbased multi-layer perceptron (MLP) [68]. Regression. Regression is a supervised learning method that uses labelled data to learn the dependency between features and a continuous target variable. There is a plethora of regression algorithms. To evaluate how error-prone different categories of regression algorithms are and how more complex algorithms within one category perform, we compare five of the most widely used approaches for regression out of three categories of regression algorithms. We use two linear-regression-based algorithms: linear regression (LR) [47] and ridge regression (RR) [32]; two tree-based algorithms: decision tree regression (DT) [11] and random forest regression (RF) [10]; and finally a deep-learning based algorithm: simple multi-layer perceptron (MLP) [68]. The most basic regression approach is linear regression. It learns a linear relationship between features added by an intercept term [47]. The second algorithm in the field of linear regression is ridge regression, which is an improved linear regression approach. It adds L2 regularization to linear regression to prevent correlated features influencing the error drastically and to prevent extremely large values in the weights [32]. The first tree-based algorithm we use is decision tree regression. It introduces axis-parallel split criteria for various features, and thus creates a hierarchical split depending on the meet conditions [11]. According to [40], decision tree algorithms are unstable. This means minor changes in data can lead to larger changes in tree layout or changes in predictions. An improved version of decision tree regression is random forest regression. It uses ensemble learning with multiple decision trees to predict the target variable. As a result, it is less error-prone and more stable 10 Budach, Feuerpfeil, Ihde, Nathansen, Noack, Patzlaff, Naumann, Harmouch resents a combination of k-Means and k-Modes algorithms. The latter can only deal with categorical features, utilizing the number of total mismatches (dissimilarities) between the samples instead of the actual distance like k-Means does. Thus, for categorical features, k-Prototypes assigns each sample to the cluster with which it has the fewest dissimilarities and for nuClustering. Clustering algorithms aim to find groups merical features, it applies k-Means. The hierarchical of samples with similar features in a given dataset. family of clustering algorithms is represented by the Since clustering is an unsupervised approach, our clusAgglomerative Clustering algorithm [17]. It expresses tering models are not trained but directly receive all the hierarchy of the clusters as a tree, i.e., the leaves data as test data for the prediction of the target laare clusters with a single sample each and the root is bel. Depending on dataset properties, such as dimenone big cluster containing all samples. Starting with sionality, distribution or data types, current research one cluster per sample, the algorithm merges two clusrecommends different clustering algorithms. These can ters into one per iteration based on a given distance be grouped into different categories or algorithm fammeasure. Therefore, in each step, each cluster is comilies. We decided to use one algorithm from each of bined with the cluster closest to it until the root of the five most commonly used categories of clustering the tree is reached. Depending on how many clusters algorithms [58]: The Gaussian Mixture Clustering alshould be build, this procedure can be stopped earlier. gorithm [57] from the distribution-based family, the kWe selected the Ordering Points to Identify Cluster Means [65] and the k-Proto-types [36] algorithms from Structure (OPTICS) algorithm for the density-based the centroid-based family, the Agglomerative Clusterclustering category [4]. On a high level the OPTICS ing algorithm [17] from the hierarchical family, the Oralgorithm works similar to the Density-Based Spatial dering Points to Identify Cluster Structure (OPTICS) Clustering of Applications with Noise (DBSCAN) algoalgorithm [4] from the density-based family and a Deep rithm, identifying areas of high density within the data Autoencoder neural network from deep learning-based and extending clusters outwards from these cores [20]. family [67]. In contrast to the DBSCAN algorithm, OPTICS uses The Gaussian Mixture Clustering algorithm is a these identified core regions to calculate a reachability distribution-based approach and assumes the input data distance for each data point to the closest core. Using to be drawn from a mixture of Gaussian distributions [57]. this reachability distance, and a calculated point orderIt aims to fit a Gaussian distribution to each cluster ing, a reachability plot is generated. From this, cluswhereby a sample can be assigned to several clusters, ters can be identified by searching for extreme changes each with a certain percentage. Thus, the mean and in the reachability distance between neighboring points the standard deviation describe the shape of the clusat which point clusters are then separated and, if apters which allows for any kind of elliptical shape. The plicable, outliers identified. This approach is more veralgorithm starts with random distribution parameters satile than the approach DBSCAN takes, justifying our per cluster. In each iteration, the probability of each selection of this clustering algorithm for our use-case sample belonging to a particular cluster is calculated. where we do not want to run hyperparameter optimizaThe probability is higher the closer the sample is to tion. The sklearn implementation of the OPTICS algothe distribution’s center. Afterwards, the parameters rithm we use in our work differs slightly from the apfor the Gaussian distributions are updated according proach originally described by [4] in the way in which to the probabilities. A new iteration is performed until cores are initially identified [18]. To represent the deep the distributions no longer change significantly. For the learning-based clustering methods, we chose to implecentroid-based category, we either apply the k-Means [65] ment a Deep Autoencoder neural network to aid with algorithm for datasets with predominantly numerical the clustering process. The Autoencoder is used as a features or the k-Proto-types [36] algorithm for datasets data preprocessing step in this approach, transformwith a large number of categorical features. The former ing the input data which may lie in a high-dimensional is initialized with a random center point for each of the space or generally a space not conducive to clustering given target classes. Thereafter, each sample is assigned into the lower-dimensional code space, which may be to its closest center point and all center point vectors more suitable for the clustering task. The Autoencoder are updated with the mean of all sample vectors curitself is trained to learn how to map the input data into rently assigned to the respective center point. This prothe code space and then decode it again while trying cedure is repeated until the center points do not change to minimize the reconstruction error, i.e., maximize the their positions much. The k-Prototypes algorithm rep[10]. Deep learning is a machine learning approach that is more and more commonly used among all fields of machine learning. Thus, it can also be used for regression. We use a simple multi layer perceptron in this category of algorithms [68]. The Effects of Data Quality on Machine Learning Performance 11 tested kv = 2 and kv = 5 where kv = 2 means adding one new representation per v in the polluted dataset. For uniqueness experiments, we varied ρ from 1 to 5 in steps to lead to linear quality decrease by 0.1 per 10 step, i.e., 10 9 , 8 etc. We make the cut at a ρ of 5, because for lower quality ρ increases faster than linearly, e.g., would have to be 10 for a quality of 0.1, which would make the experiments much slower. Regarding duplicate count distribution functions, all samples get a duplicate count of 1 and we used a normal distribution with mean 1 and standard deviation 5. Before applying any ML-model, we one-hot encode the categorical features. Then, we measure the performance of the respective ML-models, given the specific scenario for the specific dataset polluted with the specific polluter configuration. We run each polluter configuration five times with a different random seed, i.e., we obtain five results per ML-model in that setting, which we then aggregate by averaging. 5 Experimental setup For regression, we discretized the data with manually specified bin-step sizes before the stratified split. This chapter gives an overview of our implementation We use the discretized version also for the class baland introduces our datasets together with the paramance and uniqueness polluters, as they require the tareterization and performance measures of the analyzed get feature to consist of discrete classes. For all other ML-models. polluters, we use the original continuous representation. The target feature distribution in our regression datasets is mostly close to a normal distribution. 5.1 Hardware When applying the class balance polluter, this would lead to a heavily decreased dataset size of much less We ran our experiments on a DELTA D12z-M2-ZR mathan 50% after balancing. That is why, when using the chine. The server has AMD EPYC 7702P Xeon (2.00GHzclass balance polluter, we discard the discretized classes 3.35GHz, 64-Core) processor, 512 GB DDR4-3200 DIMM with very few samples, which would result in a balanced RAM and runs Ubuntu 20.04 LTS Server Edition. The dataset with such a small size. To allow consistent comserver has two NVIDIA Quadro RTX GPUs (5000/16GB, parisons with the original dataset, i.e., have the same A6000/48GB). classes contained in the data, we discard those classes with few samples from the original dataset too for the experiments with the class balance polluter. 5.2 Implementation similarity of the input data and reconstructed data [67]. After the Autoencoder was trained, the encoder portion of the network is used to encode the dataset to be clustered and a classical clustering algorithm is applied to the encoded information [45]. After experimentation, we selected the Gaussian Mixture algorithm over the k-Means clustering algorithm for this task. Our Autoencoder is a very basic neural network and built dynamically, depending on the number of features in the input dataset. Each of its encoder layers is a linear layer halving the number of features from its input to its output. The ReLU function is applied after each linear layer on its outputs. The encoder layers halve the number of features until only two dimensions remain, at which point we have reached the code space. The decoder has the same but inverted architecture of the encoder portion of the network. The implementation of the polluters and ML pipelines are written in Python (version 3.9.7) using the scikitlearn [53] (version 1.0.1) and PyTorch [21] (version 1.10.2) supporting NVIDIA CUDA 10.2 [15] libraries. We evaluate the performance of the ML-models in three scenarios: Scenario 1 – polluted training set; Scenario 2 – polluted test set; and Scenario 3 – polluted training and test sets. Note that those scenarios are only considered for classification and regression, as clustering does not have a separate training and test set. To create the scenarios, we randomly split the data with a stratified 80:20 split into training and test set. We then pollute the training and test sets separately. We varied the ratio of pollution between 0 and 1 in increments of 0.1. For consistent representation, we 5.3 Models Parameters All parameters not explicitly stated are kept at their default values in scikit-learn. Classification. For LogR, we increase max iter to 2 000 for better convergence. For a multi-class dataset, we fit a binary problem for each label. The maximum number of iterations for the multi-layer perceptron is 1 000. For SVM, we use a linear kernel and scale the input data with the sklearn StandardScaler , as the time to converge would otherwise make it infeasible given the large number of experiments we run. 12 Budach, Feuerpfeil, Ihde, Nathansen, Noack, Patzlaff, Naumann, Harmouch Regression. For LR and RF, we set n jobs to −1 to use all available processors. We seed all algorithms that provide a random state parameter with 12345, which are all algorithms except LR. For MLP, we increase the max iter parameter that defines the maximum number of epochs to 3 000 so that MLP is able to converge when trained on the original datasets. Clustering. The actual number of clusters is passed to the k-Means / k-Prototypes, Gaussian Mixture and Agglomerative algorithms. If categorical features exist, the Agglomerative algorithm uses the Gower’s distance measure [26]. We set the affinity parameter to “precomputed” (a pre-calculated distance matrix is to be employed) and the linkage parameter to “average”, which defines the distance between two clusters as the average distance between all samples in the first and all samples in the second cluster. For OPTICS, we only defined the minimum cluster size parameter min cluster size, which specifies how many samples need to be in a cluster for it to be classified as such, as 100. We chose this number based on a combination of experiments and knowledge about our data. We are certain that in our original datasets, each cluster contains substantially more samples than 100. Our Autoencoder is trained for 200 epochs and optimized using the Adam optimizer [38] with a learning rate of 0.003 and mean squared error loss. The dataset is split into 80% train and 20% test data and loaded in batches of 128 shuffled samples. For all models that need a random seed, we used 42 as a seed. 5.4 Datasets To investigate the correlation between the studied data quality dimensions and the chosen ML algorithms, we use the nine datasets, shown in Table 1. We chose them from a variety of domains, sample sizes and characteristics. Our choice was also influenced by the ML-task that the dataset is used for. Classification. IBM’s Telco Customer Churn dataset represents 7043 customers from a fictional telecommunications company [34]. It contains personal information about customers (e.g., gender and seniority) and their contracts (e.g., type of contract and monthly charges). The target variable Churn describes whether the customer cancelled their contract within the last month. We dropped the customerID sample because of its lack of information content for the classification task. The original German Credit dataset was donated to the UCI Machine Learning Repository in 1994 by Prof. Hans Hofmann [33]. Then a corrected version was introduced by Ulricke Grömping who identified inconsistencies in the coding table and corrected them [28] (we use the corrected version). The German Credit dataset contains a stratified sample of 1 000 credits between the years 1973 and 1975 from a southern German bank [28]. It contains the personal data about people who applied for a credit (e.g., marital status or age) and about the credit itself (e.g., purpose or duration). The target variable tells whether a customer complied with the conditions of the contract or not. The Contraceptive Method Choice dataset was part of the 1987 National Indonesia Contraceptive Prevalence Survey, asking non-pregnant married women about their contraceptive methods [43]. The dataset consists of 1 473 samples containing personal information about the wife’s and husband’s education, age, the number of children and much more. The classification task is to determine the contraceptive method choice out of No-use, Long-term and Shortterm with the target variable Contraceptive method used. Regression. The Houses dataset was created as a modern replacement of the widely used but outdated Boston Housing dataset [31]. Located in Ames, Iowa, it contains features of houses in the city to determine their sales price [14]. We use the dataset in the form it was presented in a Kaggle challenge [37]. There, only the training set includes the sale prices, which is why we take the training set with 1 460 samples from the challenge. We removed the Id feature because its only purpose is to identify houses uniquely. This results in 79 features and the target feature. There are missing values (NaN) in five features, which we replaced by computationally usable placeholders. One numerical feature, the year a garage was built, contains information related to the categorical feature garage type and has missing values for observations where the garage type indicates that there is no garage anyway. In those cases, we set 0 as a placeholder for the garage year to represent the meaning in context with the garage type to distinguish them from the MCAR values that the polluter injects in our experiments. For the remaining four of the features with missing values, their occurrence is independent of other features. We treat them as MCAR values and represent them as placeholders outside the feature domains. The IMDB dataset contains features and ratings for films and series that were retrieved from the IMDB website [55]. We removed all samples where the rating is missing because it is the target attribute. For the remaining 5 993 samples, we remove the name feature, as it only identifies the films and series. There are 12 features remaining apart from the target feature, where two inherently numeric features contains missing values as text placeholders. For the first feature, duration, The Effects of Data Quality on Machine Learning Performance 13 Table 1: Overview of the used datasets after pre-processing. Name Credit Contraceptive Telco-Churn Houses IMDB Cars Bank Covertype Letter # Sample 1,000 1,473 7,032 1,460 5,993 15,157 7,500 7,504 7,514 # # Feature Categorical Classification 20 13 9 7 19 16 Regression 79 46 12 8 8 3 Clustering 3 2 54 44 16 0 where the missing values are unrelated to other features, we set a numeric placeholder outside the domain to be able to process the feature values as numbers instead of categories. For the second feature, episodes, there are only missing values for film elements. We do not count those as MCAR values because they are related to the fact that the element is a film and therefore contain information, which is why we set the placeholder to 0 here. The Cars dataset collects listings for used cars of different manufacturers [1]. It contains technical attributes of the cars, as well as model, year, and tax. The sales price is the feature that shall be predicted. The data is stored in files grouped by manufacturer. We use only the data of VW. Clustering. The Letter dataset contains 20 000 samples of different statistical measures of character images [19, 66]. Each sample describes one of the 26 capital letters in the English alphabet and was generated by randomly distorting pictures of the letters in 20 different fonts.The measured statistics were scaled, bounding the feature values to integers i in the range 0 ≤ i ≤ 15. As this dataset does not contain categorical features, it is the only dataset that the consistent representation polluter cannot be applied to. This is a deliberate decision, as the majority of existing clustering datasets do not contain categorical features, and we would like to examine the other data quality dimensions on a clustering-typical dataset. The Bank dataset was created through marketing campaigns of a banking institution conducted via phone calls. It contains different characteristic