A Method for Generating Synthetic Longitudinal Health Data PDF
Document Details
Uploaded by Deleted User
Lucy Mosquera, Khaled El Emam, Lei Ding, Vishal Sharma, Xue Hua Zhang, Samer El Kababji, Chris Carvalho, Brian Hamilton, Dan Palfrey, Linglong Kong, Bei Jiang, Dean T. Eurich
Tags
Summary
This paper assesses the feasibility of generating synthetic administrative health data using a recurrent deep learning model. The data comes from 120,000 individuals from Alberta Health's administrative health database. The study evaluates the similarity of the synthetic data to the real data, assessing structure, patterns and recreating specific analyses. Results indicate suitable similarity and potential for sharing synthetic data for research purposes, whilst addressing privacy concerns.
Full Transcript
Mosquera et al. BMC Medical Research BMC Medical Research Methodology (2023) 23:67 https://doi.org/10.1186/s12874-023-01869-w...
Mosquera et al. BMC Medical Research BMC Medical Research Methodology (2023) 23:67 https://doi.org/10.1186/s12874-023-01869-w Methodology RESEARCH Open Access A method for generating synthetic longitudinal health data Lucy Mosquera1,2, Khaled El Emam1,2,3*, Lei Ding4, Vishal Sharma5, Xue Hua Zhang1, Samer El Kababji2, Chris Carvalho6, Brian Hamilton7, Dan Palfrey8, Linglong Kong4, Bei Jiang4 and Dean T. Eurich5 Abstract Getting access to administrative health data for research purposes is a difficult and time-consuming process due to increasingly demanding privacy regulations. An alternative method for sharing administrative health data would be to share synthetic datasets where the records do not correspond to real individuals, but the patterns and relationships seen in the data are reproduced. This paper assesses the feasibility of generating synthetic administrative health data using a recurrent deep learning model. Our data comes from 120,000 individuals from Alberta Health’s administra- tive health database. We assess how similar our synthetic data is to the real data using utility assessments that assess the structure and general patterns in the data as well as by recreating a specific analysis in the real data commonly applied to this type of administrative health data. We also assess the privacy risks associated with the use of this synthetic dataset. Generic utility assessments that used Hellinger distance to quantify the difference in distributions between real and synthetic datasets for event types (0.027), attributes (mean 0.0417), Markov transition matrices (order 1 mean absolute difference: 0.0896, sd: 0.159; order 2: mean Hellinger distance 0.2195, sd: 0.2724), the Hellinger distance between the joint distributions was 0.352, and the similarity of random cohorts generated from real and synthetic data had a mean Hellinger distance of 0.3 and mean Euclidean distance of 0.064, indicating small differ- ences between the distributions in the real data and the synthetic data. By applying a realistic analysis to both real and synthetic datasets, Cox regression hazard ratios achieved a mean confidence interval overlap of 68% for adjusted hazard ratios among 5 key outcomes of interest, indicating synthetic data produces similar analytic results to real data. The privacy assessment concluded that the attribution disclosure risk associated with this synthetic dataset was substantially less than the typical 0.09 acceptable risk threshold. Based on these metrics our results show that our syn- thetic data is suitably similar to the real data and could be shared for research purposes thereby alleviating concerns associated with the sharing of real data in some circumstances. Keywords Synthetic data, Administrative health data, Data privacy, Data sharing Background It is often difficult for analysts and researchers to get access to high quality individual-level health data for research purposes. For example, despite funder and jour- nal expectations for authors to share their data [1–3], an analysis of the success rates of getting individual- level data for research projects from authors found that *Correspondence: the percentage of the time these efforts were success- Khaled El Emam ful varied significantly and was generally low at 58% , [email protected] Full list of author information is available at the end of the article 46% , 14% , and 0%. Some researchers note that © The Author(s) 2023. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativeco mmons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. Content courtesy of Springer Nature, terms of use apply. Rights reserved. Mosquera et al. BMC Medical Research Methodology (2023) 23:67 Page 2 of 21 getting access to datasets from authors can take from models. Longitudinal data captures events and trans- 4 months to 4 years. Data access through independent actions over time, such as those in electronic medical data repositories can also take months to complete [8, 9]. records, insurance claims datasets, and prescription Concerns about patient privacy, coupled with increas- records. As we summarize below, published methods ingly strict privacy regulations, have contributed to the thus far are not suitable for the synthesis of realistic lon- challenges noted above. For instance, privacy concerns gitudinal data because many of them would have only by patients and regulators have acted as a barrier to worked with curated data where the messiness of real- sharing of health data [10, 11]. A recent review of health world data has been taken out. data infrastructure in Canada concluded that (mis)inter- In this article we present a recurrent neural network pretations of privacy laws and a general “privacy chill” (RNN) model for the generation of synthetic longitu- incentivizes risk-averse behavior among data custodians, dinal health data. The model was empirically tested on stifling data access and research. An analysis of data Alberta’s administrative health records. Individuals were sharing practices for studies funded by CIHR found non- selected for this cohort if they received a prescription for trivial gaps in data availability. There are a number of an opioid during the 7-year study window. Data available approaches that are available to address these concerns: for this cohort of patients included demographic infor- consent, anonymization, and data synthesis. mation, laboratory tests, prescription history, emergency Patient (re-)consent is one legal basis for making data department visits, hospitalizations, and death. The syn- available to researchers for secondary purposes. How- thesized data utility was evaluated using generic metrics ever, it is often impractical to get retroactive consent to compare the real data with the synthetic data, and a under many circumstances and there is significant evi- traditional time-to-event analyses on opioid use was per- dence of consent bias. formed on both datasets and the results compared. This Anonymization is one approach to making clinical and type of analysis is the cornerstone of most health services administrative data available for secondary analysis. How- research. The privacy risk associated with the synthetic ever, recently there have been repeated claims of successful dataset was assessed using an attribution disclosure risk re-identification attacks on anonymized data [15–21], erod- assessment on synthetic data. ing public and regulators’ trust in this approach [21–31]. Data synthesis is a more recent approach for creating Methods non-identifiable health information that can be shared for In this section we describe the requirements for a genera- secondary analysis by researchers [32, 33]. Researchers tive model that captures the patterns in complex longitu- have noted that synthetic data does not have an elevated dinal clinical datasets, and a RNN architecture to meet identity disclosure (privacy) risk [34–41], and recent those requirements. We also describe how we evaluate empirical evaluations have demonstrated low disclosure the utility and privacy risks of the generated datasets. risk. Synthetic data generation has the potential to The utility of the generated data can be evaluated using unlock historically siloed and difficult to access data sets two approaches : general purpose utility metrics and for secondary analysis, including research. a workload aware evaluation. The former approach evalu- There are synthetic health datasets that are currently ates the extent to which the characteristics and structure available to a broad research community such as: the NIH of the synthetic data are similar to characteristics of the National COVID Cohort Collaborative (N3C) , the real data, and the latter compares the model results and CMS Data Entrepreneur’s Synthetic Public Use files , conclusions of a substantive analysis on opioid use utiliz- synthetic cardiovascular and COVID-19 datasets avail- ing the synthetic and real datasets. We performed both able from the CPRD in the UK [45, 46], A&E data from types of utility assessment. NHS England , cancer data from Public Health Eng- land , a synthetic dataset from the Dutch cancer reg- Requirements for synthesizing longitudinal health data istry , synthetic variants of the French public health We first present a series of requirements for the synthesis system claims and hospital dataset (SNDS) , and of longitudinal health data. This allows us to be precise South Korean data from the Health Insurance Review in evaluating previous work and for setting express tar- and Assessment service (the national health insurer). get criteria for our generative model. These requirements There are multiple methods that have been developed are intended to capture: (a) the characteristics of real for the generation of cross-sectional synthetic health longitudinal datasets that have received minimal cura- data [52–59]. The synthesis of longitudinal data is more tion to ensure that the synthesized datasets are realistic challenging because patients can have long sequences of and that the generative models will work with real health events that need to be incorporated into the generative data, and (b) the characteristics of the generative models Content courtesy of Springer Nature, terms of use apply. Rights reserved. Mosquera et al. BMC Medical Research Methodology (2023) 23:67 Page 3 of 21 themselves to ensure that they are scalable and generaliz- databases from Alberta Health from 2012 to 2018 were able. Our requirements are as follows: linked by the encrypted personal health number (PHN). (1) The original dataset that is synthesized is a combi- 1. The Provincial Registry and Vital Statistics database nation of: for patient demographics and mortality. We used the age, sex, vital statistics, and date of last follow-up. a) Longitudinal data (i.e. multiple events over time An additional covariate was derived, the Elixhauser from the same patient), and comorbidity score, based on physician, emergency b) Cross-sectional data (i.e., measures that are fixed department or hospitalization ICD-9/10 codes. and are not repeated such as demographic infor- 2. Dispensation records for pharmaceuticals from the mation). Alberta Netcare Pharmaceutical Information Net- work (PIN). We restricted the data to only dispen- (2) The length of the longitudinal sequence varies sations of either one of two commonly dispensed across patients in the original datasets. Patients opioids of interest in our data (morphine and oxyco- with acute conditions may have very few events, done) and dispensations of antidepressant medica- whereas complex patients with chronic conditions tions since these were the focus of the time-to-event may have a very large number of events. analysis. (3) The original datasets are heterogeneous with a 3. The Ambulatory Care Classification System which combination of: provides data on all services while under the care of the Emergency Department. This included date of the a) Categorical or discrete features visit, primary diagnostic codes, and resource inten- b) Continuous features sity weight. The resource intensity weight is a meas- c) Categorical variables with high cardinality (e.g., ure used in the province to determine the amount of diagnosis codes and procedure codes) resources used during the visit. The primary diagnos- tic code we included as an ICD-10 code. (4) Outliers and rare events should be retained in the 4. Discharge Abstract Database which provides similar original dataset since real data will have such events data to Ambulatory Care but pertaining to inpatient in them. hospital admissions. Information on hospitalizations (5) The data may have many missing values, leading to was restricted to the date of admission, primary diag- sparse datasets (i.e., missing data are not removed nostic code, and the resource intensity weight. from the original datasets that are synthesized). 5. Provincial laboratory data which includes all outpa- (6) The generative model can take into account the tient laboratory tests in the province. We only consid- previous information about the patients in the ered results for 3 common labs conducted in the province sequence. (ALT, eGFR, HCT) and the associated date of testing. (7) The generative model should be developed based on existing data rather than requiring manual inter- The structure of the data is illustrated in Fig. 1. There vention by clinicians to seed it or correct it. is a demographic table with basic characteristics of patients, and a set of transactional tables with a one-to- Our objective was to construct a generative model that many relationship between the demographic table and would meet these requirements. the transactional tables. Therefore, each patient may have multiple events occurring over time. Using the PHN, Previous approaches observations for a single individual from multiple trans- Multiple methods have been proposed in the literature actional tables may be linked together. Each observation for synthesizing longitudinal health data, each with their in the transactional tables includes the date of the event own strengths and limitations. These are summarized relative to the start of the study period. This means that in Table 1. None of them meet all of our requirements, a group of observations from the same individual can be making the case for additional research and generative sorted according to the relative date, yielding a chrono- model architectures to meet the requirements above. logical order of an individual’s interactions with the health system. Data characteristics Each event, whether it is a visit or a lab test, has a We used a cohort of patients previously derived and pub- different set of attributes. Therefore, the event charac- lished to evaluate trends in opioid use in the province teristics are a function of the event type. For example, of Alberta, Canada. The following administrative a hospitalization event will record the relative date of Content courtesy of Springer Nature, terms of use apply. Rights reserved. Table 1 Literature review of key characteristics of previous works for generating longitudinal synthetic health data Title Data Structure Variable Types Model Types Cross- Longitudinal Variable length Categorical Continuous Categories with Outliers Missing values Consider all Model informed sectional (R.1b) sequences (R.2) (R.3a) (R.3b) high cardinality removed present in data the previous by clinicians R.1a) (R.3c) (R.4) (R.5) information (R.6) (R.7) Variational Autoen- No Yes Fixed Yes Yes Yes N/D Yes Yes No coder Modular Bayesian Networks (VAMBN) for Simula- tion of Heterogene- ous Clinical Study Data Machine learning No Yes Varied Yes Yes No N/D Yes No No Mosquera et al. BMC Medical Research Methodology for comprehen- sive forecasting of Alzheimer’s Disease progression Design and Valida- No Yes Varied Yes No Yes N/D No Yes No tion of a Data Simula- (2023) 23:67 tion Model for Lon- gitudinal Healthcare Data Privacy-Preserving No Yes Fixed No Yes No Yes No Yes No Generative Deep Neural Networks Support Clinical Data Sharing Analyzing Medical Yes No N/A Yes Yes No N/D Yes N/A No Research Results Based on Synthetic Data and Their Relation to Real Data Results: Systematic Comparison From Five Observational Studies Synthetic Event Time Yes Yes Fixed Yes Yes No Yes No Yes No Series Health Data Generation Content courtesy of Springer Nature, terms of use apply. Rights reserved. Data-driven No Yes Varied Yes Yes Yes N/D N/D Yes No approach for creating synthetic electronic medical records Page 4 of 21 Table 1 (continued) Title Data Structure Variable Types Model Types Cross- Longitudinal Variable length Categorical Continuous Categories with Outliers Missing values Consider all Model informed sectional (R.1b) sequences (R.2) (R.3a) (R.3b) high cardinality removed present in data the previous by clinicians R.1a) (R.3c) (R.4) (R.5) information (R.6) (R.7) Synthea: An Yes Yes Varied Yes Yes Yes N/D No Yes Yes approach, method, and software mecha- nism for generating synthetic patients and the synthetic electronic health care record Real-valued (medical) No Yes Fixed No Yes N/A Yes No Yes No Mosquera et al. BMC Medical Research Methodology time series genera- tion with recurrent conditional GANS Generating Multi- Yes No N/A Yes No No N/D Yes No No (2023) 23:67 label Discrete Patient Records using Gen- erative Adversarial Networks Data Synthesis based Yes Yes Fixed Yes Yes Yes N/D N/D Yes No on Generative Adver- sarial Networks Generation and Yes No N/A Yes Yes Yes No No No No Evaluation of Privacy Preserving Synthetic Health Data Generation of Yes No N/A Yes Yes Yes Yes N/D No No Heterogeneous Synthetic Electronic Health Records using GANs Generating Elec- Yes No N/A Yes Yes Yes Yes N/D No No tronic Health Records with Multiple Data Types and Con- straints Content courtesy of Springer Nature, terms of use apply. Rights reserved. Page 5 of 21 Table 1 (continued) Title Data Structure Variable Types Model Types Cross- Longitudinal Variable length Categorical Continuous Categories with Outliers Missing values Consider all Model informed sectional (R.1b) sequences (R.2) (R.3a) (R.3b) high cardinality removed present in data the previous by clinicians R.1a) (R.3c) (R.4) (R.5) information (R.6) (R.7) Ensuring electronic Yes No N/A Yes No Yes Yes N/D No No medical record simu- lation through better training, modeling, and evaluation Generative Adver- No Yes Fixed No Yes N/A Yes No Yes No sarial Networks for Electronic Health Records: A Frame- Mosquera et al. BMC Medical Research Methodology work for Exploring and Evaluating Methods for Predict- ing Drug-Induced Laboratory Test Trajectories (2023) 23:67 Synthesizing elec- Yes No N/A Yes No No Yes N/D Yes No tronic health records using improved gen- erative adversarial networks Generating Fake Yes Yes Fixed Yes Yes No Yes N/D No No Data Using GANs for Anonymizing Health- care Data CorGAN: Correlation- Yes No N/A Yes Yes No N/D N/D N/A No Capturing Convo- lutional Generative Adversarial Networks for Generating Synthetic Healthcare Records Generation and eval- Yes No N/A Yes Yes No No N/D N/A No uation of synthetic patient data Generating and Yes No N/A Yes Yes No No N/D N/A No Content courtesy of Springer Nature, terms of use apply. Rights reserved. Evaluating Synthetic UK Primary Care Data: Preserving Data Utility & Patient Privacy Page 6 of 21 Table 1 (continued) Title Data Structure Variable Types Model Types Cross- Longitudinal Variable length Categorical Continuous Categories with Outliers Missing values Consider all Model informed Mosquera et al. BMC Medical Research Methodology sectional (R.1b) sequences (R.2) (R.3a) (R.3b) high cardinality removed present in data the previous by clinicians R.1a) (R.3c) (R.4) (R.5) information (R.6) (R.7) SMOOTH-GAN: Yes No N/A Yes Yes No Yes N/D N/A No Towards Sharp and Smooth Synthetic (2023) 23:67 EHR Data Generation Continuous Patient- No Yes Varied No Yes N/A Yes No Yes No Centric Sequence Generation via Sequentially Coupled Adversarial Learning Medical Time-Series No Yes Varied Yes Yes No N/D N/D No No Data Generation using Generative Adversarial Networks N/A refers to not applicable while N/D refers to not described Content courtesy of Springer Nature, terms of use apply. Rights reserved. Page 7 of 21 Mosquera et al. BMC Medical Research Methodology (2023) 23:67 Page 8 of 21 Fig. 1 Representation of the dataset. Note that the demographics information contains a single observation per individual, where each individual is identified using a personal health number (PHN). This PHN links the demographics table to all other tables in the dataset, where all other tables may have multiple observations per individual the hospitalization, the length of stay, diagnostic code, Generative model description and resource intensity weight. Additionally, all event To synthesize complex longitudinal health data, we use an types include an attribute to describe the timing of the RNN. RNNs model input sequences using a memory rep- event. In this work we model time using sojourn time, resentation which is aimed to capture temporal depend- or time in days since the last event for that individual. encies. Vanilla RNNs, however, suffer from the problem of The basic patient characteristics and event charac- vanishing gradients and thus, have difficulty captur- teristics are heterogeneous in data type. This means ing long-term dependencies. Long short-term memory that some will be categorical variables, some will be units (LSTM) and the gated recurrent unit (GRU) continuous, some binary, and some discrete ordered were conceived to overcome this limitation. This variables. For example, age is a continuous patient work implements LSTM to model and synthesize obser- characteristic while diagnostic code associated with vations over time. The generated data was then evaluated a emergency department visit is a categorical event in terms of data utility. This generative model was imple- characteristic. mented in python version 3.8 using Pytorch version 1.7. Table 2 provides the exact dimensionality of the origi- nal datasets. A random subset of 100,000 patients from Model structure a population of 300,000 subjects who received a dispen- Our generative model was a form of conditional LSTM sation for morphine or oxycodone between Jan 1, 2012 where the final predicted outputs are conditional on the and Dec 31, 2018, 18 years of age and over were used to baseline characteristics. The model architecture, includ- train our generative model and included in our analy- ing which datasets are provided as inputs vs predicted ses. For these patients, we truncated the events at the as outputs is described in Fig. 2. The input data corre- 95th percentile, which means that the maximum num- sponds to n individuals at t − 1 time points (e.g., the set ber of events that an individual can have was 1000. Tϵ {1, 2, 3,.. t − 1}) for event labels (yielding an array of Details of the dataset preparation for the modeling dimensions [n, t − 1]) and event attributes (yielding an are provided in Additional file 1: Appendix 1: Data array of dimensions [n, t − 1, A] where A is the number Pre-processing. of attributes) as well as the B baseline characteristics for Content courtesy of Springer Nature, terms of use apply. Rights reserved. Mosquera et al. BMC Medical Research Methodology (2023) 23:67 Page 9 of 21 Table 2 Dimensionality of the original data tables for the features. The benefit of this embedding is that the trans- 100,000 individuals used for training formation to map the discrete features to the set of con- Table Name Number of Rows Number of tinuous features is altered and improved throughout Columns training. This allows for a continuous space representa- tion of the categorical features that picks up similarity Age_sex_comorbidity 100,000 4 between related categories. Embedding occurs indepen- Drug_data 9,975,950 7 dently for each of the baseline characteristics (age, sex, ED_visits 1,748,083 5 comorbidity index), the event labels, and the event Hosp_admit 84,669 5 attributes. Labs 2,199,574 3 The LSTM estimates a representation of the hidden Reg_file 100,000 2 state given the prior event labels and attributes. The Vital_stats 4200 6 embedded event attributes and the embedded event labels are concatenated prior to being input in the LSTM. each individual. The output consists of predictions corre- If the LSTM receives observations corresponding to sponding to n individuals at t − 1 time points (e.g., the set times T ∈ {1, 2, 3, …t − 1}, then the output of the hidden Tϵ {2, 3, 4,.. t}) for the event labels and event attributes. state will correspond to times T ∈ {2, 3, 4, …t}. In addi- These predictions are used during training to calculate tion to the predictions, the LSTM outputs the complete the model loss, or during data generation as the subse- hidden state which describes the current state of all ele- quent synthetic events. ments of the model. The complete hidden state is used This model consists of 3 components: embedding lay- during data synthesis as a way of accounting for historical ers, an LSTM, and output layers. events. The embedding layers are used to map single integer The output layers are a set of linear transformations encoded categorical features to a series of continuous that take as input the concatenation of the output of the Fig. 2 Diagram of the overall RNN architecture Content courtesy of Springer Nature, terms of use apply. Rights reserved. Mosquera et al. BMC Medical Research Methodology (2023) 23:67 Page 10 of 21 LSTM and the embedded baseline characteristics. These Thus, the objective function for training is to minimize output layers make the predictions for the next time the total loss over the model parameters θ, where the points generated by the LSTM conditioned on the base- tradeoff parameter λ controls the relative importance of line characteristics. label loss and attribute loss: Model training min losslabels + lossattributes θ During training, loss is calculated for both the event labels and the event attributes, with masking applied to During training, data is provided for the model in the event attributes so that only attributes measured for tensors of 120 time points. Individuals have their data the true event label contribute to the loss. This makes grouped into chunks of up to 120 sequential events with training more efficient as masking the loss for the event 0 s introduced to pad chunks shorter than 120 observa- attributes restricts the model to learn how to predict each tions. This is desirable as it produces data that is uniform attribute only when it is measured for a given event label. and much less sparse than if we were to pad up to the The event label loss is calculated using cross entropy true maximum number of observations per individual of loss between the predicted event labels and the true 1000. event labels: N t C 1 losslabels = −xlabel n,t [truen,t ] + log exp xlabel n,t [j] Nt n=1 t=1 j=0 where xlabeln, t is the vector of predicted probabilities for Hyperparameter optimization was performed using the event label for individual n at time t, and where xla- a training set of 100,000 individuals and a validation set beln, t[j] is the predicted probability that individual n at of 20,000 individuals. Hyperparameters explored include time t has event with label j, and truen, t is the true event batch size, number of training epochs, optimization algo- label for individual n at time t. rithm, learning rate, number of layers within the LSTM, Next, cross entropy loss is calculated for the attributes hidden size of the LSTM, embedding size for the event associated with the true event label. For example, if the labels, event attributes, and baseline characteristics, and next time point is truly a lab test, then the model loss weighting for the different event types and event attrib- for the event attributes is the sum of the cross entropy utes during calculation of the training loss. Training was between the real lab test name and the predicted lab test performed on an Nvidia P4000 graphics card and was name and the cross entropy between the real lab test coordinated using Ray Tune. result and the predicted lab test result. This masked form of loss for the event attributes is desirable as it allows the Synthetic data generation model to focus on learning the relevant features at each After training the model as described in the previous section, time point, rather than constantly predicting missing synthetic data generation consists of two phases: genera- values. tion of baseline characteristics and starting values, followed If we define the indicator function 1(Ai | truen, t) to by the generation of longitudinal event data. Baseline char- check whether a given attribute Ai, is relevant for a given acteristics and values for the first event observed are gener- true event label truen, t, then cross entropy loss for the ated using a sequential tree-based synthesis method [88, 89]. attributes is calculated as: Using a scheme similar to sequential imputation [90, 91], N t A C lossattributes = mean 1 Ai | truen,t −xn,t,i [trueAi,n,t ] + log exp xn,t,i [j] n=1 t=1 i=1 j=0 where trueAi, n, t is the true value for individual n’s attrib- trees are used quite extensively for the synthesis of health ute i at time t and xn, t, i is the vector of the predicted and social sciences data [52–59, 92]. With these types of probabilities for individual n’s attribute i at time t among models, a variable is synthesized by using the values earlier in the C possible classes for attribute i. the sequence of characteristics as predictors. Content courtesy of Springer Nature, terms of use apply. Rights reserved. Mosquera et al. BMC Medical Research Methodology (2023) 23:67 Page 11 of 21 These synthesized values are then fed into the trained for assessing the similarity of probability distributions model to generate the remaining events for each syn- that is bounded between 0 and 1 where 0 corresponds to thetic individual. The goal behind using sequential tree- no difference. based synthetic values as the baseline characteristics and starting values for the LSTM model is that they will bet- Comparing the distribution of event attributes ter reproduce the characteristics of the real population Another simple metric for assessing the similarity than randomly sampled starting values. between the real and synthetic datasets is to compare the To generate the longitudinal event data, the output of marginal distributions of each event attribute. For this the sequential tree-based synthesis is iteratively fed into assessment, we apply the Hellinger distance (as defined the LSTM model. At each iteration, the model uses the above) to the discrete probability distributions for each synthetic data from the previous time point, as well as event attribute. For this assessment, careful considera- the hidden state of the model if available, to predict the tion is taken to tabulate the probability distributions for next time point. These predictions consisted of predicted each event attribute, only using observations with an event labels and event attributes. Based on the predicted event label that is relevant for that attribute. This ensures event label, all non-relevant event attributes are masked that we are comparing the distributions of each attribute and set to missing. For example, if the next time point without the padded/missing values. To summarize the predicts an event of lab tests, the lab test name, lab test Hellinger distance values calculated for each event attrib- result, and sojourn time event attributes will be retained ute, they are plotted in a bar chart. while all others are set to missing. This masking during data generation is important to ensure that the data the Comparison of transition matrices model sees during data generation matches the format of The next method we applied for the utility evaluation of the data seen during training. Data synthesis proceeds in synthetic data is to compute the similarity between the this iterative fashion until the model has generated event real data and the synthetic data transition matrices. A data up to the maximum sequence length. In post-pro- transition matrix reflects the probability of transitioning cessing, each sequence is trimmed such that, if available, from one event to another. These transition probabilities sequences terminate when a ‘last observation’ event type can be estimated empirically by looking at the proportion is observed. of times that a particular event follows another one. For example, consider sequence data with four events: Generic utility assessments A, B, C, and D where C is a terminal event, meaning that Generic utility assessments aim to evaluate the similarity if C occurs, a sequence terminates. If 40% of the time an between a real and synthetic dataset without any specific event B follows an event A, then we can say that the tran- use case or analysis in mind. Two types of methods were sition from A to B has a probability of 0.4. The transition used depending on whether we were evaluating the util- matrix is the complete set of these transition probabili- ity of the cross-sectional vs the longitudinal portion of ties. Creating such a transition matrix assumes that the the data. All generic utility assessments were completed next event observed is dependent on only one previous using python version 3.8. event. This can be quite limiting and does not account for longer term relationships in the data. However, transition Event distribution comparisons matrices can be extended to the kth order where k cor- The simplest generic utility assessments are to compare responds to the number of previous events considered the number and distribution of events generated for each when calculating the transition probabilities. synthetic individual to the number and distribution of An example of a 2nd order transition matrix is shown in events in the real data. To compare the number of events Table 3. Here we have the two previous events along with per individual, the distributions are plotted as histograms the transition probabilities. The rows indicate the previ- and the means are compared. To compare the distribu- ous states, and the columns indicate the next state. Note tion of events in the real and synthetic data, the observed that each row needs to add up to 1 because the sum of probability distribution for event types is calculated for the total transitions from a pair of consecutive states must each dataset. This corresponds to what proportion of be 1. Also, there are no previous states with a C event in events belongs to each event type. These probability dis- them because in our example that is a terminal event. tributions are then plotted and compared as bar charts. The transition matrices for the real and synthetic data- Additionally, these distributions are compared by cal- sets can be compared by calculating the Hellinger dis- culating the Hellinger distance between the two distribu- tance between each row in the real transition matrix and tions. Hellinger distance is an interpretable metric the corresponding row in the synthetic transition matrix. Content courtesy of Springer Nature, terms of use apply. Rights reserved. Mosquera et al. BMC Medical Research Methodology (2023) 23:67 Page 12 of 21 Table 3 An example of a transition matrix with an order of 2, Analysis specific utility assessments which means that the two previous events are considered. We Generic utility assessments are agnostic to the future assume that C is a terminal event analyses of the synthetic data and compare the real and A B C D synthetic datasets in terms of distributional and struc- tural similarity. In contrast, workload aware or anal- AB 0.31 0.29 0.39 0.00 ysis-specific utility assessments compare the real and BA 0.42 0.21 0.22 0.16 synthetic datasets by performing the same substantive AD 0.64 0.11 0.08 0.18 analysis to both and comparing the results. DA 0.38 0.05 0.23 0.34 For this dataset we also conducted an analysis-specific BD 0.41 0.31 0.26 0.02 utility assessment by applying a time-to-event analyses DB 0.01 0.16 0.57 0.26 on both the real and synthetic datasets and compared the AA 0.20 0.40 0.30 0.10 results. BB 0.36 0.34 0.25 0.04 Our primary outcome was a composite endpoint of DD 0.34 0.48 0.17 0.01 all-cause emergency department visits, hospitalization, or death during the follow-up. The secondary outcomes included each component of the composite endpoint The lower the Hellinger distance values, the closer the separately, as well as to evaluate cause specific admis- transition structure between the two datasets. In this sions to hospital for pneumonia (ICD code J18) as a st and 2nd order tran- work we report utility for both the 1 prototypical example of a cause specific endpoint. sition matrices. First, all variables in both the synthetic and real data were compared using standard descriptive statistics Multivariate Hellinger distance (e.g., means, medians). Second, standardized mean dif- A multivariate Hellinger distance can be derived from the ferences (SMD) were used to statistically compare our multivariate normal Bhattacharyya distance. This variables of interest between the synthetic and real metric is bound between 0 and 1 and hence is an easily data. SMD was selected as given our large sample size, interpreted generic measure of overall similarity of the mul- small clinically unimportant differences, are likely to be tivariate distribution between the real and synthetic data- statistically different when using t-tests or chi squared sets. This metric has also been shown to be highly predictive test. A SMD greater than 0.1 is deemed as a potentially of synthetic data utility for logistic regression analyses. clinically important difference, a threshold often recom- mended for declaring imbalance in pharmacoepidemio- logic research. Utility of random cohorts Using Cox proportional hazards regression models, All the utility assessments described thus far are con- unadjusted and adjusted hazard ratios (HRs) and 95% ducted on the whole dataset. However, when analyzing confidence intervals were calculated to assess the risk longitudinal data, it is quite common to generate queried associated with either morphine or oxycodone and our data (i.e., a cohort) from the whole dataset. Therefore, it outcomes of interest in both the synthetic and real data is beneficial to compare the cohorts generated by queries separately. Start of follow-up began on the date of the on the real and synthetic datasets. A query defines the first dispensation for either morphine or oxycodone. inclusion and exclusion criteria for the cohort. All subjects were prospectively followed until outcome For this assessment, we used a fuzzy SQL method. This of interest or censoring defined as the date of termi- will generate a large number of random semantically and nation of Alberta Health coverage or 31 March 2018, syntactically correct SELECT random queries that are providing a maximum follow-up of 7 years. Finally, the simultaneously applied to both real and synthetic data- estimates derived from the real and synthetic datasets sets. The similarity between the resultant real and syn- were directly statistically compared. Morphine served thetic cohorts are compared using the Hellinger distance as the reference group for all estimates. Potential con- for distributions and normalized Euclidean distance for founding variables included in all multivariate models aggregate results (e.g., the average of a continuous vari- included age, sex, Elixhauser comorbidity score, use of able in the cohort). antidepressant medications, and our 3 laboratory vari- Such SQL fuzzers are used to test database manage- ables (ALT, eGFR, HCT). To compare the confidence ment systems (DBMSs) for any bugs or vulnerabilities intervals estimated for HRs from real vs synthetic data-. In our context we apply a similar concept to gen- set, confidence interval overlap was used. All analy- erate random cohorts. More details about our imple- ses were performed using STATA/MP 15.1 (StataCorp., mentation are included in Additional file 1: Appendix 2: College Station, TX). Random Cohort Utility Assessment. Content courtesy of Springer Nature, terms of use apply. Rights reserved. Mosquera et al. BMC Medical Research Methodology (2023) 23:67 Page 13 of 21 Privacy assessment Table 4 Summary of the generic utility assessments results To quantify the privacy risks in the synthetic data we Metric Result evaluated attribution disclosure risk. This privacy assessment is designed to evaluate the risk that an adver- Percent difference in sequence lengths 0.4% sary could match a synthetic with a real record, and that Hellinger distance of event distribution 0.027 if a re-identification were to occur, whether the adver- Hellinger distance of event attributes sary would learn something new about them. The quasi- Mean (SD) 0.0417 identifiers used for this assessment were: age, sex, death Median (IQR) 0.0303 (0.0333) indicator, and a hospitalization indicator. For this assess- Hellinger distance of Markov Transition Matrices of Order 1: ment, we consider two directions of attack: a population Mean (SD) 0.0896 (0.159) to sample and a sample to population attack. Median (IQR) 0.0209 (0.0303) We use the common threshold for the disclosure Hellinger distance of Markov Transition Matrices of Order 2: of clinical trial and other types of health data, 0.09 Mean (SD) 0.2195 (0.2724) [100–106], that is the threshold used by the European Median (IQR) 0.0597 (0.4401) Medicines Agency for their Policy 0070 anonymization Multivariate Hellinger distance 0.352 guidance , and for Health Canada’s Public Release of Utility of 100 random cohorts: Clinical Information guidance. This is equivalent to Hellinger distance: a minimal group size of 11 under a maximum risk sce- Mean (SD) 0.3039 (0.0674) nario. Median (IQR) 0.3128 (0.0358) Normalized Euclidean distance: Results Mean (SD) 0.0639 (0.1145) Model parameters Median (IQR) 0.0146 (0.0596) Hyperparameter training was conducted for a variety of aspects of model implementation. By selecting the values within a search range that minimized validation loss, an cohorts of 0.3039 (standard deviation: 0.0674), and a optimal model was selected. The complete set of optimal mean normalized Euclidean distance of 0.0639 (standard values for the hyperparameter can be found in the Addi- deviation: 0.1145). This indicates that randomly gener- tional file 1: Appendix 3: Optimal Model Parameters. ated sub cohorts in the real and synthetic datasets are quite similar. The multivariate Hellinger distance and Generic utility assessments the random cohort Hellinger results are also similar to The generic utility results are summarized in Table 4. each other demonstrating some consistency across utility They are reviewed in more detail below. evaluations. The sequence lengths in the synthetic datasets matched the real dataset quite closely (percent difference in mean Workload aware assessment sequence length 0.4%) as illustrated in Fig. 3. The distri- The workload aware assessment of utility was conducted bution of events observed across all synthetic patients on 75,660 real patient records and 75,660 synthetic matched the distribution of events in the real dataset records. Standardized mean differences (SMD) indi- quite closely (Hellinger distance 0.027) as illustrated in cated that no clinically important differences were noted Fig. 4. with respect to demographics and the comorbidity score Comparing the distribution of event attributes, the between the real and synthetic data (Table 5). For exam- synthetic data again matches the distributions seen in ple, between the real and synthetic data the mean age the real data closely as shown in the Hellinger distance was 43.32 vs 44.79 (SMD 0.078), 51.0% males vs 52.5% histogram in Fig. 5 (mean Hellinger distance 0.0417). The (SMD 0.029), and Elixhauser comorbidity score of 0.96 differences in the real and synthetic transition matrices vs 1.05 (SMD 0.055). However, differences were noted was smaller for first order Markov transition matrices (in that would be considered potentially clinically important Fig. 6) than for second order transition matrices, (mean for laboratory data with standardized mean differences Hellinger distance 0.0896 vs 0.2195) indicating that short between the real and synthetic data > 0.1, a threshold term dependencies may be modelled better than long often recommended for declaring imbalance. term dependencies. The multivariate Hellinger distance The cumulative follow-up time, post-receipt of the shows a distance of 0.352 between the real and synthetic index opioid prescription and the outcomes of interest datasets, indicating that the multivariate distributions are for the real and synthetic data are summarized in Table 6. moderately similar. The random cohort utility assessment Based on SMD cumulative follow-up time (mean of showed a mean Hellinger distance across 100 random 1474.48 vs 1077.88; SMD: 0.530) and mortality (3299 vs Content courtesy of Springer Nature, terms of use apply. Rights reserved. Mosquera et al. BMC Medical Research Methodology (2023) 23:67 Page 14 of 21 Fig. 3 Sequence length comparison between the real and synthetic datasets. Overall, the synthetic data has a similar distribution of sequence lengths than in the real data (real mean & SD: 58.14, 68.57 vs synthetic mean & SD: 58.39, 75.16) Fig. 4 Event distribution comparison between the real and synthetic datasets Content courtesy of Springer Nature, terms of use apply. Rights reserved. Mosquera et al. BMC Medical Research Methodology (2023) 23:67 Page 15 of 21 Fig. 5 Hellinger distance for each event attribute Fig. 6 First order Markov transition matrices for the real and synthetic datasets and the absolute difference in transition matrices. Note that the heatmaps have different scales 1440; SMD: 0.141) yielded a notable difference between A similar reduction was observed in the synthetic data- the real and synthetic datasets. set with a 27% reduction in time to event: aHR 0.73 After adjustment for age, sex, use of antidepressants, 95% CI 0.69–0.77 (Fig. 7 and Table 7). With respect to and laboratory data, the Cox proportional hazards our secondary outcomes, similar trends were observed were similar between the real and synthetic datasets. with small differences noted in time to event between In the real data, oxycodone was associated with a 29% the synthetic and real data with the exception of all- reduction in time to composite endpoint compared to cause mortality (Fig. 7). With respect to all-cause mor- morphine: adjusted HR (aHR) 0.71 95% CI 0.66–0.75). tality, although both the real and synthetic data would Content courtesy of Springer Nature, terms of use apply. Rights reserved. Mosquera et al. BMC Medical Research Methodology (2023) 23:67 Page 16 of 21 Table 5 Comparison of trial characteristics across the real and provide similar conclusions on the effect of oxycodone synthetic datasets on mortality, the estimated effect was higher in the Real Synthetic SMD real data, with only a 38% confidence interval over- n = 75,660 n = 75,660 lap (aHR 0.29 (95% CI 0.25, 0.33) vs aHR 0.35 (95% CI 0.29, 0.41)). Age 0.078 The confidence intervals and point estimates in the Mean (SD) 43.32 (17.87) 44.79 (19.83) adjusted Cox regression analysis are also similar and Median (IQR) 42.00 [27.00] 43.00 [30.00] would lead researchers to reach the same conclusion Sex n (%) 0.029 for many applications whether they analyzed real or Male 38,623 (51.0) 39,711 (52.5) synthetic datasets. For the adjusted models the mean Female 37,037 (49.0) 35,949 (47.5) confidence interval overlap is 68%. This indicates that Elixhauser 0.055 the conclusions drawn from the synthetic datasets Mean (SD) 0.96 (1.58) 1.05 (1.63) comfortably overlap those drawn from the real data. Median (IQR) 0.00 [1.00] 0.00 [2.00] ALT 0.099 Privacy assessment results Mean (SD) 31.67 (63.90) 40.72 (111.92) The privacy assessment showed that the population to Median (IQR) 24.00 [18.00] 26.00 [19.00] sample risk was 0.001476 and the sample to population eGFR 0.112 risk was 0.001474. Given that both these risk values are Mean (SD) 85.82 (23.56) 83.11 (25.05) substantially lower than the acceptable risk threshold of Median (IQR) 87.00 [41.00] 84.00 [38.00] 0.09, we can conclude that the attribution disclosure risks HCT 0.291 associated with this synthetic dataset is acceptably low. Mean (SD) 0.42 (0.05) 0.41 (0.06) Median (IQR) 0.42 [0.05] 0.41 [0.06] Discussion and conclusions CACS-RIW 0.002 Summary Mean (SD) 0.05 (0.07) 0.05 (0.07) This project has generated realistic synthetic data for Median (IQR) 0.03 [0.03] 0.03 [0.03] complex longitudinal administrative health records. RIW 0.002 Modelling events over time using a form of conditional Mean (SD) 1.40 (2.73) 1.40 (2.40) LSTM has allowed us to learn patterns in the data over Median (IQR) 0.77 [0.82] 0.81 [0.84] time, as well as how these trends relate to fixed baseline Opioid Utilization (%) characteristics. The masking implemented during model Morphine 1758 (2.3) 2649 (3.5) 0.070 training has allowed us to work with sparse attribute data Oxycodone 73,902 (97.7) 73,011 (96.5) from a variety of sources in a single model. Overall, this Antidepressant Use 28,224 (37.3) 29,651 (39.2) 0.039 method of generating synthetic longitudinal health data has performed quite well from a data utility perspective. Generic univariate and multivariate utility metrics based on the Hellinger distance varied from a low of 0.01 Table 6 Outcomes of interest for both real and synthetic for event attributes, to 0.35 for the joint distributions. datasets Random cohort generation also had a mean Hellinger distance of 0.3 between real and synthetic cohorts gener- Real Synthetic SMD N = 75,660 N = 75,660 ated from longitudinal data. Our model learns and recreates patterns in the hetero- Total follow-up time geneous attributes, accounting for the pattern of relevant Mean (SD) 1474.48 (772.23) 1077.88 (722.44) 0.530 attributes based on event type. The generated sequences Mortality have event lengths that are consistent with the real data n (%) 3299 (4.4) 1440 (1.9) 0.141 (percent difference in mean sequence length 0.4%). Hospitalization Baseline characteristics were synthesized to be consist- n (%) 22,495 (29.7) 21,582 (28.5) 0.027 ent with the distributions in the real data (SMD 0.05 or Emergency room visit lower) and to exert reasonable influence on the progres- n (%) 64,376 (85.1) 65,193 (86.2) 0.031 sion of events. There were differences in the univariate Composite endpoint lab results between the real and synthetic datasets. n (%) 64,848 (85.7) 65,497 (86.6) 0.025 The multivariate Cox models incorporating the Diagnosis of pneumonia (ICD10: J189) main variables of interest and confounders used to n (%) 505 (2.2) 472 (2.2) 0.004 predict multiple outcomes were similar between real Content courtesy of Springer Nature, terms of use apply. Rights reserved. Mosquera et al. BMC Medical Research Methodology (2023) 23:67 Page 17 of 21 Fig. 7 Adjusted hazard ratios for outcomes of interest in the synthetic data compared to the real data Table 7 Adjusted hazard ratios and confidence interval overlap synthetic data to reproduce results of traditional epi- for outcomes of interest in real and synthetic datasets demiologic analyses reasonably well. Additionally, we Outcome Real Data Synthetic Data CI-Overlap- have demonstrated in this study that the privacy risks percent associated with this synthetic dataset are acceptably low when considering population to sample attacks Mortality 0.29 (0.25, 0.33) 0.35 (0.29, 0.41) 38% (estimated risk: 0.001476) and sample to population Hospitalization 0.62 (0.57, 0.67) 0.64 (0.6, 0.68) 77% attacks (estimated risk: 0.001474). Emergency room visit 0.76 (0.71, 0.81) 0.74 (0.71, 0.78) 76% Composite endpoint 0.71 (0.66, 0.75) 0.73 (0.69, 0.77) 72% Contributions of this work Pneumonia 0.79 (0.5, 1.26) 0.7 (0.48, 1.03) 81% The conditional LSTM generative model described in this paper has worked well with real-world complex lon- gitudinal data that has received minimal curation. This and synthetic data, with confidence interval substan- method allows the synthesis of associated cross sec- tial overlap on the effect of Oxycodone (mean CI over- tional and longitudinal health data, where the measures lap above 68%). Our work has shown the ability of included correspond to a variety of medical events (e.g., Content courtesy of Springer Nature, terms of use apply. Rights reserved. Mosquera et al. BMC Medical Research Methodology (2023) 23:67 Page 18 of 21 prescriptions, doctor visits, etc.) and data types (e.g., con- examine the additional benefit of considering this multi- tinuous, binary, categorical). The longitudinal data gener- ple imputation approach. ated varies in the number of observations per individual, While there is a body of work on the synthesis of medi- reflecting the structure of real electronic health data. The cal images and other data types , our focus in this model selected is easy to train and automatically adapts paper was on structured longitudinal data. The synthesis as the number of events, event attributes, or complexity of multi-modal data would be an important direction for of attributes changes. future research. We have also assessed the utility of the generated syn- thetic data using generic and workload aware assess- Supplementary Information ments that have shown the similarity of our generated The online version contains supplementary material available at https://doi. data to the real data on most univariate measures and org/10.1186/s12874-023-01869-w. for multivariate models. The privacy assessment has shown that the risks from the synthetic data generated Additional file 1: Appendix 1. Data Pre-Processing. Appendix 2. Ran- dom Cohort Utility Assessment. Appendix 3. Optimal Model Parameters. are below generally accepted risk thresholds. Architecturally, the generative model has a number of Authors’ contributions features which make it suitable for this type of data: KEE, LM, DE, and VS designed the study, performed the analysis, and wrote the paper. LD, LK, and BJ provided methodology and statistical advice and analy- Combining a tabular generative model as an input to sis. XHZ and SK performed components of the analysis. CC and DP supported the design of the study, provided project management, regulatory consulta- the longitudinal generative model. tion, and coordination among all the parties. BH supported the privacy analy- Using masking on the loss function to focus only on sis and regulatory consultation. All authors approved the final manuscript. the relevant attributes at a particular point in time. Funding Dynamically weighting the loss for event attributes This work was partially funded by the Canada Research Chairs program and event labels. through the Canadian Institutes Health Research, a Discovery Grant RGPIN- The multiple embedding layers allow the model to 2016-06781 from the Natural Sciences and Engineering Research Council of Canada, Health Cities and the Institute of Health Economics, in partnership handle heterogeneous data types. with Replica Analytics Ltd., University of Alberta, and Alberta Innovates. The work on the fuzzy cohort utility measurement was supported by the Bill and The above features enabled the model to learn the pat- Melinda Gates Foundation. This study is based in part on data provided by the Alberta SPOR SUPPORT (AbSPORU) Unit, Alberta Health Services, and Alberta terns in the original dataset. Health. The interpretation and conclusions contained herein are those of the researchers and do not necessarily represent the views of the Government Limitations and future work of Alberta, Alberta Health Services or AbSPORu. Neither the Government of Alberta, Alberta Health, Alberta Health Services, nor AbSPORU expresses any Our methods truncated the maximum sequence length opinion in relation to this study. at the 95th percentile. This means that data from indi- viduals with the greatest number of interactions with Availability of data and materials The data that support the findings of this study were obatined from Alberta the healthcare system are not modelled nor synthesized, Health Services but restrictions apply to the availability of these data, which were and therefore our synthetic data may not be applicable to used under license for the current study, and so are not publicly available. The those interested in assessing the impacts of high health- dataset can be requested from Alberta Health Services under their data sharing program: https://www.albertahealthservices.ca/research/Page16074.aspx. care utilization individuals. An illustrative example of the analysis code is available from the authors upon This generative model was designed to learn and request. Code to run the utility of random cohorts fuzzy SQL assessment are reproduce the relationships seen in the training data- available here: https://github.com/skababji-ehil/fuzzy_sql. set without tuning or optimizing for a specific analy- sis. Higher utility results may be achieved by tuning Declarations a synthesis model to a specific analysis, however that Ethics approval and consent to participate may come at the cost of the generalizability of the data This study was approved by the Research Ethics Board of the University of generated. Alberta (#Pro00083807). All research was performed in accordance with rel- evant guidelines/regulations. Patient consent was not required / was waived The approach we used in this study to compare the by the research ethics board for this secondary analysis. confidence intervals between the real and synthetic datasets did not account for the additional variance Consent for publication N/A introduced by synthesis. While combining rules simi- lar to those used for multiple imputation can be used Competing interests to account for the additional variance [109, 110], some This work was performed in collaboration with Replica Analytics Ltd. This company is a spin-off from the Children’s Hospital of Eastern Ontario Research authors have suggested that parameter estimates and Institute. KEE is co-founder, SVP, and has equity in this company. LM and XHZ confidence intervals computed from a single syn- are data scientists employed by Replica Analytics Ltd. LD, VS, CC, BH, DP, LK, BJ, thetic dataset can still be valid. Future work should SK and DE have no competing interests. Content courtesy of Springer Nature, terms of use apply. Rights reserved. Mosquera et al. BMC Medical Research Methodology (2023) 23:67 Page 19 of 21 Author details 17. Sweeney L, Su Yoo J, Perovich L, Boronow KE, Brown P, Brody JG. Re- 1 Replica Analytics Ltd, Ottawa, ON, Canada. 2 Children’s Hospital of Eastern identification Risks in HIPAA Safe Harbor Data: a study of data from one Ontario Research Institute, 401 Smyth Road, Ottawa, ON K1J 8L1, Canada. environmental health study. J Technol Sci. 2017;2017082801:1–70. 3 School of Epidemiology and Public Health, University of Ottawa, Ottawa, 18. Su Yoo J, Thaler A, Sweeney L, Zang J. Risks to patient privacy: a re- ON, Canada. 4 Department of Mathematical and Statistical Sciences, University identification of patients in Maine and Vermont statewide hospital data. of Alberta, Edmonton, AB, Canada. 5 School of Public Health, University J Technol Sci. 2018;2018100901:1–62. of Alberta, Edmonton, AB, Canada. 6 Health Cities, Edmonton, AB, Canada. 7 B W 19. Sweeney L. Matching known patients to health records in Washington State Hamilton Consulting Inc., Edmonton, AB, Canada. 8 Institute of Health Econom- Data. Cambridge: Harvard University. Data Privacy Lab; 2013. Available: ics, Edmonton, Alberta, Canada. https://dataprivacylab.org/projects/wa/1089-1.pdf. Accessed 9 July 2019 20. Sweeney L, von Loewenfeldt M, Perry M. Saying it’s anonymous doesn’t Received: 18 April 2022 Accepted: 19 February 2023 make it so: re-identifications of ‘anonymized’ law school data. J Technol Sci. 2018;2018111301:1–108. 21. Zewe A. Imperiled information: students find website data leaks pose greater risks than most people realize: Harvard John A. Paulson School of Engineering and Applied Sciences; 2020. https://www.seas.harvard.edu/ References