HIV Viral Load Hotspots Prediction in Kenya (2023) PDF
Document Details
Uploaded by TollFreeDahlia
2023
Nancy Kagendi and Matilu Mwau
Tags
Summary
This research article details a machine learning model developed to predict HIV viral load hotspots in Kenya. The model utilizes real-world data from the Ministry of Health to identify high-risk areas for increased HIV viral load. The study aims to provide an early warning system for healthcare administrators to optimize treatment and resource allocation.
Full Transcript
RESEARCH ARTICLE A Machine Learning Approach to Predict HIV Viral Load Hotspots in Kenya Using Citation: Kagendi N, Mwau M. A Real-World Data...
RESEARCH ARTICLE A Machine Learning Approach to Predict HIV Viral Load Hotspots in Kenya Using Citation: Kagendi N, Mwau M. A Real-World Data Machine Learning Approach to Predict HIV Viral Load Hotspots in Kenya Using Real-World Data. Nancy Kagendi and Matilu Mwau* Health Data Sci. 2023;3:Article 0019. https://doi.org/10.34133/hds.0019 Kenya Medical Research Institute, Nairobi, Kenya. Submitted 7 November 2022 *Address correspondence to: [email protected] Accepted 25 April 2023 Published 2 October 2023 Background: Machine learning models are not in routine use for predicting HIV status. Our objective Copyright © 2023 Nancy Kagendi is to describe the development of a machine learning model to predict HIV viral load (VL) hotspots as and Matilu Mwau. Exclusive licensee an early warning system in Kenya, based on routinely collected data by affiliate entities of the Ministry Peking University Health Science of Health. Based on World Health Organization’s recommendations, hotspots are health facilities with Center. No claim to original U.S. ≥20% people living with HIV whose VL is not suppressed. Prediction of VL hotspots provides an early Government Works. Distributed under warning system to health administrators to optimize treatment and resources distribution. Methods: A a Creative Commons Attribution Downloaded from https://spj.science.org on July 25, 2024 License 4.0 (CC BY 4.0). random forest model was built to predict the hotspot status of a health facility in the upcoming month, starting from 2016. Prior to model building, the datasets were cleaned and checked for outliers and multicollinearity at the patient level. The patient-level data were aggregated up to the facility level before model building. We analyzed data from 4 million tests and 4,265 facilities. The dataset at the health facility level was divided into train (75%) and test (25%) datasets. Results: The model discriminates hotspots from non-hotspots with an accuracy of 78%. The F1 score of the model is 69% and the Brier score is 0.139. In December 2019, our model correctly predicted 434 VL hotspots in addition to the observed 446 VL hotspots. Conclusion: The hotspot mapping model can be essential to antiretroviral therapy programs. This model can provide support to decision-makers to identify VL hotspots ahead in time using cost-efficient routinely collected data. Introduction Kenya has led the way in HIV viral load (VL) monitoring in SSA, with an increase in the number of facilities conducting VL Globally, in 2021, 38.4 million people were living with HIV/ tests since 2016 and an efficient laboratory system. The Kenyan AIDS, and 650,000 have died of HIV-related illnesses. Even government, together with the help of non-governmental organ- with different lines of successful HIV treatment regimens that izations (NGOs) and local community volunteers, is actively are available via global funding, the HIV epidemic is critical trying to curb the spread of HIV by linking patients to health in sub-Saharan Africa (SSA), constituting two-thirds of the facilities, educating local communities, distributing treatments, world’s HIV+ population. Before treatments were available, and monitoring patients who are on treatment. International life expectancy of an HIV-infected person was 10 to 12 years, organizations such as the United States Agency for International but with appropriate treatments, it is now possible to live a Development and Clinton Health Access Initiative support var- healthy and full life with HIV. ious initiatives to ensure that PLHIV have access to treatment Sub-Saharan African countries have the highest rates of HIV and care. Kenya’s Ministry of Health strives to improve the prevalence. Kenya jointly bears the burden of being the third VL testing program by strengthening patient tracking mecha- largest epidemic region, in terms of people living with HIV nisms and VL result utilization. Kenya has a rich and consolidated (PLHIV), along with Uganda and Mozambique. In the past VL data dashboard that contains routinely collected VL test decade, Kenya has made steady progress in combating the HIV result information. This dashboard can be leveraged to assess the epidemic efficiently by implementing progressive disease man- VL program, follow patient behavior toward medication and agement policies and procedures. This has resulted in a 44% treatment, introduce new policies, and identify areas that need decrease in new HIV infections from 2010 to 2019. Kenya improvement. The National AIDS and STI Control Programme is committed to eliminate HIV as a public health threat by the (NASCOP), Kenya is a unit within the Ministry of Health that end of 2030. However, by 2020, the target of reducing new coordinates HIV and AIDS programs in Kenya. Their publicly HIV infections by 75% was not achieved. In 2021, 15 coun- available dashboard is an analytic platform to follow trends ties saw high levels of new infections, thus reversing progress in VL test data and patient outcomes and stores collected data. made in the past decade. Missed opportunities to provide Monitoring patients based on CD4 counts and clinical signs HIV testing and treatment services is a primary barrier to a often delays proper care, thus increasing the risk of developing steady decline in new HIV infections. antiretroviral drug resistance (DR). Measurement of HIV-1 Kagendi and Mwau 2023 | https://doi.org/10.34133/hds.0019 1 RNA levels via VL test is an essential tool to monitor a patient’s HIV Data setup status. Per local standard of care, a patient is VL non-suppressed Data cleaning is the first important task in a data science project if they have VL of at least 1,000 copies/ml or more after a min-. The source patient data had a unique identifier, patient imum of 6 months on anti-retroviral with adherence. ID, to label the pseudonymized patient data. The patient IDs Regular VL monitoring in patients can reduce morbidity and are system-generated codes received with the laboratories data- mortality due to VL non-suppression (VLNS) by optimizing set, and the authors of this article had no knowledge of their treatment of failing/virally non-suppressed patients. A nation’s algorithm. Prior to model development, we conducted rigorous approach toward tackling an epidemic is a public health issue data pre-processing. The patient ID data field was refined to at an administrative level (e.g., health facility, subcounty, and rule out data entry errors and repetition of unique IDs with a county) rather than at individual level. Patient VL data can be combination of originally available patient ID, gender, date of aggregated at an administrative level and followed to provide a birth, and county. All character variables were converted to time-series trend of data. lowercase, and any leading and trailing spaces were removed With the emergence of real-world data, data science, and and collapsed as one single word. This was particularly helpful machine learning (ML) methods in the field of public health, to clean special characters from names of facilities and counties. the availability of NASCOP VL data has immense potential This exercise also ensured proper matching of rows when 2 yet to be exploited. Identifying hotspots of public health threats datasets were being linked with a character variable. The date has been done by epidemiologists in the past [12,13]. Several variables were harmonized to reflect the yyyy-mm-dd format. studies have explored HIV hotspot identification to quickly We attributed dates to the first of the corresponding month if detect centers of HIV transmission and implement effective only month and year were present. VL test results had values interventions by screening at-risk patients. In a novel in various formats, e.g., a numeric entry such as “4”, “1,000”, approach, we use available VL datasets to predict an HIV hot- etc.; “< LDL copies/ml”, “< 150 cp/ml”, etc.; and texts like spot, at a future time point, at an administrative level. Use of “Target Not Targeted”, “Sample not received”, “not detected”, routinely collected data is cost-efficient as no extra resources “INVALID”, etc. All valid test results were cleaned thoroughly Downloaded from https://spj.science.org on July 25, 2024 are spent on the collection of data. and classified into 3 categories—lower than the detection limit ML models have the potential to unlock answers from large (LDL), low level viremia (LLV), and high viral load (HVL). We and complex datasets. Herein, we describe the use of ML attribute a VL test result as LDL if the VL is 20% VLNS HIV patients. The the health facilities that provide these services. outcome variable of hotspot vs. non-hotspot is derived from the The health facility data are obtained from the website “Kenya VL test result. We combined LLV and LDL outcomes to indicate Master Health Facility List” (http://kmhfl.health.go.ke/#/home) VLS status while HVL indicated VLNS status of a patient. Thus,. This website contains information like geographical location VL test results were treated as dichotomous variables where a (names of counties and subcounties), administrative location patient could either be VLS or VLNS. Our dataset had approx- (ward), ownership (name of the private or government entity), imately 70% non-hotspot vs. 30% hotspot facilities; indicating regulatory body type, and services offered by all Kenyan health a slight imbalance. facilities and community units. The predictor variables used were derived from the lab- We also used pseudonymized laboratory patient-level oratory and health facility datasets. The aggregation of data VL data from NASCOP collected between January 2015 and at the facility offered the opportunity for feature engineering December 2019. Some predictor variables of interest are predictor variables that may be relevant at the facility level. demographic characteristics such as age and gender; dates Since our data were time specific (month–year) at each facil- of VL testing, collection of test specimen, dispatch of test ity, we summarized the existing variables at this granularity. results; sample type, antiretroviral regimen, test result, and A few examples of the derived predictor variables are as test justification. follows: These 2 datasets were linked by key variable(s), health facil- ity name and code, that were common in both data files. The 1. Number of VL tests and patients per facility/per month–year laboratory data contained the names of the facilities that were 2. Number of VL tests and patients per regimen per facility/ used as a unique key variable to link the patient-level data with per month–year the facility data. 3. Number of regimens per facility/per month–year Kagendi and Mwau 2023 | https://doi.org/10.34133/hds.0019 2 4. Number of patients in pre-defined age brackets per facility/ per month–year 5. Average and standard deviation of time difference between 2 time points of test logistics (e.g., collected and tested) per facility/per month–year 6. Average, maximum, and minimum times between 2 VL tests for a patient per facility/per month–year We also created several lagged predictor variables including lagged value of hotspot status in the previous months. Feature engineering of predictor variables alters the feature space by creating new variables that add information and value to the learning process. Since our dataset had a limited number of predictor variables restricted to facility data and laboratory VL tests, deriving new features at the facility level strengthened our model development effort. Outliers were checked for each predictor variable and treated as missing. Imputation of missing values was done at the facility- level dataset. Character variables were treated as factors and missing values were assigned a separate class. Numeric variables were imputed by replacing missing values with the average of Fig. 1. Representation of a simplified random forest model (green and blue dots represent 2 distinct classes). the available value for each variable. We assumed that our data were missing completely at random (MCAR) since missing data Downloaded from https://spj.science.org on July 25, 2024 implied data entry issues, or lost or damaged samples in the lab. Several imputation methods were considered, but we decided to use mean imputation because it is simple to implement and Table 1. Confusion matrix prototype. works for MCAR data. We checked for multicollinearity in the final dataset and dropped predictor variables that were cor- Predicted related in order to remove redundant information from data. Hotspot Non-hotspot We developed a random forest model to predict HIV hotspots in Kenya, 1 month ahead in time, via a supervised Actual Hotspot True positive False negative learning approach. The outcome variable was hotspot status of (TP) (FN) (type II a health facility and predictor variables were derived from the error or missed laboratory and facility datasets. A random forest classification opportunity) model is an ensemble of several decision trees that classifies by Non-hotspot False positive True negative voting for the most popular class (Fig. 1). (FP) (type I (TN) The model dataset was partitioned into train and test data- error or false sets in a 75–25 split. Of the original data, 25% was set aside to alarm) test the proposed model. Note that, since our target was imbal- anced, we split the data based on the outcome variable such that the imbalance proportion was retained in both train and test datasets. Due to data imbalance, reporting just model accuracy is not appropriate. We introduce several other performance income countries (LMICs). Thus, we chose a threshold value metrics that validate model performance for imbalanced data. of predicted score that optimizes the trade-off between sensi- Sensitivity is the proportion of truly identified hotspots (true tivity and precision. positive) among all actual hotspots (positive) and precision is The Brier score was a measure developed to scale the accu- the model’s ability to predict a true hotspot (true positive) as racy of weather forecasts based on Euclidean distance between opposed to predicting a non-hotspot as hotspot (false positive) the actual outcome and the predicted probability assigned (Table 1). to the outcome for each observation. The score ranges Typically, we aim to reduce both type I and type II errors between 0 and 1 with lower scores being desirable. Other com- (Table 1), but with a fixed sample size, minimizing both errors mon model metrics like specificity, F1 score, area under the simultaneously is not always feasible. Thus, we try to min- curve (AUC) of receiving operating characteristic (ROC), and imize the error that would have serious repercussions if not precision–recall (PR) curves were also reported. The F1 score controlled. In this setting, missing a hotspot could adversely is a harmonic mean of sensitivity and precision metrics. The impact treatment behavior and delivery of many vulnerable higher the value (ranges between 0 and 100%) of AUC on the patients. Our motivation was, thus, to reduce the type II error ROC and PR curves, the better is the model fit. (Sensitivity = 1 – type II error) or increase the proportion of We have, further, explored model performance using data truly identified hotspots among all hotspots. On the other beyond 2019. We used laboratory and health facility data from hand, we also did not want our model to produce too many January 2020 to March 2022. We wanted to perform a validation false alarms, i.e., reduce the model precision. Too many false with completely out-of-sample data to test the robustness of our alarms can cause wastage of resources and cripple administra- model. All model performance metrics were derived based on tive policies in hours of need, especially in low- and middle- this dataset along with data characteristics visualizations. Kagendi and Mwau 2023 | https://doi.org/10.34133/hds.0019 3 Results hotspot against non-hotspots in the upcoming month was applied on the test dataset to validate its performance. We achieved an Data characteristics accuracy of 78% on the test dataset. The sensitivity is 70% and The VL data from 2015 to 2019 had more than 4 million test the specificity is 83% (Table 3). records. The tests increased over time—while 176,415 tests were Our model had an F1 score of 69% with 68% precision. The performed in 2015, around 1.5 million tests were performed AUC of ROC curve was 86% and PR AUC was 79% (Fig. 5) in 2019. This was a result of the increase in number of facilities and had a Brier score of 0.139. joining NASCOP, improved test turnaround time of the VL While VL results in prior months were the 2 most important testing laboratories, and scaled up patient outreach for VL tests. variables, feature engineered variables with time of treatment The data repository consisted of VL test records of over 2 mil- initiation, testing, test collection, test dispatch to laboratories, lion patients generated in 4,265 health facilities, of which 3,707 and test receipt times were some of the important variables. are uniquely linked to patients. Routine VL tests were usually We also saw some factors related to the health facilities like done annually, but a VL test could be done at any time prior to county, ward, regulatory body, and subcounty feature as impor- the annual routine test if a patient’s health deteriorated or if tant variables in the model. The variable importance bar plot there was a life-changing event like pregnancy. is added as a Supplementary Material (Fig. S1). Only 7 variables in our dataset had the issue of missingness (Fig. 2). There are more female patients (68%) tested than male Hotspot visualization patients due to their health-seeking behavior during childbear- Predicted HIV hotspot distribution at the county and sub- ing ages; their distribution over the years is captured in Fig. 3. county levels for 2019 December is included (Fig. 6); the pre- In every age group, 60% to 80% HIV+ patients were viro- diction was made in 2019 November. The color ranges from logically suppressed at the time of their test (Fig. 4); never- light yellow to brown; light yellow means lower percent of hot- theless, more than 20% of the patients were non-suppressed, spot and brown means higher percent of hotspots. % Hotspots with at least 1,000 copies/ml, in the younger age categories were calculated based on all facilities present in that area (as Downloaded from https://spj.science.org on July 25, 2024 (i.e.,