Michael Pencina, PhD, vice dean for data science at Duke University School of Medicine spoke about assessing the differences in stroke risk with Americans using algorithms.
Michael Pencina, PhD, corresponding author of a recently published study, and colleagues evaluated several algorithms along with two methods of assessment with artificial intelligence to predict a patient’s risk of stroke in the next 10 years. Published in the Journal of American Medical Association findings showed that all algorithms were worse at stratifying stroke risk for Black individuals than to White individuals, regardless of gender.1
Specifically, the study focused on risk order, how likely a patient is to experience stroke in comparison with other patients, which is a critical approach for allocating limited medical resources.2 The study showed that a simple approach using answers to patient questions was the most accurate in assessing stroke risk at a population level whereas machine learning methods failed to improve model performance.
In a recent interview with NeurologyLive®, Pencina discussed the main findings from the study, including an overview of the research and how generalizable the results are regarding the United States. Pencina, professor in the Department of Biostatistics and Bioinformatics and director of AI Health at Duke University School of Medicine, mentioned any unexpected findings that came from the study as well as the focus for future investigations for assessing health disparities with stroke.
NeurologyLive®: Can you give a brief overview of the study and more specifically, your hypothesis going into the research?
Michael Pencina, PhD: The study was motivated by the need to do better in prevention of stroke. Stroke is the fifth largest cause of mortality in the US, and it's associated with very high morbidity. We know that the incidence rates of stroke are much higher among Black adults as compared with White adults. We also know that there are a number of predictive algorithms that estimate the risk of stroke. There is a Framingham Heart Study function, there is a function from the REGARDS study, using self-reported data, as well as pooled cohort equations originally designed for the larger endpoint of atherosclerotic cardiovascular disease, which involves heart attacks and strokes together to inform the lipid guidelines. We wanted to see if stroke specific functions do better than the ASCVD (atherosclerotic cardiovascular disease) pulled cohorts equations function. We also wanted to see how the functions performed in different groups defined by race: Black and White individuals with a sufficient sample size as well as men and women, and older and younger individuals. Finally, we wanted to see if novel machine learning methods improve performance of the existing functions.
We took data from large NIH funded cohorts, Framingham, MESA (Multi-Ethnic Study of Atherosclerosis), as well, as REGARDS through our collaboration. We used the dBGaP repository in collaboration with investigators from The University of Alabama in Birmingham. We put the data together, we applied the 3 existing functions. That included both cohort equations, the REGARDS function and the Framingham function to our harmonized data, and then perform pre planned analysis. What we found was a big difference in model discrimination, defined by the C index, or the ability to risk rank individuals between the performance of any of these 3 functions among Black versus White adults. Strikingly, the performance was much better among White adults than Black adults. The c-statistics were in the 0.7 range for White adults and in 0.6 range for Black adults, and results were statistically significant for all 3 functions. We also found that model calibration was best for the REGARDS self-reported function. When we used advanced machine learning, deep learning techniques, we did not see any improvement in model performance.
How generalizable are the results, particularly, within the United States?
We have access to NIH funded cohort studies. The Framingham data is more local, whereas the REGARDS study is primarily in the southeast of the US, and that's a large cohort. Both MESA and IRIC have centers in different parts of the country. I would say that the results should be fairly generalizable, maybe with the understanding that this is cohort data, meaning that the quality of data collection is most likely much better than what we would see if we used for example, electronic health records data.
Based on the results, did you find anything that was surprising or unexpected from your original hypothesis?
We did not expect that the extent in performance in model discrimination risk ordering would be so vastly worse for Black adults than Whites. The fact that the REGARDS function based on self report— asking the patient if they have diabetes, do they smoke vs measuring your blood glucose or taking more precise measurements—held its own, performed similarly to the other ones, and actually had better calibration. It might be related to the fact that REGARDS is the largest cohort available to us. It was also interesting, and maybe slightly unexpected, that the nonspecific function, the one that is aimed to predict strokes mixed with heart attacks, did as well in model discrimination than the other functions. Our findings about machine learning not contributing much was not entirely surprising, because our previous work on fairly standard clinical data that is not complex. There is growing evidence of knowledge that the ability for machine learning to contribute something meaningful using data that's less complex, is generally not high. Our results confirm that.
What would you say should be focused on in future investigations involving health disparities?
The fact that we found this major disparity in model discrimination in stroke risk ordering implies 2 things. One is that we may not be capturing all the risk factors that are at play for stroke. I think this is a call for better, broader data collection. We need to go out there, work with communities, work with individual patients, to get their true data and capture maybe more relevant data. I'm not talking about clinical data or fancy biomarkers, I'm talking about social determinants of health. I'm talking about social context, variables that we traditionally don't have, or have in limited amounts in these cohorts. That's probably one conclusion.
The second conclusion is the performance of the REGARDS function based on self-report was as good as the other ones and had the best calibration, which creates a promising avenue for preventive strategies. Because if we don't have to measure risk factors and we can get a reasonably accurate portrayal of stroke risk, we can really go out to the community, partner with community organizations, with churches, with any groups that want to work with us, predict the risk of stroke, and start implementing preventive strategy. Again, the rates of stroke are too high. The science the medicine is here to prevent it. But we need to get to the patients, raise awareness, accurately estimate the risk, and start preventative treatments and strategies.
Transcript edited for clarity.