How AI models improve medical diagnoses and face biases

As AI models improve medical diagnoses but face biases in different demographic groups of patients, research shows challenges in fairness

Photo by: Domagoj Skledar/ arhiva (vlastita)

Artificial intelligence models often play a key role in medical diagnoses, particularly in image analysis such as X-rays. Research has shown that these models do not perform equally well across all demographic groups, often performing worse on women and minority groups. The models have also shown some unexpected abilities. Researchers at MIT discovered in 2022 that AI models can accurately predict patients' race from their chest X-rays—something even the most skilled radiologists cannot achieve. A recent study by this research team shows that models that are most accurate in predicting demographic data also exhibit the greatest "fairness biases"—discrepancies in the ability to accurately diagnose images of people of different races or genders. The findings suggest that these models may be using "demographic shortcuts" when making diagnostic assessments, leading to inaccurate results for women, Black people, and other groups, the researchers claim.

"It is well known that high-capacity machine learning models predict human demographics such as self-reported race, gender, or age very well. This work reaffirms that ability, and then links that ability to performance deficiencies among different groups, which had not been done before," says Marzyeh Ghassemi, an associate professor of electrical engineering and computer science at MIT, a member of MIT's Institute for Medical Engineering and Science, and the senior author of the study.

Researchers also found that they could retrain models in ways that improve their fairness. However, their "debiasing" approaches worked best when the models were tested on the same types of patients on whom they were trained, such as patients from the same hospital. When these models were applied to patients from different hospitals, biases reappeared.

"I think the main takeaways are first, thoroughly evaluate any external model on your own data because any fairness guarantees provided by model developers on their training data may not transfer to your population. Second, whenever enough data is available, you should train models on your own data," says Haoran Zhang, a student at MIT and one of the lead authors of the new paper. MIT student Yuzhe Yang is also a lead author of the paper, which was published today in the journal Nature Medicine. Judy Gichoya, an assistant professor of radiology and imaging sciences at Emory University School of Medicine, and Dina Katabi, the Thuan and Nicole Pham professor of electrical engineering and computer science at MIT, are also authors of the paper.

As of May 2024, the FDA has approved 882 AI-supported medical devices, of which 671 are intended for use in radiology. Since 2022, when Ghassemi and her colleagues demonstrated that these diagnostic models can accurately predict race, they and other researchers have shown that such models are also very good at predicting gender and age, even though the models were not trained for those tasks.

"Many popular machine learning models have superhuman demographic prediction capabilities—radiologists cannot detect self-reported race from a chest X-ray," says Ghassemi. "These are models that are good at predicting disease, but during training, they also learn to predict other things that may not be desirable."

In this study, the researchers wanted to explore why these models do not work equally well for certain groups. They particularly wanted to see if the models were using demographic shortcuts to make predictions that ended up being less accurate for some groups. These shortcuts can appear in AI models when they use demographic attributes to determine the presence of a medical condition instead of relying on other image features.

Using publicly available chest X-rays from Beth Israel Deaconess Medical Center in Boston, the researchers trained models to predict whether patients had one of three different medical conditions: fluid buildup in the lungs, lung collapse, or heart enlargement. They then tested the models on X-rays that were not included in the training data.

Overall, the models performed well, but most showed "fairness biases"—i.e., discrepancies in accuracy rates for men and women, and for white and Black patients.

The models could also predict the gender, race, and age of the subjects from the X-rays. Additionally, there was a significant correlation between each model's accuracy in making demographic predictions and the size of its fairness biases. This suggests that the models may be using demographic categorizations as shortcuts for making their disease predictions.

Researchers then tried to reduce fairness biases using two types of strategies. For one set of models, they trained them to optimize "subgroup robustness," which means the models were rewarded for better performance on the subgroup they performed the worst on and penalized if their error rate for one group was higher than the others.

In another set of models, the researchers forced them to remove all demographic information from the images using "adversarial" approaches. Both strategies proved to be quite effective, the researchers found.

"For in-distribution data, you can use existing state-of-the-art methods to reduce fairness biases without significant compromises in overall performance," says Ghassemi. "Subgroup robustness methods force models to be sensitive to prediction errors in specific groups, and adversarial methods try to remove group information completely."

However, these approaches only worked when the models were tested on data from the same types of patients on whom they were trained—for example, only patients from the Beth Israel Deaconess Medical Center dataset.

When the researchers tested the "debiasing" models using BIDMC data to analyze patients from five other hospital datasets, they found that the overall model accuracy remained high, but some showed significant fairness biases.

"If you debias a model on one set of patients, that fairness does not necessarily hold when you switch to a new set of patients from another hospital at another location," says Zhang.

This is concerning because in many cases, hospitals use models developed on data from other hospitals, especially when purchasing an off-the-shelf model, the researchers say.

"We found that even state-of-the-art models that are optimally performed on data similar to their training datasets are not optimal—that is, they do not make the best trade-off between overall performance and subgroup performance—in new environments," says Ghassemi. "Unfortunately, this is likely how the model is applied. Most models are trained and validated with data from one hospital or one source, and then widely applied."

Researchers found that models that were debiased using adversarial approaches showed slightly greater fairness when tested on new patient groups compared to those debiased with subgroup robustness methods. They now plan to develop and test additional methods to see if they can create models that make fairer predictions on new datasets.

The findings suggest that hospitals using these AI models should evaluate their effectiveness on their own patient populations before putting them to use to ensure they do not produce inaccurate results for certain groups.

The research was funded by the Google Research Scholar Award, the Harold Amos Medical Faculty Development Program of the Robert Wood Johnson Foundation, the RSNA Health Disparities, Lacuna Fund, the Gordon and Betty Moore Foundation, the National Institute of Biomedical Imaging and Bioengineering, and the National Heart, Lung, and Blood Institute.

Source: Massachusetts Institute of Technology

Creation time: 02 July, 2024

Note for our readers:
The Karlobag.eu portal provides information on daily events and topics important to our community. We emphasize that we are not experts in scientific or medical fields. All published information is for informational purposes only.
Please do not consider the information on our portal to be completely accurate and always consult your own doctor or professional before making decisions based on this information.
Our team strives to provide you with up-to-date and relevant information, and we publish all content with great dedication.

We invite you to share your stories from Karlobag with us!
Your experience and stories about this beautiful place are precious and we would like to hear them.
Feel free to send them to us at karlobag@ karlobag.eu.
Your stories will contribute to the rich cultural heritage of our Karlobag.
Thank you for sharing your memories with us!