New AI tool helps computers break through language barriers in medical data

Data scientists have created an AI model to help computers understand medical data from different languages to reduce bias and improve accuracy.

The new model, called Med-UniC, could help doctors and researchers better understand medical data from different parts of the world.

Medical data is often collected in different languages, with varying syntax, semantics and medical terminology. These discrepancies between different languages make it difficult for computers to understand and analyse medical data accurately.

In a new study published in NeurIPS2023, data scientists Che Liu, Dr Sibo Cheng, Dr César Quilodrán Casas and Dr Rossella Arcucci from Imperial’s Data Science Institute and the Department of Earth Sciences and Engineering, in partnership with The Ohio State University, Peking University, The Chinese University of Hong Kong and The Hong Kong University of Science and technology, found there is community bias caused by different languages that can affect the performance of downstream tasks such as patient care.

Therefore, they proposed a new framework, known as Cross-lingual Text Alignment Regularization (CTR) that teaches computers to recognise the meaning of words and phrases that are similar across different languages.

Their work will ensure doctors and researchers from all over the globe who work with medical information can receive the most accurate information, regardless of the language they speak.

Community bias in cross-lingual medical data

Vision-language pre-training (VLP) is a recent emerging method used to analyse multimodal data – for example, images and text. It is a process of training artificial intelligence models to understand medical data, such as images and reports, in multiple languages.

However, English, despite not being the primary native language for a vast majority of the global population, remains the dominant language used to train these machine-learning models which leads to biases against non-English speaking populations.

This is particularly prevalent in medical applications and can lead to a number of negative downstream effects on various tasks related to medical data analysis. For example, biased models may produce inaccurate diagnoses or treatment recommendations which can lead to poorer patient care.

Similarly, biased models may perpetuate existing health disparities by providing different levels of care to different communities.

Introducing Med-UniC – “a super-smart translator”

In order to mitigate any community bias in these medical models, the Imperial team designed a framework called Med-UniC which improves the accuracy by aligning the meaning of different medical reports across multiple languages.

“Think of Med-UniC as a super-smart translator. Not only does it translate the words in the reports, but it also ensures that the essence or meaning behind those words matches the X-ray image in the same way, regardless of the language.” Che Liu Co-author

Different languages may use different words or phrases to describe the same medical condition and Med-UniC helps AI models understand that any of these discrepancies are accounted for.

Co-author Che Liu explains: “Think of Med-UniC as a super-smart translator. Not only does it translate the words in the reports, but it also ensures that the essence or meaning behind those words matches the X-ray image in the same way, regardless of the language.”

The Imperial-designed tool works by using a technique called Cross-lingual Text Alignment Regularisation, which is optimised through latent language disentanglement. This means it separates the language-specific information from the shared information in the medical reports, helping to ensure that cross-lingual representation is not biased towards any specific language community.

The team also incorporated a variety of ‘visual backbones’ to help the models learn and extract visual features from medical images, helping to improve their overall accuracy.

Next steps

The Imperial team focused primarily on English and Spanish languages in this work but hope to expand to a broader range of languages in order to further minimise any language biases.

According to Liu: “This work is the first to identify community bias in medical VLP stemming from diverse language communities and illustrates the importance of inclusivity beyond English-speaking communities. We hope our new framework will help improve healthcare systems worldwide by addressing this bias”.

'Med-UniC: Unifying Cross-Lingual Medical Vision-Language Pre-Training by Diminishing Bias' by Wan et al., published in NeurIPS 2023.