• Synthetic Breast Ultrasound Images: A Study to Overcome Medical Data Sharing Barriers

    2024-12-02

    Professor JianQiao Zhou’s team at Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, has developed a latent diffusion model named CoLDiT. This model generates high-quality breast ultrasound images across different BI-RADS categories to overcome medical data sharing barriers. The study, titled “Synthetic Breast Ultrasound Images: A Study to Overcome Medical Data Sharing Barriers,” was published in the journal Research (DOI: 10.34133/research.0532).

     

    Medical big data holds immense potential for enhancing healthcare quality and advancing medical research. However, cross-center sharing of medical data, essential for constructing large and diverse datasets, raises privacy concerns and the risk of personal information misuse. Several methods have been developed to address this problem. De-identification methods are prone to re-identification risks, and differential privacy often compromises data utility by introducing noise. In regions with strict data-sharing regulations, federated learning has been proposed as a potential solution, enabling collaborative model training without sharing raw data. However, it remains vulnerable to privacy leakage from model updates or the final model. Therefore, achieving safe and efficient medical data sharing remains an urgent issue.

     

    To address these challenges, Professor Zhou's team developed CoLDiT, a conditional latent diffusion model with a diffusion transformer (DiT) backbone, capable of generating high-resolution breast ultrasound images conditioned on BI-RADS categories (BI-RADS 3, 4a, 4b, 4c, and 5) (see Figure 1). The training set for CoLDiT comprised 9,705 breast ultrasound images from 5,243 patients across 202 hospitals, utilizing various ultrasound vendors to ensure data diversity and comprehensiveness.


     

    Figure 1: Examples of CoLDiT-generated breast US images featuring different BI-RADS categories (including BI-RADS 3, 4A, 4B, 4C, and 5).

     

    To validate privacy protection during image generation, the team conducted nearest neighbor analysis, confirming that CoLDiT-generated images did not replicate any images from the training set, thus safeguarding patient privacy. For quality assessment, they invited radiologists to evaluate the realism and BI-RADS classification of CoLDiT-generated images. In the realism evaluation, except for one senior radiologist with an AUC score greater than 0.7, the other five radiologists achieved AUCs ranging between 0.53 and 0.63 (see Figure 2B). Figure 3 presents examples of real and CoLDiT-generated breast ultrasound images that were labeled oppositely by at least four out of six readers. Furthermore, the overall performance of BI-RADS classification on synthetic images was comparable to that on real images for all three radiologists, with two even surpassing their performance on real images (see Figure 2C).


     

    Figure 2: Procedures and results of human evaluation on CoLDiT-generated breast US images.


    Figure 3: Examples of real and CoLDiT-generated breast US images labeled oppositely by at least four out of six readers in the realism evaluation.

     

    Additionally, the study utilized the synthetic breast ultrasound images for data augmentation in a BI-RADS classification model. The results indicated that after replacing half of the real data in the training set with synthetic data, the model's performance remained comparable to the model trained exclusively with real data (P = 0.81) (see Figure 4).

     

     

    Figure 4: CoLDiT-generated breast US images effectively augment training data of the BI-RADS classification task, achieving performance comparable to using a training set comprising solely of real images.

     

    This study offers several advantages over prior works. First, the use of a large, multicenter dataset ensured diverse data sources from 202 hospitals, encompassing different vendors and device grades. This allowed the model to capture a comprehensive range of variations inherent in real-world breast ultrasound images, leading to the generation of more realistic and precise synthetic images. Second, employing a pure transformer backbone instead of the traditional U-Net capitalizes on transformers' exceptional ability to capture long-range dependencies, enabling the model to generate more coherent and detailed images. Third, conditioning the image synthesis on BI-RADS labels allows for the generation of ultrasound images corresponding to specific BI-RADS categories. This is particularly valuable in medical contexts, where the ability to generate images tailored to specific clinical scenarios is crucial for accurate diagnosis and treatment planning.

     

    Professor Zhou’s team believes that synthetic data, as a privacy-protecting solution, will play a key role in the secure utilization of medical big data, accelerating progress in medical research and clinical applications, and ultimately enhancing the quality of medical services and patient health. In the future, the team plans to integrate generative artificial intelligence with more types of medical imaging data to verify its applicability in different medical scenarios.

     

    Professor JianQiao Zhou’s team at Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, has developed a latent diffusion model named CoLDiT. This model generates high-quality breast ultrasound images of different BI-RADS categories to overcome barriers in medical data sharing. The study, titled “Synthetic Breast Ultrasound Images: A Study to Overcome Medical Data Sharing Barriers,” was published in the journal Research (DOI: 10.34133/research.0532).

     

    Medical big data holds immense potential for enhancing healthcare quality and advancing medical research. However, the cross-center sharing of medical data, essential for constructing large and diverse datasets, raises privacy concerns and the risk of personal information misuse. De-identification is prone to re-identification risks, and differential privacy often compromises data utility by introducing noise. In regions with strict data-sharing regulations, federated learning has been proposed as a potential solution, enabling collaborative model training without sharing raw data. However, it remains vulnerable to privacy leakage from the model updates or the final model. Therefore, achieving medical data sharing remains an urgent issue.

     

    To address these challenges, Professor Zhou's team presents CoLDiT, a conditional latent diffusion model with a diffusion transformer (DiT) backbone, to generate high-resolution breast US images conditioned on BI-RADS categories (BI-RADS 3, 4a, 4b, 4c, and 5) (see Figure 1). The training set of CoLDiT includes 9,705 breast ultrasound images from 5,243 patients across 202 hospitals, utilizing various ultrasound devices, to ensure data diversity and comprehensiveness.

     

    To validate the privacy protection during image generation, the team conducted nearest neighbor analysis, confirming that CoLDiT-generated images did not replicate any images from the training set, thus safeguarding patient privacy. For quality assessment, radiologists were invited to evaluate the realism and BI-RADS classification of the synthetic images. In the realism evaluation of CoLDiT-generated images, except for one senior radiologist with an AUC score greater than 0.7, the other five radiologists achieved AUC ranging between 0.53 and 0.63 (see Figure 2B). Figure 3 presents examples of real and CoLDiT-generated breast ultrasound images labeled oppositely by at least four out of six readers in the realism evaluation. The overall performance of the BI-RADS classification on synthetic images is comparable to that on real images for all three radiologists, with two of them even surpassing the performance on real image (see Figure 2C).

     

    Additionally, the study utilized the synthetic breast ultrasound images for data augmentation in a BI-RADS classification model. The results indicated that after replacing half of the real data in the training set with synthetic data, the model's performance remained comparable to the model trained exclusively with real data (P=0.81) (see Figure 4).

     

    This study offers several advantages over prior works. First, they used a large multi-center dataset of 9705 breast US images for CoLDiT training, ensuring diverse data sources across different vendors or grades from 202 hospitals. This allows the model to capture a comprehensive range of variations inherent in real-world breast US images, leading to the generation of more realistic and precise synthetic images. Second, they employed a pure transformer backbone instead of the traditional U-Net backbone.

     

    Transformers, renowned for their exceptional ability to capture long-range dependencies, enable the model to generate more coherent and detailed images. Third, they conditioned breast US image synthesis on BI-RADS labels, allowing the generation of US images corresponding to specific BI-RADS categories. This is particularly valuable in medical contexts, where the ability to generate images tailored to specific clinical scenarios is crucial for accurate diagnosis and treatment planning.

     

    Professor Zhou’s team believe that synthetic data, as a privacy-protecting solution, will play a key role in the secure utilization of medical big data, accelerating progress in medical research and clinical applications, and ultimately enhancing the quality of medical services and patient health. In the future, Professor Zhou's team plans to integrate generative artificial intelligence with more types of medical imaging data to verify its applicability in different medical scenarios.