Skip to content

TsinghuaC3I/MedXpertQA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 

Repository files navigation

MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

Static Badge Static Badge Static Badge Static Badge

🔥News

  • 🎉 [2025-05-06] MedXpertQA paper is accepted to ICML 2025!
  • 🛠️ [2025-04-08] MedXpertQA has been successfully integrated into OpenCompass! Check out the PR!
  • 💻 [2025-02-28] We release the evaluation code! Check out the Usage.
  • 🌟 [2025-02-20] Leaderboard is on! Check out the results of o3-mini, DeepSeek-R1, and o1!
  • 🤗 [2025-02-09] We release the MedXpertQA dataset.
  • 🔥 [2025-01-31] We introduce MedXpertQA, a highly challenging and comprehensive benchmark to evaluate expert-level medical knowledge and advanced reasoning!

📖Overview

MedXpertQA includes 4,460 questions spanning 17 specialties and 11 body systems. It includes two subsets, MedXpertQA Text for text medical evaluation and MedXpertQA MM for multimodal medical evaluation. The following figure presents an overview.

More Details The left side illustrates the diverse data sources, image types, and question attributes. The right side compares typical examples from MedXpertQA MM and a traditional benchmark (VQA-RAD).

Overview of MedXpertQA.

✨Features

  • Next-Generation Multimodal Medical Evaluation: MedXpertQA MM introduces expert-level medical exam questions with diverse images and rich clinical information, including patient records and examination results, setting it apart from traditional medical multimodal benchmarks with simple QA pairs generated from image captions.
  • Highly Challenging: MedXpertQA introduces high-difficulty medical exam questions and applies rigorous filtering and augmentation, effectively addressing the insufficient difficulty of existing benchmarks like MedQA. The Text and MM subsets are currently the most challenging benchmarks in their respective fields.
  • Clinical Relevance: MedXpertQA incorporates specialty board questions to improve clinical relevance and comprehensiveness by collecting questions corresponding to 17/25 member board exams (specialties) of the American Board of Medical Specialties. It showcases remarkable diversity across multiple dimensions.

MedXpertQA spans diverse human body systems, medical tasks, and question topics.

  • Mitigating Data Leakage: We perform data synthesis to mitigate data leakage risk and conduct multiple rounds of expert reviews to ensure accuracy and reliability.
  • Reasoning-Oriented Evaluation: Medicine provides a rich and representative setting for assessing reasoning abilities beyond mathematics and code. We develop a reasoning-oriented subset to facilitate the assessment of o1-like models.

📊Leaderboard

We evaluate 17 leading proprietary and open-source LMMs and LLMs including advanced inference-time scaled models with a focus on the latest progress in medical reasoning capabilities. Further details are available in the leaderboard and the paper.

🔧Usage

  1. Clone the Repository:
git clone https://github.com/TsinghuaC3I/MedXpertQA
cd MedXpertQA/eval
  1. Install Dependencies:
pip3 install -r requirements.txt
  1. Inference:
bash scripts/run.sh

The run.sh script performs inference by calling main.py, which offers additional features such as multithreading. Additionally, you can modify model/api_agent.py to support more models.

  1. Evaluation:

We provide a script eval.ipynb to calculate accuracy on each subset.

Note

Please use this script when evaluating the QVQ and DeepSeek-R1. Through case studies, we found that the answer cleaning function in the utils.py is unsuitable for these two models.

📨Contact

⚖️License

This project is licensed under the MIT License.

🎈Citation

If you find our work helpful, please use the following citation.

@article{zuo2025medxpertqa,
  title={Medxpertqa: Benchmarking expert-level medical reasoning and understanding},
  author={Zuo, Yuxin and Qu, Shang and Li, Yifei and Chen, Zhangren and Zhu, Xuekai and Hua, Ermo and Zhang, Kaiyan and Ding, Ning and Zhou, Bowen},
  journal={arXiv preprint arXiv:2501.18362},
  year={2025}
}

About

[ICML 2025] MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published