MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

🔥 News • 📖 Overview • ✨ Features • 📊 Leaderboard

🔥News

🎉 [2025-05-06] MedXpertQA paper is accepted to ICML 2025!
🛠️ [2025-04-08] MedXpertQA has been successfully integrated into OpenCompass! Check out the PR!
💻 [2025-02-28] We release the evaluation code! Check out the Usage.
🌟 [2025-02-20] Leaderboard is on! Check out the results of o3-mini, DeepSeek-R1, and o1!
🤗 [2025-02-09] We release the MedXpertQA dataset.
🔥 [2025-01-31] We introduce MedXpertQA, a highly challenging and comprehensive benchmark to evaluate expert-level medical knowledge and advanced reasoning!

📖Overview

MedXpertQA includes 4,460 questions spanning 17 specialties and 11 body systems. It includes two subsets, MedXpertQA Text for text medical evaluation and MedXpertQA MM for multimodal medical evaluation. The following figure presents an overview.

More Details

The left side illustrates the diverse data sources, image types, and question attributes. The right side compares typical examples from MedXpertQA MM and a traditional benchmark (VQA-RAD).

✨Features

Next-Generation Multimodal Medical Evaluation: MedXpertQA MM introduces expert-level medical exam questions with diverse images and rich clinical information, including patient records and examination results, setting it apart from traditional medical multimodal benchmarks with simple QA pairs generated from image captions.
Highly Challenging: MedXpertQA introduces high-difficulty medical exam questions and applies rigorous filtering and augmentation, effectively addressing the insufficient difficulty of existing benchmarks like MedQA. The Text and MM subsets are currently the most challenging benchmarks in their respective fields.
Clinical Relevance: MedXpertQA incorporates specialty board questions to improve clinical relevance and comprehensiveness by collecting questions corresponding to 17/25 member board exams (specialties) of the American Board of Medical Specialties. It showcases remarkable diversity across multiple dimensions.

Mitigating Data Leakage: We perform data synthesis to mitigate data leakage risk and conduct multiple rounds of expert reviews to ensure accuracy and reliability.
Reasoning-Oriented Evaluation: Medicine provides a rich and representative setting for assessing reasoning abilities beyond mathematics and code. We develop a reasoning-oriented subset to facilitate the assessment of o1-like models.

📊Leaderboard

We evaluate 17 leading proprietary and open-source LMMs and LLMs including advanced inference-time scaled models with a focus on the latest progress in medical reasoning capabilities. Further details are available in the leaderboard and the paper.

🔧Usage

Clone the Repository:

git clone https://github.com/TsinghuaC3I/MedXpertQA
cd MedXpertQA/eval

Install Dependencies:

pip3 install -r requirements.txt

Inference:

bash scripts/run.sh

The run.sh script performs inference by calling main.py, which offers additional features such as multithreading. Additionally, you can modify model/api_agent.py to support more models.

Evaluation:

We provide a script eval.ipynb to calculate accuracy on each subset.

Note

Please use this script when evaluating the QVQ and DeepSeek-R1. Through case studies, we found that the answer cleaning function in the utils.py is unsuitable for these two models.

📨Contact

Shang Qu: [email protected]
Ning Ding: [email protected]

⚖️License

This project is licensed under the MIT License.

🎈Citation

If you find our work helpful, please use the following citation.

@article{zuo2025medxpertqa,
  title={Medxpertqa: Benchmarking expert-level medical reasoning and understanding},
  author={Zuo, Yuxin and Qu, Shang and Li, Yifei and Chen, Zhangren and Zhu, Xuekai and Hua, Ermo and Zhang, Kaiyan and Ding, Ning and Zhou, Bowen},
  journal={arXiv preprint arXiv:2501.18362},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
eval		eval
figs		figs
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

🔥News

📖Overview

✨Features

📊Leaderboard

🔧Usage

📨Contact

⚖️License

🎈Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

License

TsinghuaC3I/MedXpertQA

Folders and files

Latest commit

History

Repository files navigation

MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

🔥News

📖Overview

✨Features

📊Leaderboard

🔧Usage

📨Contact

⚖️License

🎈Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages