RadBench release

harrison-ai · Sep 4, 2024 · e2b3a12 · e2b3a12
commit e2b3a12
Show file tree

Hide file tree

Showing 29 changed files with 988 additions and 0 deletions.
diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS
@@ -0,0 +1,2 @@
+*       @harrison-ai/ai
+
diff --git a/.github/pull_request_template.md.md b/.github/pull_request_template.md.md
@@ -0,0 +1,15 @@
+Please follow the conventional commits documentation types: <https://www.conventionalcommits.org/en/v1.0.0/#specification>
+
+## Proposed changes
+
+Describe your changes here to communicate to the maintainers why we should accept this pull request.
+
+### Focused Review
+
+If there are parts of this PR that you would like special attention to, please mention them here and tag the most appropriate reviewer.
+
+- Does this test cover all the important cases [TAG PERSON]
+
+**Related issue:**
+
+
diff --git a/.github/workflows/pages.yml b/.github/workflows/pages.yml
@@ -0,0 +1,22 @@
+name: Docs
+on:
+  push:
+    #branches:
+    #  - main
+permissions:
+  contents: write
+jobs:
+  deploy:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/setup-python@v5
+        with:
+          python-version: '3.10'
+
+      - name: install publishing dependencies
+        run: make install
+
+      - name: Deploy pages
+        run: mkdocs gh-deploy --force      
+        # run: mkdocs /bin/bash -c "HOME=/tmp python -m mkdocs build"
diff --git a/.github/workflows/renovate.yml b/.github/workflows/renovate.yml
@@ -0,0 +1,11 @@
+on:
+  workflow_dispatch:
+
+name: Renovate
+
+jobs:
+  check_dependencies:
+    name: Check dependencies
+    runs-on: ubuntu-22.04
+    steps:
+    - uses: actions/checkout@v4
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,82 @@
+.coverage
+.mypy_cache/
+.pip.conf
+.pytest_cache/
+.pytest_logs/
+lightning_logs/
+.venv/
+.vscode
+__pycache__
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+outputs/
+artifacts/
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+.hypothesis/
+.pytest_cache/
+
+
+# Crash log files
+crash.log
+*.log
+
+# Envvars environment configuration file
+.env
+.envrc
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+.direnv
+.envrc/
+.vscode/
+.pip.conf
+.requirements-no-hashes.txt
+.python-version
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# Temporary caches
+*.so
+cache/*
+.tmp
+site
diff --git a/Makefile b/Makefile
@@ -0,0 +1,11 @@
+.PHONY: install serve clean
+.DEFAULT_GOAL := serve
+
+install:
+	pip install -r requirements.txt
+
+serve:
+	mkdocs serve
+
+clean:
+	git clean -Xdf
diff --git a/README.md b/README.md
@@ -0,0 +1,36 @@
+![RadBench Logo](docs/resources/logo_font_azure.png)
+# RadBench: Radiology Benchmark Framework
+
+[![Documentations](https://img.shields.io/badge/Documentations-blue?style=flat)](https://harrison-ai.github.io/radbench/)
+
+## Overview
+
+RadBench is a radiology benchmark framework developed by [Harrison.ai](https://harrison.ai/). It is designed to evaluate the performance of Harrison.ai's foundational radiology model, `harrison.rad.1`, against other competitive models in the field. The framework employs a rigorous evaluation methodology across three distinct datasets to ensure the models are thoroughly assessed for clinical relevance, accuracy, and case comprehension. These datasets are:
+
+1. [**RadBench Dataset**](docs/datasets/radbench.md): A new visual question-answering dataset designed by Harrison.ai to benchmark radiology models.
+
+2. [**VQA-RAD Dataset**](docs/datasets/vqa-rad.md): A visual question-answering dataset for radiology, available at [Nature Datasets](https://www.nature.com/articles/sdata2018251).
+
+3. [**Fellowship of the Royal College of Radiologists (FRCR) 2B Examination**](docs/datasets/frcr.md): Curated for the Fellowship of the Royal College of Radiologists (FRCR) Rapids 2B exam, obtained from third parties to ensure fairness in our evaluation process.
+
+
+
+# mkdocs dev
+
+To launch mkdocs locally, follow these instructions:
+
+1. Create a Python environment:
+```bash
+python3 -m venv .venv
+. .venv/bin/activate
+```
+
+2. Install the dependencies:
+```bash
+make install
+```
+
+3. Start the serving endpoint:
+```bash
+make serve
+```
diff --git a/data/radbench/radbench.csv b/data/radbench/radbench.csv
diff --git a/docs/datasets/frcr.md b/docs/datasets/frcr.md
@@ -0,0 +1,7 @@
+![RadBench Logo](../resources/logo_font_azure.png)
+
+# FRCR
+
+Medical specialists undertake rigorous and thorough evaluation examinations before practicing Radiology. The Fellowship of the Royal College of Radiologists (FRCR) is one such examination. We used a component of this examination, the FRCR 2B Rapids [@FRCR2B], to benchmark radiology foundation models.
+
+While the actual examinations are kept confidential to prevent leakage, mock FRCR examinations are available on various established educational websites. Our FRCR evaluation dataset is comprised of 70 FRCR examination sheets procured from these established third party organisations. We have sourced this dataset from third party to ensure fairness in our evaluation process.
diff --git a/docs/datasets/radbench.md b/docs/datasets/radbench.md
@@ -0,0 +1,37 @@
+![RadBench Logo](../resources/logo_font_azure.png)
+
+# RadBench Dataset
+
+RadBench dataset is collation of clinically relevant Radiology specific visual questions and answers (VQA) based on plain film X-ray. This VQA dataset is clinically comprehensive, covering 3 or more questions per medical imaging. The radiology images for this set are sourced from [Medpix](https://medpix.nlm.nih.gov/home) and [Radiopaedia](https://radiopaedia.org/). RadBench is curated by medical doctors with expertise in relevant fields who interpret these images as part of their clinical duties. 
+
+
+![RadBench Overview](/resources/radbench_overview.jpg)
+
+## Overview
+
+The [RadBench dataset](https://github.com/harrison-ai/radbench/blob/main/data/radbench/radbench.csv) is formatted similarly to VQA-Rad[@Lau2018;] to ensure ease of use by Medical/Radiology communities. Some key differences are:
+
+* **Rich set of possible answers**: The closed questions of the RadBench dataset have a set of possible answers explicitly defined.
+* **Level of correctness**: The set of possible answers for given question is also ordered in terms of relative correctness. This is done to account for the fact that some options can be more incorrect than others. This ordering also helps with differential diagnosis.
+* **Multi-turn Questionnaire**: Questions are ordered per case by specificity - meaning that if evaluated in the same context, they should be asked in that order. For example, "Is there a fracture in the study?" should be asked prior to "Which side is the fracture on?" as the second question implies the answer to the first.
+
+
+## Why RadBench?
+
+There has been a growing concern within computer vision and deep learning (CV & DL) communities that we have started to overfit popular existing benchmarks, such as ImageNet [@abs-2006-07159].
+We share this concern and worry that Radiology foundation models perhaps are also starting to overfit on VQA-Rad [@Lau2018]. Besides, existing Radiology VQA datasets have several shortcomings:
+
+* Some datasets have automatically generated questions and answers from existing noisy labels extracted from radiology reports. This leads to unnatural and ambiguous questions which cannot be adequately answered given the image. For instance:
+    * This question `In the given Chest X-Ray, is cardiomegaly present in the upper? (please answer yes/no)` (dataset source: ProbMed)[@ProbMed2024] is anatomically impossible to answer as cardiomegaly is not divided into `upper` and `lower`. 
+    * Likewise, in SLAKE [@SLAKE2021] dataset,  given the image `xmlab470/source.jpg`, question `Where is the brain non-enhancing tumor?` is asked. However the image is an axial non-contrast T2 MRI of the brain whereby answering the question of 'non-enhancing tumor' is not possible. The answer for this question is also given as `Upper Left Lobe` which is not a valid anatomical region in the brain. This should be answered as `anterior left frontal lobe`. 
+* Some existing datasets have been curated by non-medical specialists, leading to questions which may be less relevant to everyday clinical work and pathology.
+* Existing datasets do not include more than one image per question, whereas in radiology many studies do include more than one view. Having only one image does not allow us to evaluate the model for its ability of comparing multiple images at once, which is a very clinically relevant task.
+* Existing datasets do not specify the context in which the images should be used. This is relevant to RadBench as more than one image can be used in a single question. In RadBench, the `<i>` token is used to denote the location of an image in relation to the surrounding words (more specifically tokens). This allows specific references to the images in the question e.g. "the first study" or "the second study". As a result multi-turn comparison questions can now be asked.  
+* Existing datasets are not selected for clinically challenging cases where the pathology is visually subtle or rare. RadBench specifically selects a wide range of pathology in different anatomical parts with the intention of including challenging cases.  
+
+
+
+
+## Acknowledgements
+
+We thank [Medpix](https://medpix.nlm.nih.gov/home) and [Radiopaedia](https://radiopaedia.org/) and their respective editorial teams and contributors specially NIH, Frank Gaillard, Andrew Dixon, and other Radiopaedia.org contributors for creating such a rich library of cases to test radiology expertise. 
diff --git a/docs/datasets/vqa-rad.md b/docs/datasets/vqa-rad.md
@@ -0,0 +1,8 @@
+![RadBench Logo](../resources/logo_font_azure.png)
+
+# VQA-Rad 
+
+VQA-Rad is a dataset of clinically generated visual questions and answers about radiology images [@Lau2018;]. This dataset can be downloaded from [nature dataset](https://www.nature.com/articles/sdata2018251) or [here](https://files.osf.io/v1/resources/89kps/providers/osfstorage/?zip=) or alternatively from [Hugging Face](https://huggingface.co/datasets/flaviagiammarino/vqa-rad).
+
+
+
diff --git a/docs/index.md b/docs/index.md
@@ -0,0 +1,14 @@
+![RadBench Logo](https://harrison-ai.github.io/radbench/resources/logo_font_azure.png)
+
+# RadBench: Radiology Benchmark Framework
+
+
+## Overview
+
+RadBench is a radiology benchmark framework developed by [Harrison.ai](https://harrison.ai/). It is designed to evaluate the performance of Harrison.ai's foundational radiology model, `harrison.rad.1`, against other competitive models in the field. The framework employs a rigorous evaluation methodology across three distinct datasets to ensure the models are thoroughly assessed for clinical relevance, accuracy, and case comprehension. These datasets are:
+
+1. [**RadBench Dataset**](/datasets/radbench): A new visual question-answering dataset designed by Harrison.ai to benchmark radiology models.
+
+2. [**VQA-RAD Dataset**](/datasets/vqa-rad): A visual question-answering dataset for radiology, available at [Nature Datasets](https://www.nature.com/articles/sdata2018251).
+
+3. [**Fellowship of the Royal College of Radiologists (FRCR) 2B Examination**](/datasets/frcr): Curated for the Fellowship of the Royal College of Radiologists (FRCR) Rapids 2B exam, obtained from third parties to ensure fairness in our evaluation process.
diff --git a/docs/readme.md b/docs/readme.md
@@ -0,0 +1,13 @@
+![RadBench Logo](https://harrison-ai.github.io/radbench/resources/logo_font_azure.png)
+
+# RadBench: Radiology Benchmark Framework
+
+## Overview
+
+RadBench is a radiology benchmark framework developed by [Harrison.ai](https://harrison.ai/). It is designed to evaluate the performance of Harrison.ai's foundational radiology model, `harrison.rad.1`, against other competitive models in the field. The framework employs a rigorous evaluation methodology across three distinct datasets to ensure the models are thoroughly assessed for clinical relevance, accuracy, and case comprehension. These datasets are:
+
+1. [**RadBench Dataset**](/datasets/radbench): A new visual question-answering dataset designed by Harrison.ai to benchmark radiology models.
+
+2. [**VQA-RAD Dataset**](/datasets/vqa-rad): A visual question-answering dataset for radiology, available at [Nature Datasets](https://www.nature.com/articles/sdata2018251).
+
+3. [**Fellowship of the Royal College of Radiologists (FRCR) 2B Examination**](/datasets/frcr): Curated for the Fellowship of the Royal College of Radiologists (FRCR) Rapids 2B exam, obtained from third parties to ensure fairness in our evaluation process.
diff --git a/docs/references/refs.bib b/docs/references/refs.bib
@@ -0,0 +1,51 @@
+@article{Lau2018,
+  title   = {A dataset of clinically generated visual questions and answers about radiology images},
+  author  = {Lau, JJ and Gayen, S and Ben, Abacha A and Demner-Fushman, D.},
+  journal = {Scientific Data},
+  volume  = {5},
+  number  = {1},
+  pages   = {2052-4463},
+  year    = {2018},
+  url     = {https://www.nature.com/articles/sdata2018251}
+}
+
+@article{SLAKE2021,
+  title   = {SLAKE: A Semantically-Labeled Knowledge-Enhanced Dataset for Medical Visual Question Answering},
+  author  = {Liu, B. and Zhan, L. and Xu, L. and Ma, L. and Yang, Y. and Wu, X.},
+  journal = {ArXiv},
+  number  = {1},
+  volume  = {/abs/2102.09542},
+  year    = {2021},
+  url     = {https://arxiv.org/abs/2102.09542}
+}
+
+
+@article{ProbMed2024,
+  title   = {Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA.},
+  author  = {Yan, Q. and He, X.  and Yue, X.  and Wang, X. E.},
+  journal = {ArXiv},
+  number  = {1},
+  volume  = {/abs/2405.20421},
+  year    = {2024},
+  url     = {https://arxiv.org/abs/2405.20421}
+}
+
+@online{FRCR2B,
+  author  = {The Royal College of Radiologists},
+  title   = {FRCR Part 2B (Radiology) - CR2B | The Royal College of Radiologists. },
+  url     = {https://www.rcr.ac.uk/exams-training/rcr-exams/clinical-radiology-exams/frcr-part-2b-radiology-cr2b/},
+  urldate = {2024-08-07}
+}
+
+@article{abs-2006-07159,
+  author  = {Lucas Beyer and
+             Olivier J. H{\'{e}}naff and
+             Alexander Kolesnikov and
+             Xiaohua Zhai and
+             A{\"{a}}ron van den Oord},
+  title   = {Are we done with ImageNet?},
+  journal = {CoRR},
+  volume  = {abs/2006.07159},
+  year    = {2020},
+  url     = {https://arxiv.org/abs/2006.07159}
+}
diff --git a/docs/resources/logo.png b/docs/resources/logo.png
diff --git a/docs/resources/logo.svg b/docs/resources/logo.svg
diff --git a/docs/resources/logo_font_azure.png b/docs/resources/logo_font_azure.png
diff --git a/docs/resources/logo_font_black.png b/docs/resources/logo_font_black.png
diff --git a/docs/resources/logo_font_mint.png b/docs/resources/logo_font_mint.png
diff --git a/docs/resources/logo_font_white.png b/docs/resources/logo_font_white.png