Initial detailed documentation on running specific subtasks in experiments.
- ✅ Download source datasets (PhysioNet and SHC)
- ✅ Sample documents from MIMIC-CXR, MIMIC-III, MedAlign, CORAL to create FactEHR dataset
- ✅ NLP sentence tokenize FactEHR documents and serialize to disk (spaCy DocBin).
- ✅ Generate fact decomposition prompted dataset using prompt templates and FactEHR documents
Download source datasets from PhysioNet and SHC. Saves to data/datasets/
Requirements
wget
installed (on MacOS use homebrewbrew install wget
)- gcloud CLI installed (see instructions) and authenticated
gcloud auth login
- PhysioNet credentialed account and per-dataset signed DUAs
- You must be connected to the Stanford VPN to download SHC datasets (MedAlign)
- Authenticated with huggingface
huggingface-cli login
./scripts/datasets/download_physionet_datasets.sh
./scripts/datasets/download_shc_datasets.sh
./scripts/datasets/download_hf_datasets.sh
Important
LEGACY Link original data to the legacy sampled documents, assign primary key and export CSVs for preprocessing.
python scripts/hotfixes/get_note_provenance.py \
--path_to_legacy $FACTEHR_LEGACY_DOCS \
--path_to_datasets data/datasets/ \
--path_to_output data/datasets/
Sampling from source datasets, assign primary key and export to CSVs for preprocessing. We impose min/max length constraints to control for extreme context lengths during inference.
python scripts/sample_source_datasets.py \
--path_to_input data/datasets/raw/ \
--path_to_output data/datasets/corpora/v2/ \
--file_name_prefix factehr_v2 \
--tokenizer tiktoken \
--min_doc_length 64 \
--max_doc_length 3840
Tokenize and sentence split clinical documents using NLP framework ∈ {medspacy
, spacy
, trove
}. See framework speed benchmarks for more details. This will generate a spaCy DocBin
file named factehr_YYYYMMDD.docbin
.
python scripts/build_docbin_dataset.py \
--path_to_input data/datasets/corpora/v2/ \
--path_to_output data/datasets/ \
--n_procs 4 \
--batch_size 100 \
--nlp_framework trove \
--file_name_prefix factehr_v2
Materialize the prompted version of the FactEHR dataset.
python scripts/build_fact_decomp_prompted_dataset.py \
--path_to_input data/datasets/factehr_v2_20240825.docbin \
--path_to_prompt_dir data/prompt_templates/fact_decomposition/ \
--path_to_output_dir data/datasets/prompted/ \
--file_name_prefix fact_decomposition \
--completion_format messages
- ✅ Run LLM fact decomposition inference on all documents and serialize to disk
- ✅ Prompt tuning for entailment
- ✅ NLP sentence tokenize fact list and serialize to disk
- ✅ Benchmark existing NLI datasets (MedNLI, SNLI, MultiNLI, SciTail) on SOTA LLMs
- ✅ Generate all entailment pairs for fact precision (
I[note ⇒ fact]
) and fact recall (I[fact-list ⇒ sentence]
) and serialize to disk - ✅ Run LLM-as-a-judge inference on all entailment pairs and serialize raw generation outputs to disk
Evaluate the performance of entailment, entailment+rationale, and entailment+rationale+CoT for performing entailment. This experiment leverages shc-gpt-4o (requires SHC VPN) and vertex API (requires Full Traffic VPN).
First set the following:
export HUGGINGFACE_HUB_TOKEN={your token here — only needed when running transformers client}
export FACTEHR_DATA_ROOT={something like /share/pi/nigam/akshays/just-the-facts/data/}
Next download the entailment datasets to the data directory. Because MedNLI comes from Physionet, run this first:
./scripts/datasets/download_physionet_datasets.sh
Download the entailment datasets by running:
./scripts/datasets/download_hf_datasets.sh
The one dataset that will not be downloaded from the above scripts is FactEHR (v0 — clinician annotated dev set). This is currently saved on carina here:
/share/pi/nigam/akshays/just-the-facts/data/datasets/raw/entailment/factehr.csv
Copy that file into this location: {$FACTEHR_DATA_ROOT}/datasets/raw/entailment/factehr.csv
As of 10/2/24, the following path on carina contains all NLI test sets needed for this experiment — instead of compiling the datasets locally you can copy over the contents of this folder into your directory: /share/pi/nigam/akshays/just-the-facts/data/datasets/raw/entailment/
To run the experiment pipeline, first adjust the config settings here as needed: scripts/experiments/run_nli_prompt_tuning_experiment.sh
The most importatnt setting is the client you want to run. For example:
models=("medlm-medium") # "meta-llama/Meta-Llama-3-8B-Instruct" "gemini-1.5-flash-002"
client="vertex" # Can be "transformers", "openai-batch", "vertex-batch", "vertex"
It first launches the job for binary entailment prompts (max_new_tokens = 25
) followed by the job for rationale entailment prompts (max_new_tokens=256
).
This command runs the full experiment pipeline.
scripts/experiments/run_nli_prompt_tuning_experiment.sh <csv_output_path> <final_metrics_output_path> | tee output.log
Fact decomposition and entailment both involve running prompts through an LLM client. Once you have a prompted dataset (a jsonl file with the prompted inputs), you can run scripts/experiments/run_inference_client.sh
with a command like
scripts/experiments/run_inference_client.sh [PATH TO PROMPTED JSONL DATA] [MODEL NAME] [CLIENT NAME] [N TMUX SESSIONS] [MAX OUTPUT TOKENS]
Example command:
scripts/experiments/run_inference_client.sh data/datasets/prompted/fact_decomposition_20241009_medalign.jsonl "gemini-1.5-flash-002" "vertex" 5 4000
The flow for our experiments is:
- Generate a prompted dataset for fact decomposition (
scripts/init_all_datasets.sh
) - Perform fact decomposition (
scripts/experiments/run_inference_client.sh
) - Break down the fact decomposition into entailment pairs (
scripts/experiments/create_entailment_file_from_fact_decomp.sh
) - Perform entailment evaluation using LLM as judge (
scripts/experiments/run_inference_client.sh
)