Skip to content

Latest commit

 

History

History
154 lines (109 loc) · 6.6 KB

experiments.md

File metadata and controls

154 lines (109 loc) · 6.6 KB

Experiment Pipelines

Initial detailed documentation on running specific subtasks in experiments.

1. Sample & Preprocess Documents

🚧 Status: Subtasks

  • ✅ Download source datasets (PhysioNet and SHC)
  • ✅ Sample documents from MIMIC-CXR, MIMIC-III, MedAlign, CORAL to create FactEHR dataset
  • ✅ NLP sentence tokenize FactEHR documents and serialize to disk (spaCy DocBin).
  • ✅ Generate fact decomposition prompted dataset using prompt templates and FactEHR documents

1. Download source datasets (PhysioNet and SHC)

Download source datasets from PhysioNet and SHC. Saves to data/datasets/

Requirements

  • wget installed (on MacOS use homebrew brew install wget)
  • gcloud CLI installed (see instructions) and authenticated gcloud auth login
  • PhysioNet credentialed account and per-dataset signed DUAs
  • You must be connected to the Stanford VPN to download SHC datasets (MedAlign)
  • Authenticated with huggingface huggingface-cli login
./scripts/datasets/download_physionet_datasets.sh
./scripts/datasets/download_shc_datasets.sh
./scripts/datasets/download_hf_datasets.sh

2. Sample documents from MIMIC-CXR, MIMIC-III, MedAlign, CORAL to create FactEHR dataset

Important

LEGACY Link original data to the legacy sampled documents, assign primary key and export CSVs for preprocessing.

python scripts/hotfixes/get_note_provenance.py \
--path_to_legacy $FACTEHR_LEGACY_DOCS \
--path_to_datasets data/datasets/ \
--path_to_output data/datasets/

Sampling from source datasets, assign primary key and export to CSVs for preprocessing. We impose min/max length constraints to control for extreme context lengths during inference.

python scripts/sample_source_datasets.py \
--path_to_input data/datasets/raw/ \
--path_to_output data/datasets/corpora/v2/ \
--file_name_prefix factehr_v2 \
--tokenizer tiktoken \
--min_doc_length 64 \
--max_doc_length 3840

3. NLP sentence tokenize FactEHR documents and serialize to disk (spaCy DocBin).

Tokenize and sentence split clinical documents using NLP framework ∈ {medspacy, spacy, trove}. See framework speed benchmarks for more details. This will generate a spaCy DocBin file named factehr_YYYYMMDD.docbin.

python scripts/build_docbin_dataset.py \
--path_to_input data/datasets/corpora/v2/ \
--path_to_output data/datasets/ \
--n_procs 4 \
--batch_size 100 \
--nlp_framework trove \
--file_name_prefix factehr_v2

4. Generate fact decomposition prompted dataset using prompt templates and FactEHR documents

Materialize the prompted version of the FactEHR dataset.

python scripts/build_fact_decomp_prompted_dataset.py \
--path_to_input data/datasets/factehr_v2_20240825.docbin \
--path_to_prompt_dir data/prompt_templates/fact_decomposition/ \
--path_to_output_dir data/datasets/prompted/ \
--file_name_prefix fact_decomposition \
--completion_format messages

2. Running LLM Experiments

🚧 Status: Subtasks

  • ✅ Run LLM fact decomposition inference on all documents and serialize to disk
  • ✅ Prompt tuning for entailment
  • ✅ NLP sentence tokenize fact list and serialize to disk
  • ✅ Benchmark existing NLI datasets (MedNLI, SNLI, MultiNLI, SciTail) on SOTA LLMs
  • ✅ Generate all entailment pairs for fact precision (I[note ⇒ fact]) and fact recall (I[fact-list ⇒ sentence]) and serialize to disk
  • ✅ Run LLM-as-a-judge inference on all entailment pairs and serialize raw generation outputs to disk

1. Prompt tuning for entailment

Evaluate the performance of entailment, entailment+rationale, and entailment+rationale+CoT for performing entailment. This experiment leverages shc-gpt-4o (requires SHC VPN) and vertex API (requires Full Traffic VPN).

First set the following:

export HUGGINGFACE_HUB_TOKEN={your token here — only needed when running transformers client}
export FACTEHR_DATA_ROOT={something like /share/pi/nigam/akshays/just-the-facts/data/}

Next download the entailment datasets to the data directory. Because MedNLI comes from Physionet, run this first:

./scripts/datasets/download_physionet_datasets.sh

Download the entailment datasets by running:

./scripts/datasets/download_hf_datasets.sh

The one dataset that will not be downloaded from the above scripts is FactEHR (v0 — clinician annotated dev set). This is currently saved on carina here:

/share/pi/nigam/akshays/just-the-facts/data/datasets/raw/entailment/factehr.csv

Copy that file into this location: {$FACTEHR_DATA_ROOT}/datasets/raw/entailment/factehr.csv

As of 10/2/24, the following path on carina contains all NLI test sets needed for this experiment — instead of compiling the datasets locally you can copy over the contents of this folder into your directory: /share/pi/nigam/akshays/just-the-facts/data/datasets/raw/entailment/

To run the experiment pipeline, first adjust the config settings here as needed: scripts/experiments/run_nli_prompt_tuning_experiment.sh

The most importatnt setting is the client you want to run. For example:

models=("medlm-medium") #  "meta-llama/Meta-Llama-3-8B-Instruct"  "gemini-1.5-flash-002"
client="vertex"  # Can be "transformers", "openai-batch", "vertex-batch", "vertex"

It first launches the job for binary entailment prompts (max_new_tokens = 25) followed by the job for rationale entailment prompts (max_new_tokens=256).

This command runs the full experiment pipeline.

scripts/experiments/run_nli_prompt_tuning_experiment.sh <csv_output_path> <final_metrics_output_path> | tee output.log

2. Running fact decomposition and entailment

Fact decomposition and entailment both involve running prompts through an LLM client. Once you have a prompted dataset (a jsonl file with the prompted inputs), you can run scripts/experiments/run_inference_client.sh with a command like

scripts/experiments/run_inference_client.sh [PATH TO PROMPTED JSONL DATA] [MODEL NAME] [CLIENT NAME] [N TMUX SESSIONS] [MAX OUTPUT TOKENS]

Example command:

scripts/experiments/run_inference_client.sh data/datasets/prompted/fact_decomposition_20241009_medalign.jsonl "gemini-1.5-flash-002" "vertex" 5 4000

The flow for our experiments is:

  1. Generate a prompted dataset for fact decomposition (scripts/init_all_datasets.sh)
  2. Perform fact decomposition (scripts/experiments/run_inference_client.sh)
  3. Break down the fact decomposition into entailment pairs (scripts/experiments/create_entailment_file_from_fact_decomp.sh)
  4. Perform entailment evaluation using LLM as judge (scripts/experiments/run_inference_client.sh)