diff --git a/demo/experiments/balanced_accuracy/README.md b/demo/experiments/balanced_accuracy/README.md index 63fa850..5b8bd8c 100644 --- a/demo/experiments/balanced_accuracy/README.md +++ b/demo/experiments/balanced_accuracy/README.md @@ -3,11 +3,13 @@ Balanced Accuracy - Sentence-level balanced accuracy for the Wikibio dataset - For these results, the threshold is set to be 0.5 +- For SelfCheckGPT with BERTScore, set `rescale_with_baseline=True` to ensure that the scores are calibrated in [0.0, 1.0] - As the metric is balanced accuracy, NonFact and Factual scenarios yield the same results | Method | NonFact | NonFact* | |----------------------|:------------------:|:------------------:| | Random Guessing | 50.00 | 50.00 | +| SelfCheck-BERTScore | 59.31 | 63.42 | | SelfCheck-QA | 62.87 | 60.08 | | SelfCheck-NLI | 70.55 | 62.15 | | SelfCheck-Prompt (gpt-3.5-turbo) | 76.69 | 65.93 |