-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
quantify consistency improvement #93
Conversation
… of TestEvaluator (the code is not cleaned)
@JohnShiuMK I add my 2 cents here.
|
wait, i thought the F is already defined in the way you specified? |
@tonyshumlh I have updated the demo to 2-tail test, ready to review and merge, thanks |
hold on, the file / code structure are actually ugly. Let me tidy it up a little bit first, sorry. |
In this demo of F-score comparison, I'm comparing the code of week 3 (before refactoring, i.e. old code base) vs. week 4 (after refactoring). Therefore, I have to keep the old code base (archive/analyze.py) and adjust the ConsistencyEvaluator (archive/llm_eval/consistency_eval.py) so that it also works for the old code. We may delete them in the future when we are having a comparison between newer versions. But, for now, I guess it's better to keep a record of the above comparison, in case someone asks for it. In order not to disturb the latest code base, I put those related to the demo and the old code base under What do you think? Do you have any better ways to proceed whenever we encounter a situation like this? |
…MDS/test-creation into 76-quantify-consistency-improvement
I have added a note here, to avoid confusion of the I think we can merge for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 2-tailed f-test p-value calculation looks good to me
close #76