Meeting Minutes for Week 5 #99

JohnShiuMK · 2024-05-24T23:34:24Z

Sprint Planning - 2024-05-27 Week 5

Checklist

Add extra column in current checklist, and merge all checklist_sys into checklist #114
- (Yingzi) remove checklist_demo
(Yingzi, Tony) Review checklist items
Expanding checklist items #63 (to be continued in week 6)
Prepare checklist for system development #91 (to be continued in week 6)
- (Tony) Address / Reply Tiffany's comments

System

System Evaluation: Consistency

Add metric (e.g. from regression model) to quantify the improvement of consistency #76
- (John1) Update the consistency F-test from one-tail to two-tail (keep it in jupyter notebook)
- (optional) to look at stat package if have time

System Evaluation: Accuracy

Human Evaluation for Accuracy - LightFM Repo #113 (Tony to wrap it into a markdown and put it into repo)
Prompt Engineering to increase the Accuracy for TestEvaluator #117

591 Requirement

a good clear README (it doesn't have to be complete, but should have the key elements of a good README in place, see: https://github.com/earthcube/EC_Repository_Template/blob/master/TemplateREADME.md or similar) #97 (to be continued in week 6)
a clear GitHub directory structure. #98 (to be continued in week 6)

tonyshumlh · 2024-05-30T17:03:10Z

Mentor Meeting 2024/05/30 - Week 5

Have the ground truth of repo evaluated by human
Have tools for users to assess how much they should trust the tool (consistency, accuracy)
Run multiple runs per repo and plot the box plot of completeness score per checklist item to show consistency per checklist item
Plot a histogram of consistency measure (e.g. Standard Deviation) vs number of repo per SD bin to illustrate the consistency for troubleshooting
Have a Consistency table with columns repo, run, checklist item 1, ... , N to investigate 1) whether there is a high variation of scores for a checklist item across repo; 2) whether there is a high variation of scores for a repo across checklist items
For 1), probably there is checklist item issue that requires improvement; For 2), probably there is structure issue for the repo
Add a conclusion whether we/users should trust our result in the Final Presentation based on the above methods
Based on the Consistency table, we can drill down to a specific repo or a specific checklist item across all repo for Human Ground Truth investigation and comparison

tonyshumlh · 2024-05-30T22:06:07Z

Partner Meeting 2024/05/30 - Week 5

Checklist

There is Python API for Quarto CLI to call Quarto using Python (doc: Quarto_dev) we can try

Evaluation Report

Content Revision:
- If none of the files satisfying the Checklist Item, just show "None of test function fulfill ..." in the Observations and skip the content in Function References;
- For partial/full satisfying the Checklist Item, keep only (one/all) the relevant test files, functions and line numbers in Function References and the corresponding Observations content.
- Observations: (Satisfied/Partial Satisfied/Not Satisfied)
- Move the hyperlink to Functions and remove the Line Numbers key-value
- Remove n_file_tested column in DataFrame and put it as a subheader
If we can provide learning examples for ChatGPT for standard case, edge case, error handling
Future Development: Add the project code base into the Tool to point out which part of the code requires the Not Satisfying Checklist Item

Test Spec Generator

Prompt engineering for well docstring format in the test spec generation output
(Refer to John note)
Build optional function that allow user to extract all docstring of ML system codes and feed into LLM to generate ref test cases that are relevant OR allow user to feed a docstring of their ML system function into LLM to generate ref test cases that are relevant

System Evaluation

Change the F-test from overall completeness score to per-checklist completeness score
Contributing: Add Acknowledgement (refer to Tiffany's py-pkg Github repo)
(Good to have) Think about how Tiffany can perform mutation testing in the future
Have a separate presentation with Dr Rohan Alexander in June 24/25th morning
Parallelise multiple API calls to speed up

JohnShiuMK · 2024-05-31T01:14:15Z

Partner Meeting Minutes - May 30, 2024

Attendees: John, Orix, Simon (Mentor), Tiffany (Partner), Tony, Yingzi

Checklist for Leader Persona

Consider trying Python API for Quarto CLI for checklist visualization
Consider including test examples respectively for standard case, edge case and error handling as a context for GPT (for evaluator and test spec generator)
Consider including repo example with ground truth as a context for GPT

System for Researcher Persona

To revise the Report output format:
- For a checklist item, if none of the files satisfy, show "Not Satisfied" / "None of test function fulfilled" in the Observations and skip the content in Function References
  - (future dev) add a hyperlink to go to the part of the code that is relevant to the missing test (require taking project code base as a context)
- If a checklist item is partially / fully satisfied, show "Partial Satisfied" / "Satisfied", and keep only relevant test files and functions under the Observations and Function References.
- Move the hyperlink directly to the Functions and remove the Line Numbers
To remove Terminal Report display format:
- Remove n_file_tested column in DataFrame
- Put it as a subheader, e.g. "N files are tested"
To improve prompts for better docstring generation in the test spec generator
- to include test examples respectively for standard case, edge case and error handling as a context for GPT (defined in the checklist)
- to provide examples of good quality of test files + providing ground truth (defined in checklist)
- to define numpy format and provide example/skeleton
- to add an optional argument: giving all the docstrings of functions in project repo
  - functions docstrings are more a user's responsibility

System Evaluation for Ourselves (System Developer Persona)

(Consistency) Show the F-test on the consistencies of per-item score as well
(Accuracy) For future users to contribute, to add Acknowledgement (refer to Tiffany's py-pkg Github repo)
To consider parallelising multiple API calls to speed up

Others

Will have a separate presentation with Dr Rohan Alexander in June 24/25th morning

JohnShiuMK · 2024-06-03T07:25:06Z

move to #127

SoloSynth1 mentioned this issue May 24, 2024

Tidy up the repo folder structure #86

Closed

4 tasks

JohnShiuMK assigned SoloSynth1, JohnShiuMK, tonyshumlh and jinyz8888 May 25, 2024

JohnShiuMK added the admin meeting related label May 25, 2024

This was referenced May 25, 2024

Add metric (e.g. from regression model) to quantify the improvement of consistency #76

Closed

quantify consistency improvement #93

Merged

Meeting Minutes for Week 4 #72

Closed

JohnShiuMK closed this as completed Jun 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Meeting Minutes for Week 5 #99

Meeting Minutes for Week 5 #99

JohnShiuMK commented May 24, 2024 •

edited

Loading

tonyshumlh commented May 30, 2024

tonyshumlh commented May 30, 2024 •

edited

Loading

JohnShiuMK commented May 31, 2024 •

edited

Loading

JohnShiuMK commented Jun 3, 2024

Meeting Minutes for Week 5 #99

Meeting Minutes for Week 5 #99

Comments

JohnShiuMK commented May 24, 2024 • edited Loading

Sprint Planning - 2024-05-27 Week 5

Checklist

System

System Evaluation: Consistency

System Evaluation: Accuracy

591 Requirement

tonyshumlh commented May 30, 2024

tonyshumlh commented May 30, 2024 • edited Loading

JohnShiuMK commented May 31, 2024 • edited Loading

JohnShiuMK commented Jun 3, 2024

JohnShiuMK commented May 24, 2024 •

edited

Loading

tonyshumlh commented May 30, 2024 •

edited

Loading

JohnShiuMK commented May 31, 2024 •

edited

Loading