Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: Check commit sha for knowledge contributions #1188

Closed
ckadner opened this issue Jun 12, 2024 · 11 comments
Closed

ci: Check commit sha for knowledge contributions #1188

ckadner opened this issue Jun 12, 2024 · 11 comments
Labels
ci stale stale-bot has marked you as stale

Comments

@ckadner
Copy link
Contributor

ckadner commented Jun 12, 2024

@bjhargrave -- We may want to add a check (linter or separate) to verify knowledge documents are actually "available" with the specified commit sha

So this part of the qna.yaml ...

document:
  repo: https://github.com/juliadenham/Summit_knowledge
  commit: ad185efec4dad89bf611d4b9aa8641155cb25d91
  patterns:
    - BBC_studios.md

Would "translate" to a URL like this ...

url = f"https://raw.githubusercontent.com/{repo}/{commit}/{patterns[0]}"

And then we could use the Github Python API or even just Python requests to check for a 200 HTTP status, similar to this simple curl example:

curl -o /dev/null -w "%{http_code}" \
    -s https://raw.githubusercontent.com/juliadenham/Summit_knowledge/ad185ef/BBC_studios.md

200
@ckadner ckadner added the ci label Jun 12, 2024
@ckadner
Copy link
Contributor Author

ckadner commented Jun 12, 2024

We could even take this further and try to find questions and answers in the document, or, just probe for specific things we know to get lost in translation often, like numbers, symbols, currency, units of measurement, etc.

@bjhargrave
Copy link
Contributor

So there are several levels of checks that could be done:

  1. The commit value is a 40 char hexadecimal number.
  2. The SHA actually exists in the repository.
  3. At that SHA, there is at least one md file which matches the specified patterns (which can have globbing).

(1) is pretty easy. (2) is a little harder. (3) is more complicated and could require the repo to be cloned into a tmp folder to look for matching md files.

I think (1) is doable. (2) and (3) require opening up a URL and I am concerned about random requests from the action runner since a malicious qna.yaml could get our action to hit any provided URL. I don't think we should do (2) or (3). They will be implicitly handled by the SDG check since it must load the knowledge to generate.

@bjhargrave
Copy link
Contributor

url = f"https://raw.githubusercontent.com/{repo}/{commit}/{patterns[0]}"

pattern can be a glob like **.md which is all md files in a repo.

ckadner added a commit to ckadner/taxonomy that referenced this issue Jun 13, 2024
@ckadner
Copy link
Contributor Author

ckadner commented Jun 13, 2024

pattern can be a glob like **.md which is all md files in a repo.

Hmm, the globbing eliminates using Python Requests. PyGithub could list all files in the repo for a given commit, maybe even with a glob filter?

I agree, a local clone might be overkill. Even a shallow, partial clone could potentially be huge.

@ckadner
Copy link
Contributor Author

ckadner commented Jun 13, 2024

I am concerned about random requests from the action runner since a malicious qna.yaml could get our action to hit any provided URL.

Right. But we could could limit our check to GitHub URLs only.

@lehors
Copy link
Contributor

lehors commented Jun 13, 2024

One possibility is to do a simple HTTP HEAD request against the commit to check that at least the URL and commit seem to be valid.

curl -I -o /dev/null -w "%{http_code}" -s \
   https://github.com/juliadenham/Summit_knowledge/commit/ad185efec4dad89bf611d4b9aa8641155cb25d91

200

I agree that going beyond and trying to process the glob might be too involved.

@bjhargrave
Copy link
Contributor

bjhargrave commented Jun 13, 2024

But we could could limit our check to GitHub URLs only.

We could. But the yaml is allowed to use any git repo not just those in github.

https://github.com/juliadenham/Summit_knowledge/commit/ad185efec4dad89bf611d4b9aa8641155cb25d91

This could work for github. We would have to remove the trailing .git from the URL (since the URL is supposed to be a git clone URL - github will redirect git agent requests from the web URL to the git URL as a convenience). But for non-github git URLs, such a URL formulation does not exist. Appending the commit SHA to the git URL as above is purely a github construct.

To support any valid git URL, we would need to use a git agent to fetch the commit SHA from the repo.

@bjhargrave
Copy link
Contributor

To support any valid git URL, we would need to use a git agent to fetch the commit SHA from the repo.

mkdir tmp
git -C tmp init
git -C tmp fetch --dry-run https://github.com/juliadenham/Summit_knowledge.git ad185efec4dad89bf611d4b9aa8641155cb25d91

could be used. The exit code will be non-zero if the commit does not exist. But we would still have to git init an empty repo first.

@lehors
Copy link
Contributor

lehors commented Jun 13, 2024

To support any valid git URL, we would need to use a git agent to fetch the commit SHA from the repo.

I hear you but practically speaking I don't know that anyone is going to use anything other than a github repo for this.
Maybe we could ask whether anyone is using anything else?

Copy link

This issue has been automatically marked as stale because it has not had activity within 60 days. It will be automatically closed if no further activity occurs within 31 days.

@github-actions github-actions bot added the stale stale-bot has marked you as stale label Aug 13, 2024
@bjhargrave
Copy link
Contributor

Closing in favor of instructlab/schema#30

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci stale stale-bot has marked you as stale
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants