Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pattern checks for linting #1374

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -139,3 +139,7 @@ scratch.ipynb

# Ignore config.yaml from the cli
config.yaml

# uv unless we decide we want it in the future
pyproject.toml
uv.lock
35 changes: 35 additions & 0 deletions scripts/check-yaml.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,10 @@

# Standard
import argparse
import glob
import os
import pathlib
import re
import sys

# Third Party
Expand Down Expand Up @@ -49,8 +51,41 @@ def check(self) -> int:
"The \"%s\" file must be non-empty",
taxonomy.path.with_name(attribution_path.name),
)
# NOTE: The following three warnings are intended for the
# beta only, at the moment, and only to flag issues for
# maintainers to address rather than block on them. We will
# revisit when other content is allowed.
qna_file_path = taxonomy.rel_path.with_name("qna.yaml")
if "knowledge" in qna_file_path.parts:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be too promiscuous since the user could put "knowledge" some where in their sub path: compositional_skills/philosophy/knowledge/qna.yaml

I think you need to look at only part 0.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, fair point. Didn't consider that edge case.

qna_file_contents = parser.parse(qna_file_path).contents
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The qna.yaml is already parsed in the taxonomy object: taxonomy.contents. Why parse it again?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Misread the code in schema, really. Thanks for catching it.

for element in qna_file_contents["document"]["patterns"]:
if not re.match('.*.md', element):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought pdf support was added? Anyway the regex here would be

re.search("\.md$", element)

But what about the pattern folder_of_md_files/* which is a legitimate value which should not be rejected.

If you want to do more checking here, I don't think you can do it by pattern matching the yaml contents. You would need to clone the repo, find all files in the repo which match the patterns, and then check that all those files match the desired file types.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, pdf support was added in the recent release. Oversight on my part as a newbie.

Technically, the .* should match directory patterns when used with re.match() and has in the testing I've done, but you're right on the higher level that I probably shouldn't be matching file patterns (I really should be just parsing the string instead of getting into regex at all...) and should be validating the file type from the repo. Was there a reason the prior version at #1192 was dropped, or did it just lack contributor time?

taxonomy.warning(
"The document \"%s\" should be a markdown "
"file.",
element
)
if not re.match(
'https://github\.com\/.*',
qna_file_contents["document"]["repo"]):
taxonomy.warning(
"The document repo \"%s\" needs to be a "
"GitHub-based repository.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it doesn't have to be GitHub. Any valid git repo could be used. We just expect that any such git repo can be accessed because any necessary authorization is configured.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been getting conflicting information on this as I've onboarded onto the project, so I'm validating that with oversight first (left a note in the taxonomy triage channel on InstructLab Slack to get more info). I'll update this when I get an answer there.

qna_file_contents["document"]["repo"]
)
if not re.match(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This potential check is discussed here: instructlab/schema#30

If we do want to require SHA values, we should probably do that in the schema. But I am not convinced that is a great idea. Since we could allow non-SHA values that have special meanings.

'[0-9a-f]{40}',
qna_file_contents["document"]["commit"]):
taxonomy.warning(
"The document commit \"%s\" needs to be an "
"alphanumeric value that represents a commit. "
"\Please check with the reviewers for help.",
qna_file_contents["document"]["commit"]
)
if taxonomy.errors > 0:
exit_code = 1
if taxonomy.warnings > 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wrong since N errors and 1 warning means exit code 0. These 2 lines are not needed since exit_code is initialized to 0.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh bugger, you're right; I'm resetting the exit code. Sorry; fixing that.

exit_code = 0
if not self.yaml_files:
print("No yaml files specified.")
return exit_code
Expand Down
Loading