-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Revise cookiecutter json in starters #205
Conversation
Signed-off-by: Dmitry Sorokin <[email protected]>
5a60867
to
bb8de1e
Compare
Summarising our discussion offline for posterity: In Kedro's implementation, I agree that indicating this is important for the understandability of the code, though I don't think empty strings are informative but perhaps something like:
Which informs why this default is there and it's lack of effect on the rest of the project generation process. CC: @merelcht |
Signed-off-by: Dmitry Sorokin <[email protected]>
It appears that in previous instances, a recursive link was utilized to indicate that the value would be derived from the codebase when executing the cookiecutter() function, like: |
Just so I understand are you saying that even if we are getting If thats the case I have no issue replacing it with some descriptive placeholder for future contributors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TL;DR I'd prefer informative text (option 4) instead of using a reference like "kedro_version": "{{ cookiecutter.kedro_version }}"
(option 3), on the simple premise that we didn't immediately realise this value had no effect on the actual generated project because it looked like it served a purpose.
Problem Context:
Note
These cookiecutter.json
files will only be used when going through kedro new --starter=<starter name>
; when using kedro new --example=yes
the cookiecutter.json
file in the Kedro framework template project is used instead.
The function of cookiecutter.json
(as understood) is to serve as a place to specify the default values of different variables used during project generation (provides the defaults for cookiecutter prompts and populates placeholders in files like pyproject.toml
). Currently we specify six attributes in the cookiecutter.json
files: project_name
, repo_name
, python_package
, version
, tools
, and example_pipeline
.
This only applies to spaceflights-pandas, spaceflights-pandas-viz, spaceflights-pyspark, and spaceflights-pyspark-viz because they are accessed through both
kedro new --starter=<starter name>
andkedro new --example=yes
When creating a project from these starter templates, the only information users must provide is a project name (either through the command flag --name
/-n
, a config file, or cookiecutter prompts) and as such project_name
, repo_name
, and python_package
need to exist in cookiecutter.json
(values of the latter two are dependent on project_name).
Furthermore, it is not simply the case that version
, tools
, and example_pipeline
are optional to provide, but instead, the user is unable to provide these at all.
If users can't provide this information, why do we need to have defaults for them in cookiecutter.json
?
Short answer: because of kedro new --example=yes
.
As mentioned above, the defaults in cookiecutter.json
are used in two places:
- Cookiecutter prompts for user input (optional to provide default values)
- Cookiecutter placeholders in the template project's files (must provide default values)
For the other templates, neither of these apply because we know this template can only be accessed through kedro new --starter=<starter name>
, and so we know that no tools or example pipeline has been selected. As a result, we don't create any placeholders to capture tools
and example_pipeline
information (version
still has a placeholder).
However, for these specific starters, when we create the project template we can't know how it will be used, and so we don't know if the user can provide information about their tools and example pipeline selection.
We want to store this information in pyproject.toml
when it's provided but we don't know when that is (and two identical templates with different pyproject.toml
files is definitely overkill), so we use cookiecutter placeholders to capture the information when it's provided.
Version is only slightly different; we have the placeholder to capture the information but we know in all cases the user cannot provide this information. Instead this is passed to cookiecutter through Kedro to fill the placeholder.
However, cookiecutter can only see placeholders that it doesn't have values for - without default values what is it expected to populate the placeholders with? It asks us this by throwing an error, and so we include default values in cookiecutter.json
If we need the default values to populate the placeholders, why are we changing them?
Short answer: because Kedro provides its own default values.
cookiecutter.json
is not the only way to provide information to cookiecutter, the default values can be overwritten by providing extra_config
to cookiecutter when calling the project generation process. What we know, that cookiecutter doesn't, is that Kedro will always provide values for version
, tools
, and example_pipeline
, even when unspecified by the user. This means that the values for these attributes in cookiecutter.json
will always be overwritten. This allows us to put whatever values we like in cookiecutter.json
as they will never be used (with the current implementation).
What should we put in cookiecutter.json
?
There are several options:
- The same default values as for the Kedro template
cookiecutter.json
(current implementation)
Pro: Consistent with Kedro template project
Con: Uninformative, lead to this investigation. Would this comment be discoverable enough to ensure this won't be raised again at a future point? I personally don't believe so.
- Empty strings
Pro: Suggests placeholders are populated with some other values (if I remember correctly .toml
files complain about empty strings; if projects are created without error some other value must be provided instead)
Con: Better but still not very informative
"kedro_version": "{{ cookiecutter.kedro_version }}"
,"tools": "{{ cookiecutter.tools }}"
...
Pro: Suggests placeholders are populated with values directly passed to cookiecutter
- if not passed through this would create infinite recursion
Con: Not informative without knowledge of cookiecutter
- Use as a comment to explain the purpose of the attributes, e.g. "default value required by
cookiecutter
- overwritten by Kedro implementation"
Pro: Explicitly informative, even without prior knowledge of how Kedro and cookiecutter
work
Con: Likely unconventional
Signed-off-by: Dmitry Sorokin <[email protected]>
Signed-off-by: Dmitry Sorokin <[email protected]>
Thank you, @AhdraMeraliQB, for the comprehensive description of the issue and the potential solutions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Would love to see some more opinions on this choice
I personally find the comment a bit messy and would prefer to just go for the It would be good to hear what the rest of the team thinks @ankatiyar @noklam @lrcouto In any case, whatever decision is made this should be applied across all starters for consistency. |
I tend to agree @merelcht on this one. It's not something that user can change. It is still a good note that we should keep, maybe somewhere in the codebase or the develop docs if applicable. |
Signed-off-by: Ahdra Merali <[email protected]>
Motivation and Context
Removing the default values of the
tools
andexample_pipeline
variables fromcookiecutter.json
in the four starters is currently not feasible. Variables utilised in Cookiecutter must be declared incookiecutter.json
, even if they are consistently set during the cookiecutter() function call (which overrides values from the file). Omitting these declarations results in Cookiecutter raising anUndefinedVariableInTemplate
error.However, for clarity, it might be beneficial to leave the values blank or indicate their status with a note, such as:
This approach informs users of the special nature of these variables, highlighting that they are internally managed and should not be altered.
Checklist