diff --git a/docs/assessments.html b/docs/assessments.html index 16c5c9a..ed4e268 100644 --- a/docs/assessments.html +++ b/docs/assessments.html @@ -2591,8 +2591,8 @@
*.ipynb
, *.Rmd
, or .qmd
) to scripts (e.g., .R
or .py
).*.ipynb
or *.Rmd
files into a Quarto document (*.qmd
). Edit your Quarto document so that it’s sole job is to narrate your analysis, display your analysis artifacts (i.e., figures and tables), and nicely format the reportMakefile
(literally called Makefile
), to act as a driver script to rule them allIn this course you will work in assigned teams of three or four (see group assignments in Canvas) -to answer a predictive question using a publicly available data set -that will allow you to answer that question. +to answer a predictive question using a publicly available data set that will allow you to answer that question. To answer this question, you will perform a complete data analysis in R and/or Python, from data import to communication of results, while placing significant emphasis on reproducible and trustworthy workflows.
-Your data analysis project will evolve throughout the course from a single, -monolithic Jupyter notebook, +
Your data analysis project will evolve throughout the course from a single, monolithic Jupyter notebook, to a fully reproducible and robust data data analysis project, comprised of:
An example final project from another course (where the project is similar) can be seen here:
- +An example final project from another course (where the project is similar) can be seen here: +Breast Cancer Predictor
In this milestone, you will:
Abstract some code from your literate code document (*.Rmd
, *.qmd
, *.ipynb
, etc)
-to functions a modular file (e.g., .R
or .py
)
Update your project’s computational environment as you add dependencies to your project
Upgrade your project’s computational environment to a container.
Abstract more code from your literate code document (*.ipynb
or *.Rmd
)
+to scripts (e.g., .R
or .py
).
+You should aim to split the analysis code into 4, or more, R or Python scripts.
+Where the code in each script is contributing to a related step in your analysis.
Convert your *.ipynb
or *.Rmd
files into a Quarto document (*.qmd
).
+Edit your Quarto document so that it’s sole
+job is to narrate your analysis, display your analysis artifacts
+(i.e., figures and tables), and nicely format the report.
+The goal is that non-data scientists would not be able to tell that code
+was used to perform your analysis or format your report
+(i.e., no code should be visible in the rendered report).
Write another script, a Makefile
(literally called Makefile
), to act as a driver script to rule them all. This script should run the others in sequence, hard coding in the appropriate arguments.
Continue to manage issues professionally.
An example project milestone 2 will soon be available here: -https://github.com/UBC-DSCI/predict-airbnb-nightly-price/tree/v2.0.0
-For now, it is still a work in progress: -https://github.com/UBC-DSCI/predict-airbnb-nightly-price.
-Write a Dockerfile
to create a custom container
+for the computational environment for your project.
+Build your container using GitHub Actions,
+and publish your container image on DockerHub.
+Once this is done, shift the development of your project
+from working in a virtual environment to working in a container!
The Dockerfile
is the file used to specify and create the Docker image from which containers
+can be run to create an reproducible computational environment for your analysis.
+For this project,
+we recommend using a base Docker image
+that already has most of the software dependencies needed for your analysis.
+Examples of these include the Jupyter core team Docker images
+(documentation)
+and the Rocker team Docker images
+(documentation).
+When you add other software dependencies to this Dockerfile
,
+ensure that you pin the version of the software that you add.
Note - this document should live in the root of your public GitHub.com repository.
+In this milestone, +we expect you to add a GitHub Actions workflow to automatically build the image, +push it to DockerHub, +and version the image (using the corresponding GitHub SHA) and GitHub repo when changes are pushed to the Dockerfile.
+You will need to add your DockerHub username and password
+(naming them DOCKER_USERNAME
and DOCKER_PASSWORD
, respectively)
+as GitHub secrets to this repository for this to work.
+This part is similar to Individual Assignment 2.
Additionally, document how to use your container image in your README
.
+This is important to make it easy for you
+and your team to shift to a container solution
+for your computational environment.
+We highly recommend using Docker Compose
+so that launching your containers is as frictionless as possible
+(which makes you more likely to use this tool in your workflow)!
*.ipynb
, *.Rmd
, or .qmd
) to scripts (e.g., .R
or .py
).This code need not be converted to a function, +but can simply be files that call the functions needed to run your analysis. +You should aim to split the analysis code into 4, or more, R or Python scripts. +Where the code in each script is contributing to a related step in your analysis.
+The output of the first script must be the input of the second, and so on.
+All scripts should have command line arguments
+and we expect you to use either the docopt
R package
+or the click
+Python package for parsing command line arguments.
They scripts could be organized something like this:
+A first script that downloads the data from the internet +and saves it locally. This should take at least two arguments:
+Note: choose more descriptive filenames than the ones used above. +These are general for illustrative purposes.
A second script that reads the data from the first script +and performs and data cleaning/pre-processing, transforming, +and/or partitioning that needs to happen before exploratory data analysis +or modeling takes place. This should take at least two arguments:
+A third script which creates exploratory data visualization(s) and table(s) +that are useful to help the reader/consumer understand that data set. +These analysis artifacts should be written to files. +This should take at least two arguments:
+A fourth script that reads the data from the second script, +performs the modeling and summarizes the results as a figure(s) and a table(s). +These analysis artifacts should be written to files. +This should take at least two arguments:
+*.ipynb
or *.Rmd
files into a Quarto document (*.qmd
). Edit your Quarto document so that it’s sole job is to narrate your analysis, display your analysis artifacts (i.e., figures and tables), and nicely format the reportThe goal is that non-data scientists would not be able to tell that code +was used to perform your analysis or format your report +(i.e., no code should be visible in the rendered report). +You should do all the things you did for the report in individual assignment 4, +including:
*.Rmd
, *.qmd
, *.ipynb
, etc*.R
or *.py
fileIn every data science project, there is some code that is repetitive, -and other code that may not be repetitive in the current project, -but would likely be very useful in other, related future projects. -It is well worth it to abstract such code to functions. -This allows for the code to be tested with unit tests, to ensure it works as expected, -as well as makes it easily reuseable in the future in other work.
-Examples of code that often repetitive in a data analysis projects:
+Makefile
(literally called Makefile
), to act as a driver script to rule them allThis script should run the others in sequence, hard coding in the appropriate arguments. This script should:
README
and comments inside the Makefile
to explain what it does and how to use it)all
target so that you can easily run the entire analysis from top to bottom by running make all
at the command lineclean
target so that you can easily “undo” your analysis by running make clean
at the command line (e.g., deletes all generate data and files).When doing this task, -follow the -workflow for writing functions and tests for data science, -remember this process will include:
+Tip:
all
target can be a .PHONY
targetdata
, analysis
, figures
, pdf
, etc, so your build process only runs what’s necessary during developmentIf you are using R, these functions will live in an .R
file
-(whose filename will be named after the function, or functions).
-It is OK to have one function per file, or all functions in one file.
-This/these file(s) will live in a sub-directory called R
.
If you are using Python, these functions will live in an .py
file
-(whose filename will be named after the function, or functions).
-Again, it is OK to have one function per file, or all functions in one file.
-This/these file(s) will live in a sub-directory called src
.
You will source
(in the case of R) or import
(in the case of Python)
-these functions in your literate code document (*.ipynb
) to use them in your analysis.
-Tests will live in a test
directory, with files/subdirectories organized as per the testing framework you are using.
We expect that you will abstract at least 3-4 functions from your literate code document. -One per group member is the minimum. -Of course, if it makes sense to have more than 3-4 you are welcome to increase the number! -However, all functions must have the same standards in regards to software robustness. -Your functions will be assessed for their quality -(e.g., functions should do one thing, -and generally return an object unless they were specifically designed for side-effects), -usability, -readability (follow the tidyverse style guide for R, -or the black style guide for Python), -documentation and quality of the test suite.
-If you are using R for your data analysis code,
-we expect you to use the testthat
R package framework for writing software tests.
-If you are using Python, we expect you to use the pytest
Python package framework.
At a minimum, you will be adding the testing package framework as a new dependency to your project
-(e.g., testthat
R package or pytest
Python package).
-Thus, you will need to update your project’s computational environment
-as you add this (and potentially other) dependencies to your project.
-This means that your Dockerfile
will need to have this package added, with the version pinned,
-and the Docker container image will need to be rebuilt.
-Do not forget to update your project’s documentation to reflect these changes.
Continue managing issues effectively through project boards and milestones, +make it clear who is responsible for what +and what project milestone each task is associated with. +In particular, create an issue for each task and/or sub-task needed for this milestone. +Each of these issues must be assigned to a single person on the team. +We want all of you to get coding experience in the project and each team member should be responsible for an approximately equal portion of the code.
You will submit two URLs to Canvas in the provided text box for milestone 2:
+You will submit three URLs to Canvas in the provided text box for milestone 2:
README.md
instructions.Just before you submit the milestone 2,
-create a release on your project repository on GitHub and name it something like 2.0.0
-(how to create a release).
-This release allows us and you to easily jump to the state of your repository at the time of submission for grading purposes,
-while you continue to work on your project for the next milestone.
Just before you submit the milestone 2, create a release on your project repository on GitHub and name it something like 1.0.0
(how to create a release). This release allows us and you to easily jump to the state of your repository at the time of submission for grading purposes, while you continue to work on your project for the next milestone.
Everyone should contribute equally to all aspects of the project (e.g., code, writing, project management). This should be evidenced by a roughly equal number of commits, -pull request reviews and participation in communication via GitHub issues.
Each group member should work in a GitHub flow workflow; where they create branches for each feature or fix, which are reviewed and critiqued by at least one other teammate -before the the pull request is accepted.
You should be committing to git and pushing to GitHub.com every time you work on this project.
Git commit messages should be meaningful. These will be marked. +It’s OK if one or two are less meaningful, but most should be.
Use GitHub issues to communicate to their team mate (as opposed to email or Slack).
Your question, analysis and visualization should make sense. It doesn’t have to be complicated.
Your analysis should be correct, and run reproducibly given the instructions provided in the README.
You should use proper grammar and full sentences in your README. +Point form may occur, but should be less than 10% of your written documents.
R code should follow the tidyverse style guide, +and Python code should follow the +black style guide for Python)
You should not have extraneous files in your repository that should be ignored.
*.ipynb
, *.Rmd
, or .qmd
) to scripts (e.g., .R
or .py
).*.ipynb
or *.Rmd
) so that it’s sole job is to narrate your analysis, display your analysis artifacts (i.e., figures and tables), and nicely format the report.Makefile
(literally called Makefile
), to act as a driver script to rule them allIn this course you will work in assigned teams of three or four (see group assignments in Canvas) to answer a predictive question using a publicly available data set that will allow you to answer that question. To answer this question, you will perform a complete data analysis in R and/or Python, from data import to communication of results, while placing significant emphasis on reproducible and trustworthy workflows.
-Your data analysis project will evolve throughout the course from a single, monolithic RMarkdown document, Jupyter notebook, or Quarto document to a fully reproducible and robust data data analysis project, comprised of:
+In this course you will work in assigned teams of three or four +(see group assignments in Canvas) +to answer a predictive question using a publicly available data set +that will allow you to answer that question. +To answer this question, +you will perform a complete data analysis in R and/or Python, +from data import to communication of results, +while placing significant emphasis on reproducible and trustworthy workflows.
+Your data analysis project will evolve throughout the course from a single, +monolithic Jupyter notebook, +to a fully reproducible and robust data data analysis project, comprised of:
An example final project from another course (where the project is similar) can be seen here:
+ -An example final project from another course (where the project is similar) can be seen here: --Breast Cancer Predictor
In this milestone, you will:
Abstract more code from your literate code document (*.ipynb
or *.Rmd
)
-to scripts (e.g., .R
or .py
).
-This code need not be converted to a function,
-but can simply be files that call the functions needed to run your analysis.
-You should aim to split the analysis code into 4, or more, R or Python scripts.
-Where the code in each script is contributing to a related step in your analysis.
Edit your literate code document (*.ipynb
or *.Rmd
) so that it’s sole
-job is to narrate your analysis, display your analysis artifacts
-(i.e., figures and tables), and nicely format the report.
-The goal is that non-data scientists would not be able to tell that code
-was used to perform your analysis or format your report
-(i.e., no code should be visible in the rendered report).
Write another script, a Makefile
(literally called Makefile
), to act as a driver script to rule them all. This script should run the others in sequence, hard coding in the appropriate arguments.
Abstract some code from your literate code document (*.Rmd
, *.qmd
, *.ipynb
, etc)
+to functions a modular file (e.g., .R
or .py
)
Update your project’s computational environment as you add dependencies to your project
An example project milestone 3 will soon be available here: https://github.com/UBC-DSCI/predict-airbnb-nightly-price/tree/v2.0.0
-For now, it is still a work in progress: https://github.com/UBC-DSCI/predict-airbnb-nightly-price.
-*.ipynb
, *.Rmd
, or .qmd
) to scripts (e.g., .R
or .py
).This code need not be converted to a function, -but can simply be files that call the functions needed to run your analysis. -You should aim to split the analysis code into 4, or more, R or Python scripts. -Where the code in each script is contributing to a related step in your analysis.
-Examples of steps in your current document that can be broken up into steps:
-The output of the first script must be the input of the second, and so on.
-All scripts should have command line arguments
-and we expect you to use the docopt
R package
-for parsing command line arguments
-(if you are using Python, we recommend
-argparse
-or click
).
They scripts could be organized something like this:
-A first script that downloads the data from the internet -and saves it locally. This should take at least two arguments:
-Note: choose more descriptive filenames than the ones used above. -These are general for illustrative purposes.
A second script that reads the data from the first script -and performs and data cleaning/pre-processing, transforming, -and/or partitioning that needs to happen before exploratory data analysis -or modeling takes place. This should take at least two arguments:
+An example project milestone 2 will soon be available here: +https://github.com/UBC-DSCI/predict-airbnb-nightly-price/tree/v2.0.0
+For now, it is still a work in progress: +https://github.com/UBC-DSCI/predict-airbnb-nightly-price.
+A third script which creates exploratory data visualization(s) and table(s) -that are useful to help the reader/consumer understand that data set. -These analysis artifacts should be written to files. -This should take at least two arguments:
-A fourth script that reads the data from the second script, -performs the modeling and summarizes the results as a figure(s) and a table(s). -These analysis artifacts should be written to files. -This should take at least two arguments:
-*.Rmd
, *.qmd
, *.ipynb
, etc*.R
or *.py
file*.ipynb
or *.Rmd
) so that it’s sole job is to narrate your analysis, display your analysis artifacts (i.e., figures and tables), and nicely format the report.The goal is that non-data scientists would not be able to tell that code -was used to perform your analysis or format your report -(i.e., no code should be visible in the rendered report).
-You should render your report to either html
or PDF
.
-If you are using R Markdown, you should use the
-bookdown
outputs
-(e.g., bookdown::html2
or bookdown::pdf2
)
-so you can reference figures, tables and sections effectively.
-PDF output needs a TeXLive installation to render the PDF.
-You did this locally in the computer setup at the beginning of the course,
-for your container, you may want to look into the
-rocker/verse
base image.
If you are using Jupyter for your final report, -you should use Jupyter book, -again so you can reference figures, tables and sections effectively.
-Makefile
(literally called Makefile
), to act as a driver script to rule them allThis script should run the others in sequence, hard coding in the appropriate arguments. This script should:
+In every data science project, there is some code that is repetitive, +and other code that may not be repetitive in the current project, +but would likely be very useful in other, related future projects. +It is well worth it to abstract such code to functions. +This allows for the code to be tested with unit tests, to ensure it works as expected, +as well as makes it easily reuseable in the future in other work.
+Examples of code that often repetitive in a data analysis projects:
live in the project’s root directory and be named Makefile
be well documented (using the project README
and comments inside the Makefile
to explain what it does and how to use it)
have a all
target so that you can easily run the entire analysis from top to bottom by running make all
at the command line
have a clean
target so that you can easily “undo” your analysis by running make clean
at the command line (e.g., deletes all generate data and files).
Tip:
-all
target can be a .PHONTY
targetWhen doing this task, +follow the +workflow for writing functions and tests for data science, +remember this process will include:
data
, analysis
, figures
, pdf
, etc, so your build process only runs what’s necessary during developmentAt a minimum,
-you will be adding either bookdown
or jupyter-book
-as a new dependency to your project.
+
If you are using R, these functions will live in an .R
file
+(whose filename will be named after the function, or functions).
+It is OK to have one function per file, or all functions in one file.
+This/these file(s) will live in a sub-directory called R
.
If you are using Python, these functions will live in an .py
file
+(whose filename will be named after the function, or functions).
+Again, it is OK to have one function per file, or all functions in one file.
+This/these file(s) will live in a sub-directory called src
.
You will source
(in the case of R) or import
(in the case of Python)
+these functions in your literate code document (*.ipynb
) to use them in your analysis.
+Tests will live in a test
directory, with files/subdirectories organized as per the testing framework you are using.
We expect that you will abstract at least 3-4 functions from your literate code document. +One per group member is the minimum. +Of course, if it makes sense to have more than 3-4 you are welcome to increase the number! +However, all functions must have the same standards in regards to software robustness. +Your functions will be assessed for their quality +(e.g., functions should do one thing, +and generally return an object unless they were specifically designed for side-effects), +usability, +readability (follow the tidyverse style guide for R, +or the black style guide for Python), +documentation and quality of the test suite.
+If you are using R for your data analysis code,
+we expect you to use the testthat
R package framework for writing software tests.
+If you are using Python, we expect you to use the pytest
Python package framework.
At a minimum, you will be adding the testing package framework as a new dependency to your project
+(e.g., testthat
R package or pytest
Python package).
Thus, you will need to update your project’s computational environment
as you add this (and potentially other) dependencies to your project.
-This means that your Dockerfile
will need to have this package added,
-with the version pinned,
-and the Docker container image will need to be rebuilt and published to DockerHub.
+This means that your Dockerfile
will need to have this package added, with the version pinned,
+and the Docker container image will need to be rebuilt.
Do not forget to update your project’s documentation to reflect these changes.
You will submit three URLs to Canvas in the provided text box for milestone 3:
+You will submit two URLs to Canvas in the provided text box for milestone 2:
README.md
instructions.Up to 50% of your Milestone 1 + Milestone 2 points will be awarded if you have adequately addressed feedback and comments from your repository.
Just before you submit the milestone 3, create a release on your project repository on GitHub and name it something like 3.0.0
(how to create a release). This release allows us and you to easily jump to the state of your repository at the time of submission for grading purposes, while you continue to work on your project for the next milestone.
Just before you submit the milestone 2,
+create a release on your project repository on GitHub and name it something like 2.0.0
+(how to create a release).
+This release allows us and you to easily jump to the state of your repository at the time of submission for grading purposes,
+while you continue to work on your project for the next milestone.
Everyone should contribute equally to all aspects of the project +
Each group member should work in a +pull request reviews and participation in communication via GitHub issues.
You should be committing to git and pushing to GitHub.com every time you work on this project.
Git commit messages should be meaningful. These will be marked. -It’s OK if one or two are less meaningful, but most should be.
Use GitHub issues to communicate to their team mate (as opposed to email or Slack).
Your question, analysis and visualization should make sense. It doesn’t have to be complicated.
Your analysis should be correct, and run reproducibly given the instructions provided in the README.
You should use proper grammar and full sentences in your README. -Point form may occur, but should be less than 10% of your written documents.
R code should follow the tidyverse style guide, -and Python code should follow the -black style guide for Python)
You should not have extraneous files in your repository that should be ignored.