From 93f403c91f5abf517e94834f2a5944f1b3737c6e Mon Sep 17 00:00:00 2001 From: "Luke W. Johnston" Date: Sat, 30 Nov 2024 19:01:21 +0100 Subject: [PATCH] docs(project): revised project work to have clearer tasks (#35) --- appendix/project.qmd | 77 ++++++++++++++++++++++++++----------------- preamble/syllabus.qmd | 7 +++- 2 files changed, 53 insertions(+), 31 deletions(-) diff --git a/appendix/project.qmd b/appendix/project.qmd index 95f14b4..1495111 100644 --- a/appendix/project.qmd +++ b/appendix/project.qmd @@ -2,11 +2,20 @@ To maximize how much you learn and how much you will retain, you as a group will take what you learn in the course and apply it to create a -reproducible project. This project ... +reproducible project within a server environment. This project ... + +We will have created a project folder to work in on the server as well +as assigning you and your team mate(s) to the project. Within your +project, you will also be given a specific set of outputs to create (a +figure, a table, and a basic report), based on the data we provide to +you. The outputs will not require using all of the data given. + + During the last session of the course you will work on this assignment. -In the last \~20 minutes of this session, the lead instructor will ... -and re-generate your report to check that it is reproducible. +In the last \~20 minutes of this session, the lead instructor will go +into your projects and re-generate your report to check that it is +reproducible. ## Specific tasks @@ -16,33 +25,31 @@ quickly start collaborating together on the project. Your specific tasks are: -Sequence of steps for project: - -- Starting point: - - Learning how to identify what file storage format (e.g. csv or - SAS dataset) there are and knowing how convert those files into - more efficient formats (like Parquet or a SQL database) - - Give them a few server environment types, and the same data but - with different starting formats. And then they figure out the - next steps based on that information - - Multiple data is big enough to prevent doing it normal way (1 Gb - or larger?) -- Explaining why the original data format might not be ideal and then - converting the data into more efficient format -- Identify what the desired sample is for the dataset, only select and - filter data they need for analysis -- Split the data into smaller chunk to prototype code (running code on - all the data later) -- Run basic analysis (descriptive statistics)... Not modeling -- Implement some code to run with parallel processing -- Identifying which format data or items can be downloaded, and - converting that to that format - -Assumptions: - -- Assume they have taken the intermediate course (need to know - functionals and function-based workflows), and either have read or - taken the advanced course or are familiar enough with targets +1. Review the outputs we want you to create as well as the specific + data needed for creating them. Keep these in mind for later tasks. +2. Identify what resources you have available in the server + environment, such as the number of cores and amount of memory. Use + this information to guide your coding. +3. Look into the ... folder that contains the raw data you will use for + this project. Identify which file storage format the data is saved + in and convert it into a more efficient method if necessary. + Depending on which format it is, save the converted format into the + ... folder. Make use of parallel processing to speed up this (using + `{furrr}`). +4. Read in a subset of the data that only contains the columns and rows + you need for your outputs. Randomly keep a slice of this data to use + for prototyping code later. +5. Working backwards, write out a set of (empty) functions that provide + the sequences of steps that will create a specific output. Begin + filling in these functions, making sure they work before adding them + to the `{targets}` pipeline. Write the functions and `{targets}` + pipeline so that it will run things in parallel. +6. Incorporate the outputs into a report, include that in the + `{targets}` pipeline. +7. Comment out the line of code that randomly keeps a slice of the data + and then run the `{targets}` pipeline with `targets::tar_make()`. +8. The project will be complete if you can regenerate all the outputs + using only the `{targets}` pipeline. ## Quick "checklist" for a good project @@ -50,7 +57,17 @@ Assumptions: What we expect you to do for the group project: +- Use parallel processing. +- Use a `{targets}` pipeline. +- Use functional programming (including creating your own functions). +- Use an efficient file storage format. + What we don't expect: +- No complicated analyses. +- No complicated figures or tables. +- No processing that isn't specifically related to creating the + assigned output. + Essentially, the group project is a way to reinforce what you learned during the course, but in a more relaxed and collaborative setting. diff --git a/preamble/syllabus.qmd b/preamble/syllabus.qmd index 59012d5..3c6716f 100644 --- a/preamble/syllabus.qmd +++ b/preamble/syllabus.qmd @@ -69,7 +69,12 @@ To help manage expectations and develop the material for this course, we make a few assumptions about *who you are* as a participant in the course: -- Assumptions +Assumptions: + +- This course builds off of the content found within the intermediate + course (specifically, using functionals and function-based + workflows) and the advanced course (specifically, using `{targets}` + to build pipelines). While we have these assumptions to help focus the content of the course, if you have an interest in learning R but don't fit any of the above